unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

History

Daniel Han 5aa8c15246 Studio: hard-stop at n_ctx with a 'Context limit reached' toast (#5021 ) * Studio: hard-stop at n_ctx with a dedicated 'Context limit reached' toast llama-server's default behavior when the KV cache fills is to silently drop the oldest non-``n_keep`` tokens and keep generating. The UI has no way to tell the user that earlier turns were evicted -- they just see degraded continuity and a confusing ``5,361 / 4,096`` on the context usage bar. Launch llama-server with ``--no-context-shift`` so it returns a clean error once the request would exceed ``n_ctx``. In the chat adapter, catch the error, identify it as a context-limit error via ``isContextLimitError()``, and surface a dedicated toast that names the exact control to adjust: the ``Context Length`` field in the chat Settings panel. Also add a lightweight tooltip hint on ``ContextUsageBar`` when usage crosses 85%, so users see the "raise Context Length in Settings" suggestion before they hit the hard stop. Tests: * ``test_llama_cpp_no_context_shift.py`` pins the ``--no-context-shift`` flag in the static launch-command template, and pins it inside the unconditional ``cmd = [ ... ]`` block so a future refactor can't hide it behind a branch. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Shorten --no-context-shift comment to 1 line * Match backend _friendly_error rewrite in isContextLimitError Codex review on PR caught that ``backend/routes/inference.py::_friendly_error`` rewrites the raw llama-server text "request (X tokens) exceeds the available context size (Y tokens)" into "Message too long: X tokens exceeds the Y-token context window. ..." on the main streaming GGUF path. The heuristic only looked for "context size" / "exceeds the available context" / "context shift", none of which survive the rewrite, so the new "Context limit reached" toast would never fire for the most common case. Add matches for "message too long" and "context window" so both wordings hit. Also addresses Gemini feedback on the launch-flag test: * Use ``inspect.getsource(LlamaCppBackend.load_model)`` instead of reading ``__file__`` directly; scopes the assertions to the function that actually launches llama-server. * Replace the hardcoded ``" ]"`` indent search with a line-at-a-time scan for a line that is just ``]``, so the test survives reformatting. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>		2026-04-14 10:58:20 -07:00
..
backend	Studio: hard-stop at n_ctx with a 'Context limit reached' toast (#5021 )	2026-04-14 10:58:20 -07:00
frontend	Studio: hard-stop at n_ctx with a 'Context limit reached' toast (#5021 )	2026-04-14 10:58:20 -07:00
__init__.py	Final cleanup	2026-03-12 18:28:04 +00:00
install_llama_prebuilt.py	Add AMD ROCm/HIP support across installer and hardware detection (#4720 )	2026-04-10 01:56:12 -07:00
install_python_stack.py	[Studio] Install flash attn at setup time for linux (#4979 )	2026-04-14 16:40:17 +04:00
LICENSE.AGPL-3.0	Add AGPL-3.0 license to studio folder	2026-03-09 19:36:25 +00:00
setup.bat	Final cleanup	2026-03-12 18:28:04 +00:00
setup.ps1	split venv_t5 into tiered 5.3.0/5.5.0 and fix trust_remote_code (#4878 )	2026-04-07 20:05:01 +04:00
setup.sh	Add AMD ROCm/HIP support across installer and hardware detection (#4720 )	2026-04-10 01:56:12 -07:00
Unsloth_Studio_Colab.ipynb	Allow install_python_stack to run on Colab (#4633 )	2026-03-27 00:29:27 +04:00