unsloth/studio
Daniel Han 5aa8c15246
Studio: hard-stop at n_ctx with a 'Context limit reached' toast (#5021)
* Studio: hard-stop at n_ctx with a dedicated 'Context limit reached' toast

llama-server's default behavior when the KV cache fills is to silently
drop the oldest non-``n_keep`` tokens and keep generating. The UI has
no way to tell the user that earlier turns were evicted -- they just
see degraded continuity and a confusing ``5,361 / 4,096`` on the
context usage bar.

Launch llama-server with ``--no-context-shift`` so it returns a clean
error once the request would exceed ``n_ctx``. In the chat adapter,
catch the error, identify it as a context-limit error via
``isContextLimitError()``, and surface a dedicated toast that names
the exact control to adjust: the ``Context Length`` field in the chat
Settings panel.

Also add a lightweight tooltip hint on ``ContextUsageBar`` when usage
crosses 85%, so users see the "raise Context Length in Settings"
suggestion before they hit the hard stop.

Tests:

  * ``test_llama_cpp_no_context_shift.py`` pins the ``--no-context-shift``
    flag in the static launch-command template, and pins it inside the
    unconditional ``cmd = [ ... ]`` block so a future refactor can't
    hide it behind a branch.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Shorten --no-context-shift comment to 1 line

* Match backend _friendly_error rewrite in isContextLimitError

Codex review on PR caught that ``backend/routes/inference.py::_friendly_error``
rewrites the raw llama-server text
  "request (X tokens) exceeds the available context size (Y tokens)"
into
  "Message too long: X tokens exceeds the Y-token context window. ..."
on the main streaming GGUF path. The heuristic only looked for
"context size" / "exceeds the available context" / "context shift",
none of which survive the rewrite, so the new "Context limit reached"
toast would never fire for the most common case. Add matches for
"message too long" and "context window" so both wordings hit.

Also addresses Gemini feedback on the launch-flag test:
  * Use ``inspect.getsource(LlamaCppBackend.load_model)`` instead of
    reading ``__file__`` directly; scopes the assertions to the
    function that actually launches llama-server.
  * Replace the hardcoded ``"            ]"`` indent search with a
    line-at-a-time scan for a line that is just ``]``, so the test
    survives reformatting.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 10:58:20 -07:00
..
backend Studio: hard-stop at n_ctx with a 'Context limit reached' toast (#5021) 2026-04-14 10:58:20 -07:00
frontend Studio: hard-stop at n_ctx with a 'Context limit reached' toast (#5021) 2026-04-14 10:58:20 -07:00
__init__.py Final cleanup 2026-03-12 18:28:04 +00:00
install_llama_prebuilt.py Add AMD ROCm/HIP support across installer and hardware detection (#4720) 2026-04-10 01:56:12 -07:00
install_python_stack.py [Studio] Install flash attn at setup time for linux (#4979) 2026-04-14 16:40:17 +04:00
LICENSE.AGPL-3.0 Add AGPL-3.0 license to studio folder 2026-03-09 19:36:25 +00:00
setup.bat Final cleanup 2026-03-12 18:28:04 +00:00
setup.ps1 split venv_t5 into tiered 5.3.0/5.5.0 and fix trust_remote_code (#4878) 2026-04-07 20:05:01 +04:00
setup.sh Add AMD ROCm/HIP support across installer and hardware detection (#4720) 2026-04-10 01:56:12 -07:00
Unsloth_Studio_Colab.ipynb Allow install_python_stack to run on Colab (#4633) 2026-03-27 00:29:27 +04:00