unsloth/studio
Daniel Han 767fa8cade
Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM (#5011)
* Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM

The load-time auto-fit in LlamaCppBackend.load_model had two issues for
models whose weights do not fit on any GPU subset (the common case for
large MoE GGUFs such as MiniMax-M2.7, Qwen3.5-397B-A17B, etc.):

1. Auto mode (max_seq_length=0) left effective_ctx at the model's native
   context when no subset passed the 90% fit check. The UI slider then
   landed on e.g. 196608 for MiniMax-M2.7, far above anything usable.
   Default the auto-pick to 4096 so the UI starts at a sane value; the
   slider ceiling stays at the native context so the user can still
   opt in to longer contexts and receive the "might be slower" warning.

2. Explicit ctx was silently shrunk when weights fit but the requested
   KV overflowed the 90% budget. The shrink loop emitted -c <capped>
   -ngl -1 without informing the caller, so a user who had opted into
   a longer context via the UI never actually got it. Drop the shrink
   loop on the explicit path and emit -c <user_ctx> --fit on instead,
   letting llama-server flex -ngl (CPU layer offload).

Adds tests/test_llama_cpp_context_fit.py covering both paths, the
file-size-only fallback when KV metadata is missing, non-regression on
fittable auto-pick, and platform-agnostic input shape.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 08:53:25 -07:00
..
backend Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM (#5011) 2026-04-14 08:53:25 -07:00
frontend [Studio] Show non exported models in chat UI (#4892) 2026-04-14 15:03:58 +04:00
__init__.py Final cleanup 2026-03-12 18:28:04 +00:00
install_llama_prebuilt.py Add AMD ROCm/HIP support across installer and hardware detection (#4720) 2026-04-10 01:56:12 -07:00
install_python_stack.py [Studio] Install flash attn at setup time for linux (#4979) 2026-04-14 16:40:17 +04:00
LICENSE.AGPL-3.0 Add AGPL-3.0 license to studio folder 2026-03-09 19:36:25 +00:00
setup.bat Final cleanup 2026-03-12 18:28:04 +00:00
setup.ps1 split venv_t5 into tiered 5.3.0/5.5.0 and fix trust_remote_code (#4878) 2026-04-07 20:05:01 +04:00
setup.sh Add AMD ROCm/HIP support across installer and hardware detection (#4720) 2026-04-10 01:56:12 -07:00
Unsloth_Studio_Colab.ipynb Allow install_python_stack to run on Colab (#4633) 2026-03-27 00:29:27 +04:00