mirror of
https://github.com/unslothai/unsloth
synced 2026-04-21 13:37:39 +00:00
* Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM The load-time auto-fit in LlamaCppBackend.load_model had two issues for models whose weights do not fit on any GPU subset (the common case for large MoE GGUFs such as MiniMax-M2.7, Qwen3.5-397B-A17B, etc.): 1. Auto mode (max_seq_length=0) left effective_ctx at the model's native context when no subset passed the 90% fit check. The UI slider then landed on e.g. 196608 for MiniMax-M2.7, far above anything usable. Default the auto-pick to 4096 so the UI starts at a sane value; the slider ceiling stays at the native context so the user can still opt in to longer contexts and receive the "might be slower" warning. 2. Explicit ctx was silently shrunk when weights fit but the requested KV overflowed the 90% budget. The shrink loop emitted -c <capped> -ngl -1 without informing the caller, so a user who had opted into a longer context via the UI never actually got it. Drop the shrink loop on the explicit path and emit -c <user_ctx> --fit on instead, letting llama-server flex -ngl (CPU layer offload). Adds tests/test_llama_cpp_context_fit.py covering both paths, the file-size-only fallback when KV metadata is missing, non-regression on fittable auto-pick, and platform-agnostic input shape. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| backend | ||
| frontend | ||
| __init__.py | ||
| install_llama_prebuilt.py | ||
| install_python_stack.py | ||
| LICENSE.AGPL-3.0 | ||
| setup.bat | ||
| setup.ps1 | ||
| setup.sh | ||
| Unsloth_Studio_Colab.ipynb | ||