Commit graph

5055 commits

Author SHA1 Message Date
Daniel Han
b42e3a120d
Remove legacy venv Scripts entry from User PATH on upgrade (#5060)
Older installers persisted the venv Scripts directory directly in the
User PATH registry. The shim approach from #4961 no longer writes that
entry, but on upgrade the old one survived and python.exe / pip.exe
from the unsloth venv continued winning resolution in every new shell.

Before creating the shim, read the current User PATH, filter out any
entry matching $VenvDir\Scripts (using the same symmetric raw+expanded
comparison as Add-ToUserPath), and write back if changed. No-op on
fresh installs where the legacy entry was never written.

Confirmed on a real Windows machine: `where.exe python` was returning
the venv interpreter first even after the shim PR merged.
2026-04-16 07:36:59 -07:00
Daniel Han
5b8643969e Revert "Remove legacy venv Scripts entry from User PATH on upgrade"
This reverts commit cae4a74297.
2026-04-16 14:20:43 +00:00
Daniel Han
cae4a74297 Remove legacy venv Scripts entry from User PATH on upgrade
Older installers persisted the venv Scripts directory directly in the
User PATH registry. The shim approach (added in this PR) no longer writes
that entry, but it also did not remove the old one. On upgrade, the
legacy entry survived and python.exe / pip.exe from the unsloth venv
continued winning resolution in every new shell, which is exactly the
hijack the shim was designed to prevent.

Before creating the shim, read the current User PATH, filter out any
entry matching $VenvDir\Scripts (using the same symmetric raw+expanded
comparison as Add-ToUserPath), and write back if changed. This runs
once per install and is a no-op on fresh installs where the legacy
entry was never written.
2026-04-16 14:19:04 +00:00
Datta Nimmaturi
6764cb9b90
Restrict flash attn to <=256 head dim. Consolidate attn impl checks (#5051)
* Restrict flash attn to <=256 head dim. Consolidate attn impl checks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Consolidate the changes into single function

* safeguard for dict instead of object

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-16 09:00:17 -05:00
Daniel Han
c5be8b1cd2
Chat-template repair: warn-by-default, AST classification, dict support (#5049)
* Chat-template repair: warn-by-default, AST classification, dict support

Follow-up hardening on top of PR #4426 (which fixed the #4150
RuntimeError for ChatML LoRA reloads).

Behavior changes:

- Warn-by-default instead of RuntimeError. When fix_chat_template cannot
  repair a broken template, emit a warning and return the original.
  Set UNSLOTH_STRICT_CHAT_TEMPLATE=1 to restore the pre-warn hard fail.
  Fixes the UX where a missing `{% if add_generation_prompt %}` block on
  a saved LoRA (typical after LlamaFactory / Axolotl re-serialize) would
  block model loading entirely.

- Local path vs HF hub distinguished in the warning message. For local
  paths the message points at the likely downstream tool; for HF IDs it
  points at the upstream model maintainers. Previously both said "file a
  bug report to the maintainers of <path>" even when <path> was the
  user's own saves/ directory.

- Dict / list chat_template now handled. Hermes-3 ships with
  {default, tool_use} and the previous code crashed with
  AttributeError: 'dict' object has no attribute 'find' when entering
  _fix_chat_template with a dict. Each variant is now fixed
  independently; structure is preserved.

Internals:

- _find_end_position now matches all four Jinja whitespace-control
  variants ({% %}, {%- %}, {% -%}, {%- -%}) and returns the rightmost
  endfor/endif so multi-for templates aren't locked onto the first loop.
  Previously {%- endfor -%} (both-side dash, used by Qwen3-Guard) was
  silently bypassed.

- _has_add_generation_prompt_block uses Jinja AST via
  jinja2.nodes.If/Name walks instead of substring matching, so
  templates that hide the block behind comments or dash-style variants
  are classified correctly.

- _template_ends_with_toplevel_for gates the GH#4150 ChatML repair on
  the AST: only fires when the last structural top-level node is a For
  (standard ChatML shape), ignoring trailing pure-whitespace output
  nodes. Templates wrapped in an outer If (Qwen3-Guard) are now
  explicitly skipped at the _fix_chat_template level as well, not just
  at load_correct_tokenizer's name-based exemption.

- _validate_patched_template renders the patched template with and
  without add_generation_prompt and confirms the patched output
  responds to the flag by appending (not replacing) content. If
  validation fails, the patch is discarded and we fall through to the
  warn path.

Verified with an expanded regression suite in tests/:
- test_fix_chat_template_pr4426.py: 42/42 template-matrix cells
- test_load_correct_tokenizer_pr4426.py: 5/5 tokenizer loads
- test_chat_template_followups.py: 10/10 new follow-up tests
- test_mistral_pr4426.py: 5 Mistral variants byte-identical
- test_qwen_pr4426.py: 14 Qwen variants byte-identical
  (Qwen1.5, Qwen2, Qwen2.5-Instruct/Coder/Math/VL, Qwen3,
  Qwen3-Coder, QwQ, Qwen3-Guard-Gen)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Guard _validate_patched_template against read-only chat_template

If tokenizer.chat_template is a property or otherwise read-only, the
validation helper would crash with AttributeError when trying to
temporarily set the patched template. Catch the assignment failure and
return False (skip validation), and best-effort restore in the finally
block.

* Replace regex separator inference with render-diff; broaden repair to non-ChatML templates

The previous `_infer_assistant_separator` was a four-tier regex heuristic that
only worked on ChatML-shaped templates and forced a hard `<|im_start|>` /
`<|im_end|>` presence gate on Case 2 repair. This meant a Llama-3, Gemma, or
Phi-3 template stripped of its generation-prompt block by a downstream tool
(LlamaFactory, Axolotl, etc.) would still warn-and-return even though the
structural shape is identical to the ChatML case the PR already handles.

This replaces the regex with `_derive_assistant_prefix_by_render`: render the
template with two dialogs that differ only in assistant content, then
`os.path.commonprefix` on the tails captures the exact assistant-turn prefix
the template emits. The template itself is ground truth, so non-ChatML shapes
work as long as the assistant block is a literal the template emits once per
message.

Three guards keep the derivation safe:
  A. both assistant renders extend the base render (no reordering);
  B. the divergence point is exactly the content-insertion site (sentinel
     follows the common prefix);
  C. a user-role cross-check: if a render with a user sentinel also emits
     the same prefix, role has no effect on output and we reject. A render
     failure on [user, user] (e.g. Gemma's `raise_exception` alternation
     check) is evidence that role matters; we accept.

Sentinels differ at character 0 so `commonprefix` cannot absorb them, and
trailing whitespace/comments after the last `{% endfor %}` are stripped
before probing (they would appear in base but not after the appended
assistant turn and break Guard A).

`_fix_chat_template` and `_repair_string_template` now thread an
`is_sharegpt` kwarg; `_fix_chat_template` retries once with
`is_sharegpt=True` if the first probe returns None (dual-probe fallback
for dict/list callers).

The ChatML `<|im_start|>` / `<|im_end|>` hard gate in Case 2 is dropped.
`_infer_assistant_separator` is deleted.

Verified via:
  - tests/test_fix_chat_template_pr4426.py: 51/51 cells (new Llama-3,
    Gemma, Phi-3 broken-template rows all repair FIX-OK)
  - tests/test_load_correct_tokenizer_pr4426.py: 5/5
  - tests/test_chat_template_followups.py: 18/18 (T11-T18 cover
    non-ChatML repair + probe failure modes)
  - tests/test_mistral_pr4426.py: 5/5 byte-identical
  - tests/test_qwen_pr4426.py: 14/14 byte-identical (Qwen3-Guard AST
    gate still rejects)
  - tests/hermes3_lora_pr4426.py reload: patched template ends with
    `<|im_start|>assistant\n`, inference returns sensible output.
  - temp/sim/battery.py: 79/79 followup; vs baseline: 0 regressions,
    9 improvements.
  - Spot-check probe on real stripped tokenizers (Hermes-3, Phi-4,
    Llama-3.2-1B, Gemma-3-1B): all derive the expected prefix.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address reviewer findings: variant routing, positive-gate detection, comment-safe end scan

Resolves three reviewer findings on PR #5049 (`fix/chat-template-followups`):

Finding #1 [10/10]: dict/list variants now route through
`_fix_chat_template_for_tokenizer` via a new `_VariantTokenizerProxy`
adapter. Previously the dict/list branches called `_fix_chat_template`
directly, silently bypassing the warn/strict (`UNSLOTH_STRICT_CHAT_TEMPLATE`)
contract, the `no == yes` diagnostic, broken-existing-block detection,
and `_validate_patched_template` guard. The proxy swaps
`base.chat_template` to the variant string before each
`apply_chat_template` call so tokenizer globals (`bos_token`, custom
filters, `raise_exception`) remain available; if the base is read-only
it falls back to isolated Jinja rendering.

Finding #2 [1/10]: `_has_add_generation_prompt_block` now requires the
`If` body to contain at least one `Output` node (a new
`_if_body_emits_content` helper walks descendants). This distinguishes a
real generation-prompt block from a header guard like
`{% if not add_generation_prompt is defined %}{% set ... %}{% endif %}`
(body contains only `Assign`) which references the name but emits
nothing. Also dropped a now-redundant `"add_generation_prompt" not in
scrubbed` guard in `_fix_chat_template` Case 2 so header-guarded
templates still get repaired.

Finding #4 [1/10]: `_find_end_position` now replaces Jinja comments with
equal-length whitespace before scanning for `{% endfor %}` / `{% endif %}`
tokens. This prevents a trailing comment containing those tokens from
being picked as the real end tag. Positions in the padded string map 1:1
to positions in the original template.

Tests:
  - tests/test_chat_template_followups.py: 21/21 (T19 strict-mode
    dict variant, T20 header-guard repair, T21 comment-endfor trap
    added; T4/T5 stubs updated with a working apply_chat_template
    that routes through Jinja).
  - tests/test_fix_chat_template_pr4426.py: 51/51 cells unchanged.
  - tests/test_load_correct_tokenizer_pr4426.py: 5/5.
  - tests/test_mistral_pr4426.py: 5/5 byte-identical.
  - tests/test_qwen_pr4426.py: 14/14 byte-identical.
  - temp/sim/battery.py: 79/79 followup; 0 regressions vs baseline.
  - Phase 3 Hermes-3 broken-LoRA reload: inference still returns
    `'The answer to the equation 2+2 is 4.'`.
  - Spot-checks on Hermes-3 / Phi-4 / Llama-3.2-1B / Gemma-3-1B real
    stripped templates: probe still derives the expected prefix.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Tighten comments in chat-template helpers

Pure comment minimization across `_find_end_position`,
`_has_add_generation_prompt_block`, `_if_body_emits_content`,
`_derive_assistant_prefix_by_render`, `_fix_chat_template` Case 2,
and `_VariantTokenizerProxy`. No behavior change; same intent,
fewer lines. All 21 follow-up tests and the 51-cell Phase 1 matrix
still pass.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Sandbox probe, fix is_sharegpt validator mismatch, reject negated gates

Three real bugs from the 10-agent Opus review:

1. Probe now uses `jinja2.sandbox.SandboxedEnvironment` instead of bare
   `jinja2.Environment`. The probe renders at model-load time (before
   the user calls `apply_chat_template`), so it was a new eager
   code-execution surface that the base HF tokenizer loading does not
   have. SandboxedEnvironment blocks attribute-chain exploits at
   negligible cost.

2. `_repair_string_template` now tries validation with both
   `is_sharegpt=False` and `is_sharegpt=True`. Previously, when
   `_fix_chat_template` internally fell back to the other schema via
   its dual-probe, the outer validation still used the caller's
   original `is_sharegpt` -- rendering with the wrong message keys and
   spuriously dropping a valid repair.

3. `_has_add_generation_prompt_block` now skips `If` nodes whose test
   is a `Not` expression. A negated gate like
   `{% if not add_generation_prompt %}{{ x }}{% endif %}` fires when
   agp=False, so its emitting body is not a generation block -- but the
   old code counted any Name reference regardless of polarity.

Cleanup: removed unused `self._label`, added `\r` escape in
generation-block literal, switched variant labels to `!r` formatting,
removed redundant `import os as _os`.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix jinja2.sandbox import and sandbox proxy fallback

Two critical findings from the 20-reviewer pass:

1. [20/20] The proxy read-only fallback used bare `jinja2.Environment`,
   not sandboxed. All 20 reviewers independently reproduced marker-file
   creation via `cycler.__init__.__globals__['os'].system(...)` during
   `fix_chat_template()`. Fixed: fallback now uses
   `from jinja2.sandbox import SandboxedEnvironment`.

2. [14/20] The render-diff probe did `import jinja2` then referenced
   `jinja2.sandbox.SandboxedEnvironment`. `jinja2.sandbox` is a
   submodule that is NOT auto-imported by `import jinja2` on Jinja 3.1.6.
   This caused `AttributeError` (swallowed by `except Exception`),
   making the entire Case 2 repair path silently return None in a clean
   process. The 6 reviewers who saw it work had `jinja2.sandbox`
   pre-imported by an earlier module in their process. Fixed: both the
   probe and the proxy fallback now use
   `from jinja2.sandbox import SandboxedEnvironment`.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-16 05:52:33 -07:00
Daniel Han
6e87bade25 Trim verbose comments in PATH helpers
Reduce inline comments from ~160 lines to ~25 across both files.
Keep one-line summaries of the "why"; drop multi-paragraph rationale
blocks that repeated information already captured in commit messages
and PR discussion.
2026-04-16 12:01:01 +00:00
Etherll
ec32ce2e82
fix: use direct registry API for PATH writes instead of SetEnvironmentVariable (#4961)
* fix: replacing SetEnvironmentVariable with direct registry API

* apply reviews

* Use CreateSubKey for HKCU\Environment

* Store PATH backup under HKCU\Software\Unsloth

* Fix $backupKey registry handle leak in PATH backup block

Wrap $backupKey operations in try/finally so the handle is closed even
if GetValue or SetValue throws. The Add-ToUserPath helper already uses
this pattern for its registry key -- the backup block was the only
place missing it.

* Isolate WM_SETTINGCHANGE broadcast from PATH write error handling

Wrap the broadcast dummy-variable calls in their own try/catch so a
broadcast failure does not mask a successful registry PATH write.
Previously, if SetEnvironmentVariable threw after SetValue already
committed the new PATH, Add-ToUserPath would return $false and the
caller would skip Refresh-SessionPath.

* PATH helper polish: venv precedence, quoted entries, raw/expanded dedup

Three small follow-ups surfaced by a 10-reviewer pass against the rebased
PR head. None fix a regression vs main; each strictly improves the new
helpers.

Refresh-SessionPath / Refresh-Environment:
- Move $env:Path to the front of the merge so an activated venv keeps
  precedence over machine/user PATH after a refresh. Pre-PR dropped
  process-only entries entirely; post-PR kept them but at the back.
- Dedup on both raw and expanded forms so %USERPROFILE%\foo and the
  already-expanded C:\Users\me\foo do not both survive.

Add-ToUserPath:
- Trim whitespace and surrounding double-quotes from each compared entry
  so quoted PATH entries like "C:\Program Files\CMake\bin" deduplicate
  against an unquoted directory of the same path.

* Back up User PATH inside Add-ToUserPath, before first mutation

Previously only studio/setup.ps1 took a one-time PATH backup, at script
top (line ~547). install.ps1 (the irm | iex entry point) had no backup,
so users who installed via that path had no recovery surface if anything
clobbered their PATH. The PR description's "one-time backup before any
modifications" promise only held for the studio installer flow.

Move the backup into Add-ToUserPath itself: just before the first actual
SetValue mutation, write the pristine raw PATH to
HKCU\Software\Unsloth\PathBackup if no backup already exists. This:

- Covers both entry points (install.ps1 and studio/setup.ps1).
- Captures the TRUE pristine PATH even when install.ps1 runs first and
  studio/setup.ps1 runs afterwards (the script-top backup in setup.ps1
  would otherwise see an already-modified PATH).
- Is idempotent: once a backup exists, subsequent calls preserve it.
- Skips when nothing would mutate (dedup match) or PATH is empty.

The script-top backup in studio/setup.ps1 is kept for defense in depth.

* Refresh PATH: venv-aware merge order

Reconcile two competing concerns about Refresh-SessionPath /
Refresh-Environment surfaced by separate review rounds:

  - venv at the back -> activated venv loses precedence to system Python
  - process at the front -> stale shims (old node, old python, etc.)
    still on $env:Path can beat a freshly installed tool

New merge order:
  1. Activated venv Scripts dir, only if $env:VIRTUAL_ENV is set
  2. Machine PATH freshly read from registry
  3. User PATH freshly read from registry
  4. Current $env:Path as fallback

This way an explicitly-activated venv keeps priority while a tool the
script just installed wins over any stale entry that was already on
the inherited shell PATH. When no venv is active, fresh registry
entries take precedence as expected.

* Append to User PATH by default, close $envKey in finally

Add-ToUserPath gains a -Position Append|Prepend parameter defaulting to
Append so installing unsloth no longer prepends the bundled venv Scripts
directory ahead of the user's existing python / pip on new shells. The
four current call sites (install.ps1 launcher, studio/setup.ps1 CMake,
nvcc, Python user Scripts) all take the Append default because each one
that needs in-session precedence already does an inline $env:Path prepend
independently. This matches rustup / cargo / nvm / pyenv / uv behavior.

Also wrap the script-top $envKey.GetValue in a try/finally so the
registry handle is released even if the read throws. Matches the pattern
already used for $backupKey five lines below.

* Prepend cmake, nvcc, Python Scripts; keep venv Scripts appended

The previous commit switched Add-ToUserPath to append by default so that
installing unsloth would not silently hijack the user's system python /
pip. That was correct for the venv Scripts dir (which contains python.exe
and pip.exe alongside unsloth.exe), but wrong for the three studio/setup
call sites. Those persist cmake, the driver-compatible nvcc, and the
Python user Scripts dir for future shells, and in all three cases an
older tool already earlier in the user PATH would keep winning after the
install finished. The nvcc case is especially load-bearing: setup selects
a driver-compatible CUDA toolkit, then llama.cpp builds against whatever
wins PATH resolution, so a stale older nvcc produces broken builds.

Pass -Position 'Prepend' explicitly at the three setup.ps1 call sites
(cmake at line 754, nvcc bin at line 1025, Python user Scripts at line
1191). None of those directories holds python.exe, so prepending them
does not re-introduce the original hijack problem. Leave the install.ps1
venv Scripts call on the default Append with a comment explaining why.

* Symmetric dedup, Prepend reorders duplicates, unsloth shim dir

Address three separate findings surfaced by review:

1. Dedup asymmetry (Gemini high-priority): the existing dedup expanded
   registry entries via ExpandEnvironmentVariables but did NOT expand the
   new directory. Passing "%USERPROFILE%\foo" when "C:\Users\me\foo" was
   already in PATH produced a duplicate. Expand both sides so the check
   is symmetric.

2. -Position Prepend no-op on existing duplicates: the dedup loop
   returned $false as soon as it saw a match, regardless of position.
   That left a late-position duplicate in place instead of moving it to
   the front, so "prepend the newly selected cmake/nvcc" did not always
   beat an older copy earlier in PATH. Partition entries into kept and
   dropped lists, then reinsert a single copy at the requested position.
   Append still returns $false on any match so user-curated orderings
   are not reshuffled. Prepend also returns $false when the only copy
   is already at position 0 so we preserve the user's casing.

3. Stop adding the venv Scripts dir to User PATH entirely. That dir
   holds python.exe and pip.exe alongside unsloth.exe, so neither
   Prepend nor Append worked: prepend hijacked the user's system python
   and pip, append made the freshly-installed unsloth.exe lose to any
   older unsloth.exe earlier on PATH. Replace the Scripts-dir PATH add
   with a dedicated shim directory that contains only unsloth.cmd, and
   prepend that dir. The shim calls the venv's unsloth.exe by absolute
   path so future pip upgrades inside the venv propagate automatically.

* Shim via hardlink, Append user Scripts, drop venv sysconfig fallback

Three follow-ups to the c0ab1ab shim commit, targeting concerns raised in
the second 20-reviewer pass:

1. Shim uses unsloth.exe (hardlink, copy fallback) instead of unsloth.cmd.
   The batch-file approach had three distinct regressions:
   - cmd.exe expanded %...% sequences inside user arguments, so prompts
     like "What does 50% mean?" got mangled before reaching the CLI
   - Git Bash / MSYS2 / POSIX-style shells on Windows do not resolve
     bare-name lookups to .cmd files, so `unsloth` stopped working there
   - Set-Content -Encoding ASCII replaced non-ASCII profile characters
     with '?', so installs under C:\Users\Jörg\... wrote a broken shim
   A hardlink (fallback: copy) of unsloth.exe is a native Windows
   executable with no shell indirection. PATHEXT picks .exe before .cmd
   in cmd.exe and PowerShell, Git Bash honors .exe natively, subprocess
   callers hit it directly, and a hardlink stays in sync with the venv
   on pip upgrades because both names point at the same inode.

2. studio/setup.ps1 Python user Scripts dir is added with default Append
   instead of -Position Prepend. That directory holds every pip-installed
   user console script (pip, pytest, huggingface-cli, and so on), not
   just unsloth, so reordering it silently changed resolution order for
   unrelated tools. The new install.ps1 shim at PATH position 0 already
   guarantees `unsloth` resolves to the freshly installed copy, so the
   Python user Scripts entry only needs to be present, not at the front.

3. The sysconfig lookup in studio/setup.ps1 no longer falls back to
   sysconfig.get_path('scripts') when the nt_user scheme dir does not
   exist. When setup.ps1 is invoked from an activated venv (a flow the
   linked issue actually hits) that fallback returns the venv's Scripts
   directory, which would then be added to the persisted User PATH and
   re-introduce the python / pip hijack the shim dir is meant to avoid.
   Stick strictly to the nt_user scheme; skip the block if it does not
   exist on disk.

* Do not crash installer when unsloth.exe shim is locked

The shim update sequence at install.ps1:1095 did a bare Remove-Item /
New-Item HardLink / Copy-Item. Under the script's $ErrorActionPreference
a locked target (most commonly 'unsloth studio' still running while the
user re-invokes the installer) turns the Remove-Item failure into a
terminating error that aborts the install with no actionable message.

The existing shim is perfectly usable in that state, so there is no
reason to abort. Wrap the whole remove/link/copy sequence in a try/catch
that logs the probable cause (Studio still running), points at the fix
(close Studio and re-run), and lets the installer finish with the old
launcher still serving the command.

Also only emit the "added unsloth launcher to PATH" step line when the
launcher was actually (re)created AND the PATH entry was newly added --
previously the message fired even when the shim refresh silently failed,
which was confusing.

* Guard shim PATH entry on existence, use NullString for broadcast delete

Two follow-ups surfaced by the latest review pass:

1. Do not add the shim directory to User PATH when the launcher was not
   actually created. Antivirus blocking unsloth.exe, a disk-full volume,
   or restrictive filesystem permissions can make both the hardlink and
   the copy fallback fail on a fresh install. In that case the existing
   sequence would report "added unsloth launcher to PATH" warnings but
   still prepend the empty $ShimDir to User PATH -- the user sees an
   install that claims success but then cannot resolve `unsloth` in a
   new shell. Gate Add-ToUserPath on Test-Path $ShimExe so the PATH
   entry is only persisted when the launcher is really there.

2. Pass [NullString]::Value instead of $null to the broadcast-delete
   call in Add-ToUserPath. On PowerShell 7.5 and later (running on .NET
   9), a bare $null going into [Environment]::SetEnvironmentVariable
   can be coerced to an empty string rather than a true .NET null,
   which sets the dummy UnslothPathRefresh_XXXXXXXX variable to "" in
   HKCU\Environment instead of deleting it. The leaked variable is
   visible in System Properties and accumulates one entry per install
   run. [NullString]::Value is a PowerShell-specific sentinel that
   crosses the interop boundary as a real null and works on both PS 5.1
   and PS 7.x. See PowerShell/PowerShell#24637 for the underlying issue.

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
2026-04-16 04:49:51 -07:00
Imgyu Kim
14ab6fbfae
BUG: fix _fix_chat_template for ChatML templates missing add_generation_prompt (#4426)
Fixes #4150.

Pre-PR, `_fix_chat_template` only patched templates where a trailing `{{ ... }}` expression followed the last `{% endfor %}`. ChatML templates (Hermes, Magnum, Phi-4, etc.) that end cleanly at `{% endfor %}` with no generation-prompt block were left unchanged, so the outer `fix_chat_template` raised:

```
RuntimeError: Unsloth: The tokenizer `...` does not have a
{% if add_generation_prompt %} for generation purposes.
```

This commonly shows up when a downstream tool (LlamaFactory, Axolotl) re-serializes the tokenizer during LoRA save and strips the generation-prompt block.

This PR adds a second branch to `_fix_chat_template` that fires when:

- the content after the last `{% endfor %}` is empty modulo Jinja `{# ... #}` comments,
- the scrubbed template contains `<|im_start|>` and `<|im_end|>`,
- and the scrubbed template does not already mention `add_generation_prompt`.

The assistant-turn separator is inferred from the template itself (preferring an explicit `'<|im_start|>assistant<sep>'` literal, then the unique `message['role'] + '<sep>'` from role concatenations, then `<|im_sep|>` for Phi-4-mini mixed-separator templates, then `\n`), so Phi-4-style templates are not silently corrupted with the wrong separator.

Verified against the existing chat-template corpus:

- Hermes-3, Magnum-v2, Phi-4-mini, Phi-4 multi-sep, ChatML with trailing whitespace, ChatML with trailing Jinja comment, dot-access `message.role`, split-literal `'<|im_start|>assistant'`: all repaired with the correct assistant prefix.
- Already-fixed ChatML templates: idempotent NOP.
- Trap templates with `<|im_start|>` only inside a Jinja comment: correctly not rewritten.
- Llama-3, Gemma-3, Qwen2.5 (non-ChatML): byte-identical.
- Mistral family (5 models including Mistral-Nemo, Mistral-Small-24B, Mixtral): byte-identical, protected both by the structural guard (no ChatML tokens) and the existing name-based exemption in `load_correct_tokenizer`.
- Qwen family (14 models including Qwen2.5, Qwen3, Qwen3-Coder, QwQ, VL, Math, Qwen3-Guard): byte-identical.

End-to-end reproduction: Hermes-3 LoRA SFT, save with stripped chat_template, reload. Pre-PR code path raises the RuntimeError above. Post-PR reload loads cleanly, patches the template at load time, and `apply_chat_template(add_generation_prompt=True)` produces the correct `<|im_start|>assistant\n` prefix.
2026-04-16 00:21:29 -07:00
DoubleMathew
a4d4dfe4ac
fix Gemma4 flash attn disable (#5045)
* fix pass attn implementation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-15 17:50:48 -05:00
Daniel Han
3869fbe1cc
Bump installer minimum to 2026.4.5 (#5041) 2026-04-15 08:23:41 -07:00
Daniel Han
cdb3e752ec Update _utils.py 2026-04-15 08:06:43 -07:00
Daniel Han
ba387e2c8f Update pyproject.toml 2026-04-15 08:06:30 -07:00
Daniel Han
f0d03655e8
Studio: add folder browser modal for Custom Folders (#5035)
* Studio: add folder browser modal for Custom Folders

The Custom Folders row in the model picker currently only accepts a
typed path. On a remote-served Studio (Colab, shared workstation) that
means the user has to guess or paste the exact server-side absolute
path. A native browser folder picker can't solve this: HTML
`<input type="file" webkitdirectory>` hides the absolute path for
security, and the File System Access API (Chrome/Edge only) returns
handles rather than strings, neither of which the server can act on.

This PR adds a small in-app directory browser that lists paths on the
server and hands the chosen string back to the existing
`POST /api/models/scan-folders` flow.

## Backend

* New endpoint `GET /api/models/browse-folders`:
  * `path` query param (expands `~`, accepts relative or absolute; empty
    defaults to the user's home directory).
  * `show_hidden` boolean to include dotfiles/dotdirs.
  * Returns `{current, parent, entries[], suggestions[]}`. `parent` is
    null at the filesystem root.
  * Immediate subdirectories only (no recursion); files are never
    returned.
  * `entries[].has_models` is a cheap hint: the directory looks like it
    holds models if it is named `models--*` (HF hub cache layout) or
    one of the first 64 children is a .gguf/.safetensors/config.json/
    adapter_config.json or another `models--*` subfolder.
  * Sort order: model-bearing dirs, then plain, then hidden; case-
    insensitive alphabetical within each bucket.
  * Suggestions auto-populate from HOME, the HF cache root, and any
    already-registered scan folders, deduplicated.
  * Error surface: 404 for missing path, 400 for non-directory, 403 on
    permission errors. Auth-required like the other models routes.

* New Pydantic schemas `BrowseEntry` and `BrowseFoldersResponse` in
  `studio/backend/models/models.py`.

## Frontend

* New `FolderBrowser` component
  (`studio/frontend/src/components/assistant-ui/model-selector/folder-browser.tsx`)
  using the existing `Dialog` primitive. Features:
  * Clickable breadcrumb with a `..` row for parent navigation.
  * Quick-pick chips for the server-provided suggestions.
  * `Show hidden` checkbox.
  * In-flight fetch cancellation via AbortController so rapid
    navigation doesn't flash stale results.
  * Badges model-bearing directories inline.

* `chat-api.ts` gains `browseFolders(path?, showHidden?)` and matching
  types.

* `pickers.tsx` adds a folder-magnifier icon next to the existing `Add`
  button. Opening the browser seeds it with whatever the user has
  already typed; confirming fills the text input, leaving the existing
  validation and save flow unchanged.

## What it does NOT change

* The existing text-input flow still works; the browser is additive.
* No new permissions or escalation; the endpoint reads only directories
  the server process is already allowed to read.
* No model scanning or filesystem mutation happens from the browser
  itself -- it just returns basenames for render.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: cap folder-browser entries and expose truncated flag

Pointing the folder browser at a huge directory (``/usr/lib``,
``/proc``, or a synthetic tree with thousands of subfolders) previously
walked the whole listing and stat-probed every child via
``_looks_like_model_dir``. That is both a DoS shape for the server
process and a large-payload surprise for the client.

Introduce a hard cap of 2000 subdirectory entries and a
``truncated: bool`` field on the response. The frontend renders a small
hint below the list when it fires, prompting the user to narrow the
path. Below-cap directories are unchanged.

Verified end-to-end against the live backend with a synthetic tree of
2050 directories: response lands at 2000 entries, ``truncated=true``,
listing finishes in sub-second time (versus tens of seconds if we were
stat-storming).

* Studio: suggest LM Studio / Ollama dirs + 2-level model probe

Three improvements to the folder-browser, driven by actually dropping
an LM Studio-style install (publisher/model/weights.gguf) into the
sandbox and walking the UX:

## 1. Quick-pick chips for other local-LLM tools

`well_known_model_dirs()` (new) returns paths commonly used by
adjacent tools. Only paths that exist are returned so the UI never
shows dead chips.

* LM Studio current + legacy roots + user-configured
  `downloadsFolder` from its `settings.json` (reuses the existing
  `lmstudio_model_dirs()` helper).
* Ollama: `$OLLAMA_MODELS` env override, then `~/.ollama/models`,
  `/usr/share/ollama/.ollama/models`, and `/var/lib/ollama/.ollama/models`
  (the systemd-service install path surfaced in the upstream "where is
  everything?" issue).
* Generic user-choice locations: `~/models`, `~/Models`.

Dedup is stable across all sources.

## 2. Two-level model-bearing probe

LM Studio and Ollama both use `root/publisher/model/weights.gguf`.
The previous `has_models` heuristic only probed one level, so the
publisher dir (whose immediate children are model dirs, not weight
files) was always marked as non-model-bearing. Pulled the direct-
signal logic into `_has_direct_model_signal` and added a grandchild
probe so the classic layout is now recognised.

Still O(PROBE^2) worst-case, still returns immediately for
`models--*` names (HF cache layout) and for any direct weight file.

## 3. model_files_here hint on response body

A leaf model dir (just GGUFs, no subdirs) previously rendered as
`(empty directory)` in the modal, confusing users into thinking the
folder wasn't scannable. Added a `model_files_here` count on the
response (capped at 200) and a small hint row in the modal: `N model
files in this folder. Click "Use this folder" to scan it.`

## Verification

Simulated an LM Studio install by downloading the real 84 MB
`unsloth/SmolLM2-135M-Instruct-Q2_K.gguf` into
`~/.lmstudio/models/unsloth/SmolLM2-135M-Instruct-GGUF/`. Confirmed
end-to-end:

* Home listing suggests `~/.lmstudio/models` as a chip.
* Browsing `~/.lmstudio/models` flags `unsloth` (publisher) as
  `has_models=true` via the 2-level probe.
* Browsing the publisher flags `SmolLM2-135M-Instruct-GGUF` (model
  dir) as `has_models=true`.
* Browsing the model dir returns empty entries but
  `model_files_here=1`, and the frontend renders a hint telling the
  user it is a valid target.

* Studio: one-click scan-folder add + prominent remove + plain search icon

Three small Custom Folders UX fixes after real-use walkthrough:

* **One-click add from the folder browser**. Confirming `Use this
  folder` now submits the path directly to
  `POST /api/models/scan-folders` instead of just populating the text
  input. `handleAddFolder` takes an optional explicit path so the
  submit lands in the same tick as `setFolderInput`, avoiding a
  state-flush race. The typed-path + `Add` button flow is unchanged.

* **Prominent remove X on scan folders**. The per-folder delete
  button was `text-muted-foreground/40` and hidden entirely on
  desktop until hovered (`md:opacity-0 md:group-hover:opacity-100`).
  Dropped the hover-only cloak, bumped color to `text-foreground/70`,
  added a red hover/focus background, and sized the icon up from
  `size-2.5` to `size-3`. Always visible on every viewport.

* **Plain search icon for the Browse button**. `FolderSearchIcon`
  replaced with `Search01Icon` so it reads as a simple "find a
  folder" action alongside the existing `Add01Icon`.

* Studio: align Custom Folders + and X buttons on the same right edge

The Custom Folders header used `px-2.5` with a `p-0.5` icon button,
while each folder row used `px-3` with a `p-1` button. That put the
X icon 4px further from the right edge than the +. Normalised both
rows to `px-2.5` with `p-1` so the two icons share a column.

* Studio: empty-state button opens the folder browser directly

The first-run empty state for Custom Folders was a text link reading
"+ Add a folder to scan for local models" whose click toggled the
text input. That's the wrong default: a user hitting the empty state
usually doesn't know what absolute path to type, which is exactly
what the folder browser is for.

* Reword to "Browse for a models folder" with a search-icon
  affordance so the label matches what the click does.
* Click opens the folder browser modal directly. The typed-path +
  Add button flow is still available via the + icon in the
  section header, so users who know their path keep that option.
* Slightly bump the muted foreground opacity (70 -> hover:foreground)
  so the button reads as a primary empty-state action rather than a
  throwaway hint.

* Studio: Custom Folders header gets a dedicated search + add button pair

The Custom Folders section header had a single toggle button that
flipped between + and X. That put the folder-browser entry point
behind the separate empty-state link. Cleaner layout: two buttons in
the header, search first, then add.

* Search icon (left) opens the folder browser modal directly.
* Plus icon (right) toggles the text-path input (unchanged).
* The first-run empty-state link is removed -- the two header icons
  cover both flows on every state.

Both buttons share the same padding / icon size so they line up with
each other and with the per-folder remove X.

* Studio: sandbox folder browser + bound caps + UX recoveries

PR review fixes for the Custom Folders folder browser. Closes the
high-severity CodeQL path-traversal alert and addresses the codex /
gemini P2 findings.

Backend (studio/backend/routes/models.py):

* New _build_browse_allowlist + _is_path_inside_allowlist sandbox.
  browse_folders now refuses any target that doesn't resolve under
  HOME, HF cache, Studio dirs, registered scan folders, or the
  well-known third-party model dirs. realpath() is used so symlink
  traversal cannot escape the sandbox. Also gates the parent crumb
  so the up-row hides instead of 403'ing.
* _BROWSE_ENTRY_CAP now bounds *visited* iterdir entries, not
  *appended* entries. Dirs full of files (or hidden subdirs when
  show_hidden is False) used to defeat the cap.
* _count_model_files gets the same visited-count fix.
* PermissionError no longer swallowed silently inside the
  enumeration / counter loops -- now logged at debug.

Frontend (folder-browser.tsx, pickers.tsx, chat-api.ts):

* splitBreadcrumb stops mangling literal backslashes inside POSIX
  filenames; only Windows-style absolute paths trigger separator
  normalization. The Windows drive crumb value is now C:/ (drive
  root) instead of C: (drive-relative CWD-on-C).
* browseFolders accepts and forwards an AbortSignal so cancelled
  navigations actually cancel the in-flight backend enumeration.
* On initial-path fetch error, FolderBrowser now falls back to HOME
  instead of leaving the modal as an empty dead end.
* When the auto-add path (one-click "Use this folder") fails, the
  failure now surfaces via toast in addition to the inline
  paragraph (which is hidden when the typed-input panel is closed).

* Studio: rebuild browse target from trusted root for CodeQL clean dataflow

CodeQL's py/path-injection rule kept flagging the post-validation
filesystem operations because the sandbox check lived inside a
helper function (_is_path_inside_allowlist) and CodeQL only does
intra-procedural taint tracking by default. The user-derived
``target`` was still flowing into ``target.exists`` /
``target.is_dir`` / ``target.iterdir``.

The fix: after resolving the user-supplied ``candidate_path``,
locate the matching trusted root from the allowlist and rebuild
``target`` by appending each individually-validated segment to
that trusted root. Each segment is rejected if it isn't a single
safe path component (no separators, no ``..``, no empty/dot).
The downstream filesystem ops now operate on a Path constructed
entirely from ``allowed_roots`` (trusted) plus those validated
segments, so CodeQL's dataflow no longer sees a tainted source.

Behavior is unchanged for all valid inputs -- only the
construction of ``target`` is restructured. Live + unit tests
all pass (58 selected, 7 deselected for Playwright env).

* Studio: walk browse paths from trusted roots for CodeQL

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@h100-8-cheapest.us-east5-a.c.unsloth.internal>
2026-04-15 08:04:33 -07:00
Roland Tannous
800ddc95f8
Re-apply #4939: updated models template mappers (#4950)
* Reapply "updated models template mappers. added lfm2.5vl450m to transformers 5…" (#4945)

This reverts commit 33503ea248.

* Add missing gemma-4-31B-it bnb-4bit mapper entry and LFM2.5 upstream namespace for PR #4950

- Add unsloth/gemma-4-31B-it-unsloth-bnb-4bit to __INT_TO_FLOAT_MAPPER so
  the int-to-float resolution works for this model (already listed in
  TEMPLATE_TO_MODEL_MAPPER but had no mapper entry).
- Add LiquidAI/LFM2.5-1.2B-Instruct to lfm-2.5 TEMPLATE_TO_MODEL_MAPPER
  entry so the canonical upstream namespace is mapped consistently with lfm-2.

* Add missing gemma-4-31B-it bnb-4bit Ollama mapping and lfm-2.5 chat template alias

- Add unsloth/gemma-4-31B-it-unsloth-bnb-4bit to OLLAMA_TEMPLATE_TO_MODEL_MAPPER
  so Ollama export works for this model (E2B-it and E4B-it bnb-4bit variants were
  already present, 31B-it was inconsistently omitted)
- Register CHAT_TEMPLATES["lfm-2.5"] as alias of the lfm-2 template to prevent
  KeyError when Studio resolves LFM2.5 models through MODEL_TO_TEMPLATE_MAPPER

* Add missing LFM2 bnb-4bit INT_TO_FLOAT_MAPPER entry

unsloth/LFM2-1.2B-unsloth-bnb-4bit is referenced in model_mappings.py
but had no mapper.py entry, so model resolution would fail when users
load that variant with load_in_4bit=False or when the float name is
used with load_in_4bit=True.

* Fix review findings for PR #16

1. ollama_template_mappers.py: Restore dropped Gemma-4 base model IDs
   (E2B, E4B, 31B, 26B-A4B) and add missing google/ upstream IDs to
   the gemma4 Ollama mapper for consistency with other gemma entries.

2. mapper.py: Remove self-mapping non-bnb-4bit entries from
   __INT_TO_FLOAT_MAPPER that were polluting FLOAT_TO_INT_MAPPER with
   lowercase 16-bit names, causing load_in_4bit=True to return bad
   model names. Add direct MAP_TO_UNSLOTH_16bit entries to preserve
   the google->unsloth 16-bit redirects.

3. mapper.py: Add LFM2.5 MAP_TO_UNSLOTH_16bit redirect so
   LiquidAI/LFM2.5-1.2B-Instruct resolves to its unsloth mirror.

* Add review tests for PR #4950

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove top-level test files

These test_*.py files were added at the repo root rather than under tests/.
Removing them from this PR; the production mapper changes remain.

* Add gemma-4-26B-A4B-it mapping

Adds unsloth/gemma-4-26B-A4B-it to __INT_TO_FLOAT_MAPPER as a 2-tuple so
google/gemma-4-26B-A4B-it routes to unsloth/gemma-4-26B-A4B-it across
INT_TO_FLOAT_MAPPER, FLOAT_TO_INT_MAPPER, and MAP_TO_UNSLOTH_16bit.

The 26B-A4B (MoE) model has no bnb-4bit variant, so the key uses the
plain unsloth name rather than the -unsloth-bnb-4bit suffix.

Removes the now-redundant standalone _add_with_lower call for the -it
variant; the 16bit mapping is registered via the dict loop.

* Add unsloth-bnb-4bit mappings for gemma-4 base (non-it) models

Adds E2B, E4B, 31B base unsloth-bnb-4bit entries to __INT_TO_FLOAT_MAPPER.
The 26B-A4B (MoE) base has no bnb-4bit variant on HF, so it stays on the
standalone _add_with_lower line for the 16bit-only routing.

Removes the redundant _add_with_lower lines for E2B, E4B, 31B base since
the dict loop now registers the same google->unsloth route through the
2-tuple entries, plus full FLOAT_TO_INT and INT_TO_FLOAT coverage.

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-15 07:52:12 -07:00
Avaya Aggarwal
7c5464ad71
feat: Add cactus QAT scheme support (#4679)
* feat: Add cactus QAT scheme support

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* test(qat): add tests for cactus QAT scheme and fix missing import

* Fix cactus QAT scheme: correct MappingType import, tighten PerGroup filter

- Drop the broken `from torchao.dtypes import MappingType` import. `MappingType`
  lives in `torchao.quantization` (and `torchao.quantization.quant_primitives`);
  it is not exported from `torchao.dtypes` in any supported torchao release
  (verified on 0.14, 0.16, 0.17). The previous code raised `ImportError` on
  every cactus call and was masked as a misleading 'torchao not found' error.
- Since `IntxWeightOnlyConfig` already defaults `mapping_type` to
  `MappingType.SYMMETRIC`, drop the explicit kwarg entirely and remove the
  import. Behavior is unchanged.
- Introduce a named `group_size = 32` constant (matches the int4 / fp8-int4
  pattern in the surrounding branches) and add a `% group_size == 0`
  divisibility guard to the filter. `PerGroup(32)` requires
  `in_features % 32 == 0` at `quantize_()` time, otherwise torchao raises
  `ValueError: in_features (N) % group_size (32) must be == 0`. The old
  `in_features >= 32` filter would admit non-aligned widths (e.g. 33, 48, 65,
  127) and crash `_prepare_model_for_qat` for those shapes.

* Warn when cactus QAT skips non-divisible Linear layers

Multiple reviewers flagged that the divisibility guard added in the
previous commit can silently leave Linear layers in full precision when
their in_features is not a multiple of 32. For currently supported
Unsloth models (Qwen, Llama, Gemma, Mistral, Phi) every Linear width is
already a multiple of 32/64/128 so this never triggers, but surfacing
the coverage gap is cheap and avoids users assuming 100% QAT coverage
when they bring a custom model with unusual shapes.

Emit a UserWarning listing up to the first 8 skipped layers whenever
the cactus filter excludes any Linear due to the modulo guard. This
keeps the lenient silent-skip behavior (consistent with int4 /
fp8-int4), but stops making it silent.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-04-15 07:40:03 -07:00
Avaya Aggarwal
f18e9dddf0
feat: Add support for OLMo-3 model (#4678)
* feat: Add support for OLMo-3 model in mapping and tests

* Update unsloth/models/mapper.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update tests/test_get_model_name.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Fix casing, add Think variants, and align version gate for OLMo-3 PR 4678

Mapper: switch slugs from OLMo-3 to canonical Olmo-3 mixed case, drop the
non-existent unsloth/Olmo-3-7B-Instruct-bnb-4bit dead alias, and add the
already-published Olmo-3-7B-Think and Olmo-3-32B-Think Unsloth mirrors.

Loader: change the olmo3 transformers version gate from Version("4.57.0")
to Version("4.57.0.dev0") so nightly/source builds that already contain
olmo3 are not blocked, matching the OLMo-2, Gemma 3 and Cohere patterns.

* Use canonical Olmo-3 casing and cover Think variants in OLMo-3 tests

Mirrors the mapper.py fixes on pr-4678-code: HuggingFace canonical slugs
for the OLMo-3 family use mixed-case Olmo-3 (not OLMo-3 like OLMo-2), and
Unsloth already hosts Olmo-3-7B-Think and Olmo-3-32B-Think mirrors, so
the resolution matrix now covers all three published Olmo-3 families.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-04-15 07:39:11 -07:00
Daniel Han
c3cd890357
Studio: refresh Downloaded GGUF list and recurse into variant subdirs (#5032)
* Studio: refresh Downloaded GGUF list and recurse into variant subdirs

Two fixes for the model picker's "Downloaded" section.

Frontend (`pickers.tsx`):
* `HubModelPicker`'s mount effect short-circuited the cached-gguf and
  cached-models refetch whenever the module-level cache already had
  entries (`if (alreadyCached) return;`). After downloading a new repo
  in the same session, reopening the picker rendered the stale cache
  and the new repo never appeared in "Downloaded" until a full page
  reload. The early return is removed so the lists are always refreshed
  on mount; the module cache still drives the initial render so there
  is no spinner flash when we already had data.

Backend (`utils/models/model_config.py`):
* `list_local_gguf_variants` and `_find_local_gguf_by_variant` used a
  non-recursive `Path.glob("*.gguf")`. Some HF GGUF repos (e.g.
  `unsloth/gemma-4-26B-A4B-it-GGUF`) place the largest quants under a
  variant-named subdirectory such as `BF16/...gguf`, which the
  top-level glob missed. Both helpers now use `rglob` and the variant
  filename is stored as a path relative to the scan root so the
  locator can still find the file.

The flat-layout case (variants directly in the snapshot root) is
unchanged: verified against `unsloth/gemma-4-E2B-it-GGUF` which still
returns its UD-Q4_K_XL variant correctly.

* Studio: emit posix-style relative filenames for local GGUF subdirs

`list_local_gguf_variants` was doing `str(f.relative_to(p))`, which on
Windows produces backslash-separated paths like `BF16\foo.gguf`. The
remote `list_gguf_variants` (HF API path) always returns forward-slash
filenames such as `BF16/foo.gguf`, so the two would diverge on Windows.

Switch to `.as_posix()` so the local and remote variant filenames stay
identical across Linux, macOS, and Windows. Verified by simulating with
`PureWindowsPath` in the test suite.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: detect mmproj at snapshot root for nested-variant layouts

When _find_local_gguf_by_variant returns a weight file inside a
quant-named subdir (e.g. snapshot/BF16/foo.gguf), detect_mmproj_file
was scanning only the immediate parent and missing the mmproj file
sitting at the snapshot root. The model was then loaded without
--mmproj, silently breaking vision support for repos that ship
nested variants.

detect_mmproj_file now takes an optional search_root and walks up
from the weight file to that root, in order, so the mmproj at the
snapshot root is picked up. Sibling quant subdirs are not scanned,
so an unrelated variant's mmproj does not leak in.

Also apply the suggested micro-optimization on relative_to in
list_local_gguf_variants -- only build the posix path when storing
the first file for a quant.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-15 07:34:42 -07:00
Daniel Han
156f3fc4b0
Gate trl disable_gradient_checkpointing patch warning on UNSLOTH_ENABLE_LOGGING (#5038)
The "Patched trl.models.utils.disable_gradient_checkpointing with a no-op"
warning fires once on every Unsloth import, including from notebooks where
the user did not opt into verbose logging. It is a routine integration
patch, not an anomaly the user needs to know about. Gate it on
UNSLOTH_ENABLE_LOGGING=1 like other diagnostic notices.
2026-04-15 07:33:48 -07:00
jonahsamost
777e1bd0ac
fix (#4887) 2026-04-15 07:21:03 -07:00
Daniel Han
1a4ca5eca8
Fix grad-accum accepts_loss_kwargs detection for vision wrappers (#5036)
* Fix grad-accum model_accepts_loss_kwargs detection for vision wrappers

Replace the source-string rewrite of Trainer.__init__ with an instance-level
accepts_loss_kwargs shadow applied on the loaded model. Covers:

  1. Unsloth-compiled forward -> True, so HF Trainer does not double-scale
     on top of unsloth_fixed_cross_entropy's num_items_in_batch division.
  2. Stock forward on a conditional-generation wrapper (Gemma3n, Gemma3
     pre-4.57, Qwen-VL family, etc.) where the outer class has no
     accepts_loss_kwargs but the inner .model declares False -> False.
     This is the case that reproduces issue #4982 under trust_remote_code
     or UNSLOTH_COMPILE_DISABLE, where the previous fix's outer-attr
     check walked past the inner model and fell through to signature
     inspection.
  3. Text LMs without any explicit accepts_loss_kwargs -> leave HF default.

The previous .replace()-based patch silently no-ops on transformers 4.48
through 4.52 (variable named model, not unwrapped_model) and is fragile
against any upstream reformat. The new helper walks the PEFT / HF wrapper
chain, finds the first class that declares accepts_loss_kwargs on its own
class dict (type(m).__dict__, not hasattr, to avoid PEFT __getattr__
forwarding), and setattr-shadows that value at every wrapper level so
HF Trainer's hasattr(unwrapped_model, ...) check picks it up at whichever
level accelerate.unwrap_model returns.

Also adds an unconditional post-init clamp of
accelerator.gradient_accumulation_steps = 1 to work around the
transformers 5.0 through 5.5 GradientAccumulationPlugin regression that
makes accelerator.backward divide loss by GA on top of training_step's
own /GA division. Fixed upstream in 5.6.0.dev0; no-op on 4.x and 5.6+.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Trim comments

* Address review: cover PEFT-after-load and custom compile location

Two review findings from 3/20 reviewers:

1. [3 of 20 reviewers] apply_accepts_loss_kwargs_fix was called from the
   loaders before get_peft_model wraps the base model, so on transformers
   4.48-4.52 (which does hasattr on the outer model) the instance shadow
   on the base model was lost after PEFT wrapping. Fix: also call it from
   the wrapped Trainer.__init__ so it runs on whatever model the user
   actually hands to Trainer, which is always the final wrapped form.

2. [1 of 20 reviewers] _forward_is_unsloth_compiled hard-coded the
   substrings "unsloth_compiled" / "unsloth_cache" in the co_filename
   check, which misclassifies compiled forwards when
   UNSLOTH_COMPILE_LOCATION is set to a custom directory. Fix: new
   _unsloth_compile_cache_leaves helper that reads the env var and
   matches the basename against path components, honoring both the
   default and any user override.

Verified locally:
- PEFT-after-load simulation: HF's hasattr(peft, "accepts_loss_kwargs")
  now returns True after our init wrapper runs, and value resolves to
  False on Gemma3n-style inner wrappers.
- Custom UNSLOTH_COMPILE_LOCATION simulation: compiled detection returns
  True for /tmp/my_custom_cache/compiled.py when the env var is set.
- End-to-end Gemma-3 270m + LoRA SFT unchanged: loss 4.9626, grad-norm
  matches prior run, all 4 wrapper levels now carry the shadowed attr.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-15 06:59:36 -07:00
Daniel Han
1ccfd2e0a5
fix(rocm): tighten gfx regex to ignore generic ISA lines (#5033)
* fix(rocm): tighten gfx regex to ignore generic ISA lines

ROCm 6.1+ rocminfo emits generic ISA names such as
"amdgcn-amd-amdhsa--gfx11-generic" and "amdgcn-amd-amdhsa--gfx9-4-generic"
alongside the real GPU name. The previous `gfx[1-9]` regex used in
`_has_rocm_gpu` matched both, so a host with only a generic ISA entry
would be reported as having a usable AMD GPU.

Tighten the pattern to `gfx[1-9][0-9a-z]{2,3}` so only real gfx ids
match. This covers every documented target from GFX6 (gfx600) through
GFX12 (gfx1201), including letter-suffixed ids like gfx90a (MI250 /
MI250X) and gfx90c. Documented generic ISA names always have 1 or 2
digits before the dash and no longer match.

Applied to both `studio/install_python_stack.py` and
`studio/install_llama_prebuilt.py` so the two detection paths agree.

Co-authored-by: Martin Hoyer <mhoyer@redhat.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Martin Hoyer <mhoyer@redhat.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-15 05:24:41 -07:00
Daniel Han
b7a8ff2833
Respect classification head skip list on pre-quantized 4-bit checkpoints (#5027) (#5034)
* Respect classification head skip list on pre-quantized 4-bit checkpoints (#5027)

FastLanguageModel.from_pretrained(..., num_labels=N) crashed with
"NotImplementedError: normal_kernel_cuda not implemented for 'Byte'" on
pre-quantized bnb 4-bit checkpoints (e.g. unsloth/Qwen3-4B-bnb-4bit)
when running on transformers 5.x.

Two pieces were needed to close this out:

1. unsloth_zoo PR: add "score", "classifier", "qa_outputs" to
   SKIP_QUANTIZATION_MODULES so replace_with_bnb_linear leaves task
   heads in the compute dtype.

2. This commit: for pre-quantized checkpoints, transformers reads
   llm_int8_skip_modules from the quantization_config baked into
   config.json and ignores the runtime BitsAndBytesConfig we pass via
   kwargs. Unsloth must merge its skip list into
   model_config.quantization_config.llm_int8_skip_modules before the
   from_pretrained call, or the checkpoint's frozen list
   (e.g. ["lm_head", "multi_modal_projector", "merger",
   "modality_projection"]) wins and the `score` head gets converted to
   Linear4bit with uint8 storage, then _init_weights calls normal_ on
   uint8 and crashes.

Also add a defensive post-load cast on the task head to guard against
any residual path that ends up with a non-floating head dtype.

Verified on transformers 4.57.6 and 5.5.0 with:
- unsloth/Qwen3-4B-bnb-4bit + num_labels=3
- unsloth/Qwen3-4B (non-bnb repo, load_in_4bit=True)
- unsloth/Llama-3.2-1B-Instruct + num_labels=3
- unsloth/ModernBERT-large classifier head (bert_classification notebook)
- Regression: causal LM path unchanged, backbone still 4-bit
- 3-step SFT on num_labels=3 confirms gradient flow and weight updates
  on score.weight

Fixes unslothai/unsloth#5027

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-15 05:16:33 -07:00
David Solanas Sanz
1fcb2502cf
fix: prevent offline freeze by fixing stats retry and forwarding local_files_only (#5016)
Fixes #2393.

- `_utils.py`: `has_internet()` now respects `HF_HUB_OFFLINE` with truthy variant parsing in addition to `TRANSFORMERS_OFFLINE`.
- `_utils.py`: replace uncontrolled `except Exception: stats_check()` retry (which had no time limit and could freeze on Kaggle offline mode) with a logged skip.
- `loader.py`: forward `local_files_only` from kwargs into all `AutoConfig.from_pretrained` and `PeftConfig.from_pretrained` probes in `FastLanguageModel.from_pretrained` and `FastModel.from_pretrained`, including the PEFT base-model reload paths.
2026-04-15 04:51:31 -07:00
Lee Jackson
f9ef639dde
Studio: support GGUF variant selection for non-suffixed repos (#5023)
* fix: support GGUF variant selection for non-suffixed repos

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: harden GGUF detection across cached models and picker flows

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* chore: use shared GGUF picker helper for search rows

* fix: avoid mixed cache duplication and preserve GGUF fallback detection

* fix: unify GGUF cache matching and merge picker hints

* fix: normalize local GGUF matching across picker and model config

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: robust cached-gguf classification + hint-aware click routing

- _repo_gguf_size_bytes: treat size_on_disk=None as 0 and dedupe fallback
  by commit_hash so partial/interrupted downloads don't TypeError out of
  sum() and wipe the entire cached list.
- list_cached_gguf / list_cached_models: narrow per-repo try/except so
  one malformed repo no longer poisons the whole response.
- handleModelClick: route through isKnownGgufRepo instead of the
  suffix-only isGgufRepo, so non-suffixed GGUF repos still open the
  variant expander from every call site.
- Replace the modelIsGgufById/resultIsGgufById Maps with Sets of known
  GGUF ids to stop conflating "no hint" with "known not-GGUF".
- Make HfModelResult.isGguf required (it is always set in makeMapModel).
- Add regression tests for the None size case, mixed-repo inclusion in
  cached-gguf, and per-repo error isolation.

* fix: exclude mmproj from GGUF classification and case-normalize hint lookups

- _repo_gguf_size_bytes now filters mmproj vision-adapter files so
  safetensors+mmproj.gguf repos stay on the cached-models path and
  non-GGUF rows no longer show zero pickable variants. A vision-capable
  GGUF repo (main weight + mmproj adapter) still classifies as GGUF and
  reports the main weight size.
- modelGgufIds / resultGgufIds now key on lowercased ids and
  isKnownGgufRepo lowercases its lookup, so store and HF-search ids
  that differ only by casing still match the same GGUF hint.
- New regression tests: mmproj-only repo excluded from cached-gguf,
  same repo included in cached-models, vision-capable repo still
  classified as GGUF with correct size.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
2026-04-15 15:32:01 +04:00
Roland Tannous
13928b5f0e
Add configurable PyTorch mirror via UNSLOTH_PYTORCH_MIRROR env var (#5024)
* Add configurable PyTorch mirror via UNSLOTH_PYTORCH_MIRROR env var

When set, UNSLOTH_PYTORCH_MIRROR overrides the default
https://download.pytorch.org/whl base URL in all four install scripts
(install.sh, install.ps1, studio/setup.ps1, studio/install_python_stack.py).
When unset or empty, the official URL is used. This lets users behind
corporate proxies or in regions with poor connectivity to pytorch.org
point at a local mirror without patching scripts.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add pytest for UNSLOTH_PYTORCH_MIRROR in install_python_stack.py

Tests that _PYTORCH_WHL_BASE picks up the env var when set, falls back
to the official URL when unset or empty, and preserves the value as-is
(including trailing slashes).

* Remove stale test assertions for missing install.sh messages

* Fix GPU mocking in test_get_torch_index_url.sh

Extract _has_usable_nvidia_gpu and _has_amd_rocm_gpu alongside
get_torch_index_url so the GPU-presence checks work in tests.
Add -L flag handling to mock nvidia-smi so it passes the GPU listing
check. All 26 tests now pass on CPU-only machines.

* Strip trailing slash from UNSLOTH_PYTORCH_MIRROR to avoid double-slash URLs

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-15 11:39:11 +04:00
Datta Nimmaturi
826c98f3c0
[moe][gemma4] Target MoE for gemma4 (#4913)
* Target MoE for gemma4

* refactor attention impl determine

* Revert "refactor attention impl determine"

This reverts commit 888fca08110a9a74278dc1ebc14d0da043bbd11d.

* Remove attention policy changes from gemma4 MoE fix
2026-04-14 16:53:07 -05:00
Daniel Han
5aa8c15246
Studio: hard-stop at n_ctx with a 'Context limit reached' toast (#5021)
* Studio: hard-stop at n_ctx with a dedicated 'Context limit reached' toast

llama-server's default behavior when the KV cache fills is to silently
drop the oldest non-``n_keep`` tokens and keep generating. The UI has
no way to tell the user that earlier turns were evicted -- they just
see degraded continuity and a confusing ``5,361 / 4,096`` on the
context usage bar.

Launch llama-server with ``--no-context-shift`` so it returns a clean
error once the request would exceed ``n_ctx``. In the chat adapter,
catch the error, identify it as a context-limit error via
``isContextLimitError()``, and surface a dedicated toast that names
the exact control to adjust: the ``Context Length`` field in the chat
Settings panel.

Also add a lightweight tooltip hint on ``ContextUsageBar`` when usage
crosses 85%, so users see the "raise Context Length in Settings"
suggestion before they hit the hard stop.

Tests:

  * ``test_llama_cpp_no_context_shift.py`` pins the ``--no-context-shift``
    flag in the static launch-command template, and pins it inside the
    unconditional ``cmd = [ ... ]`` block so a future refactor can't
    hide it behind a branch.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Shorten --no-context-shift comment to 1 line

* Match backend _friendly_error rewrite in isContextLimitError

Codex review on PR caught that ``backend/routes/inference.py::_friendly_error``
rewrites the raw llama-server text
  "request (X tokens) exceeds the available context size (Y tokens)"
into
  "Message too long: X tokens exceeds the Y-token context window. ..."
on the main streaming GGUF path. The heuristic only looked for
"context size" / "exceeds the available context" / "context shift",
none of which survive the rewrite, so the new "Context limit reached"
toast would never fire for the most common case. Add matches for
"message too long" and "context window" so both wordings hit.

Also addresses Gemini feedback on the launch-flag test:
  * Use ``inspect.getsource(LlamaCppBackend.load_model)`` instead of
    reading ``__file__`` directly; scopes the assertions to the
    function that actually launches llama-server.
  * Replace the hardcoded ``"            ]"`` indent search with a
    line-at-a-time scan for a line that is just ``]``, so the test
    survives reformatting.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 10:58:20 -07:00
Daniel Han
5861a7ce15
Studio: split model-load progress label across two rows (#5020)
* Studio: split model-load progress label across two rows

The chat flow and training overlay both compose a progress label like
"112.6 of 122.3 GB • 331.0 MB/s • 30s left" and render it next to the
percent badge in a single flex row. Once the rate + ETA part shows up,
the label outgrows the row width and wraps mid-phrase, orphaning the
percent ("19 left %") onto a second ragged line.

Fix in model-load-status.tsx: split the label on the first " • " into
a primary (size) chunk that stays on row 1 with the percent, and a
secondary (rate/ETA) chunk that renders on its own muted row below.
Labels without a bullet (e.g. "22.8 GB downloaded") collapse cleanly
to one row. The inline-status variant keeps only the primary and
surfaces the full label via the tooltip.

Also extracts the rate/ETA math out of useTransferStats into a pure
``transfer-stats.ts`` module (appendSample + computeTransferStats) so
it can be reasoned about and tested without React. The hook is now a
thin wrapper that feeds sample history through the pure functions.

Backend: adds two companion test files for load_progress():

  * test_llama_cpp_load_progress_matrix.py (21 tests) -- platform
    matrix (Linux /proc, macOS/Windows absence), VmRSS parsing
    variants (tab/space/missing/malformed), filesystem edges (HF-cache
    symlinks, broken symlinks, nonexistent paths, relative paths),
    shard aggregation (partial multi-shard, two series in same dir,
    mmproj-* exclusion, single-file), lifecycle races, concurrent
    sampling (10 threads x 50 iters against real /proc), fraction
    bounds.
  * test_llama_cpp_load_progress_live.py (5 tests) -- no-mock live
    integration: real subprocess allocating 100 MB to match VmRSS,
    real ready phase, real dead-pid degradation, real shard
    aggregation, repeated polling. Skipped on non-Linux.

Both complement the existing test_llama_cpp_load_progress.py.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Hoist splitProgressLabel out of JSX IIFE (review feedback)

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 10:58:16 -07:00
Eda Z
5b8dbdc3c2
Fix bitsandbytes ROCm install by using pip instead of uv (#4966)
* Fix bitsandbytes ROCm install by using pip instead of uv

* Also use pip for PyPI fallback path in _install_bnb_rocm

The original fix correctly switched the pre-release wheel install from
uv to pip, but left the PyPI fallback path on uv. If uv breaks bnb
on ROCm, the fallback would hit the same issue. Move pip bootstrap
before the branch so both paths use pip consistently.

* Harden pip bootstrap: try ensurepip first, warn on failure

- Try ensurepip --upgrade before falling back to uv pip install pip.
  ensurepip works offline and does not need PyPI, making the bootstrap
  robust when the network or index is unavailable.
- If both ensurepip and uv fail, emit a visible warning instead of
  silently swallowing the error (which previously led to a cryptic
  "No module named pip" downstream).
- Use run_maybe_quiet so --verbose users see bootstrap output.
- Update comment to document the actual root cause: uv rejects the
  wheel because filename version and metadata version disagree.

* Add --isolated to pip install calls in _install_bnb_rocm

uv pip install ignores pip.conf and PIP_* env vars, but python -m pip
reads them. Without --isolated, users with PIP_INDEX_URL pointing to a
private mirror that does not carry bitsandbytes would see the PyPI
fallback fail where it previously worked under uv. --isolated restores
parity with the old uv behavior.

* Drop --isolated from PyPI fallback in _install_bnb_rocm

--isolated suppresses PIP_INDEX_URL, PIP_EXTRA_INDEX_URL, and pip.conf.
This is correct for the pre-release path (hardcoded GitHub URL, no index
consulted), but breaks the PyPI fallback for users in corporate or
air-gapped environments whose only route to bitsandbytes is a private
mirror configured via those mechanisms. Keep --isolated on the direct-URL
pre-release install; drop it from the index-dependent fallback.

* Drop --isolated from pre-release pip install, fix warning wording

--isolated suppresses pip.conf cert/proxy/CA settings in addition to
index config. For the direct GitHub URL, index config is irrelevant but
cert/proxy settings matter in corporate SSL-inspection environments.
Without this fix, users with pip.conf-based CA bundles get a TLS error
on the pre-release download and silently fall back to the broken PyPI
version -- the exact outcome the PR is trying to prevent.

Also fix the fallback warning: "unreachable" is too specific since the
pre-release install can fail for reasons other than network reachability.

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-04-14 10:23:40 -07:00
pre-commit-ci[bot]
a0b9d14081
[pre-commit.ci] pre-commit autoupdate (#5004)
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.15.9 → v0.15.10](https://github.com/astral-sh/ruff-pre-commit/compare/v0.15.9...v0.15.10)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 09:49:18 -07:00
Daniel Han
bb14ab144a
Studio: live model-load progress + rate/ETA on download and load (#5017)
* Studio: live model-load progress + rate/ETA on download and load

Two UX fixes for the opaque multi-minute wait between clicking Load
and being able to chat, visible most clearly on large MoE GGUFs like
MiniMax-M2.7 (131 GB of weights on a 97 GB GPU):

1. **Model-load phase is now observable.** The existing chat flow
   transitions the toast to "Starting model..." as soon as the
   download hits 100%, then shows a spinner with no other feedback
   until llama-server reports healthy. For a 130 GB model that spinner
   freezes for five-plus minutes while the kernel pages shards into
   the page cache. A new `GET /api/inference/load-progress` endpoint
   samples `/proc/<pid>/status VmRSS` on the llama-server subprocess
   against the sum of shard file sizes on disk, so the UI can render
   a real bar plus rate / ETA during that window.

2. **Rate and ETA on downloads and loads.** Both the chat toast and
   the training-start overlay used to show a static pair of numbers
   (for example "15.4 of 140.8 GB"). A rolling 15-second window over
   the existing byte-series now surfaces "85.3 MB/s, 24m 23s left"
   beside that pair. The estimator is shared between the download
   and load phases so the numbers don't reset when the phase flips.

Also fixes a pre-existing assignment bug uncovered while wiring this
up: `load_model` was storing the caller's `gguf_path` kwarg into
`self._gguf_path`, which is `None` on the HF-download code path. The
resolved on-disk path (`model_path`) is what llama-server actually
mmaps; downstream consumers need that. No existing reader used
`_gguf_path`, so this is a correctness fix for the new endpoint.

- Backend: `LlamaCppBackend.load_progress()`, `GET /api/inference/load-progress`, `LoadProgressResponse` Pydantic model.
- Frontend: `useTransferStats` hook, `formatRate` / `formatEta` helpers, `getLoadProgress` client, rewired chat toast and `DownloadRow` in the training overlay.
- Tests: `studio/backend/tests/test_llama_cpp_load_progress.py` covers empty states, mmap phase, ready phase, sharded total aggregation, missing gguf_path, and unreadable /proc (7 cases). `tsc -b` and `vite build` on the frontend both clean.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 09:46:22 -07:00
Roland Tannous
514bb3a20e
studio: pin peft to 0.18.1 to fix export subprocess issues (#5015)
* studio: pin peft to 0.18.1 to fix export subprocess issues

peft 0.19.0 causes export subprocess shutdown failures in Studio.
Reverting to 0.18.1 resolves the issue.

* studio: move peft pin to extras-no-deps to prevent torch upgrade

Installing peft via overrides.txt would resolve its deps and pull in
torch>=0.11.0, breaking other pinned packages. Moving the pin to
extras-no-deps.txt ensures --no-deps is used during install.
2026-04-14 20:16:30 +04:00
Datta Nimmaturi
4328d0b4f6
Fix num_items_in_batch GA for Gemma4 (#4998)
* Fix num_items_in_batch GA for Gemma4

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 09:01:10 -07:00
Daniel Han
7252410ccc
studio: stream export worker output into the export dialog (#4897)
* studio: stream export worker output into the export dialog

The Export Model dialog only showed a spinner on the "Exporting..."
button while the worker subprocess was doing the actual heavy lifting.
For Merged to 16bit and GGUF / Llama.cpp exports this meant several
minutes (or more, for large models) of opaque silence, with no way to
tell whether save_pretrained_merged, convert_hf_to_gguf.py, or
llama-quantize was making progress.

This adds a live terminal-style output panel inside the export dialog,
rendered just above the Cancel / Start Export buttons and scrollable
with auto-follow-tail. It shows stdout and stderr from both the worker
process itself and any child process it spawns (GGUF converter,
llama-quantize), coloured by stream.

Backend

- core/export/worker.py: new _setup_log_capture(resp_queue) installed
  before LogConfig.setup_logging. It saves the original stdout/stderr
  fds, creates pipes, os.dup2's the write ends onto fds 1 and 2 (so
  every child process inherits the redirected fds), and spins up two
  daemon reader threads. Each thread reads bytes from a pipe, echoes
  them back to the original fd (so the server console keeps working),
  splits on \n and \r, and forwards each line to the resp queue as
  {"type":"log","stream":"stdout|stderr","line":...,"ts":...}.
  PYTHONUNBUFFERED=1 is set so nested Python converters flush
  immediately.

- core/export/orchestrator.py:
  - Thread-safe ring buffer (collections.deque, maxlen 4000) with a
    monotonically increasing seq counter. clear_logs(),
    get_logs_since(cursor), get_current_log_seq(), is_export_active().
  - _wait_response handles rtype == "log" by appending to the buffer
    and continuing the wait loop. Status messages are also surfaced as
    a "status" stream so users see high level progress alongside raw
    subprocess output.
  - load_checkpoint, _run_export, and cleanup_memory now wrap their
    bodies with the existing self._lock (previously unused), clear the
    log buffer at the start of each op, and flip _export_active in a
    try/finally so the SSE endpoint can detect idle.

- routes/export.py:
  - Wrapped every sync orchestrator call (load_checkpoint,
    cleanup_memory, export_merged_model, export_base_model,
    export_gguf, export_lora_adapter) in asyncio.to_thread so the
    FastAPI event loop stays free during long exports. Without this
    the new SSE endpoint could not be served concurrently with the
    blocking export POST.
  - New GET /api/export/logs/stream SSE endpoint. Honors
    Last-Event-ID and a since query param for reconnect, emits log /
    heartbeat / complete / error events, uses the id field to carry
    the log seq so clients can resume cleanly. On first connect
    without an explicit cursor it starts from the current seq so old
    lines from a previous run are not replayed.

Frontend

- features/export/api/export-api.ts: streamExportLogs() helper that
  authFetches the SSE endpoint and parses id / event / data fields
  manually (same pattern as streamTrainingProgress in train-api.ts).

- features/export/components/export-dialog.tsx:
  - Local useExportLogs(exporting) hook that opens the SSE stream on
    exporting transitions to true, accumulates up to 4000 lines in
    component state, and aborts on cleanup.
  - New scrollable output panel rendered above DialogFooter, only
    shown for Merged to 16bit and GGUF / Llama.cpp (LoRA adapter is
    a fast disk write with nothing to show). Dark terminal styling
    (bg-black/85, emerald text, rose for stderr, sky for status),
    max-height 14rem, auto-scrolls to the bottom on new output but
    stops following if the user scrolls up. A small streaming / idle
    indicator is shown next to the panel title.
  - DialogContent widens from sm:max-w-lg to sm:max-w-2xl when the
    output panel is visible so the logs have room to breathe.

Verified

- Python smoke test (tests/smoke_export_log_capture.py): spawns a
  real mp.get_context("spawn") process, installs _setup_log_capture,
  confirms that parent stdout prints, parent stderr prints, AND a
  child subprocess invoked via subprocess.run (both its stdout and
  stderr) are all captured in the resp queue. Passes.
- Orchestrator log helpers tested in isolation: _append_log,
  get_logs_since (with and without a cursor), clear_logs not
  resetting seq so reconnecting clients still progress. Passes.
- routes.export imports cleanly in the studio venv and /logs/stream
  shows up in router.routes.
- bun run build: tsc -b plus vite build, no TypeScript errors.

No existing export behavior is changed. If the subprocess, the SSE
endpoint, or the frontend hook fails, the export itself still runs to
completion the same way it did before, with or without logs visible.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* export dialog: trim bootstrap noise, scope logs per screen, show realpath

Several follow-ups to the live export log work:

1. Worker bootstrap noise (transformers venv activation, Unsloth banner,
   "Top GGUF/hub models" lists, vision detection, 2k-step weight load
   bar) is dropped from the export-dialog stream. A threading.Event
   gate in worker.py defaults closed and only opens once _handle_export
   actually starts; until then the reader thread still echoes lines to
   the saved console fd for debugging but does not push them onto the
   resp_queue. The orchestrator already spawns a fresh subprocess for
   every checkpoint load, so the gate is naturally reset between runs.

2. tqdm in non-tty mode defaults to a 10s mininterval, which makes
   multi-step bars look frozen in the panel. Set TQDM_MININTERVAL=0.5
   in the worker env so any tqdm-driven progress emits more often.

3. The dialog's useExportLogs hook now also clears its line buffer
   when exportMethod or open changes, so re-opening the dialog into a
   different action's screen no longer shows the previous action's
   saved output. A useElapsedSeconds tick + "Working Xs" badge in the
   log header gives users a visible sign that long single-step phases
   (cache copies, GGUF conversion) are still running when no new lines
   are arriving.

4. ExportBackend.export_{merged,base,gguf,lora} now return
   (success, message, output_path); the worker forwards output_path on
   each export_*_done response, the orchestrator's _run_export passes
   it to routes/export.py, which surfaces it via
   ExportOperationResponse.details.output_path. The dialog's Export
   Complete screen renders the resolved on-disk realpath under "Saved
   to" so users can find their exported model directly.

* fix(cli): unpack 3-tuple return from export backend

ExportOrchestrator.export_{merged,base,gguf,lora} now return
(success, message, output_path) so the studio dialog can show
the on-disk realpath. The CLI still unpacked 2 values, so every
`unsloth export --format ...` crashed with ValueError before
reporting completion. Update the four call sites and surface
output_path via a "Saved to:" echo.

* fix(studio): anchor export log SSE cursor at run start

The export dialog SSE defaulted its cursor to get_current_log_seq()
at connect time, so any line emitted between the POST that kicks
off the export and the client opening the stream was buffered with
seqs 1..k and then skipped (seq <= cursor). Long-running exports
looked silent during their first seconds.

Snapshot _log_seq into _run_start_seq inside clear_logs() and
expose it via get_run_start_seq(). The SSE default cursor now uses
that snapshot, so every line emitted since the current run began
is reachable regardless of when the client connects. Old runs
still can't leak in because their seqs are <= the snapshot.

* fix(studio): reconnect export log SSE on stream drop

useExportLogs launched streamExportLogs once per exporting
transition and recorded any drop in .catch(). Long GGUF exports
behind a proxy with an idle kill-timeout would silently lose the
stream for the rest of the run even though the backend already
supports Last-Event-ID resume. The "retry: 3000" directive emitted
by the backend is only meaningful to native EventSource; this
hook uses a manual fetch + ReadableStream parse so it had no
effect.

Wrap streamExportLogs in a retry loop that tracks lastSeq from
ExportLogEvent.id and passes it as since on reconnect. Backoff is
exponential with jitter, capped at 5s, reset on successful open.
The loop stops on explicit backend `complete` event or on effect
cleanup.

* fix(studio): register a second command so Typer keeps `export` as a subcommand

The CLI export unpacking tests wrap `unsloth_cli.commands.export.export`
in a fresh Typer app with a single registered command. Typer flattens a
single-command app into that command, so the test's
`runner.invoke(cli_app, ["export", ckpt, out, ...])` treats the leading
`"export"` token as an unexpected extra positional argument -- every
parametrized case failed with:

    Got unexpected extra argument (.../out)

Register a harmless `noop` second command so Typer preserves subcommand
routing and the tests actually exercise the 3-tuple unpack path they
were written to guard.

Before: 4 failed
After:  4 passed

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: studio-install <studio@local.install>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
2026-04-14 08:55:43 -07:00
Daniel Han
eca592effe
studio: show HF model download progress in training start overlay (#4894)
* studio: show HF model download progress in training start overlay

During the training setup phase, the overlay only displayed a static
"Loading model..." line while model weights were being downloaded from
Hugging Face. On slow connections this looked like the app had frozen.

This adds a small self-contained progress block inside the existing
TrainingStartOverlay that polls the existing
GET /api/models/download-progress endpoint and renders a Progress bar
with bytes downloaded, total bytes, and percent complete.

Notes:

- Frontend only change. No backend, worker, SSE, or runtime store edits.
- Reuses the existing getDownloadProgress client wrapper and the
  existing /api/models/download-progress endpoint that already scans
  the HF blob cache for completed and .incomplete files.
- selectedModel is read directly from useTrainingConfigStore inside the
  overlay, so no prop drilling and live-training-view.tsx is unchanged.
- Polling runs at 1500 ms and is gated on the HF repo regex
  (^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$), the same regex the backend uses,
  so local paths and empty form state never hit the endpoint.
- Polling stops once progress reaches 1.0 so the bar can stay at 100
  until the overlay hides on the first training step.
- Network errors are silently swallowed, matching the chat side flow
  (the bar simply freezes at the last value).
- When downloadedBytes is 0 the block is hidden entirely, so cached
  models do not flash a progress bar.
- When the HF API cannot determine the total size, the block falls
  back to "X downloaded" with no percent and no bar.

Verified with bun run build (tsc -b plus vite build, no TypeScript
errors).

* training overlay: track dataset download + show on-disk realpath

Adds a dedicated "Downloading dataset..." section to the training-start
overlay alongside the existing model-weights one, so an HF dataset that
is downloading mid-startup is no longer mislabeled as model weights or
hidden entirely. The new GET /api/datasets/download-progress endpoint
mirrors /api/models/download-progress against the datasets-- prefix in
HF_HUB_CACHE.

Both endpoints now also return cache_path, the resolved on-disk
realpath of the snapshot directory (or the cache repo root if no
snapshot is materialized yet). The overlay surfaces this under each
download row so users can immediately see where the model and dataset
landed without digging through server logs.

The frontend's existing useModelDownloadProgress hook is generalized
to a single useHfDownloadProgress(repoId, fetcher) hook that the
model and dataset variants both delegate to, keeping polling, gating,
and completion semantics in one place.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: Polish training start overlay download progress UI (#4957)

* studio: polish training start overlay download progress visuals

* Fix formatCachePath cross-platform support and redundant sizeLabel

- Extend formatCachePath regex to also shorten macOS /Users/<user> paths to ~
- Suppress sizeLabel when no byte info is available (cachePath-only state),
  since the "Preparing" badge already conveys the status

* Fix misleading status badge when download total is unknown

- Hide badge when totalBytes is 0 but downloadedBytes > 0, since we cannot
  determine if the download is still in progress or already complete (happens
  when HF size metadata lookup fails for gated/private repos)
- Keep "Preparing" badge for the zero-bytes cachePath-only state
- Add Windows native path shortening to formatCachePath (C:\Users\<name>)

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

---------

Co-authored-by: studio-install <studio@local.install>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
2026-04-14 08:54:01 -07:00
Daniel Han
44082cf88e
Studio: anchor ctx-slider warning threshold at 4096 when weights exceed VRAM (#5014)
* Studio: anchor ctx-slider warning threshold at 4096 when weights exceed VRAM

The chat settings sheet's ctx slider reads `max_context_length` from
`/api/inference/status` and renders

    Exceeds estimated VRAM capacity (N tokens). The model may use
    system RAM.

when the user drags the slider above that value. For models whose
weights fit on some GPU subset, `_max_context_length` was already set
to the binary-search cap and the warning fired correctly.

For models whose weights exceed 90% of every GPU subset's free memory
(e.g. MiniMax-M2.7-GGUF at 131 GB on a 97 GB GPU), the ceiling-probe
loop never matched a subset, so `max_available_ctx` stayed at the
native context (e.g. 196608). The slider ran all the way to native
with no indication that any value above the 4096 spec default would
trigger `--fit on` and degrade performance.

Anchor `max_available_ctx` at `min(4096, native_context_length)` when
no subset fits, so the warning fires at the right threshold and the
user sees the correct safe-zone / warning-zone split:

    Before (MiniMax-M2.7 on 97 GB GPU):
      slider 0 .. 196608, warning threshold = 196608  (never fires)

    After:
      slider 0 .. 196608, warning threshold = 4096    (fires correctly)

No frontend changes required: `chat-settings-sheet.tsx` already
consumes `ggufMaxContextLength` (= status.max_context_length) as the
warning threshold and `ggufNativeContextLength` as the slider max.

Adds tests/test_llama_cpp_max_context_threshold.py covering
weights-exceed-VRAM (single / multi-GPU), a native-ctx below the 4096
fallback case (don't lie about supported ctx), fittable-model
regressions (small / multi-GPU / tiny on huge GPU), and the
`max_context_length` property's fallback semantics.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 08:53:49 -07:00
Daniel Han
b2f80f210e
Studio: make GGUF disk-space preflight cache-aware (#5012)
* Studio: make GGUF disk-space preflight cache-aware

The pre-download disk check in LlamaCppBackend.load_model compared the
repo's total GGUF size against free disk without crediting bytes
already present in the Hugging Face cache. Re-loading a large cached
model (e.g. MiniMax-M2.7-GGUF at 131 GB) then failed cold with
"Not enough disk space to download any variant" whenever free disk
was below the full weight footprint, even though nothing actually
needed to be downloaded.

Subtract bytes already on disk via try_to_load_from_cache before
comparing against free space. A partial blob (interrupted download) is
not credited, so a second attempt still allocates room to finish the
download. The log line now also surfaces how much is already cached.

Adds tests/test_llama_cpp_cache_aware_disk_check.py covering the
fully-cached, partial-cache-insufficient-disk, partial-cache-enough-disk,
cold-cache, incomplete-blob, and zero-size-path-info cases. Sparse
tempfiles keep the GB-scale scenarios cheap to simulate.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 08:53:37 -07:00
Daniel Han
767fa8cade
Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM (#5011)
* Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM

The load-time auto-fit in LlamaCppBackend.load_model had two issues for
models whose weights do not fit on any GPU subset (the common case for
large MoE GGUFs such as MiniMax-M2.7, Qwen3.5-397B-A17B, etc.):

1. Auto mode (max_seq_length=0) left effective_ctx at the model's native
   context when no subset passed the 90% fit check. The UI slider then
   landed on e.g. 196608 for MiniMax-M2.7, far above anything usable.
   Default the auto-pick to 4096 so the UI starts at a sane value; the
   slider ceiling stays at the native context so the user can still
   opt in to longer contexts and receive the "might be slower" warning.

2. Explicit ctx was silently shrunk when weights fit but the requested
   KV overflowed the 90% budget. The shrink loop emitted -c <capped>
   -ngl -1 without informing the caller, so a user who had opted into
   a longer context via the UI never actually got it. Drop the shrink
   loop on the explicit path and emit -c <user_ctx> --fit on instead,
   letting llama-server flex -ngl (CPU layer offload).

Adds tests/test_llama_cpp_context_fit.py covering both paths, the
file-size-only fallback when KV metadata is missing, non-regression on
fittable auto-pick, and platform-agnostic input shape.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-14 08:53:25 -07:00
TF-MTGE
a31c82a640
fix(studio): remove 300s cap on load_checkpoint (inherits 3600s default) (#4922)
* fix: increase wait response timeout to 900 sec instead of 300 sec. #4845

* Apply suggestion from @gemini-code-assist[bot]

good catch

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-04-14 08:53:14 -07:00
Datta Nimmaturi
da78c6be71
[Studio] Install flash attn at setup time for linux (#4979)
* [Studio] Install flash attn at setup time for linux

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleanup changes

Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Test cases

* wheel_utils: narrow url_exists exceptions and log at debug level

---------

Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
2026-04-14 16:40:17 +04:00
Datta Nimmaturi
dccc0ebada
[Studio] Show non exported models in chat UI (#4892)
* Show non exported models in chat UI

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Distinguish b/w LoRa and full fine tune saves. Cleanup

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
2026-04-14 15:03:58 +04:00
Bharath Kumar Adinarayan
a50f61009b
fix(studio): default chart view to full training history (#5007)
* fix(studio): default chart view to full training history instead of last 80 steps

Fixes #5003

* chore: windowsize as null code comment

---------

Co-authored-by: imagineer99 <samleejackson0@gmail.com>
Co-authored-by: Wasim Yousef Said <wasimysdev@gmail.com>
2026-04-14 03:29:27 -07:00
Lee Jackson
bfa17330bd
Studio: Polish API key copy button and harden async clipboard fallback (#5006)
* fix: polish clipboard style and fix async clipboard path

* Use copyToClipboardAsync in CopyButton for Safari fallback

CopyButton was calling navigator.clipboard.writeText directly,
bypassing the execCommand fallback added in this same PR. Switch
to copyToClipboardAsync which tries execCommand first (Safari
user-gesture requirement) then falls back to the async clipboard API.

* Fix copyToClipboard sync contract regression and improve async path

- Restore copyToClipboard() to return only the execCommand result,
  preserving the boolean contract that 7 existing callers depend on
  to gate their "Copied!" UI state. The fire-and-forget async fallback
  was returning true before the promise resolved, causing false success.

- Add document.body null guard to copyWithExecCommand for SSR safety.

- Reorder copyToClipboardAsync to try the async Clipboard API first,
  avoiding unnecessary DOM/focus overhead in Radix focus-trapped dialogs
  where execCommand always fails anyway.

* Restore queryCommandSupported guard and fix async catch path

- Restore the queryCommandSupported("copy") guard in copyToClipboard()
  to match the original contract exactly: when execCommand is entirely
  unsupported, fall through to fire-and-forget async clipboard write.

- Fix copyToClipboardAsync catch block: after navigator.clipboard.writeText
  rejects, the user-gesture frame is gone, so execCommand will also fail.
  Return false from catch instead of falling through. The execCommand
  fallback at the bottom only runs when the Clipboard API is absent
  (still in user-gesture frame).

* Restore execCommand fallback in copyToClipboardAsync catch path

The catch block was returning false after clipboard API rejection,
based on the incorrect premise that the user-gesture frame is lost
after an await. Per the HTML spec, transient user activation IS
preserved through promise microtask chains. The real reason
execCommand fails in the Radix dialog is the focus trap intercepting
textarea.focus(), not gesture loss.

For non-dialog callers, execCommand can still succeed after a
clipboard rejection. Inside a Radix modal, execCommand returns
false harmlessly (focus trap blocks it).

* Harden textarea fallback for mobile and continue to async path on failure

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
2026-04-14 14:22:14 +04:00
Wasim Yousef Said
97eafd999e
studio: fix api-keys access + refresh (#5005)
* studio: fix api-keys access + refresh

* studio: guard v1 in spa fallback
2026-04-13 23:48:51 +04:00
AdamPlatin123
d2fc582840
studio: skip training status/metrics polling when idle (#4988)
* fix(studio): skip training status/metrics polling when idle

Add an early return in the status and metrics setInterval callbacks when
the runtime store reports phase === "idle" and hasHydrated is true.
Previously these polls fired unconditionally every 3s/5s, generating
unnecessary network traffic and console errors when no training was
running.

* fix(studio): reduce idle polling to 30s instead of stopping entirely

Review feedback (PR #4988): completely stopping polling when idle risks
permanent UI desync if hydration fails, and misses out-of-band state
changes from other clients. Add a 30s background poll that only fires
when idle to recover gracefully.

* fix: harden idle status polling around hydration and runtime reset

---------

Co-authored-by: AdamPlatin123 <AdamPlatin123@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: imagineer99 <samleejackson0@gmail.com>
2026-04-13 12:02:12 -07:00
Daniel Han
9a261aec5f
Studio: Expose openai and anthropic compatible external API end points (#4956)
* Studio: add API key authentication for programmatic access

External users want to hit the Studio API (chat completions with tool
calling, training, export, etc.) without going through the browser
login flow. This adds sk-unsloth- prefixed API keys that work as a
drop-in replacement for JWTs in the Authorization: Bearer header.

Backend:
- New api_keys table in SQLite (storage.py)
- create/list/revoke/validate functions with SHA-256 hashed storage
- API key detection in _get_current_subject before the JWT path
- POST/GET/DELETE /api/auth/api-keys endpoints on the auth router

Frontend:
- /api-keys page with create form, one-time key reveal, keys table
- API Keys link in desktop and mobile navbar
- Route registered with requireAuth guard

Zero changes to any existing route handler -- every endpoint that uses
Depends(get_current_subject) automatically works with API keys.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use actual origin in API key usage examples

The examples on /api-keys were hardcoded to localhost:8888 which is
wrong for remote users. Use window.location.origin so the examples
show the correct URL regardless of where the user is connecting from.

* Add `unsloth studio run` CLI command for one-liner model serving

Adds a `run` subcommand that starts Studio, loads a model, creates an
API key, and prints a ready-to-use curl command -- similar to
`ollama run` or `vllm serve`.

Usage: unsloth studio run -m unsloth/Qwen3-1.7B-GGUF --gguf-variant UD-Q4_K_XL

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add end-to-end tests for `unsloth studio run` and API key usage

Tests the 4 usage examples from the API Keys page:
1. curl basic (non-streaming) chat completions
2. curl streaming (SSE) chat completions
3. OpenAI Python SDK streaming completions
4. curl with tools (web_search + python)

Also tests --help output, invalid key rejection, and no-key rejection.
All 7 tests pass against Qwen3-1.7B-GGUF.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add /v1/completions, /v1/embeddings, /v1/responses endpoints and --parallel support

- llama_cpp.py: accept n_parallel param, pass to llama-server --parallel
- run.py: plumb llama_parallel_slots through to app.state
- inference.py: add /completions and /embeddings as transparent proxies to
  llama-server, add /responses as application-level endpoint that converts
  to ChatCompletionRequest; thread n_parallel through load_model
- studio.py: set llama_parallel_slots=4 for `unsloth studio run` path

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Make /v1/responses endpoint match OpenAI Responses API format

The existing /v1/responses shim returned Chat Completions format, which
broke OpenAI SDK clients using openai.responses.create(). This commit
replaces the endpoint with a proper implementation that:

- Returns `output` array with `output_text` content parts instead of
  `choices` with `message`
- Uses `input_tokens`/`output_tokens` instead of `prompt_tokens`/
  `completion_tokens` in usage
- Sets `object: "response"` and `id: "resp_..."`
- Emits named SSE events for streaming (response.created,
  response.output_text.delta, response.completed, etc.)
- Accepts all OpenAI Responses API fields (tools, store, metadata,
  previous_response_id) without erroring -- silently ignored
- Maps `developer` role to `system` and `input_text`/`input_image`
  content parts to the internal Chat format

Adds Pydantic schemas for request/response models and 23 unit tests
covering schema validation, input normalisation, and response format.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Studio: add Anthropic-compatible /v1/messages endpoint (#4981)

* Add Anthropic-compatible /v1/messages endpoint with tool support

Translate Anthropic Messages API format to/from internal OpenAI format
and reuse the existing server-side agentic tool loop. Supports streaming
SSE (message_start, content_block_delta, etc.) and non-streaming JSON.
Includes offline unit tests and e2e tests in test_studio_run.py.

* Add enable_tools, enabled_tools, session_id to /v1/messages endpoint

Support the same shorthand as /v1/chat/completions: enable_tools=true
with an optional enabled_tools list uses built-in server tools without
requiring full Anthropic tool definitions. session_id is passed through
for sandbox isolation. max_tokens is now optional.

* Strip leaked tool-call XML from Anthropic endpoint content

Apply _TOOL_XML_RE to content events in both streaming and
non-streaming tool paths, matching the OpenAI endpoint behavior.

* Emit custom tool_result SSE event in Anthropic stream

Adds a non-standard tool_result event between the tool_use block close
and the next text block, so clients can see server-side tool execution
results. Anthropic SDKs ignore unknown event types.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Split /v1/messages into server-side and client-side tool paths

enable_tools=true runs the existing server-side agentic loop with
built-in tools (web_search/python/terminal). A bare tools=[...] field
now triggers a client-side pass-through: client-provided tools are
forwarded to llama-server and any tool_use output is returned to the
caller with stop_reason=tool_use for client execution.

This fixes Claude Code (and any Anthropic SDK client) which sends
tools=[...] expecting client-side execution but was previously routed
through execute_tool() and failing with 'Unknown tool'.

Adds AnthropicPassthroughEmitter to convert llama-server OpenAI SSE
chunks into Anthropic SSE events, plus unit tests covering text
blocks, tool_use blocks, mixed, stop reasons, and usage.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix httpcore GeneratorExit in /v1/messages passthrough stream

Explicitly aclose aiter_lines() before the surrounding async with
blocks unwind, mirroring the prior fix in external_provider.py
(a41160d3) and cc757b78's RuntimeError suppression.

* Wire stop_sequences through /v1/messages; warn on tool_choice

Plumb payload.stop_sequences to all three code paths (server-side
tool loop, no-tool plain, client-side passthrough) so Anthropic SDK
clients setting stop_sequences get the behavior they expect. The
llama_cpp backend already accepted `stop` on both generate_chat_
completion and generate_chat_completion_with_tools; the Anthropic
handler simply wasn't passing it.

tool_choice remains declared on the request model for Anthropic SDK
compatibility (the SDK often sets it by default) but is not yet
honored. Log a structured warning on each request carrying a non-
null tool_choice so the silent drop is visible to operators.

* Wire min_p / repetition_penalty / presence_penalty through /v1/messages

Align the Anthropic endpoint's sampling surface with /v1/chat/completions.
Adds the three fields as x-unsloth extensions on AnthropicMessagesRequest
and threads them through all three code paths: server-side tool loop,
no-tool plain, and client-side passthrough.

The passthrough builder emits "repeat_penalty" (not "repetition_penalty")
because that is llama-server's field name; the backend methods already
apply the same rename internally.

* Fix block ordering and prev_text reset in non-streaming tool path

_anthropic_tool_non_streaming was building the response by appending
all tool_use blocks first, then a single concatenated text block at
the end — losing generation order and merging pre-tool and post-tool
text into one block. It also never reset prev_text between synthesis
turns, so the first N characters of each post-tool turn were dropped
(where N = length of the prior turn's final cumulative text).

Rewrite to build content_blocks incrementally in generation order,
matching the streaming emitter's behavior: deltas within a turn are
merged into the trailing text block, tool_use blocks interrupt the
text sequence, and prev_text is reset on tool_end so turn N+1 diffs
against an empty baseline.

Caught by gemini-code-assist[bot] review on #4981.

* Make test_studio_run.py e2e tests pytest-compatible

Add a hybrid session-scoped studio_server fixture in conftest.py that
feeds base_url / api_key into the existing e2e test functions. Three
invocation modes are now supported:

1. Script mode (unchanged) — python tests/test_studio_run.py
2. Pytest + external server — point at a running instance via
   UNSLOTH_E2E_BASE_URL / UNSLOTH_E2E_API_KEY env vars, no per-run
   GGUF load cost
3. Pytest + fixture-managed server — pytest drives _start_server /
   _kill_server itself via --unsloth-model / --unsloth-gguf-variant,
   CI-friendly

The existing _start_server / _kill_server helpers and main() stay
untouched so the script entry point keeps working exactly as before.
Test function signatures are unchanged — the (base_url, api_key)
parameters now resolve via the new fixtures when running under
pytest.

* Rename test_studio_run.py -> test_studio_api.py

The file is entirely about HTTP API endpoint testing (OpenAI-compatible
/v1/chat/completions, Anthropic-compatible /v1/messages, API key auth,
plus a CLI --help sanity check on the command that runs the API). None
of its tests cover training, export, chat-UI, or internal-Python-API
concerns.

The old name misleadingly suggested "tests for the unsloth studio run
CLI subcommand" — the new name reflects the actual scope.

Updates:
- git mv the file (rename tracked, history preserved)
- Rewrite opening docstring to state the API surface focus and call
  out what is explicitly out of scope
- Update all 4 Usage-block path references to the new filename
- LOG_FILE renamed to test_studio_api.log
- conftest.py fixture import rewritten from test_studio_run to
  test_studio_api, plus 7 docstring/comment references updated

No functional changes to test logic, signatures, or main().

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Fix httpcore asyncgen cleanup in /v1/messages and /v1/completions

The earlier fix in 985e92a9 was incomplete: it closed aiter_lines()
explicitly but still used `async with httpx.AsyncClient()` /
`async with client.stream()` inside the generator. When the generator
is orphaned (e.g. client disconnects mid-stream and Starlette drops
the StreamingResponse iterator without explicitly calling aclose()),
Python's asyncgen finalizer runs the cleanup in a DIFFERENT task than
the one that originally entered the httpx context managers. The
`async with` exits then trigger httpcore's HTTP11ConnectionByteStream
.aclose(), which enters anyio.CancelScope.__exit__ with a mismatched
task and raises RuntimeError("Attempted to exit cancel scope in a
different task"). That error escapes any user-owned try/except
because it happens during GC finalization.

Replace `async with` with manual client/response lifecycle in both
/v1/messages passthrough and /v1/completions proxy. Close the
response and client in a finally block wrapped in
`try: ... except Exception: pass`. This suppresses RuntimeError (and
other Exception subclasses) from the anyio cleanup noise while
letting GeneratorExit (a BaseException, not Exception) propagate
cleanly so the generator terminates as Python expects.

Traceback observed in user report:
  File ".../httpcore/_async/connection_pool.py", line 404, in __aiter__
      yield part
  RuntimeError: async generator ignored GeneratorExit
...
  File ".../anyio/_backends/_asyncio.py", line 455, in __exit__
      raise RuntimeError(
  RuntimeError: Attempted to exit cancel scope in a different task

* Expand unsloth studio run banner with SDK base URL and more curl examples

Add an explicit "OpenAI / Anthropic SDK base URL" line inside the info
box so SDK users don't accidentally copy the bare server URL (without
/v1) into their OpenAI/Anthropic SDK constructors and hit 404s.

Replace the single /v1/chat/completions curl example with three
labeled blocks: chat/completions, Anthropic /messages, and OpenAI
Responses. The Anthropic example includes max_tokens (Anthropic SDKs
require it even though Studio accepts None).

All examples derived from a computed sdk_base_url so the /v1 prefix
stays in sync if the public path ever changes.

* Hash API keys with HMAC-SHA256 + persistent server secret

Stores the HMAC secret in a new app_secrets singleton table. Fixes
CodeQL py/weak-sensitive-data-hashing alert on storage.py:74-76,
394-395. Refresh tokens stay on plain SHA-256 (unchanged _hash_token)
so existing user sessions survive upgrade — API keys are new on this
branch so there is no migration.

* Use PBKDF2 for API key hashing per CodeQL recommendation

HMAC-SHA256 was still flagged by py/weak-sensitive-data-hashing.
Switch to hashlib.pbkdf2_hmac, which is in CodeQL's recommended
allowlist (Argon2/scrypt/bcrypt/PBKDF2). Persistent server-side
salt stays in app_secrets for defense-in-depth. 100k iterations to
match auth/hashing.py's password hasher.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
2026-04-13 21:08:11 +04:00
Roland Tannous
3bb72a557f
Pin kernels==0.12.1 to avoid huggingface_hub dataclass conflict (#5000) 2026-04-13 20:42:02 +04:00
Lee Jackson
21a7895959
Studio: Prompt manager, message deletion, and chat UI improvements (#4938)
* feat(chat): code block styling, delete with Dexie sync, settings sheet polish

* style: config save/delete padding fix

* fix(studio): centralize dark code-block surface and optimize message sync writes

* style: config padding/alignment polish

* fix(studio): upsert custom presets without implicit rename-delete

* fix settings sheet save state polish

* fix settings sheet button widths

* fix chat settings presets

* fix chat delete sync

* fix chat trust remote code flow

---------

Co-authored-by: shine1i <wasimysdev@gmail.com>
2026-04-13 16:42:33 +02:00
AdamPlatin123
3b092bcd46
fix(studio): prevent route transition DOM duplication via AnimatePresence (#4987)
Add mode="wait" and exit={{ opacity: 0 }} to the root AnimatePresence
wrapper so outgoing routes fully unmount before incoming routes render.
Without this, rapid navigation between Studio/Export/Recipes/Chat caused
pages to stack (2x–3x duplication).

Co-authored-by: AdamPlatin123 <AdamPlatin123@users.noreply.github.com>
Co-authored-by: Wasim Yousef Said <wasimysdev@gmail.com>
2026-04-13 01:38:00 -07:00
Manan Shah
80c12ff1a6
Move gemma4 script (#4994)
* updating gemma4 script

* moving gemma4 script to scripts folder
2026-04-12 23:41:15 -07:00