unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

Author	SHA1	Message	Date
Daniel Han	b42e3a120d	Remove legacy venv Scripts entry from User PATH on upgrade (#5060 ) Older installers persisted the venv Scripts directory directly in the User PATH registry. The shim approach from #4961 no longer writes that entry, but on upgrade the old one survived and python.exe / pip.exe from the unsloth venv continued winning resolution in every new shell. Before creating the shim, read the current User PATH, filter out any entry matching $VenvDir\Scripts (using the same symmetric raw+expanded comparison as Add-ToUserPath), and write back if changed. No-op on fresh installs where the legacy entry was never written. Confirmed on a real Windows machine: `where.exe python` was returning the venv interpreter first even after the shim PR merged.	2026-04-16 07:36:59 -07:00
Daniel Han	5b8643969e	Revert "Remove legacy venv Scripts entry from User PATH on upgrade" This reverts commit `cae4a74297`.	2026-04-16 14:20:43 +00:00
Daniel Han	cae4a74297	Remove legacy venv Scripts entry from User PATH on upgrade Older installers persisted the venv Scripts directory directly in the User PATH registry. The shim approach (added in this PR) no longer writes that entry, but it also did not remove the old one. On upgrade, the legacy entry survived and python.exe / pip.exe from the unsloth venv continued winning resolution in every new shell, which is exactly the hijack the shim was designed to prevent. Before creating the shim, read the current User PATH, filter out any entry matching $VenvDir\Scripts (using the same symmetric raw+expanded comparison as Add-ToUserPath), and write back if changed. This runs once per install and is a no-op on fresh installs where the legacy entry was never written.	2026-04-16 14:19:04 +00:00
Datta Nimmaturi	6764cb9b90	Restrict flash attn to <=256 head dim. Consolidate attn impl checks (#5051 ) * Restrict flash attn to <=256 head dim. Consolidate attn impl checks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Consolidate the changes into single function * safeguard for dict instead of object * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-16 09:00:17 -05:00
Daniel Han	c5be8b1cd2	Chat-template repair: warn-by-default, AST classification, dict support (#5049 ) * Chat-template repair: warn-by-default, AST classification, dict support Follow-up hardening on top of PR #4426 (which fixed the #4150 RuntimeError for ChatML LoRA reloads). Behavior changes: - Warn-by-default instead of RuntimeError. When fix_chat_template cannot repair a broken template, emit a warning and return the original. Set UNSLOTH_STRICT_CHAT_TEMPLATE=1 to restore the pre-warn hard fail. Fixes the UX where a missing `{% if add_generation_prompt %}` block on a saved LoRA (typical after LlamaFactory / Axolotl re-serialize) would block model loading entirely. - Local path vs HF hub distinguished in the warning message. For local paths the message points at the likely downstream tool; for HF IDs it points at the upstream model maintainers. Previously both said "file a bug report to the maintainers of <path>" even when <path> was the user's own saves/ directory. - Dict / list chat_template now handled. Hermes-3 ships with {default, tool_use} and the previous code crashed with AttributeError: 'dict' object has no attribute 'find' when entering _fix_chat_template with a dict. Each variant is now fixed independently; structure is preserved. Internals: - _find_end_position now matches all four Jinja whitespace-control variants ({% %}, {%- %}, {% -%}, {%- -%}) and returns the rightmost endfor/endif so multi-for templates aren't locked onto the first loop. Previously {%- endfor -%} (both-side dash, used by Qwen3-Guard) was silently bypassed. - _has_add_generation_prompt_block uses Jinja AST via jinja2.nodes.If/Name walks instead of substring matching, so templates that hide the block behind comments or dash-style variants are classified correctly. - _template_ends_with_toplevel_for gates the GH#4150 ChatML repair on the AST: only fires when the last structural top-level node is a For (standard ChatML shape), ignoring trailing pure-whitespace output nodes. Templates wrapped in an outer If (Qwen3-Guard) are now explicitly skipped at the _fix_chat_template level as well, not just at load_correct_tokenizer's name-based exemption. - _validate_patched_template renders the patched template with and without add_generation_prompt and confirms the patched output responds to the flag by appending (not replacing) content. If validation fails, the patch is discarded and we fall through to the warn path. Verified with an expanded regression suite in tests/: - test_fix_chat_template_pr4426.py: 42/42 template-matrix cells - test_load_correct_tokenizer_pr4426.py: 5/5 tokenizer loads - test_chat_template_followups.py: 10/10 new follow-up tests - test_mistral_pr4426.py: 5 Mistral variants byte-identical - test_qwen_pr4426.py: 14 Qwen variants byte-identical (Qwen1.5, Qwen2, Qwen2.5-Instruct/Coder/Math/VL, Qwen3, Qwen3-Coder, QwQ, Qwen3-Guard-Gen) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Guard _validate_patched_template against read-only chat_template If tokenizer.chat_template is a property or otherwise read-only, the validation helper would crash with AttributeError when trying to temporarily set the patched template. Catch the assignment failure and return False (skip validation), and best-effort restore in the finally block. * Replace regex separator inference with render-diff; broaden repair to non-ChatML templates The previous `_infer_assistant_separator` was a four-tier regex heuristic that only worked on ChatML-shaped templates and forced a hard `<\|im_start\|>` / `<\|im_end\|>` presence gate on Case 2 repair. This meant a Llama-3, Gemma, or Phi-3 template stripped of its generation-prompt block by a downstream tool (LlamaFactory, Axolotl, etc.) would still warn-and-return even though the structural shape is identical to the ChatML case the PR already handles. This replaces the regex with `_derive_assistant_prefix_by_render`: render the template with two dialogs that differ only in assistant content, then `os.path.commonprefix` on the tails captures the exact assistant-turn prefix the template emits. The template itself is ground truth, so non-ChatML shapes work as long as the assistant block is a literal the template emits once per message. Three guards keep the derivation safe: A. both assistant renders extend the base render (no reordering); B. the divergence point is exactly the content-insertion site (sentinel follows the common prefix); C. a user-role cross-check: if a render with a user sentinel also emits the same prefix, role has no effect on output and we reject. A render failure on [user, user] (e.g. Gemma's `raise_exception` alternation check) is evidence that role matters; we accept. Sentinels differ at character 0 so `commonprefix` cannot absorb them, and trailing whitespace/comments after the last `{% endfor %}` are stripped before probing (they would appear in base but not after the appended assistant turn and break Guard A). `_fix_chat_template` and `_repair_string_template` now thread an `is_sharegpt` kwarg; `_fix_chat_template` retries once with `is_sharegpt=True` if the first probe returns None (dual-probe fallback for dict/list callers). The ChatML `<\|im_start\|>` / `<\|im_end\|>` hard gate in Case 2 is dropped. `_infer_assistant_separator` is deleted. Verified via: - tests/test_fix_chat_template_pr4426.py: 51/51 cells (new Llama-3, Gemma, Phi-3 broken-template rows all repair FIX-OK) - tests/test_load_correct_tokenizer_pr4426.py: 5/5 - tests/test_chat_template_followups.py: 18/18 (T11-T18 cover non-ChatML repair + probe failure modes) - tests/test_mistral_pr4426.py: 5/5 byte-identical - tests/test_qwen_pr4426.py: 14/14 byte-identical (Qwen3-Guard AST gate still rejects) - tests/hermes3_lora_pr4426.py reload: patched template ends with `<\|im_start\|>assistant\n`, inference returns sensible output. - temp/sim/battery.py: 79/79 followup; vs baseline: 0 regressions, 9 improvements. - Spot-check probe on real stripped tokenizers (Hermes-3, Phi-4, Llama-3.2-1B, Gemma-3-1B): all derive the expected prefix. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address reviewer findings: variant routing, positive-gate detection, comment-safe end scan Resolves three reviewer findings on PR #5049 (`fix/chat-template-followups`): Finding #1 [10/10]: dict/list variants now route through `_fix_chat_template_for_tokenizer` via a new `_VariantTokenizerProxy` adapter. Previously the dict/list branches called `_fix_chat_template` directly, silently bypassing the warn/strict (`UNSLOTH_STRICT_CHAT_TEMPLATE`) contract, the `no == yes` diagnostic, broken-existing-block detection, and `_validate_patched_template` guard. The proxy swaps `base.chat_template` to the variant string before each `apply_chat_template` call so tokenizer globals (`bos_token`, custom filters, `raise_exception`) remain available; if the base is read-only it falls back to isolated Jinja rendering. Finding #2 [1/10]: `_has_add_generation_prompt_block` now requires the `If` body to contain at least one `Output` node (a new `_if_body_emits_content` helper walks descendants). This distinguishes a real generation-prompt block from a header guard like `{% if not add_generation_prompt is defined %}{% set ... %}{% endif %}` (body contains only `Assign`) which references the name but emits nothing. Also dropped a now-redundant `"add_generation_prompt" not in scrubbed` guard in `_fix_chat_template` Case 2 so header-guarded templates still get repaired. Finding #4 [1/10]: `_find_end_position` now replaces Jinja comments with equal-length whitespace before scanning for `{% endfor %}` / `{% endif %}` tokens. This prevents a trailing comment containing those tokens from being picked as the real end tag. Positions in the padded string map 1:1 to positions in the original template. Tests: - tests/test_chat_template_followups.py: 21/21 (T19 strict-mode dict variant, T20 header-guard repair, T21 comment-endfor trap added; T4/T5 stubs updated with a working apply_chat_template that routes through Jinja). - tests/test_fix_chat_template_pr4426.py: 51/51 cells unchanged. - tests/test_load_correct_tokenizer_pr4426.py: 5/5. - tests/test_mistral_pr4426.py: 5/5 byte-identical. - tests/test_qwen_pr4426.py: 14/14 byte-identical. - temp/sim/battery.py: 79/79 followup; 0 regressions vs baseline. - Phase 3 Hermes-3 broken-LoRA reload: inference still returns `'The answer to the equation 2+2 is 4.'`. - Spot-checks on Hermes-3 / Phi-4 / Llama-3.2-1B / Gemma-3-1B real stripped templates: probe still derives the expected prefix. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Tighten comments in chat-template helpers Pure comment minimization across `_find_end_position`, `_has_add_generation_prompt_block`, `_if_body_emits_content`, `_derive_assistant_prefix_by_render`, `_fix_chat_template` Case 2, and `_VariantTokenizerProxy`. No behavior change; same intent, fewer lines. All 21 follow-up tests and the 51-cell Phase 1 matrix still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Sandbox probe, fix is_sharegpt validator mismatch, reject negated gates Three real bugs from the 10-agent Opus review: 1. Probe now uses `jinja2.sandbox.SandboxedEnvironment` instead of bare `jinja2.Environment`. The probe renders at model-load time (before the user calls `apply_chat_template`), so it was a new eager code-execution surface that the base HF tokenizer loading does not have. SandboxedEnvironment blocks attribute-chain exploits at negligible cost. 2. `_repair_string_template` now tries validation with both `is_sharegpt=False` and `is_sharegpt=True`. Previously, when `_fix_chat_template` internally fell back to the other schema via its dual-probe, the outer validation still used the caller's original `is_sharegpt` -- rendering with the wrong message keys and spuriously dropping a valid repair. 3. `_has_add_generation_prompt_block` now skips `If` nodes whose test is a `Not` expression. A negated gate like `{% if not add_generation_prompt %}{{ x }}{% endif %}` fires when agp=False, so its emitting body is not a generation block -- but the old code counted any Name reference regardless of polarity. Cleanup: removed unused `self._label`, added `\r` escape in generation-block literal, switched variant labels to `!r` formatting, removed redundant `import os as _os`. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix jinja2.sandbox import and sandbox proxy fallback Two critical findings from the 20-reviewer pass: 1. [20/20] The proxy read-only fallback used bare `jinja2.Environment`, not sandboxed. All 20 reviewers independently reproduced marker-file creation via `cycler.__init__.__globals__['os'].system(...)` during `fix_chat_template()`. Fixed: fallback now uses `from jinja2.sandbox import SandboxedEnvironment`. 2. [14/20] The render-diff probe did `import jinja2` then referenced `jinja2.sandbox.SandboxedEnvironment`. `jinja2.sandbox` is a submodule that is NOT auto-imported by `import jinja2` on Jinja 3.1.6. This caused `AttributeError` (swallowed by `except Exception`), making the entire Case 2 repair path silently return None in a clean process. The 6 reviewers who saw it work had `jinja2.sandbox` pre-imported by an earlier module in their process. Fixed: both the probe and the proxy fallback now use `from jinja2.sandbox import SandboxedEnvironment`. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-16 05:52:33 -07:00
Daniel Han	6e87bade25	Trim verbose comments in PATH helpers Reduce inline comments from ~160 lines to ~25 across both files. Keep one-line summaries of the "why"; drop multi-paragraph rationale blocks that repeated information already captured in commit messages and PR discussion.	2026-04-16 12:01:01 +00:00
Etherll	ec32ce2e82	fix: use direct registry API for PATH writes instead of SetEnvironmentVariable (#4961 ) * fix: replacing SetEnvironmentVariable with direct registry API * apply reviews * Use CreateSubKey for HKCU\Environment * Store PATH backup under HKCU\Software\Unsloth * Fix $backupKey registry handle leak in PATH backup block Wrap $backupKey operations in try/finally so the handle is closed even if GetValue or SetValue throws. The Add-ToUserPath helper already uses this pattern for its registry key -- the backup block was the only place missing it. * Isolate WM_SETTINGCHANGE broadcast from PATH write error handling Wrap the broadcast dummy-variable calls in their own try/catch so a broadcast failure does not mask a successful registry PATH write. Previously, if SetEnvironmentVariable threw after SetValue already committed the new PATH, Add-ToUserPath would return $false and the caller would skip Refresh-SessionPath. * PATH helper polish: venv precedence, quoted entries, raw/expanded dedup Three small follow-ups surfaced by a 10-reviewer pass against the rebased PR head. None fix a regression vs main; each strictly improves the new helpers. Refresh-SessionPath / Refresh-Environment: - Move $env:Path to the front of the merge so an activated venv keeps precedence over machine/user PATH after a refresh. Pre-PR dropped process-only entries entirely; post-PR kept them but at the back. - Dedup on both raw and expanded forms so %USERPROFILE%\foo and the already-expanded C:\Users\me\foo do not both survive. Add-ToUserPath: - Trim whitespace and surrounding double-quotes from each compared entry so quoted PATH entries like "C:\Program Files\CMake\bin" deduplicate against an unquoted directory of the same path. * Back up User PATH inside Add-ToUserPath, before first mutation Previously only studio/setup.ps1 took a one-time PATH backup, at script top (line ~547). install.ps1 (the irm \| iex entry point) had no backup, so users who installed via that path had no recovery surface if anything clobbered their PATH. The PR description's "one-time backup before any modifications" promise only held for the studio installer flow. Move the backup into Add-ToUserPath itself: just before the first actual SetValue mutation, write the pristine raw PATH to HKCU\Software\Unsloth\PathBackup if no backup already exists. This: - Covers both entry points (install.ps1 and studio/setup.ps1). - Captures the TRUE pristine PATH even when install.ps1 runs first and studio/setup.ps1 runs afterwards (the script-top backup in setup.ps1 would otherwise see an already-modified PATH). - Is idempotent: once a backup exists, subsequent calls preserve it. - Skips when nothing would mutate (dedup match) or PATH is empty. The script-top backup in studio/setup.ps1 is kept for defense in depth. * Refresh PATH: venv-aware merge order Reconcile two competing concerns about Refresh-SessionPath / Refresh-Environment surfaced by separate review rounds: - venv at the back -> activated venv loses precedence to system Python - process at the front -> stale shims (old node, old python, etc.) still on $env:Path can beat a freshly installed tool New merge order: 1. Activated venv Scripts dir, only if $env:VIRTUAL_ENV is set 2. Machine PATH freshly read from registry 3. User PATH freshly read from registry 4. Current $env:Path as fallback This way an explicitly-activated venv keeps priority while a tool the script just installed wins over any stale entry that was already on the inherited shell PATH. When no venv is active, fresh registry entries take precedence as expected. * Append to User PATH by default, close $envKey in finally Add-ToUserPath gains a -Position Append\|Prepend parameter defaulting to Append so installing unsloth no longer prepends the bundled venv Scripts directory ahead of the user's existing python / pip on new shells. The four current call sites (install.ps1 launcher, studio/setup.ps1 CMake, nvcc, Python user Scripts) all take the Append default because each one that needs in-session precedence already does an inline $env:Path prepend independently. This matches rustup / cargo / nvm / pyenv / uv behavior. Also wrap the script-top $envKey.GetValue in a try/finally so the registry handle is released even if the read throws. Matches the pattern already used for $backupKey five lines below. * Prepend cmake, nvcc, Python Scripts; keep venv Scripts appended The previous commit switched Add-ToUserPath to append by default so that installing unsloth would not silently hijack the user's system python / pip. That was correct for the venv Scripts dir (which contains python.exe and pip.exe alongside unsloth.exe), but wrong for the three studio/setup call sites. Those persist cmake, the driver-compatible nvcc, and the Python user Scripts dir for future shells, and in all three cases an older tool already earlier in the user PATH would keep winning after the install finished. The nvcc case is especially load-bearing: setup selects a driver-compatible CUDA toolkit, then llama.cpp builds against whatever wins PATH resolution, so a stale older nvcc produces broken builds. Pass -Position 'Prepend' explicitly at the three setup.ps1 call sites (cmake at line 754, nvcc bin at line 1025, Python user Scripts at line 1191). None of those directories holds python.exe, so prepending them does not re-introduce the original hijack problem. Leave the install.ps1 venv Scripts call on the default Append with a comment explaining why. * Symmetric dedup, Prepend reorders duplicates, unsloth shim dir Address three separate findings surfaced by review: 1. Dedup asymmetry (Gemini high-priority): the existing dedup expanded registry entries via ExpandEnvironmentVariables but did NOT expand the new directory. Passing "%USERPROFILE%\foo" when "C:\Users\me\foo" was already in PATH produced a duplicate. Expand both sides so the check is symmetric. 2. -Position Prepend no-op on existing duplicates: the dedup loop returned $false as soon as it saw a match, regardless of position. That left a late-position duplicate in place instead of moving it to the front, so "prepend the newly selected cmake/nvcc" did not always beat an older copy earlier in PATH. Partition entries into kept and dropped lists, then reinsert a single copy at the requested position. Append still returns $false on any match so user-curated orderings are not reshuffled. Prepend also returns $false when the only copy is already at position 0 so we preserve the user's casing. 3. Stop adding the venv Scripts dir to User PATH entirely. That dir holds python.exe and pip.exe alongside unsloth.exe, so neither Prepend nor Append worked: prepend hijacked the user's system python and pip, append made the freshly-installed unsloth.exe lose to any older unsloth.exe earlier on PATH. Replace the Scripts-dir PATH add with a dedicated shim directory that contains only unsloth.cmd, and prepend that dir. The shim calls the venv's unsloth.exe by absolute path so future pip upgrades inside the venv propagate automatically. * Shim via hardlink, Append user Scripts, drop venv sysconfig fallback Three follow-ups to the `c0ab1ab` shim commit, targeting concerns raised in the second 20-reviewer pass: 1. Shim uses unsloth.exe (hardlink, copy fallback) instead of unsloth.cmd. The batch-file approach had three distinct regressions: - cmd.exe expanded %...% sequences inside user arguments, so prompts like "What does 50% mean?" got mangled before reaching the CLI - Git Bash / MSYS2 / POSIX-style shells on Windows do not resolve bare-name lookups to .cmd files, so `unsloth` stopped working there - Set-Content -Encoding ASCII replaced non-ASCII profile characters with '?', so installs under C:\Users\Jörg\... wrote a broken shim A hardlink (fallback: copy) of unsloth.exe is a native Windows executable with no shell indirection. PATHEXT picks .exe before .cmd in cmd.exe and PowerShell, Git Bash honors .exe natively, subprocess callers hit it directly, and a hardlink stays in sync with the venv on pip upgrades because both names point at the same inode. 2. studio/setup.ps1 Python user Scripts dir is added with default Append instead of -Position Prepend. That directory holds every pip-installed user console script (pip, pytest, huggingface-cli, and so on), not just unsloth, so reordering it silently changed resolution order for unrelated tools. The new install.ps1 shim at PATH position 0 already guarantees `unsloth` resolves to the freshly installed copy, so the Python user Scripts entry only needs to be present, not at the front. 3. The sysconfig lookup in studio/setup.ps1 no longer falls back to sysconfig.get_path('scripts') when the nt_user scheme dir does not exist. When setup.ps1 is invoked from an activated venv (a flow the linked issue actually hits) that fallback returns the venv's Scripts directory, which would then be added to the persisted User PATH and re-introduce the python / pip hijack the shim dir is meant to avoid. Stick strictly to the nt_user scheme; skip the block if it does not exist on disk. * Do not crash installer when unsloth.exe shim is locked The shim update sequence at install.ps1:1095 did a bare Remove-Item / New-Item HardLink / Copy-Item. Under the script's $ErrorActionPreference a locked target (most commonly 'unsloth studio' still running while the user re-invokes the installer) turns the Remove-Item failure into a terminating error that aborts the install with no actionable message. The existing shim is perfectly usable in that state, so there is no reason to abort. Wrap the whole remove/link/copy sequence in a try/catch that logs the probable cause (Studio still running), points at the fix (close Studio and re-run), and lets the installer finish with the old launcher still serving the command. Also only emit the "added unsloth launcher to PATH" step line when the launcher was actually (re)created AND the PATH entry was newly added -- previously the message fired even when the shim refresh silently failed, which was confusing. * Guard shim PATH entry on existence, use NullString for broadcast delete Two follow-ups surfaced by the latest review pass: 1. Do not add the shim directory to User PATH when the launcher was not actually created. Antivirus blocking unsloth.exe, a disk-full volume, or restrictive filesystem permissions can make both the hardlink and the copy fallback fail on a fresh install. In that case the existing sequence would report "added unsloth launcher to PATH" warnings but still prepend the empty $ShimDir to User PATH -- the user sees an install that claims success but then cannot resolve `unsloth` in a new shell. Gate Add-ToUserPath on Test-Path $ShimExe so the PATH entry is only persisted when the launcher is really there. 2. Pass [NullString]::Value instead of $null to the broadcast-delete call in Add-ToUserPath. On PowerShell 7.5 and later (running on .NET 9), a bare $null going into [Environment]::SetEnvironmentVariable can be coerced to an empty string rather than a true .NET null, which sets the dummy UnslothPathRefresh_XXXXXXXX variable to "" in HKCU\Environment instead of deleting it. The leaked variable is visible in System Properties and accumulates one entry per install run. [NullString]::Value is a PowerShell-specific sentinel that crosses the interop boundary as a real null and works on both PS 5.1 and PS 7.x. See PowerShell/PowerShell#24637 for the underlying issue. --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>	2026-04-16 04:49:51 -07:00
Imgyu Kim	14ab6fbfae	BUG: fix _fix_chat_template for ChatML templates missing add_generation_prompt (#4426 ) Fixes #4150. Pre-PR, `_fix_chat_template` only patched templates where a trailing `{{ ... }}` expression followed the last `{% endfor %}`. ChatML templates (Hermes, Magnum, Phi-4, etc.) that end cleanly at `{% endfor %}` with no generation-prompt block were left unchanged, so the outer `fix_chat_template` raised: ``` RuntimeError: Unsloth: The tokenizer `...` does not have a {% if add_generation_prompt %} for generation purposes. ``` This commonly shows up when a downstream tool (LlamaFactory, Axolotl) re-serializes the tokenizer during LoRA save and strips the generation-prompt block. This PR adds a second branch to `_fix_chat_template` that fires when: - the content after the last `{% endfor %}` is empty modulo Jinja `{# ... #}` comments, - the scrubbed template contains `<\|im_start\|>` and `<\|im_end\|>`, - and the scrubbed template does not already mention `add_generation_prompt`. The assistant-turn separator is inferred from the template itself (preferring an explicit `'<\|im_start\|>assistant<sep>'` literal, then the unique `message['role'] + '<sep>'` from role concatenations, then `<\|im_sep\|>` for Phi-4-mini mixed-separator templates, then `\n`), so Phi-4-style templates are not silently corrupted with the wrong separator. Verified against the existing chat-template corpus: - Hermes-3, Magnum-v2, Phi-4-mini, Phi-4 multi-sep, ChatML with trailing whitespace, ChatML with trailing Jinja comment, dot-access `message.role`, split-literal `'<\|im_start\|>assistant'`: all repaired with the correct assistant prefix. - Already-fixed ChatML templates: idempotent NOP. - Trap templates with `<\|im_start\|>` only inside a Jinja comment: correctly not rewritten. - Llama-3, Gemma-3, Qwen2.5 (non-ChatML): byte-identical. - Mistral family (5 models including Mistral-Nemo, Mistral-Small-24B, Mixtral): byte-identical, protected both by the structural guard (no ChatML tokens) and the existing name-based exemption in `load_correct_tokenizer`. - Qwen family (14 models including Qwen2.5, Qwen3, Qwen3-Coder, QwQ, VL, Math, Qwen3-Guard): byte-identical. End-to-end reproduction: Hermes-3 LoRA SFT, save with stripped chat_template, reload. Pre-PR code path raises the RuntimeError above. Post-PR reload loads cleanly, patches the template at load time, and `apply_chat_template(add_generation_prompt=True)` produces the correct `<\|im_start\|>assistant\n` prefix.	2026-04-16 00:21:29 -07:00
DoubleMathew	a4d4dfe4ac	fix Gemma4 flash attn disable (#5045 ) * fix pass attn implementation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-15 17:50:48 -05:00
Daniel Han	3869fbe1cc	Bump installer minimum to 2026.4.5 (#5041 )	2026-04-15 08:23:41 -07:00
Daniel Han	cdb3e752ec	Update _utils.py	2026-04-15 08:06:43 -07:00
Daniel Han	ba387e2c8f	Update pyproject.toml	2026-04-15 08:06:30 -07:00
Daniel Han	f0d03655e8	Studio: add folder browser modal for Custom Folders (#5035 ) * Studio: add folder browser modal for Custom Folders The Custom Folders row in the model picker currently only accepts a typed path. On a remote-served Studio (Colab, shared workstation) that means the user has to guess or paste the exact server-side absolute path. A native browser folder picker can't solve this: HTML `<input type="file" webkitdirectory>` hides the absolute path for security, and the File System Access API (Chrome/Edge only) returns handles rather than strings, neither of which the server can act on. This PR adds a small in-app directory browser that lists paths on the server and hands the chosen string back to the existing `POST /api/models/scan-folders` flow. ## Backend * New endpoint `GET /api/models/browse-folders`: * `path` query param (expands `~`, accepts relative or absolute; empty defaults to the user's home directory). * `show_hidden` boolean to include dotfiles/dotdirs. * Returns `{current, parent, entries[], suggestions[]}`. `parent` is null at the filesystem root. * Immediate subdirectories only (no recursion); files are never returned. * `entries[].has_models` is a cheap hint: the directory looks like it holds models if it is named `models--` (HF hub cache layout) or one of the first 64 children is a .gguf/.safetensors/config.json/ adapter_config.json or another `models--` subfolder. * Sort order: model-bearing dirs, then plain, then hidden; case- insensitive alphabetical within each bucket. * Suggestions auto-populate from HOME, the HF cache root, and any already-registered scan folders, deduplicated. * Error surface: 404 for missing path, 400 for non-directory, 403 on permission errors. Auth-required like the other models routes. * New Pydantic schemas `BrowseEntry` and `BrowseFoldersResponse` in `studio/backend/models/models.py`. ## Frontend * New `FolderBrowser` component (`studio/frontend/src/components/assistant-ui/model-selector/folder-browser.tsx`) using the existing `Dialog` primitive. Features: * Clickable breadcrumb with a `..` row for parent navigation. * Quick-pick chips for the server-provided suggestions. * `Show hidden` checkbox. * In-flight fetch cancellation via AbortController so rapid navigation doesn't flash stale results. * Badges model-bearing directories inline. * `chat-api.ts` gains `browseFolders(path?, showHidden?)` and matching types. * `pickers.tsx` adds a folder-magnifier icon next to the existing `Add` button. Opening the browser seeds it with whatever the user has already typed; confirming fills the text input, leaving the existing validation and save flow unchanged. ## What it does NOT change * The existing text-input flow still works; the browser is additive. * No new permissions or escalation; the endpoint reads only directories the server process is already allowed to read. * No model scanning or filesystem mutation happens from the browser itself -- it just returns basenames for render. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: cap folder-browser entries and expose truncated flag Pointing the folder browser at a huge directory (``/usr/lib``, ``/proc``, or a synthetic tree with thousands of subfolders) previously walked the whole listing and stat-probed every child via ``_looks_like_model_dir``. That is both a DoS shape for the server process and a large-payload surprise for the client. Introduce a hard cap of 2000 subdirectory entries and a ``truncated: bool`` field on the response. The frontend renders a small hint below the list when it fires, prompting the user to narrow the path. Below-cap directories are unchanged. Verified end-to-end against the live backend with a synthetic tree of 2050 directories: response lands at 2000 entries, ``truncated=true``, listing finishes in sub-second time (versus tens of seconds if we were stat-storming). * Studio: suggest LM Studio / Ollama dirs + 2-level model probe Three improvements to the folder-browser, driven by actually dropping an LM Studio-style install (publisher/model/weights.gguf) into the sandbox and walking the UX: ## 1. Quick-pick chips for other local-LLM tools `well_known_model_dirs()` (new) returns paths commonly used by adjacent tools. Only paths that exist are returned so the UI never shows dead chips. * LM Studio current + legacy roots + user-configured `downloadsFolder` from its `settings.json` (reuses the existing `lmstudio_model_dirs()` helper). * Ollama: `$OLLAMA_MODELS` env override, then `~/.ollama/models`, `/usr/share/ollama/.ollama/models`, and `/var/lib/ollama/.ollama/models` (the systemd-service install path surfaced in the upstream "where is everything?" issue). * Generic user-choice locations: `~/models`, `~/Models`. Dedup is stable across all sources. ## 2. Two-level model-bearing probe LM Studio and Ollama both use `root/publisher/model/weights.gguf`. The previous `has_models` heuristic only probed one level, so the publisher dir (whose immediate children are model dirs, not weight files) was always marked as non-model-bearing. Pulled the direct- signal logic into `_has_direct_model_signal` and added a grandchild probe so the classic layout is now recognised. Still O(PROBE^2) worst-case, still returns immediately for `models--` names (HF cache layout) and for any direct weight file. ## 3. model_files_here hint on response body A leaf model dir (just GGUFs, no subdirs) previously rendered as `(empty directory)` in the modal, confusing users into thinking the folder wasn't scannable. Added a `model_files_here` count on the response (capped at 200) and a small hint row in the modal: `N model files in this folder. Click "Use this folder" to scan it.` ## Verification Simulated an LM Studio install by downloading the real 84 MB `unsloth/SmolLM2-135M-Instruct-Q2_K.gguf` into `~/.lmstudio/models/unsloth/SmolLM2-135M-Instruct-GGUF/`. Confirmed end-to-end: Home listing suggests `~/.lmstudio/models` as a chip. * Browsing `~/.lmstudio/models` flags `unsloth` (publisher) as `has_models=true` via the 2-level probe. * Browsing the publisher flags `SmolLM2-135M-Instruct-GGUF` (model dir) as `has_models=true`. * Browsing the model dir returns empty entries but `model_files_here=1`, and the frontend renders a hint telling the user it is a valid target. * Studio: one-click scan-folder add + prominent remove + plain search icon Three small Custom Folders UX fixes after real-use walkthrough: * One-click add from the folder browser. Confirming `Use this folder` now submits the path directly to `POST /api/models/scan-folders` instead of just populating the text input. `handleAddFolder` takes an optional explicit path so the submit lands in the same tick as `setFolderInput`, avoiding a state-flush race. The typed-path + `Add` button flow is unchanged. * Prominent remove X on scan folders. The per-folder delete button was `text-muted-foreground/40` and hidden entirely on desktop until hovered (`md:opacity-0 md:group-hover:opacity-100`). Dropped the hover-only cloak, bumped color to `text-foreground/70`, added a red hover/focus background, and sized the icon up from `size-2.5` to `size-3`. Always visible on every viewport. * Plain search icon for the Browse button. `FolderSearchIcon` replaced with `Search01Icon` so it reads as a simple "find a folder" action alongside the existing `Add01Icon`. * Studio: align Custom Folders + and X buttons on the same right edge The Custom Folders header used `px-2.5` with a `p-0.5` icon button, while each folder row used `px-3` with a `p-1` button. That put the X icon 4px further from the right edge than the +. Normalised both rows to `px-2.5` with `p-1` so the two icons share a column. * Studio: empty-state button opens the folder browser directly The first-run empty state for Custom Folders was a text link reading "+ Add a folder to scan for local models" whose click toggled the text input. That's the wrong default: a user hitting the empty state usually doesn't know what absolute path to type, which is exactly what the folder browser is for. * Reword to "Browse for a models folder" with a search-icon affordance so the label matches what the click does. * Click opens the folder browser modal directly. The typed-path + Add button flow is still available via the + icon in the section header, so users who know their path keep that option. * Slightly bump the muted foreground opacity (70 -> hover:foreground) so the button reads as a primary empty-state action rather than a throwaway hint. * Studio: Custom Folders header gets a dedicated search + add button pair The Custom Folders section header had a single toggle button that flipped between + and X. That put the folder-browser entry point behind the separate empty-state link. Cleaner layout: two buttons in the header, search first, then add. * Search icon (left) opens the folder browser modal directly. * Plus icon (right) toggles the text-path input (unchanged). * The first-run empty-state link is removed -- the two header icons cover both flows on every state. Both buttons share the same padding / icon size so they line up with each other and with the per-folder remove X. * Studio: sandbox folder browser + bound caps + UX recoveries PR review fixes for the Custom Folders folder browser. Closes the high-severity CodeQL path-traversal alert and addresses the codex / gemini P2 findings. Backend (studio/backend/routes/models.py): * New _build_browse_allowlist + _is_path_inside_allowlist sandbox. browse_folders now refuses any target that doesn't resolve under HOME, HF cache, Studio dirs, registered scan folders, or the well-known third-party model dirs. realpath() is used so symlink traversal cannot escape the sandbox. Also gates the parent crumb so the up-row hides instead of 403'ing. * _BROWSE_ENTRY_CAP now bounds visited iterdir entries, not appended entries. Dirs full of files (or hidden subdirs when show_hidden is False) used to defeat the cap. * _count_model_files gets the same visited-count fix. * PermissionError no longer swallowed silently inside the enumeration / counter loops -- now logged at debug. Frontend (folder-browser.tsx, pickers.tsx, chat-api.ts): * splitBreadcrumb stops mangling literal backslashes inside POSIX filenames; only Windows-style absolute paths trigger separator normalization. The Windows drive crumb value is now C:/ (drive root) instead of C: (drive-relative CWD-on-C). * browseFolders accepts and forwards an AbortSignal so cancelled navigations actually cancel the in-flight backend enumeration. * On initial-path fetch error, FolderBrowser now falls back to HOME instead of leaving the modal as an empty dead end. * When the auto-add path (one-click "Use this folder") fails, the failure now surfaces via toast in addition to the inline paragraph (which is hidden when the typed-input panel is closed). * Studio: rebuild browse target from trusted root for CodeQL clean dataflow CodeQL's py/path-injection rule kept flagging the post-validation filesystem operations because the sandbox check lived inside a helper function (_is_path_inside_allowlist) and CodeQL only does intra-procedural taint tracking by default. The user-derived ``target`` was still flowing into ``target.exists`` / ``target.is_dir`` / ``target.iterdir``. The fix: after resolving the user-supplied ``candidate_path``, locate the matching trusted root from the allowlist and rebuild ``target`` by appending each individually-validated segment to that trusted root. Each segment is rejected if it isn't a single safe path component (no separators, no ``..``, no empty/dot). The downstream filesystem ops now operate on a Path constructed entirely from ``allowed_roots`` (trusted) plus those validated segments, so CodeQL's dataflow no longer sees a tainted source. Behavior is unchanged for all valid inputs -- only the construction of ``target`` is restructured. Live + unit tests all pass (58 selected, 7 deselected for Playwright env). * Studio: walk browse paths from trusted roots for CodeQL --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Ubuntu <ubuntu@h100-8-cheapest.us-east5-a.c.unsloth.internal>	2026-04-15 08:04:33 -07:00
Roland Tannous	800ddc95f8	Re-apply #4939 : updated models template mappers (#4950 ) * Reapply "updated models template mappers. added lfm2.5vl450m to transformers 5…" (#4945) This reverts commit `33503ea248`. * Add missing gemma-4-31B-it bnb-4bit mapper entry and LFM2.5 upstream namespace for PR #4950 - Add unsloth/gemma-4-31B-it-unsloth-bnb-4bit to __INT_TO_FLOAT_MAPPER so the int-to-float resolution works for this model (already listed in TEMPLATE_TO_MODEL_MAPPER but had no mapper entry). - Add LiquidAI/LFM2.5-1.2B-Instruct to lfm-2.5 TEMPLATE_TO_MODEL_MAPPER entry so the canonical upstream namespace is mapped consistently with lfm-2. * Add missing gemma-4-31B-it bnb-4bit Ollama mapping and lfm-2.5 chat template alias - Add unsloth/gemma-4-31B-it-unsloth-bnb-4bit to OLLAMA_TEMPLATE_TO_MODEL_MAPPER so Ollama export works for this model (E2B-it and E4B-it bnb-4bit variants were already present, 31B-it was inconsistently omitted) - Register CHAT_TEMPLATES["lfm-2.5"] as alias of the lfm-2 template to prevent KeyError when Studio resolves LFM2.5 models through MODEL_TO_TEMPLATE_MAPPER * Add missing LFM2 bnb-4bit INT_TO_FLOAT_MAPPER entry unsloth/LFM2-1.2B-unsloth-bnb-4bit is referenced in model_mappings.py but had no mapper.py entry, so model resolution would fail when users load that variant with load_in_4bit=False or when the float name is used with load_in_4bit=True. * Fix review findings for PR #16 1. ollama_template_mappers.py: Restore dropped Gemma-4 base model IDs (E2B, E4B, 31B, 26B-A4B) and add missing google/ upstream IDs to the gemma4 Ollama mapper for consistency with other gemma entries. 2. mapper.py: Remove self-mapping non-bnb-4bit entries from __INT_TO_FLOAT_MAPPER that were polluting FLOAT_TO_INT_MAPPER with lowercase 16-bit names, causing load_in_4bit=True to return bad model names. Add direct MAP_TO_UNSLOTH_16bit entries to preserve the google->unsloth 16-bit redirects. 3. mapper.py: Add LFM2.5 MAP_TO_UNSLOTH_16bit redirect so LiquidAI/LFM2.5-1.2B-Instruct resolves to its unsloth mirror. * Add review tests for PR #4950 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove top-level test files These test_.py files were added at the repo root rather than under tests/. Removing them from this PR; the production mapper changes remain. Add gemma-4-26B-A4B-it mapping Adds unsloth/gemma-4-26B-A4B-it to __INT_TO_FLOAT_MAPPER as a 2-tuple so google/gemma-4-26B-A4B-it routes to unsloth/gemma-4-26B-A4B-it across INT_TO_FLOAT_MAPPER, FLOAT_TO_INT_MAPPER, and MAP_TO_UNSLOTH_16bit. The 26B-A4B (MoE) model has no bnb-4bit variant, so the key uses the plain unsloth name rather than the -unsloth-bnb-4bit suffix. Removes the now-redundant standalone _add_with_lower call for the -it variant; the 16bit mapping is registered via the dict loop. * Add unsloth-bnb-4bit mappings for gemma-4 base (non-it) models Adds E2B, E4B, 31B base unsloth-bnb-4bit entries to __INT_TO_FLOAT_MAPPER. The 26B-A4B (MoE) base has no bnb-4bit variant on HF, so it stays on the standalone _add_with_lower line for the 16bit-only routing. Removes the redundant _add_with_lower lines for E2B, E4B, 31B base since the dict loop now registers the same google->unsloth route through the 2-tuple entries, plus full FLOAT_TO_INT and INT_TO_FLOAT coverage. --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-15 07:52:12 -07:00
Avaya Aggarwal	7c5464ad71	feat: Add cactus QAT scheme support (#4679 ) * feat: Add cactus QAT scheme support * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test(qat): add tests for cactus QAT scheme and fix missing import * Fix cactus QAT scheme: correct MappingType import, tighten PerGroup filter - Drop the broken `from torchao.dtypes import MappingType` import. `MappingType` lives in `torchao.quantization` (and `torchao.quantization.quant_primitives`); it is not exported from `torchao.dtypes` in any supported torchao release (verified on 0.14, 0.16, 0.17). The previous code raised `ImportError` on every cactus call and was masked as a misleading 'torchao not found' error. - Since `IntxWeightOnlyConfig` already defaults `mapping_type` to `MappingType.SYMMETRIC`, drop the explicit kwarg entirely and remove the import. Behavior is unchanged. - Introduce a named `group_size = 32` constant (matches the int4 / fp8-int4 pattern in the surrounding branches) and add a `% group_size == 0` divisibility guard to the filter. `PerGroup(32)` requires `in_features % 32 == 0` at `quantize_()` time, otherwise torchao raises `ValueError: in_features (N) % group_size (32) must be == 0`. The old `in_features >= 32` filter would admit non-aligned widths (e.g. 33, 48, 65, 127) and crash `_prepare_model_for_qat` for those shapes. * Warn when cactus QAT skips non-divisible Linear layers Multiple reviewers flagged that the divisibility guard added in the previous commit can silently leave Linear layers in full precision when their in_features is not a multiple of 32. For currently supported Unsloth models (Qwen, Llama, Gemma, Mistral, Phi) every Linear width is already a multiple of 32/64/128 so this never triggers, but surfacing the coverage gap is cheap and avoids users assuming 100% QAT coverage when they bring a custom model with unusual shapes. Emit a UserWarning listing up to the first 8 skipped layers whenever the cactus filter excludes any Linear due to the modulo guard. This keeps the lenient silent-skip behavior (consistent with int4 / fp8-int4), but stops making it silent. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-15 07:40:03 -07:00
Avaya Aggarwal	f18e9dddf0	feat: Add support for OLMo-3 model (#4678 ) * feat: Add support for OLMo-3 model in mapping and tests * Update unsloth/models/mapper.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update tests/test_get_model_name.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix casing, add Think variants, and align version gate for OLMo-3 PR 4678 Mapper: switch slugs from OLMo-3 to canonical Olmo-3 mixed case, drop the non-existent unsloth/Olmo-3-7B-Instruct-bnb-4bit dead alias, and add the already-published Olmo-3-7B-Think and Olmo-3-32B-Think Unsloth mirrors. Loader: change the olmo3 transformers version gate from Version("4.57.0") to Version("4.57.0.dev0") so nightly/source builds that already contain olmo3 are not blocked, matching the OLMo-2, Gemma 3 and Cohere patterns. * Use canonical Olmo-3 casing and cover Think variants in OLMo-3 tests Mirrors the mapper.py fixes on pr-4678-code: HuggingFace canonical slugs for the OLMo-3 family use mixed-case Olmo-3 (not OLMo-3 like OLMo-2), and Unsloth already hosts Olmo-3-7B-Think and Olmo-3-32B-Think mirrors, so the resolution matrix now covers all three published Olmo-3 families. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-15 07:39:11 -07:00
Daniel Han	c3cd890357	Studio: refresh Downloaded GGUF list and recurse into variant subdirs (#5032 ) * Studio: refresh Downloaded GGUF list and recurse into variant subdirs Two fixes for the model picker's "Downloaded" section. Frontend (`pickers.tsx`): * `HubModelPicker`'s mount effect short-circuited the cached-gguf and cached-models refetch whenever the module-level cache already had entries (`if (alreadyCached) return;`). After downloading a new repo in the same session, reopening the picker rendered the stale cache and the new repo never appeared in "Downloaded" until a full page reload. The early return is removed so the lists are always refreshed on mount; the module cache still drives the initial render so there is no spinner flash when we already had data. Backend (`utils/models/model_config.py`): * `list_local_gguf_variants` and `_find_local_gguf_by_variant` used a non-recursive `Path.glob(".gguf")`. Some HF GGUF repos (e.g. `unsloth/gemma-4-26B-A4B-it-GGUF`) place the largest quants under a variant-named subdirectory such as `BF16/...gguf`, which the top-level glob missed. Both helpers now use `rglob` and the variant filename is stored as a path relative to the scan root so the locator can still find the file. The flat-layout case (variants directly in the snapshot root) is unchanged: verified against `unsloth/gemma-4-E2B-it-GGUF` which still returns its UD-Q4_K_XL variant correctly. Studio: emit posix-style relative filenames for local GGUF subdirs `list_local_gguf_variants` was doing `str(f.relative_to(p))`, which on Windows produces backslash-separated paths like `BF16\foo.gguf`. The remote `list_gguf_variants` (HF API path) always returns forward-slash filenames such as `BF16/foo.gguf`, so the two would diverge on Windows. Switch to `.as_posix()` so the local and remote variant filenames stay identical across Linux, macOS, and Windows. Verified by simulating with `PureWindowsPath` in the test suite. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: detect mmproj at snapshot root for nested-variant layouts When _find_local_gguf_by_variant returns a weight file inside a quant-named subdir (e.g. snapshot/BF16/foo.gguf), detect_mmproj_file was scanning only the immediate parent and missing the mmproj file sitting at the snapshot root. The model was then loaded without --mmproj, silently breaking vision support for repos that ship nested variants. detect_mmproj_file now takes an optional search_root and walks up from the weight file to that root, in order, so the mmproj at the snapshot root is picked up. Sibling quant subdirs are not scanned, so an unrelated variant's mmproj does not leak in. Also apply the suggested micro-optimization on relative_to in list_local_gguf_variants -- only build the posix path when storing the first file for a quant. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-15 07:34:42 -07:00
Daniel Han	156f3fc4b0	Gate trl disable_gradient_checkpointing patch warning on UNSLOTH_ENABLE_LOGGING (#5038 ) The "Patched trl.models.utils.disable_gradient_checkpointing with a no-op" warning fires once on every Unsloth import, including from notebooks where the user did not opt into verbose logging. It is a routine integration patch, not an anomaly the user needs to know about. Gate it on UNSLOTH_ENABLE_LOGGING=1 like other diagnostic notices.	2026-04-15 07:33:48 -07:00
jonahsamost	777e1bd0ac	fix (#4887 )	2026-04-15 07:21:03 -07:00
Daniel Han	1a4ca5eca8	Fix grad-accum accepts_loss_kwargs detection for vision wrappers (#5036 ) * Fix grad-accum model_accepts_loss_kwargs detection for vision wrappers Replace the source-string rewrite of Trainer.__init__ with an instance-level accepts_loss_kwargs shadow applied on the loaded model. Covers: 1. Unsloth-compiled forward -> True, so HF Trainer does not double-scale on top of unsloth_fixed_cross_entropy's num_items_in_batch division. 2. Stock forward on a conditional-generation wrapper (Gemma3n, Gemma3 pre-4.57, Qwen-VL family, etc.) where the outer class has no accepts_loss_kwargs but the inner .model declares False -> False. This is the case that reproduces issue #4982 under trust_remote_code or UNSLOTH_COMPILE_DISABLE, where the previous fix's outer-attr check walked past the inner model and fell through to signature inspection. 3. Text LMs without any explicit accepts_loss_kwargs -> leave HF default. The previous .replace()-based patch silently no-ops on transformers 4.48 through 4.52 (variable named model, not unwrapped_model) and is fragile against any upstream reformat. The new helper walks the PEFT / HF wrapper chain, finds the first class that declares accepts_loss_kwargs on its own class dict (type(m).__dict__, not hasattr, to avoid PEFT __getattr__ forwarding), and setattr-shadows that value at every wrapper level so HF Trainer's hasattr(unwrapped_model, ...) check picks it up at whichever level accelerate.unwrap_model returns. Also adds an unconditional post-init clamp of accelerator.gradient_accumulation_steps = 1 to work around the transformers 5.0 through 5.5 GradientAccumulationPlugin regression that makes accelerator.backward divide loss by GA on top of training_step's own /GA division. Fixed upstream in 5.6.0.dev0; no-op on 4.x and 5.6+. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Trim comments * Address review: cover PEFT-after-load and custom compile location Two review findings from 3/20 reviewers: 1. [3 of 20 reviewers] apply_accepts_loss_kwargs_fix was called from the loaders before get_peft_model wraps the base model, so on transformers 4.48-4.52 (which does hasattr on the outer model) the instance shadow on the base model was lost after PEFT wrapping. Fix: also call it from the wrapped Trainer.__init__ so it runs on whatever model the user actually hands to Trainer, which is always the final wrapped form. 2. [1 of 20 reviewers] _forward_is_unsloth_compiled hard-coded the substrings "unsloth_compiled" / "unsloth_cache" in the co_filename check, which misclassifies compiled forwards when UNSLOTH_COMPILE_LOCATION is set to a custom directory. Fix: new _unsloth_compile_cache_leaves helper that reads the env var and matches the basename against path components, honoring both the default and any user override. Verified locally: - PEFT-after-load simulation: HF's hasattr(peft, "accepts_loss_kwargs") now returns True after our init wrapper runs, and value resolves to False on Gemma3n-style inner wrappers. - Custom UNSLOTH_COMPILE_LOCATION simulation: compiled detection returns True for /tmp/my_custom_cache/compiled.py when the env var is set. - End-to-end Gemma-3 270m + LoRA SFT unchanged: loss 4.9626, grad-norm matches prior run, all 4 wrapper levels now carry the shadowed attr. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-15 06:59:36 -07:00
Daniel Han	1ccfd2e0a5	fix(rocm): tighten gfx regex to ignore generic ISA lines (#5033 ) * fix(rocm): tighten gfx regex to ignore generic ISA lines ROCm 6.1+ rocminfo emits generic ISA names such as "amdgcn-amd-amdhsa--gfx11-generic" and "amdgcn-amd-amdhsa--gfx9-4-generic" alongside the real GPU name. The previous `gfx[1-9]` regex used in `_has_rocm_gpu` matched both, so a host with only a generic ISA entry would be reported as having a usable AMD GPU. Tighten the pattern to `gfx[1-9][0-9a-z]{2,3}` so only real gfx ids match. This covers every documented target from GFX6 (gfx600) through GFX12 (gfx1201), including letter-suffixed ids like gfx90a (MI250 / MI250X) and gfx90c. Documented generic ISA names always have 1 or 2 digits before the dash and no longer match. Applied to both `studio/install_python_stack.py` and `studio/install_llama_prebuilt.py` so the two detection paths agree. Co-authored-by: Martin Hoyer <mhoyer@redhat.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Martin Hoyer <mhoyer@redhat.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-15 05:24:41 -07:00
Daniel Han	b7a8ff2833	Respect classification head skip list on pre-quantized 4-bit checkpoints (#5027 ) (#5034 ) * Respect classification head skip list on pre-quantized 4-bit checkpoints (#5027) FastLanguageModel.from_pretrained(..., num_labels=N) crashed with "NotImplementedError: normal_kernel_cuda not implemented for 'Byte'" on pre-quantized bnb 4-bit checkpoints (e.g. unsloth/Qwen3-4B-bnb-4bit) when running on transformers 5.x. Two pieces were needed to close this out: 1. unsloth_zoo PR: add "score", "classifier", "qa_outputs" to SKIP_QUANTIZATION_MODULES so replace_with_bnb_linear leaves task heads in the compute dtype. 2. This commit: for pre-quantized checkpoints, transformers reads llm_int8_skip_modules from the quantization_config baked into config.json and ignores the runtime BitsAndBytesConfig we pass via kwargs. Unsloth must merge its skip list into model_config.quantization_config.llm_int8_skip_modules before the from_pretrained call, or the checkpoint's frozen list (e.g. ["lm_head", "multi_modal_projector", "merger", "modality_projection"]) wins and the `score` head gets converted to Linear4bit with uint8 storage, then _init_weights calls normal_ on uint8 and crashes. Also add a defensive post-load cast on the task head to guard against any residual path that ends up with a non-floating head dtype. Verified on transformers 4.57.6 and 5.5.0 with: - unsloth/Qwen3-4B-bnb-4bit + num_labels=3 - unsloth/Qwen3-4B (non-bnb repo, load_in_4bit=True) - unsloth/Llama-3.2-1B-Instruct + num_labels=3 - unsloth/ModernBERT-large classifier head (bert_classification notebook) - Regression: causal LM path unchanged, backbone still 4-bit - 3-step SFT on num_labels=3 confirms gradient flow and weight updates on score.weight Fixes unslothai/unsloth#5027 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-15 05:16:33 -07:00
David Solanas Sanz	1fcb2502cf	fix: prevent offline freeze by fixing stats retry and forwarding local_files_only (#5016 ) Fixes #2393. - `_utils.py`: `has_internet()` now respects `HF_HUB_OFFLINE` with truthy variant parsing in addition to `TRANSFORMERS_OFFLINE`. - `_utils.py`: replace uncontrolled `except Exception: stats_check()` retry (which had no time limit and could freeze on Kaggle offline mode) with a logged skip. - `loader.py`: forward `local_files_only` from kwargs into all `AutoConfig.from_pretrained` and `PeftConfig.from_pretrained` probes in `FastLanguageModel.from_pretrained` and `FastModel.from_pretrained`, including the PEFT base-model reload paths.	2026-04-15 04:51:31 -07:00
Lee Jackson	f9ef639dde	Studio: support GGUF variant selection for non-suffixed repos (#5023 ) * fix: support GGUF variant selection for non-suffixed repos * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: harden GGUF detection across cached models and picker flows * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * chore: use shared GGUF picker helper for search rows * fix: avoid mixed cache duplication and preserve GGUF fallback detection * fix: unify GGUF cache matching and merge picker hints * fix: normalize local GGUF matching across picker and model config * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: robust cached-gguf classification + hint-aware click routing - _repo_gguf_size_bytes: treat size_on_disk=None as 0 and dedupe fallback by commit_hash so partial/interrupted downloads don't TypeError out of sum() and wipe the entire cached list. - list_cached_gguf / list_cached_models: narrow per-repo try/except so one malformed repo no longer poisons the whole response. - handleModelClick: route through isKnownGgufRepo instead of the suffix-only isGgufRepo, so non-suffixed GGUF repos still open the variant expander from every call site. - Replace the modelIsGgufById/resultIsGgufById Maps with Sets of known GGUF ids to stop conflating "no hint" with "known not-GGUF". - Make HfModelResult.isGguf required (it is always set in makeMapModel). - Add regression tests for the None size case, mixed-repo inclusion in cached-gguf, and per-repo error isolation. * fix: exclude mmproj from GGUF classification and case-normalize hint lookups - _repo_gguf_size_bytes now filters mmproj vision-adapter files so safetensors+mmproj.gguf repos stay on the cached-models path and non-GGUF rows no longer show zero pickable variants. A vision-capable GGUF repo (main weight + mmproj adapter) still classifies as GGUF and reports the main weight size. - modelGgufIds / resultGgufIds now key on lowercased ids and isKnownGgufRepo lowercases its lookup, so store and HF-search ids that differ only by casing still match the same GGUF hint. - New regression tests: mmproj-only repo excluded from cached-gguf, same repo included in cached-models, vision-capable repo still classified as GGUF with correct size. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>	2026-04-15 15:32:01 +04:00
Roland Tannous	13928b5f0e	Add configurable PyTorch mirror via UNSLOTH_PYTORCH_MIRROR env var (#5024 ) * Add configurable PyTorch mirror via UNSLOTH_PYTORCH_MIRROR env var When set, UNSLOTH_PYTORCH_MIRROR overrides the default https://download.pytorch.org/whl base URL in all four install scripts (install.sh, install.ps1, studio/setup.ps1, studio/install_python_stack.py). When unset or empty, the official URL is used. This lets users behind corporate proxies or in regions with poor connectivity to pytorch.org point at a local mirror without patching scripts. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add pytest for UNSLOTH_PYTORCH_MIRROR in install_python_stack.py Tests that _PYTORCH_WHL_BASE picks up the env var when set, falls back to the official URL when unset or empty, and preserves the value as-is (including trailing slashes). * Remove stale test assertions for missing install.sh messages * Fix GPU mocking in test_get_torch_index_url.sh Extract _has_usable_nvidia_gpu and _has_amd_rocm_gpu alongside get_torch_index_url so the GPU-presence checks work in tests. Add -L flag handling to mock nvidia-smi so it passes the GPU listing check. All 26 tests now pass on CPU-only machines. * Strip trailing slash from UNSLOTH_PYTORCH_MIRROR to avoid double-slash URLs --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-15 11:39:11 +04:00
Datta Nimmaturi	826c98f3c0	[moe][gemma4] Target MoE for gemma4 (#4913 ) * Target MoE for gemma4 * refactor attention impl determine * Revert "refactor attention impl determine" This reverts commit 888fca08110a9a74278dc1ebc14d0da043bbd11d. * Remove attention policy changes from gemma4 MoE fix	2026-04-14 16:53:07 -05:00
Daniel Han	5aa8c15246	Studio: hard-stop at n_ctx with a 'Context limit reached' toast (#5021 ) * Studio: hard-stop at n_ctx with a dedicated 'Context limit reached' toast llama-server's default behavior when the KV cache fills is to silently drop the oldest non-``n_keep`` tokens and keep generating. The UI has no way to tell the user that earlier turns were evicted -- they just see degraded continuity and a confusing ``5,361 / 4,096`` on the context usage bar. Launch llama-server with ``--no-context-shift`` so it returns a clean error once the request would exceed ``n_ctx``. In the chat adapter, catch the error, identify it as a context-limit error via ``isContextLimitError()``, and surface a dedicated toast that names the exact control to adjust: the ``Context Length`` field in the chat Settings panel. Also add a lightweight tooltip hint on ``ContextUsageBar`` when usage crosses 85%, so users see the "raise Context Length in Settings" suggestion before they hit the hard stop. Tests: * ``test_llama_cpp_no_context_shift.py`` pins the ``--no-context-shift`` flag in the static launch-command template, and pins it inside the unconditional ``cmd = [ ... ]`` block so a future refactor can't hide it behind a branch. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Shorten --no-context-shift comment to 1 line * Match backend _friendly_error rewrite in isContextLimitError Codex review on PR caught that ``backend/routes/inference.py::_friendly_error`` rewrites the raw llama-server text "request (X tokens) exceeds the available context size (Y tokens)" into "Message too long: X tokens exceeds the Y-token context window. ..." on the main streaming GGUF path. The heuristic only looked for "context size" / "exceeds the available context" / "context shift", none of which survive the rewrite, so the new "Context limit reached" toast would never fire for the most common case. Add matches for "message too long" and "context window" so both wordings hit. Also addresses Gemini feedback on the launch-flag test: * Use ``inspect.getsource(LlamaCppBackend.load_model)`` instead of reading ``__file__`` directly; scopes the assertions to the function that actually launches llama-server. * Replace the hardcoded ``" ]"`` indent search with a line-at-a-time scan for a line that is just ``]``, so the test survives reformatting. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-14 10:58:20 -07:00
Daniel Han	5861a7ce15	Studio: split model-load progress label across two rows (#5020 ) * Studio: split model-load progress label across two rows The chat flow and training overlay both compose a progress label like "112.6 of 122.3 GB • 331.0 MB/s • 30s left" and render it next to the percent badge in a single flex row. Once the rate + ETA part shows up, the label outgrows the row width and wraps mid-phrase, orphaning the percent ("19 left %") onto a second ragged line. Fix in model-load-status.tsx: split the label on the first " • " into a primary (size) chunk that stays on row 1 with the percent, and a secondary (rate/ETA) chunk that renders on its own muted row below. Labels without a bullet (e.g. "22.8 GB downloaded") collapse cleanly to one row. The inline-status variant keeps only the primary and surfaces the full label via the tooltip. Also extracts the rate/ETA math out of useTransferStats into a pure ``transfer-stats.ts`` module (appendSample + computeTransferStats) so it can be reasoned about and tested without React. The hook is now a thin wrapper that feeds sample history through the pure functions. Backend: adds two companion test files for load_progress(): * test_llama_cpp_load_progress_matrix.py (21 tests) -- platform matrix (Linux /proc, macOS/Windows absence), VmRSS parsing variants (tab/space/missing/malformed), filesystem edges (HF-cache symlinks, broken symlinks, nonexistent paths, relative paths), shard aggregation (partial multi-shard, two series in same dir, mmproj-* exclusion, single-file), lifecycle races, concurrent sampling (10 threads x 50 iters against real /proc), fraction bounds. * test_llama_cpp_load_progress_live.py (5 tests) -- no-mock live integration: real subprocess allocating 100 MB to match VmRSS, real ready phase, real dead-pid degradation, real shard aggregation, repeated polling. Skipped on non-Linux. Both complement the existing test_llama_cpp_load_progress.py. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Hoist splitProgressLabel out of JSX IIFE (review feedback) --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-14 10:58:16 -07:00
Eda Z	5b8dbdc3c2	Fix bitsandbytes ROCm install by using pip instead of uv (#4966 ) * Fix bitsandbytes ROCm install by using pip instead of uv * Also use pip for PyPI fallback path in _install_bnb_rocm The original fix correctly switched the pre-release wheel install from uv to pip, but left the PyPI fallback path on uv. If uv breaks bnb on ROCm, the fallback would hit the same issue. Move pip bootstrap before the branch so both paths use pip consistently. * Harden pip bootstrap: try ensurepip first, warn on failure - Try ensurepip --upgrade before falling back to uv pip install pip. ensurepip works offline and does not need PyPI, making the bootstrap robust when the network or index is unavailable. - If both ensurepip and uv fail, emit a visible warning instead of silently swallowing the error (which previously led to a cryptic "No module named pip" downstream). - Use run_maybe_quiet so --verbose users see bootstrap output. - Update comment to document the actual root cause: uv rejects the wheel because filename version and metadata version disagree. * Add --isolated to pip install calls in _install_bnb_rocm uv pip install ignores pip.conf and PIP_* env vars, but python -m pip reads them. Without --isolated, users with PIP_INDEX_URL pointing to a private mirror that does not carry bitsandbytes would see the PyPI fallback fail where it previously worked under uv. --isolated restores parity with the old uv behavior. * Drop --isolated from PyPI fallback in _install_bnb_rocm --isolated suppresses PIP_INDEX_URL, PIP_EXTRA_INDEX_URL, and pip.conf. This is correct for the pre-release path (hardcoded GitHub URL, no index consulted), but breaks the PyPI fallback for users in corporate or air-gapped environments whose only route to bitsandbytes is a private mirror configured via those mechanisms. Keep --isolated on the direct-URL pre-release install; drop it from the index-dependent fallback. * Drop --isolated from pre-release pip install, fix warning wording --isolated suppresses pip.conf cert/proxy/CA settings in addition to index config. For the direct GitHub URL, index config is irrelevant but cert/proxy settings matter in corporate SSL-inspection environments. Without this fix, users with pip.conf-based CA bundles get a TLS error on the pre-release download and silently fall back to the broken PyPI version -- the exact outcome the PR is trying to prevent. Also fix the fallback warning: "unreachable" is too specific since the pre-release install can fail for reasons other than network reachability. --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-14 10:23:40 -07:00
pre-commit-ci[bot]	a0b9d14081	[pre-commit.ci] pre-commit autoupdate (#5004 ) updates: - [github.com/astral-sh/ruff-pre-commit: v0.15.9 → v0.15.10](https://github.com/astral-sh/ruff-pre-commit/compare/v0.15.9...v0.15.10) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-14 09:49:18 -07:00
Daniel Han	bb14ab144a	Studio: live model-load progress + rate/ETA on download and load (#5017 ) * Studio: live model-load progress + rate/ETA on download and load Two UX fixes for the opaque multi-minute wait between clicking Load and being able to chat, visible most clearly on large MoE GGUFs like MiniMax-M2.7 (131 GB of weights on a 97 GB GPU): 1. Model-load phase is now observable. The existing chat flow transitions the toast to "Starting model..." as soon as the download hits 100%, then shows a spinner with no other feedback until llama-server reports healthy. For a 130 GB model that spinner freezes for five-plus minutes while the kernel pages shards into the page cache. A new `GET /api/inference/load-progress` endpoint samples `/proc/<pid>/status VmRSS` on the llama-server subprocess against the sum of shard file sizes on disk, so the UI can render a real bar plus rate / ETA during that window. 2. Rate and ETA on downloads and loads. Both the chat toast and the training-start overlay used to show a static pair of numbers (for example "15.4 of 140.8 GB"). A rolling 15-second window over the existing byte-series now surfaces "85.3 MB/s, 24m 23s left" beside that pair. The estimator is shared between the download and load phases so the numbers don't reset when the phase flips. Also fixes a pre-existing assignment bug uncovered while wiring this up: `load_model` was storing the caller's `gguf_path` kwarg into `self._gguf_path`, which is `None` on the HF-download code path. The resolved on-disk path (`model_path`) is what llama-server actually mmaps; downstream consumers need that. No existing reader used `_gguf_path`, so this is a correctness fix for the new endpoint. - Backend: `LlamaCppBackend.load_progress()`, `GET /api/inference/load-progress`, `LoadProgressResponse` Pydantic model. - Frontend: `useTransferStats` hook, `formatRate` / `formatEta` helpers, `getLoadProgress` client, rewired chat toast and `DownloadRow` in the training overlay. - Tests: `studio/backend/tests/test_llama_cpp_load_progress.py` covers empty states, mmap phase, ready phase, sharded total aggregation, missing gguf_path, and unreadable /proc (7 cases). `tsc -b` and `vite build` on the frontend both clean. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-14 09:46:22 -07:00
Roland Tannous	514bb3a20e	studio: pin peft to 0.18.1 to fix export subprocess issues (#5015 ) * studio: pin peft to 0.18.1 to fix export subprocess issues peft 0.19.0 causes export subprocess shutdown failures in Studio. Reverting to 0.18.1 resolves the issue. * studio: move peft pin to extras-no-deps to prevent torch upgrade Installing peft via overrides.txt would resolve its deps and pull in torch>=0.11.0, breaking other pinned packages. Moving the pin to extras-no-deps.txt ensures --no-deps is used during install.	2026-04-14 20:16:30 +04:00
Datta Nimmaturi	4328d0b4f6	Fix num_items_in_batch GA for Gemma4 (#4998 ) * Fix num_items_in_batch GA for Gemma4 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-14 09:01:10 -07:00
Daniel Han	7252410ccc	studio: stream export worker output into the export dialog (#4897 ) * studio: stream export worker output into the export dialog The Export Model dialog only showed a spinner on the "Exporting..." button while the worker subprocess was doing the actual heavy lifting. For Merged to 16bit and GGUF / Llama.cpp exports this meant several minutes (or more, for large models) of opaque silence, with no way to tell whether save_pretrained_merged, convert_hf_to_gguf.py, or llama-quantize was making progress. This adds a live terminal-style output panel inside the export dialog, rendered just above the Cancel / Start Export buttons and scrollable with auto-follow-tail. It shows stdout and stderr from both the worker process itself and any child process it spawns (GGUF converter, llama-quantize), coloured by stream. Backend - core/export/worker.py: new _setup_log_capture(resp_queue) installed before LogConfig.setup_logging. It saves the original stdout/stderr fds, creates pipes, os.dup2's the write ends onto fds 1 and 2 (so every child process inherits the redirected fds), and spins up two daemon reader threads. Each thread reads bytes from a pipe, echoes them back to the original fd (so the server console keeps working), splits on \n and \r, and forwards each line to the resp queue as {"type":"log","stream":"stdout\|stderr","line":...,"ts":...}. PYTHONUNBUFFERED=1 is set so nested Python converters flush immediately. - core/export/orchestrator.py: - Thread-safe ring buffer (collections.deque, maxlen 4000) with a monotonically increasing seq counter. clear_logs(), get_logs_since(cursor), get_current_log_seq(), is_export_active(). - _wait_response handles rtype == "log" by appending to the buffer and continuing the wait loop. Status messages are also surfaced as a "status" stream so users see high level progress alongside raw subprocess output. - load_checkpoint, _run_export, and cleanup_memory now wrap their bodies with the existing self._lock (previously unused), clear the log buffer at the start of each op, and flip _export_active in a try/finally so the SSE endpoint can detect idle. - routes/export.py: - Wrapped every sync orchestrator call (load_checkpoint, cleanup_memory, export_merged_model, export_base_model, export_gguf, export_lora_adapter) in asyncio.to_thread so the FastAPI event loop stays free during long exports. Without this the new SSE endpoint could not be served concurrently with the blocking export POST. - New GET /api/export/logs/stream SSE endpoint. Honors Last-Event-ID and a since query param for reconnect, emits log / heartbeat / complete / error events, uses the id field to carry the log seq so clients can resume cleanly. On first connect without an explicit cursor it starts from the current seq so old lines from a previous run are not replayed. Frontend - features/export/api/export-api.ts: streamExportLogs() helper that authFetches the SSE endpoint and parses id / event / data fields manually (same pattern as streamTrainingProgress in train-api.ts). - features/export/components/export-dialog.tsx: - Local useExportLogs(exporting) hook that opens the SSE stream on exporting transitions to true, accumulates up to 4000 lines in component state, and aborts on cleanup. - New scrollable output panel rendered above DialogFooter, only shown for Merged to 16bit and GGUF / Llama.cpp (LoRA adapter is a fast disk write with nothing to show). Dark terminal styling (bg-black/85, emerald text, rose for stderr, sky for status), max-height 14rem, auto-scrolls to the bottom on new output but stops following if the user scrolls up. A small streaming / idle indicator is shown next to the panel title. - DialogContent widens from sm:max-w-lg to sm:max-w-2xl when the output panel is visible so the logs have room to breathe. Verified - Python smoke test (tests/smoke_export_log_capture.py): spawns a real mp.get_context("spawn") process, installs _setup_log_capture, confirms that parent stdout prints, parent stderr prints, AND a child subprocess invoked via subprocess.run (both its stdout and stderr) are all captured in the resp queue. Passes. - Orchestrator log helpers tested in isolation: _append_log, get_logs_since (with and without a cursor), clear_logs not resetting seq so reconnecting clients still progress. Passes. - routes.export imports cleanly in the studio venv and /logs/stream shows up in router.routes. - bun run build: tsc -b plus vite build, no TypeScript errors. No existing export behavior is changed. If the subprocess, the SSE endpoint, or the frontend hook fails, the export itself still runs to completion the same way it did before, with or without logs visible. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * export dialog: trim bootstrap noise, scope logs per screen, show realpath Several follow-ups to the live export log work: 1. Worker bootstrap noise (transformers venv activation, Unsloth banner, "Top GGUF/hub models" lists, vision detection, 2k-step weight load bar) is dropped from the export-dialog stream. A threading.Event gate in worker.py defaults closed and only opens once _handle_export actually starts; until then the reader thread still echoes lines to the saved console fd for debugging but does not push them onto the resp_queue. The orchestrator already spawns a fresh subprocess for every checkpoint load, so the gate is naturally reset between runs. 2. tqdm in non-tty mode defaults to a 10s mininterval, which makes multi-step bars look frozen in the panel. Set TQDM_MININTERVAL=0.5 in the worker env so any tqdm-driven progress emits more often. 3. The dialog's useExportLogs hook now also clears its line buffer when exportMethod or open changes, so re-opening the dialog into a different action's screen no longer shows the previous action's saved output. A useElapsedSeconds tick + "Working Xs" badge in the log header gives users a visible sign that long single-step phases (cache copies, GGUF conversion) are still running when no new lines are arriving. 4. ExportBackend.export_{merged,base,gguf,lora} now return (success, message, output_path); the worker forwards output_path on each export__done response, the orchestrator's _run_export passes it to routes/export.py, which surfaces it via ExportOperationResponse.details.output_path. The dialog's Export Complete screen renders the resolved on-disk realpath under "Saved to" so users can find their exported model directly. fix(cli): unpack 3-tuple return from export backend ExportOrchestrator.export_{merged,base,gguf,lora} now return (success, message, output_path) so the studio dialog can show the on-disk realpath. The CLI still unpacked 2 values, so every `unsloth export --format ...` crashed with ValueError before reporting completion. Update the four call sites and surface output_path via a "Saved to:" echo. * fix(studio): anchor export log SSE cursor at run start The export dialog SSE defaulted its cursor to get_current_log_seq() at connect time, so any line emitted between the POST that kicks off the export and the client opening the stream was buffered with seqs 1..k and then skipped (seq <= cursor). Long-running exports looked silent during their first seconds. Snapshot _log_seq into _run_start_seq inside clear_logs() and expose it via get_run_start_seq(). The SSE default cursor now uses that snapshot, so every line emitted since the current run began is reachable regardless of when the client connects. Old runs still can't leak in because their seqs are <= the snapshot. * fix(studio): reconnect export log SSE on stream drop useExportLogs launched streamExportLogs once per exporting transition and recorded any drop in .catch(). Long GGUF exports behind a proxy with an idle kill-timeout would silently lose the stream for the rest of the run even though the backend already supports Last-Event-ID resume. The "retry: 3000" directive emitted by the backend is only meaningful to native EventSource; this hook uses a manual fetch + ReadableStream parse so it had no effect. Wrap streamExportLogs in a retry loop that tracks lastSeq from ExportLogEvent.id and passes it as since on reconnect. Backoff is exponential with jitter, capped at 5s, reset on successful open. The loop stops on explicit backend `complete` event or on effect cleanup. * fix(studio): register a second command so Typer keeps `export` as a subcommand The CLI export unpacking tests wrap `unsloth_cli.commands.export.export` in a fresh Typer app with a single registered command. Typer flattens a single-command app into that command, so the test's `runner.invoke(cli_app, ["export", ckpt, out, ...])` treats the leading `"export"` token as an unexpected extra positional argument -- every parametrized case failed with: Got unexpected extra argument (.../out) Register a harmless `noop` second command so Typer preserves subcommand routing and the tests actually exercise the 3-tuple unpack path they were written to guard. Before: 4 failed After: 4 passed --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: studio-install <studio@local.install> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com> Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>	2026-04-14 08:55:43 -07:00
Daniel Han	eca592effe	studio: show HF model download progress in training start overlay (#4894 ) * studio: show HF model download progress in training start overlay During the training setup phase, the overlay only displayed a static "Loading model..." line while model weights were being downloaded from Hugging Face. On slow connections this looked like the app had frozen. This adds a small self-contained progress block inside the existing TrainingStartOverlay that polls the existing GET /api/models/download-progress endpoint and renders a Progress bar with bytes downloaded, total bytes, and percent complete. Notes: - Frontend only change. No backend, worker, SSE, or runtime store edits. - Reuses the existing getDownloadProgress client wrapper and the existing /api/models/download-progress endpoint that already scans the HF blob cache for completed and .incomplete files. - selectedModel is read directly from useTrainingConfigStore inside the overlay, so no prop drilling and live-training-view.tsx is unchanged. - Polling runs at 1500 ms and is gated on the HF repo regex (^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$), the same regex the backend uses, so local paths and empty form state never hit the endpoint. - Polling stops once progress reaches 1.0 so the bar can stay at 100 until the overlay hides on the first training step. - Network errors are silently swallowed, matching the chat side flow (the bar simply freezes at the last value). - When downloadedBytes is 0 the block is hidden entirely, so cached models do not flash a progress bar. - When the HF API cannot determine the total size, the block falls back to "X downloaded" with no percent and no bar. Verified with bun run build (tsc -b plus vite build, no TypeScript errors). * training overlay: track dataset download + show on-disk realpath Adds a dedicated "Downloading dataset..." section to the training-start overlay alongside the existing model-weights one, so an HF dataset that is downloading mid-startup is no longer mislabeled as model weights or hidden entirely. The new GET /api/datasets/download-progress endpoint mirrors /api/models/download-progress against the datasets-- prefix in HF_HUB_CACHE. Both endpoints now also return cache_path, the resolved on-disk realpath of the snapshot directory (or the cache repo root if no snapshot is materialized yet). The overlay surfaces this under each download row so users can immediately see where the model and dataset landed without digging through server logs. The frontend's existing useModelDownloadProgress hook is generalized to a single useHfDownloadProgress(repoId, fetcher) hook that the model and dataset variants both delegate to, keeping polling, gating, and completion semantics in one place. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: Polish training start overlay download progress UI (#4957) * studio: polish training start overlay download progress visuals * Fix formatCachePath cross-platform support and redundant sizeLabel - Extend formatCachePath regex to also shorten macOS /Users/<user> paths to ~ - Suppress sizeLabel when no byte info is available (cachePath-only state), since the "Preparing" badge already conveys the status * Fix misleading status badge when download total is unknown - Hide badge when totalBytes is 0 but downloadedBytes > 0, since we cannot determine if the download is still in progress or already complete (happens when HF size metadata lookup fails for gated/private repos) - Keep "Preparing" badge for the zero-bytes cachePath-only state - Add Windows native path shortening to formatCachePath (C:\Users\<name>) --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com> --------- Co-authored-by: studio-install <studio@local.install> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>	2026-04-14 08:54:01 -07:00
Daniel Han	44082cf88e	Studio: anchor ctx-slider warning threshold at 4096 when weights exceed VRAM (#5014 ) * Studio: anchor ctx-slider warning threshold at 4096 when weights exceed VRAM The chat settings sheet's ctx slider reads `max_context_length` from `/api/inference/status` and renders Exceeds estimated VRAM capacity (N tokens). The model may use system RAM. when the user drags the slider above that value. For models whose weights fit on some GPU subset, `_max_context_length` was already set to the binary-search cap and the warning fired correctly. For models whose weights exceed 90% of every GPU subset's free memory (e.g. MiniMax-M2.7-GGUF at 131 GB on a 97 GB GPU), the ceiling-probe loop never matched a subset, so `max_available_ctx` stayed at the native context (e.g. 196608). The slider ran all the way to native with no indication that any value above the 4096 spec default would trigger `--fit on` and degrade performance. Anchor `max_available_ctx` at `min(4096, native_context_length)` when no subset fits, so the warning fires at the right threshold and the user sees the correct safe-zone / warning-zone split: Before (MiniMax-M2.7 on 97 GB GPU): slider 0 .. 196608, warning threshold = 196608 (never fires) After: slider 0 .. 196608, warning threshold = 4096 (fires correctly) No frontend changes required: `chat-settings-sheet.tsx` already consumes `ggufMaxContextLength` (= status.max_context_length) as the warning threshold and `ggufNativeContextLength` as the slider max. Adds tests/test_llama_cpp_max_context_threshold.py covering weights-exceed-VRAM (single / multi-GPU), a native-ctx below the 4096 fallback case (don't lie about supported ctx), fittable-model regressions (small / multi-GPU / tiny on huge GPU), and the `max_context_length` property's fallback semantics. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-14 08:53:49 -07:00
Daniel Han	b2f80f210e	Studio: make GGUF disk-space preflight cache-aware (#5012 ) * Studio: make GGUF disk-space preflight cache-aware The pre-download disk check in LlamaCppBackend.load_model compared the repo's total GGUF size against free disk without crediting bytes already present in the Hugging Face cache. Re-loading a large cached model (e.g. MiniMax-M2.7-GGUF at 131 GB) then failed cold with "Not enough disk space to download any variant" whenever free disk was below the full weight footprint, even though nothing actually needed to be downloaded. Subtract bytes already on disk via try_to_load_from_cache before comparing against free space. A partial blob (interrupted download) is not credited, so a second attempt still allocates room to finish the download. The log line now also surfaces how much is already cached. Adds tests/test_llama_cpp_cache_aware_disk_check.py covering the fully-cached, partial-cache-insufficient-disk, partial-cache-enough-disk, cold-cache, incomplete-blob, and zero-size-path-info cases. Sparse tempfiles keep the GB-scale scenarios cheap to simulate. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-14 08:53:37 -07:00
Daniel Han	767fa8cade	Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM (#5011 ) * Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM The load-time auto-fit in LlamaCppBackend.load_model had two issues for models whose weights do not fit on any GPU subset (the common case for large MoE GGUFs such as MiniMax-M2.7, Qwen3.5-397B-A17B, etc.): 1. Auto mode (max_seq_length=0) left effective_ctx at the model's native context when no subset passed the 90% fit check. The UI slider then landed on e.g. 196608 for MiniMax-M2.7, far above anything usable. Default the auto-pick to 4096 so the UI starts at a sane value; the slider ceiling stays at the native context so the user can still opt in to longer contexts and receive the "might be slower" warning. 2. Explicit ctx was silently shrunk when weights fit but the requested KV overflowed the 90% budget. The shrink loop emitted -c <capped> -ngl -1 without informing the caller, so a user who had opted into a longer context via the UI never actually got it. Drop the shrink loop on the explicit path and emit -c <user_ctx> --fit on instead, letting llama-server flex -ngl (CPU layer offload). Adds tests/test_llama_cpp_context_fit.py covering both paths, the file-size-only fallback when KV metadata is missing, non-regression on fittable auto-pick, and platform-agnostic input shape. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-14 08:53:25 -07:00
TF-MTGE	a31c82a640	fix(studio): remove 300s cap on load_checkpoint (inherits 3600s default) (#4922 ) * fix: increase wait response timeout to 900 sec instead of 300 sec. #4845 * Apply suggestion from @gemini-code-assist[bot] good catch Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-04-14 08:53:14 -07:00
Datta Nimmaturi	da78c6be71	[Studio] Install flash attn at setup time for linux (#4979 ) * [Studio] Install flash attn at setup time for linux * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup changes Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Test cases * wheel_utils: narrow url_exists exceptions and log at debug level --------- Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>	2026-04-14 16:40:17 +04:00
Datta Nimmaturi	dccc0ebada	[Studio] Show non exported models in chat UI (#4892 ) * Show non exported models in chat UI * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Distinguish b/w LoRa and full fine tune saves. Cleanup --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>	2026-04-14 15:03:58 +04:00
Bharath Kumar Adinarayan	a50f61009b	fix(studio): default chart view to full training history (#5007 ) * fix(studio): default chart view to full training history instead of last 80 steps Fixes #5003 * chore: windowsize as null code comment --------- Co-authored-by: imagineer99 <samleejackson0@gmail.com> Co-authored-by: Wasim Yousef Said <wasimysdev@gmail.com>	2026-04-14 03:29:27 -07:00
Lee Jackson	bfa17330bd	Studio: Polish API key copy button and harden async clipboard fallback (#5006 ) * fix: polish clipboard style and fix async clipboard path * Use copyToClipboardAsync in CopyButton for Safari fallback CopyButton was calling navigator.clipboard.writeText directly, bypassing the execCommand fallback added in this same PR. Switch to copyToClipboardAsync which tries execCommand first (Safari user-gesture requirement) then falls back to the async clipboard API. * Fix copyToClipboard sync contract regression and improve async path - Restore copyToClipboard() to return only the execCommand result, preserving the boolean contract that 7 existing callers depend on to gate their "Copied!" UI state. The fire-and-forget async fallback was returning true before the promise resolved, causing false success. - Add document.body null guard to copyWithExecCommand for SSR safety. - Reorder copyToClipboardAsync to try the async Clipboard API first, avoiding unnecessary DOM/focus overhead in Radix focus-trapped dialogs where execCommand always fails anyway. * Restore queryCommandSupported guard and fix async catch path - Restore the queryCommandSupported("copy") guard in copyToClipboard() to match the original contract exactly: when execCommand is entirely unsupported, fall through to fire-and-forget async clipboard write. - Fix copyToClipboardAsync catch block: after navigator.clipboard.writeText rejects, the user-gesture frame is gone, so execCommand will also fail. Return false from catch instead of falling through. The execCommand fallback at the bottom only runs when the Clipboard API is absent (still in user-gesture frame). * Restore execCommand fallback in copyToClipboardAsync catch path The catch block was returning false after clipboard API rejection, based on the incorrect premise that the user-gesture frame is lost after an await. Per the HTML spec, transient user activation IS preserved through promise microtask chains. The real reason execCommand fails in the Radix dialog is the focus trap intercepting textarea.focus(), not gesture loss. For non-dialog callers, execCommand can still succeed after a clipboard rejection. Inside a Radix modal, execCommand returns false harmlessly (focus trap blocks it). * Harden textarea fallback for mobile and continue to async path on failure --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>	2026-04-14 14:22:14 +04:00
Wasim Yousef Said	97eafd999e	studio: fix api-keys access + refresh (#5005 ) * studio: fix api-keys access + refresh * studio: guard v1 in spa fallback	2026-04-13 23:48:51 +04:00
AdamPlatin123	d2fc582840	studio: skip training status/metrics polling when idle (#4988 ) * fix(studio): skip training status/metrics polling when idle Add an early return in the status and metrics setInterval callbacks when the runtime store reports phase === "idle" and hasHydrated is true. Previously these polls fired unconditionally every 3s/5s, generating unnecessary network traffic and console errors when no training was running. * fix(studio): reduce idle polling to 30s instead of stopping entirely Review feedback (PR #4988): completely stopping polling when idle risks permanent UI desync if hydration fails, and misses out-of-band state changes from other clients. Add a 30s background poll that only fires when idle to recover gracefully. * fix: harden idle status polling around hydration and runtime reset --------- Co-authored-by: AdamPlatin123 <AdamPlatin123@users.noreply.github.com> Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com> Co-authored-by: imagineer99 <samleejackson0@gmail.com>	2026-04-13 12:02:12 -07:00
Daniel Han	9a261aec5f	Studio: Expose openai and anthropic compatible external API end points (#4956 ) * Studio: add API key authentication for programmatic access External users want to hit the Studio API (chat completions with tool calling, training, export, etc.) without going through the browser login flow. This adds sk-unsloth- prefixed API keys that work as a drop-in replacement for JWTs in the Authorization: Bearer header. Backend: - New api_keys table in SQLite (storage.py) - create/list/revoke/validate functions with SHA-256 hashed storage - API key detection in _get_current_subject before the JWT path - POST/GET/DELETE /api/auth/api-keys endpoints on the auth router Frontend: - /api-keys page with create form, one-time key reveal, keys table - API Keys link in desktop and mobile navbar - Route registered with requireAuth guard Zero changes to any existing route handler -- every endpoint that uses Depends(get_current_subject) automatically works with API keys. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use actual origin in API key usage examples The examples on /api-keys were hardcoded to localhost:8888 which is wrong for remote users. Use window.location.origin so the examples show the correct URL regardless of where the user is connecting from. * Add `unsloth studio run` CLI command for one-liner model serving Adds a `run` subcommand that starts Studio, loads a model, creates an API key, and prints a ready-to-use curl command -- similar to `ollama run` or `vllm serve`. Usage: unsloth studio run -m unsloth/Qwen3-1.7B-GGUF --gguf-variant UD-Q4_K_XL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add end-to-end tests for `unsloth studio run` and API key usage Tests the 4 usage examples from the API Keys page: 1. curl basic (non-streaming) chat completions 2. curl streaming (SSE) chat completions 3. OpenAI Python SDK streaming completions 4. curl with tools (web_search + python) Also tests --help output, invalid key rejection, and no-key rejection. All 7 tests pass against Qwen3-1.7B-GGUF. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add /v1/completions, /v1/embeddings, /v1/responses endpoints and --parallel support - llama_cpp.py: accept n_parallel param, pass to llama-server --parallel - run.py: plumb llama_parallel_slots through to app.state - inference.py: add /completions and /embeddings as transparent proxies to llama-server, add /responses as application-level endpoint that converts to ChatCompletionRequest; thread n_parallel through load_model - studio.py: set llama_parallel_slots=4 for `unsloth studio run` path * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make /v1/responses endpoint match OpenAI Responses API format The existing /v1/responses shim returned Chat Completions format, which broke OpenAI SDK clients using openai.responses.create(). This commit replaces the endpoint with a proper implementation that: - Returns `output` array with `output_text` content parts instead of `choices` with `message` - Uses `input_tokens`/`output_tokens` instead of `prompt_tokens`/ `completion_tokens` in usage - Sets `object: "response"` and `id: "resp_..."` - Emits named SSE events for streaming (response.created, response.output_text.delta, response.completed, etc.) - Accepts all OpenAI Responses API fields (tools, store, metadata, previous_response_id) without erroring -- silently ignored - Maps `developer` role to `system` and `input_text`/`input_image` content parts to the internal Chat format Adds Pydantic schemas for request/response models and 23 unit tests covering schema validation, input normalisation, and response format. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Studio: add Anthropic-compatible /v1/messages endpoint (#4981) * Add Anthropic-compatible /v1/messages endpoint with tool support Translate Anthropic Messages API format to/from internal OpenAI format and reuse the existing server-side agentic tool loop. Supports streaming SSE (message_start, content_block_delta, etc.) and non-streaming JSON. Includes offline unit tests and e2e tests in test_studio_run.py. * Add enable_tools, enabled_tools, session_id to /v1/messages endpoint Support the same shorthand as /v1/chat/completions: enable_tools=true with an optional enabled_tools list uses built-in server tools without requiring full Anthropic tool definitions. session_id is passed through for sandbox isolation. max_tokens is now optional. * Strip leaked tool-call XML from Anthropic endpoint content Apply _TOOL_XML_RE to content events in both streaming and non-streaming tool paths, matching the OpenAI endpoint behavior. * Emit custom tool_result SSE event in Anthropic stream Adds a non-standard tool_result event between the tool_use block close and the next text block, so clients can see server-side tool execution results. Anthropic SDKs ignore unknown event types. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Split /v1/messages into server-side and client-side tool paths enable_tools=true runs the existing server-side agentic loop with built-in tools (web_search/python/terminal). A bare tools=[...] field now triggers a client-side pass-through: client-provided tools are forwarded to llama-server and any tool_use output is returned to the caller with stop_reason=tool_use for client execution. This fixes Claude Code (and any Anthropic SDK client) which sends tools=[...] expecting client-side execution but was previously routed through execute_tool() and failing with 'Unknown tool'. Adds AnthropicPassthroughEmitter to convert llama-server OpenAI SSE chunks into Anthropic SSE events, plus unit tests covering text blocks, tool_use blocks, mixed, stop reasons, and usage. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix httpcore GeneratorExit in /v1/messages passthrough stream Explicitly aclose aiter_lines() before the surrounding async with blocks unwind, mirroring the prior fix in external_provider.py (`a41160d3`) and cc757b78's RuntimeError suppression. * Wire stop_sequences through /v1/messages; warn on tool_choice Plumb payload.stop_sequences to all three code paths (server-side tool loop, no-tool plain, client-side passthrough) so Anthropic SDK clients setting stop_sequences get the behavior they expect. The llama_cpp backend already accepted `stop` on both generate_chat_ completion and generate_chat_completion_with_tools; the Anthropic handler simply wasn't passing it. tool_choice remains declared on the request model for Anthropic SDK compatibility (the SDK often sets it by default) but is not yet honored. Log a structured warning on each request carrying a non- null tool_choice so the silent drop is visible to operators. * Wire min_p / repetition_penalty / presence_penalty through /v1/messages Align the Anthropic endpoint's sampling surface with /v1/chat/completions. Adds the three fields as x-unsloth extensions on AnthropicMessagesRequest and threads them through all three code paths: server-side tool loop, no-tool plain, and client-side passthrough. The passthrough builder emits "repeat_penalty" (not "repetition_penalty") because that is llama-server's field name; the backend methods already apply the same rename internally. * Fix block ordering and prev_text reset in non-streaming tool path _anthropic_tool_non_streaming was building the response by appending all tool_use blocks first, then a single concatenated text block at the end — losing generation order and merging pre-tool and post-tool text into one block. It also never reset prev_text between synthesis turns, so the first N characters of each post-tool turn were dropped (where N = length of the prior turn's final cumulative text). Rewrite to build content_blocks incrementally in generation order, matching the streaming emitter's behavior: deltas within a turn are merged into the trailing text block, tool_use blocks interrupt the text sequence, and prev_text is reset on tool_end so turn N+1 diffs against an empty baseline. Caught by gemini-code-assist[bot] review on #4981. * Make test_studio_run.py e2e tests pytest-compatible Add a hybrid session-scoped studio_server fixture in conftest.py that feeds base_url / api_key into the existing e2e test functions. Three invocation modes are now supported: 1. Script mode (unchanged) — python tests/test_studio_run.py 2. Pytest + external server — point at a running instance via UNSLOTH_E2E_BASE_URL / UNSLOTH_E2E_API_KEY env vars, no per-run GGUF load cost 3. Pytest + fixture-managed server — pytest drives _start_server / _kill_server itself via --unsloth-model / --unsloth-gguf-variant, CI-friendly The existing _start_server / _kill_server helpers and main() stay untouched so the script entry point keeps working exactly as before. Test function signatures are unchanged — the (base_url, api_key) parameters now resolve via the new fixtures when running under pytest. * Rename test_studio_run.py -> test_studio_api.py The file is entirely about HTTP API endpoint testing (OpenAI-compatible /v1/chat/completions, Anthropic-compatible /v1/messages, API key auth, plus a CLI --help sanity check on the command that runs the API). None of its tests cover training, export, chat-UI, or internal-Python-API concerns. The old name misleadingly suggested "tests for the unsloth studio run CLI subcommand" — the new name reflects the actual scope. Updates: - git mv the file (rename tracked, history preserved) - Rewrite opening docstring to state the API surface focus and call out what is explicitly out of scope - Update all 4 Usage-block path references to the new filename - LOG_FILE renamed to test_studio_api.log - conftest.py fixture import rewritten from test_studio_run to test_studio_api, plus 7 docstring/comment references updated No functional changes to test logic, signatures, or main(). --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix httpcore asyncgen cleanup in /v1/messages and /v1/completions The earlier fix in `985e92a9` was incomplete: it closed aiter_lines() explicitly but still used `async with httpx.AsyncClient()` / `async with client.stream()` inside the generator. When the generator is orphaned (e.g. client disconnects mid-stream and Starlette drops the StreamingResponse iterator without explicitly calling aclose()), Python's asyncgen finalizer runs the cleanup in a DIFFERENT task than the one that originally entered the httpx context managers. The `async with` exits then trigger httpcore's HTTP11ConnectionByteStream .aclose(), which enters anyio.CancelScope.__exit__ with a mismatched task and raises RuntimeError("Attempted to exit cancel scope in a different task"). That error escapes any user-owned try/except because it happens during GC finalization. Replace `async with` with manual client/response lifecycle in both /v1/messages passthrough and /v1/completions proxy. Close the response and client in a finally block wrapped in `try: ... except Exception: pass`. This suppresses RuntimeError (and other Exception subclasses) from the anyio cleanup noise while letting GeneratorExit (a BaseException, not Exception) propagate cleanly so the generator terminates as Python expects. Traceback observed in user report: File ".../httpcore/_async/connection_pool.py", line 404, in __aiter__ yield part RuntimeError: async generator ignored GeneratorExit ... File ".../anyio/_backends/_asyncio.py", line 455, in __exit__ raise RuntimeError( RuntimeError: Attempted to exit cancel scope in a different task * Expand unsloth studio run banner with SDK base URL and more curl examples Add an explicit "OpenAI / Anthropic SDK base URL" line inside the info box so SDK users don't accidentally copy the bare server URL (without /v1) into their OpenAI/Anthropic SDK constructors and hit 404s. Replace the single /v1/chat/completions curl example with three labeled blocks: chat/completions, Anthropic /messages, and OpenAI Responses. The Anthropic example includes max_tokens (Anthropic SDKs require it even though Studio accepts None). All examples derived from a computed sdk_base_url so the /v1 prefix stays in sync if the public path ever changes. * Hash API keys with HMAC-SHA256 + persistent server secret Stores the HMAC secret in a new app_secrets singleton table. Fixes CodeQL py/weak-sensitive-data-hashing alert on storage.py:74-76, 394-395. Refresh tokens stay on plain SHA-256 (unchanged _hash_token) so existing user sessions survive upgrade — API keys are new on this branch so there is no migration. * Use PBKDF2 for API key hashing per CodeQL recommendation HMAC-SHA256 was still flagged by py/weak-sensitive-data-hashing. Switch to hashlib.pbkdf2_hmac, which is in CodeQL's recommended allowlist (Argon2/scrypt/bcrypt/PBKDF2). Persistent server-side salt stays in app_secrets for defense-in-depth. 100k iterations to match auth/hashing.py's password hasher. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>	2026-04-13 21:08:11 +04:00
Roland Tannous	3bb72a557f	Pin kernels==0.12.1 to avoid huggingface_hub dataclass conflict (#5000 )	2026-04-13 20:42:02 +04:00
Lee Jackson	21a7895959	Studio: Prompt manager, message deletion, and chat UI improvements (#4938 ) * feat(chat): code block styling, delete with Dexie sync, settings sheet polish * style: config save/delete padding fix * fix(studio): centralize dark code-block surface and optimize message sync writes * style: config padding/alignment polish * fix(studio): upsert custom presets without implicit rename-delete * fix settings sheet save state polish * fix settings sheet button widths * fix chat settings presets * fix chat delete sync * fix chat trust remote code flow --------- Co-authored-by: shine1i <wasimysdev@gmail.com>	2026-04-13 16:42:33 +02:00
AdamPlatin123	3b092bcd46	fix(studio): prevent route transition DOM duplication via AnimatePresence (#4987 ) Add mode="wait" and exit={{ opacity: 0 }} to the root AnimatePresence wrapper so outgoing routes fully unmount before incoming routes render. Without this, rapid navigation between Studio/Export/Recipes/Chat caused pages to stack (2x–3x duplication). Co-authored-by: AdamPlatin123 <AdamPlatin123@users.noreply.github.com> Co-authored-by: Wasim Yousef Said <wasimysdev@gmail.com>	2026-04-13 01:38:00 -07:00
Manan Shah	80c12ff1a6	Move gemma4 script (#4994 ) * updating gemma4 script * moving gemma4 script to scripts folder	2026-04-12 23:41:15 -07:00

1 2 3 4 5 ...

5055 commits