* Studio: forward standard OpenAI tools / tool_choice on /v1/responses
Mirrors the /v1/chat/completions client-side tool pass-through from #5099
so clients (OpenAI Codex CLI, OpenAI Python SDK, ...) that target the
Responses API receive structured function_call output items instead of
plain text with tool-call tokens leaking into content.
- ResponsesRequest: type tools/tool_choice properly, add parallel_tool_calls;
accept function_call and function_call_output input items for multi-turn
- Translate flat Responses tool / tool_choice shape to the nested Chat
Completions shape before forwarding to llama-server
- _normalise_responses_input: map function_call_output -> role="tool",
function_call -> assistant tool_calls (preserving call_id)
- Non-streaming: map returned tool_calls -> top-level function_call
output items keyed by call_id
- Streaming: emit response.output_item.added (function_call),
response.function_call_arguments.delta/.done, and response.output_item.done
per tool call while keeping the text message at output_index 0
- Pytest coverage: tools/tool_choice translation, multi-turn input mapping,
non-streaming tool_calls mapping, response round-trip
* Studio: merge system messages and close inner stream on /v1/responses
Fixes two issues surfacing when OpenAI Codex CLI drives /v1/responses
against a GGUF with a strict chat template (gpt-oss harmony, Qwen3, ...).
1. "System message must be at the beginning" upstream errors
Codex sends `instructions` AND a `role:"developer"` message in `input`,
producing two separate system-role messages. Strict templates raise
when a second system message exists or when one appears after a user
turn. _normalise_responses_input now hoists all instructions / system /
developer content into a single merged system message at the top of
the Chat Completions message list.
2. "async generator ignored GeneratorExit" / "Attempted to exit cancel
scope in a different task"
_responses_stream consumed the inner chat-completions body_iterator
without an explicit aclose() in a finally block. On client disconnect
(Codex frequently cancels mid-stream), Python 3.13 finalized the inner
async generator on a different task, tripping anyio's cancel-scope
check. Mirrored the same try/finally + aclose pattern used by the
/v1/messages, /v1/chat/completions, and /v1/completions passthroughs.
Tests: hoisting of instructions + developer, developer mid-conversation,
multiple system messages in input, no-system passthrough.
* Studio: accept Codex multi-turn shapes and fix cross-task stream close on /v1/responses
Two issues observed driving /v1/responses from OpenAI Codex CLI against a
GGUF backend.
1. 422 on every turn after the first
Codex replays prior assistant turns with
`content:[{"type":"output_text","text":...,"annotations":[],"logprobs":[]}]`
and carries forward `reasoning` items (o-series / gpt-5) between turns.
Our `ResponsesContentPart` union only accepted input_text / input_image,
and `ResponsesInputItem` only message / function_call / function_call_output,
so Pydantic failed the whole list and FastAPI returned
`"Input should be a valid string"` against the `str` branch of the
outer union.
- Add `ResponsesOutputTextPart` for assistant-replay content.
- Add `ResponsesUnknownContentPart` and `ResponsesUnknownInputItem`
as permissive catch-alls (drop during normalisation).
- Wire an explicit `Discriminator` so dispatch is deterministic and
the fallthrough reaches the catch-all instead of misreporting via
the outer `Union[str, list[...]]`.
- `_normalise_responses_input` now accepts output_text parts, flattens
single-part assistant text to a plain string (keeps legacy chat
templates happy), and silently drops reasoning / unknown items.
2. "async generator ignored GeneratorExit" / cross-task cancel scope
`_responses_stream` awaited `openai_chat_completions` in the parent
route-handler task, which opens the httpx client for the inner
passthrough on *that* task. The outer `StreamingResponse` then iterates
in a child task, so the asyncgen GC finalises the inner httpcore byte
stream on the child task, tripping anyio's "Attempted to exit cancel
scope in a different task". Move the `await` inside `event_generator`
so the httpx lifecycle stays within the single streaming child task,
and surface any HTTPException as a `response.failed` SSE frame.
Tests: assistant output_text replay, reasoning-item tolerance, unknown
content-part tolerance, end-to-end Codex-shape payload (developer + user +
reasoning + function_call + function_call_output + assistant output_text +
user), and single-part assistant flattening to plain string.
* Studio: call llama-server directly from streaming /v1/responses
The previous fix (running the inner await inside event_generator) was not
enough. Wrapping the existing `openai_chat_completions` pass-through still
stacks two async generators: when the outer generator is closed, the
innermost `HTTP11ConnectionByteStream.__aiter__` in httpcore doesn't
receive GeneratorExit before Python's asyncgen GC finalises it in a
sibling task, tripping "Attempted to exit cancel scope in a different
task" and "async generator ignored GeneratorExit" — the same Python 3.13
+ httpcore 1.0.x interaction already seen in PRs #4956, #4981, #5099.
Cure both pass-throughs had: a single same-task httpx lifecycle with
explicit `aiter_lines().aclose()` BEFORE `resp.aclose()` / `client.aclose()`
in the generator's finally block.
Apply it at the Responses layer by dropping the wrapper entirely for GGUF:
open httpx, consume `resp.aiter_lines()`, parse `chat.completion.chunk`,
emit Responses SSE events, close everything in finally — all in the
single StreamingResponse child task. Non-GGUF streaming is rejected with
a 400 (wrapping the transformers backend would re-introduce the
double-layer pattern and isn't a Codex-compatible path today anyway).
Also surfaces upstream httpx.RequestError / non-200 as a
`response.failed` SSE frame rather than a dropped stream now that the
request is dispatched after SSE headers have gone out.
* Studio: silence benign httpcore asyncgen GC warnings on Python 3.13
The streaming pass-throughs (/v1/chat/completions, /v1/messages,
/v1/responses, /v1/completions) all use the proven #4981 / #5099 pattern
— single-task httpx lifecycle with explicit aiter_lines().aclose() ahead
of resp.aclose() / client.aclose() in the generator's finally block.
That handles our own iterators correctly.
The residual noise ("async generator ignored GeneratorExit" /
"Attempted to exit cancel scope in a different task") comes from an
innermost HTTP11ConnectionByteStream.__aiter__ that httpcore creates
internally inside its pool. We hold no reference to it, so we cannot
aclose it ourselves. Python 3.13's asyncgen GC hook finalises it on the
finaliser task, its aclose path enters an anyio CancelScope shield, and
Python flags the cross-task exit. The response has already been
delivered with a 200 by then — it is purely log noise, not a functional
failure. Same interaction seen in modelcontextprotocol/python-sdk #831,
agno #3556, chainlit #2361, langchain-mcp-adapters #254.
Install a targeted sys.unraisablehook that swallows this specific tuple
— RuntimeError mentioning "cancel scope" or "GeneratorExit" plus an
object repr referencing HTTP11ConnectionByteStream — and defers to the
default hook for every other unraisable. Idempotent; guarded by a
sentinel attribute so repeated imports don't stack filters.
* Chatbox, scroll, and menu fixes
- Fixed chatbox auto-expand height for multi-line text on the compare page
- Fixed chatbox UI to be consistent across compare and new chat
- Fixed scrolling being enabled on pages with no content, which also triggered the scroll-to-bottom button
- Fixed scroll-to-bottom button to only appear after scrolling up a reasonable amount instead of instantly
- Added shutdown studio button to the menu for easier access
- Fixed pop-up menu width to match the user button width
(cherry picked from commit cd4e390dfa84fe311fae79a781b96cc0ef5970a9)
* fix: correct compare scroll viewport and clean up chat composer UI polish
* Dark theme refactor and sidebar/chat UI refinements
- Complete refactoring of dark theme
- Replaced square rounded-corner user profile image with a circular bordered one
- Replaced user profile icon with 'U' initial and renamed label from 'Studio' to 'User'
- Chat bubbles now have a pointy top-right edge
- Sidebar menu tab line color selection is now consistent across all menus
- Tab-selection color animation now also applies to recent chats
- Removed 'Compare' menu autoselect when a compare chat conversation is selected
- Fixed UI consistency in Compare to match New Chat
- Removed sidebar animation and tab line, replaced with rounded selection for consistency
- Further adjustments to sidebar UI
- Further adjustments to compare chat UI
* Fixed sidebar collapse/expand for recent chats and recent runs not being clickable
* Chatbox, scroll, and menu fixes
- Fixed chatbox auto-expand height for multi-line text on the compare page
- Fixed chatbox UI to be consistent across compare and new chat
- Fixed scrolling being enabled on pages with no content, which also triggered the scroll-to-bottom button
- Fixed scroll-to-bottom button to only appear after scrolling up a reasonable amount instead of instantly
- Added shutdown studio button to the menu for easier access
- Fixed pop-up menu width to match the user button width
* Sidebar, fonts, and chat UI refinements
- Replaced logo PNG with real font text for 'unsloth' and 'BETA' label
- Added Hellix font and applied it across menus and UI elements
- Lighter scrollbar in the sidebar compared to other areas of the app
- Adjusted chat font and chat bubble styling
- Adjusted app menu design to stay consistent with the sidebar
- Adjusted text style for 'New Chat' and repositioned content/chatbox
- Adjusted model selector and top area UI
- Fixed footer text from 'LLM's' to 'LLMs'
- Fixed active selection border color incorrectly appearing on page refresh and during general navigation
- Logo now defaults to 'New Chat' when clicked
* Sidebar, model selector, and mobile UI fixes
- Further adjustments to sidebar UI and logo
- Changed right bar icon
- Model selector adjustments
- Collapsed sidebar now matches the content area background
- Adjusted Hellix font spacing across pages
- Fixed sidebar icon overlap on mobile screens
* Adjust sidebar icons
* Adjust sidebar icons
* Fixed compare chat UI and scrolling issues
* Fixed inference settings icon behavior and context info positioning
- Fixed top right inference settings icon to move into sidepanel during expand/collapse, matching left sidebar behavior
- Adjusted context information element positioning
* Fix: textarea overflow in system prompt editor
* Code block redesign, font, and chat bubble adjustments
- Redesigned code block colors and theme
- Changed code block font to Fira Code
- Fixed scrollbar disappearing when expanding/collapsing tool calls in chats
- Adjusted chat bubble background color
* Fix chat bubble background color in dark theme
* fix: restore textarea auto-sizing and scope prompt editor sizing
* fix: add explicit textarea field sizing for prompt editor overflow
* fix: generate chat nonce on click instead of render
* fix: respect training lock on logo navigation
* Refactor compare page dual chat scrolling behavior
* Revert "Refactor compare page dual chat scrolling behavior"
This reverts commit d056ec09f2.
---------
Co-authored-by: sneakr <hauzin@hotmail.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
* export: update GGUF quant list and ordering
* gguf: add Q2_K_L quantize flags for output and embeddings
* export: add live console logs for LoRA export flow
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: stream q2_k_l quantize logs and include subprocess error details
* fix: route Q2_K_L preset to q2_k ftype with q8_0 output+embeddings
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Trashing a thread mid-stream used to delete the Dexie rows while the
model kept generating, because the sidebar has no access to the
@assistant-ui aui context. Expose per-thread cancelRun() through the
chat runtime store and call it from deleteChatItem so trash behaves
like Stop → Trash. Covers compare pairs by cancelling each paired
thread.
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
* fix(studio): forward OpenAI tools/tool_choice to llama-server (#4999)
Studio's /v1/chat/completions silently stripped standard OpenAI `tools`
and `tool_choice` fields, so clients using standard function calling
(opencode, Claude Code, Cursor, Continue, ...) never got structured
tool_calls back. Adds a client-side pass-through path mirroring the
existing Anthropic /v1/messages flow: when `tools` is present without
Studio's `enable_tools` shorthand, the request is forwarded to
llama-server verbatim so the client sees native id, finish_reason
("tool_calls"), delta.tool_calls, and accurate usage tokens.
Also wires Anthropic tool_choice forwarding: /v1/messages previously
accepted tool_choice on the request model but silently dropped it with
a warning. Translate the four Anthropic shapes to OpenAI format and
forward them so agentic clients can actually enforce tool use.
- ChatCompletionRequest: add tools, tool_choice, stop; extra="allow"
- ChatMessage: accept role="tool", optional tool_call_id / tool_calls /
name; content is now optional (assistant with only tool_calls)
- routes/inference.py: _openai_passthrough_stream /
_openai_passthrough_non_streaming helpers, routing branch in
openai_chat_completions, vision+tools via content-parts injection
- _build_passthrough_payload: tool_choice parameter (default "auto")
- anthropic_compat: anthropic_tool_choice_to_openai() translator
- tests/test_openai_tool_passthrough.py: Pydantic + translator unit tests
- tests/test_studio_api.py: 5 new E2E tests (non-stream, stream,
multi-turn, OpenAI SDK, Anthropic tool_choice=any regression)
* fix(studio): surface httpx transport errors from OpenAI passthrough
When the managed llama-server subprocess crashes mid-request, the
async pass-through helpers in routes/inference.py used to return a
bare 500 (non-streaming) or an "An internal error occurred" SSE chunk
(streaming) because _friendly_error only recognized the sync path's
"Lost connection to llama-server" substring -- httpx transport
failures (ConnectError / ReadError / RemoteProtocolError /
ReadTimeout) stringify differently and fell through to the generic
case.
- _friendly_error: map any httpx.RequestError subclass to the same
"Lost connection to the model server" message the sync chat path
emits. Placed before the substring heuristics so the streaming path
automatically picks it up via its existing except Exception catch.
- _openai_passthrough_non_streaming: wrap the httpx.AsyncClient.post
in a try/except httpx.RequestError and re-raise as HTTPException
502 with the friendly detail.
- tests/test_openai_tool_passthrough.py: new TestFriendlyErrorHttpx
class pinning the mapping for ConnectError, ReadError,
RemoteProtocolError, ReadTimeout, and confirming non-httpx paths
(context-size heuristic, generic fallback) are unchanged.
* fix(studio): close aiter_bytes/aiter_lines explicitly in passthroughs
The httpcore asyncgen cleanup fix in 5cedd9a5 is incomplete on Python
3.13 + httpcore 1.0.x: it switched to manual client/response lifecycle
but still used anonymous `async for raw_line in resp.aiter_lines():`
patterns in all three streaming paths. Python's async for does NOT
auto-close the iterator on break/return, so the aiter_lines /
aiter_bytes async generator remains alive, reachable only from the
surrounding coroutine frame. Once `_stream()` returns the frame is
GC'd and the orphaned asyncgen is finalized on a LATER GC pass in a
DIFFERENT asyncio task, where httpcore's
HTTP11ConnectionByteStream.aclose() enters anyio.CancelScope.__exit__
with a mismatched task and prints "Exception ignored in: <async
generator>" / "async generator ignored GeneratorExit" / "Attempted
to exit cancel scope in a different task" to the server log.
User observed this on /v1/messages after successful (status 200)
requests, with the traceback pointing at HTTP11ConnectionByteStream
.__aiter__ / .aclose inside httpcore.
Fix: save resp.aiter_lines() / resp.aiter_bytes() as a variable and
explicitly `await iter.aclose()` in the finally block BEFORE
resp.aclose() / client.aclose(). This closes the asyncgen inside the
current task's event loop, so the internal httpcore byte stream is
cleaned up before Python's asyncgen GC hook has anything orphaned to
finalize. Each aclose is wrapped in try/except Exception so nested
anyio cleanup noise can't bubble out.
Applied to all three streaming passthrough paths:
- _anthropic_passthrough_stream (/v1/messages client-side tool path)
- _openai_passthrough_stream (/v1/chat/completions client-side tool
path, new in this PR)
- openai_completions (/v1/completions bytes proxy from PR #4956)
* fix(studio): default ChatCompletionRequest.stream to false per OpenAI spec
OpenAI's /v1/chat/completions spec defaults `stream` to false, so
clients that omit the field (naive curl, minimal integrations) expect
a single JSON response back. Studio was defaulting to true, silently
switching those clients into SSE and breaking any parser that didn't
also handle streaming. ResponsesRequest and AnthropicMessagesRequest
already default to false correctly; only ChatCompletionRequest was
wrong.
Studio's own frontend always sets `stream` explicitly on every
chat-adapter / chat-api / runtime-provider call site, so the flip has
no UI impact. SDK users (OpenAI Python/JS SDK, opencode, Claude Code,
Cursor, Continue) also always pass `stream` explicitly, so they're
unaffected. The only clients feeling the change are raw-curl users
who were relying on the wrong default -- those get the correct OpenAI
behavior now.
Added a regression test pinning the default so it can't silently
flip back.
* fix(studio): reject images in OpenAI tool passthrough for text-only GGUFs
The new tool passthrough branch runs before _extract_content_parts,
skipping the existing not is_vision guard. Requests combining tools
with an image on a text-only tool-capable GGUF were forwarded to
llama-server, producing opaque upstream errors instead of the
pre-existing clear 400. Restore the guard inline at the dispatch
point, checking both legacy image_base64 and inline image_url parts.
* fix(studio): require tool_call_id on role=tool chat messages
Enforce the OpenAI spec rule that role="tool" messages must carry a
tool_call_id. Without it, upstream backends cannot associate a tool
result with the assistant's prior tool_calls entry and the request
fails in non-obvious ways through the passthrough path. Reject at the
request boundary with a 422 instead.
* fix(studio): harden OpenAI tool passthrough validation and error surfacing
Three related fixes called out by the PR review:
1. Preserve upstream status codes in the streaming passthrough. The
httpx request is now dispatched before the StreamingResponse is
constructed. Non-200 upstream responses and httpx RequestError
transport failures raise HTTPException with the real status
instead of being buried inside a 200 SSE error frame, so OpenAI
SDK clients see APIError/BadRequestError/... as expected.
2. Require non-empty content on user/system/tool messages. Per the
OpenAI spec, content may only be omitted on assistant messages
that carry tool_calls; enforce that at the request boundary so
malformed messages never reach the passthrough path.
3. Role-constrain tool-call metadata. tool_calls is only valid on
role=assistant, tool_call_id and name only on role=tool. Without
this, a user/system message with tool_calls would flip the
passthrough branch on and be forwarded to llama-server, surfacing
as an opaque upstream error.
* fix(studio): normalize image mode and passthrough JSON verbatim
Two Gemini-code-assist review findings on PR #5099:
1. Unconditionally convert decoded images to RGB before PNG encoding.
The prior code only handled RGBA, letting CMYK/I/F images crash
at img.save(format="PNG") and surface as opaque 400s. Applied to
both the passthrough helper and the non-passthrough GGUF path
that originally carried this pattern, keeping the two sites in
sync.
2. Return the upstream JSON body as raw bytes via Response rather
than parse-then-re-serialize with JSONResponse. Matches the
passthrough helper's "verbatim" contract and drops a redundant
round-trip.
---------
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* unsloth gemma4 support files
* some fixes
* Fixing cache.empty() calls (#4813)
* Fixing cache.empty() calls
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix/gemma4 mlx (#4816)
* Fixing cache.empty() calls
* fixing for mlx versions
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* removed bidirectional check for 31b (#4839)
Co-authored-by: Manan17 <shahmanan170602@gmail.coml>
* Add Gemma 4 26B MoE support (MLX) (#4844)
* removed bidirectional check for 31b
* Change gemma4_text for moe
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(gemma4): cast RoPE offset to int before mx.arange() (#4901)
* fix(gemma4): cast RoPE offset to int before mx.arange()
* fix(gemma4): use zero-based arange + offset to avoid CPU-GPU sync
* qwen3.6 patches for multi-turn chat
* qwen3.6 script
* removing unnecessary scripts
* displaying errors for not installed packages
---------
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Manan17 <shahmanan170602@gmail.coml>
Co-authored-by: Théophile Lafargue <138336683+eauchs@users.noreply.github.com>
* Add Qwen3.6 inference defaults for Studio
Add qwen3.6 family entry to inference_defaults.json with the
recommended sampling parameters from Qwen's documentation:
temperature=0.7, top_p=0.8, top_k=20, min_p=0.0,
presence_penalty=1.5, repetition_penalty=1.0.
Without this, Qwen3.6 models fall through to the generic qwen3
pattern which uses different defaults (temperature=0.6,
top_p=0.95, no presence_penalty).
* Add Qwen3.6-35B-A3B-GGUF to default model lists
* Add Qwen3.5/3.6 presence_penalty to thinking toggle and small-model disable logic
- Thinking toggle (on-load + button click) now sets presencePenalty: 1.5 for
Qwen3.5 and Qwen3.6 models (both thinking-ON and thinking-OFF states)
- Small-model thinking-disable check (<9B defaults to no-thinking) extended
from Qwen3.5-only to also cover Qwen3.6, in all 3 locations:
frontend on-load, frontend refresh, backend llama_cpp.py
* fix: multi-GPU inference crash for bnb 4-bit/8-bit models
When load_in_4bit or load_in_8bit is used with device_map="sequential"
and max_memory constraints that place weights across multiple GPUs (or
entirely on a non-default GPU like cuda:1), the bitsandbytes loading
path in transformers never calls dispatch_model. No AlignDevicesHook is
installed, and the first forward/generate call crashes with:
RuntimeError: Expected all tensors to be on the same device
This adds _attach_bnb_multidevice_hooks() which is called after
from_pretrained returns. It infers a device map from actual parameter
placements and calls dispatch_model(force_hooks=True) to install the
missing hooks. The function is a complete no-op for the common
single-GPU cuda:0 case.
Call sites: FastBaseModel.from_pretrained (vision.py) and
FastLlamaModel.from_pretrained (llama.py).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: align with PR #5053 final review improvements
- Add hook call to the bnb quantized loading branch in llama.py (the
primary load_in_4bit path), not just the non-fast-inference fallback
- Expand bnb detection: also check model.is_loaded_in_4bit,
model.is_loaded_in_8bit, model.quantization_method
- Pass explicit main_device and skip_keys to dispatch_model
- Use logger.info instead of print for the success message
- Use kwargs.get("load_in_8bit", False) at llama.py call sites
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* auth: default to chat
* settings: relaunch onboarding
* onboarding: return to launch page
* studio: stop auto guided tour
* ui: soften global radius
* cleanup: rename onboarding exit prop
* fix onboarding redirect safety
* Show real Unsloth version in settings
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): replace navbar navigation with collapsible sidebar
Add an app-wide sidebar with hover-expand and pin-to-dock behavior.
Navigation items (Studio, Recipes, Export, Chat) move from the center
pill navbar to the sidebar. Chat threads and recipes render as
collapsible sub-lists. Navbar simplified to logo + update + close.
- Extend SidebarProvider with pinned/hovered state model
- New AppSidebar with animated active indicator, sloth profile menu,
theme toggle, guided tour, back/forward navigation
- Chat page refactored to URL-driven view state via search params
- Extract reusable hooks for chat thread and recipe sidebar data
- Guard startViewTransition for browser compatibility
- Wrap chat deletions in Dexie transaction for data integrity
* feat(studio): move logo to sidebar and make navbar overlay
- Sidebar is now full-height with logo in SidebarHeader
- Collapsed sidebar shows sticker.png, expanded shows full logo
- Navbar is absolute-positioned overlay (no layout space)
- Main content extends to top, aligning with navbar controls
* feat(studio): full-height sidebar with recents, edge-to-edge nav buttons
- Sidebar outside max-w-7xl, pinned to left edge
- Remove sidebar rounding, menu buttons rounded-md
- Nav buttons flush to sidebar edges with no left rounding
- Replace collapsible recipes/chat with flat nav items
- Add Recents section with chat history (1 item when not on chat, full on chat)
- New Chat as first nav item with PencilEdit02Icon
- Cursor pointer on all sidebar buttons
- Navbar temporarily hidden for screenshots
* fix(studio): fix chat scroll, action bar hover, collapsible recents
- Fix sticky composer by removing `relative` override on viewport footer
- Action bar buttons only show on hover (autohide=always)
- Remove floating border/shadow from action bar
- Add scroll space above composer for last message actions
- Back/forward buttons use router history (stay in-app)
- Recents section collapsible with chevron on chat route
- Set html/body/#root height for proper h-full chain
* fix(studio): address review feedback, clean up unused code
- Unhide navbar (was left hidden from screenshot)
- Remove unused imports: SidebarMenuSub*, BubbleChatIcon, ColumnInsertIcon
- Remove unused vars: recipeItems, activeRecipeId, canCompare, recipesOpen
- Include compare query id in active sidebar selection
- Use store type for contextUsage instead of inline type
- Simplify noop in sidebar.tsx
- Remove empty className prop
* feat(studio): add mobile sidebar, recent runs section, and misc UX fixes
* feat(studio): scaffold settings feature module with dialog store
* feat(studio): add tri-state theme store for settings
* feat(chat): add clear-all-chats and export-chat-history utils
* feat(studio): add settings dialog shell with tab rail
* feat(studio): add appearance tab with theme and sidebar pin
* feat(studio): add settings general tab with hf token, auto-title, reset prefs
* feat(studio): add settings chat tab with export and clear
* feat(studio): add api keys tab with list and revoke flow
* feat(studio): add create-key form and reveal dialog
* feat(studio): add usage examples panel to api keys tab
* feat(studio): add settings about tab with update and shutdown
* feat(studio): add settings dropdown item and cmd-comma shortcut
* feat(studio): remove legacy api-keys route and chat-sheet preference rows
* fix(studio): settings dialog a11y + polish pass
* feat(studio): inline api key reveal card replacing nested dialog
* fix(studio): hide revoked keys from settings list
* refactor(studio): strip navbar and hoist training unload guard
* feat(studio): explicit sidebar toggle, remove hover-open and pin icons
* fix(studio): use SidebarRight01Icon for collapsed sidebar open toggle
* fix(studio): address code review findings for settings dialog
* feat(studio): collapsible navigate group with standalone new-chat and compare
* fix(studio): chat-only standalone actions, use ColumnInsertIcon for compare
* fix(studio): sidebar new-chat/compare state reset and icon-mode collapsible
* feat(studio): add compact logo assets for sidebar header
* Fixed sidebar design
* fix(studio): sidebar delete icon hover contrast and sizing
* feat(studio): route-gate sidebar recents (chats off /studio, runs on /studio)
* feat(studio): add chat search store
* feat(studio): add chat search index hook with snapshot-on-open
* feat(studio): add chat search command dialog with global shortcut
* feat(studio): wire chat search into sidebar
* fix(studio): trim hf token on save, add show/hide toggle, commit on close
* revert(studio): restore original sidebar/border colors, brighten sidebar
* feat(studio): forward overlayClassName through CommandDialog
* fix(studio): wrap search dialog in Command context, redesign as flat 635px card
* fix(studio): reserve right padding on recent items so delete icon stops overlapping title
* fix(studio): skip hf token unmount-commit during reset-prefs reload
* chore(studio): drop unused icon import and unreachable runs navigate branch
* fix(studio): chat search index filters archived before limit, batches message query, picks up reasoning text
* fix(studio): keep CommandEmpty in tree so empty state renders correctly
* fix(studio): cap system prompt and chat template textareas so they scroll instead of growing
* fix(studio): attach chat-compare tour anchor to sidebar compare button
* fix(studio): persist system theme explicitly so next-themes does not clobber on reload
* fix(studio): auto-switch to history tab when selecting a recent run from sidebar
* UI overhaul: chatbox, scrollbar, sidebar, and compare view
UI Changes:
- Redesigned the Compare UI with general cleanup
- Redesigned the Chatbox UI
- Reduced the width of the user chat bubble for improved readability
- Narrowed the user chat box across the content page
- Adjusted thinking-box text color to be slightly darker
- Removed faded text effect from chat messages
- Removed faded text effect from the thinking box
- Added a small LLM chat safety note at the bottom of the chatbox
- Restyled the scrollbar
Layout & Behavior:
- Reworked the scrollbar to span the full height of the page (no top/bottom padding) and remain persistently visible when content is scrollable, rather than only on hover
- Reworked the Configuration sidebar to span full height — removed rounded corners and borders, with the scrollbar adjusted to match the full top-to-bottom layout
- Adjusted the top menu and bottom chatbox content areas to work correctly with the new full-page scroll behavior
- Made chat content match the chatbox width, with content sliding slightly behind the chatbox when scrolling
- Aligned chat text width with the chatbox for visual consistency, including how far the text extends behind the chatbox
Fixes:
- Fixed the chatbox not auto-expanding when typing multi-line input while bottom-positioned during an active chat (previously only worked before a chat had started)
- Fixed positioning and design of the user chat hover menu buttons to match the assistant chat box — now displayed below the chat bubble instead of on the left side
* Fix user message layout in thread component
* swap code icon
* fix compare layout
* fix compare pane flex
* Sidebar improvements and fixes
- Added scrolling support to the sidebar so menus and recent chats no longer get hidden
- Recent chats are now always visible in the sidebar, not hidden when in Studio, Recipes, or Export
- Recent chat is now deselected when selecting other navigations
- Fixed sidebar glitch where browser resize could make the sidebar and expand button disappear completely
- Fixed glitch where the open-sidebar hover tooltip appeared above the logo when clicking expand sidebar
- Reduced sidebar width on mobile to around 2/3 of the screen (was too wide)
- Made the close-sidebar hover tooltip consistent with the rest of the design
- Removed sidebar collapse/expand animation
- Small adjustment to chat width
* Fix route scrolling, polling, and theme sync issues
* Fix Studio page scrolling
---------
Co-authored-by: sneakr <hauzin@hotmail.com>
* Studio: Ollama support, recommended folders, Custom Folders UX polish
Backend:
- Add _scan_ollama_dir that reads manifests/registry.ollama.ai/library/*
and creates .gguf symlinks under <ollama_dir>/.studio_links/ pointing
at the content-addressable blobs, so detect_gguf_model and llama-server
-m work unchanged for Ollama models
- Filter entries under .studio_links from the generic models/hf/lmstudio
scanners to avoid duplicate rows and leaked internal paths in the UI
- New GET /api/models/recommended-folders endpoint returning LM Studio
and Ollama model directories that currently exist on the machine
(OLLAMA_MODELS env var + standard paths, ~/.lmstudio/models, legacy
LM Studio cache), used by the Custom Folders quick-add chips
- detect_gguf_model now uses os.path.abspath instead of Path.resolve so
the readable symlink name is preserved as display_name (e.g.
qwen2.5-0.5b-Q4_K_M.gguf instead of sha256-abc...)
- llama-server failure with a path under .studio_links or .cache/ollama
surfaces a friendlier message ("Some Ollama models do not work with
llama.cpp. Try a different model, or use this model directly through
Ollama instead.") instead of the generic validation error
Frontend:
- ListLabel supports an optional leading icon and collapse toggle; used
for Downloaded (download icon), Custom Folders (folder icon), and
Recommended (star icon)
- Custom Folders header gets folder icon on the left, and +, search,
and chevron buttons on the right; chevron uses ml-auto so it aligns
with the Downloaded and Recommended chevrons
- New recommended folder chips render below the registered scan folders
when there are unregistered well-known paths; one click adds them as
a scan folder
- Custom folder rows that are direct .gguf files (Ollama symlinks) load
immediately via onSelect instead of opening the GGUF variant expander
(which is for repos containing multiple quants, not single files)
- When loading a direct .gguf file path, send max_seq_length = 0 so the
backend uses the model's native context instead of the 4096 chat
default (qwen2.5:0.5b now loads at 32768 instead of 4096)
- New listRecommendedFolders() helper on the chat API
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: log silent exceptions and support read-only Ollama dirs
Replace silent except blocks in _scan_ollama_dir and the
recommended-folders endpoint with narrower exception types plus debug
or warning logs, so failures are diagnosable without hiding signal.
Add _ollama_links_dir helper that falls back to a per-ollama-dir hashed
namespace under Studio's own cache (~/.unsloth/studio/cache/ollama_links)
when the Ollama models directory is read-only. Common for system installs
at /usr/share/ollama/.ollama/models and /var/lib/ollama/.ollama/models
where the Studio process has read but not write access. Previously the
scanner returned an empty list in that case and Ollama models would
silently not appear.
The fallback preserves the .gguf suffix on symlink names so
detect_gguf_model keeps recognising them. The prior "raw sha256 blob
path" fallback would have missed the suffix check and failed to load.
* Address review: detect mmproj next to symlink target for vision GGUFs
Codex P1 on model_config.py:1012: when detect_gguf_model returns the
symlink path (to preserve readable display names), detect_mmproj_file
searched the symlink's parent directory instead of the target's. For
vision GGUFs surfaced via Ollama's .studio_links/ -- where the weight
file is symlinked but any mmproj sidecar lives next to the real blob
-- mmproj was no longer detected, so the model was misclassified as
text-only and llama-server would start without --mmproj.
detect_mmproj_file now adds the resolved target's parent to the scan
order when path is a symlink. Direct (non-symlink) .gguf paths are
unchanged, so LM Studio and HF cache layouts keep working exactly as
before. Verified with a fake layout reproducing the bug plus a
regression check on a non-symlink LM Studio model.
* Address review: support all Ollama namespaces and vision projector layers
- Iterate over all directories under registry.ollama.ai/ instead of
hardcoding the "library" namespace. Custom namespaces like
"mradermacher/llama3" now get scanned and include the namespace
prefix in display names, model IDs, and symlink names to avoid
collisions.
- Create companion -mmproj.gguf symlinks for Ollama vision models
that have an "application/vnd.ollama.image.projector" layer, so
detect_mmproj_file can find the projector alongside the model.
- Extract symlink creation into _make_symlink helper to reduce
duplication between model and projector paths.
* Address review: move imports to top level and add scan limit
- Move hashlib and json imports to the top of the file (PEP 8).
- Remove inline `import json as _json` and `import hashlib` from
function bodies, use the top-level imports directly.
- Add `limit` parameter to `_scan_ollama_dir()` with early exit
when the threshold is reached.
- Pass `_MAX_MODELS_PER_FOLDER` into the scanner so it stops
traversing once enough models are found.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: Windows fallback, all registry hosts, collision safety
_make_link (formerly _make_symlink):
- Falls back to os.link() hardlink when symlink_to() fails (Windows
without Developer Mode), then to shutil.copy2 as last resort
- Uses atomic os.replace via tmp file to avoid race window where the
.gguf path is missing during rescan
Scanner now handles all Ollama registry layouts:
- Uses rglob over manifests/ instead of hardcoding registry.ollama.ai
- Discovers hf.co/org/repo:tag and any other host, not just library/
- Filenames include a stable sha1 hash of the manifest path to prevent
collisions between models that normalize to the same stem
Per-model subdirectories under .studio_links/:
- Each model's links live in their own hash-keyed subdirectory
- detect_mmproj_file only sees the projector for that specific model,
not siblings from other Ollama models
Friendly Ollama error detection:
- Now also matches ollama_links/ (the read-only fallback cache path)
and model_identifier starting with "ollama/"
Recommended folders:
- Added os.access(R_OK | X_OK) check so unreadable system directories
like /var/lib/ollama/.ollama/models are not advertised as chips
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: filter ollama_links from generic scanners
The generic scanners (models_dir, hf_cache, lmstudio) already filter
out .studio_links to avoid duplicate Ollama entries, but missed the
ollama_links fallback cache directory used for read-only Ollama
installs. Add it to the filter.
* Address review: idempotent link creation and path-component filter
_make_link:
- Skip recreation when a valid link/copy already exists (samefile or
matching size check). Prevents blocking the model-list API with
multi-GB copies on repeated scans.
- Use uuid4 instead of os.getpid() for tmp file names to avoid race
conditions from concurrent scans.
- Log cleanup errors instead of silently swallowing them.
Path filter:
- Use os.sep-bounded checks instead of bare substring match to avoid
false positives on paths like "my.studio_links.backup/model.gguf".
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: drop copy fallback, targeted glob, robust path filter
_make_link:
- Drop shutil.copy2 fallback -- copying multi-GB GGUFs inside a sync
API request would block the backend. Log a warning and skip the
model when both symlink and hardlink fail.
Scanner:
- Replace rglob("*") with targeted glob patterns (*/*/* and */*/*/*)
to avoid traversing unrelated subdirectories in large custom folders.
Path filter:
- Use Path.parts membership check instead of os.sep substring matching
for robustness across platforms.
Scan limit:
- Skip _scan_ollama_dir when _generic already fills the per-folder cap.
* Address review: sha256, top-level uuid import, Path.absolute()
- Switch hashlib.sha1 to hashlib.sha256 for path hashing consistency.
- Move uuid import to the top of the file instead of inside _make_link.
- Replace os.path.abspath with Path.absolute() in detect_gguf_model
to match the pathlib style used throughout the codebase.
* Address review: fix stale comments (sha1, rglob, copy fallback)
Update three docstrings/comments that still referenced the old
implementation after recent changes:
- sha1 comment now says "not a security boundary" (no hash name)
- "rglob" -> "targeted glob patterns"
- "file copies as a last resort" -> removed (copy fallback was dropped)
* Address review: fix stale links, support all manifest depths, scope error
_make_link:
- Drop size-based idempotency shortcut that kept stale links after
ollama pull updates a tag to a same-sized blob. Only samefile()
is used now -- if the link doesn't point at the exact same inode,
it gets replaced.
Scanner:
- Revert targeted glob back to rglob so deeper OCI-style repo names
(5+ path segments) are not silently skipped.
Ollama error:
- Only show "Some Ollama models do not work with llama.cpp" when the
server output contains GGUF compatibility hints (key not found,
unknown architecture, failed to load). Unrelated failures like
OOM or missing binaries now show the generic error instead of
being misdiagnosed.
---------
Co-authored-by: Daniel Han <info@unsloth.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: danielhanchen <michaelhan2050@gmail.com>
* Fix review findings for PR #49
1. Sandbox fallback Jinja env in _VariantTokenizerProxy.apply_chat_template
(use SandboxedEnvironment, matching _derive_assistant_prefix_by_render)
2. Unwrap benign outer-If guards in _template_ends_with_toplevel_for so
templates like {% if messages %}{% for ... %}{% endfor %}{% endif %}
are still repairable (preserves Qwen3-Guard rejection via else-branch
and add_generation_prompt-name checks)
3. Preserve raw name_or_path in _VariantTokenizerProxy._source_path so
local-path detection works for dict/list variant tokenizers
4. Context-aware strict-mode messages: omit "will still load" and
"Set UNSLOTH_STRICT_CHAT_TEMPLATE=1" when already raising
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Older installers persisted the venv Scripts directory directly in the
User PATH registry. The shim approach from #4961 no longer writes that
entry, but on upgrade the old one survived and python.exe / pip.exe
from the unsloth venv continued winning resolution in every new shell.
Before creating the shim, read the current User PATH, filter out any
entry matching $VenvDir\Scripts (using the same symmetric raw+expanded
comparison as Add-ToUserPath), and write back if changed. No-op on
fresh installs where the legacy entry was never written.
Confirmed on a real Windows machine: `where.exe python` was returning
the venv interpreter first even after the shim PR merged.
Older installers persisted the venv Scripts directory directly in the
User PATH registry. The shim approach (added in this PR) no longer writes
that entry, but it also did not remove the old one. On upgrade, the
legacy entry survived and python.exe / pip.exe from the unsloth venv
continued winning resolution in every new shell, which is exactly the
hijack the shim was designed to prevent.
Before creating the shim, read the current User PATH, filter out any
entry matching $VenvDir\Scripts (using the same symmetric raw+expanded
comparison as Add-ToUserPath), and write back if changed. This runs
once per install and is a no-op on fresh installs where the legacy
entry was never written.
* Restrict flash attn to <=256 head dim. Consolidate attn impl checks
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Consolidate the changes into single function
* safeguard for dict instead of object
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Chat-template repair: warn-by-default, AST classification, dict support
Follow-up hardening on top of PR #4426 (which fixed the #4150
RuntimeError for ChatML LoRA reloads).
Behavior changes:
- Warn-by-default instead of RuntimeError. When fix_chat_template cannot
repair a broken template, emit a warning and return the original.
Set UNSLOTH_STRICT_CHAT_TEMPLATE=1 to restore the pre-warn hard fail.
Fixes the UX where a missing `{% if add_generation_prompt %}` block on
a saved LoRA (typical after LlamaFactory / Axolotl re-serialize) would
block model loading entirely.
- Local path vs HF hub distinguished in the warning message. For local
paths the message points at the likely downstream tool; for HF IDs it
points at the upstream model maintainers. Previously both said "file a
bug report to the maintainers of <path>" even when <path> was the
user's own saves/ directory.
- Dict / list chat_template now handled. Hermes-3 ships with
{default, tool_use} and the previous code crashed with
AttributeError: 'dict' object has no attribute 'find' when entering
_fix_chat_template with a dict. Each variant is now fixed
independently; structure is preserved.
Internals:
- _find_end_position now matches all four Jinja whitespace-control
variants ({% %}, {%- %}, {% -%}, {%- -%}) and returns the rightmost
endfor/endif so multi-for templates aren't locked onto the first loop.
Previously {%- endfor -%} (both-side dash, used by Qwen3-Guard) was
silently bypassed.
- _has_add_generation_prompt_block uses Jinja AST via
jinja2.nodes.If/Name walks instead of substring matching, so
templates that hide the block behind comments or dash-style variants
are classified correctly.
- _template_ends_with_toplevel_for gates the GH#4150 ChatML repair on
the AST: only fires when the last structural top-level node is a For
(standard ChatML shape), ignoring trailing pure-whitespace output
nodes. Templates wrapped in an outer If (Qwen3-Guard) are now
explicitly skipped at the _fix_chat_template level as well, not just
at load_correct_tokenizer's name-based exemption.
- _validate_patched_template renders the patched template with and
without add_generation_prompt and confirms the patched output
responds to the flag by appending (not replacing) content. If
validation fails, the patch is discarded and we fall through to the
warn path.
Verified with an expanded regression suite in tests/:
- test_fix_chat_template_pr4426.py: 42/42 template-matrix cells
- test_load_correct_tokenizer_pr4426.py: 5/5 tokenizer loads
- test_chat_template_followups.py: 10/10 new follow-up tests
- test_mistral_pr4426.py: 5 Mistral variants byte-identical
- test_qwen_pr4426.py: 14 Qwen variants byte-identical
(Qwen1.5, Qwen2, Qwen2.5-Instruct/Coder/Math/VL, Qwen3,
Qwen3-Coder, QwQ, Qwen3-Guard-Gen)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard _validate_patched_template against read-only chat_template
If tokenizer.chat_template is a property or otherwise read-only, the
validation helper would crash with AttributeError when trying to
temporarily set the patched template. Catch the assignment failure and
return False (skip validation), and best-effort restore in the finally
block.
* Replace regex separator inference with render-diff; broaden repair to non-ChatML templates
The previous `_infer_assistant_separator` was a four-tier regex heuristic that
only worked on ChatML-shaped templates and forced a hard `<|im_start|>` /
`<|im_end|>` presence gate on Case 2 repair. This meant a Llama-3, Gemma, or
Phi-3 template stripped of its generation-prompt block by a downstream tool
(LlamaFactory, Axolotl, etc.) would still warn-and-return even though the
structural shape is identical to the ChatML case the PR already handles.
This replaces the regex with `_derive_assistant_prefix_by_render`: render the
template with two dialogs that differ only in assistant content, then
`os.path.commonprefix` on the tails captures the exact assistant-turn prefix
the template emits. The template itself is ground truth, so non-ChatML shapes
work as long as the assistant block is a literal the template emits once per
message.
Three guards keep the derivation safe:
A. both assistant renders extend the base render (no reordering);
B. the divergence point is exactly the content-insertion site (sentinel
follows the common prefix);
C. a user-role cross-check: if a render with a user sentinel also emits
the same prefix, role has no effect on output and we reject. A render
failure on [user, user] (e.g. Gemma's `raise_exception` alternation
check) is evidence that role matters; we accept.
Sentinels differ at character 0 so `commonprefix` cannot absorb them, and
trailing whitespace/comments after the last `{% endfor %}` are stripped
before probing (they would appear in base but not after the appended
assistant turn and break Guard A).
`_fix_chat_template` and `_repair_string_template` now thread an
`is_sharegpt` kwarg; `_fix_chat_template` retries once with
`is_sharegpt=True` if the first probe returns None (dual-probe fallback
for dict/list callers).
The ChatML `<|im_start|>` / `<|im_end|>` hard gate in Case 2 is dropped.
`_infer_assistant_separator` is deleted.
Verified via:
- tests/test_fix_chat_template_pr4426.py: 51/51 cells (new Llama-3,
Gemma, Phi-3 broken-template rows all repair FIX-OK)
- tests/test_load_correct_tokenizer_pr4426.py: 5/5
- tests/test_chat_template_followups.py: 18/18 (T11-T18 cover
non-ChatML repair + probe failure modes)
- tests/test_mistral_pr4426.py: 5/5 byte-identical
- tests/test_qwen_pr4426.py: 14/14 byte-identical (Qwen3-Guard AST
gate still rejects)
- tests/hermes3_lora_pr4426.py reload: patched template ends with
`<|im_start|>assistant\n`, inference returns sensible output.
- temp/sim/battery.py: 79/79 followup; vs baseline: 0 regressions,
9 improvements.
- Spot-check probe on real stripped tokenizers (Hermes-3, Phi-4,
Llama-3.2-1B, Gemma-3-1B): all derive the expected prefix.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address reviewer findings: variant routing, positive-gate detection, comment-safe end scan
Resolves three reviewer findings on PR #5049 (`fix/chat-template-followups`):
Finding #1 [10/10]: dict/list variants now route through
`_fix_chat_template_for_tokenizer` via a new `_VariantTokenizerProxy`
adapter. Previously the dict/list branches called `_fix_chat_template`
directly, silently bypassing the warn/strict (`UNSLOTH_STRICT_CHAT_TEMPLATE`)
contract, the `no == yes` diagnostic, broken-existing-block detection,
and `_validate_patched_template` guard. The proxy swaps
`base.chat_template` to the variant string before each
`apply_chat_template` call so tokenizer globals (`bos_token`, custom
filters, `raise_exception`) remain available; if the base is read-only
it falls back to isolated Jinja rendering.
Finding #2 [1/10]: `_has_add_generation_prompt_block` now requires the
`If` body to contain at least one `Output` node (a new
`_if_body_emits_content` helper walks descendants). This distinguishes a
real generation-prompt block from a header guard like
`{% if not add_generation_prompt is defined %}{% set ... %}{% endif %}`
(body contains only `Assign`) which references the name but emits
nothing. Also dropped a now-redundant `"add_generation_prompt" not in
scrubbed` guard in `_fix_chat_template` Case 2 so header-guarded
templates still get repaired.
Finding #4 [1/10]: `_find_end_position` now replaces Jinja comments with
equal-length whitespace before scanning for `{% endfor %}` / `{% endif %}`
tokens. This prevents a trailing comment containing those tokens from
being picked as the real end tag. Positions in the padded string map 1:1
to positions in the original template.
Tests:
- tests/test_chat_template_followups.py: 21/21 (T19 strict-mode
dict variant, T20 header-guard repair, T21 comment-endfor trap
added; T4/T5 stubs updated with a working apply_chat_template
that routes through Jinja).
- tests/test_fix_chat_template_pr4426.py: 51/51 cells unchanged.
- tests/test_load_correct_tokenizer_pr4426.py: 5/5.
- tests/test_mistral_pr4426.py: 5/5 byte-identical.
- tests/test_qwen_pr4426.py: 14/14 byte-identical.
- temp/sim/battery.py: 79/79 followup; 0 regressions vs baseline.
- Phase 3 Hermes-3 broken-LoRA reload: inference still returns
`'The answer to the equation 2+2 is 4.'`.
- Spot-checks on Hermes-3 / Phi-4 / Llama-3.2-1B / Gemma-3-1B real
stripped templates: probe still derives the expected prefix.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Tighten comments in chat-template helpers
Pure comment minimization across `_find_end_position`,
`_has_add_generation_prompt_block`, `_if_body_emits_content`,
`_derive_assistant_prefix_by_render`, `_fix_chat_template` Case 2,
and `_VariantTokenizerProxy`. No behavior change; same intent,
fewer lines. All 21 follow-up tests and the 51-cell Phase 1 matrix
still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Sandbox probe, fix is_sharegpt validator mismatch, reject negated gates
Three real bugs from the 10-agent Opus review:
1. Probe now uses `jinja2.sandbox.SandboxedEnvironment` instead of bare
`jinja2.Environment`. The probe renders at model-load time (before
the user calls `apply_chat_template`), so it was a new eager
code-execution surface that the base HF tokenizer loading does not
have. SandboxedEnvironment blocks attribute-chain exploits at
negligible cost.
2. `_repair_string_template` now tries validation with both
`is_sharegpt=False` and `is_sharegpt=True`. Previously, when
`_fix_chat_template` internally fell back to the other schema via
its dual-probe, the outer validation still used the caller's
original `is_sharegpt` -- rendering with the wrong message keys and
spuriously dropping a valid repair.
3. `_has_add_generation_prompt_block` now skips `If` nodes whose test
is a `Not` expression. A negated gate like
`{% if not add_generation_prompt %}{{ x }}{% endif %}` fires when
agp=False, so its emitting body is not a generation block -- but the
old code counted any Name reference regardless of polarity.
Cleanup: removed unused `self._label`, added `\r` escape in
generation-block literal, switched variant labels to `!r` formatting,
removed redundant `import os as _os`.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix jinja2.sandbox import and sandbox proxy fallback
Two critical findings from the 20-reviewer pass:
1. [20/20] The proxy read-only fallback used bare `jinja2.Environment`,
not sandboxed. All 20 reviewers independently reproduced marker-file
creation via `cycler.__init__.__globals__['os'].system(...)` during
`fix_chat_template()`. Fixed: fallback now uses
`from jinja2.sandbox import SandboxedEnvironment`.
2. [14/20] The render-diff probe did `import jinja2` then referenced
`jinja2.sandbox.SandboxedEnvironment`. `jinja2.sandbox` is a
submodule that is NOT auto-imported by `import jinja2` on Jinja 3.1.6.
This caused `AttributeError` (swallowed by `except Exception`),
making the entire Case 2 repair path silently return None in a clean
process. The 6 reviewers who saw it work had `jinja2.sandbox`
pre-imported by an earlier module in their process. Fixed: both the
probe and the proxy fallback now use
`from jinja2.sandbox import SandboxedEnvironment`.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Reduce inline comments from ~160 lines to ~25 across both files.
Keep one-line summaries of the "why"; drop multi-paragraph rationale
blocks that repeated information already captured in commit messages
and PR discussion.
* fix: replacing SetEnvironmentVariable with direct registry API
* apply reviews
* Use CreateSubKey for HKCU\Environment
* Store PATH backup under HKCU\Software\Unsloth
* Fix $backupKey registry handle leak in PATH backup block
Wrap $backupKey operations in try/finally so the handle is closed even
if GetValue or SetValue throws. The Add-ToUserPath helper already uses
this pattern for its registry key -- the backup block was the only
place missing it.
* Isolate WM_SETTINGCHANGE broadcast from PATH write error handling
Wrap the broadcast dummy-variable calls in their own try/catch so a
broadcast failure does not mask a successful registry PATH write.
Previously, if SetEnvironmentVariable threw after SetValue already
committed the new PATH, Add-ToUserPath would return $false and the
caller would skip Refresh-SessionPath.
* PATH helper polish: venv precedence, quoted entries, raw/expanded dedup
Three small follow-ups surfaced by a 10-reviewer pass against the rebased
PR head. None fix a regression vs main; each strictly improves the new
helpers.
Refresh-SessionPath / Refresh-Environment:
- Move $env:Path to the front of the merge so an activated venv keeps
precedence over machine/user PATH after a refresh. Pre-PR dropped
process-only entries entirely; post-PR kept them but at the back.
- Dedup on both raw and expanded forms so %USERPROFILE%\foo and the
already-expanded C:\Users\me\foo do not both survive.
Add-ToUserPath:
- Trim whitespace and surrounding double-quotes from each compared entry
so quoted PATH entries like "C:\Program Files\CMake\bin" deduplicate
against an unquoted directory of the same path.
* Back up User PATH inside Add-ToUserPath, before first mutation
Previously only studio/setup.ps1 took a one-time PATH backup, at script
top (line ~547). install.ps1 (the irm | iex entry point) had no backup,
so users who installed via that path had no recovery surface if anything
clobbered their PATH. The PR description's "one-time backup before any
modifications" promise only held for the studio installer flow.
Move the backup into Add-ToUserPath itself: just before the first actual
SetValue mutation, write the pristine raw PATH to
HKCU\Software\Unsloth\PathBackup if no backup already exists. This:
- Covers both entry points (install.ps1 and studio/setup.ps1).
- Captures the TRUE pristine PATH even when install.ps1 runs first and
studio/setup.ps1 runs afterwards (the script-top backup in setup.ps1
would otherwise see an already-modified PATH).
- Is idempotent: once a backup exists, subsequent calls preserve it.
- Skips when nothing would mutate (dedup match) or PATH is empty.
The script-top backup in studio/setup.ps1 is kept for defense in depth.
* Refresh PATH: venv-aware merge order
Reconcile two competing concerns about Refresh-SessionPath /
Refresh-Environment surfaced by separate review rounds:
- venv at the back -> activated venv loses precedence to system Python
- process at the front -> stale shims (old node, old python, etc.)
still on $env:Path can beat a freshly installed tool
New merge order:
1. Activated venv Scripts dir, only if $env:VIRTUAL_ENV is set
2. Machine PATH freshly read from registry
3. User PATH freshly read from registry
4. Current $env:Path as fallback
This way an explicitly-activated venv keeps priority while a tool the
script just installed wins over any stale entry that was already on
the inherited shell PATH. When no venv is active, fresh registry
entries take precedence as expected.
* Append to User PATH by default, close $envKey in finally
Add-ToUserPath gains a -Position Append|Prepend parameter defaulting to
Append so installing unsloth no longer prepends the bundled venv Scripts
directory ahead of the user's existing python / pip on new shells. The
four current call sites (install.ps1 launcher, studio/setup.ps1 CMake,
nvcc, Python user Scripts) all take the Append default because each one
that needs in-session precedence already does an inline $env:Path prepend
independently. This matches rustup / cargo / nvm / pyenv / uv behavior.
Also wrap the script-top $envKey.GetValue in a try/finally so the
registry handle is released even if the read throws. Matches the pattern
already used for $backupKey five lines below.
* Prepend cmake, nvcc, Python Scripts; keep venv Scripts appended
The previous commit switched Add-ToUserPath to append by default so that
installing unsloth would not silently hijack the user's system python /
pip. That was correct for the venv Scripts dir (which contains python.exe
and pip.exe alongside unsloth.exe), but wrong for the three studio/setup
call sites. Those persist cmake, the driver-compatible nvcc, and the
Python user Scripts dir for future shells, and in all three cases an
older tool already earlier in the user PATH would keep winning after the
install finished. The nvcc case is especially load-bearing: setup selects
a driver-compatible CUDA toolkit, then llama.cpp builds against whatever
wins PATH resolution, so a stale older nvcc produces broken builds.
Pass -Position 'Prepend' explicitly at the three setup.ps1 call sites
(cmake at line 754, nvcc bin at line 1025, Python user Scripts at line
1191). None of those directories holds python.exe, so prepending them
does not re-introduce the original hijack problem. Leave the install.ps1
venv Scripts call on the default Append with a comment explaining why.
* Symmetric dedup, Prepend reorders duplicates, unsloth shim dir
Address three separate findings surfaced by review:
1. Dedup asymmetry (Gemini high-priority): the existing dedup expanded
registry entries via ExpandEnvironmentVariables but did NOT expand the
new directory. Passing "%USERPROFILE%\foo" when "C:\Users\me\foo" was
already in PATH produced a duplicate. Expand both sides so the check
is symmetric.
2. -Position Prepend no-op on existing duplicates: the dedup loop
returned $false as soon as it saw a match, regardless of position.
That left a late-position duplicate in place instead of moving it to
the front, so "prepend the newly selected cmake/nvcc" did not always
beat an older copy earlier in PATH. Partition entries into kept and
dropped lists, then reinsert a single copy at the requested position.
Append still returns $false on any match so user-curated orderings
are not reshuffled. Prepend also returns $false when the only copy
is already at position 0 so we preserve the user's casing.
3. Stop adding the venv Scripts dir to User PATH entirely. That dir
holds python.exe and pip.exe alongside unsloth.exe, so neither
Prepend nor Append worked: prepend hijacked the user's system python
and pip, append made the freshly-installed unsloth.exe lose to any
older unsloth.exe earlier on PATH. Replace the Scripts-dir PATH add
with a dedicated shim directory that contains only unsloth.cmd, and
prepend that dir. The shim calls the venv's unsloth.exe by absolute
path so future pip upgrades inside the venv propagate automatically.
* Shim via hardlink, Append user Scripts, drop venv sysconfig fallback
Three follow-ups to the c0ab1ab shim commit, targeting concerns raised in
the second 20-reviewer pass:
1. Shim uses unsloth.exe (hardlink, copy fallback) instead of unsloth.cmd.
The batch-file approach had three distinct regressions:
- cmd.exe expanded %...% sequences inside user arguments, so prompts
like "What does 50% mean?" got mangled before reaching the CLI
- Git Bash / MSYS2 / POSIX-style shells on Windows do not resolve
bare-name lookups to .cmd files, so `unsloth` stopped working there
- Set-Content -Encoding ASCII replaced non-ASCII profile characters
with '?', so installs under C:\Users\Jörg\... wrote a broken shim
A hardlink (fallback: copy) of unsloth.exe is a native Windows
executable with no shell indirection. PATHEXT picks .exe before .cmd
in cmd.exe and PowerShell, Git Bash honors .exe natively, subprocess
callers hit it directly, and a hardlink stays in sync with the venv
on pip upgrades because both names point at the same inode.
2. studio/setup.ps1 Python user Scripts dir is added with default Append
instead of -Position Prepend. That directory holds every pip-installed
user console script (pip, pytest, huggingface-cli, and so on), not
just unsloth, so reordering it silently changed resolution order for
unrelated tools. The new install.ps1 shim at PATH position 0 already
guarantees `unsloth` resolves to the freshly installed copy, so the
Python user Scripts entry only needs to be present, not at the front.
3. The sysconfig lookup in studio/setup.ps1 no longer falls back to
sysconfig.get_path('scripts') when the nt_user scheme dir does not
exist. When setup.ps1 is invoked from an activated venv (a flow the
linked issue actually hits) that fallback returns the venv's Scripts
directory, which would then be added to the persisted User PATH and
re-introduce the python / pip hijack the shim dir is meant to avoid.
Stick strictly to the nt_user scheme; skip the block if it does not
exist on disk.
* Do not crash installer when unsloth.exe shim is locked
The shim update sequence at install.ps1:1095 did a bare Remove-Item /
New-Item HardLink / Copy-Item. Under the script's $ErrorActionPreference
a locked target (most commonly 'unsloth studio' still running while the
user re-invokes the installer) turns the Remove-Item failure into a
terminating error that aborts the install with no actionable message.
The existing shim is perfectly usable in that state, so there is no
reason to abort. Wrap the whole remove/link/copy sequence in a try/catch
that logs the probable cause (Studio still running), points at the fix
(close Studio and re-run), and lets the installer finish with the old
launcher still serving the command.
Also only emit the "added unsloth launcher to PATH" step line when the
launcher was actually (re)created AND the PATH entry was newly added --
previously the message fired even when the shim refresh silently failed,
which was confusing.
* Guard shim PATH entry on existence, use NullString for broadcast delete
Two follow-ups surfaced by the latest review pass:
1. Do not add the shim directory to User PATH when the launcher was not
actually created. Antivirus blocking unsloth.exe, a disk-full volume,
or restrictive filesystem permissions can make both the hardlink and
the copy fallback fail on a fresh install. In that case the existing
sequence would report "added unsloth launcher to PATH" warnings but
still prepend the empty $ShimDir to User PATH -- the user sees an
install that claims success but then cannot resolve `unsloth` in a
new shell. Gate Add-ToUserPath on Test-Path $ShimExe so the PATH
entry is only persisted when the launcher is really there.
2. Pass [NullString]::Value instead of $null to the broadcast-delete
call in Add-ToUserPath. On PowerShell 7.5 and later (running on .NET
9), a bare $null going into [Environment]::SetEnvironmentVariable
can be coerced to an empty string rather than a true .NET null,
which sets the dummy UnslothPathRefresh_XXXXXXXX variable to "" in
HKCU\Environment instead of deleting it. The leaked variable is
visible in System Properties and accumulates one entry per install
run. [NullString]::Value is a PowerShell-specific sentinel that
crosses the interop boundary as a real null and works on both PS 5.1
and PS 7.x. See PowerShell/PowerShell#24637 for the underlying issue.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Fixes#4150.
Pre-PR, `_fix_chat_template` only patched templates where a trailing `{{ ... }}` expression followed the last `{% endfor %}`. ChatML templates (Hermes, Magnum, Phi-4, etc.) that end cleanly at `{% endfor %}` with no generation-prompt block were left unchanged, so the outer `fix_chat_template` raised:
```
RuntimeError: Unsloth: The tokenizer `...` does not have a
{% if add_generation_prompt %} for generation purposes.
```
This commonly shows up when a downstream tool (LlamaFactory, Axolotl) re-serializes the tokenizer during LoRA save and strips the generation-prompt block.
This PR adds a second branch to `_fix_chat_template` that fires when:
- the content after the last `{% endfor %}` is empty modulo Jinja `{# ... #}` comments,
- the scrubbed template contains `<|im_start|>` and `<|im_end|>`,
- and the scrubbed template does not already mention `add_generation_prompt`.
The assistant-turn separator is inferred from the template itself (preferring an explicit `'<|im_start|>assistant<sep>'` literal, then the unique `message['role'] + '<sep>'` from role concatenations, then `<|im_sep|>` for Phi-4-mini mixed-separator templates, then `\n`), so Phi-4-style templates are not silently corrupted with the wrong separator.
Verified against the existing chat-template corpus:
- Hermes-3, Magnum-v2, Phi-4-mini, Phi-4 multi-sep, ChatML with trailing whitespace, ChatML with trailing Jinja comment, dot-access `message.role`, split-literal `'<|im_start|>assistant'`: all repaired with the correct assistant prefix.
- Already-fixed ChatML templates: idempotent NOP.
- Trap templates with `<|im_start|>` only inside a Jinja comment: correctly not rewritten.
- Llama-3, Gemma-3, Qwen2.5 (non-ChatML): byte-identical.
- Mistral family (5 models including Mistral-Nemo, Mistral-Small-24B, Mixtral): byte-identical, protected both by the structural guard (no ChatML tokens) and the existing name-based exemption in `load_correct_tokenizer`.
- Qwen family (14 models including Qwen2.5, Qwen3, Qwen3-Coder, QwQ, VL, Math, Qwen3-Guard): byte-identical.
End-to-end reproduction: Hermes-3 LoRA SFT, save with stripped chat_template, reload. Pre-PR code path raises the RuntimeError above. Post-PR reload loads cleanly, patches the template at load time, and `apply_chat_template(add_generation_prompt=True)` produces the correct `<|im_start|>assistant\n` prefix.
* fix pass attn implementation
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: add folder browser modal for Custom Folders
The Custom Folders row in the model picker currently only accepts a
typed path. On a remote-served Studio (Colab, shared workstation) that
means the user has to guess or paste the exact server-side absolute
path. A native browser folder picker can't solve this: HTML
`<input type="file" webkitdirectory>` hides the absolute path for
security, and the File System Access API (Chrome/Edge only) returns
handles rather than strings, neither of which the server can act on.
This PR adds a small in-app directory browser that lists paths on the
server and hands the chosen string back to the existing
`POST /api/models/scan-folders` flow.
## Backend
* New endpoint `GET /api/models/browse-folders`:
* `path` query param (expands `~`, accepts relative or absolute; empty
defaults to the user's home directory).
* `show_hidden` boolean to include dotfiles/dotdirs.
* Returns `{current, parent, entries[], suggestions[]}`. `parent` is
null at the filesystem root.
* Immediate subdirectories only (no recursion); files are never
returned.
* `entries[].has_models` is a cheap hint: the directory looks like it
holds models if it is named `models--*` (HF hub cache layout) or
one of the first 64 children is a .gguf/.safetensors/config.json/
adapter_config.json or another `models--*` subfolder.
* Sort order: model-bearing dirs, then plain, then hidden; case-
insensitive alphabetical within each bucket.
* Suggestions auto-populate from HOME, the HF cache root, and any
already-registered scan folders, deduplicated.
* Error surface: 404 for missing path, 400 for non-directory, 403 on
permission errors. Auth-required like the other models routes.
* New Pydantic schemas `BrowseEntry` and `BrowseFoldersResponse` in
`studio/backend/models/models.py`.
## Frontend
* New `FolderBrowser` component
(`studio/frontend/src/components/assistant-ui/model-selector/folder-browser.tsx`)
using the existing `Dialog` primitive. Features:
* Clickable breadcrumb with a `..` row for parent navigation.
* Quick-pick chips for the server-provided suggestions.
* `Show hidden` checkbox.
* In-flight fetch cancellation via AbortController so rapid
navigation doesn't flash stale results.
* Badges model-bearing directories inline.
* `chat-api.ts` gains `browseFolders(path?, showHidden?)` and matching
types.
* `pickers.tsx` adds a folder-magnifier icon next to the existing `Add`
button. Opening the browser seeds it with whatever the user has
already typed; confirming fills the text input, leaving the existing
validation and save flow unchanged.
## What it does NOT change
* The existing text-input flow still works; the browser is additive.
* No new permissions or escalation; the endpoint reads only directories
the server process is already allowed to read.
* No model scanning or filesystem mutation happens from the browser
itself -- it just returns basenames for render.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Studio: cap folder-browser entries and expose truncated flag
Pointing the folder browser at a huge directory (``/usr/lib``,
``/proc``, or a synthetic tree with thousands of subfolders) previously
walked the whole listing and stat-probed every child via
``_looks_like_model_dir``. That is both a DoS shape for the server
process and a large-payload surprise for the client.
Introduce a hard cap of 2000 subdirectory entries and a
``truncated: bool`` field on the response. The frontend renders a small
hint below the list when it fires, prompting the user to narrow the
path. Below-cap directories are unchanged.
Verified end-to-end against the live backend with a synthetic tree of
2050 directories: response lands at 2000 entries, ``truncated=true``,
listing finishes in sub-second time (versus tens of seconds if we were
stat-storming).
* Studio: suggest LM Studio / Ollama dirs + 2-level model probe
Three improvements to the folder-browser, driven by actually dropping
an LM Studio-style install (publisher/model/weights.gguf) into the
sandbox and walking the UX:
## 1. Quick-pick chips for other local-LLM tools
`well_known_model_dirs()` (new) returns paths commonly used by
adjacent tools. Only paths that exist are returned so the UI never
shows dead chips.
* LM Studio current + legacy roots + user-configured
`downloadsFolder` from its `settings.json` (reuses the existing
`lmstudio_model_dirs()` helper).
* Ollama: `$OLLAMA_MODELS` env override, then `~/.ollama/models`,
`/usr/share/ollama/.ollama/models`, and `/var/lib/ollama/.ollama/models`
(the systemd-service install path surfaced in the upstream "where is
everything?" issue).
* Generic user-choice locations: `~/models`, `~/Models`.
Dedup is stable across all sources.
## 2. Two-level model-bearing probe
LM Studio and Ollama both use `root/publisher/model/weights.gguf`.
The previous `has_models` heuristic only probed one level, so the
publisher dir (whose immediate children are model dirs, not weight
files) was always marked as non-model-bearing. Pulled the direct-
signal logic into `_has_direct_model_signal` and added a grandchild
probe so the classic layout is now recognised.
Still O(PROBE^2) worst-case, still returns immediately for
`models--*` names (HF cache layout) and for any direct weight file.
## 3. model_files_here hint on response body
A leaf model dir (just GGUFs, no subdirs) previously rendered as
`(empty directory)` in the modal, confusing users into thinking the
folder wasn't scannable. Added a `model_files_here` count on the
response (capped at 200) and a small hint row in the modal: `N model
files in this folder. Click "Use this folder" to scan it.`
## Verification
Simulated an LM Studio install by downloading the real 84 MB
`unsloth/SmolLM2-135M-Instruct-Q2_K.gguf` into
`~/.lmstudio/models/unsloth/SmolLM2-135M-Instruct-GGUF/`. Confirmed
end-to-end:
* Home listing suggests `~/.lmstudio/models` as a chip.
* Browsing `~/.lmstudio/models` flags `unsloth` (publisher) as
`has_models=true` via the 2-level probe.
* Browsing the publisher flags `SmolLM2-135M-Instruct-GGUF` (model
dir) as `has_models=true`.
* Browsing the model dir returns empty entries but
`model_files_here=1`, and the frontend renders a hint telling the
user it is a valid target.
* Studio: one-click scan-folder add + prominent remove + plain search icon
Three small Custom Folders UX fixes after real-use walkthrough:
* **One-click add from the folder browser**. Confirming `Use this
folder` now submits the path directly to
`POST /api/models/scan-folders` instead of just populating the text
input. `handleAddFolder` takes an optional explicit path so the
submit lands in the same tick as `setFolderInput`, avoiding a
state-flush race. The typed-path + `Add` button flow is unchanged.
* **Prominent remove X on scan folders**. The per-folder delete
button was `text-muted-foreground/40` and hidden entirely on
desktop until hovered (`md:opacity-0 md:group-hover:opacity-100`).
Dropped the hover-only cloak, bumped color to `text-foreground/70`,
added a red hover/focus background, and sized the icon up from
`size-2.5` to `size-3`. Always visible on every viewport.
* **Plain search icon for the Browse button**. `FolderSearchIcon`
replaced with `Search01Icon` so it reads as a simple "find a
folder" action alongside the existing `Add01Icon`.
* Studio: align Custom Folders + and X buttons on the same right edge
The Custom Folders header used `px-2.5` with a `p-0.5` icon button,
while each folder row used `px-3` with a `p-1` button. That put the
X icon 4px further from the right edge than the +. Normalised both
rows to `px-2.5` with `p-1` so the two icons share a column.
* Studio: empty-state button opens the folder browser directly
The first-run empty state for Custom Folders was a text link reading
"+ Add a folder to scan for local models" whose click toggled the
text input. That's the wrong default: a user hitting the empty state
usually doesn't know what absolute path to type, which is exactly
what the folder browser is for.
* Reword to "Browse for a models folder" with a search-icon
affordance so the label matches what the click does.
* Click opens the folder browser modal directly. The typed-path +
Add button flow is still available via the + icon in the
section header, so users who know their path keep that option.
* Slightly bump the muted foreground opacity (70 -> hover:foreground)
so the button reads as a primary empty-state action rather than a
throwaway hint.
* Studio: Custom Folders header gets a dedicated search + add button pair
The Custom Folders section header had a single toggle button that
flipped between + and X. That put the folder-browser entry point
behind the separate empty-state link. Cleaner layout: two buttons in
the header, search first, then add.
* Search icon (left) opens the folder browser modal directly.
* Plus icon (right) toggles the text-path input (unchanged).
* The first-run empty-state link is removed -- the two header icons
cover both flows on every state.
Both buttons share the same padding / icon size so they line up with
each other and with the per-folder remove X.
* Studio: sandbox folder browser + bound caps + UX recoveries
PR review fixes for the Custom Folders folder browser. Closes the
high-severity CodeQL path-traversal alert and addresses the codex /
gemini P2 findings.
Backend (studio/backend/routes/models.py):
* New _build_browse_allowlist + _is_path_inside_allowlist sandbox.
browse_folders now refuses any target that doesn't resolve under
HOME, HF cache, Studio dirs, registered scan folders, or the
well-known third-party model dirs. realpath() is used so symlink
traversal cannot escape the sandbox. Also gates the parent crumb
so the up-row hides instead of 403'ing.
* _BROWSE_ENTRY_CAP now bounds *visited* iterdir entries, not
*appended* entries. Dirs full of files (or hidden subdirs when
show_hidden is False) used to defeat the cap.
* _count_model_files gets the same visited-count fix.
* PermissionError no longer swallowed silently inside the
enumeration / counter loops -- now logged at debug.
Frontend (folder-browser.tsx, pickers.tsx, chat-api.ts):
* splitBreadcrumb stops mangling literal backslashes inside POSIX
filenames; only Windows-style absolute paths trigger separator
normalization. The Windows drive crumb value is now C:/ (drive
root) instead of C: (drive-relative CWD-on-C).
* browseFolders accepts and forwards an AbortSignal so cancelled
navigations actually cancel the in-flight backend enumeration.
* On initial-path fetch error, FolderBrowser now falls back to HOME
instead of leaving the modal as an empty dead end.
* When the auto-add path (one-click "Use this folder") fails, the
failure now surfaces via toast in addition to the inline
paragraph (which is hidden when the typed-input panel is closed).
* Studio: rebuild browse target from trusted root for CodeQL clean dataflow
CodeQL's py/path-injection rule kept flagging the post-validation
filesystem operations because the sandbox check lived inside a
helper function (_is_path_inside_allowlist) and CodeQL only does
intra-procedural taint tracking by default. The user-derived
``target`` was still flowing into ``target.exists`` /
``target.is_dir`` / ``target.iterdir``.
The fix: after resolving the user-supplied ``candidate_path``,
locate the matching trusted root from the allowlist and rebuild
``target`` by appending each individually-validated segment to
that trusted root. Each segment is rejected if it isn't a single
safe path component (no separators, no ``..``, no empty/dot).
The downstream filesystem ops now operate on a Path constructed
entirely from ``allowed_roots`` (trusted) plus those validated
segments, so CodeQL's dataflow no longer sees a tainted source.
Behavior is unchanged for all valid inputs -- only the
construction of ``target`` is restructured. Live + unit tests
all pass (58 selected, 7 deselected for Playwright env).
* Studio: walk browse paths from trusted roots for CodeQL
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@h100-8-cheapest.us-east5-a.c.unsloth.internal>
* Reapply "updated models template mappers. added lfm2.5vl450m to transformers 5…" (#4945)
This reverts commit 33503ea248.
* Add missing gemma-4-31B-it bnb-4bit mapper entry and LFM2.5 upstream namespace for PR #4950
- Add unsloth/gemma-4-31B-it-unsloth-bnb-4bit to __INT_TO_FLOAT_MAPPER so
the int-to-float resolution works for this model (already listed in
TEMPLATE_TO_MODEL_MAPPER but had no mapper entry).
- Add LiquidAI/LFM2.5-1.2B-Instruct to lfm-2.5 TEMPLATE_TO_MODEL_MAPPER
entry so the canonical upstream namespace is mapped consistently with lfm-2.
* Add missing gemma-4-31B-it bnb-4bit Ollama mapping and lfm-2.5 chat template alias
- Add unsloth/gemma-4-31B-it-unsloth-bnb-4bit to OLLAMA_TEMPLATE_TO_MODEL_MAPPER
so Ollama export works for this model (E2B-it and E4B-it bnb-4bit variants were
already present, 31B-it was inconsistently omitted)
- Register CHAT_TEMPLATES["lfm-2.5"] as alias of the lfm-2 template to prevent
KeyError when Studio resolves LFM2.5 models through MODEL_TO_TEMPLATE_MAPPER
* Add missing LFM2 bnb-4bit INT_TO_FLOAT_MAPPER entry
unsloth/LFM2-1.2B-unsloth-bnb-4bit is referenced in model_mappings.py
but had no mapper.py entry, so model resolution would fail when users
load that variant with load_in_4bit=False or when the float name is
used with load_in_4bit=True.
* Fix review findings for PR #16
1. ollama_template_mappers.py: Restore dropped Gemma-4 base model IDs
(E2B, E4B, 31B, 26B-A4B) and add missing google/ upstream IDs to
the gemma4 Ollama mapper for consistency with other gemma entries.
2. mapper.py: Remove self-mapping non-bnb-4bit entries from
__INT_TO_FLOAT_MAPPER that were polluting FLOAT_TO_INT_MAPPER with
lowercase 16-bit names, causing load_in_4bit=True to return bad
model names. Add direct MAP_TO_UNSLOTH_16bit entries to preserve
the google->unsloth 16-bit redirects.
3. mapper.py: Add LFM2.5 MAP_TO_UNSLOTH_16bit redirect so
LiquidAI/LFM2.5-1.2B-Instruct resolves to its unsloth mirror.
* Add review tests for PR #4950
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove top-level test files
These test_*.py files were added at the repo root rather than under tests/.
Removing them from this PR; the production mapper changes remain.
* Add gemma-4-26B-A4B-it mapping
Adds unsloth/gemma-4-26B-A4B-it to __INT_TO_FLOAT_MAPPER as a 2-tuple so
google/gemma-4-26B-A4B-it routes to unsloth/gemma-4-26B-A4B-it across
INT_TO_FLOAT_MAPPER, FLOAT_TO_INT_MAPPER, and MAP_TO_UNSLOTH_16bit.
The 26B-A4B (MoE) model has no bnb-4bit variant, so the key uses the
plain unsloth name rather than the -unsloth-bnb-4bit suffix.
Removes the now-redundant standalone _add_with_lower call for the -it
variant; the 16bit mapping is registered via the dict loop.
* Add unsloth-bnb-4bit mappings for gemma-4 base (non-it) models
Adds E2B, E4B, 31B base unsloth-bnb-4bit entries to __INT_TO_FLOAT_MAPPER.
The 26B-A4B (MoE) base has no bnb-4bit variant on HF, so it stays on the
standalone _add_with_lower line for the 16bit-only routing.
Removes the redundant _add_with_lower lines for E2B, E4B, 31B base since
the dict loop now registers the same google->unsloth route through the
2-tuple entries, plus full FLOAT_TO_INT and INT_TO_FLOAT coverage.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat: Add cactus QAT scheme support
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* test(qat): add tests for cactus QAT scheme and fix missing import
* Fix cactus QAT scheme: correct MappingType import, tighten PerGroup filter
- Drop the broken `from torchao.dtypes import MappingType` import. `MappingType`
lives in `torchao.quantization` (and `torchao.quantization.quant_primitives`);
it is not exported from `torchao.dtypes` in any supported torchao release
(verified on 0.14, 0.16, 0.17). The previous code raised `ImportError` on
every cactus call and was masked as a misleading 'torchao not found' error.
- Since `IntxWeightOnlyConfig` already defaults `mapping_type` to
`MappingType.SYMMETRIC`, drop the explicit kwarg entirely and remove the
import. Behavior is unchanged.
- Introduce a named `group_size = 32` constant (matches the int4 / fp8-int4
pattern in the surrounding branches) and add a `% group_size == 0`
divisibility guard to the filter. `PerGroup(32)` requires
`in_features % 32 == 0` at `quantize_()` time, otherwise torchao raises
`ValueError: in_features (N) % group_size (32) must be == 0`. The old
`in_features >= 32` filter would admit non-aligned widths (e.g. 33, 48, 65,
127) and crash `_prepare_model_for_qat` for those shapes.
* Warn when cactus QAT skips non-divisible Linear layers
Multiple reviewers flagged that the divisibility guard added in the
previous commit can silently leave Linear layers in full precision when
their in_features is not a multiple of 32. For currently supported
Unsloth models (Qwen, Llama, Gemma, Mistral, Phi) every Linear width is
already a multiple of 32/64/128 so this never triggers, but surfacing
the coverage gap is cheap and avoids users assuming 100% QAT coverage
when they bring a custom model with unusual shapes.
Emit a UserWarning listing up to the first 8 skipped layers whenever
the cactus filter excludes any Linear due to the modulo guard. This
keeps the lenient silent-skip behavior (consistent with int4 /
fp8-int4), but stops making it silent.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat: Add support for OLMo-3 model in mapping and tests
* Update unsloth/models/mapper.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update tests/test_get_model_name.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Fix casing, add Think variants, and align version gate for OLMo-3 PR 4678
Mapper: switch slugs from OLMo-3 to canonical Olmo-3 mixed case, drop the
non-existent unsloth/Olmo-3-7B-Instruct-bnb-4bit dead alias, and add the
already-published Olmo-3-7B-Think and Olmo-3-32B-Think Unsloth mirrors.
Loader: change the olmo3 transformers version gate from Version("4.57.0")
to Version("4.57.0.dev0") so nightly/source builds that already contain
olmo3 are not blocked, matching the OLMo-2, Gemma 3 and Cohere patterns.
* Use canonical Olmo-3 casing and cover Think variants in OLMo-3 tests
Mirrors the mapper.py fixes on pr-4678-code: HuggingFace canonical slugs
for the OLMo-3 family use mixed-case Olmo-3 (not OLMo-3 like OLMo-2), and
Unsloth already hosts Olmo-3-7B-Think and Olmo-3-32B-Think mirrors, so
the resolution matrix now covers all three published Olmo-3 families.
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Studio: refresh Downloaded GGUF list and recurse into variant subdirs
Two fixes for the model picker's "Downloaded" section.
Frontend (`pickers.tsx`):
* `HubModelPicker`'s mount effect short-circuited the cached-gguf and
cached-models refetch whenever the module-level cache already had
entries (`if (alreadyCached) return;`). After downloading a new repo
in the same session, reopening the picker rendered the stale cache
and the new repo never appeared in "Downloaded" until a full page
reload. The early return is removed so the lists are always refreshed
on mount; the module cache still drives the initial render so there
is no spinner flash when we already had data.
Backend (`utils/models/model_config.py`):
* `list_local_gguf_variants` and `_find_local_gguf_by_variant` used a
non-recursive `Path.glob("*.gguf")`. Some HF GGUF repos (e.g.
`unsloth/gemma-4-26B-A4B-it-GGUF`) place the largest quants under a
variant-named subdirectory such as `BF16/...gguf`, which the
top-level glob missed. Both helpers now use `rglob` and the variant
filename is stored as a path relative to the scan root so the
locator can still find the file.
The flat-layout case (variants directly in the snapshot root) is
unchanged: verified against `unsloth/gemma-4-E2B-it-GGUF` which still
returns its UD-Q4_K_XL variant correctly.
* Studio: emit posix-style relative filenames for local GGUF subdirs
`list_local_gguf_variants` was doing `str(f.relative_to(p))`, which on
Windows produces backslash-separated paths like `BF16\foo.gguf`. The
remote `list_gguf_variants` (HF API path) always returns forward-slash
filenames such as `BF16/foo.gguf`, so the two would diverge on Windows.
Switch to `.as_posix()` so the local and remote variant filenames stay
identical across Linux, macOS, and Windows. Verified by simulating with
`PureWindowsPath` in the test suite.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Studio: detect mmproj at snapshot root for nested-variant layouts
When _find_local_gguf_by_variant returns a weight file inside a
quant-named subdir (e.g. snapshot/BF16/foo.gguf), detect_mmproj_file
was scanning only the immediate parent and missing the mmproj file
sitting at the snapshot root. The model was then loaded without
--mmproj, silently breaking vision support for repos that ship
nested variants.
detect_mmproj_file now takes an optional search_root and walks up
from the weight file to that root, in order, so the mmproj at the
snapshot root is picked up. Sibling quant subdirs are not scanned,
so an unrelated variant's mmproj does not leak in.
Also apply the suggested micro-optimization on relative_to in
list_local_gguf_variants -- only build the posix path when storing
the first file for a quant.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The "Patched trl.models.utils.disable_gradient_checkpointing with a no-op"
warning fires once on every Unsloth import, including from notebooks where
the user did not opt into verbose logging. It is a routine integration
patch, not an anomaly the user needs to know about. Gate it on
UNSLOTH_ENABLE_LOGGING=1 like other diagnostic notices.
* Fix grad-accum model_accepts_loss_kwargs detection for vision wrappers
Replace the source-string rewrite of Trainer.__init__ with an instance-level
accepts_loss_kwargs shadow applied on the loaded model. Covers:
1. Unsloth-compiled forward -> True, so HF Trainer does not double-scale
on top of unsloth_fixed_cross_entropy's num_items_in_batch division.
2. Stock forward on a conditional-generation wrapper (Gemma3n, Gemma3
pre-4.57, Qwen-VL family, etc.) where the outer class has no
accepts_loss_kwargs but the inner .model declares False -> False.
This is the case that reproduces issue #4982 under trust_remote_code
or UNSLOTH_COMPILE_DISABLE, where the previous fix's outer-attr
check walked past the inner model and fell through to signature
inspection.
3. Text LMs without any explicit accepts_loss_kwargs -> leave HF default.
The previous .replace()-based patch silently no-ops on transformers 4.48
through 4.52 (variable named model, not unwrapped_model) and is fragile
against any upstream reformat. The new helper walks the PEFT / HF wrapper
chain, finds the first class that declares accepts_loss_kwargs on its own
class dict (type(m).__dict__, not hasattr, to avoid PEFT __getattr__
forwarding), and setattr-shadows that value at every wrapper level so
HF Trainer's hasattr(unwrapped_model, ...) check picks it up at whichever
level accelerate.unwrap_model returns.
Also adds an unconditional post-init clamp of
accelerator.gradient_accumulation_steps = 1 to work around the
transformers 5.0 through 5.5 GradientAccumulationPlugin regression that
makes accelerator.backward divide loss by GA on top of training_step's
own /GA division. Fixed upstream in 5.6.0.dev0; no-op on 4.x and 5.6+.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Trim comments
* Address review: cover PEFT-after-load and custom compile location
Two review findings from 3/20 reviewers:
1. [3 of 20 reviewers] apply_accepts_loss_kwargs_fix was called from the
loaders before get_peft_model wraps the base model, so on transformers
4.48-4.52 (which does hasattr on the outer model) the instance shadow
on the base model was lost after PEFT wrapping. Fix: also call it from
the wrapped Trainer.__init__ so it runs on whatever model the user
actually hands to Trainer, which is always the final wrapped form.
2. [1 of 20 reviewers] _forward_is_unsloth_compiled hard-coded the
substrings "unsloth_compiled" / "unsloth_cache" in the co_filename
check, which misclassifies compiled forwards when
UNSLOTH_COMPILE_LOCATION is set to a custom directory. Fix: new
_unsloth_compile_cache_leaves helper that reads the env var and
matches the basename against path components, honoring both the
default and any user override.
Verified locally:
- PEFT-after-load simulation: HF's hasattr(peft, "accepts_loss_kwargs")
now returns True after our init wrapper runs, and value resolves to
False on Gemma3n-style inner wrappers.
- Custom UNSLOTH_COMPILE_LOCATION simulation: compiled detection returns
True for /tmp/my_custom_cache/compiled.py when the env var is set.
- End-to-end Gemma-3 270m + LoRA SFT unchanged: loss 4.9626, grad-norm
matches prior run, all 4 wrapper levels now carry the shadowed attr.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(rocm): tighten gfx regex to ignore generic ISA lines
ROCm 6.1+ rocminfo emits generic ISA names such as
"amdgcn-amd-amdhsa--gfx11-generic" and "amdgcn-amd-amdhsa--gfx9-4-generic"
alongside the real GPU name. The previous `gfx[1-9]` regex used in
`_has_rocm_gpu` matched both, so a host with only a generic ISA entry
would be reported as having a usable AMD GPU.
Tighten the pattern to `gfx[1-9][0-9a-z]{2,3}` so only real gfx ids
match. This covers every documented target from GFX6 (gfx600) through
GFX12 (gfx1201), including letter-suffixed ids like gfx90a (MI250 /
MI250X) and gfx90c. Documented generic ISA names always have 1 or 2
digits before the dash and no longer match.
Applied to both `studio/install_python_stack.py` and
`studio/install_llama_prebuilt.py` so the two detection paths agree.
Co-authored-by: Martin Hoyer <mhoyer@redhat.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Martin Hoyer <mhoyer@redhat.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Respect classification head skip list on pre-quantized 4-bit checkpoints (#5027)
FastLanguageModel.from_pretrained(..., num_labels=N) crashed with
"NotImplementedError: normal_kernel_cuda not implemented for 'Byte'" on
pre-quantized bnb 4-bit checkpoints (e.g. unsloth/Qwen3-4B-bnb-4bit)
when running on transformers 5.x.
Two pieces were needed to close this out:
1. unsloth_zoo PR: add "score", "classifier", "qa_outputs" to
SKIP_QUANTIZATION_MODULES so replace_with_bnb_linear leaves task
heads in the compute dtype.
2. This commit: for pre-quantized checkpoints, transformers reads
llm_int8_skip_modules from the quantization_config baked into
config.json and ignores the runtime BitsAndBytesConfig we pass via
kwargs. Unsloth must merge its skip list into
model_config.quantization_config.llm_int8_skip_modules before the
from_pretrained call, or the checkpoint's frozen list
(e.g. ["lm_head", "multi_modal_projector", "merger",
"modality_projection"]) wins and the `score` head gets converted to
Linear4bit with uint8 storage, then _init_weights calls normal_ on
uint8 and crashes.
Also add a defensive post-load cast on the task head to guard against
any residual path that ends up with a non-floating head dtype.
Verified on transformers 4.57.6 and 5.5.0 with:
- unsloth/Qwen3-4B-bnb-4bit + num_labels=3
- unsloth/Qwen3-4B (non-bnb repo, load_in_4bit=True)
- unsloth/Llama-3.2-1B-Instruct + num_labels=3
- unsloth/ModernBERT-large classifier head (bert_classification notebook)
- Regression: causal LM path unchanged, backbone still 4-bit
- 3-step SFT on num_labels=3 confirms gradient flow and weight updates
on score.weight
Fixesunslothai/unsloth#5027
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fixes#2393.
- `_utils.py`: `has_internet()` now respects `HF_HUB_OFFLINE` with truthy variant parsing in addition to `TRANSFORMERS_OFFLINE`.
- `_utils.py`: replace uncontrolled `except Exception: stats_check()` retry (which had no time limit and could freeze on Kaggle offline mode) with a logged skip.
- `loader.py`: forward `local_files_only` from kwargs into all `AutoConfig.from_pretrained` and `PeftConfig.from_pretrained` probes in `FastLanguageModel.from_pretrained` and `FastModel.from_pretrained`, including the PEFT base-model reload paths.
* fix: support GGUF variant selection for non-suffixed repos
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: harden GGUF detection across cached models and picker flows
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* chore: use shared GGUF picker helper for search rows
* fix: avoid mixed cache duplication and preserve GGUF fallback detection
* fix: unify GGUF cache matching and merge picker hints
* fix: normalize local GGUF matching across picker and model config
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: robust cached-gguf classification + hint-aware click routing
- _repo_gguf_size_bytes: treat size_on_disk=None as 0 and dedupe fallback
by commit_hash so partial/interrupted downloads don't TypeError out of
sum() and wipe the entire cached list.
- list_cached_gguf / list_cached_models: narrow per-repo try/except so
one malformed repo no longer poisons the whole response.
- handleModelClick: route through isKnownGgufRepo instead of the
suffix-only isGgufRepo, so non-suffixed GGUF repos still open the
variant expander from every call site.
- Replace the modelIsGgufById/resultIsGgufById Maps with Sets of known
GGUF ids to stop conflating "no hint" with "known not-GGUF".
- Make HfModelResult.isGguf required (it is always set in makeMapModel).
- Add regression tests for the None size case, mixed-repo inclusion in
cached-gguf, and per-repo error isolation.
* fix: exclude mmproj from GGUF classification and case-normalize hint lookups
- _repo_gguf_size_bytes now filters mmproj vision-adapter files so
safetensors+mmproj.gguf repos stay on the cached-models path and
non-GGUF rows no longer show zero pickable variants. A vision-capable
GGUF repo (main weight + mmproj adapter) still classifies as GGUF and
reports the main weight size.
- modelGgufIds / resultGgufIds now key on lowercased ids and
isKnownGgufRepo lowercases its lookup, so store and HF-search ids
that differ only by casing still match the same GGUF hint.
- New regression tests: mmproj-only repo excluded from cached-gguf,
same repo included in cached-models, vision-capable repo still
classified as GGUF with correct size.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
* Add configurable PyTorch mirror via UNSLOTH_PYTORCH_MIRROR env var
When set, UNSLOTH_PYTORCH_MIRROR overrides the default
https://download.pytorch.org/whl base URL in all four install scripts
(install.sh, install.ps1, studio/setup.ps1, studio/install_python_stack.py).
When unset or empty, the official URL is used. This lets users behind
corporate proxies or in regions with poor connectivity to pytorch.org
point at a local mirror without patching scripts.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add pytest for UNSLOTH_PYTORCH_MIRROR in install_python_stack.py
Tests that _PYTORCH_WHL_BASE picks up the env var when set, falls back
to the official URL when unset or empty, and preserves the value as-is
(including trailing slashes).
* Remove stale test assertions for missing install.sh messages
* Fix GPU mocking in test_get_torch_index_url.sh
Extract _has_usable_nvidia_gpu and _has_amd_rocm_gpu alongside
get_torch_index_url so the GPU-presence checks work in tests.
Add -L flag handling to mock nvidia-smi so it passes the GPU listing
check. All 26 tests now pass on CPU-only machines.
* Strip trailing slash from UNSLOTH_PYTORCH_MIRROR to avoid double-slash URLs
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: hard-stop at n_ctx with a dedicated 'Context limit reached' toast
llama-server's default behavior when the KV cache fills is to silently
drop the oldest non-``n_keep`` tokens and keep generating. The UI has
no way to tell the user that earlier turns were evicted -- they just
see degraded continuity and a confusing ``5,361 / 4,096`` on the
context usage bar.
Launch llama-server with ``--no-context-shift`` so it returns a clean
error once the request would exceed ``n_ctx``. In the chat adapter,
catch the error, identify it as a context-limit error via
``isContextLimitError()``, and surface a dedicated toast that names
the exact control to adjust: the ``Context Length`` field in the chat
Settings panel.
Also add a lightweight tooltip hint on ``ContextUsageBar`` when usage
crosses 85%, so users see the "raise Context Length in Settings"
suggestion before they hit the hard stop.
Tests:
* ``test_llama_cpp_no_context_shift.py`` pins the ``--no-context-shift``
flag in the static launch-command template, and pins it inside the
unconditional ``cmd = [ ... ]`` block so a future refactor can't
hide it behind a branch.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Shorten --no-context-shift comment to 1 line
* Match backend _friendly_error rewrite in isContextLimitError
Codex review on PR caught that ``backend/routes/inference.py::_friendly_error``
rewrites the raw llama-server text
"request (X tokens) exceeds the available context size (Y tokens)"
into
"Message too long: X tokens exceeds the Y-token context window. ..."
on the main streaming GGUF path. The heuristic only looked for
"context size" / "exceeds the available context" / "context shift",
none of which survive the rewrite, so the new "Context limit reached"
toast would never fire for the most common case. Add matches for
"message too long" and "context window" so both wordings hit.
Also addresses Gemini feedback on the launch-flag test:
* Use ``inspect.getsource(LlamaCppBackend.load_model)`` instead of
reading ``__file__`` directly; scopes the assertions to the
function that actually launches llama-server.
* Replace the hardcoded ``" ]"`` indent search with a
line-at-a-time scan for a line that is just ``]``, so the test
survives reformatting.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: split model-load progress label across two rows
The chat flow and training overlay both compose a progress label like
"112.6 of 122.3 GB • 331.0 MB/s • 30s left" and render it next to the
percent badge in a single flex row. Once the rate + ETA part shows up,
the label outgrows the row width and wraps mid-phrase, orphaning the
percent ("19 left %") onto a second ragged line.
Fix in model-load-status.tsx: split the label on the first " • " into
a primary (size) chunk that stays on row 1 with the percent, and a
secondary (rate/ETA) chunk that renders on its own muted row below.
Labels without a bullet (e.g. "22.8 GB downloaded") collapse cleanly
to one row. The inline-status variant keeps only the primary and
surfaces the full label via the tooltip.
Also extracts the rate/ETA math out of useTransferStats into a pure
``transfer-stats.ts`` module (appendSample + computeTransferStats) so
it can be reasoned about and tested without React. The hook is now a
thin wrapper that feeds sample history through the pure functions.
Backend: adds two companion test files for load_progress():
* test_llama_cpp_load_progress_matrix.py (21 tests) -- platform
matrix (Linux /proc, macOS/Windows absence), VmRSS parsing
variants (tab/space/missing/malformed), filesystem edges (HF-cache
symlinks, broken symlinks, nonexistent paths, relative paths),
shard aggregation (partial multi-shard, two series in same dir,
mmproj-* exclusion, single-file), lifecycle races, concurrent
sampling (10 threads x 50 iters against real /proc), fraction
bounds.
* test_llama_cpp_load_progress_live.py (5 tests) -- no-mock live
integration: real subprocess allocating 100 MB to match VmRSS,
real ready phase, real dead-pid degradation, real shard
aggregation, repeated polling. Skipped on non-Linux.
Both complement the existing test_llama_cpp_load_progress.py.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Hoist splitProgressLabel out of JSX IIFE (review feedback)
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix bitsandbytes ROCm install by using pip instead of uv
* Also use pip for PyPI fallback path in _install_bnb_rocm
The original fix correctly switched the pre-release wheel install from
uv to pip, but left the PyPI fallback path on uv. If uv breaks bnb
on ROCm, the fallback would hit the same issue. Move pip bootstrap
before the branch so both paths use pip consistently.
* Harden pip bootstrap: try ensurepip first, warn on failure
- Try ensurepip --upgrade before falling back to uv pip install pip.
ensurepip works offline and does not need PyPI, making the bootstrap
robust when the network or index is unavailable.
- If both ensurepip and uv fail, emit a visible warning instead of
silently swallowing the error (which previously led to a cryptic
"No module named pip" downstream).
- Use run_maybe_quiet so --verbose users see bootstrap output.
- Update comment to document the actual root cause: uv rejects the
wheel because filename version and metadata version disagree.
* Add --isolated to pip install calls in _install_bnb_rocm
uv pip install ignores pip.conf and PIP_* env vars, but python -m pip
reads them. Without --isolated, users with PIP_INDEX_URL pointing to a
private mirror that does not carry bitsandbytes would see the PyPI
fallback fail where it previously worked under uv. --isolated restores
parity with the old uv behavior.
* Drop --isolated from PyPI fallback in _install_bnb_rocm
--isolated suppresses PIP_INDEX_URL, PIP_EXTRA_INDEX_URL, and pip.conf.
This is correct for the pre-release path (hardcoded GitHub URL, no index
consulted), but breaks the PyPI fallback for users in corporate or
air-gapped environments whose only route to bitsandbytes is a private
mirror configured via those mechanisms. Keep --isolated on the direct-URL
pre-release install; drop it from the index-dependent fallback.
* Drop --isolated from pre-release pip install, fix warning wording
--isolated suppresses pip.conf cert/proxy/CA settings in addition to
index config. For the direct GitHub URL, index config is irrelevant but
cert/proxy settings matter in corporate SSL-inspection environments.
Without this fix, users with pip.conf-based CA bundles get a TLS error
on the pre-release download and silently fall back to the broken PyPI
version -- the exact outcome the PR is trying to prevent.
Also fix the fallback warning: "unreachable" is too specific since the
pre-release install can fail for reasons other than network reachability.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Studio: live model-load progress + rate/ETA on download and load
Two UX fixes for the opaque multi-minute wait between clicking Load
and being able to chat, visible most clearly on large MoE GGUFs like
MiniMax-M2.7 (131 GB of weights on a 97 GB GPU):
1. **Model-load phase is now observable.** The existing chat flow
transitions the toast to "Starting model..." as soon as the
download hits 100%, then shows a spinner with no other feedback
until llama-server reports healthy. For a 130 GB model that spinner
freezes for five-plus minutes while the kernel pages shards into
the page cache. A new `GET /api/inference/load-progress` endpoint
samples `/proc/<pid>/status VmRSS` on the llama-server subprocess
against the sum of shard file sizes on disk, so the UI can render
a real bar plus rate / ETA during that window.
2. **Rate and ETA on downloads and loads.** Both the chat toast and
the training-start overlay used to show a static pair of numbers
(for example "15.4 of 140.8 GB"). A rolling 15-second window over
the existing byte-series now surfaces "85.3 MB/s, 24m 23s left"
beside that pair. The estimator is shared between the download
and load phases so the numbers don't reset when the phase flips.
Also fixes a pre-existing assignment bug uncovered while wiring this
up: `load_model` was storing the caller's `gguf_path` kwarg into
`self._gguf_path`, which is `None` on the HF-download code path. The
resolved on-disk path (`model_path`) is what llama-server actually
mmaps; downstream consumers need that. No existing reader used
`_gguf_path`, so this is a correctness fix for the new endpoint.
- Backend: `LlamaCppBackend.load_progress()`, `GET /api/inference/load-progress`, `LoadProgressResponse` Pydantic model.
- Frontend: `useTransferStats` hook, `formatRate` / `formatEta` helpers, `getLoadProgress` client, rewired chat toast and `DownloadRow` in the training overlay.
- Tests: `studio/backend/tests/test_llama_cpp_load_progress.py` covers empty states, mmap phase, ready phase, sharded total aggregation, missing gguf_path, and unreadable /proc (7 cases). `tsc -b` and `vite build` on the frontend both clean.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* studio: pin peft to 0.18.1 to fix export subprocess issues
peft 0.19.0 causes export subprocess shutdown failures in Studio.
Reverting to 0.18.1 resolves the issue.
* studio: move peft pin to extras-no-deps to prevent torch upgrade
Installing peft via overrides.txt would resolve its deps and pull in
torch>=0.11.0, breaking other pinned packages. Moving the pin to
extras-no-deps.txt ensures --no-deps is used during install.
* Fix num_items_in_batch GA for Gemma4
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* studio: stream export worker output into the export dialog
The Export Model dialog only showed a spinner on the "Exporting..."
button while the worker subprocess was doing the actual heavy lifting.
For Merged to 16bit and GGUF / Llama.cpp exports this meant several
minutes (or more, for large models) of opaque silence, with no way to
tell whether save_pretrained_merged, convert_hf_to_gguf.py, or
llama-quantize was making progress.
This adds a live terminal-style output panel inside the export dialog,
rendered just above the Cancel / Start Export buttons and scrollable
with auto-follow-tail. It shows stdout and stderr from both the worker
process itself and any child process it spawns (GGUF converter,
llama-quantize), coloured by stream.
Backend
- core/export/worker.py: new _setup_log_capture(resp_queue) installed
before LogConfig.setup_logging. It saves the original stdout/stderr
fds, creates pipes, os.dup2's the write ends onto fds 1 and 2 (so
every child process inherits the redirected fds), and spins up two
daemon reader threads. Each thread reads bytes from a pipe, echoes
them back to the original fd (so the server console keeps working),
splits on \n and \r, and forwards each line to the resp queue as
{"type":"log","stream":"stdout|stderr","line":...,"ts":...}.
PYTHONUNBUFFERED=1 is set so nested Python converters flush
immediately.
- core/export/orchestrator.py:
- Thread-safe ring buffer (collections.deque, maxlen 4000) with a
monotonically increasing seq counter. clear_logs(),
get_logs_since(cursor), get_current_log_seq(), is_export_active().
- _wait_response handles rtype == "log" by appending to the buffer
and continuing the wait loop. Status messages are also surfaced as
a "status" stream so users see high level progress alongside raw
subprocess output.
- load_checkpoint, _run_export, and cleanup_memory now wrap their
bodies with the existing self._lock (previously unused), clear the
log buffer at the start of each op, and flip _export_active in a
try/finally so the SSE endpoint can detect idle.
- routes/export.py:
- Wrapped every sync orchestrator call (load_checkpoint,
cleanup_memory, export_merged_model, export_base_model,
export_gguf, export_lora_adapter) in asyncio.to_thread so the
FastAPI event loop stays free during long exports. Without this
the new SSE endpoint could not be served concurrently with the
blocking export POST.
- New GET /api/export/logs/stream SSE endpoint. Honors
Last-Event-ID and a since query param for reconnect, emits log /
heartbeat / complete / error events, uses the id field to carry
the log seq so clients can resume cleanly. On first connect
without an explicit cursor it starts from the current seq so old
lines from a previous run are not replayed.
Frontend
- features/export/api/export-api.ts: streamExportLogs() helper that
authFetches the SSE endpoint and parses id / event / data fields
manually (same pattern as streamTrainingProgress in train-api.ts).
- features/export/components/export-dialog.tsx:
- Local useExportLogs(exporting) hook that opens the SSE stream on
exporting transitions to true, accumulates up to 4000 lines in
component state, and aborts on cleanup.
- New scrollable output panel rendered above DialogFooter, only
shown for Merged to 16bit and GGUF / Llama.cpp (LoRA adapter is
a fast disk write with nothing to show). Dark terminal styling
(bg-black/85, emerald text, rose for stderr, sky for status),
max-height 14rem, auto-scrolls to the bottom on new output but
stops following if the user scrolls up. A small streaming / idle
indicator is shown next to the panel title.
- DialogContent widens from sm:max-w-lg to sm:max-w-2xl when the
output panel is visible so the logs have room to breathe.
Verified
- Python smoke test (tests/smoke_export_log_capture.py): spawns a
real mp.get_context("spawn") process, installs _setup_log_capture,
confirms that parent stdout prints, parent stderr prints, AND a
child subprocess invoked via subprocess.run (both its stdout and
stderr) are all captured in the resp queue. Passes.
- Orchestrator log helpers tested in isolation: _append_log,
get_logs_since (with and without a cursor), clear_logs not
resetting seq so reconnecting clients still progress. Passes.
- routes.export imports cleanly in the studio venv and /logs/stream
shows up in router.routes.
- bun run build: tsc -b plus vite build, no TypeScript errors.
No existing export behavior is changed. If the subprocess, the SSE
endpoint, or the frontend hook fails, the export itself still runs to
completion the same way it did before, with or without logs visible.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* export dialog: trim bootstrap noise, scope logs per screen, show realpath
Several follow-ups to the live export log work:
1. Worker bootstrap noise (transformers venv activation, Unsloth banner,
"Top GGUF/hub models" lists, vision detection, 2k-step weight load
bar) is dropped from the export-dialog stream. A threading.Event
gate in worker.py defaults closed and only opens once _handle_export
actually starts; until then the reader thread still echoes lines to
the saved console fd for debugging but does not push them onto the
resp_queue. The orchestrator already spawns a fresh subprocess for
every checkpoint load, so the gate is naturally reset between runs.
2. tqdm in non-tty mode defaults to a 10s mininterval, which makes
multi-step bars look frozen in the panel. Set TQDM_MININTERVAL=0.5
in the worker env so any tqdm-driven progress emits more often.
3. The dialog's useExportLogs hook now also clears its line buffer
when exportMethod or open changes, so re-opening the dialog into a
different action's screen no longer shows the previous action's
saved output. A useElapsedSeconds tick + "Working Xs" badge in the
log header gives users a visible sign that long single-step phases
(cache copies, GGUF conversion) are still running when no new lines
are arriving.
4. ExportBackend.export_{merged,base,gguf,lora} now return
(success, message, output_path); the worker forwards output_path on
each export_*_done response, the orchestrator's _run_export passes
it to routes/export.py, which surfaces it via
ExportOperationResponse.details.output_path. The dialog's Export
Complete screen renders the resolved on-disk realpath under "Saved
to" so users can find their exported model directly.
* fix(cli): unpack 3-tuple return from export backend
ExportOrchestrator.export_{merged,base,gguf,lora} now return
(success, message, output_path) so the studio dialog can show
the on-disk realpath. The CLI still unpacked 2 values, so every
`unsloth export --format ...` crashed with ValueError before
reporting completion. Update the four call sites and surface
output_path via a "Saved to:" echo.
* fix(studio): anchor export log SSE cursor at run start
The export dialog SSE defaulted its cursor to get_current_log_seq()
at connect time, so any line emitted between the POST that kicks
off the export and the client opening the stream was buffered with
seqs 1..k and then skipped (seq <= cursor). Long-running exports
looked silent during their first seconds.
Snapshot _log_seq into _run_start_seq inside clear_logs() and
expose it via get_run_start_seq(). The SSE default cursor now uses
that snapshot, so every line emitted since the current run began
is reachable regardless of when the client connects. Old runs
still can't leak in because their seqs are <= the snapshot.
* fix(studio): reconnect export log SSE on stream drop
useExportLogs launched streamExportLogs once per exporting
transition and recorded any drop in .catch(). Long GGUF exports
behind a proxy with an idle kill-timeout would silently lose the
stream for the rest of the run even though the backend already
supports Last-Event-ID resume. The "retry: 3000" directive emitted
by the backend is only meaningful to native EventSource; this
hook uses a manual fetch + ReadableStream parse so it had no
effect.
Wrap streamExportLogs in a retry loop that tracks lastSeq from
ExportLogEvent.id and passes it as since on reconnect. Backoff is
exponential with jitter, capped at 5s, reset on successful open.
The loop stops on explicit backend `complete` event or on effect
cleanup.
* fix(studio): register a second command so Typer keeps `export` as a subcommand
The CLI export unpacking tests wrap `unsloth_cli.commands.export.export`
in a fresh Typer app with a single registered command. Typer flattens a
single-command app into that command, so the test's
`runner.invoke(cli_app, ["export", ckpt, out, ...])` treats the leading
`"export"` token as an unexpected extra positional argument -- every
parametrized case failed with:
Got unexpected extra argument (.../out)
Register a harmless `noop` second command so Typer preserves subcommand
routing and the tests actually exercise the 3-tuple unpack path they
were written to guard.
Before: 4 failed
After: 4 passed
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: studio-install <studio@local.install>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* studio: show HF model download progress in training start overlay
During the training setup phase, the overlay only displayed a static
"Loading model..." line while model weights were being downloaded from
Hugging Face. On slow connections this looked like the app had frozen.
This adds a small self-contained progress block inside the existing
TrainingStartOverlay that polls the existing
GET /api/models/download-progress endpoint and renders a Progress bar
with bytes downloaded, total bytes, and percent complete.
Notes:
- Frontend only change. No backend, worker, SSE, or runtime store edits.
- Reuses the existing getDownloadProgress client wrapper and the
existing /api/models/download-progress endpoint that already scans
the HF blob cache for completed and .incomplete files.
- selectedModel is read directly from useTrainingConfigStore inside the
overlay, so no prop drilling and live-training-view.tsx is unchanged.
- Polling runs at 1500 ms and is gated on the HF repo regex
(^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$), the same regex the backend uses,
so local paths and empty form state never hit the endpoint.
- Polling stops once progress reaches 1.0 so the bar can stay at 100
until the overlay hides on the first training step.
- Network errors are silently swallowed, matching the chat side flow
(the bar simply freezes at the last value).
- When downloadedBytes is 0 the block is hidden entirely, so cached
models do not flash a progress bar.
- When the HF API cannot determine the total size, the block falls
back to "X downloaded" with no percent and no bar.
Verified with bun run build (tsc -b plus vite build, no TypeScript
errors).
* training overlay: track dataset download + show on-disk realpath
Adds a dedicated "Downloading dataset..." section to the training-start
overlay alongside the existing model-weights one, so an HF dataset that
is downloading mid-startup is no longer mislabeled as model weights or
hidden entirely. The new GET /api/datasets/download-progress endpoint
mirrors /api/models/download-progress against the datasets-- prefix in
HF_HUB_CACHE.
Both endpoints now also return cache_path, the resolved on-disk
realpath of the snapshot directory (or the cache repo root if no
snapshot is materialized yet). The overlay surfaces this under each
download row so users can immediately see where the model and dataset
landed without digging through server logs.
The frontend's existing useModelDownloadProgress hook is generalized
to a single useHfDownloadProgress(repoId, fetcher) hook that the
model and dataset variants both delegate to, keeping polling, gating,
and completion semantics in one place.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Studio: Polish training start overlay download progress UI (#4957)
* studio: polish training start overlay download progress visuals
* Fix formatCachePath cross-platform support and redundant sizeLabel
- Extend formatCachePath regex to also shorten macOS /Users/<user> paths to ~
- Suppress sizeLabel when no byte info is available (cachePath-only state),
since the "Preparing" badge already conveys the status
* Fix misleading status badge when download total is unknown
- Hide badge when totalBytes is 0 but downloadedBytes > 0, since we cannot
determine if the download is still in progress or already complete (happens
when HF size metadata lookup fails for gated/private repos)
- Keep "Preparing" badge for the zero-bytes cachePath-only state
- Add Windows native path shortening to formatCachePath (C:\Users\<name>)
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
---------
Co-authored-by: studio-install <studio@local.install>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
* Studio: anchor ctx-slider warning threshold at 4096 when weights exceed VRAM
The chat settings sheet's ctx slider reads `max_context_length` from
`/api/inference/status` and renders
Exceeds estimated VRAM capacity (N tokens). The model may use
system RAM.
when the user drags the slider above that value. For models whose
weights fit on some GPU subset, `_max_context_length` was already set
to the binary-search cap and the warning fired correctly.
For models whose weights exceed 90% of every GPU subset's free memory
(e.g. MiniMax-M2.7-GGUF at 131 GB on a 97 GB GPU), the ceiling-probe
loop never matched a subset, so `max_available_ctx` stayed at the
native context (e.g. 196608). The slider ran all the way to native
with no indication that any value above the 4096 spec default would
trigger `--fit on` and degrade performance.
Anchor `max_available_ctx` at `min(4096, native_context_length)` when
no subset fits, so the warning fires at the right threshold and the
user sees the correct safe-zone / warning-zone split:
Before (MiniMax-M2.7 on 97 GB GPU):
slider 0 .. 196608, warning threshold = 196608 (never fires)
After:
slider 0 .. 196608, warning threshold = 4096 (fires correctly)
No frontend changes required: `chat-settings-sheet.tsx` already
consumes `ggufMaxContextLength` (= status.max_context_length) as the
warning threshold and `ggufNativeContextLength` as the slider max.
Adds tests/test_llama_cpp_max_context_threshold.py covering
weights-exceed-VRAM (single / multi-GPU), a native-ctx below the 4096
fallback case (don't lie about supported ctx), fittable-model
regressions (small / multi-GPU / tiny on huge GPU), and the
`max_context_length` property's fallback semantics.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: make GGUF disk-space preflight cache-aware
The pre-download disk check in LlamaCppBackend.load_model compared the
repo's total GGUF size against free disk without crediting bytes
already present in the Hugging Face cache. Re-loading a large cached
model (e.g. MiniMax-M2.7-GGUF at 131 GB) then failed cold with
"Not enough disk space to download any variant" whenever free disk
was below the full weight footprint, even though nothing actually
needed to be downloaded.
Subtract bytes already on disk via try_to_load_from_cache before
comparing against free space. A partial blob (interrupted download) is
not credited, so a second attempt still allocates room to finish the
download. The log line now also surfaces how much is already cached.
Adds tests/test_llama_cpp_cache_aware_disk_check.py covering the
fully-cached, partial-cache-insufficient-disk, partial-cache-enough-disk,
cold-cache, incomplete-blob, and zero-size-path-info cases. Sparse
tempfiles keep the GB-scale scenarios cheap to simulate.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM
The load-time auto-fit in LlamaCppBackend.load_model had two issues for
models whose weights do not fit on any GPU subset (the common case for
large MoE GGUFs such as MiniMax-M2.7, Qwen3.5-397B-A17B, etc.):
1. Auto mode (max_seq_length=0) left effective_ctx at the model's native
context when no subset passed the 90% fit check. The UI slider then
landed on e.g. 196608 for MiniMax-M2.7, far above anything usable.
Default the auto-pick to 4096 so the UI starts at a sane value; the
slider ceiling stays at the native context so the user can still
opt in to longer contexts and receive the "might be slower" warning.
2. Explicit ctx was silently shrunk when weights fit but the requested
KV overflowed the 90% budget. The shrink loop emitted -c <capped>
-ngl -1 without informing the caller, so a user who had opted into
a longer context via the UI never actually got it. Drop the shrink
loop on the explicit path and emit -c <user_ctx> --fit on instead,
letting llama-server flex -ngl (CPU layer offload).
Adds tests/test_llama_cpp_context_fit.py covering both paths, the
file-size-only fallback when KV metadata is missing, non-regression on
fittable auto-pick, and platform-agnostic input shape.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Studio] Install flash attn at setup time for linux
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup changes
Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Test cases
* wheel_utils: narrow url_exists exceptions and log at debug level
---------
Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* Show non exported models in chat UI
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Distinguish b/w LoRa and full fine tune saves. Cleanup
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
* fix(studio): default chart view to full training history instead of last 80 steps
Fixes#5003
* chore: windowsize as null code comment
---------
Co-authored-by: imagineer99 <samleejackson0@gmail.com>
Co-authored-by: Wasim Yousef Said <wasimysdev@gmail.com>
* fix: polish clipboard style and fix async clipboard path
* Use copyToClipboardAsync in CopyButton for Safari fallback
CopyButton was calling navigator.clipboard.writeText directly,
bypassing the execCommand fallback added in this same PR. Switch
to copyToClipboardAsync which tries execCommand first (Safari
user-gesture requirement) then falls back to the async clipboard API.
* Fix copyToClipboard sync contract regression and improve async path
- Restore copyToClipboard() to return only the execCommand result,
preserving the boolean contract that 7 existing callers depend on
to gate their "Copied!" UI state. The fire-and-forget async fallback
was returning true before the promise resolved, causing false success.
- Add document.body null guard to copyWithExecCommand for SSR safety.
- Reorder copyToClipboardAsync to try the async Clipboard API first,
avoiding unnecessary DOM/focus overhead in Radix focus-trapped dialogs
where execCommand always fails anyway.
* Restore queryCommandSupported guard and fix async catch path
- Restore the queryCommandSupported("copy") guard in copyToClipboard()
to match the original contract exactly: when execCommand is entirely
unsupported, fall through to fire-and-forget async clipboard write.
- Fix copyToClipboardAsync catch block: after navigator.clipboard.writeText
rejects, the user-gesture frame is gone, so execCommand will also fail.
Return false from catch instead of falling through. The execCommand
fallback at the bottom only runs when the Clipboard API is absent
(still in user-gesture frame).
* Restore execCommand fallback in copyToClipboardAsync catch path
The catch block was returning false after clipboard API rejection,
based on the incorrect premise that the user-gesture frame is lost
after an await. Per the HTML spec, transient user activation IS
preserved through promise microtask chains. The real reason
execCommand fails in the Radix dialog is the focus trap intercepting
textarea.focus(), not gesture loss.
For non-dialog callers, execCommand can still succeed after a
clipboard rejection. Inside a Radix modal, execCommand returns
false harmlessly (focus trap blocks it).
* Harden textarea fallback for mobile and continue to async path on failure
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* fix(studio): skip training status/metrics polling when idle
Add an early return in the status and metrics setInterval callbacks when
the runtime store reports phase === "idle" and hasHydrated is true.
Previously these polls fired unconditionally every 3s/5s, generating
unnecessary network traffic and console errors when no training was
running.
* fix(studio): reduce idle polling to 30s instead of stopping entirely
Review feedback (PR #4988): completely stopping polling when idle risks
permanent UI desync if hydration fails, and misses out-of-band state
changes from other clients. Add a 30s background poll that only fires
when idle to recover gracefully.
* fix: harden idle status polling around hydration and runtime reset
---------
Co-authored-by: AdamPlatin123 <AdamPlatin123@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: imagineer99 <samleejackson0@gmail.com>
* Studio: add API key authentication for programmatic access
External users want to hit the Studio API (chat completions with tool
calling, training, export, etc.) without going through the browser
login flow. This adds sk-unsloth- prefixed API keys that work as a
drop-in replacement for JWTs in the Authorization: Bearer header.
Backend:
- New api_keys table in SQLite (storage.py)
- create/list/revoke/validate functions with SHA-256 hashed storage
- API key detection in _get_current_subject before the JWT path
- POST/GET/DELETE /api/auth/api-keys endpoints on the auth router
Frontend:
- /api-keys page with create form, one-time key reveal, keys table
- API Keys link in desktop and mobile navbar
- Route registered with requireAuth guard
Zero changes to any existing route handler -- every endpoint that uses
Depends(get_current_subject) automatically works with API keys.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use actual origin in API key usage examples
The examples on /api-keys were hardcoded to localhost:8888 which is
wrong for remote users. Use window.location.origin so the examples
show the correct URL regardless of where the user is connecting from.
* Add `unsloth studio run` CLI command for one-liner model serving
Adds a `run` subcommand that starts Studio, loads a model, creates an
API key, and prints a ready-to-use curl command -- similar to
`ollama run` or `vllm serve`.
Usage: unsloth studio run -m unsloth/Qwen3-1.7B-GGUF --gguf-variant UD-Q4_K_XL
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add end-to-end tests for `unsloth studio run` and API key usage
Tests the 4 usage examples from the API Keys page:
1. curl basic (non-streaming) chat completions
2. curl streaming (SSE) chat completions
3. OpenAI Python SDK streaming completions
4. curl with tools (web_search + python)
Also tests --help output, invalid key rejection, and no-key rejection.
All 7 tests pass against Qwen3-1.7B-GGUF.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add /v1/completions, /v1/embeddings, /v1/responses endpoints and --parallel support
- llama_cpp.py: accept n_parallel param, pass to llama-server --parallel
- run.py: plumb llama_parallel_slots through to app.state
- inference.py: add /completions and /embeddings as transparent proxies to
llama-server, add /responses as application-level endpoint that converts
to ChatCompletionRequest; thread n_parallel through load_model
- studio.py: set llama_parallel_slots=4 for `unsloth studio run` path
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Make /v1/responses endpoint match OpenAI Responses API format
The existing /v1/responses shim returned Chat Completions format, which
broke OpenAI SDK clients using openai.responses.create(). This commit
replaces the endpoint with a proper implementation that:
- Returns `output` array with `output_text` content parts instead of
`choices` with `message`
- Uses `input_tokens`/`output_tokens` instead of `prompt_tokens`/
`completion_tokens` in usage
- Sets `object: "response"` and `id: "resp_..."`
- Emits named SSE events for streaming (response.created,
response.output_text.delta, response.completed, etc.)
- Accepts all OpenAI Responses API fields (tools, store, metadata,
previous_response_id) without erroring -- silently ignored
- Maps `developer` role to `system` and `input_text`/`input_image`
content parts to the internal Chat format
Adds Pydantic schemas for request/response models and 23 unit tests
covering schema validation, input normalisation, and response format.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Studio: add Anthropic-compatible /v1/messages endpoint (#4981)
* Add Anthropic-compatible /v1/messages endpoint with tool support
Translate Anthropic Messages API format to/from internal OpenAI format
and reuse the existing server-side agentic tool loop. Supports streaming
SSE (message_start, content_block_delta, etc.) and non-streaming JSON.
Includes offline unit tests and e2e tests in test_studio_run.py.
* Add enable_tools, enabled_tools, session_id to /v1/messages endpoint
Support the same shorthand as /v1/chat/completions: enable_tools=true
with an optional enabled_tools list uses built-in server tools without
requiring full Anthropic tool definitions. session_id is passed through
for sandbox isolation. max_tokens is now optional.
* Strip leaked tool-call XML from Anthropic endpoint content
Apply _TOOL_XML_RE to content events in both streaming and
non-streaming tool paths, matching the OpenAI endpoint behavior.
* Emit custom tool_result SSE event in Anthropic stream
Adds a non-standard tool_result event between the tool_use block close
and the next text block, so clients can see server-side tool execution
results. Anthropic SDKs ignore unknown event types.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Split /v1/messages into server-side and client-side tool paths
enable_tools=true runs the existing server-side agentic loop with
built-in tools (web_search/python/terminal). A bare tools=[...] field
now triggers a client-side pass-through: client-provided tools are
forwarded to llama-server and any tool_use output is returned to the
caller with stop_reason=tool_use for client execution.
This fixes Claude Code (and any Anthropic SDK client) which sends
tools=[...] expecting client-side execution but was previously routed
through execute_tool() and failing with 'Unknown tool'.
Adds AnthropicPassthroughEmitter to convert llama-server OpenAI SSE
chunks into Anthropic SSE events, plus unit tests covering text
blocks, tool_use blocks, mixed, stop reasons, and usage.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix httpcore GeneratorExit in /v1/messages passthrough stream
Explicitly aclose aiter_lines() before the surrounding async with
blocks unwind, mirroring the prior fix in external_provider.py
(a41160d3) and cc757b78's RuntimeError suppression.
* Wire stop_sequences through /v1/messages; warn on tool_choice
Plumb payload.stop_sequences to all three code paths (server-side
tool loop, no-tool plain, client-side passthrough) so Anthropic SDK
clients setting stop_sequences get the behavior they expect. The
llama_cpp backend already accepted `stop` on both generate_chat_
completion and generate_chat_completion_with_tools; the Anthropic
handler simply wasn't passing it.
tool_choice remains declared on the request model for Anthropic SDK
compatibility (the SDK often sets it by default) but is not yet
honored. Log a structured warning on each request carrying a non-
null tool_choice so the silent drop is visible to operators.
* Wire min_p / repetition_penalty / presence_penalty through /v1/messages
Align the Anthropic endpoint's sampling surface with /v1/chat/completions.
Adds the three fields as x-unsloth extensions on AnthropicMessagesRequest
and threads them through all three code paths: server-side tool loop,
no-tool plain, and client-side passthrough.
The passthrough builder emits "repeat_penalty" (not "repetition_penalty")
because that is llama-server's field name; the backend methods already
apply the same rename internally.
* Fix block ordering and prev_text reset in non-streaming tool path
_anthropic_tool_non_streaming was building the response by appending
all tool_use blocks first, then a single concatenated text block at
the end — losing generation order and merging pre-tool and post-tool
text into one block. It also never reset prev_text between synthesis
turns, so the first N characters of each post-tool turn were dropped
(where N = length of the prior turn's final cumulative text).
Rewrite to build content_blocks incrementally in generation order,
matching the streaming emitter's behavior: deltas within a turn are
merged into the trailing text block, tool_use blocks interrupt the
text sequence, and prev_text is reset on tool_end so turn N+1 diffs
against an empty baseline.
Caught by gemini-code-assist[bot] review on #4981.
* Make test_studio_run.py e2e tests pytest-compatible
Add a hybrid session-scoped studio_server fixture in conftest.py that
feeds base_url / api_key into the existing e2e test functions. Three
invocation modes are now supported:
1. Script mode (unchanged) — python tests/test_studio_run.py
2. Pytest + external server — point at a running instance via
UNSLOTH_E2E_BASE_URL / UNSLOTH_E2E_API_KEY env vars, no per-run
GGUF load cost
3. Pytest + fixture-managed server — pytest drives _start_server /
_kill_server itself via --unsloth-model / --unsloth-gguf-variant,
CI-friendly
The existing _start_server / _kill_server helpers and main() stay
untouched so the script entry point keeps working exactly as before.
Test function signatures are unchanged — the (base_url, api_key)
parameters now resolve via the new fixtures when running under
pytest.
* Rename test_studio_run.py -> test_studio_api.py
The file is entirely about HTTP API endpoint testing (OpenAI-compatible
/v1/chat/completions, Anthropic-compatible /v1/messages, API key auth,
plus a CLI --help sanity check on the command that runs the API). None
of its tests cover training, export, chat-UI, or internal-Python-API
concerns.
The old name misleadingly suggested "tests for the unsloth studio run
CLI subcommand" — the new name reflects the actual scope.
Updates:
- git mv the file (rename tracked, history preserved)
- Rewrite opening docstring to state the API surface focus and call
out what is explicitly out of scope
- Update all 4 Usage-block path references to the new filename
- LOG_FILE renamed to test_studio_api.log
- conftest.py fixture import rewritten from test_studio_run to
test_studio_api, plus 7 docstring/comment references updated
No functional changes to test logic, signatures, or main().
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix httpcore asyncgen cleanup in /v1/messages and /v1/completions
The earlier fix in 985e92a9 was incomplete: it closed aiter_lines()
explicitly but still used `async with httpx.AsyncClient()` /
`async with client.stream()` inside the generator. When the generator
is orphaned (e.g. client disconnects mid-stream and Starlette drops
the StreamingResponse iterator without explicitly calling aclose()),
Python's asyncgen finalizer runs the cleanup in a DIFFERENT task than
the one that originally entered the httpx context managers. The
`async with` exits then trigger httpcore's HTTP11ConnectionByteStream
.aclose(), which enters anyio.CancelScope.__exit__ with a mismatched
task and raises RuntimeError("Attempted to exit cancel scope in a
different task"). That error escapes any user-owned try/except
because it happens during GC finalization.
Replace `async with` with manual client/response lifecycle in both
/v1/messages passthrough and /v1/completions proxy. Close the
response and client in a finally block wrapped in
`try: ... except Exception: pass`. This suppresses RuntimeError (and
other Exception subclasses) from the anyio cleanup noise while
letting GeneratorExit (a BaseException, not Exception) propagate
cleanly so the generator terminates as Python expects.
Traceback observed in user report:
File ".../httpcore/_async/connection_pool.py", line 404, in __aiter__
yield part
RuntimeError: async generator ignored GeneratorExit
...
File ".../anyio/_backends/_asyncio.py", line 455, in __exit__
raise RuntimeError(
RuntimeError: Attempted to exit cancel scope in a different task
* Expand unsloth studio run banner with SDK base URL and more curl examples
Add an explicit "OpenAI / Anthropic SDK base URL" line inside the info
box so SDK users don't accidentally copy the bare server URL (without
/v1) into their OpenAI/Anthropic SDK constructors and hit 404s.
Replace the single /v1/chat/completions curl example with three
labeled blocks: chat/completions, Anthropic /messages, and OpenAI
Responses. The Anthropic example includes max_tokens (Anthropic SDKs
require it even though Studio accepts None).
All examples derived from a computed sdk_base_url so the /v1 prefix
stays in sync if the public path ever changes.
* Hash API keys with HMAC-SHA256 + persistent server secret
Stores the HMAC secret in a new app_secrets singleton table. Fixes
CodeQL py/weak-sensitive-data-hashing alert on storage.py:74-76,
394-395. Refresh tokens stay on plain SHA-256 (unchanged _hash_token)
so existing user sessions survive upgrade — API keys are new on this
branch so there is no migration.
* Use PBKDF2 for API key hashing per CodeQL recommendation
HMAC-SHA256 was still flagged by py/weak-sensitive-data-hashing.
Switch to hashlib.pbkdf2_hmac, which is in CodeQL's recommended
allowlist (Argon2/scrypt/bcrypt/PBKDF2). Persistent server-side
salt stays in app_secrets for defense-in-depth. 100k iterations to
match auth/hashing.py's password hasher.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Add mode="wait" and exit={{ opacity: 0 }} to the root AnimatePresence
wrapper so outgoing routes fully unmount before incoming routes render.
Without this, rapid navigation between Studio/Export/Recipes/Chat caused
pages to stack (2x–3x duplication).
Co-authored-by: AdamPlatin123 <AdamPlatin123@users.noreply.github.com>
Co-authored-by: Wasim Yousef Said <wasimysdev@gmail.com>
* Fix Gemma-4 GRPO catastrophic KL divergence with TRL 1.0.0+
Two compounding bugs caused Gemma-4 GRPO training to diverge with KL ~10^12
at step 1 against TRL 1.0.0+. Both fixes are runtime patches in the existing
TRL/model patch flow and are no-ops for models and TRL versions that are not
affected.
Fix 1 (rl.py): replace trl.models.utils.disable_gradient_checkpointing with
a no-op context manager. TRL 1.0.0+ wraps generation in
`with torch.no_grad(), disable_gradient_checkpointing(self.model, ...):`
purely to suppress a cosmetic PyTorch warning ("None of the inputs have
requires_grad=True"). Inside torch.no_grad() the gradient checkpointing
state has no functional effect on the forward pass. On context exit, TRL
calls model.gradient_checkpointing_enable() which dispatches to HF's
generic implementation and overwrites Unsloth's custom
`use_gradient_checkpointing="unsloth"` wrapper, corrupting Gemma-4 forward
numerics. Replacing the toggle with a no-op preserves Unsloth's custom GC
wrapper across generation passes. The patch walks sys.modules dynamically
to also rebind the symbol on every trl.* module that already imported it
(grpo_trainer, dpo_trainer, rloo_trainer, dppo_trainer, gfpo_trainer,
grpo_with_replay_buffer_trainer, and any future trainer module).
Fix 2 (vision.py): inject `final_logit_softcapping` from `config.text_config`
into the top-level `model.config` for multimodal models. Unsloth's GRPO
trainer reads `getattr(model.config, "final_logit_softcapping", 0)` but
for Gemma-4 the attribute lives only on the nested `Gemma4TextConfig`,
so the lookup silently defaults to 0 instead of 30.
Backwards compatibility:
- trl 0.22.2: no `disable_gradient_checkpointing` symbol exists, the patch
early-returns via `hasattr` guard.
- trl 0.27.1: same broken pattern as 1.0.0, the noop replacement is correct.
- trl 1.0.0+: end-to-end verified on `unsloth/gemma-4-E2B-it` GRPO with TRL
1.0.0 and transformers 5.5.0. Step 1 loss=2.46e-08, kl=2.92e-05 (machine
zero) vs broken baseline loss=1.37e+06, kl=1.76e+09.
- Llama / non-VLM text models: Fix 2 is a no-op (no `text_config`); Fix 1
is functionally identical (Unsloth's GC wrapper is preserved).
- Qwen3-VL and other VLMs without final_logit_softcapping: Fix 2 is a no-op
(text_config.final_logit_softcapping is None).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply loop 1 review fixes for PR #4934
- Move Fix 2 from vision.py to rl_replacements.py:858 and :1110 at the
actual consumer sites. This avoids mutating model.config (which could
leak into save_pretrained output) and covers text-only Gemma-4 paths
that do not flow through FastBaseModel.from_pretrained.
- Revert the vision.py injection block entirely.
- Narrow the bare except blocks in patch_trl_disable_gradient_checkpointing
from `except Exception:` to `(AttributeError, ImportError)` and
`(AttributeError, TypeError)` to avoid masking unrelated bugs.
- Add logger.warning_once when the noop patch is installed, matching
patch_trl_openenv and patch_trl_vllm_generation convention.
- Remove the dead per-module `_unsloth_noop_patched` sentinel check inside
the sys.modules walk. The function-level early return already covers
this case.
- Move `import sys` and `from contextlib import contextmanager` to the
module-level imports instead of inside the function body.
- Rewrite the ordering comment in PatchFastRL to accurately describe
why patch_trl_disable_gradient_checkpointing must run before
patch_trl_rl_trainers.
- Fix keyword default spacing to match surrounding rl.py style.
End-to-end verified: Gemma-4-E2B GRPO on TRL 1.0.0 + transformers 5.5.0
step 1 loss=2.464e-08 kl=2.921e-05, all 5 steps succeed.
* Apply loop 2 review fix for PR #4934
Extract the final_logit_softcapping fallback logic into a shared helper
`_unsloth_get_final_logit_softcapping(config)` defined in rl_replacements.py
and injected into the compiled cache via RL_PRE_ITEMS["grpo_trainer"]. Both
call sites (`grpo_trainer__generate_and_score_completions` and
`grpo_trainer_compute_loss`) now use the helper instead of inlining the
same text_config fallback block twice.
Verified: compiled cache file lists the helper at module scope and both
consumer sites call it. Gemma-4-E2B GRPO step 1 loss=2.464e-08 kl=2.921e-05
(unchanged), all 5 steps pass.
* Apply loop 3 review fix for PR #4934
Extend _unsloth_get_final_logit_softcapping to also fall back to
config.get_text_config() for composite configs such as T5GemmaConfig
where the text sub-config is not exposed via the text_config attribute
but only via the get_text_config() method. Guard against (TypeError,
ValueError) raised by ambiguous composite configs, and skip the
self-referential case where get_text_config() returns self.
This addresses the 6/7 reviewer consensus from the third review loop.
Verified:
- Helper returns 30.0 for Gemma-4, T5Gemma, and Gemma 1/2 configs.
- Helper returns 0 for Llama, Qwen, Mistral, Cohere, Granite, and
ambiguous configs raising ValueError.
- Gemma-4-E2B GRPO step 1 loss=2.464e-08 kl=2.921e-05 (unchanged).
- Llama-3.2-1B GRPO all 5 steps loss=0 kl=0 (no regression).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix
bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on
every ROCm target:
- CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a
broken blocksize=32/64 warp64 GEMV kernel whose tests were
explicitly skipped with ROCM_WARP_SIZE_64 guards because the
code was known broken.
- RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time
BNB_WARP_SIZE macro in the host-side dispatch that resolves to
64 when the multi-arch wheel is compiled with CDNA as the
primary target, so num_blocks is wrong on RDNA and half the GEMV
output is never written.
At decode shape (1, 1, hidden) both bugs produce NaN. Training is
unaffected because training shapes are (batch, seq_len > 1, hidden)
and never touch the GEMV path. The crash during autoregressive
inference surfaces as _assert_async_cuda_kernel in torch.multinomial
which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of
a clean Python error.
Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable
blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA",
PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a
runtime hipDeviceGetAttribute query and ships a working CDNA warp64
kernel. That commit has not shipped to PyPI yet, but
continuous-release_main wheels are published on every push to bnb
main via GitHub Releases.
Point the ROCm install path at the continuous-release_main x86_64 and
aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is
unreachable (offline installs, firewalled hosts, or architectures not
covered by the pre-release wheels). Drop the pin once bnb cuts a
0.50+ tag on PyPI.
Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct
bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1
(no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit
sampling generation works end-to-end.
NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is
gated on the ROCm torch index and platform.machine() respectively.
* Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode
The 16-bit fallback in studio/backend/core/inference/inference.py was
added as a workaround for a bug that this PR already fixes at the
install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel
on every ROCm target, which NaNs at decode shape (seq_len=1) and
crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in
0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this
PR) restores correct 4-bit decode on MI300X and verified working
end-to-end with full Unsloth + for_inference + sampling.
Revert the dual code path so ROCm and NVIDIA both go through the
normal FastLanguageModel.from_pretrained + for_inference flow:
- Remove the conditional `from unsloth import` that skipped the
import on ROCm. The monkey-patches it was trying to avoid were
never the cause of the crash; bnb 4-bit GEMV was.
- Remove the `if _hw_module.IS_ROCM:` branch in load_model that
loaded with plain transformers + PEFT + bfloat16, and the
`_resolve_fp16_base` helper it relied on.
- Remove the `get_chat_template is not None` fallback in
_load_chat_template_info -- get_chat_template is now always
imported.
- Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM
directly instead of the removed _IS_ROCM_ENV global. Audio and
vision on ROCm still need separate validation (FastVisionModel
and the CSM audio codecs were never tested on HIP) so the guard
stays for now.
Add _bnb_rocm_4bit_ok() as a runtime safety net for users who
install from this PR before the install.sh bnb pin kicks in, or
whose installer fell back to the PyPI pin because the continuous-
release wheel was unreachable. When the installed bnb is < 0.50 on
ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit /
-bnb-4bit suffix from the model path so a pre-quantized repo
resolves to its FP16 sibling instead of pulling bnb back in via
the repo's quantization_config. LoRA adapters whose base is a
pre-quantized repo on old bnb will still fail inside Unsloth's
loader -- the only real fix there is `unsloth studio update`.
Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1):
- HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized
repo): loads in 4-bit via the fixed GEMV, generation returns
"Paris." for greedy and sampling.
- SAFETY-NET path (simulated old bnb, suffix-stripped to the
FP16 sibling, load_in_4bit=False): loads in bf16, generation
returns "Paris." for greedy and sampling.
Net diff is ~45 lines smaller than the pre-revert state because
the entire plain-transformers 16-bit branch is gone.
* Cache _bnb_rocm_4bit_ok() with functools.cache
load_model() can be called many times in a single session but the bnb
version and hardware state cannot change at runtime, so memoise the
check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes`
inside the try block), subsequent calls drop to sub-microsecond dict
lookups. Zero behavioral change.
* Shorten verbose bnb/ROCm comments
Comment-only cleanup across install.sh, studio/install_python_stack.py,
and studio/backend/core/inference/inference.py. No behavioral change.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove _bnb_rocm_4bit_ok safety net from inference.py
Studio's ROCm support is brand new (PR #4720, merged today) and every
fresh install pulls the bnb continuous-release_main wheel via
install.sh / install_python_stack.py in this same PR. There are no
existing ROCm Studio installs carrying bnb < 0.50, so the defensive
version-check fallback is guarding against a scenario that cannot
actually occur. Delete the helper, the functools import, and the
safety-net block -- inference.py now calls FastLanguageModel.from_pretrained
directly with no ROCm branching.
* Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix
Vision inference was blocked by the same bnb 4-bit GEMV bug that affected
text inference (vision models use bnb 4-bit for the LM backbone). With
bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works
end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit
loaded in 4-bit via FastVisionModel + for_inference returns a correct
answer to a multimodal prompt.
Audio (CSM) was never actually blocked by HIP — on this hardware CSM
loads and runs its backbone forward pass fine with bnb 0.50, then fails
during generate() with a transformers-level kwarg validation mismatch
in generation_csm.py (`backbone_last_hidden_state` rejected). That's a
pre-existing transformers/CSM integration bug that reproduces identically
on NVIDIA, so the ROCm-gated guard was never actually protecting users
from anything HIP-specific.
Remove the combined audio/vision guard and the now-unused _hw_module
import. Also restore the one-word "Can be" in an inline comment that
drifted during the earlier comment-shortening pass, so the inference.py
delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and
nothing else.
* Shorten max_seq_length=0 guard comment to one line
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add ROCm detection to install.sh and expand shell tests
Add AMD ROCm GPU detection to get_torch_index_url() in install.sh.
When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm
version file, hipconfig, dpkg-query, and rpm.
Includes validation guard for malformed _rocm_tag, Debian epoch prefix
stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install,
and status messaging. Shell tests expanded to 23 cases.
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add ROCm torch reinstall support to install_python_stack.py
Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a
Linux host has ROCm but the venv received CPU-only torch, and reinstall
with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a
30-second timeout on the torch GPU probe subprocess.
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add ROCm support to llama.cpp prebuilt installer
Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm
via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream
prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP
prebuilt with CPU fallback). Add linux-rocm and windows-hip install
kinds to runtime_patterns_for_choice().
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add IS_ROCM hardware flag and fix AMD error message
Add IS_ROCM flag to hardware.py detect_hardware() (set when
torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM
from __init__.py. Add "rocm" key to get_package_versions().
Replace "We do not support AMD" error in tokenizer_utils.py with a
helpful message pointing to ROCm installation docs.
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add comprehensive ROCm support test suite (68 tests)
Add tests/studio/install/test_rocm_support.py covering all ROCm code
paths across install_llama_prebuilt.py, install_python_stack.py,
hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks
and run without AMD hardware.
Covers: asset selection (11), runtime patterns (5), HostInfo (4),
ROCm version detection (9), torch reinstall (9), index mapping (8),
hardware flag (8), tokenizer message (2), install.sh structure (10),
and live regression (1).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden ROCm support: probe error handling, version cap, validation
Address review findings from 8 independent reviewers:
- Wrap _ensure_rocm_torch() torch probe in try/except for
TimeoutExpired and OSError so a hung or broken torch import does not
crash the installer (8/8 reviewers flagged this)
- Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to
prevent installing unsupported torch 2.11.0 from the rocm7.1 index
- Use with-statement for file reads in _detect_rocm_version() to avoid
resource leaks
- Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of
default parameter to avoid relative path resolution)
- Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to
reject rocm0.x tags that would produce nonexistent PyTorch index URLs
- Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0*
|rocm7.1* pass through, everything else caps to rocm7.1) so future
ROCm 10+ does not fall through to a nonexistent index
- Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering
- Fix test_probe_timeout_handled: replace zero-assertion test with
proper assertions verifying reinstall proceeds after timeout
* Clean up rocm_paths list construction in detect_host()
Filter None from the ROCM_PATH env var lookup at list construction time
instead of relying on the inline `if p` guard in the any() call.
* Require actual AMD GPU presence before selecting ROCm paths
All 8 reviewers across 2 cycles independently flagged that ROCm
detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core)
as a proxy for GPU presence, which would misroute CPU-only or NVIDIA
hosts that happen to have ROCm tools installed.
Now all 3 detection points (install.sh, install_python_stack.py,
install_llama_prebuilt.py) probe for an actual AMD GPU before
entering the ROCm path:
- install.sh: check rocminfo for gfx* GPU names, or amd-smi list
for device rows, before version detection
- install_python_stack.py: new _has_rocm_gpu() function probes
rocminfo and amd-smi list before _ensure_rocm_torch() proceeds
- install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi
list instead of just checking tool existence or directory paths
Also:
- Shell test mock amd-smi now handles "list" subcommand
- Python tests updated to mock _has_rocm_gpu where needed
- Added test_no_gpu_with_rocm_tools_skips to verify the new guard
- Test index lookups now use sorted() to match production code
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden hipconfig version parsing and torch probe compatibility
- Add parts[1].isdigit() check in hipconfig version parsing to handle
versions like "6.3-HIP" where the minor component has non-numeric
suffix (strip "-" prefix before int() conversion)
- Use getattr() in torch probe subprocess to safely handle old or
custom torch builds that may lack torch.version.hip/cuda attributes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Strengthen AMD GPU detection and add NVIDIA precedence guard
- Change amd-smi list detection from any-non-empty-output to requiring
"gpu" marker in output, matching the shell-side NR>1 check. Prevents
false positives from header-only amd-smi list output.
- Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed
AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and
install_llama_prebuilt.py behavior).
- Apply the same amd-smi marker fix to install_llama_prebuilt.py
detect_host() for consistency.
* Add Windows-specific ROCm/HIP detection in detect_host()
The previous detect_host() ROCm check used rocminfo and amd-smi list
which are Linux-only tools. On Windows, has_rocm would always be False,
making the Windows HIP prebuilt path at line 1794 unreachable.
Now detect_host() uses platform-specific detection:
- Linux: rocminfo (check for gfx GPU names) or amd-smi list
- Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH
This allows Windows AMD users to get the HIP prebuilt binary instead
of silently falling through to the CPU prebuilt.
* Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion
- worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check
for hipcc before ROCm source builds, improve status messages and error
reporting, add timeout and uv support for the source build fallback
- amd.py: New AMD GPU monitoring module via amd-smi metric --json,
mirroring nvidia.py structure (utilization, temperature, power, VRAM)
- hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization,
visible GPU queries, and physical GPU count
- install_python_stack.py: Detect AMD GPUs on Windows and warn that
ROCm-enabled PyTorch must be installed manually
- kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032),
RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries
- tests: Add 32 new tests covering all changes (95/95 pass)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage
- Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi
output markers instead of just checking tool existence on PATH
- _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before
giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools)
- amd.py _parse_numeric: handle dict-shaped metric objects from newer
amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units
- amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly
handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs
- amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index
so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly
- install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels
exist for older versions); fix rocm7.1* glob to not match rocm7.10+
- is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.)
- worker.py: increase ROCm source build timeout from 600s to 1800s;
fix success log message for ROCm source builds
- Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation
- hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm
before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with
HIP-specific env vars report the correct visible device set
- amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi
JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly;
fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB
- install_python_stack.py: Windows AMD warning now validates actual GPU
presence via hipinfo/amd-smi output markers before printing
- install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP
detection after tool-based checks, so Windows HIP installs without CLI
tools on PATH are still detected
- hardware.py: fix IS_ROCM comment to accurately describe its role
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec
Use explicit None checks instead of Python `or` operator when reading
HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string
("") is correctly honored as "no visible GPUs" rather than silently
falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems.
* Fix IS_ROCM test assertion for multi-line formatting
* Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count
- Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in
both install.sh and install_python_stack.py to prevent resolver from
selecting incompatible companion packages from ROCm wheel index
- Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence
without hipinfo/amd-smi is not proof of GPU existence)
- Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which
respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts
* Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428
The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3
(gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201)
architectures is based on the original work from PR #4428.
Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com>
Co-authored-by: billishyahao <bill.he@amd.com>
* Support AMD Radeon for studio (#4770)
Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
* Remove ROCm test files from main PR
Move test_rocm_support.py and shell test additions to a separate PR
to keep the main ROCm support PR focused on implementation changes.
* Fix installer and hardware detection issues for PR #4720
- Fix empty _tri_arg passed to uv pip install in Radeon path (causes
"Empty field is not allowed for PEP508" error)
- Fix Radeon fallback: use ROCm index instead of CPU-only when
repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm)
- Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings
- Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64
wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag)
- Fix IS_ROCM export: use __getattr__ so callers always see the live
value after detect_hardware() runs
- Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES
on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set
- Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB)
- Add amd-smi version as a fallback in _detect_rocm_version
- Fix trailing whitespace and missing newline at EOF in install.sh
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix GPU detection false positives and add missing health groups
- Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows
from amd-smi list, not just header containing "gpu"
- Apply same fix in detect_host() in install_llama_prebuilt.py
- Add runtime_payload_health_groups for linux-rocm and windows-hip so
partial/corrupt ROCm/HIP prebuilt installs are properly detected
- Add bitsandbytes install to Radeon fallback paths (was only in the
success path, skipped when repo.radeon.com was unreachable)
- Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main)
and only use __getattr__ for IS_ROCM
* Fix _ensure_rocm_torch and Windows AMD warning false positives
- _ensure_rocm_torch: only skip when HIP is already present, not for
CUDA builds (which are unusable on AMD-only hosts). Fixes the case
where a venv has a stale CUDA wheel and the repair step is skipped.
- Windows AMD warning: use GPU data row check (same as Linux fix) to
avoid false positives from amd-smi list header-only output.
* Fix amd-smi GPU detection for GPU[N] output format
Older amd-smi versions output "GPU[0] : Card series: ..." instead of
"GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>"
formats to detect actual GPU data rows.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden AMD GPU detection against false positives
- install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with
strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/)
- All files: reject rocminfo gfx000 (CPU HSA agent) by requiring
gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe
- Fixes false positives on hosts with ROCm tools but no AMD GPU
* Remove duplicate comment from pre-commit merge
* Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports
- Extract _has_amd_rocm_gpu() shell function to avoid duplicating the
rocminfo/amd-smi GPU detection logic in get_torch_index_url and
the Radeon auto-detect block
- Consolidate bitsandbytes install into a single case block after torch
install (was duplicated 4 times across Radeon success/fallback paths)
- Move math and re imports to top of amd.py (were inline in functions)
- Add _smi_query() helper in hardware.py to centralize IS_ROCM backend
selection for get_gpu_utilization and get_visible_gpu_utilization
Addresses Gemini code review suggestions.
* Fix VRAM parsing for string values and GB/GiB consistency
- Extract unit from string-valued VRAM fields (e.g. "192 GiB") so
_parse_memory_mb correctly applies the unit multiplier instead of
treating the value as bare MB
- Treat GB and GiB identically (both as binary x1024) since GPU tools
including amd-smi use binary units even when labeling them "GB"
- Fixes incorrect VRAM reporting on MI300-class cards (was showing
~0.19 GB instead of 192 GB for string-valued outputs)
* Add --no-cache to uv for ROCm HIP source builds
Avoid stale cache artifacts from partial HIP source builds when
uv is used for causal-conv1d/mamba-ssm compilation on ROCm.
The pip path already uses --no-cache-dir; this adds the uv equivalent
(--no-cache) only when is_hip is True.
* Fix critical: initialize _amd_gpu_radeon before case block
_amd_gpu_radeon was only set inside the */rocm*) case arm, so on
NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm",
the variable was unbound. With set -u (nounset) enabled, this crashes
the installer for every non-AMD user.
Move initialization to before the case block so it is always defined.
* Fix Windows AMD: route has_rocm hosts to HIP prebuilt path
resolve_release_asset_choice was selecting windows-cpu for all Windows
x86_64 hosts including those with has_rocm=True. Windows AMD users
should fall through to resolve_upstream_asset_choice which tries the
HIP prebuilt first. Add "not host.has_rocm" guard to the published
windows-cpu selection.
* Harden ROCm detection, Radeon wheel fallback, and HIP visibility
Addresses review findings from parallel reviewers on PR #4720:
- install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L
to actually list a GPU before treating the host as NVIDIA. Fixes the
stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the
CUDA branch.
- install.sh: fix hipconfig awk blocks to propagate a non-zero exit code
when the output is not a recognisable version string, so the ||-chain
continues to dpkg-query / rpm instead of terminating early.
- install.sh: fail-closed on Radeon wheel fallback. When torch,
torchvision or torchaudio is missing from the Radeon repo for the
active Python tag, fall back to the standard ROCm index instead of
silently mixing Radeon wheels with PyPI defaults. Quote all wheel
arguments individually so wheel filenames cannot be word-split or
glob-expanded.
- install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to
list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts
with a broken leftover nvidia-smi to the ROCm path instead of
misclassifying them as NVIDIA.
- install_llama_prebuilt.py: scan upstream assets for any rocm-<version>
prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+
users pick up a matching upstream prebuilt when one exists.
- install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for
linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted
on the GPU path instead of passing validation on CPU only.
- install_llama_prebuilt.py: restore the published windows-cpu fallback
for AMD Windows hosts without a HIP prebuilt so hash-approved bundles
are still preferred over the raw upstream CPU asset.
- install_python_stack.py: drop the /opt/rocm / hipcc gate in
_ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm
installs (package-managed minimal installs, Radeon software) that ship
amd-smi / rocminfo without hipcc can now repair a CPU-only venv via
"unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard.
- studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES /
ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in
get_primary_gpu_utilization(). A process restricted to GPU 2 now
reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain
bytes unit detection to an explicit allowlist.
- studio/backend/utils/hardware/hardware.py: route
get_backend_visible_gpu_info()'s backend_cuda_visible_devices field
through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the
unconditional "(rocm=False)" suffix in apply_gpu_ids() logs.
* Fix round 2 regressions: ROCm validate_server and Windows HIP routing
Follow-up to 810b833b addressing review findings on the first round of
hardening commits:
- install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the
resolved install_kind instead of host.has_rocm. AMD Windows hosts
without a HIP prebuilt fall back to windows-cpu and must not be
validated with GPU layers; thread install_kind through from the
caller.
- install_llama_prebuilt.py resolve_release_asset_choice: reinstate the
"not has_rocm" guard on the published windows-cpu bundle so AMD
Windows hosts reach resolve_upstream_asset_choice() where the new
HIP prebuilt path lives. Prefer a published windows-hip bundle first
when one exists, fall through to upstream HIP + upstream CPU
otherwise.
- install_llama_prebuilt.py detect_host: also set has_physical_nvidia
when the secondary --query-gpu block confirms a working NVIDIA GPU,
so older nvidia-smi versions without -L support do not silently skip
the Linux diagnostics that key off has_physical_nvidia.
- install_llama_prebuilt.py: drop redundant "import re as _re" /
"import re as _re_rocm" local aliases in favour of the existing
top-level "import re".
- install_python_stack.py _ensure_rocm_torch: run the AMD
bitsandbytes install unconditionally after the HIP-torch probe so
"unsloth studio update" on venvs that already have ROCm torch still
gains the AMD bitsandbytes build.
- install.sh: add a non-x86_64 early-exit to get_torch_index_url() so
aarch64 / arm64 Linux hosts do not hit the ROCm wheel index
(PyTorch only publishes ROCm wheels for linux_x86_64).
- install.sh: add bitsandbytes install to the migrated-environment
branch so upgrades pick it up for ROCm hosts instead of only the
fresh-install path.
- install.sh: in the Radeon wheel path, pass version constraints +
--no-index --find-links to uv instead of explicit wheel URLs so a
version-compatible torch / torchvision / torchaudio triple is
resolved, rather than picking the highest-version wheel for each
package independently.
- studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall
through to lower-priority visibility env vars when the first entry
is malformed (leading comma, all-whitespace first token) instead of
silently returning GPU 0.
* Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps
Address issues surfaced by the round 3 reviewers on top of 8636fa63:
- install_python_stack.py _ensure_rocm_torch: add the same `x86_64`
guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts
must skip the repair path entirely; PyTorch only publishes ROCm
wheels for linux_x86_64, and without this guard
`unsloth studio update` aborts with a missing-wheel error on non
x86_64 hosts.
- install_llama_prebuilt.py resolve_upstream_asset_choice: add a
best-effort _detect_host_rocm_version() helper (reading
/opt/rocm/.info/version, amd-smi version, hipconfig --version) and
filter rocm_candidates to entries whose major.minor is <= host
version. Falls back to the newest candidate only when no compatible
one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being
handed the numerically newest rocm-7.2 bundle (which fails preflight
and forces a source build).
- install.sh: remove the round 2 --no-index switch from the Radeon
wheel branch. --no-index forced uv to ignore PyPI entirely, which
broke transitive dependency resolution (filelock, sympy, networkx,
jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv.
Restore the round 1 explicit wheel URL invocation but add a
torch / torchvision / torchaudio version-pair sanity check so a
mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio
2.9.0) falls back to the standard ROCm index instead of installing a
broken combination.
- install_python_stack.py _ensure_rocm_torch: restructure the
"tag is None" path so it no longer short-circuits the bitsandbytes
install. On a ROCm runtime older than anything in
_ROCM_TORCH_INDEX, print the "no wheel" warning but still run the
AMD bitsandbytes install.
- studio/backend/core/training/worker.py: restore the pre-PR
"no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source
builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts
slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos)
after 5 minutes; omit timeout for the non-HIP branch so the cap
only applies to ROCm source builds.
* Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate
Address remaining issues surfaced by the round 4 reviewers:
- studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the
selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever
the caller already had a ROCm visibility env var set, not only when
IS_ROCM has already been set by detect_hardware(). Training and
inference workers call apply_gpu_ids() before detect_hardware()
runs, so the old guard would leave a forked ROCm worker with a
stale HIP_VISIBLE_DEVICES mask that no longer matched the
narrowed CUDA_VISIBLE_DEVICES selection.
- install.sh get_radeon_wheel_url: accept X.Y ROCm versions in
addition to X.Y.Z. The `/opt/rocm/.info/version` file and some
hipconfig versions report only two components, and the Radeon
repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/
directories, so treating X.Y as invalid caused Radeon hosts to fall
back to the generic ROCm index even when a matching AMD wheel set
existed.
- install_python_stack.py _ensure_rocm_torch: only install the AMD
bitsandbytes build when the venv actually has a ROCm-compatible
torch (either already present or just installed by this function).
Previously the bitsandbytes install ran unconditionally, which
could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch
on hosts where the ROCm runtime is older than any entry in
_ROCM_TORCH_INDEX. Also add --force-reinstall so an existing
CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id
Two medium-severity defensive fixes from the gemini-code-assist review on
the AMD monitoring backend:
1. _extract_gpu_metrics may return a dict where every value is None when
amd-smi succeeds (zero exit) but the JSON envelope contains no usable
fields (error response, unsupported card). The new _has_real_metrics
helper lets get_primary_gpu_utilization surface available:False and
lets get_visible_gpu_utilization skip ghost device rows so the UI
does not render placeholder cards with empty numbers.
2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit":
"none"}, including the per-GPU id. The previous int(raw_id) call
silently fell back to the enumeration index in that case, losing the
real GPU id. Routing raw_id through the existing _parse_numeric
helper handles bare ints, floats, strings, and the dict shape
uniformly, with a debug log on parse failure.
* Fix gemini round 2 findings: explicit length guard on ROCm version file parser
Both _detect_rocm_version (install_python_stack.py) and
_detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version
or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed
parts[1]. The surrounding broad `except Exception: pass` already swallowed
the resulting IndexError, so a one-component file like "6\n" did fall
through to the next detection source -- but the control flow relied on
exception handling instead of an explicit check.
Add `if len(parts) >= 2:` guards in both helpers so the loop falls through
on its own without raising. Behaviour is unchanged for the common multi-
component case; the previously-silent IndexError path becomes an explicit
no-op.
* Fix gemini round 3: include has_rocm in validate_server fallback path
When validate_server is called without an explicit install_kind (older
call sites that have not been updated), the fallback was only enabling
--n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts
fell through to the CPU validation path even though the prebuilt being
exercised was a HIP binary.
Add host.has_rocm to the fallback expression so the GPU offload flag is
applied consistently with the install_kind=='linux-rocm' / 'windows-hip'
branches above.
* Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb
The previous heuristic divided any bare number above 10_000_000 by
1024*1024 on the assumption that large unit-less values were bytes.
This misclassified small VRAM allocations: 5 MB of used VRAM reported
as 5_242_880 bytes without a unit would be taken at face value and
render as 5_242_880 MB (~5 TB) in the monitoring UI.
Modern amd-smi always provides explicit units (MiB/GiB dict form),
and legacy amd-smi returns bare numbers in MB -- the heuristic never
had a real workload to handle. Drop it and default to MB for bare
numeric input, keeping the existing unit-aware branches for dict /
string inputs unchanged.
The unrelated gemini suggestion to "default minor to 0" in the
amd-smi version awk parser was intentionally NOT applied: rocm7.0
and rocm7.1 ship different wheel sets, so silently substituting 0
for a missing minor could install the wrong wheels. The existing
reject-and-fall-through behaviour is safer.
* Fix gemini round 5: POSIX compliance and leading-comma visibility parsing
Three medium findings from gemini-code-assist addressed in this commit:
1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions
that are not in POSIX and break on BSD/BusyBox coreutils. install.sh
has a #!/bin/sh shebang so the whole pipeline was rewritten as a
single awk script that extracts all href="..." hits on each line,
filters to wheels matching the package prefix and python tag, and
picks the newest version via zero-padded lexical comparison. No
external sort or grep is needed.
2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a
leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to
the next env var", which is surprising given the clear intent to
narrow to device 1. Filter empty tokens after the split and return
the first real one. An all-commas value ("," / ",,,") still falls
through because no real tokens exist; the empty-string and "-1"
explicit-zero cases are unchanged.
The unrelated amd-smi version awk parser suggestion was not applied
(see round 4 commit message for rationale: defaulting a missing minor
to 0 could silently install the wrong ROCm wheel set).
* Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label
Consolidated fix batch from a 20-parallel reviewer.py run on the current
head. Each fix is drawn from a high-consensus finding and addresses a
real bug or feature gap, not a stylistic preference.
1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five
call sites so this branch no longer regresses main's version floor
(main bumped to 2026.4.4 in #4876). Without this, merging 4720 would
silently downgrade the minimum version pin for fresh installs.
2. install.sh: URL-decode Radeon wheel names before extracting the
torch / torchvision / torchaudio version strings. Real wheel URLs
from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...")
so the previous `[+-]` terminator in the sed regex never matched,
`_torch_ver` stayed empty, `_radeon_versions_match` stayed false,
and every Radeon consumer install silently fell back to the generic
ROCm index. Now decode %2B -> + first, then extract, then validate.
3. install.sh: the two AMD bitsandbytes install lines were running
`uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`,
so upgrades where the venv already has a CPU/CUDA bitsandbytes
satisfying the constraint would keep the stale non-AMD wheel. Add
`--force-reinstall --no-cache-dir` to both call sites, matching the
pattern already used in install_python_stack.py::_ensure_rocm_torch.
4. install_python_stack.py and install_llama_prebuilt.py: add
`dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the
Python-side ROCm version detectors so they match the chain in
install.sh::get_torch_index_url. Package-managed ROCm installs
(Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via
rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig,
or amd-smi `version` output -- without these fallbacks, `unsloth
studio update` on such hosts returned None and skipped the ROCm
torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before
parsing so epoch-annotated packages parse correctly.
5. hardware.py: add a `_backend_label(device)` helper that returns
"rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and
use it for every `"backend": ...` emission in JSON responses served
to the Studio frontend. Internally we still represent ROCm hosts as
DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API
surface), but the user-facing API now correctly reports "rocm" on
AMD boxes instead of labeling them as "cuda".
All 250 simulation scenarios pass (was 233 before this batch: added 17
new regression tests covering the version pin, %2B decoding, bnb
force-reinstall flags, dpkg/rpm fallback presence, and the
_backend_label helper's four-way truth table).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4
Two rounds of fixes in one commit, plus a full URL audit of every PyPI /
download.pytorch.org / repo.radeon.com reference the PR introduces.
amd.py (4 medium gemini findings on commit b3627bc2):
1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util
gate. The follow-up `vram_total_mb > 0` already handles the division
guard, but the truthiness check was redundant and slightly surprising
for a 0.0 valid value. Replace with explicit `is not None and > 0`
for both vram_util and power_util.
2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding
for non-dict envelopes. A scalar / string JSON response from amd-smi
would raise AttributeError. Add an isinstance(data, dict) check and
return None for unexpected shapes.
3. get_visible_gpu_utilization had the same .get() exposure on the outer
envelope. Rewrite the gpu_list extraction as an explicit
list/dict/else cascade so a malformed scalar envelope produces
gpu_list=[data] and continues without raising.
4. The same function's per-entry loop also called gpu_data.get() on
whatever was inside gpu_list. If a scalar ever leaks into the list
(directly or via the previous fix's fallback), _extract_gpu_metrics
would raise on the first .get() inside the helper. Skip non-dict
entries in the loop before extracting metrics.
install.sh (URL audit finding, previously flagged by 20-reviewer as #13):
5. get_torch_index_url used `rocm6.*` in the rocm tag case statement,
which matched rocm6.5 and rocm6.6 and emitted
download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because
PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the
supported 6.x minors explicitly and add a rocm6.* fallback branch
that clips to rocm6.4 (the last supported 6.x wheel set).
URL audit results (all URLs PR 4720 references):
- 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130,
rocm6.0..6.4,rocm7.0..7.2} return HTTP 200.
- 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3,
6.4,7.0,7.1,7.2}/ return HTTP 200.
- X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for
6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z
-> X.Y fallback sed in the Radeon wheel install block.
- Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the
llama.cpp GitHub releases API endpoint all return 200.
Test suite: 255 -> 258. New regression coverage:
- U17: get_physical_gpu_count tolerates scalar amd-smi envelope
- U18: get_visible_gpu_utilization tolerates scalar envelope
- U19a-c: vram_util / power_util return None on zero total, but
vram_total_gb still echoes 0.0 (not None)
- A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported
6.x minors to rocm6.4 instead of producing a 403 index URL
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label
Three high-confidence findings from a second 20-parallel reviewer.py run
on commit 7effb3ae. Triaged 15 total findings and applied the three that
were confirmed as real bugs; the rest were either false positives (e.g.
"migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream
via setup.sh regardless), design decisions (e.g. visibility mask env
vars not consulted in installer detection), or edge cases the existing
fallback logic already handles.
1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe
runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then
only raises if `torch.cuda.is_available()` is False. On ROCm torch,
torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.*
API), so the guard becomes dead code on AMD hosts and multi-GPU AMD
setups slip through even though unsloth does not support them yet.
Add a torch.cuda.device_count() > 1 fallback inside the except so
AMD multi-visible-device setups are flagged consistently with the
original CUDA memory check.
2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm
ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when
SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user
running `install.sh --no-torch` on an AMD host would still pull in
bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the
case block in an outer `[ "$SKIP_TORCH" = false ]` guard.
3. studio/backend/main.py [3/20]: the /api/system endpoint returned
`"device_backend": get_device().value`, which is "cuda" on ROCm
hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints
(hardware.py) already use the _backend_label helper which swaps
"cuda" -> "rocm" when IS_ROCM. Route /api/system through the same
helper so the Studio UI reports the backend consistently across all
endpoints.
4. studio/backend/tests/test_utils.py: update test_backend_matches_device
to call _backend_label(get_device()) instead of raw get_device().value
so the test matches the new contract and still passes on CUDA hosts.
Tests: 258 -> 261. New regression coverage:
- X08 main.py /api/system uses _backend_label
- X09 tokenizer multi-GPU guard has device_count() fallback
- X10 fresh-install bnb case block gated on SKIP_TORCH=false
* fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels
During install, bitsandbytes was installed without --no-deps, causing
uv to resolve torch from PyPI (CUDA build) and silently overwrite the
ROCm wheels that were just installed in the previous step.
This happened in three places:
- install.sh: bitsandbytes install in both migrated and fresh paths
- install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch()
Additionally, multiple install steps in install_python_stack.py (extras,
overrides, studio deps) can pull in CUDA torch via transitive
dependencies. A final _ensure_rocm_torch() call at the end of the
install sequence ensures ROCm torch is always in place at runtime.
All changes are gated behind ROCm-specific conditions and do not affect
NVIDIA, CPU-only, macOS, or Windows install paths.
Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms
torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install.
* fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP
On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path:
1. Unsloth's global monkey-patching of transformers model classes
(LlamaRotaryEmbedding, attention modules) triggers
_assert_async_cuda_kernel crashes on HIP during generation.
Training uses different code paths and works fine.
2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion
failures on MI300X (CDNA3 / gfx942), even without Unsloth patching.
This commit adds a ROCm-specific inference fallback that:
- Skips importing Unsloth at module level (prevents global patching)
- Loads models in 16-bit with plain transformers + PEFT instead
- Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx")
since pre-quantized HF repos still trigger bnb codepaths
- Guards get_chat_template calls (unavailable without Unsloth import)
- Fixes max_seq_length=0 being passed to from_pretrained (GGUF
semantics don't apply to transformers path)
The NVIDIA path is completely unchanged -- Unsloth import and
for_inference() optimization remain active. GGUF inference (via
llama-server/HIP) is unaffected since it never imports Python model
classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X)
so 16-bit loading is practical for inference.
Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424):
- Simple generation: PASS
- Compare mode (base vs finetuned): PASS
- GGUF inference + tool calling: PASS (unaffected by this change)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: guard audio/vision inference on ROCm, remove unused import
- Add clear RuntimeError for audio/vision model inference on ROCm
(these paths use Unsloth's FastModel/FastVisionModel which would
crash on HIP; GGUF inference is the supported path on AMD)
- Remove unused `import os as _os` from the ROCm changes
* fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature)
amd-smi on recent ROCm versions (7.x) wraps metric output in a
{"gpu_data": [...]} envelope instead of returning a raw list. This
caused get_primary_gpu_utilization() and get_visible_gpu_utilization()
to fail silently (returning available=False) because the GPU data
dict was never unwrapped.
Additionally:
- VRAM data moved from "vram" to "mem_usage" with "total_vram" /
"used_vram" keys. Added fallback key lookup.
- Temperature "edge" sensor returns "N/A" on MI300X VF; the previous
dict.get() chain returned the "N/A" string instead of falling
through to "hotspot". Changed to a loop that checks each key until
a parseable value is found.
Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x):
- GPU utilization: 0% (idle), up to 100% during training
- Temperature: 40-44C (from hotspot sensor)
- VRAM: 0.28/191.69 GB (idle)
- Power: 158-211W draw
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Bug fix detecting radeon (#4940)
* Bug fix detecting radeon
* Expanding GPU target for gfx1100*
* Generalize gfx family-prefix filter to cover gfx10/gfx12 as well
rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the
specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures
the bare family prefix from the generic line, and passing that to
-DGPU_TARGETS breaks the HIP build because clang only accepts specific
gfxNNN ids.
The previous filter only special-cased gfx11. Generalize it so any bare
2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a
specific sibling target is present in the same list. No real AMD GPU has
a 2-digit gfx id, so the filter can only ever drop family prefixes and
never a real target.
Covers the existing gfx11 cases unchanged, and extends the same fix to
gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4),
which would otherwise hit the same build failure on newer rocminfo.
---------
Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
---------
Co-authored-by: Eda Z <eda.zhou@amd.com>
Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: billishyahao <bill.he@amd.com>
Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com>
Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* updated models template mappers. added lfm2.5vl450m to transformers 5.3.0 whitelist
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: check find() return value before adding offset in try_fix_tokenizer
The `str.find()` result was checked for -1 only after adding
`len(find_text)`, turning the guard into dead code. When the substring
is absent, `start` becomes `len(find_text) - 1` (a positive number),
so the `if start == -1: continue` never triggers and the subsequent
slice extracts garbage from the tokenizer string.
Split the find and offset into two steps so the -1 check works correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add defensive guards for token_id None and end find() returning -1
- Skip loop iteration early when token_id is None to avoid constructing
a find_text that can never match valid JSON
- Guard end = tokenizer_string.find('",', start) against -1 to prevent
silent garbage extraction from malformed tokenizer strings
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(chat): sticky composer bar in thread
* fix(chat): fix compare pane clipping
* fix(chat): tighten scroll-to-bottom placement and compare footer spacing
* Fix TypeScript build break and clean up ViewportFooter classes
- Remove unused `compact` prop from ThreadScrollToBottom call site
(component is FC with no props, passing it caused TS2322)
- Extract shared classes (sticky, bottom-0, z-20, bg-transparent) from
ternary branches into the unconditional className string
- Restore `relative` on normal-mode footer so the inner absolute
bg-background strip has a positioning context
- Remove redundant md:pb-3 / md:pb-4 (same value as base pb-3 / pb-4)
- Remove no-op `sticky bottom-0` from SharedComposer wrapper in both
LoraCompareContent and GeneralCompareContent (flex layout with
shrink-0 already pins it at the bottom; parent has no scrollable
overflow for sticky to bind to)
- Fix truncated comment on pointer-events rationale
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix raw text paragraph break normalization
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Normalize horizontal whitespace before stripping non-ASCII and collapse leftover doubles
Run the [^\S\n]+ horizontal-whitespace collapse before the non-ASCII strip
so that Unicode whitespace (\u00A0, \u202F, \u2009, \u3000, \v, \f, etc.)
becomes a single ASCII space instead of being deleted outright. The prior
ordering silently merged adjacent words on HTML/PDF/OCR-sourced text:
"hello\u00a0world" used to produce "helloworld" after this PR; it now
produces "hello world".
Also drop \t from the allow-list since the horizontal-whitespace collapse
already normalizes tabs to a single space, and add a targeted [ ]{2,} pass
right after the non-ASCII strip so that a non-whitespace non-ASCII character
sitting between two spaces ("word1 (c) word2") does not leave an interior
double space. Without this extra pass, clean_text was not idempotent on
such inputs: the first call produced "word1 word2" and only the second
call collapsed it to "word1 word2". Fuzz testing over 10000 random inputs
now satisfies the idempotence invariant in every case.
* Add regression tests for Unicode/control whitespace and non-ASCII edge cases
Cover:
- Unicode horizontal whitespace separators (NBSP, narrow NBSP, thin space,
en/em space, ideographic space, vertical tab, form feed) normalizing to
a single ASCII space instead of being deleted.
- Mixed paragraph + Unicode whitespace realistic input ("Section\u00a01\r\n\r\nBody\ftext\u202Fhere").
- Tab collapsing and space trimming around newlines.
- Non-whitespace non-ASCII characters (copyright, accented letters, emoji)
sitting between spaces: must not leave an interior double space, and
clean_text must be idempotent on these inputs.
- Non-ASCII characters adjacent to a newline: stripping must not leave
stray leading or trailing spaces on the neighbouring line, and must not
swallow an adjacent paragraph break.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix Mistral training crash when xformers is unavailable
* Fix/adjust Mistral DPO training crash fix for PR #4889
- Clarify comment in MistralForCausalLM_fast_forward: the DPO embed-masking
block runs BEFORE attention_mask is nulled out, and it is the consumer that
requires a 2D mask.
- Add defensive attention_mask.ndim == 2 guard to the LlamaModel_fast_forward
DPO embed-masking block so it self-protects if a 4D mask ever reaches it.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Only run ldconfig CUDA-linking recovery when we have permission
When `import unsloth` runs on a non-root environment (shared HPC,
locked-down container, CI runner, etc.) the CUDA-linking recovery path
shells out to `os.system("ldconfig /usr/lib64-nvidia")`, which fails
loudly with "Permission denied". It's especially noisy for users who
don't even have bitsandbytes installed - they're doing 16bit or full
finetuning and the line immediately above told them "16bit and full
finetuning works!". The reason the recovery runs at all in that case
is that `bnb.functional.lib.cdequantize_blockwise_fp32` raises
AttributeError on `bnb is None`, the bare `except:` swallows it, and
the code drops into the recovery unconditionally.
Fix: gate the recovery body on `os.geteuid() == 0`. When we don't
have permission to run ldconfig, silently skip the recovery. When we
do, the recovery runs UNCHANGED - same `os.system()` calls, same
reload + retry, same warnings. `libcuda_dirs()` is used by both triton
and bitsandbytes, so we still want to run the recovery whenever we
have permission, regardless of whether bnb is installed.
For non-root users who DO have bitsandbytes installed and broken,
emit a single remediation warning telling them how to fix it manually
(`sudo ldconfig /usr/lib64-nvidia`). This preserves the diagnostic
guidance from the original code without the Permission denied noise.
Scope:
- Only the `DEVICE_TYPE == "cuda"` branch is touched.
- The `hip` (AMD ROCm) and `xpu` (Intel) branches are unchanged.
- On a real CUDA box running as root, behavior is byte-identical to
main: same os.system() calls, same reload, same retry, same warnings.
AST-verified by /tmp/verify_minimal/verify.py.
- `hasattr(os, "geteuid")` guards against Windows where `os.geteuid`
doesn't exist.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <info@unsloth.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat: inject local model provider into recipe jobs via JWT
* feat: auto-generate JWT for local model providers in recipes
* feat: add is_local flag to model provider config types and utils
* fix(studio): skip endpoint validation for local providers
* feat(studio): add local/external model source toggle to provider dialog
* feat(studio): thread localProviderNames through model config dialog chain
* feat(studio): show 'Local model (Chat)' label for local model_provider configs
* fix: hardcode loopback for local endpoint, clear stale creds on toggle
* fix: document TOCTOU/JWT rotation, add deferred import comments, fix is_local serialization
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): clear stale local model state on provider toggle and validation
* fix(studio): override empty local endpoint in validation and skip model gate for unused providers
* fix(studio): resolve loopback port from app.state, clear stale local provider fields, sync model id on toggle
Address review feedback on the local-model-provider flow:
- Backend (jobs.py): _resolve_local_v1_endpoint now reads the actual bound
port from app.state.server_port (set in run.py after binding) instead of
parsing it out of request.base_url, which is wrong behind any reverse
proxy or non-default port. The two duplicated urlparse blocks are gone.
- Backend (jobs.py): defensively pop api_key_env, extra_headers, extra_body
from local providers so a previously external provider that flipped to
local cannot leak invalid JSON or rogue auth headers into the local /v1
call. Also dedupe the post-loop assignment and tighten the local-name
intersection so empty names cannot match.
- Backend (jobs.py): hoist datetime and urllib.parse imports to the top
import block for consistency with the rest of the file.
- Backend (run.py): expose the bound port on app.state.server_port after
the uvicorn server is constructed.
- Frontend (model-provider-dialog.tsx): clear extra_headers and extra_body
when toggling to local mode. Hidden inputs would otherwise keep stale
JSON blocking validate/run.
- Frontend (model-config-dialog.tsx): factor the local-aware provider
selection logic into applyProviderChange and call it from both
onValueChange and onBlur, so manually typing a provider name and tabbing
away keeps the model field consistent.
- Frontend (recipe-studio.ts store): handle both directions of the
is_local toggle in the cascade. external -> local now backfills
model: "local" on already-linked model_configs so they pass validation
immediately, mirroring the existing local -> external clear path.
- Frontend (validate.ts + build-payload.ts): thread localProviderNames
into validateModelConfigProviders and skip the "model is required"
check for local-linked configs. Local providers do not need a real
model id since the inference endpoint uses the loaded Chat model.
* fix(studio): narrow store cascade types, sync model placeholder on graph relink and node removal, harden ephemeral port path
Loop 2 review fixes:
- recipe-studio.ts: type-narrow next.is_local by also checking
next.kind === "model_provider". TS otherwise raised TS2339 because
next was typed as the union NodeConfig after the spread. The behavior
is unchanged but the code now compiles cleanly.
- model-config-dialog.tsx: convert the lastProviderRef / providerInputRef
ref-during-render pattern (pre-existing react-hooks/refs lint error)
to a useEffect that syncs providerInputRef from config.provider. The
combobox blur path still uses applyProviderChange and remains stable.
- recipe-graph-connection.ts: when a graph drag links a model_provider
to a model_config, mirror the dialog applyProviderChange behavior:
fill model: "local" if the new provider is local and the model field
is blank, clear model when relinking from a local placeholder to an
external provider, otherwise leave the model alone.
- reference-sync.ts: when a referenced provider node is removed, clear
the synthetic model: "local" placeholder along with the provider
field, so a future relink to an external provider does not pass
validation with a stale value that fails at runtime.
- run.py: only publish app.state.server_port when the bound port is a
real positive integer; for ephemeral binds (port==0) leave it unset
and let request handlers fall back to request.base_url.
- jobs.py: _resolve_local_v1_endpoint also falls back when
app.state.server_port is non-positive, and uses `is None` instead of
the truthy fallback so a literal 0 is handled correctly.
* fix(studio): strict is_local check, narrow loaded-model gate to LLM-reachable configs, add scope-server port fallback
Loop 3 review fixes:
- jobs.py, validate.py: require `is_local is True` instead of truthy
check. Malformed payloads such as is_local: "false" or is_local: 1
would otherwise be treated as local and silently rewritten to the
loopback endpoint.
- jobs.py: _resolve_local_v1_endpoint now tries request.scope["server"]
(the actual uvicorn-assigned (host, port) tuple) as a second
resolution step before falling back to parsing request.base_url.
This covers direct-uvicorn startup paths and ephemeral binds that
never publish app.state.server_port.
- jobs.py: new _used_llm_model_aliases helper collects the set of
model_aliases that an LLM column actually references, and the
"Chat model loaded" gate is now only triggered when a local
provider is reachable from that set. Orphan model_config nodes on
the canvas no longer block unrelated recipe runs.
* fix(studio): force skip_health_check on local-linked configs, skip JSON parsing for local providers, local-aware inline editor
Loop 4 review fixes:
- jobs.py: after rewriting local providers, also force
skip_health_check: true on any model_config linked to a local
provider. The /v1/models endpoint only advertises the real loaded
model id, so data_designer's default model-availability health check
would otherwise fail against the placeholder "local" id before the
first chat completion call. The inference route already ignores the
model id in chat completions, so skipping the check is safe.
- builders-model.ts: buildModelProvider now short-circuits for local
providers and emits only { name, endpoint: "", provider_type, is_local }
without running parseJsonObject on the hidden extra_headers/extra_body
inputs. Imported or hydrated recipes with stale invalid JSON in those
fields no longer block client-side validate/run.
- inline-model.tsx: the model_config branch now accepts an optional
localProviderNames prop and mirrors the dialog applyProviderChange
behavior. Changing provider to/from a local one auto-fills or clears
the "local" placeholder consistently with the other edit paths.
- recipe-graph-node.tsx: derive localProviderNames from the store via
useMemo (stable identity) and pass it through renderNodeBody to
<InlineModel>. Hooks order is preserved by declaring them above the
early return for markdown_note nodes.
- run.py: minor comment tweak - loop 3 already added the scope-server
fallback path, note that in the comment.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: danielhanchen <info@unsloth.ai>
* split venv_t5 into venv_t5_530 and venv_t5_550 for tiered transformers 5.x support
* fix bfloat16 crash on T4 for FORCE_FLOAT32 models and disable trust_remote_code auto-enable for native t5 models
* revert FORCE_FLOAT32 dtype change
* restrict trust_remote_code auto-enable to Nemotron models only
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* use config.json model_type for tier detection, add unsloth/nvidia namespace guard
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"
This reverts commit fb43d468e2.
* Revert "use config.json model_type for tier detection, add unsloth/nvidia namespace guard"
This reverts commit fc49ae2453.
* add unsloth/nvidia namespace guard to Nemotron trust_remote_code auto-enable
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* reorder tier checks: all substring matches before config.json fetches
* extract shared activate_transformers_for_subprocess into transformers_version.py
* narrow Nemotron trust_remote_code to nemotron_h/nemotron-3-nano, add to export worker
* clean venv_t5 dirs before re-install in setup.sh, clarify version alias comment
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* run venv_t5 migration outside deps fast-path gate in both setup scripts
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(chat): prevent implicit empty thread creation and stabilize new-chat flow
* fix(chat): harden compare thread sync and simplify sidebar thread query
* fix(chat): harden new-thread state sync and isolate compare active thread updates
* fix(chat): stabilize new-thread state sync and prevent compare/session bleed
* Fix thread restoration, handleNewThread guard, sidebar filter, and delete flow
- Remove __LOCALID_ filter from getInitialSingleChatView: in this
Dexie-backed adapter, AUI's __LOCALID_ prefixed IDs ARE the real
persistent thread IDs stored by initialize(). Filtering them out
breaks thread restoration on navigation.
- Simplify handleNewThread to synchronous: the async Dexie message
check is redundant (persistence is already deferred to first append)
and strands users on legacy empty threads. Use a simple guard that
checks the store's activeThreadId to detect unsent drafts.
- Add message-count filter to sidebar: filter threads to only show
those with at least one message, hiding legacy empty threads.
- Add store-based sidebar highlighting fallback: use activeThreadId
from the store when view.threadId is not set (nonce-backed chats).
- Fix handleDelete to call onNewThread() instead of onSelect(), and
clear activeThreadId, so the runtime properly resets after deleting
the active thread.
* Fix handleDelete nonce path and restore __LOCALID_ filter
handleDelete was calling onNewThread() after clearing activeThreadId,
but the handleNewThread guard sees !view.threadId && !activeThreadId
and returns early, leaving the UI stuck on the deleted thread.
Fix by directly calling onSelect with a new nonce instead.
Restore __LOCALID_ filter in getInitialSingleChatView to prevent
restoring unpersisted AUI local thread IDs on navigation. Without
this filter, navigating away from /chat before sending a message
would restore a non-existent thread that Dexie cannot fetch.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fix custom folder scanning when pointing directly at a model directory.
When a user adds a custom scan folder that points directly at a model
directory (e.g. /path/to/gemma-4-e2b-it-gguf/ containing config.json
and gemma-4-E2B-it-BF16.gguf), the model list previously showed
individual .gguf files as separate entries instead of recognizing the
directory as a single model. Clicking any entry showed "No GGUF
variants found" because list_local_gguf_variants received a file path
and immediately returned empty.
Changes:
- Add _is_model_directory() helper that detects directories with both
config metadata and actual model weight files (excludes mmproj GGUFs
and non-weight .bin files like tokenizer.bin)
- _scan_models_dir: detect self-model and return single directory entry
- _scan_lmstudio_dir: surface model directories directly instead of
descending into them as publisher folders; handle both root and child
model directories
- Add _resolve_gguf_dir() helper for GGUF path resolution that only
falls back to parent directory when parent has model metadata
- list_local_gguf_variants / _find_local_gguf_by_variant: use resolver
so .gguf file paths inside model directories work correctly
* fix: skip redundant HfFileSystem().glob() calls in loader.py
Guard the SUPPORTS_LLAMA32 glob blocks with `is_model and is_peft` so
the HfFileSystem HTTP call is only made when both configs could actually
exist. This prevents indefinite hangs on slow/unreliable networks since
the glob result is redundant when either AutoConfig or PeftConfig
already failed to load.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test file from main PR - moved to separate PR
Tests for the glob skip guard belong in their own PR to keep
the loader change minimal and reviewable.
* Harden HfFileSystem glob: fix Windows path splitting, add try/except
- Use str.rsplit("/", 1) instead of os.path.split to extract filenames
from HfFileSystem paths. HfFileSystem always returns POSIX-style paths,
but os.path.split uses the OS separator, so on Windows the entire path
was returned as the "filename" and the config name comparison always
failed.
- Wrap the HfFileSystem().glob() call in try/except to gracefully handle
network failures (offline mode, timeouts, unreachable Hub). On failure
both_exist stays False, which is the safe default.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove redundant HfFileSystem().glob() call for remote repos
When is_model and is_peft are both True, AutoConfig and PeftConfig
have already loaded successfully, proving both config.json and
adapter_config.json exist. The HfFileSystem network call to re-verify
this was redundant and could cause hangs on slow networks.
Replace the glob + try/except block with a direct both_exist = True
assignment.
* Remove unused HfFileSystem import
HfFileSystem was only used for the glob() calls that were replaced
with direct both_exist = True assignments in the previous commit.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Gemma-4 does not need FORCE_FLOAT32. Testing shows that both float16 and
bfloat16 work correctly without the forced float32 override:
- Inference: identical outputs for float16 and bfloat16 (greedy decoding)
- Training (100 steps, 4-bit LoRA, SFT on FineTome-100k):
- float16 final loss: 3.048
- bfloat16 final loss: 3.065
- Losses converge to within 0.02 by step 60
- Grad norms healthy and comparable for both dtypes
The FORCE_FLOAT32 path was actually causing training divergence. With
it enabled, the compiled float32 run diverged at step ~28 with grad norms
collapsing to near zero and loss plateauing at ~12.4. Without it, both
dtypes train normally.
This enables float16 on Tesla T4 and other GPUs without bfloat16 support.
* Add tests for is_vision_model() caching behaviour
* Fix review feedback: remove dead helper, fix exception test
- Remove unused _make_config() helper function (dead code)
- Fix test_exception_result_cached to actually exercise the exception path
by mocking load_model_config to raise OSError instead of using
side_effect=[False] which only tested normal False returns
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use strict mock specs so tests exercise intended detection paths
Use MagicMock(spec=[]) for all config mocks so hasattr() only returns
True for explicitly set attributes. Without this, MagicMock defaults
make all hasattr checks truthy, allowing tests to pass via unintended
detection paths (e.g. img_processor instead of vision_config).
---------
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add vision detection cache to is_vision_model() to avoid redundant subprocess spawns
is_vision_model() is called 4-5 times per training run for the same model
with zero caching. For transformers 5.x models, each call spawns a full
subprocess (~6s each). This adds a module-level _vision_detection_cache dict
following the same pattern as the existing _audio_detection_cache used by
detect_audio_type(). The function is refactored into a thin cache wrapper
around _is_vision_model_uncached(), saving ~12s per training run.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Include hf_token in vision cache key for gated model correctness
Cache key is now (model_name, hf_token) instead of just model_name.
This prevents stale False results when an unauthenticated probe for a
gated model is followed by an authenticated call.
* Remove test file from main PR - will be submitted separately
* Fix vision cache: normalize model names and skip caching transient failures
- Normalize model names in cache key using resolve_cached_repo_id_case()
to avoid duplicate entries for different casings of the same HF repo
(aligns with case normalization from #4822)
- Return None instead of False on transient failures (network errors,
subprocess timeouts, HF API issues) so the cache layer can distinguish
"definitely not a vision model" from "failed to check"
- Only cache definitive True/False results; transient failures are retried
on the next call instead of being permanently locked in as False
* Refine failure handling: cache deterministic failures, guard normalization
- Subprocess non-zero exit, JSON errors, and general exceptions return
False (deterministic, cached) instead of None (retryable). Only
subprocess.TimeoutExpired returns None since timeouts are transient.
- Wrap cache key normalization in try/except so resolve_cached_repo_id_case
or normalize_path failures fall back to raw model_name instead of
crashing callers.
* Harden vision detection cache: fix transient failure handling, thread safety, token security
- All subprocess failure paths now return None (transient) instead of False,
preventing permanent misclassification of VLMs after temporary HF/auth/network errors
- Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token
- Add threading.Lock with double-checked locking to prevent thundering herd
of concurrent subprocess spawns for the same uncached model
- Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError,
ValueError) from transient ones in _is_vision_model_uncached
- Pass resolved/normalized model name to detection (not just cache key)
- Log normalization fallback at debug level instead of silent swallow
- Thread hf_token through callers in routes/models.py and trainer.py
that previously omitted it
* Refine lock strategy and token fingerprint
- Move detection computation outside the lock to avoid serializing
long-running subprocess spawns (60s timeout) and HF API calls across
all concurrent model checks. Lock is now only held for cache writes.
- Use full SHA256 digest for token fingerprint instead of truncated
16-char prefix to eliminate collision risk.
* Fix huggingface_hub import fallback and use atomic cache read
- Add fallback import path for RepositoryNotFoundError/GatedRepoError
from huggingface_hub.utils (older hub versions) when .errors is
not available
- Use sentinel-based dict.get() for single atomic cache read instead
of two-step in/[] pattern (future-proof for no-GIL runtimes)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add fallback message for Colab Studio button when localhost link doesn't work
* Make fallback message darker grey for better readability
* Make fallback message bold for better visibility
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
* studio: add speculative decoding support (ngram-mod, on by default)
Enable n-gram speculative decoding for GGUF models in Unsloth Studio.
Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation
with zero VRAM cost via a 4MB fixed hash table that auto-resets on
low acceptance rates.
Backend:
- Add speculative_type field to LoadRequest, LoadResponse, and
InferenceStatusResponse pydantic models
- Add speculative_type parameter to LlamaCppBackend.load_model()
with allowlist validation (ngram-simple, ngram-mod)
- Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags
to llama-server when ngram-mod is active
- Default to ngram-mod for non-vision GGUF models server-side
- Silently skip speculative decoding for vision models (unsupported
in llama.cpp server-context.cpp)
Frontend:
- Add speculative_type to TS API types
- Add speculativeType/loadedSpeculativeType to chat runtime store
with default value of "ngram-mod"
- Add On/Off toggle in Model settings section (GGUF only, hidden
for vision models), included in dirty check for Apply/Reset
- Wire speculative_type through model load request and response
- Restore speculative type state on page refresh/reconnect
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: remove server-side speculative decoding override
The backend was overriding speculative_type=None to "ngram-mod" for
non-vision GGUF models, which prevented users from disabling spec
decoding via the UI toggle. The frontend store already defaults to
"ngram-mod", so the backend fallback was redundant and blocked the
explicit "Off" setting.
* fix: use recommended ngram-mod params from llama.cpp docs
Update speculative decoding params to match the recommended values
from llama.cpp docs (docs/speculative.md):
--spec-ngram-size-n 24 (was 16, docs say small n not recommended)
--draft-min 48 (was 0)
--draft-max 64 (was 24, docs note MoEs need long drafts)
Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes),
not 4 MB.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add benchmark table and references to speculative decoding comment
Include speedup numbers from llama.cpp PRs #18471 and #19164 as an
inline comment so future readers understand the expected gains.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): harden sandbox security for terminal and python tools
The existing command blocklist used naive str.split() which is trivially
bypassable via quoting, full paths, nested shells, variable expansion,
and cross-tool pivoting through Python os.system/subprocess. Fixes#4818.
Changes:
- Replace str.split() blocklist with shlex.split() + os.path.basename()
tokenization and regex scanning at shell command boundaries
- Add sanitized subprocess environment (_build_safe_env) that strips
credentials (HF_TOKEN, WANDB_API_KEY, GH_TOKEN, AWS_*, etc.) and
restricts PATH to /usr/local/bin:/usr/bin:/bin
- Add PR_SET_NO_NEW_PRIVS via prctl on Linux so sudo/su/pkexec fail
at the kernel level regardless of how they are invoked
- Add RLIMIT_NPROC (256) and RLIMIT_FSIZE (100MB) to prevent fork
bombs and disk filling attacks
- Extend AST safety checker to detect os.system(), os.popen(),
subprocess.run/Popen/call/check_output, os.exec*, os.spawn* calls
containing blocked commands or dynamic (non-literal) arguments
- Add cross-platform support: cmd.exe on Windows, bash on Unix;
CREATE_NO_WINDOW flag on Windows, preexec_fn on Unix
- Expand blocklist from 7 to 14 commands: add su, chown, passwd,
mount, umount, fdisk, kill, killall, pkill
- Apply all layers to both _bash_exec and _python_exec
Zero measurable performance overhead -- shlex parsing and a single
prctl syscall per subprocess fork.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix review findings: exception_catching dead code, false positives, process substitution
- Include exception_catching reasons in _check_code_safety so bare
except-in-loop timeout evasion is actually blocked (was computed in
_check_signal_escape_patterns but never read by the caller)
- Remove base.split() inner loop that caused false positives on quoted
text arguments containing blocked words (e.g. echo "kill this process")
- Add targeted nested shell detection for bash/sh/zsh -c arguments
instead, which catches bash -c 'sudo whoami' without false positives
- Add <() process substitution to the regex character class so
diff <(rm -rf /path) is also caught
- Fix error message to say "unsafe patterns" instead of specifically
mentioning signal manipulation when other categories trigger
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review feedback: regex paths, keyword args, list element scanning
- Regex now matches blocked commands after optional path prefix at shell
boundaries (catches ls; /usr/bin/sudo and similar)
- Nested shell detection uses os.path.basename so bash -c "/bin/rm" is
caught
- AST checker now inspects keyword arguments (not just positional) so
subprocess.run(args="sudo ...", shell=True) is detected
- List elements in subprocess calls are now checked via
_find_blocked_commands for consistency (catches subprocess.run(["bash",
"-c", "rm -rf /"]))
- Dynamic argument check uses _is_safe_literal that validates list
contents are all string literals
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix nested shell scan to only check the script body, not positional args
bash -c 'script' arg0 arg1 -- only tokens[i+1] is the script body;
subsequent tokens are $0, $1 positional parameters passed to the script
and are not executed as shell commands. Scanning all remaining tokens
caused false positives.
* Add subshell parentheses to regex command boundary detection
(sudo whoami) was not caught because ( was not in the regex character
class for shell command boundaries. Add ( to the set alongside ;, &,
|, backtick, newline.
* Address high-priority review findings from 7 parallel reviewers
- Track from-imports of dangerous functions (from os import system,
from subprocess import run as r, etc.) via shell_exec_aliases dict
so bare-name calls are detected by the AST checker
- Include the active Python interpreter and virtualenv directories
in the sanitized PATH so pip, uv, and Studio packages remain
accessible in the sandbox
- Add Windows-specific blocked commands (rmdir, takeown, icacls,
runas, powershell, pwsh) only on win32 platform
- Add os.posix_spawn and os.posix_spawnp to _SHELL_EXEC_FUNCS
- Handle tuple literals same as list literals in AST argument
inspection (both _extract_strings_from_list and _is_safe_literal)
* Fix false positive on check=True kwargs and recursive nested shell scanning
- Only inspect command-carrying keyword arguments (args, command,
executable, path, file) in the AST checker, not control flags like
check=True, text=True, capture_output=True which are booleans and
were incorrectly flagged as non-literal dynamic arguments
- Replace split() in nested shell detection with recursive call to
_find_blocked_commands so that quoted commands (bash -c '"sudo"
whoami') and semicolons (bash -c "sudo;ls") within nested shells
are properly detected through the full shlex + regex pipeline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move preexec_fn imports to module level and use find_library for libc
Addresses two Gemini review findings:
1. preexec_fn thread safety: _sandbox_preexec previously imported ctypes
and resource inside the function body, which runs between fork() and
exec() in the child process. In a multi-threaded server, this could
deadlock if the import machinery locks were held by another thread at
fork time. Now all imports and the libc handle are resolved once at
module load time, so _sandbox_preexec only calls C-level functions
(prctl, setrlimit) with no Python import activity.
2. Hardcoded libc.so.6 path: replaced with ctypes.util.find_library("c")
which works on glibc (libc.so.6), musl (libc.musl-*.so.1), and other
Linux distributions where libc has a different soname.
* Apply Gemini style suggestions: combined regex, dict.fromkeys, constant hoisting
- Combine per-word regex loop into a single re.findall with alternation
pattern, avoiding repeated regex compilation and searching
- Replace manual dedup loop with dict.fromkeys for PATH entries
- Hoist _CMD_KWARGS frozenset out of visit_Call to avoid recreating it
on every AST node visit
* Add cmd /c nested shell detection for Windows parity
The nested shell scan only checked for Unix shells (bash -c, sh -c, etc).
Add cmd /c and cmd.exe /c detection so that Windows nested shell
invocations are also recursively scanned for blocked commands. The token
scan already catches blocked commands at any position, so this is
defense-in-depth for consistency across platforms.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Handle combined shell flags (-lc, -xc) and interleaved flags (--login -c)
The nested shell scan only matched token == "-c" with the immediately
preceding token being a shell name. This missed:
- Combined flags: bash -lc 'rm ...' (-lc ends with c, is a valid
combined flag meaning -l -c)
- Interleaved flags: bash --login -c 'sudo ...' (--login sits between
bash and -c)
Now matches any short flag ending in 'c' (e.g. -lc, -xc, -ic) and
walks backwards past intermediate flags to find the shell binary.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix /bin/bash bypass, remove RLIMIT_NPROC, reduce AST false positives
Addresses three high-consensus findings from 20-reviewer pass:
1. /bin/bash -c 'sudo whoami' bypassed nested shell scan because the
backwards flag-skip logic treated paths starting with / as flags.
Now only skips tokens starting with - as Unix flags; on Windows
only skips short /X flags (not /bin/bash style paths). [9/20]
2. RLIMIT_NPROC=256 caused subprocess.run to fail with EAGAIN because
Linux enforces NPROC per real UID, not per process tree. Removed
RLIMIT_NPROC entirely; RLIMIT_FSIZE and PR_SET_NO_NEW_PRIVS remain
as the primary resource and privilege controls. [5/20]
3. AST checker rejected safe dynamic subprocess usage like
cmd=["git","status"]; subprocess.run(cmd) as shell_escape_dynamic.
Now only flags dynamic args for shell-string functions (os.system,
os.popen, subprocess.getoutput, etc.) or when shell=True is
explicitly set. List-based subprocess calls with shell=False (the
default) do not pass through a shell and are not flagged. [12/20]
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Handle Windows drive letter paths and .exe extensions in command detection
Gemini review found that Windows absolute paths (C:\Windows\System32\
shutdown.exe) and executable extensions (.exe, .com, .bat, .cmd) were
not handled:
- Token scan now strips .exe/.com/.bat/.cmd extensions before checking
the blocklist, so sudo.exe matches sudo, shutdown.bat matches shutdown
- Regex pattern now includes optional Windows drive letter prefix
([a-zA-Z]:[/\\]) and optional executable extension suffix, so commands
after shell metacharacters with full Windows paths are also caught
* Handle **kwargs dict expansion, non-literal shell=, and except Exception false positive
Addresses three findings from second 20-reviewer pass:
1. **kwargs dict expansion (9/20): subprocess.run(**{"args": "rm ...",
"shell": True}) bypassed the AST checker because **kwargs were
treated as opaque. Now expands literal dict **kwargs to inspect
their keys, and flags opaque **kwargs (variable dicts) as unsafe.
2. Non-literal shell= values (7/20): shell=variable was treated as
shell=False (safe). Now any shell= value that is not literally
False is treated as potentially True (conservative default).
3. except Exception false positive (1/20): except Exception in a loop
was flagged as timeout evasion, but Exception does not catch
SystemExit or KeyboardInterrupt which are used for timeout
enforcement. Narrowed to only flag except BaseException and
except TimeoutError in loops.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fixes#4809
On a new Studio chat, the first tool call could start before the frontend
initializes the thread ID. That meant the first request could go out without
a session_id, so the backend started the tool in the shared sandbox root
instead of the chat's session sandbox.
Frontend:
- Eagerly initialize the thread when switching to a new chat
- Resolve the thread ID once at request time and keep it stable through
async model-load waits
- Disable ActiveThreadSync during new-chat initialization to prevent
stale thread IDs from being written back
- Add error handling for thread initialization failures
- Clear activeThreadId on all compare-mode entry paths to prevent
cross-session leakage
- Fix exitCompare to restore context usage from the saved view
- Coerce falsy thread IDs to undefined for consistent backend/frontend
fallback behavior
- Use _default as the image sessionId fallback to match the backend
Backend:
- Use ~/studio_sandbox/_default when a request arrives without a session_id
* fix(studio): reuse HF cached repo casing to prevent duplicate downloads
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move cache case resolution tests to separate PR
Tests for resolve_cached_repo_id_case and get_model_config case resolution
belong in their own PR to keep this change focused on the runtime fix.
* fix(studio): debug-log HF_HUB_CACHE fallback in path_utils
* Fix stale memoization in resolve_cached_repo_id_case
- Check exact-case path before memo to ensure a newly-appeared exact
match always wins over a previously memoized variant
- Validate memoized entries still exist on disk before returning them
to prevent stale results when cache dirs are deleted/recreated
* Minor cleanups for cache case resolution
- Use .is_dir() instead of .exists() for exact-case cache check
(cache entries are always directories)
- Remove redundant fallback in _detect_audio_from_tokenizer since
get_cache_path already handles case resolution and returns None
when the model is not cached
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat: allow non-LLM recipes to run without provider block
* feat: reorder execution tabs and add generation-aware data tab empty state
* fix: add accessibility attrs to data tab spinner and use literal ellipsis
* fix(studio): use shared spinner, stub provider, and hide unused LLM metrics
Backend: inject stub model provider for sampler-only recipes so
DataDesigner init does not reject empty provider lists.
Frontend: use shared Spinner component, hide LLM columns metric
and model usage card when recipe has no LLM columns.
* Fix tab reset and terminal auto-scroll regressions for PR #4805
Reset detailTab to "data" when switching between executions so
the Data tab default is applied consistently, not only on first
mount. Also add detailTab to the terminal scroll effect deps so
auto-scroll-to-bottom fires when the user opens the Overview tab
after landing on Data.
* Guard terminal scroll reset to only fire on Overview tab
The previous scroll effect ran on every tab switch, which could
reset the user's manual scroll position if they scrolled up in
the terminal and briefly switched tabs. Now the scroll-to-bottom
and sticky-bottom reset only fires when navigating to the
Overview tab.
* Use None for stub provider api_key instead of literal string
The stub ModelProvider that satisfies the DataDesigner registry
for non-LLM recipes should not carry a fake credential string.
Using None avoids sending an Authorization header if the provider
is ever inadvertently invoked.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Differentiate web_search query searches from URL fetches in the Studio chat UI.
Backend (llama_cpp.py):
- Emit "Reading: hostname" for URL fetches and "Searching: query" for query searches in SSE status events
- Only show hostname for valid http/https URLs; schemeless/non-http URLs get "Reading page..." generic fallback
- Strip www. prefix for consistency with the frontend
Frontend (tool-ui-web-search.tsx):
- Tool card shows "Read hostname" / "Reading hostname..." for URL fetches
- Shows "Searched query" / "Searching for query..." for query searches
- Uses new URL() with protocol check; falls back to "Read page" / "Reading page..." for non-http URLs
* Simplify llama.cpp install logic
* print release tag
* Retry failed json decode
* don't pull all ggml releases
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test file changes from main PR
Test changes for test_pr4562_bugfixes.py will be submitted in a separate PR to keep this PR focused on the install path simplification.
* Fix setup.sh executable bit and direct tag lookup for pinned releases
- Restore setup.sh file mode to 100755 (was accidentally changed to 100644)
- Add direct GitHub API tag lookup in iter_release_payloads_by_time for
non-latest requested tags (e.g. b7879) instead of relying on paginated
release scans that may miss older releases beyond the 5-page limit
- Update stale DEFAULT_PUBLISHED_REPO comment to match new value
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix force-compile default ref and remove dead code in setup.ps1
- Change FORCE_COMPILE_DEFAULT_REF from "main" to "master" in all three
files (install_llama_prebuilt.py, setup.sh, setup.ps1) since
ggml-org/llama.cpp uses "master" as its default branch, not "main".
Using "main" would cause git clone --branch to fail when
UNSLOTH_LLAMA_FORCE_COMPILE=1 with UNSLOTH_LLAMA_TAG=latest.
- Remove dead if ($SkipPrebuiltInstall) block inside the else branch of
setup.ps1 that could never be reached (the outer elseif already
handles $SkipPrebuiltInstall=true).
- Maintain setup.sh executable bit (100755).
* Improve iter_release_payloads_by_time error handling for direct tag lookup
When a pinned release tag is not found (HTTP 404), fall through to the
paginated release scan instead of silently returning empty results.
Non-404 errors (network failures, rate limits) are propagated to the
caller so users get actionable error messages.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: patch PEFT for Gemma4ClippableLinear in loader checkpoint path
The same Gemma4ClippableLinear monkey-patch that exists in vision.py
for training is needed in loader.py for loading existing checkpoints
(used by export and inference).
Gemma4ClippableLinear wraps nn.Linear but does not subclass it, so
PEFT's LoRA injection fails with "Target module not supported".
The patch redirects PEFT to target the inner .linear child instead.
Applied only to the vision model PeftModel.from_pretrained path.
Temporary fix until PEFT adds native support (peft#3129).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: wrap ClippableLinear patch in try/finally to always restore
Ensures _create_and_replace is restored even if PeftModel.from_pretrained
raises, preventing leaked global state across subsequent model loads.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): lazy-import AutoConfig in model_config.py to fix transformers 5.x version switch
Move `from transformers import AutoConfig` from module level to inside
load_model_config() where it is actually used.
model_config.py is transitively imported at module load time via:
core/inference/__init__ → llama_cpp → utils.models → model_config
In inference subprocesses (mp.spawn), this chain runs before
_activate_transformers_version() can prepend .venv_t5/ to sys.path.
The eager import caches transformers 4.57.6 in sys.modules, and the
subsequent sys.path change has no effect — Python always checks
sys.modules before sys.path.
Making the import lazy ensures transformers is not loaded until after
version activation, so the subprocess picks up the correct version.
* fix(studio): also lazy-import extract_model_size_b in llama_cpp.py
Belt-and-suspenders: make the import that originally triggered the
chain lazy as well, so future module-level AutoConfig additions in
utils.models cannot reintroduce the problem.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
When DEFAULT_PUBLISHED_REPO is ggml-org/llama.cpp, the prebuilt
resolver raises PrebuiltFallback because ggml-org releases do not
include a llama-prebuilt-manifest.json asset. This was caught by the
generic Exception handler and printed as "fatal helper error" to
stderr, which triggers NativeCommandError on PowerShell.
Catch PrebuiltFallback separately in the top-level __main__ handler
and exit with EXIT_FALLBACK (code 2) instead of EXIT_ERROR (code 1).
The message is still logged but without the "fatal helper error"
prefix. The shell scripts already handle non-zero exits and fall
back to source builds.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* fix(studio): revert llama.cpp default tag to latest
The latest ggml-org/llama.cpp release (b8637) now includes Gemma 4
support. Revert the temporary "b8637" pin from #4796 to "latest" so
the prebuilt resolver always picks the newest release automatically
without needing manual tag bumps.
* docs: add comment explaining latest vs master for llama.cpp tag
Document in all three files why "latest" is preferred over "master"
and when "master" should be used as a temporary override.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Gemma 4 is a native transformers 5.5 model and does not need
trust_remote_code=True. The auto-enable logic (added for NemotronH)
was catching all transformers 5.x models, including Gemma 4.
When trust_remote_code=True, unsloth_compile_transformers() returns
early without running the compiler. This disables the fused cross
entropy patch, causing logged training loss to be inflated by the
gradient_accumulation_steps factor.
Exclude models matching "gemma-4" or "gemma4" from the auto-enable
so the compiler runs and applies fused cross entropy correctly.
ggml-org/llama.cpp b8637 includes Gemma 4 support (ggml-org/llama.cpp#21309).
Revert the temporary "master" default back to a pinned release tag.
This eliminates the HTTP 422 errors from the prebuilt resolver (which
could not find a release matching "master"), avoids unnecessary source
builds, and restores prebuilt binary downloads on all platforms.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* fix windows llama.cpp compile from source issue
* undo local repo usage
* fix llama.cpp install
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix windows
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: route resolve-source-build call through Invoke-LlamaHelper
The --resolve-source-build call at the source-build resolution path
was still calling install_llama_prebuilt.py directly instead of going
through Invoke-LlamaHelper. On PS7+ with ErrorActionPreference=Stop,
stderr from the 422 response (when tag is "master") would trigger a
terminating NativeCommandError and crash setup.
* fix: suppress stderr error records from Invoke-LlamaHelper
ErrorActionPreference=Continue prevents termination but PowerShell
still displays stderr lines as visible ErrorRecord objects. Capture
all output via 2>&1 and split stdout from stderr manually so that
stderr lines never appear on the console. When StderrPath is given
the stderr content is written to that file for diagnostics.
* fix: always rebuild llama.cpp on Windows when tag is master
When the requested llama.cpp tag is "master" (a moving target), skip
the "already built" early exit so the build path runs and syncs to
the latest commit. Without this, existing llama-server binaries from
an older build (e.g. b8635 which lacks Gemma 4 support) are reused
and model loading fails.
Pinned tags (e.g. b8635) still skip the rebuild when the binary
already exists, since the tag is immutable.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The model list merge order was `top_gguf + top_hub + static_models`,
which meant the HF download-ranked models always came first. New models
like Gemma 4 have low download counts and were not in the HF top-40,
so they got buried after 80 other models despite being at the top of
the curated static defaults in defaults.py.
Flip the merge to `static_models + top_gguf + top_hub` so editorial
picks (new model launches, promoted models) always appear first in the
Recommended section, with HF popularity backfilling after.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The latest ggml-org/llama.cpp release (b8635) does not include Gemma 4
support (ggml-org/llama.cpp#21309 merged after the release was cut).
This causes `llama-server` to fail with "unknown model architecture:
gemma4" when loading Gemma 4 GGUFs.
Temporarily default _DEFAULT_LLAMA_TAG to "master" so all new installs
build from the llama.cpp master branch which includes Gemma 4 support.
Once a new upstream release is cut with Gemma 4, this can be reverted
back to "latest".
Changes:
- setup.sh: add _DEFAULT_LLAMA_TAG="master" maintainer default
- setup.ps1: add $DefaultLlamaTag="master" maintainer default
- install_llama_prebuilt.py: change DEFAULT_LLAMA_TAG fallback to "master"
Users can still override via UNSLOTH_LLAMA_TAG env var.
Revert the >= loosening from f9c4b08 back to exact pins.
Using transformers>=4.57.6 allows pip to install 5.x into the main
Studio venv, which breaks huggingface_hub imports
(is_offline_mode removed in newer hub versions).
The main venv must stay on transformers==4.57.6 and
huggingface-hub==0.36.2. The 5.x version lives only in .venv_t5/
and is dynamically switched via sys.path at runtime.
The v5.5-release branch now exists on huggingface/transformers.
Use transformers==5.5.0 for all install paths and
git+transformers.git@v5.5-release for the MLX installer.
Also bumps huggingface_hub from 1.7.1 to 1.8.0 in setup.sh and
setup.ps1 to stay consistent.
Hardcode the release repo to ggml-org/llama.cpp and remove the
UNSLOTH_LLAMA_RELEASE_REPO and UNSLOTH_LLAMA_SOURCE env var overrides
so that all users always build/download from mainline llama.cpp.
Gemma-4 support landed in transformers main
(huggingface/transformers#45192). Update the version pin from
5.5.0.dev0 to 5.5.0 across loader, Studio version switcher,
and the MLX installer. Also fix install_gemma4_mlx.sh which
referenced a non-existent v5.5-release branch -- pin it to
the correct commit (91b1ab1) instead.
Small GGUF models (<9B) frequently generate full code or lengthy
explanations instead of calling tools, bypassing the existing
plan-without-action re-prompt mechanism. Three issues:
1. _REPROMPT_MAX_CHARS=500 was too low -- models that output full
HTML/code responses (often 1000+ chars) never triggered the
re-prompt at all, since it only fires on short responses.
2. _MAX_REPROMPTS=1 gave the model only one chance to comply.
Small models often need 2-3 nudges before switching from
text generation to tool calling.
3. The re-prompt text ("Please use the available tools...") was
too polite for small models to follow reliably.
4. Tool-calling detection missed chat templates using Jinja
whitespace-trimming syntax ({%- if tools -%}) since only
({%- if tools %}) and ({% if tools %}) were checked.
Changes:
- Raise _REPROMPT_MAX_CHARS from 500 to 2000 so longer responses
(code blocks, multi-paragraph plans) still trigger re-prompts
- Raise _MAX_REPROMPTS from 1 to 3 for more retry budget
- Use direct, imperative re-prompt language that small models
follow more reliably ("STOP. You MUST call a tool NOW.")
- Strengthen the system prompt tool nudge to explicitly forbid
outputting code blocks (redirect to the python tool instead)
- Add Jinja whitespace-trimmed variants to the tool_markers
list so all template styles are detected correctly
* UI Changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unrelated test file
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): display images from Python tool execution in chat UI
When the model calls the Python tool to create a matplotlib plot or
other image file, the image now displays inline in the chat output
instead of being invisible to the user.
Backend:
- Detect new image files (png/jpg/gif/webp/bmp) after Python subprocess
completes by diffing os.listdir before/after execution
- Append __IMAGES__ sentinel to tool result for frontend consumption
- Strip sentinel before injecting result into LLM context (role: tool)
so the model never sees file paths
- Add GET /sandbox/{session_id}/{filename} endpoint with JWT auth
(header or query param), path traversal protection, extension
allowlist, realpath containment check, and nosniff header
Frontend:
- Parse __IMAGES__ sentinel in tool_end SSE events, create structured
result with text/images/sessionId
- Render <img> tags in Python tool UI pointing at the sandbox endpoint
Also fixes a bug where SyntaxError in user code was misreported as
"unsafe code detected" instead of showing the actual Python traceback.
The _check_code_safety function now lets SyntaxError pass through to
the subprocess for a proper error message.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): improve SVG detection and strip XML preamble
Handle <?xml ...?> declarations before <svg> tags in code fences,
strip XML declaration from SVGs before data URI rendering, and
update the sloth suggestion prompt to request showing code.
* fix(studio): persist parentId so retries survive reload
The append() handler was destructuring only { message } from
ExportedMessageRepositoryItem and discarding parentId. When loading
a saved thread, load() used ExportedMessageRepository.fromArray()
which chains all messages sequentially, flattening retry branches
into a linear list.
Now append() writes parentId to the MessageRecord, and load()
reconstructs the tree when parentIds are present. Old threads
without parentId fall back to the existing fromArray() behavior.
* fix(studio): address review findings for image display and retry persistence
Image detection:
- Use mtime comparison instead of filename-only diff so overwritten
files (e.g. plt.savefig("chart.png") called twice) are detected
Sentinel parsing:
- Use rsplit/lastIndexOf instead of split/indexOf so user code that
prints __IMAGES__: does not collide with the backend sentinel
Mixed legacy/new threads:
- For old messages without a stored parentId, infer sequential parent
from the previous message instead of null, preventing multiple roots
Sandbox endpoint:
- Change Cache-Control from "public, max-age=3600" to "private,
no-store" since these are authenticated responses
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(frontend): scope sans font overrides to chat thread only
* fix(frontend): use font-sans fallback for heading stack and simplify chat font rules
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* update logic to incorporate custom prebuilt installs
* bug fixes
* update for review comments
* fix tags
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Separate test changes from main PR
Move test file changes out of this PR to keep the diff focused on
the install_llama_prebuilt.py and setup script changes. Test updates
will be submitted in a follow-up PR.
* Fix branch ref normalization and harden JSON parsing
- Add checkout_friendly_ref() to strip refs/heads/ prefix from branch
refs before emitting them in SourceBuildPlan. git clone --branch does
not accept fully qualified refs like refs/heads/main.
- Apply normalization in source_build_plan_for_release() and the
direct-ref fallback in resolve_source_build_plan().
- Allow validated_checksums_for_bundle() to accept releases that carry
only an exact-commit source archive without the legacy upstream-tag
source tarball.
- Add 2>/dev/null || true guards to all inline python -c JSON parsing
in setup.sh so a malformed payload does not abort the script under
set -e.
* Fix Windows CUDA asset ordering and tag ref normalization
- Reorder windows_cuda_upstream_asset_names to prefer the main binary
archive (llama-{tag}-bin-win-cuda-*) over the cudart sidecar archive
(cudart-llama-bin-win-cuda-*). The cudart ZIP only contains CUDA
runtime DLLs, not llama-server or llama-quantize binaries.
- Extend checkout_friendly_ref to also strip refs/tags/ prefix for tag
refs, matching the refs/heads/ handling for branch refs.
* Simplify JSON parsing consistency in setup.sh
Use json.load(sys.stdin) consistently for all inline JSON parsing
in setup.sh, instead of the more complex json.loads(raw) pattern
on the install-tag resolution path. The 2>/dev/null || true guard
already handles empty/malformed input gracefully.
* Fix source build plan fallback for commit ref kind in PR #4771
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <daniel@unsloth.ai>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Expand test coverage for install_llama_prebuilt.py:
- Add tests for source build plan resolution with custom repos
- Add tests for branch/commit/PR ref matching and normalization
- Add tests for manifest checksum validation
- Add tests for Windows CUDA upstream asset name patterns
- Update capsys checks to capture stderr after log() redirect
* fix(studio): prevent small models from stalling on tool-calling tasks
Small GGUF models (< 9B params) in "Think, Search, Code" mode would
often describe what they planned to do ("Let me create this dashboard")
and then stop generating without ever calling a tool.
Three changes:
1. Simplify web_tips for small models: remove the "fetch its full content
by calling web_search with the url parameter" guidance for models < 9B.
This multi-step instruction causes small models to plan elaborate
search-then-fetch-then-code sequences they cannot reliably execute.
2. Add "always call tools directly" imperative to the system prompt nudge
so models act immediately instead of narrating their intentions.
3. Add plan-without-action re-prompt in the agentic loop: when the model
emits planning text (matching patterns like "let me", "I'll", etc.)
without calling any tool, inject a nudge asking it to call the tool
and continue the loop. Capped at 2 re-prompts per request.
Benchmarked with Qwen3.5-4B-GGUF (N=5 trials per variant):
- Baseline: 40% of requests had any tool call
- Combined fix: 100% of requests had at least one tool call
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix shell injection in GGML conversion paths
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test file from security fix PR
Move test_save_shell_injection.py to a separate PR to keep this PR focused on the security fix itself.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Distinguish between actual network downloads and GPU memory loading for cached LoRA adapters in Studio chat.
- Add isCachedLora detection for local LoRA adapter paths using comprehensive cross-platform regex (Unix, Windows, UNC, WSL, tilde)
- Thread isCachedLora through loadInfo to chat-page inline status for proper 3-way distinction (cached / local LoRA / downloading)
- Skip download progress polling for cached LoRA models (no useless /download-progress API calls)
- Fix initial toast state to use isCachedLoad consistently instead of only checking isDownloaded
- Fix cancelLoading toast to not mention background downloads for cached/local loads
- Keep download-specific text ("Downloading model..." / "Download complete") inside the download-only polling block
- Add min-w-0 guards to thread/message/markdown containers to prevent
content overflow past the composer width
- Unify chat typography from Hellix/Space Grotesk to the sans stack,
keeping monospace for code blocks and inline code
- Restructure desktop navbar right-side controls with shrink-0 wrappers
for consistent spacing across HoverCard roots
- Soften tool-call label styling (font-medium + text-foreground/85
instead of bold)
- Add responsive code block sizing via @container queries
- Add horizontal scrolling for wide code blocks within the thread column
- Scope list-item code block alignment CSS to .aui-thread-root
- Preserve useScrollLock in tool-fallback and tool-group collapsibles
- Fall back to bg-background on ViewportFooter when hideComposer is true
- Widen inline code monospace selector to cover th, blockquote, and
heading elements
- Remove unused @fontsource-variable/space-grotesk import
* Fix script unbound variable error
* remove stale test script, add llama.cpp metal source builds, update tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Metal precedence, test sync, and add behavioral tests
- Move macOS arm64 Metal check before CUDA/ROCm in GPU backend
decision chain so Metal is not bypassed when nvcc is in PATH
- Remove RPATH flags from CPU fallback CMAKE_ARGS (only needed
for Metal library linking)
- Update test_llama_pr_force_and_source.py to match _CLONE_ARGS
rename from _CLONE_BRANCH_ARGS in setup.sh
- Add confirm_install_tree guard test for
existing_install_matches_choice
- Add TestMacOSMetalBuildLogic bash subprocess tests verifying
Metal flag selection, nvcc precedence, and CPU fallback behavior
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Metal CPU fallback to also cover cmake build failures and update tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* 1. _GPU_BACKEND_FRAGMENT synced -- removed dead CPU_FALLBACK_CMAKE_ARGS= init (6/8)
2. RPATH assertion replaced -- new test_macos_arm64_cpu_fallback_args_exclude_rpath checks the actual runtime CPU_FALLBACK_CMAKE_ARGS output for @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (6/8)
3. _TRY_METAL_CPU_FALLBACK=false reset after both configure-failure and build-failure fallback branches in setup.sh (4/8)
4. macOS test now removes libmtmd.0.dylib instead of the platform-agnostic convert_hf_to_gguf.py (3/8)
5. Empty-string tag test added -- test_empty_tag_omits_branch_flag for resolved_tag= (2/8)
6. RPATH checks on cmake call logs -- both fallback tests now assert @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON are absent from CPU fallback cmake calls, plus baseline flag preservation (multiple)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tests clean up
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): allow context length slider to reach model's native limit
The context length slider was hard-capped to the VRAM-estimated maximum,
preventing users from requesting higher context even though the backend
already handles it safely (multi-GPU selection, --fit fallback). Expose
the model's native context length from GGUF metadata as a separate API
field and use it as the slider ceiling instead. Add an amber warning
when the selected context exceeds the estimated VRAM capacity.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Raise VRAM budget to 90% and add native_context_length tests
Increase the GPU memory utilization threshold from 70% to 90% across
_select_gpus and _fit_context_to_vram, allowing longer context lengths
before VRAM capping kicks in.
Add 33 tests for the native_context_length feature covering the backend
property, context value separation invariants, Pydantic models, route
completeness, edge cases, and cross-platform binary I/O.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: add tokenizers to no-torch runtime deps and add TORCH_CONSTRAINT for arm64 macOS py313+
Two installer fixes:
1. Add `tokenizers` to `no-torch-runtime.txt` before `transformers`.
Without it, `from transformers import AutoConfig` crashes on startup
because `--no-deps` skips transitive dependencies.
2. Add `TORCH_CONSTRAINT` variable to `install.sh`. On arm64 macOS with
Python 3.13+, tighten the torch requirement to `>=2.6` since torch
<2.6 has no cp313 arm64 wheels. The variable replaces the previously
hard-coded constraint in the uv pip install line.
Includes 66 tests (42 pytest + 24 bash) covering:
- Structural checks on install.sh, install.ps1, no-torch-runtime.txt
- Shell snippet tests with mocked python for 13 platform/version combos
- Mock uv integration verifying correct constraint string
- E2E venv tests on Python 3.12 and 3.13 confirming AutoConfig works
- Negative control proving AutoConfig fails without tokenizers
- Full no-torch sandbox regression guards (safetensors, huggingface_hub)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix incomplete no-torch manifest and align E2E tests with real --no-deps path
- Add missing transitive deps to no-torch-runtime.txt that are required
under --no-deps: regex, typing_extensions, filelock, httpx, httpcore,
certifi, idna, anyio, sniffio, h11. Without these, `from transformers
import AutoConfig` still fails after install.sh --no-torch.
- Change all E2E tests to use --no-deps (matching what install.sh does)
instead of normal dep resolution. Previous tests passed even with an
incomplete manifest because uv backfilled transitive deps.
- Rewrite negative control to derive from the real no-torch-runtime.txt
with tokenizers stripped, proving the specific fix matters.
- Replace GNU-only sed -i with heredoc in shell test for macOS compat.
- Remove unused os/sys imports from Python test file.
- Quote SKIP_TORCH and mock uv paths in bash -c strings.
* Assert install succeeds before checking import results in E2E tests
Address review feedback: test_torch_not_importable and
test_tokenizers_directly_importable in Group 3 now assert that
uv pip install returns 0 before checking import behavior. This
prevents false positives when the install itself fails silently.
* Assert install succeeds in negative control and tighten error check
- Add missing install-success assertion in test_negative_control_no_tokenizers
to prevent false positives from network/install failures.
- Tighten error message check to look for "tokenizers" in stderr or
ModuleNotFoundError, rather than the generic "No module" substring
which could match unrelated import failures.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Fix SSL handshake failures (SSLV3_ALERT_HANDSHAKE_FAILURE, CERTIFICATE_VERIFY_FAILED) when fetching HTTPS pages by introducing _PinnedHTTPSConnection that separates TCP connect (to pinned IP) from TLS handshake (with real hostname for SNI/cert verification)
- Fix SSRF DNS-rebinding vulnerability: previous impl swapped conn.host before connect(), causing fresh DNS resolution; new subclass keeps TCP pinned to validated IP
- Fix SPA/JS-rendered doc sites returning empty content by rotating real browser User-Agents (Chrome/Firefox/Safari)
- Strip nav/footer from HTML-to-Markdown output so article content is not buried under navigation chrome
- Increase raw fetch cap from 64KB to 512KB so SSR article content is reached on GitBook/Docusaurus/Next.js pages
- Fix IPv6 address bracketing in URL netloc construction
- Hoist SSL context, handler classes, and stdlib imports to module level (created once, not per-call)
- Use consistent UA across redirect hops to avoid breaking session-aware bot detection
Split out from #4741 to keep the main PR focused on installer logic.
- New test_install_llama_prebuilt_logic.py: tests for resolve logic,
fallback behavior, env_int, busy/lock handling
- New test_validate_llama_prebuilt.py: validator tests for staged
release_tag/upstream_tag handling
- New test_llama_pr_force_and_source.py: tests for PR_FORCE and
LLAMA_SOURCE maintainer defaults
- Updated test_selection_logic.py: expanded selection/fallback coverage
- Updated test_pr4562_bugfixes.py: updated bugfix tests for new logic
- Updated smoke_test_llama_prebuilt.py: minor update
Replaces the fixed prebuilt llama.cpp tag with dynamic published-release
resolution, adds bounded fallback across older published releases, and
introduces maintainer-editable defaults for PR/source overrides.
Changes:
- Resolve latest from the latest usable published release in unslothai/llama.cpp
- Use the selected release upstream_tag as the authoritative llama.cpp version
- Prefer Unsloth-published platform assets when available
- Fall back to same-tag upstream ggml-org/llama.cpp assets where allowed
- Keep Linux CUDA anchored to Unsloth-published CUDA bundles only
- Add bounded fallback across older Unsloth published releases
- Add separate busy/in-use install handling (exit code 3)
- Skip reinstall when the installed bundle already matches the selected candidate
- Add maintainer-editable _DEFAULT_LLAMA_PR_FORCE and _DEFAULT_LLAMA_SOURCE
- Harden env parsing so malformed installer env vars do not crash import-time fallback logic
- Honor UNSLOTH_LLAMA_RELEASE_TAG in all resolve steps
- Always sync git remote URL in existing-checkout path
* Fix save_pretrained_merged for full-finetuned models
save_pretrained_merged and push_to_hub_merged silently do nothing when
the model is not a PeftModel (i.e. full finetuning without LoRA).
merge_and_overwrite_lora returns None immediately for non-PeftModel,
and unsloth_generic_save does not check the return value.
Add a non-PeftModel branch in unsloth_generic_save that falls back to
model.save_pretrained / model.push_to_hub. When save_method contains
"16bit", cast weights to bfloat16 (or float16) via a state_dict copy
to honor the user's intent without mutating the live model.
The existing PeftModel (LoRA) code path is unchanged.
* Forward create_pr and revision to tokenizer.push_to_hub
The tokenizer push_to_hub call was missing create_pr and revision,
which could cause the tokenizer to push to the wrong branch or
bypass PR creation when the model push uses them.
* Honor merged_16bit dtype contract for full-finetuned models
Cast state_dict to bfloat16/float16 when save_method contains "16bit"
to match the documented behavior of save_pretrained_merged. Also pass
state_dict and save kwargs consistently to both save_pretrained and
push_to_hub paths.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review feedback for PR #4755
- Simplify PeftModel isinstance check (PeftModelForCausalLM inherits
from PeftModel)
- Add is_main_process guard for distributed training
- Forward variant to save_pretrained
- Set tokenizer padding_side to "left" before saving (matches other
save paths)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): architecture-aware KV cache VRAM estimation
Replace the single legacy formula (2 * n_kv_heads * head_dim * n_layers
* n_ctx * bpe) with 5-path estimation that reads 8 additional GGUF
metadata fields:
1. MLA (DeepSeek-V2/V3, GLM-4.7, GLM-5, Kimi-K2.5) -- K-only cache
using compressed KV latent + RoPE; no separate V allocation
2. Hybrid Mamba (Qwen3.5-27B, Qwen3.5-35B-A3B) -- only attention
layers (1 in N) carry KV; Mamba layers have none
3. Sliding Window (Gemma-3, gpt-oss) -- SWA layers cache
min(ctx, window) tokens instead of the full context
4. Standard GQA -- uses explicit key_length/value_length from GGUF
instead of embed // n_heads (which is wrong for many models)
5. Legacy fallback -- identical to old formula for old GGUFs
New GGUF fields parsed: attention.key_length, attention.value_length,
attention.sliding_window, full_attention_interval,
attention.kv_lora_rank, attention.key_length_mla, ssm.inner_size,
ssm.state_size.
Validated against 9 real GGUF files (72/72 field checks pass).
The legacy formula was off by +682% for Gemma-3 and -81% for
DeepSeek-V3.1.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix MLA fallback and SWA global/local ratio heuristic
Two fixes based on review findings:
1. MLA fallback now uses key_length_mla from GGUF metadata instead of
hardcoded rope_dim=64. Falls back to 64 only when key_length_mla is
absent. This ensures correct estimates for MLA variants that use
rope dimensions other than 64.
2. SWA global/local layer ratio changed from 50/50 to 1/4 (25% global,
75% SWA). Most sliding window architectures have predominantly local
layers (Gemma-3 uses ~17% global, gpt-oss uses ~50%). The 1/4
heuristic is closer to the common case and still a large improvement
over the legacy formula which ignores SWA entirely.
* Tighten _can_estimate_kv gate and treat sliding_window=0 as disabled
Two additional fixes from review round 1 (5/8 and 4/8 reviewer consensus):
1. _can_estimate_kv now requires BOTH key_length AND value_length for
the explicit-dims path. Previously key_length alone was enough,
which could cause silent fallthrough to the legacy formula with
fabricated defaults (n_kv=1, head_dim=128) when value_length was
absent from the GGUF.
2. SWA path now requires sliding_window > 0. Some GGUFs use 0 as a
disabled sentinel. Without this guard, min(ctx, 0) would zero out
all SWA layer contributions, severely underestimating KV cache.
* Fix MLA n_kv safety and use ceiling division for hybrid path
Addresses Gemini Code Assist review findings:
1. MLA path now uses n_kv_mla = n_kv_heads or 1 (not n_heads). This
prevents a 128x overestimate for DeepSeek-V3 if head_count_kv is
absent from the GGUF (n_heads=128 would have been used instead).
2. Hybrid path now uses ceiling division for attention layer count.
This prevents undercounting by 1 when n_layers is not perfectly
divisible by full_attention_interval.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix forward compatibility with transformers 5.x
Tested on transformers 4.57.6, 5.3.0, and 5.4.0. All changes are no-ops
on transformers 4.x.
1. Skip exec-based config patching for transformers >= 5.0
Config classes in v5 use @strict, @auto_docstring, and interval()
which break exec(inspect.getsource(...)). Those configs already use
rope_parameters (the v5 replacement for rope_scaling).
2. Slice position_ids to last token in fast_forward_inference
Transformers 5.x generate() accumulates position_ids as
[batch, full_seq_len] across decode steps instead of [batch, 1].
cos[position_ids] then produces the wrong shape for rotary
embeddings. Fixed in llama, qwen3, falcon_h1, gemma2, cohere,
granite. No-op on 4.x since position_ids is already [batch, 1].
3. Handle @strict config kwargs for sequence classification
num_labels, max_position_embeddings, id2label etc. are set on the
config object and passed via config= instead of as kwargs.
AutoModelForSequenceClassification routing added to FastModel loader.
4. Exclude modernbert from flex_attention
ModernBERT with flex_attention hits CUDA illegal memory access in
create_block_mask. Falls back to eager attention safely.
5. Propagate token_type_ids and mm_token_type_ids through GRPO VLM path
Gemma3 Vision requires token_type_ids during training. Qwen3VL
requires mm_token_type_ids for M-RoPE. Extract from inputs in
compute_loss, pass to grpo_accumulated_loss, and extend
mm_token_type_ids for completion tokens in
_generate_and_score_completions.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add try/except safety net around config exec for pre-release transformers versions
* Pop config-level kwargs in seqclass path and use except Exception
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
When searching for a specific publisher model (e.g. `openai/gpt-oss-20b`), the
unsloth search used the full `openai/gpt-oss-20b` string with `author=unsloth`,
which returned zero results because no unsloth model contains the publisher
prefix in its name. Users never discovered unsloth variants.
This PR strips the org prefix for publisher-qualified queries so unsloth variants
surface, then pins the original publisher model after a small batch of unsloth
results. Plain queries (no slash) and unsloth-prefixed queries are unchanged.
- Strict regex (`/^([^/\s]+)\/([^/\s]+)$/`) only triggers on valid `owner/repo`
identifiers; incomplete typeahead, multi-slash, and URL-like inputs are rejected
- Queries for `unsloth/...` models (case-insensitive) keep the full 20-result
prefetch and secondary sort
- Pinned model lookup fires in parallel with the unsloth prefetch
- Canonical-name dedup prevents duplicates when HF normalizes casing
- Publisher detection extracted into a single `useMemo` block
Replace strikethrough + opacity-50 OOM styling with gray text and red pill badge across all Studio model selectors (chat, training, onboarding).
- Use gray-500/gray-400 for OOM model names (better contrast than strikethrough)
- Red pill badge for OOM indicator with light/dark mode support
- Scope GGUF gray override to quant name only so downloaded/recommended labels keep colors
- Add !important on TIGHT/OOM badges to resist ComboboxItem hover overrides
* Fix Windows "Non-relative patterns are unsupported" when loading local GGUF models
When a user loads a GGUF model from a local Windows path (e.g.
C:\Users\danie\.lmstudio\models\unsloth\functiongemma-270m-it-GGUF),
the model identifier contains backslashes and a drive letter. Both
load_model_defaults() and _has_specific_yaml() constructed a YAML
filename from the full absolute path and passed it to Path.rglob(),
which rejects non-relative patterns on Windows.
Fixed by detecting Windows-style paths (drive letters, UNC paths,
backslashes) in addition to Unix-style paths, and using only the
directory basename for the YAML filename lookup when the identifier
is a local filesystem path.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Refactor: reuse is_local_path helper, fix case-sensitive suffix lookup
- Replace inline local-path detection in model_config.py and
inference_config.py with the existing is_local_path() from utils.paths,
which already handles Unix, Windows drive-letter, UNC, and backslash paths
- Fix case-sensitive suffix lookup in load_model_defaults(): the
_REVERSE_MODEL_MAPPING is lowercase-keyed, so suffix comparisons must use
.lower() to match paths like /path/to/Spark-TTS-0.5B/LLM
* Fix WSL path parsing and _has_specific_yaml suffix lookup
- Use normalize_path() before Path() operations so backslash Windows
paths (e.g. C:\Users\...\model) are correctly split on POSIX/WSL hosts
where pathlib treats backslashes as literal characters
- Add suffix-based (2-component and 1-component) lookup to
_has_specific_yaml() so it matches the same resolution rules as
load_model_defaults(), fixing wrong inference params for local
suffix-mapped models like Spark-TTS-0.5B/LLM
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: clear tool status badge immediately after tool execution
The tool status timer badge (Searching 1s, 2s...) persisted after
tool calls finished because the status clear event was only sent
at the start of the next generation iteration, not after tool
execution completed.
Backend: yield status clear after all tools finish in the agentic
loop iteration, before continue starts the next generation pass.
Frontend: debounce badge visibility by 300ms so sub-second tool
calls dont flash the badge.
* Fix debounce regression for consecutive tool calls
Only apply the 300ms show-delay when transitioning from idle to
tool-active. When switching between consecutive tools in the same
turn (e.g. web_search -> python), keep the badge visible immediately
so it does not flicker or disappear during multi-tool runs.
* Delay wasActiveRef reset to bridge inter-iteration tool gaps
The backend emits a status-clear event between tool iterations,
which was resetting wasActiveRef immediately and causing the next
tool to be re-debounced (300ms hidden gap between consecutive tools
in the same turn). Now the ref reset is delayed by 500ms so a
follow-up tool within the same agentic turn shows the badge
immediately, while a genuinely new turn still gets the debounce.
* Use thread lifecycle to track tool-run boundaries
Replace the 500ms wall-clock timeout with the actual thread.isRunning
state to determine when wasActiveRef should reset. This properly
handles all cases:
- Consecutive tools within the same run stay visible without flicker
- The badge hides only when the thread run actually ends
- New turns always get a fresh 300ms debounce on the first tool
- No heuristic timeout that can misfire on slow or fast inference
* Consolidate wasActiveRef reset into single effect
Removes the separate isThreadRunning effect to avoid a race where
the ref resets before the tool-status effect reads it (when
isThreadRunning flips to false before setToolStatus(null) from
the adapter's finally block). Now wasActiveRef resets only when
both toolStatus is null AND the thread run has ended, eliminating
any flicker on the last tool of a run.
* Simplify debounce: use visible state instead of ref tracking
Drop wasActiveRef entirely and use the visible state as the
debounce gate. When the badge is not yet on screen, debounce
for 300ms before showing. When already visible from a prior tool,
keep showing immediately. This correctly handles all cases:
- All fast tools (<300ms) are suppressed, not just the first
- Consecutive tools after the badge is shown stay visible
- Badge persists across inter-iteration clears while thread runs
- New turns get a fresh debounce after visible resets
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* refactor: move folder management from sidebar into model selector
* Fix folder management: restore LoRA picker sync, error handling, caching
- Restore onFoldersChange callback to keep LoRA adapter picker in sync
when scan folders are added/removed (fixes regression from sidebar move)
- Thread onFoldersChange through ModelSelector -> HubModelPicker prop chain
- Add module-level _scanFoldersCache to prevent folder list flash on re-open
- Surface error toast on folder removal failure instead of silently ignoring
- Guard handleAddFolder against concurrent double-submit via folderLoading
- Clear folderInput on Escape key dismiss to prevent stale input on re-open
- Add refreshLocalModelsList and refreshScanFolders to useEffect dep array
* Fix compare-mode folder sync, Escape key propagation, cancel toggle state
- Wire onFoldersChange through CompareContent/GeneralCompareContent so
compare-mode selectors also refresh local models after folder changes
- Add e.stopPropagation() on Escape key in folder input to prevent
Radix Popover from closing the entire model selector dropdown
- Add e.preventDefault() on Enter key to prevent form submission
- Clear folderInput and folderError when cancel toggle hides the input,
matching the Escape key behavior for consistency
* Fix folder mutation state ordering and touch accessibility
- Use optimistic updates for add/remove so the folder list reflects
changes immediately instead of waiting on a second listScanFolders
round-trip that could silently fail.
- Move refreshScanFolders out of the finally block in handleRemoveFolder
so it runs after the cache update, not after onFoldersChange.
- Make the remove button visible on touch/mobile devices and reachable
via keyboard focus (opacity-100 on small screens, focus-visible).
- Add aria-label to the remove button for screen readers.
* Deduplicate optimistic folder add to match backend behavior
The backend returns the existing ScanFolderInfo row when adding a
path that is already registered. The optimistic update was blindly
appending the returned row, producing duplicate entries and React
key warnings. Now checks by id before appending.
* Add aria-label to folder toggle button and strengthen dedup check
- Add aria-label to the +/cancel icon button for screen readers.
- Extend optimistic dedup check to also compare by path, not just id,
to handle edge cases where the cache is stale.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* style(windows): clean installer/setup log output and remove seeded credential banner
* Keep startup credential hint without exposing plaintext password
Print the username and .bootstrap_password file path on first-run
admin creation instead of the raw password. Headless / Docker / SSH
operators still get a startup-time hint for initial sign-in, and the
plaintext credential no longer appears in terminal output or logs.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* feat: add scan_folders table and CRUD functions to studio_db
* feat: add scan folders API endpoints and integrate into model scan
* feat: add scan folders API client and update source types
* feat: add custom source to model filters and selector
* feat: add Model Folders section to chat settings sidebar
* style: fix biome formatting in ModelFoldersSection
* fix: address review findings for custom scan folders
empty string bypass, concurrent delete crash guard,
Windows case normalization, response_model on endpoints,
logging, deduplicated filter/map, module level cache for
custom folder models, consistent source labels, handleRemove
error surfacing, per folder scan cap
* fix: show custom folders section regardless of chatOnly mode
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refactor: extract shared refreshLocalModelsList in pickers
* Harden custom scan folder validation and scanning
- Validate path exists, is a directory, and is readable before persisting
- Apply per-folder model cap during traversal instead of after (avoids
scanning millions of inodes in large directories)
- Wrap per-folder scan in try/except so one unreadable folder does not
break the entire /api/models/local endpoint for all callers
- Normalize case on Windows before storing so C:\Models and c:\models
dedup correctly
- Extend macOS denylist to cover /private/etc and /private/tmp (realpath
resolves /etc -> /private/etc, bypassing the original denylist)
- Add /boot and /run to Linux denylist
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Improve scan robustness and preserve Windows path casing
- Preserve original Windows path casing in DB instead of lowercasing
(normcase used only for dedup comparison, not storage)
- Catch PermissionError per child directory so one unreadable subdirectory
does not skip the entire custom folder scan
- Wrap list_scan_folders() DB call in try/except so a DB issue does not
break the entire /api/models/local endpoint
* fix: scan custom folders for both flat and HF cache layouts
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Windows case-insensitive path dedup with COLLATE NOCASE
Use COLLATE NOCASE on the scan_folders.path column so that the UNIQUE
constraint correctly deduplicates C:\Models and c:\models on Windows
without lowercasing the stored path. Also use COLLATE NOCASE in the
pre-insert lookup query on Windows to catch existing rows with
different casing.
* Restore early-exit limit in _scan_models_dir for custom folders
Keep the limit parameter so _scan_models_dir stops iterating once
enough models are found, avoiding unbounded traversal of large
directories. The post-traversal slice is still applied after combining
with _scan_hf_cache results.
* feat: scan custom folders with LM Studio layout too
* Fix custom folder models being hidden by dedup
Custom folder entries were appended after HF cache and models_dir
entries. The dedup loop kept the first occurrence of each model id,
so custom models with the same id as an existing HF cache entry were
silently dropped -- they never appeared in the "Custom Folders" UI
section.
Use a separate dedup key for custom-source entries so they always
survive deduplication. This way a model can appear under both
"Downloaded" (from HF cache) and "Custom Folders" (from the
user-registered directory) at the same time.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden LM Studio scan and fix COLLATE NOCASE on Linux
- Add per-child and per-publisher OSError handling in _scan_lmstudio_dir
so one unreadable subdirectory does not discard the entire custom
folder's results
- Only apply COLLATE NOCASE on the scan_folders schema on Windows where
paths are case-insensitive; keep default BINARY collation on Linux
and macOS where /Models and /models are distinct directories
* Use COLLATE NOCASE in post-IntegrityError fallback SELECT on Windows
The fallback SELECT after an IntegrityError race now uses the same
case-insensitive collation as the pre-insert check, so a concurrent
writer that stored the path with different casing does not cause a
false "Folder was concurrently removed" error.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Simplify tool-call dedup: drop hashlib, inline helpers
The duplicate tool-call detector only compares calls within a single
request from the same JSON parser, so dict key order is guaranteed
identical for identical calls (Python 3.7+ insertion-ordered dicts).
- Replace hashlib.md5(json.dumps(...)) with name + str(args)
- Inline _tool_call_key, _is_duplicate_call, _record_tool_call
since each was a one-liner used once
- Remove unused hashlib import
* Remove tool_calling_benchmark_results.md from repo
* Replace html2text with builtin HTML-to-Markdown converter
Drop the external html2text (GPL-3.0) dependency and its regex
fallback. Add _html_to_md.py (~190 lines, stdlib only) using
html.parser.HTMLParser that handles headings, links, bold/italic,
lists, tables, blockquotes, code blocks, and entity decoding.
Strips script/style/head tags entirely.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use json.dumps(sort_keys=True) for tool-call dedup key
str(dict) is sensitive to insertion order, so semantically identical
calls with different key ordering would bypass duplicate detection.
Switch to json.dumps with sort_keys=True for a canonical representation.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert dedup key to str(arguments)
json.dumps(sort_keys=True) is unnecessary here -- the arguments dict
always comes from the same JSON parser within a single request, so
key insertion order is deterministic (Python 3.7+). str() is faster
and sufficient for consecutive-call dedup.
* Address review comments on _html_to_md.py
- Remove "hr" from _BLOCK_TAGS so the dedicated hr handler is reachable
- Prefix all newlines with ">" inside blockquotes (multi-line support)
- Emit full  for images instead of alt text only
- Replace newlines with spaces inside table cells
- Track header cells per-row (_row_has_th) instead of last-cell-only
- Strip trailing tabs in addition to spaces in cleanup regex
* Fix blockquote rendering, truncated-HTML buffer flush, and dedup key canonicalization
_html_to_md.py:
- Rewrite blockquote handling with stack-based buffer approach so nested
blockquotes, pre blocks inside blockquotes, and multi-paragraph quotes
all render correctly with proper "> " prefix on every line.
- Add flush_pending() to recover content from truncated HTML where closing
tags are missing (common when _fetch_page_text caps the download size).
Flushes open <a>, <td>, <pre>, and blockquote buffers.
- Skip <img> tags to match prior html2text ignore_images=True behavior
and avoid data-URI amplification consuming the output budget.
- Collapse all whitespace (including newlines) in non-pre content per
standard HTML whitespace rules: \s+ -> single space.
- Escape pipe characters in table cell content to prevent column breakage.
- Emit separator row after the first row for tables without <th> headers.
- Guard against IndexError on _ol_counter for orphan <li> elements.
- Normalize CRLF line endings before parsing.
llama_cpp.py:
- Restore canonical dedup key with json.dumps(sort_keys=True) so that
semantically identical tool calls with different JSON key order are
correctly detected as duplicates.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix table optional end tags, inline code whitespace, and link text normalization
_html_to_md.py:
- Extract _finish_cell() and _finish_row() helpers to handle HTML tables
that omit optional </td>, </th>, or </tr> end tags. This is valid HTML
and common on real web pages -- previously the parser would silently
drop earlier cells and entire rows.
- Call _finish_cell()/_finish_row() from handle_starttag for <tr>/<td>/<th>,
handle_endtag for </tr>/<td>/<th>/<table>, and flush_pending() so all
three paths (normal close, implicit close, truncated HTML) use the same
row-finalization logic including header separator emission.
- Add _in_inline_code flag so handle_data() preserves literal whitespace
inside <code> spans instead of collapsing it. Source like
<code>pip install unsloth</code> now correctly renders as
`pip install unsloth` rather than `pip install unsloth`.
- Extract _finish_link() helper that normalizes accumulated link text with
\s+ -> single space before building the Markdown link. Prevents block-
level content inside <a> tags (e.g. <a><div>one</div><div>two</div></a>)
from producing multiline [one\n\ntwo](href) link labels.
- Empty blockquotes now produce no output instead of a stray ">".
- Remove unused _bq_depth field (all routing uses _bq_stack).
- Flush open cells and rows in handle_endtag("table") for robustness.
* Support <ol start=N>, <dl>/<dt>/<dd>, and preserve code block whitespace
_html_to_md.py:
- Honor <ol start="N"> attribute so ordered lists preserve their original
numbering instead of always restarting from 1. Important for docs/tutorials
that continue numbering across sections.
- Add dl, dt, dd to _BLOCK_TAGS so definition lists (common on MDN, Python
docs, Django docs) produce separated text instead of concatenated blobs.
- Rewrite _cleanup() to be fence-aware: content inside fenced code blocks
is now preserved verbatim (intentional blank lines in <pre> content are
no longer collapsed). Outside code blocks, blank runs are limited to one
and trailing whitespace is stripped.
- Fix _prefix_blockquote() to strip trailing whitespace before collapsing
blank lines, preventing the "\n\n \n\n" pattern from sneaking through.
* Suppress whitespace-only text nodes between table structural elements
Indented HTML tables (nearly all real-world pages) produce whitespace
text nodes between <table>, <tr>, </tr> etc. that land in the output
as leading spaces before table rows, breaking Markdown table alignment.
Skip whitespace-only text nodes when inside a table but not inside a
cell, so indentation from source HTML does not leak into the output.
* Revert dedup key to str(arguments) with explanatory comment
json.dumps(sort_keys=True) is unnecessary overhead here: arguments
always comes from json.loads on model output within a single request,
so dict insertion order is deterministic in Python 3.7+. A repeated
call from the model produces the same JSON, which parses to the same
dict repr. str() avoids re-serialization on every tool call.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* studio: improve GGUF tool calling accuracy and reliability
- Add URL fetching to web_search tool so models can read full page
content instead of only getting search snippets. Uses html2text for
clean markdown conversion with regex fallback.
- Inject current date and behavioral guidance (URL fetch workflow,
no repeated queries, use code for data processing) into the
tool-use system prompt.
- Append error recovery nudge to tool results that indicate failure,
helping small models avoid looping on the same broken call.
- Strip leaked <tool_call> XML from assistant messages in conversation
history and from the outgoing SSE stream.
- Raise default max tool iterations from 10 to 25 across backend,
model schema, and frontend defaults.
- Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain
enough content for the model to extract useful information.
- Add "IMPORTANT: These are only short snippets" hint to search
results so models know to fetch full pages when needed.
Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after:
- XML leaks in responses: 10/10 -> 0/10
- URL fetch usage: 0 -> 4/10 runs
- Runs producing actual correct answers: 0/10 -> 2/10
- Average tool calls per query: 5.5 -> 3.8 (more efficient)
- Average response time: 12.3s -> 9.8s
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add tool calling benchmark results across model sizes and quants
Tested 16 configurations (4 models x 2 quants x 2 KV cache types)
with 10 runs each on NVIDIA B200.
Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4
correct songs, 0 XML leaks, 131s average response time.
* Add duplicate tool-call detection and final-answer synthesis
When the model repeats the exact same tool call (same name + arguments)
twice in a row, skip execution and return a redirect message telling it
to try a different approach. This prevents the 8x-repeated-query loops
observed on 27B and 35B models.
When the tool iteration cap (25) is reached, inject a "provide your
final answer now" message before the final streaming pass. This lets
the model synthesize a useful answer from everything it gathered
instead of being silently cut off.
Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs):
- Repeated query runs: 4/10 -> 2/10
- Cap hits: 1/10 -> 0/10
- All 4/4 accuracy: 5/10 -> 7/10
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix CodeQL alert: handle whitespace in script/style closing tags
The regex fallback for HTML stripping did not match closing tags
with whitespace before the angle bracket (e.g. </script >).
Use \s* before > in both script and style patterns.
* Address reviewer findings: SSRF, timeout crash, XML regex, dedup
- SSRF: resolve hostname via getaddrinfo and reject private, loopback,
link-local, multicast, and reserved addresses before fetching
- Timeout: handle timeout=None (unlimited mode) in URL fetch path
by defaulting to 60s instead of crashing on min(None, 60)
- Download cap: read at most max_chars*4+1 bytes instead of the
full response body before truncating
- XML regex: match both <tool_call> and <function=...> markup in
the history/stream cleanup (inference.py)
- CodeQL: use [^>]* in closing script/style tags to handle any
whitespace or attributes before >
- Dedup: track whether each tool call failed so retries after
transient errors are allowed; only block consecutive identical
calls that both succeeded
- Final-answer synthesis: guard on max_tool_iterations > 0 so
callers who disable tools do not get a false "used all calls" turn
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix redirect SSRF, SSE streaming regression, dedup off-by-one
- SSRF redirect bypass: disable auto-redirect in urllib, manually
follow up to 5 hops with host validation at each step. Prevents
public URLs from redirecting to loopback/private targets.
- SSE streaming: track prev_text on the raw cumulative and strip
XML from the delta only, so completed tool_call tags do not cause
the cumulative to shrink and drop trailing real text.
- Dedup off-by-one: check the immediately previous call (window=1)
instead of requiring 2 matching history entries, so the second
identical successful call is blocked rather than the third.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix redirect HTTPError handling and tighten error prefixes
- Redirect fix: urllib raises HTTPError (not a normal response) when
the redirect handler returns None. Catch HTTPError for 3xx codes
and extract the Location header from the exception object.
- Error prefixes: remove overly broad "No " prefix that matched
"No results found." (a valid empty-search outcome, not an error).
Replace with specific prefixes like "Blocked:", "No query provided",
"Failed to resolve". This ensures empty search results are correctly
classified as non-errors for duplicate-call tracking.
* Fix SSE cross-chunk XML leaks, cleanup review findings
- SSE streaming: sanitize the full cumulative text before diffing
against the previous sanitized snapshot, so XML tags that span
chunk boundaries are stripped correctly. The previous delta-based
approach leaked split tags.
- DRAINING fallback: use _strip_tool_markup() helper instead of a
manual regex that only handled <tool_call> but not <function=...>.
- Move hashlib import, _TOOL_XML_RE compile, and datetime import to
module level per style guide.
- Remove unused _hit_tool_cap variable.
* Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record
- DNS rebinding: resolve hostname once via getaddrinfo, pin the
returned IP, rewrite the URL to connect to the pinned IP with
a Host header. Each redirect hop re-resolves and re-validates.
Closes the TOCTOU window between validation and connection.
- Charset: use resp.headers.get_content_charset() instead of
hardcoding utf-8, so pages with other encodings decode correctly.
- HTTPError: return descriptive "HTTP {code} {reason}" instead of
re-raising into a generic "Search failed" message.
- Dedup: remove redundant _record_tool_call in the duplicate branch;
the single call at the end of the loop handles all cases.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1
The heartbeat thread now monitors the HF Hub cache directory for
file-size growth. If no bytes are written for 3 minutes, it sends a
"stall" message to the orchestrator, which kills the subprocess and
retries with HF_HUB_DISABLE_XET=1 (falling back from Xet to standard
HTTPS). If the retry also stalls, it errors out with a clear message.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: include transport type (xet/https) in heartbeat and stall log messages
Makes it clear in backend logs whether the download is using xet or
https transport, and which transport stalled — helpful for debugging.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: monitor HF Hub .tmp dir to avoid false stall detections
huggingface_hub downloads into .tmp/ before atomically moving to
blobs/. Without monitoring .tmp, a large shard actively downloading
for several minutes would show zero blob growth and trigger a false
stall.
* fix: scope HF cache size check to specific model being loaded
Instead of scanning every models--*/blobs directory (O(N) with cached
models), only check the specific model's blobs dir plus the global
.tmp dir. Much faster on systems with many cached models.
* Fix false stall detection on cached/local models and cleanup issues
- Only fire stall if download activity was observed (cache size changed
at least once). Previously, any model load taking >180s would trigger
a false stall, even for already-cached or local models where no
download is happening.
- Return -1 from _get_hf_cache_size on exception to distinguish
"unable to measure" from "genuinely zero bytes". Skip stall logic
when measurement fails.
- Add _shutdown_subprocess before raising on terminal stall path to
prevent leaking a stuck subprocess.
- Detect pre-existing HF_HUB_DISABLE_XET=1 in the parent environment
to avoid a redundant retry cycle when Xet is already disabled.
- Remove global .tmp directory scanning (not used by modern
huggingface_hub; in-progress downloads use .incomplete files in
blobs/ which are already captured by iterdir).
- Add f.is_file() guard in cache size calculation.
- Replace em dashes with ASCII dashes for Windows terminal compat.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden stall detection edge cases
- Guard -1 to valid value transition: when initial _get_hf_cache_size
returns -1 (error) and later recovers to a real value, do not count
that as download activity. Only set saw_download_activity when the
previous measurement was also valid (>= 0).
- Move os import to top-level in orchestrator.py instead of inline
import os as _os.
- Fix misleading comment about post-download protection.
* Use .incomplete files to detect active downloads for stall detection
Replace the saw_download_activity heuristic with direct .incomplete file
detection. huggingface_hub creates *.incomplete files in blobs/ during
active downloads and removes them on completion. This gives a reliable
signal for whether a download is actually in progress.
Benefits:
- Cached models: no .incomplete files -> no stall fired even after 180s
- Post-download init (quantization, GPU loading): .incomplete files gone
so stall timer resets, long init phases are not killed
- Pre-download hangs (XET handshake stall): .incomplete files are
created at download start, so zero-byte stalls are now detected
- No more false positives from -1 to valid measurement transitions
The _get_hf_download_state function now returns (total_bytes,
has_incomplete) tuple or None on error, replacing _get_hf_cache_size.
* Add debug logging to download state exception handler
Log the exception at debug level when _get_hf_download_state fails,
instead of silently returning None. Helps with troubleshooting cache
measurement issues.
* Watch both adapter and base model repos for LoRA stall detection
When loading a LoRA adapter, the actual download bottleneck is often
the base model, not the adapter itself. Update the heartbeat to watch
both mc.identifier and mc.base_model cache directories so stall
detection works for LoRA loads where the base model stalls on Xet.
Also update _get_hf_download_state to accept multiple model names and
skip names without "/" (local paths) since those do not have HF cache
directories.
* Fix model name filtering for official HF models without org prefix
Models like gpt2 and bert-base-uncased do not contain a slash but are
still valid HF Hub models with cache directories. Replace the "/" check
with a proper local-path detection that checks for path separators and
path-like prefixes instead.
Also fix the base_model watch list to not require "/" in the base model
name, so official models used as LoRA bases are also monitored.
* Fix local path detection that broke all org/model names on Linux
The os.path.sep check matched "/" in HF model IDs like "org/model" on
Linux, causing the stall detector to skip ALL standard HF models.
Replace with a check that only skips names starting with "/" (absolute
paths), "." (relative paths), "~" (home-relative), or containing "\"
(Windows paths). HF model IDs like "org/model" or "gpt2" pass through
correctly on all platforms.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): change default weight_decay from 0.01 to 0.001
The default weight decay across Studio was 0.01 but should be 0.001.
Updated the default in all backend fallbacks, the Pydantic model, the
frontend config, and every YAML preset/model-default config.
* fix(studio): auto-set learning rate based on training method
Default LR should be 2e-4 for LoRA/QLoRA and 2e-5 for full fine-tuning.
Frontend: track whether the user has manually edited the LR field via a
_learningRateManuallySet flag (same pattern as trainOnCompletions).
When switching training method and the user has not touched the LR,
auto-set it to the appropriate default. Reset the flag on model load.
Backend: change trainer.py start_training default from 5e-5 to 2e-4,
update default.yaml fallback from 5e-5 to 2e-4, and fix
full_finetune.yaml from 0.0002 (2e-4) to 2e-5.
* refactor(studio): centralize weight_decay and learning rate defaults
Create studio/backend/core/training/constants.py as the single source of
truth for DEFAULT_WEIGHT_DECAY (0.001), DEFAULT_LEARNING_RATE (2e-4),
DEFAULT_LEARNING_RATE_FULL (2e-5), and DEFAULT_LEARNING_RATE_STR ("2e-4").
All backend modules (trainer.py, training.py, worker.py, models/training.py)
now import from constants.py instead of hardcoding values.
On the frontend, add LR_DEFAULT_LORA and LR_DEFAULT_FULL to
config/training.ts and use them in the store instead of magic numbers.
A comment cross-references the backend constants file.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix model-specific LR override, persist migration, and flag resets
- Preserve model-specific learning rates from YAML configs when the
async autoSelectTrainingMethod callback fires (fixes Qwen2.5-1.5B
getting 2e-4 instead of its configured 1e-5, etc.)
- Bump zustand persist version to 9 with migration so existing users
with weightDecay=0.01 get updated to 0.001
- Clear _learningRateManuallySet in reset() and applyConfigPatch()
for consistency with trainOnCompletions flag behavior
- Add DEFAULT_LEARNING_RATE_FULL_STR to constants.py
* Refine applyConfigPatch to only clear LR flag when patch includes LR
Only reset _learningRateManuallySet when the applied config patch
actually provides a learningRate value. This prevents unrelated config
patches from silently disarming the manual-edit guard, which would
cause a subsequent setTrainingMethod call to overwrite the user's
custom LR.
* Preserve model-specific LR when switching between qlora and lora
Only auto-switch the learning rate when the training category changes
(adapter <-> full fine-tuning). Switching between qlora and lora keeps
the current LR since both methods share the same learning rate range.
This preserves curated per-model defaults (e.g. 1e-5 for
Qwen2.5-1.5B-Instruct) when the user toggles between adapter methods.
* Remove constants.py, use YAML configs as the source of truth
The YAML config files (model-specific + default.yaml) are the intended
config layer for training defaults. The Python backend fallbacks now use
inline values that match the YAML configs, rather than importing from a
separate constants module. This keeps the config architecture simple:
YAML files are the single source of truth, and the inline Python
fallbacks are just safety nets that mirror them.
* fix(studio): preserve model-specific LR when switching training method
Stash YAML-provided learning rate and use it to restore the correct
value when switching between adapter and full fine-tune modes.
- qlora <-> lora no longer overwrites the model's LR
- full -> adapter restores the YAML LR instead of a hardcoded constant
- selecting a model while on full fine-tune uses LR_DEFAULT_FULL
instead of applying the YAML adapter LR
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* fix: throttle and cache HuggingFace modelInfo API calls
The frontend was firing 40 to 60 parallel modelInfo requests on app
startup with zero caching or deduplication, causing HF rate limits.
Adds a caching layer (hf-cache.ts) with TTL cache, inflight request
dedup, and a concurrency limiter. Also debounces the HF token input
so typing a token no longer re-fires all model searches per keystroke.
* fix: only fetch VRAM info for visible models in chat selector
* Fix cache key isolation and VRAM badge stability for PR #4696
- Cache key now includes a token fingerprint (last 8 chars) instead of a
boolean, so switching HF tokens gives separate cache entries instead of
serving stale data from the previous token.
- Extract token via credentials?.accessToken to match the @huggingface/hub
API surface.
- Extend CachedResult type with safetensors/tags fields so downstream
consumers no longer need unsafe `as` casts.
- Merge VRAM param map with previous state on scroll instead of replacing
it, preventing a brief flash of missing VRAM badges when new models
become visible.
* Fix VRAM badges missing for search-filtered recommended models
When a user types a search query, filteredRecommendedIds can include
models beyond the currently visible page. These models had no VRAM data
because useRecommendedModelVram only received visibleRecommendedIds.
Now we pass the union of visibleRecommendedIds and filteredRecommendedIds
to the VRAM hook, so recommended models surfaced by search also show
their VRAM badges. The hf-cache layer ensures no duplicate network calls.
* Apply biome formatting to hf-cache.ts and use-recommended-model-vram.ts
Auto-formatted with biome check --write to match project lint rules:
- Block statements for single-line if/for bodies
- Import sorting (type imports first)
- Consistent line wrapping
* Fix extractToken to handle both current and deprecated HF auth forms
The @huggingface/hub CredentialsParams type is a union:
- { accessToken: "hf_..." } (current preferred form)
- { credentials: { accessToken: "..." } } (deprecated form)
Previously only checked params.credentials?.accessToken (deprecated path).
Now checks both forms so the cache key is correct regardless of which
calling convention is used.
* Simplify extractToken, map merge, and set construction
- extractToken: remove type assertions, use direct property access with
truthiness checks for cleaner union type handling
- VRAM map merge: use Map spread constructor instead of manual for loop
- idsForVram: use Set spread construction for more concise dedup
* Add rationale comment for MAX_CONCURRENT=3 in hf-cache.ts
* Skip GGUF repos in VRAM fetch and pre-populate cache from listModels
Two changes to reduce redundant HF API calls:
1. Filter GGUF repos from idsForVram before passing to useRecommendedModelVram.
GGUF repos have no safetensors metadata and the render layer already shows
a static "GGUF" badge -- fetching modelInfo for them is a no-op that wastes
a semaphore slot and a network round-trip.
2. Add primeCacheFromListing() to hf-cache.ts and call it from listModels
yield sites in mergedModelIterator and priorityThenListingIterator.
listModels returns the same type (ModelEntry & Pick<ApiModelInfo, T>) as
modelInfo with the same additionalFields, so the data is interchangeable.
Priming only writes if the key is not already fresh, so it never overwrites
a recent modelInfo response.
This means models discovered via listModels are already in cache when
useRecommendedModelVram later calls cachedModelInfo for them, eliminating
duplicate network requests.
* Fix cache key mismatch: prime both token and anonymous slots
The VRAM hook calls cachedModelInfo without credentials (anonymous key),
but listModels results were primed only under the authenticated key.
For authenticated users the priming was a no-op -- cache miss every time.
Fix: prime both the token-specific slot and the anonymous slot when an
access token is present. Public model metadata (safetensors, tags) is
identical regardless of auth so this is safe.
Also add a defensive guard in primeCacheFromListing for empty name.
* Auto-prime anonymous cache slot from authenticated modelInfo fetches
When cachedModelInfo is called with a token, the result was only stored
under the token-specific key (e.g. model::abc12345). The VRAM hook
calls cachedModelInfo without credentials and reads the anonymous slot
(model::anon), causing a cache miss and duplicate fetch for every
priority model.
Now cachedModelInfo also writes to the anonymous slot on success when
a token is present. Public model metadata (safetensors, tags) is
identical regardless of auth, so this is safe and eliminates ~10
duplicate API calls on first page load.
* Guard anonymous cache priming against gated/private models
Only prime the anonymous cache slot for non-gated, non-private models.
Previously, authenticated modelInfo responses and listing results were
unconditionally copied into the anonymous slot, which could briefly
expose gated/private model metadata after clearing the HF token.
Now checks result.gated and result.private before writing the anon slot.
Public unsloth/ models (the common case) still benefit from the
optimization; gated models like meta-llama/* require a fresh fetch
per auth context.
* Extract primeFromListing helper to deduplicate cache priming logic
The cache priming pattern (prime token slot + conditionally prime anon
slot for non-gated models) was duplicated in three places. Extracted
into a single primeFromListing() function for maintainability.
* Export CachedResult type, add isStale helper, simplify primeFromListing
- Export CachedResult so consumers can use it directly instead of
the indirect Parameters<typeof ...> pattern.
- Extract isStale(key) helper to deduplicate the cache freshness
check that was repeated in primeCacheFromListing, cachedModelInfo,
and the anonymous-slot priming logic.
- Simplify primeFromListing to use CachedResult directly for both
the data parameter and the gated/private guard, eliminating the
double cast.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Revert to balanced for inference
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unused for_inference parameter from get_device_map
Since inference and training both use "balanced" now, the for_inference
flag is dead code. Remove it from the function signature, the call site
in inference.py, and simplify the tests accordingly.
* Remove redundant TestDeviceMapForInference test class
TestGpuAutoSelection already covers the same multi-gpu and single-gpu
device_map assertions. The TestDeviceMapForInference class was left
over from when for_inference had distinct behavior.
* Remove redundant test_get_device_map_multi_gpu_uses_balanced
Its assertions ([0,1] -> balanced, [0] -> sequential) are already
covered by test_get_device_map_uses_explicit_gpu_selection.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): open tour ReadMore links in new tab
The quick tour "Read more" links navigate away from Studio instead of
opening in a separate tab. Add target="_blank" and rel="noopener
noreferrer" to the ReadMore component so external doc links open in a
new browser tab.
* fix(studio): only open external ReadMore links in new tab
Apply target="_blank" conditionally based on whether the href starts
with "http", so internal links still navigate in the same tab.
* Tighten external-link detection in ReadMore component
Use regex /^https?:\/\// instead of startsWith("http") so the check
requires the full protocol prefix and does not match non-URL strings
that happen to begin with "http".
* Hoist regex to module scope for ReadMore
Move EXTERNAL_URL_RE to top-level constant to satisfy the biome
useTopLevelRegex lint rule and avoid re-creating the RegExp on
every render.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* studio: gate multimodal incompatibility warning on settled model capabilities
* Also disable Start button during isCheckingVision fallback
When getModelConfig fails and the fallback checkVisionModel is still
in-flight, isLoadingModelDefaults clears before isCheckingVision does.
Without also gating on isCheckingVision the Start button briefly
re-enables with stale capability flags.
Add isCheckingVision to the disabled condition and show "Loading
model..." text while either flag is active.
* Show correct error message for audio dataset incompatibility
The incompatibility warning always said "switch to a vision model"
even when the actual issue was an audio dataset on a non-audio model.
Now shows an audio-specific message when the mismatch is audio.
* Extract isLoadingModel constant for clarity
Pull the combined model-loading condition into a single constant
reused by the settled check, the disabled prop, and the button label.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The 180s wall-clock timeout would kill model loads on slow connections
even when the download was actively progressing. Now the worker sends
heartbeat status messages every 30s during loading, and the orchestrator
resets its 300s deadline on each one — so it only times out when the
subprocess goes truly silent.
* fix: skip download progress polling for exported GGUF models
* fix: revert isLocalGgufDir change — exported GGUFs are file paths, not dirs
* fix: set isDownloaded true for all adapters in LoraModelPicker
* fix(studio): replace unicode emoji in print() to avoid cp1252 crash on Windows
On Windows the default console encoding is cp1252 which cannot encode
unicode emoji like U+2705 or U+26A0. bare print() calls with these
characters cause a UnicodeEncodeError at runtime.
- run.py: replace emoji with ASCII status prefixes [OK] and [WARNING]
- format_conversion.py: remove duplicate print() that mirrors the
logger.info() call on the next line, and drop the emoji from the
log message since loggers handle encoding separately
* fix(studio): apply same emoji/print cleanup to parallel VLM conversion path
The parallel URL-based conversion logic has the same duplicate print()
with emoji that was fixed in the sequential path. Remove the bare
print() and drop the emoji from the logger.info() call.
* Treat install_python_stack.py failure as fatal in setup.ps1
On Linux/Mac, setup.sh runs under set -euo pipefail so a non-zero
exit from install_python_stack.py aborts the installer. On Windows,
setup.ps1 had no exit code check -- if the Python script crashed
(eg from the cp1252 UnicodeEncodeError), the installer silently
continued past the dependency loop and reported success. Studio
would then fail at launch with ModuleNotFoundError for structlog,
fastapi, and other deps that were never installed.
Capture $LASTEXITCODE and exit 1 if the dependency installer fails,
matching the error handling pattern already used for PyTorch install.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: scope packages.find to prevent node_modules namespace scanning
The packages.find section had no include filter, so setuptools'
find_namespace_packages discovered all directories as potential Python
packages -- including the 6,557 directories inside
studio/frontend/node_modules/ after the frontend build step.
This caused the editable install overlay step to run 20,000+ glob
operations across 6,619 "packages", which on fast NVMe takes ~5s but
on slower disks can take 7+ minutes.
Adding an explicit include filter scopes discovery to only the packages
we actually ship (unsloth, unsloth_cli, studio, studio.backend), dropping
from 6,619 to 58 discovered packages and the editable build time from
5.4s to 1.2s.
Also removes the broken kernels/moe exclude (used "/" instead of "."
notation so it never matched) and adds a node_modules exclude as a
safety net.
* fix: use precise node_modules exclude patterns
Use "*.node_modules" and "*.node_modules.*" instead of "*.node_modules*"
to avoid accidentally excluding valid packages that might contain
"node_modules" as a substring in their name.
* [WIP] balanced device map for studio
* gpus as a request parameter
* API for multi GPU stuff
* return multi gpu util in new API
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use balanced_low0 instead of balanced
* Use balanced_low0 instead of balanced
* Fix device_map typo, UUID parsing crash, set() filter bug, and broken tests
- balanced_low0 -> balanced_low_0 (transformers/accelerate rejects the old string)
- get_parent_visible_gpu_ids() now handles UUID/MIG CUDA_VISIBLE_DEVICES
gracefully instead of crashing on int() parse
- _get_backend_visible_gpu_info() set() or None bug: empty set is falsy so
CUDA_VISIBLE_DEVICES=-1 would disable filtering and report all GPUs
- test_gpu_selection.py: add missing get_visible_gpu_utilization import and
add required job_id arg to start_training() calls
* Smart GPU determinism using estimates
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* disallow gpu selection for gguf for now
* cleanup
* Slightly larger baseline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Treat empty list as auto
* Verbose logging/debug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Cleanup and revert unnecessary deletions
* Cleanup excessive logs and guard against disk/cpu offload
* auth for visibility API. cleanup redundant imports. Adjust QLoRA estimate
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* support for non cuda gpus
* Fix multi-GPU auto-selection memory accounting
The multi_gpu_factor was applied uniformly to all GPUs including the
first one, which unfairly penalizes single-GPU capacity when
transitioning to multi-GPU. This created a discontinuity where a model
that barely fits 1 GPU would suddenly require 2 GPUs because the first
GPU's free memory was discounted by 20%.
Now the first GPU keeps its full free memory, and only additional GPUs
have an overhead factor (0.85) applied to account for inter-GPU
communication and sharding overhead. This gives more accurate
auto-selection and avoids unnecessary multi-GPU for models that
comfortably fit on one device.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox tests for multi-GPU selection logic
24 tests covering model size estimation, memory requirements, automatic
GPU selection, device map generation, GPU ID validation, and multi-GPU
overhead accounting. All tests use mocks so they run without GPUs on
Linux, macOS, and Windows.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reviewer findings: 4bit inference estimate, fallback, GGUF gpu_ids, retry
1. 4-bit inference now uses reduced memory estimate (model_size/3 + buffer)
instead of the FP16 1.3x multiplier. This prevents over-sharding
quantized models across unnecessary GPUs.
2. When model size estimation fails, auto_select_gpu_ids now falls back to
all visible GPUs instead of returning None (which could default to
single-GPU loading for an unknown-size model).
3. GGUF inference route now treats gpu_ids=[] as auto-selection (same as
None) instead of rejecting it as an unsupported explicit request.
4. Training retry path for "could not get source code" now preserves the
gpu_ids parameter so the retry lands on the same GPUs.
5. Updated sandbox tests to cover the new 4-bit inference estimate branch.
* Remove accidentally added unsloth-zoo submodule
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix UUID/MIG visibility and update test expectations
1. nvidia.py: When CUDA_VISIBLE_DEVICES uses UUID/MIG tokens, the
visibility APIs now return "unresolved" with empty device lists instead
of exposing all physical GPUs. This prevents the UI from showing GPUs
that the backend process cannot actually use.
2. test_gpu_selection.py: Updated test expectations to match the new
multi-GPU overhead accounting (first GPU at full capacity, 0.85x for
additional GPUs) and 4-bit inference memory estimation formula.
All 60 tests now pass.
* Add CPU/disk offload guard to audio inference path
The audio model loading branch returned before the common
get_offloaded_device_map_entries() check, so audio models loaded with a
multi-GPU device_map that spilled layers to CPU/disk would be accepted
instead of rejected. Now audio loads also verify no modules are offloaded.
* Improve VRAM requirement estimates
* Replace balanced_low_0 with balanced
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refine calculations for slightly easier nums
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* adjust estimates
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use nums instead of obj to avoid seralisation error
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden nvidia-smi parsing and fix fallback GPU list
1. nvidia.py: Wrap int() casts for GPU index and memory in try/except
so MIG slices, N/A values, or unexpected nvidia-smi output skip the
unparseable row instead of aborting the entire GPU list.
2. nvidia.py: Handle GPU names containing commas by using the last
field as memory instead of a fixed positional index.
3. hardware.py: fallback_all now uses gpu_candidates (GPUs with verified
VRAM data) instead of raw devices list, which could include GPUs
with null VRAM that were excluded from the ranking.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* consolidate raise_if_offload
* Improve MoE support. Guard against nvidia-smi failures
* Improve MoE support. Guard against nvidia-smi failures
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix shared-expert LoRA undercount, torch VRAM fallback, and apply_gpu_ids edge case
1. vram_estimation.py: compute_lora_params now includes shared experts
(n_shared_experts) alongside routed experts when computing MoE LoRA
adapter parameters. Previously only n_experts were counted, causing
the estimator to undercount adapter, optimizer, and gradient memory
for DeepSeek/GLM-style models with shared experts.
2. hardware.py: _torch_get_per_device_info now uses mem_get_info (which
reports system-wide VRAM usage) instead of memory_allocated (which
only reports this process's PyTorch allocations). This prevents
auto-selection from treating a GPU as mostly free when another
process is consuming VRAM. Falls back to memory_allocated when
mem_get_info is unavailable.
3. hardware.py: apply_gpu_ids([]) now returns early instead of setting
CUDA_VISIBLE_DEVICES="" which would disable CUDA entirely. Empty
list inherits the parent visibility, same as None.
4. hardware.py: Upgraded fallback_all GPU selection log from debug to
warning so operators are notified when the model likely will not fit
in available VRAM.
* Guard nvidia-smi subprocess calls against OSError and TimeoutExpired
get_visible_gpu_utilization and get_backend_visible_gpu_info now catch
OSError (nvidia-smi not found) and TimeoutExpired internally instead
of relying on callers to wrap every invocation. Returns the standard
available=False sentinel on failure so the torch-based fallback in
hardware.py can take over.
* Guard get_primary_gpu_utilization and reset GPU caches between tests
1. nvidia.py: get_primary_gpu_utilization now catches OSError and
TimeoutExpired internally, matching the pattern already used in
get_visible_gpu_utilization and get_backend_visible_gpu_info. All
three nvidia-smi callers are now self-contained.
2. test_gpu_selection.py: Added _GpuCacheResetMixin that resets the
module-level _physical_gpu_count and _visible_gpu_count caches in
tearDown. Applied to all test classes that exercise GPU selection,
device map, or visibility functions. This prevents stale cache
values from leaking between tests and causing flaky results on
machines with real GPUs.
* Fix nvidia-smi fallback regression and physical GPU count validation
1. hardware.py: get_gpu_utilization, get_visible_gpu_utilization, and
get_backend_visible_gpu_info now check result.get("available") before
returning the nvidia-smi result. When nvidia-smi is unavailable or
returns no data (e.g., containers without nvidia-smi, UUID/MIG masks),
the functions fall through to the torch-based fallback instead of
returning an empty result. This fixes a regression where the internal
exception handling in nvidia.py prevented the caller's except block
from triggering the fallback.
2. hardware.py: resolve_requested_gpu_ids now separates negative-ID
validation from physical upper-bound validation. The physical count
check is only enforced when it is plausibly a true physical count
(i.e., higher than the largest parent-visible ID), since
torch.cuda.device_count() under CUDA_VISIBLE_DEVICES returns the
visible count, not the physical total. The parent-visible-set check
remains authoritative in all cases. This prevents valid physical IDs
like [2, 3] from being rejected as "out of range" when nvidia-smi is
unavailable and CUDA_VISIBLE_DEVICES="2,3" makes torch report only
2 devices.
* Fix UUID/MIG torch fallback to enumerate devices by ordinal
When CUDA_VISIBLE_DEVICES uses UUID or MIG identifiers,
get_parent_visible_gpu_ids() returns [] because the tokens are
non-numeric. The torch fallback in get_visible_gpu_utilization() and
get_backend_visible_gpu_info() previously passed that empty list to
_torch_get_per_device_info(), getting nothing back.
Now both functions detect the empty-list case and fall back to
enumerating torch-visible ordinals (0..device_count-1) with
index_kind="relative". This means the UI and auto-selection still
see real device data in Kubernetes, MIG, and Slurm-style UUID
environments where nvidia-smi output cannot be mapped to physical
indices.
Updated test_uuid_parent_visibility to verify the new torch fallback
path returns available=True with relative ordinals.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add type hint for gpu_ids parameter in InferenceOrchestrator.load_model
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fixes#4670
Separates the GGUF context slider ceiling from the currently active context length so lowering context via Chat Settings no longer locks the slider max to the reduced value.
- Backend: adds `max_context_length` to GGUF load/status responses, computed from the largest VRAM/KV-fit cap across all usable GPU subsets
- Frontend: stores `ggufMaxContextLength` and uses it for Context Length slider/input bounds; hydrates from both `/api/inference/load` and `/api/inference/status`
- Defaults UI ceiling to native context for CPU-only and fallback paths
- Seeds `effective_ctx` and `max_available_ctx` before GPU probing to prevent `UnboundLocalError` on probe failure
- Property fallback uses native `_context_length`, not effective `context_length`
* refactor(studio): unify setup terminal output style and add verbose setup mode
* studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose)
* studio(setup): revert nvcc path reordering to match main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio(setup): restore fail-fast llama.cpp setup flow
* studio(banner): use IPv6 loopback URL when binding :: or ::1
* Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp
- Bracket IPv6 display_host in external_url to produce clickable URLs
- Redirect try_quiet failure log to stderr instead of stdout
- Clamp _step label to column width to prevent negative padding
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox integration tests for PR #4494 UX fixes
Simulation harness (tests/simulate_pr4494.py) creates an isolated uv
venv, copies the real source files into it, and runs subprocess tests
for all three fixes with visual before/after demos and edge cases.
Standalone bash test (tests/test_try_quiet.sh) validates try_quiet
stderr redirect across 8 scenarios including broken-version contrast.
39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all
existing 75 unit tests still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Truncate step() labels in setup.sh to match PS1 and Python
The %-15s printf format pads short labels but does not truncate long
ones. Change to %-15.15s so labels wider than 15 chars are clipped,
matching the PowerShell .Substring(0,15) and Python label[:15] logic.
* Remove sandbox integration tests from PR
These test files are not part of the styling fix and should not
ship with this PR.
* Show error output on failure instead of suppressing it
- install_python_stack.py: restore _red for patch_package_file
warnings (was downgraded to _dim)
- setup.ps1: capture winget output and show on failure for CUDA,
Node, Python, and OpenSSL installs (was piped to Out-Null)
- setup.ps1: always show git pull failure warning, not just in
verbose mode
* Show winget error output for Git and CMake installs on failure
Same capture-and-print-on-failure pattern already used for
Node, Python, CUDA, and OpenSSL winget installs.
* fix: preserve stderr for _run_quiet error messages in setup.sh
The step() helper writes to stdout, but _run_quiet's error header
was originally sent to stderr (>&2). Without the redirect, callers
that separate stdout/stderr would miss the failure headline while
still seeing the log body on stderr. Add >&2 to both step calls
inside _run_quiet to match main's behavior.
* feat: add --verbose flag to setup and update commands
Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that
'unsloth studio update --verbose' (and the deprecated 'setup')
passes the flag to setup.sh / setup.ps1 / install_python_stack.py.
* fix(studio): honor verbose logging and keep llama.cpp failures non-blocking
* fix(studio): switch installer to 'studio update' and normalize Windows setup logs
* chore(studio): refine localhost tip and remove skip-base setup nois
* fix(studio): align Windows setup logs with Linux style and improve startup tips
* fix(studio): align Windows setup logs with Linux style
* refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output
* refactor(windows): align installer/setup output with Linux style and reduce default verbosity
* refactor(windows): match install.ps1 output style/colors to setup and quiet default logs
* fix(studio-banner): update personal-computer localhost tip
* fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode
* fix(install.sh): align installer logging with setup style and restore POSIX-safe color output
* fix(install.sh): preserve installer reliability and launch visibility
Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output.
* fix(windows installer): keep exit semantics and degrade status accurate
Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable.
* fix(setup.sh): improve log clarity and enforce GGUF degraded signaling
Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable.
* fix(installer): harden bash preflight and PowerShell GPU checks
Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling.
* fix(windows): keep verbose output visible while preserving exit codes
Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode.
* fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity
* Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose
- setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer
already reports the limitation, and install.sh runs under set -e so a
non-zero exit aborts the entire install including PATH/shortcuts/launch.
- setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1
sets $? = $false when native commands write to stderr even with exit 0.
Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE.
- startup_banner.py: Show the actual bound address when Studio is bound to
a non-loopback interface instead of always showing 127.0.0.1/localhost.
- setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for
npm install steps so --verbose correctly surfaces npm output.
* Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose
- install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge
stderr into stdout with 2>&1 and drop the $? check that misclassifies
successful native commands on PS 5.1.
- install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env
when --verbose is passed so child processes like install_python_stack.py
also run in verbose mode.
- setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose
correctly surfaces clone diagnostics during source-build fallback.
* Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner
- setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so
users can see download/validation progress while still capturing the log
for structured error reporting on failure.
- setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode.
- setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers.
- startup_banner.py: Avoid printing the same URL twice when Studio is
bound to a specific non-loopback address that matches the display host.
* Fix run_install_cmd exit code after failed if-statement
The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured
$? = 0 because $? reflects the if-statement result, not the command's exit
code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual
command exit code on failure. Applies to both verbose and quiet branches.
* Fix _run_quiet exit code, double uv install, missing --local flag
- setup.sh: Fix _run_quiet verbose path that always captured exit code 0
due to $? resetting after if-then-fi with no else. Switch to the same
'"$@" && return 0; exit_code=$?' pattern used in install.sh.
- setup.sh: Consolidate the two uv install branches (verbose + quiet)
into a single attempt with conditional output. Previously, when verbose
mode was on and the install failed, a second silent attempt was made.
- install.ps1: Pass --local flag to 'unsloth studio update' when
$StudioLocalInstall is true. Without this, studio.py's update() command
overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if
setup.ps1 or install_python_stack.py later checks that variable.
* Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners
- Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already
installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling
setup.sh, so letting install_python_stack.py redo it was redundant and
slowed down --no-torch installs for no benefit.
- Restore the "Unsloth Studio installed!" success banner and "starting
Unsloth Studio..." launch message so users get clear install completion
feedback before the server starts.
* Make llama.cpp build failure a hard error with proper cleanup
- setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF
inference requires a working llama.cpp build, so this should be a
hard failure, not a silent degradation.
- install.sh: Catch setup.sh's non-zero exit with '|| _SETUP_EXIT=$?'
instead of letting set -e abort immediately. This ensures PATH setup,
symlinks, and shortcuts still get created so the user can fix the
build deps and retry with 'unsloth studio update'. After post-install
steps, propagate the failure with a clear error message.
* Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE
'studio update' pops SKIP_STUDIO_BASE from the environment, which
defeats the fast-path version check added in PR #4667. When called
from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1
must survive into setup.ps1 so it skips the redundant PyPI check and
package reinstallation. 'studio setup' does not modify env vars.
* Remove deprecation message from 'studio setup' command
install.ps1 uses 'studio setup' (not 'studio update') to preserve
SKIP_STUDIO_BASE. The deprecation message was confusing during first
install since the user never typed the command.
* Fix stale env vars, scope degraded exit, generic error message for PR #4651
- install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO
when not using --local, to prevent stale values from a previous --local
run in the same PowerShell session. Fix log messages to say 'setup' not
'update' since we call 'studio setup'.
- setup.sh: Only exit non-zero for degraded llama.cpp when called from the
installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps
degraded installs successful since Studio is still usable for non-GGUF
workflows and the footer already reports the limitation.
- install.sh: Make the setup failure error message generic instead of
GGUF-specific, so unrelated failures (npm, Python deps) do not show
misleading cmake/git recovery advice.
* Show captured output on failure in quiet mode for PR #4651
Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand
(setup.ps1) now capture command output in quiet mode and display it
in red when the command fails. This matches the behavior of
run_install_cmd in install.sh where failure output is surfaced even
in quiet mode, making cross-platform error debugging consistent.
* Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651
- setup.ps1: Exit non-zero for degraded llama.cpp when called from
install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct
'unsloth studio update' keeps degraded installs successful.
- install.sh: Show 'unsloth studio update --local' in the recovery
message when the install was run with --local, so users retry with
the correct flag instead of losing local checkout context.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: add PyPI version check to setup.ps1 for fast update path
Port the update-flow logic from setup.sh to setup.ps1 so that
`unsloth studio update` on Windows skips Python dependency reinstall
when the installed version already matches PyPI latest.
* fix: clear SKIP_STUDIO_BASE in update command
install.ps1 sets SKIP_STUDIO_BASE=1 which persists in the PowerShell
session. If the user runs `unsloth studio update` in the same terminal,
the env var causes the version check to be skipped. Clear it explicitly
in the update command.
* fix: harden version check and clear stale env vars in update flow
- Normalize $InstalledVer with Out-String + Trim() to avoid array/whitespace
comparison issues in PowerShell 5.1 (python output can be captured as
string[] instead of scalar string)
- Move Fast-Install --upgrade pip inside if (-not $SkipPythonDeps) so the
fast path avoids unnecessary network round-trips
- Clear STUDIO_LOCAL_REPO when --local is not passed to prevent a previous
--local session from leaking into a plain update
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix blank page on Windows due to broken .js MIME type in registry
* Update studio/backend/main.py
adding defensive suggestion by gemini where we make the mimetypes specific to windows platforms
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* feat(studio): add HF/local model selection UI for GGUF export
* fix(studio):fix selector ring clipping
* fix(studio): export page trust_remote_code control and label styling
* fix(studio): accept hf_token in load_checkpoint orchestrator method
The route was passing hf_token to load_checkpoint() but the method
didn't accept it, causing a TypeError on every /api/export/load-checkpoint
request.
* fix(studio): clear HF model selection when input is edited
Previously selectedSourceModel was only cleared when the input became
empty, so editing to a different repo ID after selecting a model would
silently keep the old selection.
---------
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
normalize_path() unconditionally converted Windows paths like
C:\Users\... to WSL format /mnt/c/Users/..., which breaks path
resolution on native Windows. This caused LM Studio GGUF models
to fail detection (detect_gguf_model returned None for the invalid
path), falling through to the Unsloth import path which requires
a GPU.
Now only performs the /mnt/ mapping when actually running under WSL.
On native Windows, drive letters are preserved and backslashes are
normalized to forward slashes.
* fix: default HF cache to standard platform path instead of legacy Unsloth cache
* feat: show LM Studio and local models in chat Fine-tuned tab
* feat: show LM Studio models in Hub models tab
* fix: fetch local models after auth refresh completes
* Revert "fix: fetch local models after auth refresh completes"
This reverts commit cfd61f0ac7.
* fix: increase llama-server health check timeout to 600s for large models
* feat: expandable GGUF variant picker for LM Studio local models
* fix: show GGUF variant label for locally loaded LM Studio models
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: show publisher name in LM Studio model labels
* fix: set model_id for loose GGUF files in LM Studio publisher dirs
* fix: show publisher prefix in Fine-tuned tab LM Studio models
* fix: only use model_id for lmstudio source models
* fix: only show LM Studio models in Hub tab on Mac/chat-only mode
* fix: respect XDG_CACHE_HOME, handle Windows paths in isLocalPath, refresh LM Studio on remount
- _setup_cache_env now reads XDG_CACHE_HOME (falls back to ~/.cache)
instead of hard-coding ~/.cache/huggingface. This follows the standard
HF cache resolution chain and respects distro/container overrides.
- isLocalPath in GgufVariantExpander uses a regex that covers Windows
drive letters (C:\, D:/), UNC paths (\\server\share), relative paths
(./, ../), and tilde (~/) -- not just startsWith("/").
- HubModelPicker.useEffect now calls listLocalModels() before the
alreadyCached early-return gate so LM Studio models are always
refreshed on remount. Also seeds useState from _lmStudioCache for
instant display on re-open.
* fix: add comment explaining isLocalPath regex for Windows/cross-platform paths
* fix: prioritize unsloth publisher in LM Studio model list
* fix: scope unsloth-first sort to LM Studio models on all platforms
* fix: add missing _lmStudioCache module-level declaration
* fix: prioritize unsloth publisher before timestamp sort in LM Studio group
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Some models like unsloth/Qwen3-0.6B have no safetensors metadata
on Hugging Face, so the training model selector showed no parameter
size badge. The chat model picker already had extractParamLabel()
as a fallback that parses sizes like "0.6B" from the model name.
Add the same fallback to the training model selector and the
onboarding model selection step.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Detect always-on reasoning models and show Think button as locked-on
Models with hardcoded <think>/<think> tags or reasoning_content in
their chat template (e.g. distilled reasoning models) always produce
thinking output regardless of any toggle. Previously these models
were not detected as reasoning-capable at all, so the Think button
was grayed out even though the model was actively reasoning.
Backend:
- Detect <think>/<think> and reasoning_content in GGUF chat templates
as a fallback when enable_thinking is not present
- Add reasoning_always_on flag to LoadResponse and InferenceStatusResponse
- Pass the flag through all GGUF load and status response paths
Frontend:
- Add reasoningAlwaysOn to the chat runtime store and API types
- When reasoning_always_on is true, show the Think button as lit
(active) but not clickable, with a tooltip explaining the model
always uses thinking
- Force reasoningEnabled=true when the model always reasons
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use pointer-events-none instead of disabled for always-on Think button
The HTML disabled attribute was not fully blocking clicks on the Think
button for always-on reasoning models. Switch to pointer-events-none
CSS class which prevents all mouse interaction at the CSS level.
* Use a static span instead of disabled button for always-on Think
Replace the button element with a plain span when reasoning is
always on. This makes it physically impossible to toggle since
there is no clickable element at all, avoiding any CSS or
disabled-attribute edge cases.
* Simplify always-on Think button to stay lit and remain toggleable
Keep the Think button as a normal toggleable button but ensure it
shows as lit when reasoning_always_on is true. The model always
reasons regardless of the toggle state so there is no need to
block interaction.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Use --no-deps for ALL packages (unsloth, unsloth-zoo, and runtime deps)
since the current PyPI metadata for unsloth still declares torch as a
hard dependency. Runtime deps (typer, pydantic, safetensors,
transformers, etc.) are installed from no-torch-runtime.txt with
--no-deps to prevent transitive torch resolution from accelerate, peft,
trl, and sentence-transformers.
no-torch-runtime.txt now includes unsloth's own direct deps (typer,
pydantic, pyyaml, nest-asyncio) since --no-deps skips those too.
install.sh installs no-torch-runtime.txt directly (via helper function
_find_no_torch_runtime). install.ps1 does the same via
Find-NoTorchRuntimeFile. SKIP_STUDIO_BASE stays at 1 to avoid setup.sh
fast-path issues.
install_python_stack.py NO_TORCH branch does the same for unsloth
studio update, using package_name instead of hardcoded "unsloth".
* Fix inference failing for transformers 5.x models (trust_remote_code)
The training worker in core/training/worker.py auto-enables
trust_remote_code for unsloth/* models that need transformers 5.x
(e.g. NVIDIA-Nemotron-3-Nano-4B). The inference worker did not have
the same logic, so loading these models for chat would fail with
"No config file found" while training worked fine.
Add the same auto-detection to the inference worker so
trust_remote_code is set automatically when needed.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio shutdown button
* fix: add auth to shutdown endpoint and improve UX
- Add JWT auth (Depends(get_current_subject)) to POST /api/shutdown
- Use authFetch instead of bare fetch in shutdown dialog
- Only show beforeunload prompt when training is running
- Remove Ctrl+W/Cmd+W interception (browsers don't allow it)
- Store shutdown task on app.state to prevent GC
---------
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: only kill studio-managed llama-server processes, not user's own servers
_kill_orphaned_servers() checked for "unsloth" anywhere in the process
cmdline, which matched the user's own llama-server when serving models
from unsloth/ HF repos (the model path in -m contains "unsloth"). This
caused the user's server to get SIGKILLed on Studio startup, destroying
their prompt cache and forcing full model re-loads.
Narrow the check to only match processes whose binary path lives under
~/.unsloth/llama.cpp/ (the Studio install directory).
* Address review: cover env var paths, move Path.home() inside try block
- Also check LLAMA_SERVER_PATH and UNSLOTH_LLAMA_CPP_PATH so orphans
from custom install locations are still cleaned up.
- Move studio_dirs construction inside the try/except so a Path.home()
failure (containers without HOME) does not crash the constructor.
* Address reviewer feedback: proper path ancestry, /proc/pid/exe, legacy paths
Changes based on 10-reviewer consensus:
- Use Path.is_relative_to() instead of substring matching to prevent
false positives on sibling paths like ~/.unsloth/llama.cpp-backup/.
- Use /proc/<pid>/exe (symlink to real binary) instead of parsing the
first cmdline token, which breaks on paths with spaces. Falls back
to cmdline parsing on non-Linux or when /proc is unavailable.
- Add legacy in-tree install paths (project_root/llama.cpp/ and
project_root/bin/) so orphans from older setup.sh are still cleaned.
- Treat LLAMA_SERVER_PATH as an exact binary match rather than widening
it to its parent directory, which could match unrelated servers in
shared locations like /usr/local/bin/.
- Keep everything inside the try/except so Path.home() failures in
containers do not crash the constructor.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: add Linux platform guard and log cleanup errors
- Guard pgrep fallback with sys.platform check so it does not crash
on Windows/macOS when psutil is unavailable.
- Replace silent except-pass with logger.warning for observability.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The [huggingfacenotorch] extras only exist in pyproject.toml but are
NOT published on PyPI, so uv pip install "unsloth[huggingfacenotorch]"
fails on fresh installs from the registry.
Fix: add studio/backend/requirements/no-torch-runtime.txt with the
runtime deps (safetensors, transformers, datasets, accelerate, etc.)
that mirror [huggingfacenotorch] from pyproject.toml. In no-torch mode:
1. install.sh/ps1 install unsloth + unsloth-zoo with --no-deps
2. SKIP_STUDIO_BASE=0 so install_python_stack.py's NO_TORCH branch runs
3. install_python_stack.py installs no-torch-runtime.txt
* Guard against late tool_calls after visible content, filter incomplete fragments
1. If visible content was already emitted (_last_emitted is non-empty)
when delta.tool_calls arrives, ignore the tool_calls instead of
reclassifying the turn as a tool call. llama-server never
interleaves content and tool_calls (they are mutually exclusive),
but this guard is defensive for other OpenAI-compatible backends.
2. Filter out incomplete structured tool_calls fragments before
execution. Entries with empty function.name (from truncation by
max_tokens, disconnect, or interruption) are skipped instead of
being passed to execute_tool().
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: account for KV cache in GGUF GPU fit check and auto-cap context length
The GPU fit check only compared GGUF file size against free VRAM,
ignoring KV cache memory. Models with large native context lengths
(e.g. Qwen3.5-9B at 262k) would pass the fit check since the GGUF
is only 5.6 GB, but the KV cache at 262k context needs ~40 GB at
f16. This caused llama-server to silently fall back to CPU inference.
Changes:
- Parse block_count, head_count_kv, head_count, and embedding_length
from GGUF metadata alongside context_length
- Add KV cache VRAM estimation based on architecture params and the
selected cache quantization type (f16, q8_0, q4_0, etc.)
- Auto-reduce context length to the maximum that fits in available
GPU VRAM when the native context would exceed it
- Include estimated KV cache size in the _select_gpus total so the
fit decision reflects actual runtime memory, not just file size
For the reported scenario (Qwen3.5-9B on RTX 3090 with 22415 MiB
free), context is auto-reduced from 262144 to ~63k with f16 KV cache,
keeping the model fully on GPU. With q4_0 KV cache quantization the
context can reach ~226k.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: resolve 6 bugs in KV cache VRAM estimation and add test harness
- Fix q8_0 BPE constant: 1.125 -> 34/32 (1.0625) to match llama.cpp block size
- Fix _fit_context_to_vram returning min_ctx when weights exceed budget
(should return requested_ctx unchanged, let --fit handle it)
- Fix binary search inflating below-2048 requests (lo=min_ctx=2048 > hi)
- Fix n_ctx=0 regressing to 4096 when metadata unavailable (preserve sentinel)
- Fix multi-GPU auto-cap using single-GPU budget instead of aggregate
- Fix _context_length being overwritten with capped effective value
Add tests/test_gguf_kv_vram.py: 43 cross-platform pytest tests covering
pure logic, integration (monkeypatched load_model), and real GGUF parsing.
Runs in an isolated uv venv with only pytest -- no GPU/torch/structlog needed.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: complete _effective_context_length lifecycle
- Initialize _effective_context_length in __init__ (prevents AttributeError)
- Reset _effective_context_length in unload_model (prevents stale values)
- Update context_length property to return effective (capped) value for
the UI/API, falling back to native _context_length if not set
* fix: multi-GPU selection tries smallest subset first
The previous approach summed all GPUs' memory to cap context, then
selected GPUs afterward. This was overly optimistic for heterogeneous
setups (e.g., 48 GiB + 4 GiB): the context was inflated by the tiny
GPU's contribution, then both GPUs were dragged in.
Now we try GPU subsets from smallest (1 GPU) to largest, capping
context for each. We pick the smallest subset where the model+KV
fits. This prefers single-GPU when possible (simpler, no tensor
split overhead) and avoids pulling in GPUs that barely help.
Add tests: test_multi_gpu_prefers_fewer_gpus,
test_multi_gpu_heterogeneous.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: prefer fewer GPUs over higher context in GPU selection
Multi-GPU inference is slower due to tensor-split overhead, so we
should prefer fewer GPUs with reduced context over more GPUs with
full context. Now the loop stops at the first GPU subset where the
model fits, rather than continuing to find subsets that allow higher
context. Only if the model can't fit on N GPUs do we try N+1.
This preserves the original behavior: use multi-GPU only when the
model doesn't fit on a single GPU.
* fix: make _kill_orphaned_servers cross-platform via psutil
Replace pgrep + os.kill(SIGKILL) with psutil.process_iter() and
proc.kill(), which work on Linux, macOS, and Windows. Build an
allowlist of install roots matching _find_llama_server_binary so
only studio-managed servers are killed.
* fix: skip KV estimation loop when effective context is unknown
When n_ctx=0 and GGUF metadata lacks context_length, effective_ctx
stays 0. _estimate_kv_cache_bytes(0) returns 0, so a GPU could be
selected with no KV headroom. Guard the loop with effective_ctx > 0
to fall back to file-size-only GPU selection in this case.
* chore: temporarily remove test harness (will add back separately)
* refactor: deduplicate UINT32/UINT64 handling in GGUF parser
Replace duplicated if/elif chains for vtype 4 and 10 with a single
block using setattr. No behavioral change.
* fix: honor explicit n_ctx by using multi-GPU before capping
When the user explicitly sets n_ctx, try to fit the full requested
context using _select_gpus (which adds GPUs as needed). Only cap
context if it doesn't fit on any GPU combination.
When n_ctx=0 (auto/native context), keep the existing behavior:
prefer fewer GPUs with reduced context, since multi-GPU is slower
and the user didn't ask for a specific context length.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: context_length property returns native value for frontend slider
The frontend uses context_length as the slider max. Returning the
capped effective value prevented users from requesting higher context
on reload (e.g., after switching to q4_0 KV cache). Revert to
returning the native GGUF metadata value -- the backend auto-caps
at load time regardless.
* revert: context_length returns effective (capped) value
The UI slider should show what the server is actually running at,
not the theoretical maximum. Revert to returning the effective
context length.
* fix: raise minimum context floor from 2048 to 4096
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix ~1.2s TTFT penalty when tools are enabled in Studio
When users enable web search, Python execution, or terminal tools,
every message gets a ~1.2s delay before any text appears -- even when
the model does not call any tool. This happens because
generate_chat_completion_with_tools() does a non-streaming detection
pass (stream: False) first, waits for the complete response, then
checks for tool calls. For the ~90% of messages that don't trigger a
tool call, this blocking wait is entirely wasted.
Root cause: the detection pass payload uses stream: False, forcing
llama-server to generate the entire response before returning any
tokens.
Fix: replace the non-streaming detection pass with a streaming pass
(stream: True) and a speculative buffer state machine that detects
tool signals in the first 1-2 SSE chunks:
- BUFFERING: accumulate content tokens, check first chars for tool
signal prefixes (<tool_call>, <function=)
- STREAMING: no tool detected, yield tokens to caller immediately
- DRAINING: tool signal found, silently accumulate rest of stream
Three detection paths:
1. Structured delta.tool_calls -- detected instantly, transition to
DRAINING, accumulate fragments, assemble at stream end.
2. XML tool markup in content -- buffer holds up to 32 chars checking
for <tool_call> or <function= prefix, then transitions to DRAINING.
3. No tool signal -- first non-whitespace, non-XML char triggers
immediate transition to STREAMING (fast path, ~90% of requests).
Safety net: after any stream ends in STREAMING state, check accumulated
content for XML tool signals. Handles rare "content before tool call"
edge case.
Additional supporting changes:
- Add headers parameter to _stream_with_retry for auth forwarding
- Share _strip_tool_markup and regex patterns between the detection
pass and the final streaming pass (removes duplication)
- Remove the iteration==0 non-streaming content shortcut (no longer
needed since all iterations stream directly)
- Keep the final streaming pass as fallback for max_tool_iterations
exhaustion
Benchmarked on Qwen3.5-4B Q4_K_XL:
- No tools: TTFT ~112ms (unchanged)
- Tools enabled, no call: TTFT ~112ms (was ~1207ms)
- Decode TPS: 226 (unchanged in all cases)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add unit tests for streaming tool detection state machine
16 tests covering every tool call parsing path:
- Plain text (no tool call) streaming
- Structured delta.tool_calls detection and fragment assembly
- XML <tool_call>JSON</tool_call> detection via buffer
- XML <function=name> tag detection via buffer
- Whitespace before tool XML
- Safety net (content then tool XML)
- Parallel multi-tool calls
- Reasoning token bypass (thinking models)
- Reasoning then tool call
- Empty response handling
- Buffer prefix timeout (HTML not mistaken for tool)
- Non-XML first char instant streaming
- False positive rejection (<tool_tip> vs <tool_call>)
- Arguments split across multiple chunks
- auto_heal_tool_calls=False respects the flag
- Metrics accumulation across tool iterations
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reasoning-only BUFFERING, pre-tool content emission, and code duplication
Addresses review feedback on the streaming tool detection:
1. Reasoning tokens are no longer yielded during BUFFERING/DRAINING
states. The consumer in routes/inference.py tracks prev_text across
tool iterations without resetting it, so yielding reasoning during
a detection pass that resolves to a tool call would corrupt the
delta computation for subsequent iterations. Reasoning is now
silently accumulated during detection (matching the old non-streaming
behavior) and flushed together with content when the buffer resolves
to STREAMING.
2. Handle reasoning-only responses in the BUFFERING resolver. When a
thinking model emits only reasoning_content with no content tokens,
the stream ends while still in BUFFERING state. The resolver now
detects this case and yields reasoning as plain text (without
<think> wrapper), matching the final streaming pass behavior for
models like Qwen3 in always-think mode.
3. Replace duplicated re.sub calls for stripping tool markup with
the existing _strip_tool_markup(content_text, final=True) helper,
removing ~40 lines of redundant regex code.
4. Update tests: adjust reasoning test expectations to match the new
behavior (reasoning batched with content, not streamed individually
during BUFFERING). Add test_reasoning_only_no_content for the
reasoning-only edge case. 17/17 tests pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address remaining reviewer findings: late tool_call IDs and XML speculation
1. Late-arriving tool_calls.id: when a provider sends the real ID on a
later delta chunk (after the initial one with index and function
name), the accumulator now updates the ID instead of keeping the
synthetic "call_{idx}" placeholder. (P2, 2/10 reviewers)
2. XML speculation respects auto_heal_tool_calls: when auto_heal is
explicitly disabled, _TOOL_XML_SIGNALS is empty so the BUFFERING
state never speculatively holds content for XML prefix detection.
Content starting with literal "<tool_call>" or "<function=" text
flows straight through without delay. (P2, 1/10 reviewers)
Skipped: finish_reason="tool_calls" without delta.tool_calls fallback
(P1, 1/10 reviewers). llama-server always sends delta.tool_calls
fragments in streaming mode. A non-streaming fallback for this edge
case would add complexity for a scenario that does not occur in
practice with the supported backend.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Check request.is_disconnected() every 20 tokens instead of every token
The disconnect check is an async round-trip that adds overhead on every
loop iteration. Since the cancel watcher in llama_cpp.py already
handles connection teardown (closes the streaming response on cancel),
this route-layer check is a secondary safety net that does not need to
run on every single token.
Check every 20 tokens across all 4 streaming paths:
- gguf_tool_stream (tool-enabled GGUF)
- gguf_stream_chunks (standard GGUF)
- audio_input_generate (audio/whisper input)
- generic backend stream (non-GGUF fallback)
* Fix safety net, DRAINING metadata, and test import path
1. Safety net no longer retroactively executes tools after visible
content was already emitted to the user. Once _last_emitted is
non-empty, the stream is committed to normal content mode.
Retroactive tool execution after visible output would violate the
streaming contract and corrupt the route-layer cumulative delta
tracker (prev_text). The tool XML is still stripped by
_strip_tool_markup so the user sees clean content.
2. DRAINING false-positive path now merges accumulated metrics from
prior tool iterations instead of dropping them. Uses the same
merge formula as the STREAMING path.
3. Test import path fixed to use repo root instead of hardcoded
sibling directory. Works in clean checkouts and CI.
4. Renamed test_content_then_tool_xml_safety_net to
test_content_then_tool_xml_no_retroactive_execution to reflect
the corrected behavior.
17/17 tests pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Redact --api-key value from llama-server startup log
When UNSLOTH_DIRECT_STREAM=1, the generated bearer token was logged
verbatim in the startup command. Replace the secret with <redacted>
before logging.
* Remove test file temporarily
* Revert disconnect throttle, reset prev_text on tool_start, restore XML safety net
Addresses all P1 findings from reviewer round 3 (10 reviewers):
1. Revert disconnect check to every iteration (was every 20th).
All 10 reviewers flagged this as a correctness regression for
short streams and sparse tool event loops. The cancel watcher in
llama_cpp.py is the primary mechanism but the route-layer check
must remain per-iteration for completeness. [10/10]
2. Reset prev_text on tool_start in gguf_tool_stream. When a tool
cycle begins after visible content was already streamed, the
route-layer cumulative delta tracker (prev_text) must be reset
so the post-tool synthesis response is not truncated or dropped.
[9/10]
3. Remove the _last_emitted gate from the XML safety net. The gate
was added to prevent retroactive tool execution after visible
content, but with prev_text now reset on tool_start (#2), the
root cause is fixed and the safety net can correctly handle
content-then-tool-XML responses (matching pre-PR behavior).
[8/10]
* Use None instead of {} for empty auth headers in TTS methods
* Include accumulated metrics in STREAMING metadata check
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* refactor(studio): unify setup terminal output style and add verbose setup mode
* studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose)
* studio(setup): revert nvcc path reordering to match main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio(setup): restore fail-fast llama.cpp setup flow
* studio(banner): use IPv6 loopback URL when binding :: or ::1
* Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp
- Bracket IPv6 display_host in external_url to produce clickable URLs
- Redirect try_quiet failure log to stderr instead of stdout
- Clamp _step label to column width to prevent negative padding
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox integration tests for PR #4494 UX fixes
Simulation harness (tests/simulate_pr4494.py) creates an isolated uv
venv, copies the real source files into it, and runs subprocess tests
for all three fixes with visual before/after demos and edge cases.
Standalone bash test (tests/test_try_quiet.sh) validates try_quiet
stderr redirect across 8 scenarios including broken-version contrast.
39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all
existing 75 unit tests still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Truncate step() labels in setup.sh to match PS1 and Python
The %-15s printf format pads short labels but does not truncate long
ones. Change to %-15.15s so labels wider than 15 chars are clipped,
matching the PowerShell .Substring(0,15) and Python label[:15] logic.
* Remove sandbox integration tests from PR
These test files are not part of the styling fix and should not
ship with this PR.
* Show error output on failure instead of suppressing it
- install_python_stack.py: restore _red for patch_package_file
warnings (was downgraded to _dim)
- setup.ps1: capture winget output and show on failure for CUDA,
Node, Python, and OpenSSL installs (was piped to Out-Null)
- setup.ps1: always show git pull failure warning, not just in
verbose mode
* Show winget error output for Git and CMake installs on failure
Same capture-and-print-on-failure pattern already used for
Node, Python, CUDA, and OpenSSL winget installs.
* fix: preserve stderr for _run_quiet error messages in setup.sh
The step() helper writes to stdout, but _run_quiet's error header
was originally sent to stderr (>&2). Without the redirect, callers
that separate stdout/stderr would miss the failure headline while
still seeing the log body on stderr. Add >&2 to both step calls
inside _run_quiet to match main's behavior.
* feat: add --verbose flag to setup and update commands
Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that
'unsloth studio update --verbose' (and the deprecated 'setup')
passes the flag to setup.sh / setup.ps1 / install_python_stack.py.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Make Studio shortcuts launch in a visible terminal
Studio shortcuts (Desktop/Start Menu) previously launched the server as a
hidden background process. Closing the browser tab did not stop the server,
leaving users with no obvious way to shut it down. This change makes shortcuts
open a visible terminal window so users can see server output and close the
terminal to stop Studio.
Launcher changes (install.sh):
- Add TTY detection in the launcher's main section. When a TTY is present
(foreground mode), the launcher spawns a background browser-opener and then
exec's the studio process directly. This means closing the terminal sends
SIGHUP to studio, stopping it cleanly. When no TTY is present (background
mode, e.g. macOS .app or headless), the existing _spawn_terminal behavior
is preserved.
- Add _open_browser_when_ready helper that polls health on the specific
launch port and opens the browser once ready.
- Add WSL fallback in _open_browser: uses powershell.exe Start-Process or
cmd.exe /c start instead of unreliable xdg-open under WSL.
Linux .desktop shortcut:
- Change Terminal=false to Terminal=true so the desktop environment opens
the user's default terminal emulator for the launcher.
WSL support:
- Remove the early-return that skipped WSL entirely. WSL now gets the
launcher script and studio.conf written.
- Add WSL shortcut creation: generates Windows Desktop and Start Menu .lnk
files via a temp PowerShell script. Targets wt.exe (Windows Terminal) with
automatic fallback to wsl.exe. Uses WSL_DISTRO_NAME for multi-distro setups.
Windows launcher (install.ps1):
- Add Find-FreeLaunchPort function that mirrors the Unix _find_launch_port
logic, scanning Get-NetTCPConnection for busy ports and returning the first
free port in the configured range.
- Replace the hardcoded $basePort with the dynamic port result, with a
MessageBox error dialog if no free port is found.
* Fix review findings: lock race, WSL quoting, Windows port fallback
Foreground lock race (10/10 reviewers):
The foreground mode released the single-instance lock before exec,
allowing a second launcher to acquire the lock and race for the same
port during startup. Move lock release into the background subshell
so it only happens after the health check passes.
WSL shortcut quoting (10/10 reviewers):
WSL_DISTRO_NAME values with spaces (e.g. "Ubuntu Preview", "Fedora
Remix for WSL") were not quoted, causing the distro name to be split
across multiple arguments. Add double-quoting around the distro name
and launcher path in the generated shortcut arguments.
Windows port fallback (3/10 reviewers):
Find-FreeLaunchPort silently assumed no ports were listening when
Get-NetTCPConnection was unavailable, which could return 8888 even
when busy. Add a Test-PortBusy fallback that probes ports with
TcpListener when Get-NetTCPConnection fails. Also scope the
Get-NetTCPConnection query to only the port range we care about.
* Skip powershell.exe shortcut creation if wslpath fails
If wslpath -w fails (returns empty), do not attempt to pass a Linux-style
path to powershell.exe -- it would always fail. Only run powershell.exe
when we have a valid Windows path for the temp PS1 script.
* Remove dead code and fix background health poll target
- Remove unused _open_browser_when_ready function
- Background mode now polls only the specific _launch_port instead of
scanning all ports via _find_healthy_port, matching foreground behavior
- Add launcher test harness (22 unit + 19 integration tests)
* Fix port probe scope, lock ownership, and T4 test coverage
- Test-PortBusy: bind on Any instead of Loopback to match Studio's
0.0.0.0 bind scope (prevents false-free in fallback path)
- _release_lock: verify PID ownership before removing lock dir
(prevents a timed-out subshell from deleting another launcher's lock)
- T4 test: fail first curl call so the test actually exercises the
lock-contention wait path instead of short-circuiting via fast path
* Temporarily remove launcher test scripts
Tests will be re-added in a follow-up PR to keep this diff focused
on the launcher changes.
* Fix missing num_items_in_batch in unsloth_prediction_step
unsloth_prediction_step calls compute_loss without num_items_in_batch
during evaluation. This causes _unsloth_pre_compute_loss to see
num_items_in_batch=None, which triggers a spurious warning for every
model when gradient_accumulation_steps > 1:
"Unsloth: Not an error, but {model} does not accept num_items_in_batch.
Using gradient accumulation will be very slightly less accurate."
The standard transformers prediction_step computes num_items_in_batch
via _get_num_items_in_batch before passing it to compute_loss. This
patch does the same in unsloth_prediction_step.
Tested on Llama-3.2-1B-Instruct and Olmo-3-7B-Instruct with
gradient_accumulation_steps=3 and eval_steps=3. Warning is gone and
eval loss is computed correctly for both.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard _get_num_items_in_batch for older transformers versions
_get_num_items_in_batch was added in transformers 4.46. Wrap the call
in try/except so older versions fall back to num_items_in_batch=None,
which preserves the original behavior of not passing it.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix Gemma3N audio training stride assertion with non-reentrant checkpointing
Gemma3N audio conformer processes variable-length audio tensors
that cause stride mismatches in AOT autograd compiled backward
when non-reentrant gradient checkpointing is used. The error
manifests as:
AssertionError: expected size 2==2, stride 1928==1936 at dim=0
This happens because the audio conformer's conv/norm layers produce
tensors whose strides vary with audio clip duration, but AOT autograd
traces the backward graph assuming fixed strides from the first batch.
The notebook sets gradient_checkpointing_kwargs={"use_reentrant": False}
and TRL 0.27.0+ also forces this. Both override Unsloth's own
use_reentrant=True set during prepare_model_for_training.
Fix: intercept gradient_checkpointing_enable on Gemma3N models to
always force use_reentrant=True, regardless of what the notebook
or TRL passes.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The previous --no-deps approach skipped ALL dependencies, not just
torch. This left safetensors, transformers, datasets, accelerate, etc.
missing, causing PackageNotFoundError at runtime.
Fix: in no-torch mode, install unsloth[huggingfacenotorch] (which pulls
all runtime deps except torch), then install unsloth-zoo with --no-deps
(since zoo's published metadata still declares torch as a hard dep).
This gives a working no-torch environment with all non-torch packages.
Applied to all three installer files: install.sh, install.ps1, and
studio/install_python_stack.py.
* fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621)
On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the
installer crashes. Even when torch is absent, Studio crashes on startup
because two files have bare top-level torch imports.
Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and
HF-inference already isolate torch to subprocesses. Only 2 files in the
server startup chain had top-level torch imports preventing startup.
Changes:
- install.sh: detect architecture, default to Python 3.12 on Intel Mac,
skip torch install, add Python 3.13.8 guard for arm64, pass
UNSLOTH_NO_TORCH env var to setup.sh
- data_collators.py: remove unused `import torch` (no torch.* refs)
- chat_templates.py: lazy-import IterableDataset into function bodies
- install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip
torch-dependent packages, skip overrides.txt, skip triton on macOS
No existing working flow changes. Linux/WSL and macOS arm64 behavior is
identical.
* tests: add test suite for Mac Intel compat + no-torch mode
Shell tests (test_mac_intel_compat.sh):
- version_ge edge cases (9 tests)
- Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64
- get_torch_index_url returns cpu on simulated Darwin
- UNSLOTH_NO_TORCH propagation to both setup.sh branches
Python unit tests (test_no_torch_filtering.py):
- _filter_requirements with NO_TORCH_SKIP_PACKAGES
- NO_TORCH env var parsing (true/1/TRUE/false/0/unset)
- IS_MACOS constant check
- Overrides skip and triton macOS skip guards
Python import tests (test_studio_import_no_torch.py):
- data_collators.py loads in isolated no-torch venv
- chat_templates.py has no top-level torch imports
- Negative control confirms import torch fails without torch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tests: add E2E sandbox tests for Mac Intel no-torch mode
Replace static/synthetic test stubs with real sandbox tests:
- Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify
torch install is skipped when MAC_INTEL=true, dynamic env propagation
test for UNSLOTH_NO_TORCH in both local and non-local install paths
- Python filtering: test real extras.txt and extras-no-deps.txt with
NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for
5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux,
Windows-only, macOS-only), VCS URL and env marker edge cases
- Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass
instantiation for all 3 collator classes, chat_templates.py exec with
stubs, negative controls proving import torch and torchao install fail
in no-torch venvs
91 total tests, all passing.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: address reviewer findings for Intel Mac no-torch mode
P1 fixes:
- Auto-infer NO_TORCH in install_python_stack.py via platform.machine()
so `unsloth studio update` preserves GGUF-only mode without needing
the UNSLOTH_NO_TORCH env var (6/10 reviewers)
- Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES
since both have unconditional torch dependencies (4/10 reviewers)
- Skip unsloth-zoo on Intel Mac --local installs (depends on torch)
in both migrated and fresh install paths (1/10)
- Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10)
- Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64
and warn user to use native arm64 terminal (1/10)
P2 fixes:
- Wire new test files into tests/run_all.sh (4/10 reviewers)
- Add update-path tests (skip_base=False) for Intel Mac
- Add _infer_no_torch tests for platform auto-detection
P3 fixes:
- Fix macOS progress bar total (triton step skipped but was counted)
- Fix temp file leak when Windows + NO_TORCH filters stack
All tests pass: 30 shell, 66 Python (96 total).
* feat: add --python override flag to install.sh
Lets users force a specific Python version, e.g. ./install.sh --python 3.12.
Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch.
When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade
are skipped so the user's choice is respected.
* tests: add comprehensive E2E sandbox tests for no-torch mode
Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total)
covering the full no-torch import chain, edge cases, and install logic:
- Group 1: BEFORE vs AFTER import chain comparison (proves the bug
existed and the fix works by synthetically prepending top-level torch
imports)
- Group 2: Dataclass instantiation without torch
- Group 3: Edge cases with broken/fake torch modules on sys.path
- Group 4: Hardware detection fallback to CPU without torch
- Group 5: install.sh flag parsing, version resolution, arch detection
- Group 6: install_python_stack.py NO_TORCH filtering
- Group 7: Live server startup without torch (marked @server, skipped
when studio venv is unavailable)
All 43 tests pass on both Python 3.12 and 3.13 isolated venvs.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting
- Fix chat_templates.py: narrow torch IterableDataset import into inner
try/except ImportError so dataset.map() works without torch installed
- Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca
and convert_alpaca_to_chatml
- Add --no-torch flag to install.sh with unified SKIP_TORCH variable
(driven by --no-torch flag OR MAC_INTEL auto-detection)
- Add --no-torch flag to install.ps1 with $SkipTorch variable
- Print CPU hint when no GPU detected and --no-torch not set
- Replace MAC_INTEL guards with SKIP_TORCH in torch install sections
- Update shell tests (40 pass) and Python tests (90 pass)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: address reviewer findings for --no-torch installer paths
- Fix migrated-env branch in install.sh and install.ps1: check
SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously
SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which
depends on torch), defeating --no-torch mode.
- Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true"
or "false" instead of only setting on the true branch. Prevents stale
no-torch state from leaking across runs in the same PS session.
- Fix install_python_stack.py update path: add NO_TORCH guard around
base.txt install so unsloth studio update does not reinstall
unsloth-zoo (which depends on torch) in no-torch mode.
* fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode
Instead of skipping unsloth-zoo entirely (which breaks unsloth's
dependency on it), install both packages with --no-deps so they are
present but torch is not pulled in transitively. Applied consistently
across all no-torch paths: migrated-env, fresh-local, fresh-non-local
in install.sh, install.ps1, and install_python_stack.py.
* chore: temporarily remove test files (will be added in a follow-up)
* refactor: deduplicate SKIP_TORCH conditional branches in installers
Collapse if/else blocks that differ only by --no-deps into a single
branch with a conditional flag variable. Applied to migrated-env and
fresh-local paths in install.sh, install.ps1, and install_python_stack.py.
* fix: apply --no-deps to fresh non-local --no-torch install path
The non-local else branch was missing $_no_deps_arg/$noDepsArg, so
uv pip install unsloth would resolve torch from PyPI metadata (the
published unsloth package still declares torch as a hard dep). Now
--no-deps is applied consistently to all SKIP_TORCH code paths.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Inline querier identity changed every render, forcing useLiveQuery to
resubscribe continuously causing CPU spikes. Store querier in a ref and
only re-subscribe when explicit deps change.
The ChatCompletionRequest Pydantic model defaulted repetition_penalty
to 1.1 when clients omitted the field. This silently forced
llama-server to perform per-token repetition scanning, dropping
streaming throughput from ~225 TPS to ~172 TPS (a 24% penalty).
The Studio frontend always sends repetition_penalty=1.0 explicitly,
so UI users were unaffected. But any API client hitting
/v1/chat/completions without setting the field (curl, third-party
integrations, Open WebUI, etc.) would get the slow path.
Benchmarked on Qwen3.5-4B Q4_K_XL, GPU 0:
- repeat_penalty=1.0: 225.2 TPS
- repeat_penalty=1.1: 172.7 TPS (24% slower)
- LM Studio (which applies rp internally): 170.8 TPS
This aligns the Pydantic default with the frontend default (1.0),
generate_chat_completion's function signature default (1.0), and
llama-server's own default (1.0).
* Allow install_python_stack to run on Colab
The _COLAB_NO_VENV flag was setting _SKIP_PYTHON_DEPS=true, which
skipped both the PyPI version check (needs $VENV_DIR/bin/python) and
install_python_stack (uses sys.executable, works without a venv).
Introduce a separate _SKIP_VERSION_CHECK flag for the version check,
so install_python_stack still runs on Colab. The _SKIP_PYTHON_DEPS
flag remains available for the "versions match" fast path.
* Remove colab.py workarounds that broke transformers/hf-hub compatibility
PR #4601 added _pip_install_backend_deps(), _bootstrap_studio_venv(),
and _is_colab() to colab.py as workarounds for install_python_stack
being skipped on Colab. These workarounds:
- Stripped version constraints from studio.txt and installed into system Python
- Upgraded huggingface-hub to >=1.0, breaking Colab's pre-installed
transformers which requires huggingface-hub<1.0
With install_python_stack now running on Colab (previous commit), these
workarounds are unnecessary — all deps are properly installed by setup.sh.
Restore colab.py to its original PR #4237 structure: just get_colab_url(),
show_link(), and start().
* Remove --local flag from setup.sh in Colab notebook
The --local flag is not needed for the standard Colab flow since
install_python_stack now runs on Colab and installs deps from PyPI.
* studio: humanize ETA display for long training runs
When training takes hours or days, the ETA displayed raw minutes
(e.g. '560m 50s'). This changes the format to:
- Under 1 hour: Xm Ys (unchanged)
- 1-24 hours: Xh Ym Zs
- Over 24 hours: Xd Xh Xm
* Fix formatDuration edge cases and consolidate duplicate for PR #4608
- Guard NaN/Infinity inputs with Number.isFinite() (matches formatNumber in same file)
- Add sub-minute branch so 30s displays as "30s" instead of "0m 30s"
- Accept undefined in type signature to match formatNumber pattern
- Remove duplicate formatDuration from history-card-grid.tsx and import the shared one
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: avoid _yaml.pyd lock on Windows during dependency overrides
* fix: move pytorch_tokenizers and kernels to no-deps install to avoid Windows _yaml.pyd loc
* fix(studio): align config cards, dynamic height for expanders, LoRA collapsible
* Fix clipping regressions in training, dataset, and params section cards
- training-section: Add hasMessage conditional so the card expands
(min-h) when startError, vision/audio incompatibility, or config
validation messages are present instead of always using fixed height
- dataset-section: Expand card when a local dataset is selected via
upload (datasetSource === "upload" && selectedLocalDataset), not only
when the Advanced panel is open
- params-section: Guard loraOpen behind isLora so switching to full
fine-tune collapses the card instead of staying expanded from stale
React useState
* Fix dataset card clipping for direct file uploads
Use uploadedFile instead of selectedLocalDataset in the card height
condition. selectedLocalDataset is derived from localDatasets.find()
which only resolves for Data Recipe entries, not direct file uploads
(.jsonl, .csv, .parquet, .arrow). The card already renders the Eval
Dataset panel based on uploadedFile (line 750), so the height gate
should match.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Recommended models matching the query were filtered from HF results but the Recommended section was hidden during search, causing them to vanish entirely.
- Show filtered recommended models during search by introducing `filteredRecommendedIds`
- Switch `recommendedSet` to use filtered IDs when searching so dedup against HF results is correct
- Hide empty "Hugging Face" label when recommended matches cover the query
- Add `normalizeForSearch` helper to strip separators (spaces, hyphens, underscores, dots) so queries like "llama 3" match "Llama-3.2-1B" and "qwen 2.5" matches "Qwen2.5-7B" in both the recommended model filter and the LoRA adapter filter
* Fix Colab setup skipping llama.cpp installation
The early exit 0 in the Colab no-venv path prevented setup.sh from
ever reaching the llama.cpp install section. Remove the early exit
and instead guard only the venv-dependent Python deps section, so
execution continues through to the llama.cpp prebuilt/source install.
* Simplify _SKIP_PYTHON_DEPS initialization
* Add --local flag to setup.sh in Colab notebook
* Fix Colab huggingface-hub conflict, ensurepip fallback, bump to 2026.3.14
- colab.py / setup.sh: relax == pins to >= when installing studio.txt
on Colab so huggingface-hub does not clobber Colab's bundled version
(breaks transformers is_offline_mode import)
- install_python_stack.py: when uv is unavailable and pip is missing
(uv-created venvs), bootstrap via ensurepip before attempting upgrade
- Bump version to 2026.3.14
- Bump installer min version pins to 2026.3.14
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix Colab Studio launch and setup.ps1 box alignment
- colab.py: when the Studio venv is missing on Colab, pip-install
backend dependencies (structlog, fastapi, etc.) from studio.txt
into the current Python instead of failing with ModuleNotFoundError
- setup.sh: on Colab without a venv, install backend deps into system
Python and skip venv-dependent sections (Python stack update,
llama.cpp build) that would otherwise fail
- setup.ps1: use PadRight(47) for the done-line so "Setup Complete!"
and "Update Complete!" both align with the box border
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): editable context length with Apply/Reset for GGUF model settings
Previously the Context Length field was read-only and the backend
hardcoded `-c 0`, ignoring custom values entirely. KV Cache Dtype also
triggered an immediate model reload with no way to cancel.
Backend:
- llama_cpp.py: pass the actual n_ctx value to `-c` instead of always 0
- models/inference.py: relax max_seq_length to 0..1048576 (0 = model
default) so GGUF models with large context windows are supported
Frontend:
- chat-runtime-store: add customContextLength and loadedKvCacheDtype
state fields for dirty tracking
- chat-settings-sheet: make Context Length an editable number input,
stop KV Cache Dtype from auto-reloading, show Apply/Reset buttons
when either setting has been changed
- use-chat-model-runtime: send customContextLength as max_seq_length
in the load request, reset after successful load
* fix: preserve maxSeqLength for non-GGUF models in load request
customContextLength ?? 0 sent max_seq_length=0 for non-GGUF models,
breaking the finetuning/inference path that needs the slider value.
Now uses a three-way branch:
- customContextLength set: use it (user edited GGUF context)
- GGUF without custom: 0 (model's native context)
- Non-GGUF: maxSeqLength from the sampling slider
* fix: keep max_seq_length default at 4096 for non-GGUF callers
Only relax the bounds (ge=0 for GGUF's "model default" mode,
le=1048576 for large context windows). The default stays at 4096
so API callers that omit max_seq_length still get a sane value
for non-GGUF models.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): rename trust remote code toggle and hide when no model selected
- Rename "Trust remote code" to "Enable custom code"
- Shorten subtitle to "Only enable if sure"
- Hide the toggle when no model is loaded (already hidden for GGUFs)
* fix: restore ge=128 for max_seq_length validation
Keep the minimum at 128 so the API rejects nonsensical values.
GGUF path now sends the model's native context length (from
ggufContextLength) instead of 0 when the user has not customized it.
The upper bound stays at 1048576 for large-context GGUF models.
* feat(studio): replace Context Length input with slider
Use a ParamSlider (512 to model's native context, step 512) instead
of a small number input. Shows "Max" when at the model's native
context length. Consistent with the other slider controls in the
settings panel.
* feat(studio): add editable number input alongside Context Length slider
The slider and number input stay synced -- dragging the slider updates
the number, typing a number moves the slider. The input also accepts
values beyond the slider range for power users who need custom context
lengths larger than the model default.
* fix(studio): widen context length input and use 1024 step for slider
Make the number input wider (100px) so large values like 262144 are
fully visible. Change slider step from 512 to 1024 and min from 512
to 1024.
* fix(studio): context length number input increments by 1024
* fix(studio): cap context length input at model's native max
Adds max attribute and clamps typed/incremented values so the context
length cannot exceed the GGUF model's reported context window.
* fix(studio): point "What's new" link to changelog page
Changed from /blog to /docs/new/changelog.
* fix(studio): preserve custom context length after Apply, remove stale subtitle
- After a reload with a custom context length, keep the user's value
in the UI instead of snapping back to the model's native max.
ggufContextLength always reports the model's native metadata value
regardless of what -c was passed, so we need to preserve
customContextLength when it differs from native.
- Remove "Reload to apply." from KV Cache Dtype subtitle since the
Apply/Reset buttons now handle this.
* feat(studio): auto-enable Search and Code tools when model supports them
Previously toolsEnabled and codeToolsEnabled stayed false after loading
a model even if it reported supports_tools=true. Now both toggles are
automatically enabled when the loaded model supports tool calling,
matching the existing behavior for reasoning.
* fix(studio): auto-enable tools in autoLoadSmallestModel path
The suggestion cards trigger autoLoadSmallestModel which bypasses
selectModel entirely. It was hardcoding toolsEnabled: false and
codeToolsEnabled: false even when the model supports tool calling.
Now both are set from the load response, matching the selectModel
behavior. Also sets kvCacheDtype/loadedKvCacheDtype for dirty
tracking consistency.
* fix(studio): re-read tool flags after auto-loading model
The runtime state was captured once at the start of the chat adapter's
run(), before autoLoadSmallestModel() executes. After auto-load enables
tools in the store, the request was still built with the stale snapshot
that had toolsEnabled=false. Now re-reads the store after auto-load so
the first message includes tools.
* fix(studio): re-read entire runtime state after auto-load, not just tools
The runtime snapshot (including params.checkpoint, model id, and all
tool/reasoning flags) was captured once before auto-load. After
autoLoadSmallestModel sets the checkpoint and enables tools, the
request was still built with stale params (empty checkpoint, tools
disabled). Now re-reads the full store state after auto-load so the
first message has the correct model, tools, and reasoning flags.
* feat(studio): add Hugging Face token field in Preferences
Adds a password input under Configuration > Preferences for users to
enter their HF token. The token is persisted in localStorage and
passed to all model validate/load/download calls, replacing the
previously hardcoded null. This enables downloading gated and private
models.
* fix(studio): use model native context for GGUF auto-load, show friendly errors
The auto-load paths and selectModel for GGUF were sending
max_seq_length=4096 which now actually limits the context window
(since we fixed the backend to respect n_ctx). Changed to send 0
for GGUF, which means "use model's native context size".
Also replaced generic "An internal error occurred" messages with
user-friendly descriptions for known errors like context size
exceeded and lost connections.
LoadRequest validation changed to ge=0 to allow the GGUF "model
default" signal. The frontend slider still enforces min=128 for
non-GGUF models.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): filter out FP8 models from model search results
Hide models matching *-FP8-* or *FP8-Dynamic* from both the
recommended list and HF search results. These models are not
yet supported in the inference UI.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add PID file tracking and `unsloth studio stop` command
On macOS the .app shortcut launches Studio via osascript into a
Terminal window, then the launcher script exits. The server process
runs outside of the launcher's context with no PID file, so there
is no straightforward way to find or stop it.
This adds:
- PID file at ~/.unsloth/studio/studio.pid, written after the
server starts and removed on graceful shutdown or via atexit
- `unsloth studio stop` command that reads the PID file and sends
SIGTERM (or taskkill on Windows) to shut down the server
The PID file is only removed if it still contains the current
process ID, avoiding races when a new server instance replaces
a crashed one.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move atexit PID cleanup into run_server()
The atexit registration was only in the __main__ block, so it
did not cover the `unsloth studio` CLI path that calls
run_server() directly via studio_default(). Moving it into
run_server() ensures the PID file is cleaned up on unexpected
exit regardless of entry point.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The function was called with no arguments, so $args inside the function
was always empty. Script-level args (--local, --package) were never
forwarded. Use @args splatting to pass them through.
Windows install.ps1 had no way to install from a local repo checkout,
unlike install.sh which supports ./install.sh --local. This adds:
- --local: install from the local repo via editable install (-e . --no-deps)
after installing deps from PyPI, mirroring install.sh behavior
- --package: install a different package name for testing
The --local flag:
1. Validates pyproject.toml exists at the script's directory
2. Installs torch + unsloth deps normally
3. Overlays the local checkout with uv pip install -e <repo> --no-deps
4. Passes STUDIO_LOCAL_INSTALL and STUDIO_LOCAL_REPO to setup.ps1
After installation, `unsloth studio` only works if the user
activates the Studio venv first or uses the full absolute path.
The Desktop/Start Menu shortcuts work fine, but typing `unsloth
studio` in a fresh terminal does not.
This adds the venv Scripts dir to the persistent User PATH env
var (if not already present) so `unsloth studio` works from any
new terminal window. The current session is also updated via the
existing Refresh-SessionPath helper.
* feat: multi-source model discovery (HF default, legacy cache, LM Studio)
* Fix multi-source model discovery bugs
- Fix lmstudio_model_dirs: add ~/.lmstudio/models as default path,
remove dead sys.platform branch, add dedup via seen set
- Fix _setup_cache_env: preserve legacy HF cache env vars when the
legacy hub directory exists and is non-empty
- Fix _scan_lmstudio_dir: use absolute path for id field so
is_local_path() returns True
- Remove LM Studio dirs from allowed_roots (scanned unconditionally)
- Replace bare except passes with logger.warning in legacy cache blocks
- Fix delete_cached_model to search both default and legacy HF caches
- Make lmstudio_dirs non-optional in TS interface (matches Python schema)
- Exclude lmstudio source from trainable model filter
- Remove unused import sys
* Scan HF default cache alongside legacy and active caches
When _setup_cache_env overrides HF_HUB_CACHE to the legacy Unsloth
path, the standard HF default cache (~/.cache/huggingface/hub) was
never scanned, hiding models downloaded before Unsloth Studio was
installed.
Add hf_default_cache_dir() and _all_hf_cache_scans() helper that
deduplicates and scans all three HF cache locations (active, legacy,
default). Used in list_local_models, list_cached_gguf,
list_cached_models, and delete_cached_model.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Port the bun cache corruption fix from setup.sh to setup.ps1.
bun's package cache can become corrupt, storing only package metadata
without actual content. This causes bun install to exit 0 but leave
binaries like tsc missing from node_modules/.bin/.
Changes:
- After bun install, verify tsc and vite exist in node_modules\.bin\
- Check for both bare names and .cmd wrappers (Windows creates both)
- If missing, clear the bun cache and retry once
- Only fall back to npm if the retry also fails
* fix(studio): source-build fallback prefers Unsloth's tested tag over upstream latest
When the prebuilt install fails and falls back to source build,
--resolve-llama-tag now queries the Unsloth release repo
(unslothai/llama.cpp) first to get the latest tested/approved tag
(e.g. b8508), instead of going straight to ggml-org/llama.cpp which
may return a newer untested tag (e.g. b8514).
This ensures the source-build fallback compiles the same version that
the prebuilt path would have installed, rather than a potentially
incompatible bleeding-edge release.
Resolution order for "latest":
1. Unsloth release repo (tested/approved)
2. ggml-org upstream (bleeding-edge)
3. Raw requested tag string (last resort)
Changes:
- resolve_requested_llama_tag() accepts optional published_repo param
with docstring explaining the resolution order
- CLI --resolve-llama-tag passes --published-repo through
- setup.sh and setup.ps1 pass --published-repo to --resolve-llama-tag
with inline comments explaining the preference
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
torch 2.11.0 has a torch.compile/dynamo bug that causes a
StopIteration crash in dict_keys_getitem when compiling MoE
router functions (e.g. GptOssTopKRouter_forward). Pin to
<2.11.0 until the upstream fix lands.
Applies to both install.sh (Linux/macOS) and install.ps1
(Windows) fresh install paths.
bun's package cache can become corrupt, storing only package metadata
(package.json, README) without actual content (bin/, lib/). When this
happens, bun install exits 0 and reports packages as installed, but
binaries like tsc are missing from node_modules/.bin/.
For example, a corrupt typescript cache entry is 64KB (metadata only)
vs 23MB when correctly downloaded.
Changes:
- After bun install, verify tsc and vite exist in node_modules/.bin/
- If missing, clear the bun cache with bun pm cache rm and retry once
- Only fall back to npm if the retry also fails
- Revert bun installation to npm install -g bun (the binary is fine,
the cache was the problem)
bun install (specifically the npm "bun" shim v1.3.x installed via
npm install -g bun) can exit 0 while silently failing to install
packages. This causes the frontend build to fail with "tsc: not found"
or missing type declarations, since the fallback to npm only triggers
on a non-zero exit code.
Changes:
1. Initial bun install now tries the official bun.sh installer first
(which gives a real bun runtime), falling back to npm install -g bun
only if that fails.
2. After bun install reports success, verify that critical binaries
(tsc, vite) actually exist in node_modules/.bin/. If they are
missing, reinstall bun from the official source and retry once
before falling back to npm.
3. Extract the bun install + validation logic into _try_bun_install()
to avoid duplicating the check/cleanup across both attempts.
The prebuilt llama.cpp binary (cuda13-newer) links against
libcudart.so.13 and libcublas.so.13. When torch is installed via pip,
these libraries live in the venv's site-packages under
nvidia/cu13/lib/, not in /usr/local/cuda/.
The existing LD_LIBRARY_PATH logic only searched /usr/local/cuda*
paths (which have CUDA 12.x), so the CUDA backend failed to load
silently and llama-server fell back to CPU -- even with -ngl -1.
This adds a glob scan of the venv's nvidia package directories
(cu*, cudnn, nvjitlink) to LD_LIBRARY_PATH before launching
llama-server, matching where pip puts the CUDA runtime.
Tested on Colab with RTX PRO 6000 Blackwell (CUDA 13.0, pip torch):
before -- 3 MiB GPU, 0% util, CPU inference
after -- 13317 MiB GPU, 77% util, full GPU inference
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
When _select_gpus determines that a GGUF model fits on the selected
GPU(s), the code sets CUDA_VISIBLE_DEVICES but never passes -ngl
(number of GPU layers) to llama-server. Without -ngl or --fit,
llama-server defaults to 0 GPU layers and runs entirely on CPU.
This adds -ngl -1 (offload all layers) in the elif branch where
gpu_indices is set and use_fit is False, so models that fit in VRAM
actually use the GPU for inference.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Use prebuilt llama.cpp for unsloth studio setup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix 3 issues that cause unnecessary fallback to source build
1. Make filelock import optional -- environments without filelock
(e.g. minimal installs) crashed at import time instead of
gracefully skipping the lock.
2. Use already-verified converter script from the hydrated source
tree instead of re-downloading from raw.githubusercontent.com
with no checksum. Adds symlink with copy fallback for the
legacy filename.
3. Initialize $SkipPrebuiltInstall in setup.ps1 before first use
to prevent potential uninitialized variable errors.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Keep network fallback in ensure_converter_scripts
Prefer the local verified copy from the hydrated source tree, but
retain the original network download as a fallback if the file is
missing. Create the legacy hyphenated filename as a symlink with a
copy fallback instead of writing a second full copy.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix 4 bugs in source-build fallback and binary_env paths
- setup.ps1: Replace git pull + checkout FETCH_HEAD with fetch + checkout -B
to avoid detached HEAD state that breaks re-runs. Use pinned tag in both
fetch and clone paths.
- setup.sh: Move rm -rf after cmake/git prerequisite checks so a missing
tool no longer deletes the existing install. Add --branch tag to clone.
- install_llama_prebuilt.py: Add binary_path.parent to Linux LD_LIBRARY_PATH
in binary_env() so bundled .so files in build/bin are found even without
RPATH, matching the existing Windows PATH logic.
- Add test for binary_env LD_LIBRARY_PATH on Linux.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Handle unresolved "latest" tag in source-build fallback clone
When tag resolution fails and the requested tag is "latest", both
setup scripts now omit --branch from git clone so the default branch
is cloned instead of failing on a nonexistent "latest" branch/tag.
Similarly, the PS1 fetch path fetches the default ref when the tag
is "latest".
* Resolve actual latest ggml-org tag instead of using literal "latest"
When both Python tag resolution attempts fail and the requested tag
is "latest", query the GitHub API for the actual latest release tag
from ggml-org/llama.cpp (e.g. b8508) instead of passing the literal
string "latest" to git clone --branch, which would fail since no
such branch/tag exists.
setup.sh uses curl + python json parsing; setup.ps1 uses
Invoke-RestMethod. Both fall back to the raw requested tag if the
API call also fails.
* Try Unsloth release repo before ggml-org when resolving latest tag
When falling back to the GitHub API to resolve "latest", query the
Unsloth release repo (unslothai/llama.cpp) first since it has the
prebuilt binaries pinned to tested tags. Only fall back to
ggml-org/llama.cpp if the Unsloth repo query fails.
* Add comprehensive sandbox tests for PR #4562 bug fixes
35 tests covering all fixes across platforms:
- binary_env cross-platform (Linux LD_LIBRARY_PATH, Windows PATH,
macOS DYLD_LIBRARY_PATH) with edge cases (dedup, ordering, existing paths)
- resolve_requested_llama_tag (concrete, latest, None, empty)
- setup.sh logic via subprocess: prereq check ordering (cmake/git missing
preserves install), pinned tag in clone, fetch+checkout -B pattern,
fetch failure warns instead of aborting
- "latest" tag resolution fallback chain (Unsloth API -> ggml-org ->
raw) with mock curl: success, failure, malformed JSON, empty body,
empty tag_name, env overrides
- Source code pattern verification for both .sh and .ps1 files
All 138 tests pass in isolated uv venv.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add binary_path.parent to macOS DYLD_LIBRARY_PATH in binary_env
macOS prebuilt .dylib files are overlaid into build/bin (same as
Linux), but binary_env only added install_dir to DYLD_LIBRARY_PATH.
Add binary_path.parent so the loader can find sibling dylibs even
without embedded loader paths.
Mirrors the existing fix for Linux LD_LIBRARY_PATH and the Windows
PATH pattern.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard --branch when resolved tag is "latest"; fix broken test assertion
When all API fallbacks fail and the tag stays as literal "latest",
omit --branch from git clone (clones default branch instead of
failing). Both setup.sh and setup.ps1 now check for "latest" before
passing --branch to git clone/fetch.
Also fix test_setup_ps1_clone_uses_branch_tag which used Python
tuple syntax (assert "x", "y" in z) that always passes. Changed to
assert "x" in z and "y" in z.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix macOS DYLD trailing colon, install_lock no-op, and debug log
- binary_env macOS: use dedupe_existing_dirs instead of raw string
concatenation. Eliminates trailing colon in DYLD_LIBRARY_PATH
(which causes dyld to search CWD for libraries) and deduplicates
when binary_path.parent == install_dir. Now consistent with the
Linux and Windows branches.
- install_lock: when filelock is not installed, use os.O_CREAT|O_EXCL
as a fallback exclusive file lock with timeout, instead of yielding
with no locking. Prevents concurrent installs from corrupting each
other's staging directories.
- setup.ps1: remove [DEBUG] log line that printed to every user on
every Windows setup run.
* Add stale-lock detection and atomic clone-then-swap
install_lock fallback (no filelock): write PID to lock file and
check if the holder process is still alive on contention. Dead PIDs
(ProcessLookupError) and unreadable lock files trigger immediate
cleanup. Live processes owned by other users (PermissionError) are
correctly recognized as alive -- the lock is not removed.
setup.sh/setup.ps1 source-build: clone into a temporary directory
first, then swap into place only on success. If git clone fails,
the existing install is preserved instead of being deleted by the
premature rm -rf.
* Remove redundant upstream_tag != release_tag check
load_approved_release_checksums compared checksums.upstream_tag
against the Unsloth release_tag, which are different namespaces
(upstream ggml-org tag vs Unsloth published tag). This only worked
because both happened to be "b8508" by convention. Would break if
Unsloth ever uses a different release naming scheme.
The existing check at parse_approved_release_checksums (line 950)
already validates the release_tag field correctly.
* Fix lock TOCTOU race and build-in-temp-dir swap
install_lock fallback: add os.fsync(fd) after writing PID to ensure
the PID is visible to racing processes before they check. Treat
empty lock files (PID not yet written) as "wait and retry" instead
of stale, closing the window where two processes could both see an
empty file, both unlink it, and both acquire the lock.
setup.sh/setup.ps1 source-build: clone AND build in a temp directory
(LLAMA_CPP_DIR.build.$$). Only swap into the final LLAMA_CPP_DIR
after the build succeeds. If clone or cmake or build fails, the temp
dir is cleaned up and the existing working install is preserved.
Previously, rm -rf ran after clone but before build, destroying the
existing install even if the build later failed.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio
* refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check)
* fix: install.sh calls setup.sh directly, keep both setup and update CLI commands
* fix: use importlib.resources.files() directly without _path attribute
* fix: bootstrap uv before pip upgrade to handle uv venvs without pip
* fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin
* feat: add --local flag to install.sh and unsloth studio update for branch testing
* fix: resolve repo root from script location for --local installs
* feat: add --package flag to install.sh for testing with custom package names
* feat: add --package flag to unsloth studio update
* fix: always nuke venv in install.sh for clean installs
* revert: remove Windows changes, will handle in separate PR
* fix: error when --package is passed without an argument
* revert: restore Windows scripts to current main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars
* fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs
* fix: align banner box for Setup vs Update labels
* deprecate: hide 'unsloth studio setup' command, point users to update/install.sh
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: check stdout not stdin for auto-launch detection (curl pipe fix)
* fix: update install URL to unsloth.ai/install.sh
* fix: update install.sh usage comments to unsloth.ai/install.sh
* fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: --local install now also installs unsloth-zoo via base.txt before editable overlay
* fix: don't skip base packages for --local installs (editable needs unsloth-zoo)
* refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths
* feat: add migration support for old .venv and CWD-based installs in setup.sh
* Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh"
This reverts commit 301291d002.
* feat: migrate old .venv layout in install.sh instead of always nuking
* feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure
* fix: try CUDA then fall back to CPU for migration validation
* fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch
* remove: delete unused unsloth ui command (use unsloth studio instead)
* Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py
install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"),
setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py
looks for ".unsloth\studio\unsloth_studio". All three paths were different, so
the Windows installer would never produce a working Studio setup.
install.ps1:
- Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout
- Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio
from the previous install.ps1, or fresh creation with torch validation
- For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels
- Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior)
- Fix launch instructions to use the absolute venv path
setup.ps1:
- Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio"
- Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from
install.ps1 (which should have already created it)
- Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE
* setup.ps1: unconditionally error if venv missing, matching setup.sh
setup.sh always errors out if the venv does not exist (line 224-228),
telling the user to run install.sh first. setup.ps1 was conditionally
creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not
set, which would produce an empty venv with no torch or unsloth. Now
setup.ps1 matches setup.sh: always error, always point to install.ps1.
* Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows
On CPU-only machines, `uv pip install unsloth --torch-backend=auto`
falls back to unsloth==2024.8 because the CPU solver cannot satisfy
newer unsloth's dependencies. install.ps1 already solved this with a
two-step approach; this applies the same fix to install.sh and
install_python_stack.py.
install.sh: add get_torch_index_url() that detects GPU via nvidia-smi
and maps CUDA versions to PyTorch index URLs (matching install.ps1's
Get-TorchIndexUrl). Fresh installs now install torch first via explicit
--index-url, then install unsloth with --upgrade-package to preserve
the pre-installed torch. All 5 --torch-backend=auto removed from
primary paths.
install.ps1: add fallback else-branch when TorchIndexUrl is empty,
using --torch-backend=auto as last resort (matching install.sh).
install_python_stack.py: remove unconditional --torch-backend=auto
from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1
by the time this runs. Callers that need it can set UV_TORCH_BACKEND.
Both install.sh and install.ps1 now share the same three-branch logic:
migrated env (upgrade-package only), normal (torch-first + index-url),
and fallback (--torch-backend=auto if URL detection fails).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use --reinstall-package for migrated envs on both Linux and Windows
For migrated environments (moved from legacy venv location),
--reinstall-package is better than --upgrade-package because it forces
a clean reinstall even if the same version is already installed. This
ensures proper .dist-info and .pyc state in the new venv location.
--upgrade-package remains correct for the fresh install path where
torch is already installed and we just want to add unsloth without
re-resolving torch.
* Address review findings: portability, parity, and stale comments
- Replace grep -oP (GNU Perl regex) with POSIX sed in
get_torch_index_url() so the script works on BSD grep (macOS is
already guarded by the Darwin early-return, but Alpine/BusyBox
would silently get the wrong CUDA tag)
- Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent
output parsing issues
- Add warning on stderr when nvidia-smi output is unparseable, matching
install.ps1's [WARN] message
- Add explicit unsloth-zoo positional arg to install.ps1 migrated path,
matching install.sh (--reinstall-package alone won't install it if it
was never present in the migrated env)
- Fix stale comment in install_python_stack.py line 392 that still
claimed --torch-backend=auto is added by _build_uv_cmd
- Add sed to test tools directory (function now uses sed instead of grep)
* Add --index-url to migrated env path to prevent CPU torch resolution
The migrated path runs uv pip install with --reinstall-package for
unsloth/unsloth-zoo. While uv should keep existing torch as satisfied,
the resolver could still re-resolve torch as a transitive dependency.
Without --index-url pointing at the correct CUDA wheel index, the
resolver would fall back to plain PyPI and potentially pull CPU-only
torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are
available if the resolver needs them.
Applied to both install.sh and install.ps1.
* Revert --index-url on migrated env path
The original install.ps1 on main already handles the migrated path
without --index-url and it works correctly. --reinstall-package only
forces reinstall of the named packages while uv keeps existing torch
as satisfied. No need for the extra flag.
* Fix unsloth studio update --local not installing local checkout
studio.py sets STUDIO_LOCAL_REPO when --local is passed, but
install_python_stack.py never read it. The update path always
installed from PyPI regardless of the --local flag.
Add a local_repo branch that first updates deps from base.txt
(with --upgrade-package to preserve torch), then overlays the
local checkout as an editable install with --no-deps.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add support for ROCm in studio setup
* Fix ROCm detection bugs: ROCM_PATH resolution, CUDA guard, compiler selection
- Set GPU_BACKEND="cuda" when nvcc is found (CUDA path was unreachable)
- Guard ROCm detection with `if [ -z "$GPU_BACKEND" ]` so CUDA takes
priority on mixed-toolchain hosts
- Rename ROCM_PATH to ROCM_HIPCC for the hipcc binary; resolve the
actual ROCm root via readlink -f and hipconfig -R into ROCM_ROOT
- Export both ROCM_PATH and HIP_PATH as the resolved root directory
- Use HIPCXX via hipconfig -l instead of legacy CMAKE_C_COMPILER=hipcc
- Switch grep -oP to grep -oE for portability across Linux distros
- Use GPU_TARGETS (upstream cmake variable) instead of AMDGPU_TARGETS
- Remove stale hardcoded fallback targets; let cmake auto-detect instead
* Fix gfx regex to match gfx90a (MI210/MI250/MI250X)
The grep and bash regex used {3,4} digits after 'gfx', which silently
excluded gfx90a (2 digits + letter 'a') -- the architecture for AMD
Instinct MI210, MI250, and MI250X data-center GPUs. Change to {2,4}
so all real gfx targets from gfx90a through gfx1200 are matched.
---------
Co-authored-by: edamamez <eda.zhou@amd.com>
* feat(tokenizer): add get_tokenizer_info() diagnostic helper
Adds get_tokenizer_info(tokenizer) to tokenizer_utils.py returning a concise dict of key tokenizer properties class name, is_fast, vocab size, added token count, model_max_length, padding side, special tokens (bos, eos, pad, unk), chat template presence, and total special token count. All fields use getattr(..., None) fallbacks so the function never raises on unusual or partially initialized tokenizers. Exported via __all__ alongside the existing public helpers. Useful for logging, debugging, and surfacing tokenizer state in the Unsloth Studio UI.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix docstring, remove artifact, restore valuable comments in tokenizer_utils.py
- Fix get_tokenizer_info() docstring example: correct tokenizer_class to
PreTrainedTokenizerFast, vocab_size to 128000, swap added_tokens_count (256)
and special_tokens_count (3) to match actual Llama-3.2-1B-Instruct output
- Remove accidentally committed "# ... (rest of file unchanged)" diff artifact
- Restore fix_sentencepiece_gguf() docstring with llama.cpp upstream link
- Restore 10 comments containing upstream URLs, model-specific workarounds,
and non-obvious context (issue #292, sentencepiece#121, Starling hack,
Kaggle /tmp limit, Deepseek slow tokenizer, twitter/danielhanchen references)
* Revert "Fix docstring, remove artifact, restore valuable comments in tokenizer_utils.py"
This reverts commit 4e525b734b.
* Revert all deletions, keep only get_tokenizer_info() addition
Restore tokenizer_utils.py to main and add only the new
get_tokenizer_info() function and its __all__ entry.
All comment removals, dead code cleanup, and formatting
changes from the original PR are reverted.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* perf(studio): upgrade to Vite 8 + auto-install bun for 3x faster frontend builds
* fix(studio): make bun-to-npm fallback actually reachable
setup.sh used run_quiet() for the bun install attempt, but run_quiet
calls exit on failure. This killed the script before the npm fallback
could run, making the "falling back to npm" branch dead code.
Replace the run_quiet call with a direct bun invocation that captures
output to a temp file (same pattern, but returns instead of exiting).
Also clean up partial node_modules left by a failed bun install before
falling back to npm, in both setup.sh and build.sh. Without this, npm
inherits a corrupted node_modules tree from the failed bun run.
* fix(studio): restore commonjsOptions for dagre CJS interop
The previous commit removed build.commonjsOptions, assuming Vite 8's
Rolldown handles CJS natively. While optimizeDeps.include covers the
dev server (pre-bundling), it does NOT apply to production builds.
The resolve.alias still points @dagrejs/dagre to its .cjs.js entry,
so without commonjsOptions the production bundle fails to resolve
the CJS default export. This causes "TypeError: e is not a function"
on /chat after build (while dev mode works fine).
Restore the original commonjsOptions block to fix production builds.
* fix(studio): use motion/react instead of legacy framer-motion import
* fix(studio): address PR review findings for Vite 8 + bun upgrade
Fixes:
- Remove bun.lock from repo and add to .gitignore (npm is source of truth)
- Use & bun install *> $null pattern in setup.ps1 for reliable $LASTEXITCODE
- Add Remove-Item node_modules before npm fallback in setup.ps1
- Print bun install failure log in setup.sh before discarding
- Add Refresh-Environment after npm install -g bun in setup.ps1
- Tighten Node version check to ^20.19.0 || >=22.12.0 (Vite 8 requirement)
- Add engines field to package.json
- Use string comparison for _install_ok in build.sh
- Remove explicit framer-motion ^11.18.2 from package.json (motion pulls
framer-motion ^12.38.0 as its own dependency — the old pin caused a
version conflict)
* Fix Colab Node bypass and bun.lock stale-build trigger
Gate the Colab Node shortcut on NODE_OK=true so Colab
environments with a Node version too old for Vite 8 fall
through to the nvm install path instead of silently proceeding.
Exclude bun.lock from the stale-build probe in both setup.sh
and setup.ps1 so it does not force unnecessary frontend rebuilds
on every run.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Shine1i <wasimysdev@gmail.com>
* Add macOS and Linux desktop shortcuts to install.sh
Adds create_studio_shortcuts() function that creates platform-native
shortcuts after `unsloth studio setup` completes, mirroring the Windows
shortcut behavior from PR #4558.
Linux: .desktop file in ~/.local/share/applications/ and ~/Desktop/
macOS: .app bundle in ~/Applications/ with Info.plist, exec stub, and
optional .icns icon built from unsloth-gem.png via sips+iconutil
Both platforms share a Bash launcher script at
~/.local/share/unsloth/launch-studio.sh that provides:
- Health check with service fingerprint verification
- Port scanning (8888-8908) via ss/lsof
- PID-file single-instance guard (no flock dependency)
- Terminal spawning (macOS: Terminal.app; Linux: gnome-terminal etc.)
- Browser open after health poll with 60s timeout
WSL is skipped (no native desktop environment).
* Fix 6 issues found by 10 parallel reviewers
1. [10/10] Health check now supports wget as fallback to curl via
_http_get() helper, matching the installer's own download() pattern.
Previously wget-only systems would time out on every launch.
2. [9/10] Exe path substitution now escapes sed metacharacters (&, \, |)
and shell single-quotes before injection, preventing launcher
corruption for paths like /opt/R&D/bin/unsloth.
3. [4/10] Linux .desktop Exec= field now quotes the launcher path,
fixing launches from home directories containing spaces.
4. [3/10] macOS AppleScript command now escapes backslashes and
double-quotes before interpolation into do script "...", fixing
Terminal.app launch failures.
5. [3/10] Single-instance guard now uses atomic mkdir instead of
racy check-then-write PID file, preventing duplicate concurrent
launches on rapid double-click.
6. [1/10] Launcher now scans for a free port via _find_launch_port()
instead of always hardcoding -p 8888, so Studio starts correctly
when another service already occupies port 8888.
Also fixed: `open` command on Linux (openvt) no longer incorrectly
triggers the macOS browser-open path -- now gated on uname=Darwin.
* Fix mktemp guard and exe path escaping from PR review comments
Two real issues identified from automated review comments:
1. Guard mktemp -d failure in macOS icns generation. If mktemp -d
returned empty, dirname would resolve to / and rm -rf would attempt
to delete the root directory. Now checks that the temp dir was
actually created before proceeding.
2. Replace sed-based exe path substitution with a conf file approach.
The previous sed escaping broke paths containing apostrophes
(e.g. /home/O'Connor/) because the '\'' escape introduced
backslashes that were then double-escaped by the metacharacter
pass. Now writes UNSLOTH_EXE to a separate studio.conf file that
the launcher sources at runtime, eliminating all sed metacharacter
and shell quoting interaction issues.
This also addresses the sed -i.bak portability concern (now moot
since sed is no longer used on the launcher file).
* Fix unbound variable crash and per-user lock in launcher
- Use ${UNSLOTH_EXE:-} so set -u does not crash before the friendly
error message when studio.conf is missing or empty.
- Append $(id -u) to the fallback lock path so each user gets their
own lock directory when XDG_RUNTIME_DIR is unset.
* Mark desktop shortcut as trusted for GNOME/Nautilus
On modern GNOME desktops, chmod +x alone is not sufficient to make
a .desktop file launchable by double-click on ~/Desktop. Nautilus
requires the metadata::trusted attribute to be set via gio, otherwise
it shows a warning dialog instead of launching the application.
The repo has both the CodeQL "default setup" (configured in repo
settings) and this advanced workflow file enabled. GitHub does not
allow both simultaneously, causing all PR CI runs to fail with:
"CodeQL analyses from advanced configurations cannot be processed
when the default setup is enabled"
Since the default setup already covers the same languages (Python,
JavaScript/TypeScript) with the same build-mode (none), remove the
redundant advanced workflow file.
* Add CodeQL analysis workflow configuration
* Add Dependabot configuration for package updates
Configure Dependabot to check for updates in various ecosystems weekly.
* Fix dependabot.yml: bun ecosystem, missing dir, grouping for PR #4479
1. studio/frontend uses bun.lock not package-lock.json, so change npm to bun
2. Add missing studio/backend/requirements/ pip entry (consumed by studio/setup.sh)
3. Add groups with patterns ["*"] to all pip/bun/npm entries to batch updates
and avoid 30+ individual Dependabot PRs on the first run
* Consolidate pip blocks to fix overlapping directory violation
GitHub Dependabot forbids multiple same-ecosystem entries with
overlapping directories on the same branch. The root "/" directory
overlapped the 3 nested pip dirs. Merge all 4 pip blocks into one
using the `directories:` (plural) key.
Also remove redundant open-pull-requests-limit from the bun block
since grouping with patterns: ["*"] already limits PR count.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Try installing causal-conv1d from prebuilt wheels if avialable
* Prefer installing mamba-ssm from wheel to speed up things
* undo python stack install changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert "undo python stack install changes"
This reverts commit d943551092.
* add comments
* Fix wheel installer: model detection, platform tags, torch pin, error handling
- Add nemotron-h (hyphen) and granite-4.0-h / granitemoehybrid to model
detection for both causal-conv1d and mamba-ssm. These hybrid Mamba models
were silently skipped since nemotron_h (underscore) never matches real
HF model IDs like nvidia/Nemotron-H-8B-Base, and granite was missing
entirely despite being a supported model in model_config.py and loader.py.
- Fix _causal_conv1d_platform_tag to detect linux_aarch64 via
platform.machine() instead of hardcoding linux_x86_64. Both upstream
releases publish aarch64 wheels. Drop win_amd64 since neither repo
publishes Windows wheels (avoids a wasted HTTP probe on every run).
- Pin torch to >=2.6.0,<2.11.0 instead of <=2.10.0 to add a version floor
and document the wheel coverage range with upstream release links.
- Strip non-numeric suffixes from torch minor version so nightly builds
like 2.7a0 correctly resolve to wheel tag torch2.7 instead of torch2.7a0.
- Use stderr=_sp.PIPE instead of stderr=_sp.STDOUT in the env probe so
torch import warnings do not corrupt the JSON output.
- Add timeout=30 to the env probe subprocess to prevent indefinite hangs.
- Catch Exception (not just ImportError) on the existing-install check so
ABI-broken installs with OSError/RuntimeError are retried rather than
silently accepted.
- Guard uv invocation with shutil.which("uv") to prevent FileNotFoundError
crash when uv is not on PATH. Wrap the top-level ensure calls in
try/except so failures do not kill the training worker.
- Hoist _SSM_MODEL_SUBSTRINGS to module level.
- Remove redundant --torch-backend=auto flag from direct wheel URL install.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add LFM2 to causal-conv1d detection; stop training on install failure
- Add "lfm2" to _model_wants_causal_conv1d so Studio picks up the
fast kernel path for Liquid Foundation Model 2.
- Replace silent logger.warning on SSM dependency install failure
with an error event that tells the user to choose another model
and stops the training job immediately.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Catch subprocess timeout in torch probe; narrow import guard to ImportError
- _probe_causal_conv1d_env: wrap subprocess.run in try/except for
TimeoutExpired so a slow torch import returns None (falls back to
PyPI) instead of killing the training job.
- _install_package_wheel_first: narrow except Exception to except
ImportError on the __import__ check so unexpected errors from a
broken module still propagate.
* Remove unconditional torch pin from install_python_stack
The torch>=2.6.0,<2.11.0 pin was added to ensure prebuilt
causal-conv1d / mamba-ssm wheels exist, but it runs at install
time for all users regardless of model choice. This can downgrade
or unnecessarily upgrade torch. The worker already handles wheel
compatibility at training time by probing the environment and
falling back to PyPI, so the install-time pin is not needed.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat(chat): ghost-style tool containers
Remove borders and card styling from tool call UI. ToolFallback
uses minimal padding with indented content. ToolGroup defaults
to ghost variant with subtle background for multi-tool grouping.
* feat(chat): compact web search source pills
Switch sources from vertical full-width badges to horizontal
wrapping pills with smaller icons.
* feat(chat): left-accent code and terminal tool UI
Replace bordered card layout with a left border accent for
Python and Terminal tool output. Add timer cleanup on unmount
for the copy button in both components.
* feat(chat): inline latex and clickable links
Enable single-dollar $...$ math rendering via createMathPlugin.
Add styled link component with target=_blank for external links.
* fix(chat): inline generating indicator, static tailwind classes, misc fixes
Move generating indicator from viewport footer into assistant
message using AnimatedShinyText shimmer. Only shows when message
content is empty, hides once tool calls or text appear.
Use static size class map in SourceIcon for Tailwind v4 compat.
Use unique keys for web search sources. Remove px-3 from ghost
tool group variant.
* fix(chat): only show generating indicator while message is running
Hide the shimmer when message is cancelled or errored with no
content, preventing stale loading UI on empty completed messages.
* fix: escape currency dollar signs in LaTeX math rendering and fix TS build error
- Add preprocessLaTeX() in lib/latex.ts to escape currency patterns ($5, $1,000, $5.99, $100K)
before they reach the math parser, preventing false positives when singleDollarTextMath is enabled.
Code blocks and already-escaped dollars are left untouched.
- Use preprocessLaTeX via useMemo in markdown-text.tsx so Streamdown receives clean input.
- Fix TS18048 in thread.tsx: message.status?.type (optional chaining) since status can be undefined.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Bump Data Designer to 0.5.4 (removes litellm dependency)
NVIDIA Data Designer v0.5.4 removes litellm entirely and replaces it
with native OpenAI and Anthropic adapters. This follows the litellm
supply chain incident where versions 1.82.7 and 1.82.8 were compromised
with a credential stealer.
Release notes: https://github.com/NVIDIA-NeMo/DataDesigner/releases/tag/v0.5.4
Changes:
- Bump data-designer, data-designer-config, data-designer-engine to 0.5.4
- Sync data-designer-deps.txt with 0.5.4 engine requirements:
- Added: chardet, fsspec, mcp
- Removed: python-json-logger, pymupdf, pymupdf4llm, mammoth
(these remain in the unstructured-seed plugin which still needs them)
- duckdb constraint relaxed from <1.5 to <2 (upstream fixed record_batch)
- Bump plugin lower bound to >=0.5.4
* Keep pymupdf, pymupdf4llm, mammoth in data-designer-deps
The unstructured-seed plugin is installed with --no-deps, so its
pyproject.toml dependencies are not auto-resolved. These three
packages are needed by the seed route (studio/backend/routes/
data_recipe/seed.py) and must remain in the explicit deps list.
* feat: Implement Q-GaLore optimizer and custom embedding learning rate in the Unsloth trainer.
* feat: Implement QGaLoreAdamW8bit optimizer with 8-bit states, GaLore low-rank gradient projection, and optional INT8 weight quantization, along with supporting projector and tests.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: Introduce Q-GaLore AdamW optimizer with low-rank quantized gradient projection and integrate into the trainer, along with dedicated tests.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: Implement Q-GaLore AdamW optimizer with gradient projection and quantization, including trainer integration and corresponding tests.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix 3 bugs in Q-GaLore optimizer and add weight_quant forward hooks
1. Fix use-after-delete crash: move `del p._saved_data` after the
weight decay block so decoupled weight decay can reference the
current weights correctly (p.data).
2. Fix substring matching in make_q_galore_param_groups: split
parameter names on "." and check exact component matches to
prevent false positives (e.g. "not_q_proj" matching "q_proj").
3. Implement forward pre-hooks for weight_quant: after the optimizer
quantizes weights to INT8, replace p.data with a 1-element
placeholder to free float memory. A register_forward_pre_hook
dequantizes back to float before each forward pass. The trainer
calls install_weight_quant_hooks() when weight_quant is enabled.
4. Update test_weight_decay_uses_saved_data to match the fixed code
path (decoupled decay uses p.data, expected value 2.7). Add
test_weight_quant_hook_restores_float to verify the INT8-to-float
hook round-trip.
All 24/24 Q-GaLore tests pass. Benchmarked on Llama-3.2-1B-Instruct
FFT: Q-GaLore saves 32% VRAM (10.63 -> 7.24 GB) with better loss
convergence (1.3 vs 2.0 at step 100). No regressions in 31-notebook
sweep across Llama, Qwen, Mistral, Phi, Gemma, vision, and GRPO.
* Default weight_quant to False in QGaloreConfig
Benchmarks show weight_quant=True adds ~1 GB on Llama-3.2-1B due to
INT8 copy/scale overhead exceeding savings from the placeholder trick.
Users can still opt in explicitly. The optimizer logic is unchanged.
* Optimize Q-GaLore projector and optimizer step performance
Projector (q_galore_projector.py):
- Use torch.svd_lowrank with oversampling p=10 (Halko et al. 2009) instead
of full SVD for large matrices. Falls back to full SVD when min(m,n) <= 2*rank.
SVD steps are 6-8x faster on Llama-3.2-1B (22s -> 3s for first step).
- Cache the dequantized ortho matrix between project() and project_back() to
avoid redundant dequantization when quant=True.
- Replace F.cosine_similarity with torch.dot for 1-D unit vectors in the
adaptive schedule. Remove unused torch.nn.functional import.
- Use collections.deque(maxlen=queue_size) instead of list with manual pop(0).
Optimizer (q_galore_adamw.py):
- Remove redundant .clone() on dequantized weights (line 151) and on float
data before re-quantization (line 211). _dequantize already returns a fresh
tensor and _quantize/_quantize_stochastic only reads its input.
- Consolidate per-group torch.cuda.synchronize() into a single call after
all param groups complete.
- Use torch.empty instead of torch.zeros for the scalar placeholder tensor
that is never read.
Verified: 24/24 unit tests pass. Llama-3.2-1B 61-step training produces
losses within 0.24% relative diff (correlation >0.9999) of the original.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: remove auto wandb.finish() after train() to allow post-training evaluate()
The prepare_for_training_mode wrapper unconditionally called wandb.finish()
after trainer.train() completed. This terminated the active W&B run, causing
trainer.evaluate() to fail with "You must call wandb.init() before wandb.log()".
Users who need multiple training runs in one session can call wandb.finish()
manually between runs to avoid data overwriting.
Fixes#3954
* fix: defer wandb.finish() to next train() call instead of removing it
Instead of calling wandb.finish() at the end of train() (which breaks
evaluate/log) or removing it entirely (which causes data overwriting on
multiple train() calls), defer it to the start of the next train() call.
This way:
- train() + evaluate() works (run stays open after train)
- train() + train() gets separate W&B runs (previous run finished first)
- train() + evaluate() + train() also works correctly
Also resets HF's WandbCallback._initialized flag so it re-calls
wandb.init() for the new run.
Fixes#3954
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat(db): add SQLite storage layer for training history
* feat(api): add training history endpoints and response models
* feat(training): integrate DB persistence into training event loop
* feat(ui): add training history views and card grid
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): address review issues in training history persistence
- Strip hf_token/wandb_token from config before SQLite storage
- Add UUID suffix to job_id for collision resistance
- Use isfinite() for 0.0 metric handling throughout
- Respect _should_stop in error event finalization
- Run schema DDL once per process, not per connection
- Close connection on schema init failure
- Guard cleanup_orphaned_runs at startup
- Cap _metric_buffer at 500 entries
- Make FLUSH_THRESHOLD a class constant
- Map 'running' to 'training' phase in historical view
- Derive LR/GradNorm from history arrays in historical view
- Fix nested button with div[role=button] in history cards
- Guard String(value) against null/undefined in config popover
- Clear selectedHistoryRunId on auto tab switch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): address round-2 review findings across training backend and frontend
Backend (training.py):
- Move state mutation after proc.start() so a failed spawn does not wedge
the backend with is_training=True
- Create DB run row eagerly after proc.start() so runs appear in history
during model loading, not after first metric event
- Rewrite _flush_metrics_to_db() with snapshot-before-insert pattern to
preserve metrics arriving during the write and retain buffer on failure
- Guard eval_loss with float() coercion and math.isfinite(), matching the
existing grad_norm guard
- Increase pump thread join timeout from 3s to 8s to cover SQLite's
default 5s lock timeout
Frontend (studio-page.tsx):
- Fix history navigation: check isTrainingRunning instead of
showTrainingView in onSelectRun so completed runs are not misrouted
- Replace activeTab state + auto-switch useEffect with derived tab to
eliminate react-hooks/set-state-in-effect lint violation
Frontend (historical-training-view.tsx):
- Add explicit "running" branch to message ternary so running runs no
longer fall through to "Training errored"
- Derive loading from detail/error state and move cleanup to effect
return to eliminate react-hooks/set-state-in-effect lint violation
Frontend (progress-section.tsx):
- Derive stopRequested from isTrainingRunning && stopRequestedLocal to
eliminate react-hooks/set-state-in-effect lint violation and remove
unused useEffect import
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): resolve 3 remaining bugs from round-2 review
1. Stuck on Current Run tab [12/20]: Only force "current-run" tab when
isTrainingRunning is true, not when stale completed-run data exists.
After training ends, users can freely navigate to Configure.
2. Incomplete metric sanitization [7/20]: Apply float() coercion and
isfinite() guards to loss and learning_rate, matching the existing
pattern used by grad_norm and eval_loss. Prevents TypeError from
string values and NaN leaks into history arrays.
3. Stop button state leak across runs [10/20]: Add key={runtime.jobId}
to ProgressSection so React remounts it when a new run starts,
resetting stopRequestedLocal state.
* fix(studio): deduplicate loss/lr sanitization in training event handler
Reuse _safe_loss/_safe_lr from the progress update block instead of
re-sanitizing the same raw event values for metric history.
* fix(studio): restore loss > 0 guard to prevent eval steps injecting 0.0 into metric histories
Round-2/3 fixes relaxed the history append guard from `loss > 0` to
`loss is not None`, which let eval-only log events (where loss defaults
to 0.0) append fake zeros into loss_history and lr_history. Restore the
`loss > 0` check to match the worker's own has_train_loss gate. The
float() coercion and isfinite() sanitization from round-3 remain intact.
* fix(studio): resolve training history bugs — nullable loss/lr, tab nav, sparkline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
The wheel currently ships frontend/public/, frontend/src/, and
frontend/*.lock alongside frontend/dist/. These are build-time inputs
that Vite already copies into dist/ during the build step:
- public/ is copied verbatim into dist/ by vite build (28.6 MB duplicate)
- src/ is TSX source compiled into dist/assets/*.js (2.1 MB, not used at runtime)
- *.lock files are package manager lockfiles (0.9 MB, not used at runtime)
The backend only serves from frontend/dist/ (see main.py setup_frontend
and run.py frontend_path). Nothing references public/ or src/ at runtime.
This drops the wheel from ~62.7 MB to ~31 MB.
* feat(windows): add Studio desktop/Start shortcuts with health-check launcher
* chore(windows): bundle sloth.ico and set shortcut icons when valid
* chore(windows):add images/sloth.ico
* fix(windows): guard PSScriptRoot for Studio shortcut icon in iex installs
* fix(install): high-DPI sloth.ico and relocate to studio/frontend/publi
* chore(studio): update sloth.ico for clearer desktop and shell icons
* chore(studio): use unsloth.ico for Studio shortcut icon
* feat(windows): improve Studio shortcut launcher (fast health + browser UX)
* fix(windows): stable unsloth.ico URL and Unicode-safe Studio launcher scripts
* fix(windows): escape $ in exe path and write launcher UTF-8 with BOM
* fix(windows): skip shortcuts when Desktop or APPDATA paths are missing
* fix(install): log shortcut/icon/port failures and warn early on missing paths
* fix(install): guard missing LOCALAPPDATA before shortcut paths
* fix(install): harden New-StudioShortcuts and improve success messaging
* fix(install): include port 8908 in studio health check
* fix(install): fix launch-studio.ps1 quoting
* Fix launcher edge cases and normalize indentation in install.ps1
- Handle silent timeout: show a message when Studio is still starting
but did not become healthy within the timeout, instead of exiting
with no feedback
- Add -NoProfile to the visible PowerShell terminal launch so the
user profile cannot hang or error before Studio runs
- Add a named mutex (Local\UnslothStudioLauncher) to prevent
double-click from spawning duplicate terminals; second instance
polls for health and opens the browser when ready
- Normalize indentation inside New-StudioShortcuts outer try block
from mixed 8/12-space to consistent 12-space
* Simplify Get-CandidatePorts port dedup with Sort-Object -Unique
Replace the foreach/-notcontains loop with a single pipeline:
$ports = (@($basePort) + $listening) | Sort-Object -Unique
* Harden health probe and handle abandoned mutex in launcher
- Test-StudioHealth now checks resp.service == 'Unsloth UI Backend' to
avoid fingerprinting collisions with other local services on the same
port range.
- Wrap the mutex WaitOne(0) call in a try/catch for
AbandonedMutexException so the launcher recovers gracefully when a
previous instance was killed while holding the mutex.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: prevent UnicodeEncodeError on Windows CP1252 consoles in studio setup
On Windows, `unsloth studio setup` crashes with a UnicodeEncodeError
when install_python_stack.py tries to print Unicode status glyphs
(✅, ❌, ⚠️) to a console that uses a legacy code page like CP1252.
Add a _safe_print() helper that catches UnicodeEncodeError and
gracefully degrades emoji to ASCII equivalents ([OK], [FAIL], [!]).
Replace all print() calls that emit Unicode glyphs with _safe_print().
Fixes#4509
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Replace Unicode dashes with ASCII in install_python_stack.py
Box-drawing (U+2500) and em dash (U+2014) chars in section dividers
and comments are themselves not representable on CP1252 -- replace
with plain ASCII dashes for consistency with the fix.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add GRPO resume vLLM cleanup guard
* Guard GRPO resume sleep on vLLM sleep mode
* Harden GRPO resume vLLM cleanup guard
- Wrap llm.sleep(1) in try/except so a failed sleep does not block
training resume (best-effort cleanup)
- Also check kwargs["model_path"] which transformers.Trainer.train()
still accepts and normalizes to resume_from_checkpoint internally
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat(chat): regroup settings sidebar into Model, Sampling, Tools, and Preferences sections
Split the monolithic Settings collapsible into focused sections with
icons. Model section shows context length and KV cache dtype for GGUF
models, trust remote code for non GGUF. Tools section groups auto heal,
max tool calls, and tool call timeout. Preferences section holds auto
title toggle.
* feat(chat): persist collapsible section open/closed state in localStorage
Remember which sections the user expanded or collapsed across sidebar
toggles, mobile sheet reopens, and browser sessions.
* fix(chat): harden collapsible state persistence and restore defaultOpen
- Validate localStorage values are booleans before using them, preventing
corrupted entries like string "false" from being treated as truthy
- Use Object.hasOwn() instead of `in` operator to avoid prototype chain
matches on keys like "constructor" or "toString"
- Restore defaultOpen={true} on Model and Preferences sections so they
are expanded on first visit, matching the old Settings section behavior
- Fix misleading Context Length description to reflect it is read-only
- Downgrade console.error to console.warn for non-critical localStorage
parse failures
* fix(chat): remove redundant disabled styles on Context Length input
The Input component already applies opacity-50 and cursor-not-allowed
via its disabled: variants. Specifying them unconditionally in the
className is redundant.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Ensures both install scripts always pull a version that has the
litellm removal fix. Without the pin, stale uv/pip caches could
resolve the older 2026.3.10 which still had litellm in
data-designer-deps.txt, causing setup to fail at step 8/11
while PyPI has litellm quarantined.
litellm has been quarantined on PyPI due to a supply chain attack
in version 1.82.8 (malicious credential-stealing .pth file).
No versions are currently installable, which blocks
`unsloth studio setup` at step 8/11 (data-designer deps).
Remove litellm from the single-env data-designer requirements
so setup completes. litellm can be re-added once PyPI lifts the
quarantine.
Ref: https://github.com/BerriAI/litellm/issues/24512
* Revert "fix: handle prompt/completion datasets in slow-path BOS detection (#4548)"
This reverts commit fca83182af.
* fix: support completion_only_loss=True with prompt/completion dataset columns
When completion_only_loss=True, TRL rejects formatting_func but Unsloth's
patched _prepare_dataset/_prepare_non_packed_dataloader assumed either
formatting_func or dataset_text_field was always set, causing a catch-22.
Now handles prompt/completion columns as a third case for BOS token
detection, with a safe None fallback for all other cases.
(cherry picked from commit 978f78c6f1)
* fix: handle prompt/completion datasets in slow-path BOS detection
The slow-path check_text blocks in rl_replacements.py and
tokenizer_utils.py crash when a prompt/completion dataset is used
because they unconditionally access dataset[0][dataset_text_field]
even when the dataset does not have a text field.
This fixes both files to:
- Default dataset_text_field to None instead of raising when undefined
- Detect prompt/completion columns and concatenate them for BOS check
- Guard with isinstance(str) on both prompt and completion to handle
conversational format (list of dicts) by setting test_text to None
- Add test_text is not None guard on has_bos_token_already to prevent
AttributeError on NoneType.startswith()
This is the slow-path complement to unslothai/unsloth-zoo#560 which
fixes the fast-path in sft_prepare_dataset.
Closes#4486
(cherry picked from commit b6ce5786d0)
* fix: preserve chat_template BOS check when test_text is None
The has_bos_token_already guard wrapped both test_text.startswith()
and bos_token in chat_template with test_text is not None, which
disabled the chat_template BOS detection for conversational datasets
where test_text is set to None.
Split the guard so test_text is not None only applies to the
startswith() call, while bos_token in chat_template is always checked.
(cherry picked from commit 40bd8b8917)
---------
Co-authored-by: Ayush Kushwaha <148432773+ayushkushwaha240@users.noreply.github.com>
* fix: handle prompt/completion datasets in slow-path BOS detection
The slow-path check_text blocks in rl_replacements.py and
tokenizer_utils.py crash when a prompt/completion dataset is used
because they unconditionally access dataset[0][dataset_text_field]
even when the dataset does not have a text field.
This fixes both files to:
- Default dataset_text_field to None instead of raising when undefined
- Detect prompt/completion columns and concatenate them for BOS check
- Guard with isinstance(str) on both prompt and completion to handle
conversational format (list of dicts) by setting test_text to None
- Add test_text is not None guard on has_bos_token_already to prevent
AttributeError on NoneType.startswith()
This is the slow-path complement to unslothai/unsloth-zoo#560 which
fixes the fast-path in sft_prepare_dataset.
Closes#4486
* fix: preserve chat_template BOS check when test_text is None
The has_bos_token_already guard wrapped both test_text.startswith()
and bos_token in chat_template with test_text is not None, which
disabled the chat_template BOS detection for conversational datasets
where test_text is set to None.
Split the guard so test_text is not None only applies to the
startswith() call, while bos_token in chat_template is always checked.
* fix: system prompt was dropped in unsloth text and vision inference
* refactor: simplify system prompt message construction
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: use multimodal typed content parts for vision system message and add fallback
The system message content must use typed content parts
([{"type": "text", "text": ...}]) instead of a plain string to match
the multimodal processor contract (consistent with the audio path).
Plain strings cause some processors (e.g. LLaVA) to silently drop the
system prompt.
Also wraps processor.apply_chat_template in try/except so models that
reject the system role gracefully fall back to no system message with
a warning log.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: capture and log original exception in vision system prompt fallback
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: always show chat tool icons, gray out when model doesn't support them
Tool icons (Think, Search, Code) were hidden unless a model was loaded
and supported those features. Now they're always visible so users can
see and pre-select them. If a loaded model doesn't support a feature,
the button gets grayed out and disabled instead of being removed.
* refactor: centralize Qwen thinking params in store
* fix: disable tool buttons when no model is loaded
Change disabled condition from `modelLoaded && !supportsX` to
`!modelLoaded || !supportsX` so buttons are grayed out both when
no model is loaded and when the loaded model lacks the capability.
* Fix Qwen3 param clobbering and restore SuggestionItem capability guards
- Revert setReasoningEnabled() in the store to a pure boolean setter.
Moving the Qwen3 param logic into it caused reconnect/load/refresh
paths (which also call setReasoningEnabled) to silently overwrite
user-customized or server-provided temperature/topP/topK/minP.
- Restore applyQwenThinkingParams() as a standalone function called
only from explicit user toggle click handlers in thread.tsx and
shared-composer.tsx, matching the pre-PR behavior.
- Re-add supportsReasoning/supportsTools guards in the SuggestionItem
click handler so that clicking a suggestion card only activates
tool toggles the loaded model actually supports.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
PR #4543 removed useScrollLock from ReasoningRoot, causing the thread
viewport to jump when a user collapses a reasoning panel. Restore the
hook to freeze scrollTop during the 200ms collapse animation, matching
the pattern used by tool-fallback.tsx and tool-group.tsx.
* Fix port conflict detection when loopback address is held by another process
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use getaddrinfo for IPv6 host support, restore emojis in terminal output
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard against conn.pid being None in _get_pid_on_port
psutil.net_connections() can return entries with pid=None when the
current user lacks privileges to see the owning process (common on
macOS without root, Windows without admin, and some Linux configs).
psutil.Process(None) does not raise -- it silently returns the
current process, which would make the warning incorrectly blame
Unsloth Studio itself for blocking the port.
Skip entries with pid=None so the caller falls back to the generic
"port is already in use" message instead.
* Update studio/backend/run.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* fix(chat): stabilize thinking panel and thread scroll during generation
* fix: match ChatGPT scroll and thinking panel behavior
- Remove autoScroll={false} from thread viewport to restore default
follow-scroll during streaming (pauses when user scrolls up, resumes
at bottom)
- Rewrite reasoning panel state: auto-opens on stream start, user can
close during streaming, auto-collapses when reasoning ends, user can
re-expand after collapse
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): harden system prompt persistence and storage fallback
* Exclude checkpoint from localStorage persistence for PR #4538
checkpoint is backend-owned state -- refresh() already syncs it from
getInferenceStatus() on every page load. Persisting it to localStorage
causes a stale model ID to survive across backend restarts, which
prevents auto-load from triggering when no model is actually loaded.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fixes#4492
The embedding_learning_rate parameter was assigned to a local variable
instead of self.embedding_learning_rate, causing UnslothTrainer.create_optimizer()
to always get None via getattr and silently fall back to a single param group.
Bug: embedding_learning_rate = embedding_learning_rate (no-op)
Fix: self.embedding_learning_rate = embedding_learning_rate
<imgalt="unsloth studio ui homepage"src="https://raw.githubusercontent.com/unslothai/unsloth/main/studio/frontend/public/studio%20github%20landscape%20colab%20display.png"style="max-width: 100%; margin-bottom: 0;"></a>
<br>
<ahref="https://unsloth.ai/docs/new/studio">
<imgalt="unsloth studio ui homepage"src="https://github.com/user-attachments/assets/53ae17a9-d975-44ef-9686-efb4ebd0454d"style="max-width: 100%; margin-bottom: 0;"></a>
Unsloth Studio (Beta) lets you run and train text, [audio](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [embedding](https://unsloth.ai/docs/new/embedding-finetuning), [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) models on Windows, Linux and macOS.
## ⚡ Get started
#### macOS, Linux, WSL:
```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```
#### Windows:
```powershell
irm https://unsloth.ai/install.ps1 | iex
```
#### Community:
- [Discord](https://discord.gg/unsloth)
- [𝕏 (Twitter)](https://x.com/UnslothAI)
- [Reddit](https://reddit.com/r/unsloth)
## ⭐ Features
Unsloth provides several key features for both inference and training:
Unsloth Studio (Beta) lets you run and train text, [audio](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [embedding](https://unsloth.ai/docs/new/embedding-finetuning), [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) models on Windows, Linux and macOS.
### Inference
* **Search + download + run models** including GGUF, LoRA adapters, safetensors
* **Export models**: [Save or export](https://unsloth.ai/docs/new/studio/export) models to GGUF, 16-bit safetensors and other formats.
@ -32,15 +47,15 @@ Unsloth provides several key features for both inference and training:
* We work directly with teams behind [gpt-oss](https://docs.unsloth.ai/new/gpt-oss-how-to-run-and-fine-tune#unsloth-fixes-for-gpt-oss), [Qwen3](https://www.reddit.com/r/LocalLLaMA/comments/1kaodxu/qwen3_unsloth_dynamic_ggufs_128k_context_bug_fixes/), [Llama 4](https://github.com/ggml-org/llama.cpp/pull/12889), [Mistral](models/tutorials/devstral-how-to-run-and-fine-tune.md), [Gemma 1-3](https://news.ycombinator.com/item?id=39671146), and [Phi-4](https://unsloth.ai/blog/phi4), where we’ve fixed bugs that improve model accuracy.
* Upload images, audio, PDFs, code, DOCX and more file types to chat with.
### Training
* Train **500+ models** up to **2x faster** with up to **70% less VRAM**, with no accuracy loss.
* Train and RL **500+ models** up to **2x faster** with up to **70% less VRAM**, with no accuracy loss.
* Custom Triton and mathematical **kernels**. See some collabs we did with [PyTorch](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) and [Hugging Face](https://unsloth.ai/docs/new/faster-moe).
* **Data Recipes**: [Auto-create datasets](https://unsloth.ai/docs/new/studio/data-recipe) from **PDF, CSV, DOCX** etc. Edit data in a visual-node workflow.
* Supports full fine-tuning, pretraining, 4-bit, 16-bit and, FP8 training.
* **[Reinforcement Learning](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide)** (RL): The most efficient [RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) library, using **80% less VRAM** for GRPO, [FP8](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) etc.
* **Observability**: Monitor training live, track loss and GPU usage and customize graphs.
* **Reinforcement Learning**: The most efficient [RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) library, using **80% less VRAM** for GRPO, [FP8](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) etc.
* [Multi-GPU](https://unsloth.ai/docs/basics/multi-gpu-training-with-unsloth) training is supported, with major improvements coming soon.
## ⚡ Quickstart
## 📥 Install
Unsloth can be used in two ways: through **[Unsloth Studio](https://unsloth.ai/docs/new/studio/)**, the web UI, or through **Unsloth Core**, the code-based version. Each has different requirements.
### Unsloth Studio (web UI)
@ -49,7 +64,7 @@ Unsloth Studio (Beta) works on **Windows, Linux, WSL** and **macOS**.
* **CPU:** Supported for Chat and Data Recipes currently
* **NVIDIA:** Training works on RTX 30/40/50, Blackwell, DGX Spark, Station and more
* **macOS:** Currently supports chat and Data Recipes. **MLX training** is coming very soon
* **AMD:** Chat works. Train with [Unsloth Core](#unsloth-core-code-based). Studio support is coming soon.
* **AMD:** Chat + Data works. Train with [Unsloth Core](#unsloth-core-code-based). Studio support is out soon.
* **Coming soon:** Training support for Apple MLX, AMD, and Intel.
* **Multi-GPU:** Available now, with a major upgrade on the way
@ -57,19 +72,20 @@ Unsloth Studio (Beta) works on **Windows, Linux, WSL** and **macOS**.
```bash
curl -fsSL https://unsloth.ai/install.sh | sh
```
If you don't have `curl`, use `wget`. Launch after setup via:
```bash
source unsloth_studio/bin/activate
unsloth studio -H 0.0.0.0 -p 8888
```
#### Windows:
```powershell
irm https://unsloth.ai/install.ps1 | iex
```
Launch after setup via:
```powershell
& .\unsloth_studio\Scripts\unsloth.exe studio -H 0.0.0.0 -p 8888
#### Launch
```bash
unsloth studio -H 0.0.0.0 -p 8888
```
#### Update
To update, use the same install commands as above. Or run (does not work on Windows):
```bash
unsloth studio update
```
#### Docker
@ -82,64 +98,8 @@ docker run -d -e JUPYTER_PASSWORD="mypassword" \
To see developer, nightly and uninstallation etc. instructions, see [advanced installation](#-advanced-installation).
### Unsloth Core (code-based)
#### Linux, WSL:
@ -164,17 +124,19 @@ You can use the same Docker image as Unsloth Studio.
For RTX 50x, B200, 6000 GPUs: `uv pip install unsloth --torch-backend=auto`. Read our guides for: [Blackwell](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) and [DGX Spark](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth). <br>
To install Unsloth on **AMD** and **Intel** GPUs, follow our [AMD Guide](https://unsloth.ai/docs/get-started/install/amd) and [Intel Guide](https://unsloth.ai/docs/get-started/install/intel).
## ✨ Free Notebooks
## 📒 Free Notebooks
Train for free with our notebooks. Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model.
Train for free with our notebooks. You can use our new [free Unsloth Studio notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb) to run and train models for free in a web UI.
Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model.
| Model | Free Notebooks | Performance | Memory use |
|-----------|---------|--------|----------|
| **Gemma 4 (E2B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Vision.ipynb) | 1.5x faster | 50% less |
| **Qwen3.5 (4B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision.ipynb) | 1.5x faster | 60% less |
| **gpt-oss (20B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb) | 2x faster | 70% less |
| **Qwen3.5 GSPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision_GRPO.ipynb) | 2x faster | 70% less |
| **gpt-oss (20B): GRPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) | 2x faster | 80% less |
| **Qwen3: Advanced GRPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb) | 2x faster | 50% less |
| **Gemma 3 (4B) Vision** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma3_(4B)-Vision.ipynb) | 1.7x faster | 60% less |
| **Qwen3: Advanced GRPO** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)-GRPO.ipynb) | 2x faster | 70% less |
| **embeddinggemma (300M)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/EmbeddingGemma_(300M).ipynb) | 2x faster | 20% less |
| **Mistral Ministral 3 (3B)** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Ministral_3_VL_(3B)_Vision.ipynb) | 1.5x faster | 60% less |
| **Llama 3.1 (8B) Alpaca** | [▶️ Start for free](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb) | 2x faster | 70% less |
@ -186,6 +148,8 @@ Train for free with our notebooks. Read our [guide](https://unsloth.ai/docs/get-
- See detailed documentation for Unsloth [here](https://unsloth.ai/docs)
## 🦥 Unsloth News
- **Qwen3.6**: Qwen3.6-35B-A3B can now be trained and run in Unsloth Studio. [Blog](https://unsloth.ai/docs/models/qwen3.6)
- **Gemma 4**: Run and train Google’s new models directly in Unsloth. [Blog](https://unsloth.ai/docs/models/gemma-4)
- **Introducing Unsloth Studio**: our new web UI for running and training LLMs. [Blog](https://unsloth.ai/docs/new/studio)
- **Qwen3.5** - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. [Guide + notebooks](https://unsloth.ai/docs/models/qwen3.5/fine-tune)
- Train **MoE LLMs 12x faster** with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. [Blog](https://unsloth.ai/docs/new/faster-moe)
@ -196,13 +160,83 @@ Train for free with our notebooks. Read our [guide](https://unsloth.ai/docs/get-
- **FP8 & Vision RL**: You can now do FP8 & VLM GRPO on consumer GPUs. [FP8 Blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl)
- **gpt-oss** by OpenAI: Read our [RL blog](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/gpt-oss-reinforcement-learning), [Flex Attention](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/long-context-gpt-oss-training) blog and [Guide](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune).
## 🔗 Links and Resources
## 📥 Advanced Installation
The below advanced instructions are for Unsloth Studio. For Unsloth Core advanced installation, [view our docs](https://unsloth.ai/docs/get-started/install/pip-install#advanced-pip-installation).
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\install.ps1 --local
unsloth studio -H 0.0.0.0 -p 8888
```
Then to launch every time:
```bash
unsloth studio -H 0.0.0.0 -p 8888
```
#### Uninstall
You can uninstall Unsloth Studio by deleting its install folder usually located under `$HOME/.unsloth/studio` on Mac/Linux/WSL and `%USERPROFILE%\.unsloth\studio` on Windows. Using the `rm -rf` commands will **delete everything**, including your history, cache:
For more info, [see our docs](https://unsloth.ai/docs/new/studio/install#uninstall).
#### Deleting model files
You can delete old model files either from the bin icon in model search or by removing the relevant cached model folder from the default Hugging Face cache directory. By default, HF uses:
| <imgwidth="13"src="https://upload.wikimedia.org/wikipedia/commons/0/09/X_(formerly_Twitter)_logo_late_2025.svg"/>**Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai) |
if uv pip install --python "$_VENV_PY" -q "transformers>=5.2.0";then
substep "installed from PyPI"
else
substep "PyPI install failed, trying GitHub..."
if uv pip install --python "$_VENV_PY" -q "git+https://github.com/huggingface/transformers.git";then
substep "installed from huggingface/transformers main"
else
fail "Could not install transformers>=5.2.0 (required for Qwen3.5/3.6 model support). Please check your Python version (>=3.10 required) and network connection, then try again."
fi
fi
step "install""installing torch + torchvision (needed for Qwen3 VL processor)..."
"<a href=\"https://unsloth.ai/docs/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a> Join Discord if you need help + ⭐ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐\n",
"</div>\n",
"\n",
"To install Unsloth Studio on your local device, follow [our guide](https://unsloth.ai/docs/new/unsloth-studio/install). Unsloth Studio is licensed [AGPL-3.0](https://github.com/unslothai/unsloth/blob/main/studio/LICENSE.AGPL-3.0).\n",
"\n",
"### Unsloth Studio\n",
"\n",
"Train and run open models with [**Unsloth Studio**](https://unsloth.ai/docs/new/unsloth-studio/start). Currently, installation may take 30+ mins so use a newer GPU.\n",
"\n",
"\n",
"We are actively working on making Unsloth Studio install on Colab T4 GPUs faster.\n",
"for _ in range(10000): time.sleep(300), print(\"=\", end = \"\")"
],
"metadata": {
"id": "wb9UELh--XzX"
},
"id": "wb9UELh--XzX",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "f2b0c6a1",
"metadata": {
"id": "f2b0c6a1"
},
"source": [
"And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n",
"\n",
"Some other resources:\n",
"1. Looking to use Unsloth locally? Read our [Installation Guide](https://unsloth.ai/docs/get-started/install) for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.\n",
"2. Learn how to do Reinforcement Learning with our [RL Guide and notebooks](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide).\n",
"3. Read our guides and notebooks for [Text-to-speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning) and [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) model support.\n",
"4. Explore our [LLM Tutorials Directory](https://unsloth.ai/docs/models/tutorials-how-to-fine-tune-and-run-llms) to find dedicated guides for each model.\n",
"5. Need help with Inference? Read our [Inference & Deployment page](https://unsloth.ai/docs/basics/inference-and-deployment) for details on using vLLM, llama.cpp, Ollama etc.\n",
"<a href=\"https://unsloth.ai/docs/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a> Join Discord if you need help + ⭐ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐\n",
"</div>\n",
"\n",
"To install Unsloth Studio on your local device, follow [our guide](https://unsloth.ai/docs/new/unsloth-studio/install). Unsloth Studio is licensed [AGPL-3.0](https://github.com/unslothai/unsloth/blob/main/studio/LICENSE.AGPL-3.0).\n",
"\n",
"### Unsloth Studio\n",
"\n",
"Train and run open models with [**Unsloth Studio**](https://unsloth.ai/docs/new/unsloth-studio/start). NEW! Installation should now only take 2 mins!\n",
"\n",
"\n",
"We are actively working on making Unsloth Studio install on Colab T4 GPUs faster.\n",
"for _ in range(10000): time.sleep(300), print(\"=\", end = \"\")"
],
"metadata": {
"id": "wb9UELh--XzX"
},
"id": "wb9UELh--XzX",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "f2b0c6a1",
"metadata": {
"id": "f2b0c6a1"
},
"source": [
"And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n",
"\n",
"Some other resources:\n",
"1. Looking to use Unsloth locally? Read our [Installation Guide](https://unsloth.ai/docs/get-started/install) for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.\n",
"2. Learn how to do Reinforcement Learning with our [RL Guide and notebooks](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide).\n",
"3. Read our guides and notebooks for [Text-to-speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning) and [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) model support.\n",
"4. Explore our [LLM Tutorials Directory](https://unsloth.ai/docs/models/tutorials-how-to-fine-tune-and-run-llms) to find dedicated guides for each model.\n",
"5. Need help with Inference? Read our [Inference & Deployment page](https://unsloth.ai/docs/basics/inference-and-deployment) for details on using vLLM, llama.cpp, Ollama etc.\n",
description="Maximum sequence length (0 = model default for GGUF)",
)
load_in_4bit:bool=Field(True,description="Load model in 4-bit quantization")
is_lora:bool=Field(False,description="Whether this is a LoRA adapter")
@ -41,6 +44,14 @@ class LoadRequest(BaseModel):
None,
description="KV cache data type for both K and V (e.g. 'f16', 'bf16', 'q8_0', 'q4_1', 'q5_1')",
)
gpu_ids:Optional[List[int]]=Field(
None,
description="Physical GPU indices to use, for example [0, 1]. Omit or pass [] to use automatic selection. Explicit gpu_ids are unsupported when the parent CUDA_VISIBLE_DEVICES uses UUID/MIG entries. Not supported for GGUF models.",
)
speculative_type:Optional[str]=Field(
None,
description="Speculative decoding mode for GGUF models (e.g. 'ngram-simple', 'ngram-mod'). Ignored for non-GGUF and vision models.",
)
classUnloadRequest(BaseModel):
@ -83,6 +94,10 @@ class ValidateModelResponse(BaseModel):
is_gguf:bool=Field(False,description="Whether this is a GGUF model (llama.cpp)")
is_lora:bool=Field(False,description="Whether this is a LoRA adapter")
is_vision:bool=Field(False,description="Whether this is a vision-capable model")
requires_trust_remote_code:bool=Field(
False,
description="Whether the model defaults require trust_remote_code to be enabled for loading.",
)
classGenerateRequest(BaseModel):
@ -126,13 +141,28 @@ class LoadResponse(BaseModel):
description="Physical GPU indices to use, for example [0, 1]. Omit or pass [] to use automatic selection. Explicit gpu_ids are unsupported when the parent CUDA_VISIBLE_DEVICES uses UUID/MIG entries.",
)
classTrainingJobResponse(BaseModel):
"""Immediate response when training is initiated"""
@ -177,8 +183,8 @@ class TrainingProgress(BaseModel):