* Studio: forward standard OpenAI tools / tool_choice on /v1/responses
Mirrors the /v1/chat/completions client-side tool pass-through from #5099
so clients (OpenAI Codex CLI, OpenAI Python SDK, ...) that target the
Responses API receive structured function_call output items instead of
plain text with tool-call tokens leaking into content.
- ResponsesRequest: type tools/tool_choice properly, add parallel_tool_calls;
accept function_call and function_call_output input items for multi-turn
- Translate flat Responses tool / tool_choice shape to the nested Chat
Completions shape before forwarding to llama-server
- _normalise_responses_input: map function_call_output -> role="tool",
function_call -> assistant tool_calls (preserving call_id)
- Non-streaming: map returned tool_calls -> top-level function_call
output items keyed by call_id
- Streaming: emit response.output_item.added (function_call),
response.function_call_arguments.delta/.done, and response.output_item.done
per tool call while keeping the text message at output_index 0
- Pytest coverage: tools/tool_choice translation, multi-turn input mapping,
non-streaming tool_calls mapping, response round-trip
* Studio: merge system messages and close inner stream on /v1/responses
Fixes two issues surfacing when OpenAI Codex CLI drives /v1/responses
against a GGUF with a strict chat template (gpt-oss harmony, Qwen3, ...).
1. "System message must be at the beginning" upstream errors
Codex sends `instructions` AND a `role:"developer"` message in `input`,
producing two separate system-role messages. Strict templates raise
when a second system message exists or when one appears after a user
turn. _normalise_responses_input now hoists all instructions / system /
developer content into a single merged system message at the top of
the Chat Completions message list.
2. "async generator ignored GeneratorExit" / "Attempted to exit cancel
scope in a different task"
_responses_stream consumed the inner chat-completions body_iterator
without an explicit aclose() in a finally block. On client disconnect
(Codex frequently cancels mid-stream), Python 3.13 finalized the inner
async generator on a different task, tripping anyio's cancel-scope
check. Mirrored the same try/finally + aclose pattern used by the
/v1/messages, /v1/chat/completions, and /v1/completions passthroughs.
Tests: hoisting of instructions + developer, developer mid-conversation,
multiple system messages in input, no-system passthrough.
* Studio: accept Codex multi-turn shapes and fix cross-task stream close on /v1/responses
Two issues observed driving /v1/responses from OpenAI Codex CLI against a
GGUF backend.
1. 422 on every turn after the first
Codex replays prior assistant turns with
`content:[{"type":"output_text","text":...,"annotations":[],"logprobs":[]}]`
and carries forward `reasoning` items (o-series / gpt-5) between turns.
Our `ResponsesContentPart` union only accepted input_text / input_image,
and `ResponsesInputItem` only message / function_call / function_call_output,
so Pydantic failed the whole list and FastAPI returned
`"Input should be a valid string"` against the `str` branch of the
outer union.
- Add `ResponsesOutputTextPart` for assistant-replay content.
- Add `ResponsesUnknownContentPart` and `ResponsesUnknownInputItem`
as permissive catch-alls (drop during normalisation).
- Wire an explicit `Discriminator` so dispatch is deterministic and
the fallthrough reaches the catch-all instead of misreporting via
the outer `Union[str, list[...]]`.
- `_normalise_responses_input` now accepts output_text parts, flattens
single-part assistant text to a plain string (keeps legacy chat
templates happy), and silently drops reasoning / unknown items.
2. "async generator ignored GeneratorExit" / cross-task cancel scope
`_responses_stream` awaited `openai_chat_completions` in the parent
route-handler task, which opens the httpx client for the inner
passthrough on *that* task. The outer `StreamingResponse` then iterates
in a child task, so the asyncgen GC finalises the inner httpcore byte
stream on the child task, tripping anyio's "Attempted to exit cancel
scope in a different task". Move the `await` inside `event_generator`
so the httpx lifecycle stays within the single streaming child task,
and surface any HTTPException as a `response.failed` SSE frame.
Tests: assistant output_text replay, reasoning-item tolerance, unknown
content-part tolerance, end-to-end Codex-shape payload (developer + user +
reasoning + function_call + function_call_output + assistant output_text +
user), and single-part assistant flattening to plain string.
* Studio: call llama-server directly from streaming /v1/responses
The previous fix (running the inner await inside event_generator) was not
enough. Wrapping the existing `openai_chat_completions` pass-through still
stacks two async generators: when the outer generator is closed, the
innermost `HTTP11ConnectionByteStream.__aiter__` in httpcore doesn't
receive GeneratorExit before Python's asyncgen GC finalises it in a
sibling task, tripping "Attempted to exit cancel scope in a different
task" and "async generator ignored GeneratorExit" — the same Python 3.13
+ httpcore 1.0.x interaction already seen in PRs #4956, #4981, #5099.
Cure both pass-throughs had: a single same-task httpx lifecycle with
explicit `aiter_lines().aclose()` BEFORE `resp.aclose()` / `client.aclose()`
in the generator's finally block.
Apply it at the Responses layer by dropping the wrapper entirely for GGUF:
open httpx, consume `resp.aiter_lines()`, parse `chat.completion.chunk`,
emit Responses SSE events, close everything in finally — all in the
single StreamingResponse child task. Non-GGUF streaming is rejected with
a 400 (wrapping the transformers backend would re-introduce the
double-layer pattern and isn't a Codex-compatible path today anyway).
Also surfaces upstream httpx.RequestError / non-200 as a
`response.failed` SSE frame rather than a dropped stream now that the
request is dispatched after SSE headers have gone out.
* Studio: silence benign httpcore asyncgen GC warnings on Python 3.13
The streaming pass-throughs (/v1/chat/completions, /v1/messages,
/v1/responses, /v1/completions) all use the proven #4981 / #5099 pattern
— single-task httpx lifecycle with explicit aiter_lines().aclose() ahead
of resp.aclose() / client.aclose() in the generator's finally block.
That handles our own iterators correctly.
The residual noise ("async generator ignored GeneratorExit" /
"Attempted to exit cancel scope in a different task") comes from an
innermost HTTP11ConnectionByteStream.__aiter__ that httpcore creates
internally inside its pool. We hold no reference to it, so we cannot
aclose it ourselves. Python 3.13's asyncgen GC hook finalises it on the
finaliser task, its aclose path enters an anyio CancelScope shield, and
Python flags the cross-task exit. The response has already been
delivered with a 200 by then — it is purely log noise, not a functional
failure. Same interaction seen in modelcontextprotocol/python-sdk #831,
agno #3556, chainlit #2361, langchain-mcp-adapters #254.
Install a targeted sys.unraisablehook that swallows this specific tuple
— RuntimeError mentioning "cancel scope" or "GeneratorExit" plus an
object repr referencing HTTP11ConnectionByteStream — and defers to the
default hook for every other unraisable. Idempotent; guarded by a
sentinel attribute so repeated imports don't stack filters.
* Chatbox, scroll, and menu fixes
- Fixed chatbox auto-expand height for multi-line text on the compare page
- Fixed chatbox UI to be consistent across compare and new chat
- Fixed scrolling being enabled on pages with no content, which also triggered the scroll-to-bottom button
- Fixed scroll-to-bottom button to only appear after scrolling up a reasonable amount instead of instantly
- Added shutdown studio button to the menu for easier access
- Fixed pop-up menu width to match the user button width
(cherry picked from commit cd4e390dfa84fe311fae79a781b96cc0ef5970a9)
* fix: correct compare scroll viewport and clean up chat composer UI polish
* Dark theme refactor and sidebar/chat UI refinements
- Complete refactoring of dark theme
- Replaced square rounded-corner user profile image with a circular bordered one
- Replaced user profile icon with 'U' initial and renamed label from 'Studio' to 'User'
- Chat bubbles now have a pointy top-right edge
- Sidebar menu tab line color selection is now consistent across all menus
- Tab-selection color animation now also applies to recent chats
- Removed 'Compare' menu autoselect when a compare chat conversation is selected
- Fixed UI consistency in Compare to match New Chat
- Removed sidebar animation and tab line, replaced with rounded selection for consistency
- Further adjustments to sidebar UI
- Further adjustments to compare chat UI
* Fixed sidebar collapse/expand for recent chats and recent runs not being clickable
* Chatbox, scroll, and menu fixes
- Fixed chatbox auto-expand height for multi-line text on the compare page
- Fixed chatbox UI to be consistent across compare and new chat
- Fixed scrolling being enabled on pages with no content, which also triggered the scroll-to-bottom button
- Fixed scroll-to-bottom button to only appear after scrolling up a reasonable amount instead of instantly
- Added shutdown studio button to the menu for easier access
- Fixed pop-up menu width to match the user button width
* Sidebar, fonts, and chat UI refinements
- Replaced logo PNG with real font text for 'unsloth' and 'BETA' label
- Added Hellix font and applied it across menus and UI elements
- Lighter scrollbar in the sidebar compared to other areas of the app
- Adjusted chat font and chat bubble styling
- Adjusted app menu design to stay consistent with the sidebar
- Adjusted text style for 'New Chat' and repositioned content/chatbox
- Adjusted model selector and top area UI
- Fixed footer text from 'LLM's' to 'LLMs'
- Fixed active selection border color incorrectly appearing on page refresh and during general navigation
- Logo now defaults to 'New Chat' when clicked
* Sidebar, model selector, and mobile UI fixes
- Further adjustments to sidebar UI and logo
- Changed right bar icon
- Model selector adjustments
- Collapsed sidebar now matches the content area background
- Adjusted Hellix font spacing across pages
- Fixed sidebar icon overlap on mobile screens
* Adjust sidebar icons
* Adjust sidebar icons
* Fixed compare chat UI and scrolling issues
* Fixed inference settings icon behavior and context info positioning
- Fixed top right inference settings icon to move into sidepanel during expand/collapse, matching left sidebar behavior
- Adjusted context information element positioning
* Fix: textarea overflow in system prompt editor
* Code block redesign, font, and chat bubble adjustments
- Redesigned code block colors and theme
- Changed code block font to Fira Code
- Fixed scrollbar disappearing when expanding/collapsing tool calls in chats
- Adjusted chat bubble background color
* Fix chat bubble background color in dark theme
* fix: restore textarea auto-sizing and scope prompt editor sizing
* fix: add explicit textarea field sizing for prompt editor overflow
* fix: generate chat nonce on click instead of render
* fix: respect training lock on logo navigation
* Refactor compare page dual chat scrolling behavior
* Revert "Refactor compare page dual chat scrolling behavior"
This reverts commit d056ec09f2.
---------
Co-authored-by: sneakr <hauzin@hotmail.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
* export: update GGUF quant list and ordering
* gguf: add Q2_K_L quantize flags for output and embeddings
* export: add live console logs for LoRA export flow
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: stream q2_k_l quantize logs and include subprocess error details
* fix: route Q2_K_L preset to q2_k ftype with q8_0 output+embeddings
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Trashing a thread mid-stream used to delete the Dexie rows while the
model kept generating, because the sidebar has no access to the
@assistant-ui aui context. Expose per-thread cancelRun() through the
chat runtime store and call it from deleteChatItem so trash behaves
like Stop → Trash. Covers compare pairs by cancelling each paired
thread.
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
* fix(studio): forward OpenAI tools/tool_choice to llama-server (#4999)
Studio's /v1/chat/completions silently stripped standard OpenAI `tools`
and `tool_choice` fields, so clients using standard function calling
(opencode, Claude Code, Cursor, Continue, ...) never got structured
tool_calls back. Adds a client-side pass-through path mirroring the
existing Anthropic /v1/messages flow: when `tools` is present without
Studio's `enable_tools` shorthand, the request is forwarded to
llama-server verbatim so the client sees native id, finish_reason
("tool_calls"), delta.tool_calls, and accurate usage tokens.
Also wires Anthropic tool_choice forwarding: /v1/messages previously
accepted tool_choice on the request model but silently dropped it with
a warning. Translate the four Anthropic shapes to OpenAI format and
forward them so agentic clients can actually enforce tool use.
- ChatCompletionRequest: add tools, tool_choice, stop; extra="allow"
- ChatMessage: accept role="tool", optional tool_call_id / tool_calls /
name; content is now optional (assistant with only tool_calls)
- routes/inference.py: _openai_passthrough_stream /
_openai_passthrough_non_streaming helpers, routing branch in
openai_chat_completions, vision+tools via content-parts injection
- _build_passthrough_payload: tool_choice parameter (default "auto")
- anthropic_compat: anthropic_tool_choice_to_openai() translator
- tests/test_openai_tool_passthrough.py: Pydantic + translator unit tests
- tests/test_studio_api.py: 5 new E2E tests (non-stream, stream,
multi-turn, OpenAI SDK, Anthropic tool_choice=any regression)
* fix(studio): surface httpx transport errors from OpenAI passthrough
When the managed llama-server subprocess crashes mid-request, the
async pass-through helpers in routes/inference.py used to return a
bare 500 (non-streaming) or an "An internal error occurred" SSE chunk
(streaming) because _friendly_error only recognized the sync path's
"Lost connection to llama-server" substring -- httpx transport
failures (ConnectError / ReadError / RemoteProtocolError /
ReadTimeout) stringify differently and fell through to the generic
case.
- _friendly_error: map any httpx.RequestError subclass to the same
"Lost connection to the model server" message the sync chat path
emits. Placed before the substring heuristics so the streaming path
automatically picks it up via its existing except Exception catch.
- _openai_passthrough_non_streaming: wrap the httpx.AsyncClient.post
in a try/except httpx.RequestError and re-raise as HTTPException
502 with the friendly detail.
- tests/test_openai_tool_passthrough.py: new TestFriendlyErrorHttpx
class pinning the mapping for ConnectError, ReadError,
RemoteProtocolError, ReadTimeout, and confirming non-httpx paths
(context-size heuristic, generic fallback) are unchanged.
* fix(studio): close aiter_bytes/aiter_lines explicitly in passthroughs
The httpcore asyncgen cleanup fix in 5cedd9a5 is incomplete on Python
3.13 + httpcore 1.0.x: it switched to manual client/response lifecycle
but still used anonymous `async for raw_line in resp.aiter_lines():`
patterns in all three streaming paths. Python's async for does NOT
auto-close the iterator on break/return, so the aiter_lines /
aiter_bytes async generator remains alive, reachable only from the
surrounding coroutine frame. Once `_stream()` returns the frame is
GC'd and the orphaned asyncgen is finalized on a LATER GC pass in a
DIFFERENT asyncio task, where httpcore's
HTTP11ConnectionByteStream.aclose() enters anyio.CancelScope.__exit__
with a mismatched task and prints "Exception ignored in: <async
generator>" / "async generator ignored GeneratorExit" / "Attempted
to exit cancel scope in a different task" to the server log.
User observed this on /v1/messages after successful (status 200)
requests, with the traceback pointing at HTTP11ConnectionByteStream
.__aiter__ / .aclose inside httpcore.
Fix: save resp.aiter_lines() / resp.aiter_bytes() as a variable and
explicitly `await iter.aclose()` in the finally block BEFORE
resp.aclose() / client.aclose(). This closes the asyncgen inside the
current task's event loop, so the internal httpcore byte stream is
cleaned up before Python's asyncgen GC hook has anything orphaned to
finalize. Each aclose is wrapped in try/except Exception so nested
anyio cleanup noise can't bubble out.
Applied to all three streaming passthrough paths:
- _anthropic_passthrough_stream (/v1/messages client-side tool path)
- _openai_passthrough_stream (/v1/chat/completions client-side tool
path, new in this PR)
- openai_completions (/v1/completions bytes proxy from PR #4956)
* fix(studio): default ChatCompletionRequest.stream to false per OpenAI spec
OpenAI's /v1/chat/completions spec defaults `stream` to false, so
clients that omit the field (naive curl, minimal integrations) expect
a single JSON response back. Studio was defaulting to true, silently
switching those clients into SSE and breaking any parser that didn't
also handle streaming. ResponsesRequest and AnthropicMessagesRequest
already default to false correctly; only ChatCompletionRequest was
wrong.
Studio's own frontend always sets `stream` explicitly on every
chat-adapter / chat-api / runtime-provider call site, so the flip has
no UI impact. SDK users (OpenAI Python/JS SDK, opencode, Claude Code,
Cursor, Continue) also always pass `stream` explicitly, so they're
unaffected. The only clients feeling the change are raw-curl users
who were relying on the wrong default -- those get the correct OpenAI
behavior now.
Added a regression test pinning the default so it can't silently
flip back.
* fix(studio): reject images in OpenAI tool passthrough for text-only GGUFs
The new tool passthrough branch runs before _extract_content_parts,
skipping the existing not is_vision guard. Requests combining tools
with an image on a text-only tool-capable GGUF were forwarded to
llama-server, producing opaque upstream errors instead of the
pre-existing clear 400. Restore the guard inline at the dispatch
point, checking both legacy image_base64 and inline image_url parts.
* fix(studio): require tool_call_id on role=tool chat messages
Enforce the OpenAI spec rule that role="tool" messages must carry a
tool_call_id. Without it, upstream backends cannot associate a tool
result with the assistant's prior tool_calls entry and the request
fails in non-obvious ways through the passthrough path. Reject at the
request boundary with a 422 instead.
* fix(studio): harden OpenAI tool passthrough validation and error surfacing
Three related fixes called out by the PR review:
1. Preserve upstream status codes in the streaming passthrough. The
httpx request is now dispatched before the StreamingResponse is
constructed. Non-200 upstream responses and httpx RequestError
transport failures raise HTTPException with the real status
instead of being buried inside a 200 SSE error frame, so OpenAI
SDK clients see APIError/BadRequestError/... as expected.
2. Require non-empty content on user/system/tool messages. Per the
OpenAI spec, content may only be omitted on assistant messages
that carry tool_calls; enforce that at the request boundary so
malformed messages never reach the passthrough path.
3. Role-constrain tool-call metadata. tool_calls is only valid on
role=assistant, tool_call_id and name only on role=tool. Without
this, a user/system message with tool_calls would flip the
passthrough branch on and be forwarded to llama-server, surfacing
as an opaque upstream error.
* fix(studio): normalize image mode and passthrough JSON verbatim
Two Gemini-code-assist review findings on PR #5099:
1. Unconditionally convert decoded images to RGB before PNG encoding.
The prior code only handled RGBA, letting CMYK/I/F images crash
at img.save(format="PNG") and surface as opaque 400s. Applied to
both the passthrough helper and the non-passthrough GGUF path
that originally carried this pattern, keeping the two sites in
sync.
2. Return the upstream JSON body as raw bytes via Response rather
than parse-then-re-serialize with JSONResponse. Matches the
passthrough helper's "verbatim" contract and drops a redundant
round-trip.
---------
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* unsloth gemma4 support files
* some fixes
* Fixing cache.empty() calls (#4813)
* Fixing cache.empty() calls
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix/gemma4 mlx (#4816)
* Fixing cache.empty() calls
* fixing for mlx versions
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* removed bidirectional check for 31b (#4839)
Co-authored-by: Manan17 <shahmanan170602@gmail.coml>
* Add Gemma 4 26B MoE support (MLX) (#4844)
* removed bidirectional check for 31b
* Change gemma4_text for moe
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(gemma4): cast RoPE offset to int before mx.arange() (#4901)
* fix(gemma4): cast RoPE offset to int before mx.arange()
* fix(gemma4): use zero-based arange + offset to avoid CPU-GPU sync
* qwen3.6 patches for multi-turn chat
* qwen3.6 script
* removing unnecessary scripts
* displaying errors for not installed packages
---------
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Manan17 <shahmanan170602@gmail.coml>
Co-authored-by: Théophile Lafargue <138336683+eauchs@users.noreply.github.com>
* Add Qwen3.6 inference defaults for Studio
Add qwen3.6 family entry to inference_defaults.json with the
recommended sampling parameters from Qwen's documentation:
temperature=0.7, top_p=0.8, top_k=20, min_p=0.0,
presence_penalty=1.5, repetition_penalty=1.0.
Without this, Qwen3.6 models fall through to the generic qwen3
pattern which uses different defaults (temperature=0.6,
top_p=0.95, no presence_penalty).
* Add Qwen3.6-35B-A3B-GGUF to default model lists
* Add Qwen3.5/3.6 presence_penalty to thinking toggle and small-model disable logic
- Thinking toggle (on-load + button click) now sets presencePenalty: 1.5 for
Qwen3.5 and Qwen3.6 models (both thinking-ON and thinking-OFF states)
- Small-model thinking-disable check (<9B defaults to no-thinking) extended
from Qwen3.5-only to also cover Qwen3.6, in all 3 locations:
frontend on-load, frontend refresh, backend llama_cpp.py
* fix: multi-GPU inference crash for bnb 4-bit/8-bit models
When load_in_4bit or load_in_8bit is used with device_map="sequential"
and max_memory constraints that place weights across multiple GPUs (or
entirely on a non-default GPU like cuda:1), the bitsandbytes loading
path in transformers never calls dispatch_model. No AlignDevicesHook is
installed, and the first forward/generate call crashes with:
RuntimeError: Expected all tensors to be on the same device
This adds _attach_bnb_multidevice_hooks() which is called after
from_pretrained returns. It infers a device map from actual parameter
placements and calls dispatch_model(force_hooks=True) to install the
missing hooks. The function is a complete no-op for the common
single-GPU cuda:0 case.
Call sites: FastBaseModel.from_pretrained (vision.py) and
FastLlamaModel.from_pretrained (llama.py).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: align with PR #5053 final review improvements
- Add hook call to the bnb quantized loading branch in llama.py (the
primary load_in_4bit path), not just the non-fast-inference fallback
- Expand bnb detection: also check model.is_loaded_in_4bit,
model.is_loaded_in_8bit, model.quantization_method
- Pass explicit main_device and skip_keys to dispatch_model
- Use logger.info instead of print for the success message
- Use kwargs.get("load_in_8bit", False) at llama.py call sites
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* auth: default to chat
* settings: relaunch onboarding
* onboarding: return to launch page
* studio: stop auto guided tour
* ui: soften global radius
* cleanup: rename onboarding exit prop
* fix onboarding redirect safety
* Show real Unsloth version in settings
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): replace navbar navigation with collapsible sidebar
Add an app-wide sidebar with hover-expand and pin-to-dock behavior.
Navigation items (Studio, Recipes, Export, Chat) move from the center
pill navbar to the sidebar. Chat threads and recipes render as
collapsible sub-lists. Navbar simplified to logo + update + close.
- Extend SidebarProvider with pinned/hovered state model
- New AppSidebar with animated active indicator, sloth profile menu,
theme toggle, guided tour, back/forward navigation
- Chat page refactored to URL-driven view state via search params
- Extract reusable hooks for chat thread and recipe sidebar data
- Guard startViewTransition for browser compatibility
- Wrap chat deletions in Dexie transaction for data integrity
* feat(studio): move logo to sidebar and make navbar overlay
- Sidebar is now full-height with logo in SidebarHeader
- Collapsed sidebar shows sticker.png, expanded shows full logo
- Navbar is absolute-positioned overlay (no layout space)
- Main content extends to top, aligning with navbar controls
* feat(studio): full-height sidebar with recents, edge-to-edge nav buttons
- Sidebar outside max-w-7xl, pinned to left edge
- Remove sidebar rounding, menu buttons rounded-md
- Nav buttons flush to sidebar edges with no left rounding
- Replace collapsible recipes/chat with flat nav items
- Add Recents section with chat history (1 item when not on chat, full on chat)
- New Chat as first nav item with PencilEdit02Icon
- Cursor pointer on all sidebar buttons
- Navbar temporarily hidden for screenshots
* fix(studio): fix chat scroll, action bar hover, collapsible recents
- Fix sticky composer by removing `relative` override on viewport footer
- Action bar buttons only show on hover (autohide=always)
- Remove floating border/shadow from action bar
- Add scroll space above composer for last message actions
- Back/forward buttons use router history (stay in-app)
- Recents section collapsible with chevron on chat route
- Set html/body/#root height for proper h-full chain
* fix(studio): address review feedback, clean up unused code
- Unhide navbar (was left hidden from screenshot)
- Remove unused imports: SidebarMenuSub*, BubbleChatIcon, ColumnInsertIcon
- Remove unused vars: recipeItems, activeRecipeId, canCompare, recipesOpen
- Include compare query id in active sidebar selection
- Use store type for contextUsage instead of inline type
- Simplify noop in sidebar.tsx
- Remove empty className prop
* feat(studio): add mobile sidebar, recent runs section, and misc UX fixes
* feat(studio): scaffold settings feature module with dialog store
* feat(studio): add tri-state theme store for settings
* feat(chat): add clear-all-chats and export-chat-history utils
* feat(studio): add settings dialog shell with tab rail
* feat(studio): add appearance tab with theme and sidebar pin
* feat(studio): add settings general tab with hf token, auto-title, reset prefs
* feat(studio): add settings chat tab with export and clear
* feat(studio): add api keys tab with list and revoke flow
* feat(studio): add create-key form and reveal dialog
* feat(studio): add usage examples panel to api keys tab
* feat(studio): add settings about tab with update and shutdown
* feat(studio): add settings dropdown item and cmd-comma shortcut
* feat(studio): remove legacy api-keys route and chat-sheet preference rows
* fix(studio): settings dialog a11y + polish pass
* feat(studio): inline api key reveal card replacing nested dialog
* fix(studio): hide revoked keys from settings list
* refactor(studio): strip navbar and hoist training unload guard
* feat(studio): explicit sidebar toggle, remove hover-open and pin icons
* fix(studio): use SidebarRight01Icon for collapsed sidebar open toggle
* fix(studio): address code review findings for settings dialog
* feat(studio): collapsible navigate group with standalone new-chat and compare
* fix(studio): chat-only standalone actions, use ColumnInsertIcon for compare
* fix(studio): sidebar new-chat/compare state reset and icon-mode collapsible
* feat(studio): add compact logo assets for sidebar header
* Fixed sidebar design
* fix(studio): sidebar delete icon hover contrast and sizing
* feat(studio): route-gate sidebar recents (chats off /studio, runs on /studio)
* feat(studio): add chat search store
* feat(studio): add chat search index hook with snapshot-on-open
* feat(studio): add chat search command dialog with global shortcut
* feat(studio): wire chat search into sidebar
* fix(studio): trim hf token on save, add show/hide toggle, commit on close
* revert(studio): restore original sidebar/border colors, brighten sidebar
* feat(studio): forward overlayClassName through CommandDialog
* fix(studio): wrap search dialog in Command context, redesign as flat 635px card
* fix(studio): reserve right padding on recent items so delete icon stops overlapping title
* fix(studio): skip hf token unmount-commit during reset-prefs reload
* chore(studio): drop unused icon import and unreachable runs navigate branch
* fix(studio): chat search index filters archived before limit, batches message query, picks up reasoning text
* fix(studio): keep CommandEmpty in tree so empty state renders correctly
* fix(studio): cap system prompt and chat template textareas so they scroll instead of growing
* fix(studio): attach chat-compare tour anchor to sidebar compare button
* fix(studio): persist system theme explicitly so next-themes does not clobber on reload
* fix(studio): auto-switch to history tab when selecting a recent run from sidebar
* UI overhaul: chatbox, scrollbar, sidebar, and compare view
UI Changes:
- Redesigned the Compare UI with general cleanup
- Redesigned the Chatbox UI
- Reduced the width of the user chat bubble for improved readability
- Narrowed the user chat box across the content page
- Adjusted thinking-box text color to be slightly darker
- Removed faded text effect from chat messages
- Removed faded text effect from the thinking box
- Added a small LLM chat safety note at the bottom of the chatbox
- Restyled the scrollbar
Layout & Behavior:
- Reworked the scrollbar to span the full height of the page (no top/bottom padding) and remain persistently visible when content is scrollable, rather than only on hover
- Reworked the Configuration sidebar to span full height — removed rounded corners and borders, with the scrollbar adjusted to match the full top-to-bottom layout
- Adjusted the top menu and bottom chatbox content areas to work correctly with the new full-page scroll behavior
- Made chat content match the chatbox width, with content sliding slightly behind the chatbox when scrolling
- Aligned chat text width with the chatbox for visual consistency, including how far the text extends behind the chatbox
Fixes:
- Fixed the chatbox not auto-expanding when typing multi-line input while bottom-positioned during an active chat (previously only worked before a chat had started)
- Fixed positioning and design of the user chat hover menu buttons to match the assistant chat box — now displayed below the chat bubble instead of on the left side
* Fix user message layout in thread component
* swap code icon
* fix compare layout
* fix compare pane flex
* Sidebar improvements and fixes
- Added scrolling support to the sidebar so menus and recent chats no longer get hidden
- Recent chats are now always visible in the sidebar, not hidden when in Studio, Recipes, or Export
- Recent chat is now deselected when selecting other navigations
- Fixed sidebar glitch where browser resize could make the sidebar and expand button disappear completely
- Fixed glitch where the open-sidebar hover tooltip appeared above the logo when clicking expand sidebar
- Reduced sidebar width on mobile to around 2/3 of the screen (was too wide)
- Made the close-sidebar hover tooltip consistent with the rest of the design
- Removed sidebar collapse/expand animation
- Small adjustment to chat width
* Fix route scrolling, polling, and theme sync issues
* Fix Studio page scrolling
---------
Co-authored-by: sneakr <hauzin@hotmail.com>
* Studio: Ollama support, recommended folders, Custom Folders UX polish
Backend:
- Add _scan_ollama_dir that reads manifests/registry.ollama.ai/library/*
and creates .gguf symlinks under <ollama_dir>/.studio_links/ pointing
at the content-addressable blobs, so detect_gguf_model and llama-server
-m work unchanged for Ollama models
- Filter entries under .studio_links from the generic models/hf/lmstudio
scanners to avoid duplicate rows and leaked internal paths in the UI
- New GET /api/models/recommended-folders endpoint returning LM Studio
and Ollama model directories that currently exist on the machine
(OLLAMA_MODELS env var + standard paths, ~/.lmstudio/models, legacy
LM Studio cache), used by the Custom Folders quick-add chips
- detect_gguf_model now uses os.path.abspath instead of Path.resolve so
the readable symlink name is preserved as display_name (e.g.
qwen2.5-0.5b-Q4_K_M.gguf instead of sha256-abc...)
- llama-server failure with a path under .studio_links or .cache/ollama
surfaces a friendlier message ("Some Ollama models do not work with
llama.cpp. Try a different model, or use this model directly through
Ollama instead.") instead of the generic validation error
Frontend:
- ListLabel supports an optional leading icon and collapse toggle; used
for Downloaded (download icon), Custom Folders (folder icon), and
Recommended (star icon)
- Custom Folders header gets folder icon on the left, and +, search,
and chevron buttons on the right; chevron uses ml-auto so it aligns
with the Downloaded and Recommended chevrons
- New recommended folder chips render below the registered scan folders
when there are unregistered well-known paths; one click adds them as
a scan folder
- Custom folder rows that are direct .gguf files (Ollama symlinks) load
immediately via onSelect instead of opening the GGUF variant expander
(which is for repos containing multiple quants, not single files)
- When loading a direct .gguf file path, send max_seq_length = 0 so the
backend uses the model's native context instead of the 4096 chat
default (qwen2.5:0.5b now loads at 32768 instead of 4096)
- New listRecommendedFolders() helper on the chat API
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: log silent exceptions and support read-only Ollama dirs
Replace silent except blocks in _scan_ollama_dir and the
recommended-folders endpoint with narrower exception types plus debug
or warning logs, so failures are diagnosable without hiding signal.
Add _ollama_links_dir helper that falls back to a per-ollama-dir hashed
namespace under Studio's own cache (~/.unsloth/studio/cache/ollama_links)
when the Ollama models directory is read-only. Common for system installs
at /usr/share/ollama/.ollama/models and /var/lib/ollama/.ollama/models
where the Studio process has read but not write access. Previously the
scanner returned an empty list in that case and Ollama models would
silently not appear.
The fallback preserves the .gguf suffix on symlink names so
detect_gguf_model keeps recognising them. The prior "raw sha256 blob
path" fallback would have missed the suffix check and failed to load.
* Address review: detect mmproj next to symlink target for vision GGUFs
Codex P1 on model_config.py:1012: when detect_gguf_model returns the
symlink path (to preserve readable display names), detect_mmproj_file
searched the symlink's parent directory instead of the target's. For
vision GGUFs surfaced via Ollama's .studio_links/ -- where the weight
file is symlinked but any mmproj sidecar lives next to the real blob
-- mmproj was no longer detected, so the model was misclassified as
text-only and llama-server would start without --mmproj.
detect_mmproj_file now adds the resolved target's parent to the scan
order when path is a symlink. Direct (non-symlink) .gguf paths are
unchanged, so LM Studio and HF cache layouts keep working exactly as
before. Verified with a fake layout reproducing the bug plus a
regression check on a non-symlink LM Studio model.
* Address review: support all Ollama namespaces and vision projector layers
- Iterate over all directories under registry.ollama.ai/ instead of
hardcoding the "library" namespace. Custom namespaces like
"mradermacher/llama3" now get scanned and include the namespace
prefix in display names, model IDs, and symlink names to avoid
collisions.
- Create companion -mmproj.gguf symlinks for Ollama vision models
that have an "application/vnd.ollama.image.projector" layer, so
detect_mmproj_file can find the projector alongside the model.
- Extract symlink creation into _make_symlink helper to reduce
duplication between model and projector paths.
* Address review: move imports to top level and add scan limit
- Move hashlib and json imports to the top of the file (PEP 8).
- Remove inline `import json as _json` and `import hashlib` from
function bodies, use the top-level imports directly.
- Add `limit` parameter to `_scan_ollama_dir()` with early exit
when the threshold is reached.
- Pass `_MAX_MODELS_PER_FOLDER` into the scanner so it stops
traversing once enough models are found.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: Windows fallback, all registry hosts, collision safety
_make_link (formerly _make_symlink):
- Falls back to os.link() hardlink when symlink_to() fails (Windows
without Developer Mode), then to shutil.copy2 as last resort
- Uses atomic os.replace via tmp file to avoid race window where the
.gguf path is missing during rescan
Scanner now handles all Ollama registry layouts:
- Uses rglob over manifests/ instead of hardcoding registry.ollama.ai
- Discovers hf.co/org/repo:tag and any other host, not just library/
- Filenames include a stable sha1 hash of the manifest path to prevent
collisions between models that normalize to the same stem
Per-model subdirectories under .studio_links/:
- Each model's links live in their own hash-keyed subdirectory
- detect_mmproj_file only sees the projector for that specific model,
not siblings from other Ollama models
Friendly Ollama error detection:
- Now also matches ollama_links/ (the read-only fallback cache path)
and model_identifier starting with "ollama/"
Recommended folders:
- Added os.access(R_OK | X_OK) check so unreadable system directories
like /var/lib/ollama/.ollama/models are not advertised as chips
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: filter ollama_links from generic scanners
The generic scanners (models_dir, hf_cache, lmstudio) already filter
out .studio_links to avoid duplicate Ollama entries, but missed the
ollama_links fallback cache directory used for read-only Ollama
installs. Add it to the filter.
* Address review: idempotent link creation and path-component filter
_make_link:
- Skip recreation when a valid link/copy already exists (samefile or
matching size check). Prevents blocking the model-list API with
multi-GB copies on repeated scans.
- Use uuid4 instead of os.getpid() for tmp file names to avoid race
conditions from concurrent scans.
- Log cleanup errors instead of silently swallowing them.
Path filter:
- Use os.sep-bounded checks instead of bare substring match to avoid
false positives on paths like "my.studio_links.backup/model.gguf".
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: drop copy fallback, targeted glob, robust path filter
_make_link:
- Drop shutil.copy2 fallback -- copying multi-GB GGUFs inside a sync
API request would block the backend. Log a warning and skip the
model when both symlink and hardlink fail.
Scanner:
- Replace rglob("*") with targeted glob patterns (*/*/* and */*/*/*)
to avoid traversing unrelated subdirectories in large custom folders.
Path filter:
- Use Path.parts membership check instead of os.sep substring matching
for robustness across platforms.
Scan limit:
- Skip _scan_ollama_dir when _generic already fills the per-folder cap.
* Address review: sha256, top-level uuid import, Path.absolute()
- Switch hashlib.sha1 to hashlib.sha256 for path hashing consistency.
- Move uuid import to the top of the file instead of inside _make_link.
- Replace os.path.abspath with Path.absolute() in detect_gguf_model
to match the pathlib style used throughout the codebase.
* Address review: fix stale comments (sha1, rglob, copy fallback)
Update three docstrings/comments that still referenced the old
implementation after recent changes:
- sha1 comment now says "not a security boundary" (no hash name)
- "rglob" -> "targeted glob patterns"
- "file copies as a last resort" -> removed (copy fallback was dropped)
* Address review: fix stale links, support all manifest depths, scope error
_make_link:
- Drop size-based idempotency shortcut that kept stale links after
ollama pull updates a tag to a same-sized blob. Only samefile()
is used now -- if the link doesn't point at the exact same inode,
it gets replaced.
Scanner:
- Revert targeted glob back to rglob so deeper OCI-style repo names
(5+ path segments) are not silently skipped.
Ollama error:
- Only show "Some Ollama models do not work with llama.cpp" when the
server output contains GGUF compatibility hints (key not found,
unknown architecture, failed to load). Unrelated failures like
OOM or missing binaries now show the generic error instead of
being misdiagnosed.
---------
Co-authored-by: Daniel Han <info@unsloth.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: danielhanchen <michaelhan2050@gmail.com>
* Fix review findings for PR #49
1. Sandbox fallback Jinja env in _VariantTokenizerProxy.apply_chat_template
(use SandboxedEnvironment, matching _derive_assistant_prefix_by_render)
2. Unwrap benign outer-If guards in _template_ends_with_toplevel_for so
templates like {% if messages %}{% for ... %}{% endfor %}{% endif %}
are still repairable (preserves Qwen3-Guard rejection via else-branch
and add_generation_prompt-name checks)
3. Preserve raw name_or_path in _VariantTokenizerProxy._source_path so
local-path detection works for dict/list variant tokenizers
4. Context-aware strict-mode messages: omit "will still load" and
"Set UNSLOTH_STRICT_CHAT_TEMPLATE=1" when already raising
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Older installers persisted the venv Scripts directory directly in the
User PATH registry. The shim approach from #4961 no longer writes that
entry, but on upgrade the old one survived and python.exe / pip.exe
from the unsloth venv continued winning resolution in every new shell.
Before creating the shim, read the current User PATH, filter out any
entry matching $VenvDir\Scripts (using the same symmetric raw+expanded
comparison as Add-ToUserPath), and write back if changed. No-op on
fresh installs where the legacy entry was never written.
Confirmed on a real Windows machine: `where.exe python` was returning
the venv interpreter first even after the shim PR merged.
Older installers persisted the venv Scripts directory directly in the
User PATH registry. The shim approach (added in this PR) no longer writes
that entry, but it also did not remove the old one. On upgrade, the
legacy entry survived and python.exe / pip.exe from the unsloth venv
continued winning resolution in every new shell, which is exactly the
hijack the shim was designed to prevent.
Before creating the shim, read the current User PATH, filter out any
entry matching $VenvDir\Scripts (using the same symmetric raw+expanded
comparison as Add-ToUserPath), and write back if changed. This runs
once per install and is a no-op on fresh installs where the legacy
entry was never written.
* Restrict flash attn to <=256 head dim. Consolidate attn impl checks
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Consolidate the changes into single function
* safeguard for dict instead of object
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Chat-template repair: warn-by-default, AST classification, dict support
Follow-up hardening on top of PR #4426 (which fixed the #4150
RuntimeError for ChatML LoRA reloads).
Behavior changes:
- Warn-by-default instead of RuntimeError. When fix_chat_template cannot
repair a broken template, emit a warning and return the original.
Set UNSLOTH_STRICT_CHAT_TEMPLATE=1 to restore the pre-warn hard fail.
Fixes the UX where a missing `{% if add_generation_prompt %}` block on
a saved LoRA (typical after LlamaFactory / Axolotl re-serialize) would
block model loading entirely.
- Local path vs HF hub distinguished in the warning message. For local
paths the message points at the likely downstream tool; for HF IDs it
points at the upstream model maintainers. Previously both said "file a
bug report to the maintainers of <path>" even when <path> was the
user's own saves/ directory.
- Dict / list chat_template now handled. Hermes-3 ships with
{default, tool_use} and the previous code crashed with
AttributeError: 'dict' object has no attribute 'find' when entering
_fix_chat_template with a dict. Each variant is now fixed
independently; structure is preserved.
Internals:
- _find_end_position now matches all four Jinja whitespace-control
variants ({% %}, {%- %}, {% -%}, {%- -%}) and returns the rightmost
endfor/endif so multi-for templates aren't locked onto the first loop.
Previously {%- endfor -%} (both-side dash, used by Qwen3-Guard) was
silently bypassed.
- _has_add_generation_prompt_block uses Jinja AST via
jinja2.nodes.If/Name walks instead of substring matching, so
templates that hide the block behind comments or dash-style variants
are classified correctly.
- _template_ends_with_toplevel_for gates the GH#4150 ChatML repair on
the AST: only fires when the last structural top-level node is a For
(standard ChatML shape), ignoring trailing pure-whitespace output
nodes. Templates wrapped in an outer If (Qwen3-Guard) are now
explicitly skipped at the _fix_chat_template level as well, not just
at load_correct_tokenizer's name-based exemption.
- _validate_patched_template renders the patched template with and
without add_generation_prompt and confirms the patched output
responds to the flag by appending (not replacing) content. If
validation fails, the patch is discarded and we fall through to the
warn path.
Verified with an expanded regression suite in tests/:
- test_fix_chat_template_pr4426.py: 42/42 template-matrix cells
- test_load_correct_tokenizer_pr4426.py: 5/5 tokenizer loads
- test_chat_template_followups.py: 10/10 new follow-up tests
- test_mistral_pr4426.py: 5 Mistral variants byte-identical
- test_qwen_pr4426.py: 14 Qwen variants byte-identical
(Qwen1.5, Qwen2, Qwen2.5-Instruct/Coder/Math/VL, Qwen3,
Qwen3-Coder, QwQ, Qwen3-Guard-Gen)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard _validate_patched_template against read-only chat_template
If tokenizer.chat_template is a property or otherwise read-only, the
validation helper would crash with AttributeError when trying to
temporarily set the patched template. Catch the assignment failure and
return False (skip validation), and best-effort restore in the finally
block.
* Replace regex separator inference with render-diff; broaden repair to non-ChatML templates
The previous `_infer_assistant_separator` was a four-tier regex heuristic that
only worked on ChatML-shaped templates and forced a hard `<|im_start|>` /
`<|im_end|>` presence gate on Case 2 repair. This meant a Llama-3, Gemma, or
Phi-3 template stripped of its generation-prompt block by a downstream tool
(LlamaFactory, Axolotl, etc.) would still warn-and-return even though the
structural shape is identical to the ChatML case the PR already handles.
This replaces the regex with `_derive_assistant_prefix_by_render`: render the
template with two dialogs that differ only in assistant content, then
`os.path.commonprefix` on the tails captures the exact assistant-turn prefix
the template emits. The template itself is ground truth, so non-ChatML shapes
work as long as the assistant block is a literal the template emits once per
message.
Three guards keep the derivation safe:
A. both assistant renders extend the base render (no reordering);
B. the divergence point is exactly the content-insertion site (sentinel
follows the common prefix);
C. a user-role cross-check: if a render with a user sentinel also emits
the same prefix, role has no effect on output and we reject. A render
failure on [user, user] (e.g. Gemma's `raise_exception` alternation
check) is evidence that role matters; we accept.
Sentinels differ at character 0 so `commonprefix` cannot absorb them, and
trailing whitespace/comments after the last `{% endfor %}` are stripped
before probing (they would appear in base but not after the appended
assistant turn and break Guard A).
`_fix_chat_template` and `_repair_string_template` now thread an
`is_sharegpt` kwarg; `_fix_chat_template` retries once with
`is_sharegpt=True` if the first probe returns None (dual-probe fallback
for dict/list callers).
The ChatML `<|im_start|>` / `<|im_end|>` hard gate in Case 2 is dropped.
`_infer_assistant_separator` is deleted.
Verified via:
- tests/test_fix_chat_template_pr4426.py: 51/51 cells (new Llama-3,
Gemma, Phi-3 broken-template rows all repair FIX-OK)
- tests/test_load_correct_tokenizer_pr4426.py: 5/5
- tests/test_chat_template_followups.py: 18/18 (T11-T18 cover
non-ChatML repair + probe failure modes)
- tests/test_mistral_pr4426.py: 5/5 byte-identical
- tests/test_qwen_pr4426.py: 14/14 byte-identical (Qwen3-Guard AST
gate still rejects)
- tests/hermes3_lora_pr4426.py reload: patched template ends with
`<|im_start|>assistant\n`, inference returns sensible output.
- temp/sim/battery.py: 79/79 followup; vs baseline: 0 regressions,
9 improvements.
- Spot-check probe on real stripped tokenizers (Hermes-3, Phi-4,
Llama-3.2-1B, Gemma-3-1B): all derive the expected prefix.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address reviewer findings: variant routing, positive-gate detection, comment-safe end scan
Resolves three reviewer findings on PR #5049 (`fix/chat-template-followups`):
Finding #1 [10/10]: dict/list variants now route through
`_fix_chat_template_for_tokenizer` via a new `_VariantTokenizerProxy`
adapter. Previously the dict/list branches called `_fix_chat_template`
directly, silently bypassing the warn/strict (`UNSLOTH_STRICT_CHAT_TEMPLATE`)
contract, the `no == yes` diagnostic, broken-existing-block detection,
and `_validate_patched_template` guard. The proxy swaps
`base.chat_template` to the variant string before each
`apply_chat_template` call so tokenizer globals (`bos_token`, custom
filters, `raise_exception`) remain available; if the base is read-only
it falls back to isolated Jinja rendering.
Finding #2 [1/10]: `_has_add_generation_prompt_block` now requires the
`If` body to contain at least one `Output` node (a new
`_if_body_emits_content` helper walks descendants). This distinguishes a
real generation-prompt block from a header guard like
`{% if not add_generation_prompt is defined %}{% set ... %}{% endif %}`
(body contains only `Assign`) which references the name but emits
nothing. Also dropped a now-redundant `"add_generation_prompt" not in
scrubbed` guard in `_fix_chat_template` Case 2 so header-guarded
templates still get repaired.
Finding #4 [1/10]: `_find_end_position` now replaces Jinja comments with
equal-length whitespace before scanning for `{% endfor %}` / `{% endif %}`
tokens. This prevents a trailing comment containing those tokens from
being picked as the real end tag. Positions in the padded string map 1:1
to positions in the original template.
Tests:
- tests/test_chat_template_followups.py: 21/21 (T19 strict-mode
dict variant, T20 header-guard repair, T21 comment-endfor trap
added; T4/T5 stubs updated with a working apply_chat_template
that routes through Jinja).
- tests/test_fix_chat_template_pr4426.py: 51/51 cells unchanged.
- tests/test_load_correct_tokenizer_pr4426.py: 5/5.
- tests/test_mistral_pr4426.py: 5/5 byte-identical.
- tests/test_qwen_pr4426.py: 14/14 byte-identical.
- temp/sim/battery.py: 79/79 followup; 0 regressions vs baseline.
- Phase 3 Hermes-3 broken-LoRA reload: inference still returns
`'The answer to the equation 2+2 is 4.'`.
- Spot-checks on Hermes-3 / Phi-4 / Llama-3.2-1B / Gemma-3-1B real
stripped templates: probe still derives the expected prefix.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Tighten comments in chat-template helpers
Pure comment minimization across `_find_end_position`,
`_has_add_generation_prompt_block`, `_if_body_emits_content`,
`_derive_assistant_prefix_by_render`, `_fix_chat_template` Case 2,
and `_VariantTokenizerProxy`. No behavior change; same intent,
fewer lines. All 21 follow-up tests and the 51-cell Phase 1 matrix
still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Sandbox probe, fix is_sharegpt validator mismatch, reject negated gates
Three real bugs from the 10-agent Opus review:
1. Probe now uses `jinja2.sandbox.SandboxedEnvironment` instead of bare
`jinja2.Environment`. The probe renders at model-load time (before
the user calls `apply_chat_template`), so it was a new eager
code-execution surface that the base HF tokenizer loading does not
have. SandboxedEnvironment blocks attribute-chain exploits at
negligible cost.
2. `_repair_string_template` now tries validation with both
`is_sharegpt=False` and `is_sharegpt=True`. Previously, when
`_fix_chat_template` internally fell back to the other schema via
its dual-probe, the outer validation still used the caller's
original `is_sharegpt` -- rendering with the wrong message keys and
spuriously dropping a valid repair.
3. `_has_add_generation_prompt_block` now skips `If` nodes whose test
is a `Not` expression. A negated gate like
`{% if not add_generation_prompt %}{{ x }}{% endif %}` fires when
agp=False, so its emitting body is not a generation block -- but the
old code counted any Name reference regardless of polarity.
Cleanup: removed unused `self._label`, added `\r` escape in
generation-block literal, switched variant labels to `!r` formatting,
removed redundant `import os as _os`.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix jinja2.sandbox import and sandbox proxy fallback
Two critical findings from the 20-reviewer pass:
1. [20/20] The proxy read-only fallback used bare `jinja2.Environment`,
not sandboxed. All 20 reviewers independently reproduced marker-file
creation via `cycler.__init__.__globals__['os'].system(...)` during
`fix_chat_template()`. Fixed: fallback now uses
`from jinja2.sandbox import SandboxedEnvironment`.
2. [14/20] The render-diff probe did `import jinja2` then referenced
`jinja2.sandbox.SandboxedEnvironment`. `jinja2.sandbox` is a
submodule that is NOT auto-imported by `import jinja2` on Jinja 3.1.6.
This caused `AttributeError` (swallowed by `except Exception`),
making the entire Case 2 repair path silently return None in a clean
process. The 6 reviewers who saw it work had `jinja2.sandbox`
pre-imported by an earlier module in their process. Fixed: both the
probe and the proxy fallback now use
`from jinja2.sandbox import SandboxedEnvironment`.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Reduce inline comments from ~160 lines to ~25 across both files.
Keep one-line summaries of the "why"; drop multi-paragraph rationale
blocks that repeated information already captured in commit messages
and PR discussion.
* fix: replacing SetEnvironmentVariable with direct registry API
* apply reviews
* Use CreateSubKey for HKCU\Environment
* Store PATH backup under HKCU\Software\Unsloth
* Fix $backupKey registry handle leak in PATH backup block
Wrap $backupKey operations in try/finally so the handle is closed even
if GetValue or SetValue throws. The Add-ToUserPath helper already uses
this pattern for its registry key -- the backup block was the only
place missing it.
* Isolate WM_SETTINGCHANGE broadcast from PATH write error handling
Wrap the broadcast dummy-variable calls in their own try/catch so a
broadcast failure does not mask a successful registry PATH write.
Previously, if SetEnvironmentVariable threw after SetValue already
committed the new PATH, Add-ToUserPath would return $false and the
caller would skip Refresh-SessionPath.
* PATH helper polish: venv precedence, quoted entries, raw/expanded dedup
Three small follow-ups surfaced by a 10-reviewer pass against the rebased
PR head. None fix a regression vs main; each strictly improves the new
helpers.
Refresh-SessionPath / Refresh-Environment:
- Move $env:Path to the front of the merge so an activated venv keeps
precedence over machine/user PATH after a refresh. Pre-PR dropped
process-only entries entirely; post-PR kept them but at the back.
- Dedup on both raw and expanded forms so %USERPROFILE%\foo and the
already-expanded C:\Users\me\foo do not both survive.
Add-ToUserPath:
- Trim whitespace and surrounding double-quotes from each compared entry
so quoted PATH entries like "C:\Program Files\CMake\bin" deduplicate
against an unquoted directory of the same path.
* Back up User PATH inside Add-ToUserPath, before first mutation
Previously only studio/setup.ps1 took a one-time PATH backup, at script
top (line ~547). install.ps1 (the irm | iex entry point) had no backup,
so users who installed via that path had no recovery surface if anything
clobbered their PATH. The PR description's "one-time backup before any
modifications" promise only held for the studio installer flow.
Move the backup into Add-ToUserPath itself: just before the first actual
SetValue mutation, write the pristine raw PATH to
HKCU\Software\Unsloth\PathBackup if no backup already exists. This:
- Covers both entry points (install.ps1 and studio/setup.ps1).
- Captures the TRUE pristine PATH even when install.ps1 runs first and
studio/setup.ps1 runs afterwards (the script-top backup in setup.ps1
would otherwise see an already-modified PATH).
- Is idempotent: once a backup exists, subsequent calls preserve it.
- Skips when nothing would mutate (dedup match) or PATH is empty.
The script-top backup in studio/setup.ps1 is kept for defense in depth.
* Refresh PATH: venv-aware merge order
Reconcile two competing concerns about Refresh-SessionPath /
Refresh-Environment surfaced by separate review rounds:
- venv at the back -> activated venv loses precedence to system Python
- process at the front -> stale shims (old node, old python, etc.)
still on $env:Path can beat a freshly installed tool
New merge order:
1. Activated venv Scripts dir, only if $env:VIRTUAL_ENV is set
2. Machine PATH freshly read from registry
3. User PATH freshly read from registry
4. Current $env:Path as fallback
This way an explicitly-activated venv keeps priority while a tool the
script just installed wins over any stale entry that was already on
the inherited shell PATH. When no venv is active, fresh registry
entries take precedence as expected.
* Append to User PATH by default, close $envKey in finally
Add-ToUserPath gains a -Position Append|Prepend parameter defaulting to
Append so installing unsloth no longer prepends the bundled venv Scripts
directory ahead of the user's existing python / pip on new shells. The
four current call sites (install.ps1 launcher, studio/setup.ps1 CMake,
nvcc, Python user Scripts) all take the Append default because each one
that needs in-session precedence already does an inline $env:Path prepend
independently. This matches rustup / cargo / nvm / pyenv / uv behavior.
Also wrap the script-top $envKey.GetValue in a try/finally so the
registry handle is released even if the read throws. Matches the pattern
already used for $backupKey five lines below.
* Prepend cmake, nvcc, Python Scripts; keep venv Scripts appended
The previous commit switched Add-ToUserPath to append by default so that
installing unsloth would not silently hijack the user's system python /
pip. That was correct for the venv Scripts dir (which contains python.exe
and pip.exe alongside unsloth.exe), but wrong for the three studio/setup
call sites. Those persist cmake, the driver-compatible nvcc, and the
Python user Scripts dir for future shells, and in all three cases an
older tool already earlier in the user PATH would keep winning after the
install finished. The nvcc case is especially load-bearing: setup selects
a driver-compatible CUDA toolkit, then llama.cpp builds against whatever
wins PATH resolution, so a stale older nvcc produces broken builds.
Pass -Position 'Prepend' explicitly at the three setup.ps1 call sites
(cmake at line 754, nvcc bin at line 1025, Python user Scripts at line
1191). None of those directories holds python.exe, so prepending them
does not re-introduce the original hijack problem. Leave the install.ps1
venv Scripts call on the default Append with a comment explaining why.
* Symmetric dedup, Prepend reorders duplicates, unsloth shim dir
Address three separate findings surfaced by review:
1. Dedup asymmetry (Gemini high-priority): the existing dedup expanded
registry entries via ExpandEnvironmentVariables but did NOT expand the
new directory. Passing "%USERPROFILE%\foo" when "C:\Users\me\foo" was
already in PATH produced a duplicate. Expand both sides so the check
is symmetric.
2. -Position Prepend no-op on existing duplicates: the dedup loop
returned $false as soon as it saw a match, regardless of position.
That left a late-position duplicate in place instead of moving it to
the front, so "prepend the newly selected cmake/nvcc" did not always
beat an older copy earlier in PATH. Partition entries into kept and
dropped lists, then reinsert a single copy at the requested position.
Append still returns $false on any match so user-curated orderings
are not reshuffled. Prepend also returns $false when the only copy
is already at position 0 so we preserve the user's casing.
3. Stop adding the venv Scripts dir to User PATH entirely. That dir
holds python.exe and pip.exe alongside unsloth.exe, so neither
Prepend nor Append worked: prepend hijacked the user's system python
and pip, append made the freshly-installed unsloth.exe lose to any
older unsloth.exe earlier on PATH. Replace the Scripts-dir PATH add
with a dedicated shim directory that contains only unsloth.cmd, and
prepend that dir. The shim calls the venv's unsloth.exe by absolute
path so future pip upgrades inside the venv propagate automatically.
* Shim via hardlink, Append user Scripts, drop venv sysconfig fallback
Three follow-ups to the c0ab1ab shim commit, targeting concerns raised in
the second 20-reviewer pass:
1. Shim uses unsloth.exe (hardlink, copy fallback) instead of unsloth.cmd.
The batch-file approach had three distinct regressions:
- cmd.exe expanded %...% sequences inside user arguments, so prompts
like "What does 50% mean?" got mangled before reaching the CLI
- Git Bash / MSYS2 / POSIX-style shells on Windows do not resolve
bare-name lookups to .cmd files, so `unsloth` stopped working there
- Set-Content -Encoding ASCII replaced non-ASCII profile characters
with '?', so installs under C:\Users\Jörg\... wrote a broken shim
A hardlink (fallback: copy) of unsloth.exe is a native Windows
executable with no shell indirection. PATHEXT picks .exe before .cmd
in cmd.exe and PowerShell, Git Bash honors .exe natively, subprocess
callers hit it directly, and a hardlink stays in sync with the venv
on pip upgrades because both names point at the same inode.
2. studio/setup.ps1 Python user Scripts dir is added with default Append
instead of -Position Prepend. That directory holds every pip-installed
user console script (pip, pytest, huggingface-cli, and so on), not
just unsloth, so reordering it silently changed resolution order for
unrelated tools. The new install.ps1 shim at PATH position 0 already
guarantees `unsloth` resolves to the freshly installed copy, so the
Python user Scripts entry only needs to be present, not at the front.
3. The sysconfig lookup in studio/setup.ps1 no longer falls back to
sysconfig.get_path('scripts') when the nt_user scheme dir does not
exist. When setup.ps1 is invoked from an activated venv (a flow the
linked issue actually hits) that fallback returns the venv's Scripts
directory, which would then be added to the persisted User PATH and
re-introduce the python / pip hijack the shim dir is meant to avoid.
Stick strictly to the nt_user scheme; skip the block if it does not
exist on disk.
* Do not crash installer when unsloth.exe shim is locked
The shim update sequence at install.ps1:1095 did a bare Remove-Item /
New-Item HardLink / Copy-Item. Under the script's $ErrorActionPreference
a locked target (most commonly 'unsloth studio' still running while the
user re-invokes the installer) turns the Remove-Item failure into a
terminating error that aborts the install with no actionable message.
The existing shim is perfectly usable in that state, so there is no
reason to abort. Wrap the whole remove/link/copy sequence in a try/catch
that logs the probable cause (Studio still running), points at the fix
(close Studio and re-run), and lets the installer finish with the old
launcher still serving the command.
Also only emit the "added unsloth launcher to PATH" step line when the
launcher was actually (re)created AND the PATH entry was newly added --
previously the message fired even when the shim refresh silently failed,
which was confusing.
* Guard shim PATH entry on existence, use NullString for broadcast delete
Two follow-ups surfaced by the latest review pass:
1. Do not add the shim directory to User PATH when the launcher was not
actually created. Antivirus blocking unsloth.exe, a disk-full volume,
or restrictive filesystem permissions can make both the hardlink and
the copy fallback fail on a fresh install. In that case the existing
sequence would report "added unsloth launcher to PATH" warnings but
still prepend the empty $ShimDir to User PATH -- the user sees an
install that claims success but then cannot resolve `unsloth` in a
new shell. Gate Add-ToUserPath on Test-Path $ShimExe so the PATH
entry is only persisted when the launcher is really there.
2. Pass [NullString]::Value instead of $null to the broadcast-delete
call in Add-ToUserPath. On PowerShell 7.5 and later (running on .NET
9), a bare $null going into [Environment]::SetEnvironmentVariable
can be coerced to an empty string rather than a true .NET null,
which sets the dummy UnslothPathRefresh_XXXXXXXX variable to "" in
HKCU\Environment instead of deleting it. The leaked variable is
visible in System Properties and accumulates one entry per install
run. [NullString]::Value is a PowerShell-specific sentinel that
crosses the interop boundary as a real null and works on both PS 5.1
and PS 7.x. See PowerShell/PowerShell#24637 for the underlying issue.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Fixes#4150.
Pre-PR, `_fix_chat_template` only patched templates where a trailing `{{ ... }}` expression followed the last `{% endfor %}`. ChatML templates (Hermes, Magnum, Phi-4, etc.) that end cleanly at `{% endfor %}` with no generation-prompt block were left unchanged, so the outer `fix_chat_template` raised:
```
RuntimeError: Unsloth: The tokenizer `...` does not have a
{% if add_generation_prompt %} for generation purposes.
```
This commonly shows up when a downstream tool (LlamaFactory, Axolotl) re-serializes the tokenizer during LoRA save and strips the generation-prompt block.
This PR adds a second branch to `_fix_chat_template` that fires when:
- the content after the last `{% endfor %}` is empty modulo Jinja `{# ... #}` comments,
- the scrubbed template contains `<|im_start|>` and `<|im_end|>`,
- and the scrubbed template does not already mention `add_generation_prompt`.
The assistant-turn separator is inferred from the template itself (preferring an explicit `'<|im_start|>assistant<sep>'` literal, then the unique `message['role'] + '<sep>'` from role concatenations, then `<|im_sep|>` for Phi-4-mini mixed-separator templates, then `\n`), so Phi-4-style templates are not silently corrupted with the wrong separator.
Verified against the existing chat-template corpus:
- Hermes-3, Magnum-v2, Phi-4-mini, Phi-4 multi-sep, ChatML with trailing whitespace, ChatML with trailing Jinja comment, dot-access `message.role`, split-literal `'<|im_start|>assistant'`: all repaired with the correct assistant prefix.
- Already-fixed ChatML templates: idempotent NOP.
- Trap templates with `<|im_start|>` only inside a Jinja comment: correctly not rewritten.
- Llama-3, Gemma-3, Qwen2.5 (non-ChatML): byte-identical.
- Mistral family (5 models including Mistral-Nemo, Mistral-Small-24B, Mixtral): byte-identical, protected both by the structural guard (no ChatML tokens) and the existing name-based exemption in `load_correct_tokenizer`.
- Qwen family (14 models including Qwen2.5, Qwen3, Qwen3-Coder, QwQ, VL, Math, Qwen3-Guard): byte-identical.
End-to-end reproduction: Hermes-3 LoRA SFT, save with stripped chat_template, reload. Pre-PR code path raises the RuntimeError above. Post-PR reload loads cleanly, patches the template at load time, and `apply_chat_template(add_generation_prompt=True)` produces the correct `<|im_start|>assistant\n` prefix.
* fix pass attn implementation
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: add folder browser modal for Custom Folders
The Custom Folders row in the model picker currently only accepts a
typed path. On a remote-served Studio (Colab, shared workstation) that
means the user has to guess or paste the exact server-side absolute
path. A native browser folder picker can't solve this: HTML
`<input type="file" webkitdirectory>` hides the absolute path for
security, and the File System Access API (Chrome/Edge only) returns
handles rather than strings, neither of which the server can act on.
This PR adds a small in-app directory browser that lists paths on the
server and hands the chosen string back to the existing
`POST /api/models/scan-folders` flow.
## Backend
* New endpoint `GET /api/models/browse-folders`:
* `path` query param (expands `~`, accepts relative or absolute; empty
defaults to the user's home directory).
* `show_hidden` boolean to include dotfiles/dotdirs.
* Returns `{current, parent, entries[], suggestions[]}`. `parent` is
null at the filesystem root.
* Immediate subdirectories only (no recursion); files are never
returned.
* `entries[].has_models` is a cheap hint: the directory looks like it
holds models if it is named `models--*` (HF hub cache layout) or
one of the first 64 children is a .gguf/.safetensors/config.json/
adapter_config.json or another `models--*` subfolder.
* Sort order: model-bearing dirs, then plain, then hidden; case-
insensitive alphabetical within each bucket.
* Suggestions auto-populate from HOME, the HF cache root, and any
already-registered scan folders, deduplicated.
* Error surface: 404 for missing path, 400 for non-directory, 403 on
permission errors. Auth-required like the other models routes.
* New Pydantic schemas `BrowseEntry` and `BrowseFoldersResponse` in
`studio/backend/models/models.py`.
## Frontend
* New `FolderBrowser` component
(`studio/frontend/src/components/assistant-ui/model-selector/folder-browser.tsx`)
using the existing `Dialog` primitive. Features:
* Clickable breadcrumb with a `..` row for parent navigation.
* Quick-pick chips for the server-provided suggestions.
* `Show hidden` checkbox.
* In-flight fetch cancellation via AbortController so rapid
navigation doesn't flash stale results.
* Badges model-bearing directories inline.
* `chat-api.ts` gains `browseFolders(path?, showHidden?)` and matching
types.
* `pickers.tsx` adds a folder-magnifier icon next to the existing `Add`
button. Opening the browser seeds it with whatever the user has
already typed; confirming fills the text input, leaving the existing
validation and save flow unchanged.
## What it does NOT change
* The existing text-input flow still works; the browser is additive.
* No new permissions or escalation; the endpoint reads only directories
the server process is already allowed to read.
* No model scanning or filesystem mutation happens from the browser
itself -- it just returns basenames for render.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Studio: cap folder-browser entries and expose truncated flag
Pointing the folder browser at a huge directory (``/usr/lib``,
``/proc``, or a synthetic tree with thousands of subfolders) previously
walked the whole listing and stat-probed every child via
``_looks_like_model_dir``. That is both a DoS shape for the server
process and a large-payload surprise for the client.
Introduce a hard cap of 2000 subdirectory entries and a
``truncated: bool`` field on the response. The frontend renders a small
hint below the list when it fires, prompting the user to narrow the
path. Below-cap directories are unchanged.
Verified end-to-end against the live backend with a synthetic tree of
2050 directories: response lands at 2000 entries, ``truncated=true``,
listing finishes in sub-second time (versus tens of seconds if we were
stat-storming).
* Studio: suggest LM Studio / Ollama dirs + 2-level model probe
Three improvements to the folder-browser, driven by actually dropping
an LM Studio-style install (publisher/model/weights.gguf) into the
sandbox and walking the UX:
## 1. Quick-pick chips for other local-LLM tools
`well_known_model_dirs()` (new) returns paths commonly used by
adjacent tools. Only paths that exist are returned so the UI never
shows dead chips.
* LM Studio current + legacy roots + user-configured
`downloadsFolder` from its `settings.json` (reuses the existing
`lmstudio_model_dirs()` helper).
* Ollama: `$OLLAMA_MODELS` env override, then `~/.ollama/models`,
`/usr/share/ollama/.ollama/models`, and `/var/lib/ollama/.ollama/models`
(the systemd-service install path surfaced in the upstream "where is
everything?" issue).
* Generic user-choice locations: `~/models`, `~/Models`.
Dedup is stable across all sources.
## 2. Two-level model-bearing probe
LM Studio and Ollama both use `root/publisher/model/weights.gguf`.
The previous `has_models` heuristic only probed one level, so the
publisher dir (whose immediate children are model dirs, not weight
files) was always marked as non-model-bearing. Pulled the direct-
signal logic into `_has_direct_model_signal` and added a grandchild
probe so the classic layout is now recognised.
Still O(PROBE^2) worst-case, still returns immediately for
`models--*` names (HF cache layout) and for any direct weight file.
## 3. model_files_here hint on response body
A leaf model dir (just GGUFs, no subdirs) previously rendered as
`(empty directory)` in the modal, confusing users into thinking the
folder wasn't scannable. Added a `model_files_here` count on the
response (capped at 200) and a small hint row in the modal: `N model
files in this folder. Click "Use this folder" to scan it.`
## Verification
Simulated an LM Studio install by downloading the real 84 MB
`unsloth/SmolLM2-135M-Instruct-Q2_K.gguf` into
`~/.lmstudio/models/unsloth/SmolLM2-135M-Instruct-GGUF/`. Confirmed
end-to-end:
* Home listing suggests `~/.lmstudio/models` as a chip.
* Browsing `~/.lmstudio/models` flags `unsloth` (publisher) as
`has_models=true` via the 2-level probe.
* Browsing the publisher flags `SmolLM2-135M-Instruct-GGUF` (model
dir) as `has_models=true`.
* Browsing the model dir returns empty entries but
`model_files_here=1`, and the frontend renders a hint telling the
user it is a valid target.
* Studio: one-click scan-folder add + prominent remove + plain search icon
Three small Custom Folders UX fixes after real-use walkthrough:
* **One-click add from the folder browser**. Confirming `Use this
folder` now submits the path directly to
`POST /api/models/scan-folders` instead of just populating the text
input. `handleAddFolder` takes an optional explicit path so the
submit lands in the same tick as `setFolderInput`, avoiding a
state-flush race. The typed-path + `Add` button flow is unchanged.
* **Prominent remove X on scan folders**. The per-folder delete
button was `text-muted-foreground/40` and hidden entirely on
desktop until hovered (`md:opacity-0 md:group-hover:opacity-100`).
Dropped the hover-only cloak, bumped color to `text-foreground/70`,
added a red hover/focus background, and sized the icon up from
`size-2.5` to `size-3`. Always visible on every viewport.
* **Plain search icon for the Browse button**. `FolderSearchIcon`
replaced with `Search01Icon` so it reads as a simple "find a
folder" action alongside the existing `Add01Icon`.
* Studio: align Custom Folders + and X buttons on the same right edge
The Custom Folders header used `px-2.5` with a `p-0.5` icon button,
while each folder row used `px-3` with a `p-1` button. That put the
X icon 4px further from the right edge than the +. Normalised both
rows to `px-2.5` with `p-1` so the two icons share a column.
* Studio: empty-state button opens the folder browser directly
The first-run empty state for Custom Folders was a text link reading
"+ Add a folder to scan for local models" whose click toggled the
text input. That's the wrong default: a user hitting the empty state
usually doesn't know what absolute path to type, which is exactly
what the folder browser is for.
* Reword to "Browse for a models folder" with a search-icon
affordance so the label matches what the click does.
* Click opens the folder browser modal directly. The typed-path +
Add button flow is still available via the + icon in the
section header, so users who know their path keep that option.
* Slightly bump the muted foreground opacity (70 -> hover:foreground)
so the button reads as a primary empty-state action rather than a
throwaway hint.
* Studio: Custom Folders header gets a dedicated search + add button pair
The Custom Folders section header had a single toggle button that
flipped between + and X. That put the folder-browser entry point
behind the separate empty-state link. Cleaner layout: two buttons in
the header, search first, then add.
* Search icon (left) opens the folder browser modal directly.
* Plus icon (right) toggles the text-path input (unchanged).
* The first-run empty-state link is removed -- the two header icons
cover both flows on every state.
Both buttons share the same padding / icon size so they line up with
each other and with the per-folder remove X.
* Studio: sandbox folder browser + bound caps + UX recoveries
PR review fixes for the Custom Folders folder browser. Closes the
high-severity CodeQL path-traversal alert and addresses the codex /
gemini P2 findings.
Backend (studio/backend/routes/models.py):
* New _build_browse_allowlist + _is_path_inside_allowlist sandbox.
browse_folders now refuses any target that doesn't resolve under
HOME, HF cache, Studio dirs, registered scan folders, or the
well-known third-party model dirs. realpath() is used so symlink
traversal cannot escape the sandbox. Also gates the parent crumb
so the up-row hides instead of 403'ing.
* _BROWSE_ENTRY_CAP now bounds *visited* iterdir entries, not
*appended* entries. Dirs full of files (or hidden subdirs when
show_hidden is False) used to defeat the cap.
* _count_model_files gets the same visited-count fix.
* PermissionError no longer swallowed silently inside the
enumeration / counter loops -- now logged at debug.
Frontend (folder-browser.tsx, pickers.tsx, chat-api.ts):
* splitBreadcrumb stops mangling literal backslashes inside POSIX
filenames; only Windows-style absolute paths trigger separator
normalization. The Windows drive crumb value is now C:/ (drive
root) instead of C: (drive-relative CWD-on-C).
* browseFolders accepts and forwards an AbortSignal so cancelled
navigations actually cancel the in-flight backend enumeration.
* On initial-path fetch error, FolderBrowser now falls back to HOME
instead of leaving the modal as an empty dead end.
* When the auto-add path (one-click "Use this folder") fails, the
failure now surfaces via toast in addition to the inline
paragraph (which is hidden when the typed-input panel is closed).
* Studio: rebuild browse target from trusted root for CodeQL clean dataflow
CodeQL's py/path-injection rule kept flagging the post-validation
filesystem operations because the sandbox check lived inside a
helper function (_is_path_inside_allowlist) and CodeQL only does
intra-procedural taint tracking by default. The user-derived
``target`` was still flowing into ``target.exists`` /
``target.is_dir`` / ``target.iterdir``.
The fix: after resolving the user-supplied ``candidate_path``,
locate the matching trusted root from the allowlist and rebuild
``target`` by appending each individually-validated segment to
that trusted root. Each segment is rejected if it isn't a single
safe path component (no separators, no ``..``, no empty/dot).
The downstream filesystem ops now operate on a Path constructed
entirely from ``allowed_roots`` (trusted) plus those validated
segments, so CodeQL's dataflow no longer sees a tainted source.
Behavior is unchanged for all valid inputs -- only the
construction of ``target`` is restructured. Live + unit tests
all pass (58 selected, 7 deselected for Playwright env).
* Studio: walk browse paths from trusted roots for CodeQL
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Ubuntu <ubuntu@h100-8-cheapest.us-east5-a.c.unsloth.internal>
* Reapply "updated models template mappers. added lfm2.5vl450m to transformers 5…" (#4945)
This reverts commit 33503ea248.
* Add missing gemma-4-31B-it bnb-4bit mapper entry and LFM2.5 upstream namespace for PR #4950
- Add unsloth/gemma-4-31B-it-unsloth-bnb-4bit to __INT_TO_FLOAT_MAPPER so
the int-to-float resolution works for this model (already listed in
TEMPLATE_TO_MODEL_MAPPER but had no mapper entry).
- Add LiquidAI/LFM2.5-1.2B-Instruct to lfm-2.5 TEMPLATE_TO_MODEL_MAPPER
entry so the canonical upstream namespace is mapped consistently with lfm-2.
* Add missing gemma-4-31B-it bnb-4bit Ollama mapping and lfm-2.5 chat template alias
- Add unsloth/gemma-4-31B-it-unsloth-bnb-4bit to OLLAMA_TEMPLATE_TO_MODEL_MAPPER
so Ollama export works for this model (E2B-it and E4B-it bnb-4bit variants were
already present, 31B-it was inconsistently omitted)
- Register CHAT_TEMPLATES["lfm-2.5"] as alias of the lfm-2 template to prevent
KeyError when Studio resolves LFM2.5 models through MODEL_TO_TEMPLATE_MAPPER
* Add missing LFM2 bnb-4bit INT_TO_FLOAT_MAPPER entry
unsloth/LFM2-1.2B-unsloth-bnb-4bit is referenced in model_mappings.py
but had no mapper.py entry, so model resolution would fail when users
load that variant with load_in_4bit=False or when the float name is
used with load_in_4bit=True.
* Fix review findings for PR #16
1. ollama_template_mappers.py: Restore dropped Gemma-4 base model IDs
(E2B, E4B, 31B, 26B-A4B) and add missing google/ upstream IDs to
the gemma4 Ollama mapper for consistency with other gemma entries.
2. mapper.py: Remove self-mapping non-bnb-4bit entries from
__INT_TO_FLOAT_MAPPER that were polluting FLOAT_TO_INT_MAPPER with
lowercase 16-bit names, causing load_in_4bit=True to return bad
model names. Add direct MAP_TO_UNSLOTH_16bit entries to preserve
the google->unsloth 16-bit redirects.
3. mapper.py: Add LFM2.5 MAP_TO_UNSLOTH_16bit redirect so
LiquidAI/LFM2.5-1.2B-Instruct resolves to its unsloth mirror.
* Add review tests for PR #4950
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove top-level test files
These test_*.py files were added at the repo root rather than under tests/.
Removing them from this PR; the production mapper changes remain.
* Add gemma-4-26B-A4B-it mapping
Adds unsloth/gemma-4-26B-A4B-it to __INT_TO_FLOAT_MAPPER as a 2-tuple so
google/gemma-4-26B-A4B-it routes to unsloth/gemma-4-26B-A4B-it across
INT_TO_FLOAT_MAPPER, FLOAT_TO_INT_MAPPER, and MAP_TO_UNSLOTH_16bit.
The 26B-A4B (MoE) model has no bnb-4bit variant, so the key uses the
plain unsloth name rather than the -unsloth-bnb-4bit suffix.
Removes the now-redundant standalone _add_with_lower call for the -it
variant; the 16bit mapping is registered via the dict loop.
* Add unsloth-bnb-4bit mappings for gemma-4 base (non-it) models
Adds E2B, E4B, 31B base unsloth-bnb-4bit entries to __INT_TO_FLOAT_MAPPER.
The 26B-A4B (MoE) base has no bnb-4bit variant on HF, so it stays on the
standalone _add_with_lower line for the 16bit-only routing.
Removes the redundant _add_with_lower lines for E2B, E4B, 31B base since
the dict loop now registers the same google->unsloth route through the
2-tuple entries, plus full FLOAT_TO_INT and INT_TO_FLOAT coverage.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat: Add cactus QAT scheme support
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* test(qat): add tests for cactus QAT scheme and fix missing import
* Fix cactus QAT scheme: correct MappingType import, tighten PerGroup filter
- Drop the broken `from torchao.dtypes import MappingType` import. `MappingType`
lives in `torchao.quantization` (and `torchao.quantization.quant_primitives`);
it is not exported from `torchao.dtypes` in any supported torchao release
(verified on 0.14, 0.16, 0.17). The previous code raised `ImportError` on
every cactus call and was masked as a misleading 'torchao not found' error.
- Since `IntxWeightOnlyConfig` already defaults `mapping_type` to
`MappingType.SYMMETRIC`, drop the explicit kwarg entirely and remove the
import. Behavior is unchanged.
- Introduce a named `group_size = 32` constant (matches the int4 / fp8-int4
pattern in the surrounding branches) and add a `% group_size == 0`
divisibility guard to the filter. `PerGroup(32)` requires
`in_features % 32 == 0` at `quantize_()` time, otherwise torchao raises
`ValueError: in_features (N) % group_size (32) must be == 0`. The old
`in_features >= 32` filter would admit non-aligned widths (e.g. 33, 48, 65,
127) and crash `_prepare_model_for_qat` for those shapes.
* Warn when cactus QAT skips non-divisible Linear layers
Multiple reviewers flagged that the divisibility guard added in the
previous commit can silently leave Linear layers in full precision when
their in_features is not a multiple of 32. For currently supported
Unsloth models (Qwen, Llama, Gemma, Mistral, Phi) every Linear width is
already a multiple of 32/64/128 so this never triggers, but surfacing
the coverage gap is cheap and avoids users assuming 100% QAT coverage
when they bring a custom model with unusual shapes.
Emit a UserWarning listing up to the first 8 skipped layers whenever
the cactus filter excludes any Linear due to the modulo guard. This
keeps the lenient silent-skip behavior (consistent with int4 /
fp8-int4), but stops making it silent.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat: Add support for OLMo-3 model in mapping and tests
* Update unsloth/models/mapper.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update tests/test_get_model_name.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Fix casing, add Think variants, and align version gate for OLMo-3 PR 4678
Mapper: switch slugs from OLMo-3 to canonical Olmo-3 mixed case, drop the
non-existent unsloth/Olmo-3-7B-Instruct-bnb-4bit dead alias, and add the
already-published Olmo-3-7B-Think and Olmo-3-32B-Think Unsloth mirrors.
Loader: change the olmo3 transformers version gate from Version("4.57.0")
to Version("4.57.0.dev0") so nightly/source builds that already contain
olmo3 are not blocked, matching the OLMo-2, Gemma 3 and Cohere patterns.
* Use canonical Olmo-3 casing and cover Think variants in OLMo-3 tests
Mirrors the mapper.py fixes on pr-4678-code: HuggingFace canonical slugs
for the OLMo-3 family use mixed-case Olmo-3 (not OLMo-3 like OLMo-2), and
Unsloth already hosts Olmo-3-7B-Think and Olmo-3-32B-Think mirrors, so
the resolution matrix now covers all three published Olmo-3 families.
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Studio: refresh Downloaded GGUF list and recurse into variant subdirs
Two fixes for the model picker's "Downloaded" section.
Frontend (`pickers.tsx`):
* `HubModelPicker`'s mount effect short-circuited the cached-gguf and
cached-models refetch whenever the module-level cache already had
entries (`if (alreadyCached) return;`). After downloading a new repo
in the same session, reopening the picker rendered the stale cache
and the new repo never appeared in "Downloaded" until a full page
reload. The early return is removed so the lists are always refreshed
on mount; the module cache still drives the initial render so there
is no spinner flash when we already had data.
Backend (`utils/models/model_config.py`):
* `list_local_gguf_variants` and `_find_local_gguf_by_variant` used a
non-recursive `Path.glob("*.gguf")`. Some HF GGUF repos (e.g.
`unsloth/gemma-4-26B-A4B-it-GGUF`) place the largest quants under a
variant-named subdirectory such as `BF16/...gguf`, which the
top-level glob missed. Both helpers now use `rglob` and the variant
filename is stored as a path relative to the scan root so the
locator can still find the file.
The flat-layout case (variants directly in the snapshot root) is
unchanged: verified against `unsloth/gemma-4-E2B-it-GGUF` which still
returns its UD-Q4_K_XL variant correctly.
* Studio: emit posix-style relative filenames for local GGUF subdirs
`list_local_gguf_variants` was doing `str(f.relative_to(p))`, which on
Windows produces backslash-separated paths like `BF16\foo.gguf`. The
remote `list_gguf_variants` (HF API path) always returns forward-slash
filenames such as `BF16/foo.gguf`, so the two would diverge on Windows.
Switch to `.as_posix()` so the local and remote variant filenames stay
identical across Linux, macOS, and Windows. Verified by simulating with
`PureWindowsPath` in the test suite.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Studio: detect mmproj at snapshot root for nested-variant layouts
When _find_local_gguf_by_variant returns a weight file inside a
quant-named subdir (e.g. snapshot/BF16/foo.gguf), detect_mmproj_file
was scanning only the immediate parent and missing the mmproj file
sitting at the snapshot root. The model was then loaded without
--mmproj, silently breaking vision support for repos that ship
nested variants.
detect_mmproj_file now takes an optional search_root and walks up
from the weight file to that root, in order, so the mmproj at the
snapshot root is picked up. Sibling quant subdirs are not scanned,
so an unrelated variant's mmproj does not leak in.
Also apply the suggested micro-optimization on relative_to in
list_local_gguf_variants -- only build the posix path when storing
the first file for a quant.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The "Patched trl.models.utils.disable_gradient_checkpointing with a no-op"
warning fires once on every Unsloth import, including from notebooks where
the user did not opt into verbose logging. It is a routine integration
patch, not an anomaly the user needs to know about. Gate it on
UNSLOTH_ENABLE_LOGGING=1 like other diagnostic notices.
* Fix grad-accum model_accepts_loss_kwargs detection for vision wrappers
Replace the source-string rewrite of Trainer.__init__ with an instance-level
accepts_loss_kwargs shadow applied on the loaded model. Covers:
1. Unsloth-compiled forward -> True, so HF Trainer does not double-scale
on top of unsloth_fixed_cross_entropy's num_items_in_batch division.
2. Stock forward on a conditional-generation wrapper (Gemma3n, Gemma3
pre-4.57, Qwen-VL family, etc.) where the outer class has no
accepts_loss_kwargs but the inner .model declares False -> False.
This is the case that reproduces issue #4982 under trust_remote_code
or UNSLOTH_COMPILE_DISABLE, where the previous fix's outer-attr
check walked past the inner model and fell through to signature
inspection.
3. Text LMs without any explicit accepts_loss_kwargs -> leave HF default.
The previous .replace()-based patch silently no-ops on transformers 4.48
through 4.52 (variable named model, not unwrapped_model) and is fragile
against any upstream reformat. The new helper walks the PEFT / HF wrapper
chain, finds the first class that declares accepts_loss_kwargs on its own
class dict (type(m).__dict__, not hasattr, to avoid PEFT __getattr__
forwarding), and setattr-shadows that value at every wrapper level so
HF Trainer's hasattr(unwrapped_model, ...) check picks it up at whichever
level accelerate.unwrap_model returns.
Also adds an unconditional post-init clamp of
accelerator.gradient_accumulation_steps = 1 to work around the
transformers 5.0 through 5.5 GradientAccumulationPlugin regression that
makes accelerator.backward divide loss by GA on top of training_step's
own /GA division. Fixed upstream in 5.6.0.dev0; no-op on 4.x and 5.6+.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Trim comments
* Address review: cover PEFT-after-load and custom compile location
Two review findings from 3/20 reviewers:
1. [3 of 20 reviewers] apply_accepts_loss_kwargs_fix was called from the
loaders before get_peft_model wraps the base model, so on transformers
4.48-4.52 (which does hasattr on the outer model) the instance shadow
on the base model was lost after PEFT wrapping. Fix: also call it from
the wrapped Trainer.__init__ so it runs on whatever model the user
actually hands to Trainer, which is always the final wrapped form.
2. [1 of 20 reviewers] _forward_is_unsloth_compiled hard-coded the
substrings "unsloth_compiled" / "unsloth_cache" in the co_filename
check, which misclassifies compiled forwards when
UNSLOTH_COMPILE_LOCATION is set to a custom directory. Fix: new
_unsloth_compile_cache_leaves helper that reads the env var and
matches the basename against path components, honoring both the
default and any user override.
Verified locally:
- PEFT-after-load simulation: HF's hasattr(peft, "accepts_loss_kwargs")
now returns True after our init wrapper runs, and value resolves to
False on Gemma3n-style inner wrappers.
- Custom UNSLOTH_COMPILE_LOCATION simulation: compiled detection returns
True for /tmp/my_custom_cache/compiled.py when the env var is set.
- End-to-end Gemma-3 270m + LoRA SFT unchanged: loss 4.9626, grad-norm
matches prior run, all 4 wrapper levels now carry the shadowed attr.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(rocm): tighten gfx regex to ignore generic ISA lines
ROCm 6.1+ rocminfo emits generic ISA names such as
"amdgcn-amd-amdhsa--gfx11-generic" and "amdgcn-amd-amdhsa--gfx9-4-generic"
alongside the real GPU name. The previous `gfx[1-9]` regex used in
`_has_rocm_gpu` matched both, so a host with only a generic ISA entry
would be reported as having a usable AMD GPU.
Tighten the pattern to `gfx[1-9][0-9a-z]{2,3}` so only real gfx ids
match. This covers every documented target from GFX6 (gfx600) through
GFX12 (gfx1201), including letter-suffixed ids like gfx90a (MI250 /
MI250X) and gfx90c. Documented generic ISA names always have 1 or 2
digits before the dash and no longer match.
Applied to both `studio/install_python_stack.py` and
`studio/install_llama_prebuilt.py` so the two detection paths agree.
Co-authored-by: Martin Hoyer <mhoyer@redhat.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Martin Hoyer <mhoyer@redhat.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Respect classification head skip list on pre-quantized 4-bit checkpoints (#5027)
FastLanguageModel.from_pretrained(..., num_labels=N) crashed with
"NotImplementedError: normal_kernel_cuda not implemented for 'Byte'" on
pre-quantized bnb 4-bit checkpoints (e.g. unsloth/Qwen3-4B-bnb-4bit)
when running on transformers 5.x.
Two pieces were needed to close this out:
1. unsloth_zoo PR: add "score", "classifier", "qa_outputs" to
SKIP_QUANTIZATION_MODULES so replace_with_bnb_linear leaves task
heads in the compute dtype.
2. This commit: for pre-quantized checkpoints, transformers reads
llm_int8_skip_modules from the quantization_config baked into
config.json and ignores the runtime BitsAndBytesConfig we pass via
kwargs. Unsloth must merge its skip list into
model_config.quantization_config.llm_int8_skip_modules before the
from_pretrained call, or the checkpoint's frozen list
(e.g. ["lm_head", "multi_modal_projector", "merger",
"modality_projection"]) wins and the `score` head gets converted to
Linear4bit with uint8 storage, then _init_weights calls normal_ on
uint8 and crashes.
Also add a defensive post-load cast on the task head to guard against
any residual path that ends up with a non-floating head dtype.
Verified on transformers 4.57.6 and 5.5.0 with:
- unsloth/Qwen3-4B-bnb-4bit + num_labels=3
- unsloth/Qwen3-4B (non-bnb repo, load_in_4bit=True)
- unsloth/Llama-3.2-1B-Instruct + num_labels=3
- unsloth/ModernBERT-large classifier head (bert_classification notebook)
- Regression: causal LM path unchanged, backbone still 4-bit
- 3-step SFT on num_labels=3 confirms gradient flow and weight updates
on score.weight
Fixesunslothai/unsloth#5027
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fixes#2393.
- `_utils.py`: `has_internet()` now respects `HF_HUB_OFFLINE` with truthy variant parsing in addition to `TRANSFORMERS_OFFLINE`.
- `_utils.py`: replace uncontrolled `except Exception: stats_check()` retry (which had no time limit and could freeze on Kaggle offline mode) with a logged skip.
- `loader.py`: forward `local_files_only` from kwargs into all `AutoConfig.from_pretrained` and `PeftConfig.from_pretrained` probes in `FastLanguageModel.from_pretrained` and `FastModel.from_pretrained`, including the PEFT base-model reload paths.
* fix: support GGUF variant selection for non-suffixed repos
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: harden GGUF detection across cached models and picker flows
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* chore: use shared GGUF picker helper for search rows
* fix: avoid mixed cache duplication and preserve GGUF fallback detection
* fix: unify GGUF cache matching and merge picker hints
* fix: normalize local GGUF matching across picker and model config
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: robust cached-gguf classification + hint-aware click routing
- _repo_gguf_size_bytes: treat size_on_disk=None as 0 and dedupe fallback
by commit_hash so partial/interrupted downloads don't TypeError out of
sum() and wipe the entire cached list.
- list_cached_gguf / list_cached_models: narrow per-repo try/except so
one malformed repo no longer poisons the whole response.
- handleModelClick: route through isKnownGgufRepo instead of the
suffix-only isGgufRepo, so non-suffixed GGUF repos still open the
variant expander from every call site.
- Replace the modelIsGgufById/resultIsGgufById Maps with Sets of known
GGUF ids to stop conflating "no hint" with "known not-GGUF".
- Make HfModelResult.isGguf required (it is always set in makeMapModel).
- Add regression tests for the None size case, mixed-repo inclusion in
cached-gguf, and per-repo error isolation.
* fix: exclude mmproj from GGUF classification and case-normalize hint lookups
- _repo_gguf_size_bytes now filters mmproj vision-adapter files so
safetensors+mmproj.gguf repos stay on the cached-models path and
non-GGUF rows no longer show zero pickable variants. A vision-capable
GGUF repo (main weight + mmproj adapter) still classifies as GGUF and
reports the main weight size.
- modelGgufIds / resultGgufIds now key on lowercased ids and
isKnownGgufRepo lowercases its lookup, so store and HF-search ids
that differ only by casing still match the same GGUF hint.
- New regression tests: mmproj-only repo excluded from cached-gguf,
same repo included in cached-models, vision-capable repo still
classified as GGUF with correct size.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
* Add configurable PyTorch mirror via UNSLOTH_PYTORCH_MIRROR env var
When set, UNSLOTH_PYTORCH_MIRROR overrides the default
https://download.pytorch.org/whl base URL in all four install scripts
(install.sh, install.ps1, studio/setup.ps1, studio/install_python_stack.py).
When unset or empty, the official URL is used. This lets users behind
corporate proxies or in regions with poor connectivity to pytorch.org
point at a local mirror without patching scripts.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add pytest for UNSLOTH_PYTORCH_MIRROR in install_python_stack.py
Tests that _PYTORCH_WHL_BASE picks up the env var when set, falls back
to the official URL when unset or empty, and preserves the value as-is
(including trailing slashes).
* Remove stale test assertions for missing install.sh messages
* Fix GPU mocking in test_get_torch_index_url.sh
Extract _has_usable_nvidia_gpu and _has_amd_rocm_gpu alongside
get_torch_index_url so the GPU-presence checks work in tests.
Add -L flag handling to mock nvidia-smi so it passes the GPU listing
check. All 26 tests now pass on CPU-only machines.
* Strip trailing slash from UNSLOTH_PYTORCH_MIRROR to avoid double-slash URLs
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: hard-stop at n_ctx with a dedicated 'Context limit reached' toast
llama-server's default behavior when the KV cache fills is to silently
drop the oldest non-``n_keep`` tokens and keep generating. The UI has
no way to tell the user that earlier turns were evicted -- they just
see degraded continuity and a confusing ``5,361 / 4,096`` on the
context usage bar.
Launch llama-server with ``--no-context-shift`` so it returns a clean
error once the request would exceed ``n_ctx``. In the chat adapter,
catch the error, identify it as a context-limit error via
``isContextLimitError()``, and surface a dedicated toast that names
the exact control to adjust: the ``Context Length`` field in the chat
Settings panel.
Also add a lightweight tooltip hint on ``ContextUsageBar`` when usage
crosses 85%, so users see the "raise Context Length in Settings"
suggestion before they hit the hard stop.
Tests:
* ``test_llama_cpp_no_context_shift.py`` pins the ``--no-context-shift``
flag in the static launch-command template, and pins it inside the
unconditional ``cmd = [ ... ]`` block so a future refactor can't
hide it behind a branch.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Shorten --no-context-shift comment to 1 line
* Match backend _friendly_error rewrite in isContextLimitError
Codex review on PR caught that ``backend/routes/inference.py::_friendly_error``
rewrites the raw llama-server text
"request (X tokens) exceeds the available context size (Y tokens)"
into
"Message too long: X tokens exceeds the Y-token context window. ..."
on the main streaming GGUF path. The heuristic only looked for
"context size" / "exceeds the available context" / "context shift",
none of which survive the rewrite, so the new "Context limit reached"
toast would never fire for the most common case. Add matches for
"message too long" and "context window" so both wordings hit.
Also addresses Gemini feedback on the launch-flag test:
* Use ``inspect.getsource(LlamaCppBackend.load_model)`` instead of
reading ``__file__`` directly; scopes the assertions to the
function that actually launches llama-server.
* Replace the hardcoded ``" ]"`` indent search with a
line-at-a-time scan for a line that is just ``]``, so the test
survives reformatting.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: split model-load progress label across two rows
The chat flow and training overlay both compose a progress label like
"112.6 of 122.3 GB • 331.0 MB/s • 30s left" and render it next to the
percent badge in a single flex row. Once the rate + ETA part shows up,
the label outgrows the row width and wraps mid-phrase, orphaning the
percent ("19 left %") onto a second ragged line.
Fix in model-load-status.tsx: split the label on the first " • " into
a primary (size) chunk that stays on row 1 with the percent, and a
secondary (rate/ETA) chunk that renders on its own muted row below.
Labels without a bullet (e.g. "22.8 GB downloaded") collapse cleanly
to one row. The inline-status variant keeps only the primary and
surfaces the full label via the tooltip.
Also extracts the rate/ETA math out of useTransferStats into a pure
``transfer-stats.ts`` module (appendSample + computeTransferStats) so
it can be reasoned about and tested without React. The hook is now a
thin wrapper that feeds sample history through the pure functions.
Backend: adds two companion test files for load_progress():
* test_llama_cpp_load_progress_matrix.py (21 tests) -- platform
matrix (Linux /proc, macOS/Windows absence), VmRSS parsing
variants (tab/space/missing/malformed), filesystem edges (HF-cache
symlinks, broken symlinks, nonexistent paths, relative paths),
shard aggregation (partial multi-shard, two series in same dir,
mmproj-* exclusion, single-file), lifecycle races, concurrent
sampling (10 threads x 50 iters against real /proc), fraction
bounds.
* test_llama_cpp_load_progress_live.py (5 tests) -- no-mock live
integration: real subprocess allocating 100 MB to match VmRSS,
real ready phase, real dead-pid degradation, real shard
aggregation, repeated polling. Skipped on non-Linux.
Both complement the existing test_llama_cpp_load_progress.py.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Hoist splitProgressLabel out of JSX IIFE (review feedback)
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix bitsandbytes ROCm install by using pip instead of uv
* Also use pip for PyPI fallback path in _install_bnb_rocm
The original fix correctly switched the pre-release wheel install from
uv to pip, but left the PyPI fallback path on uv. If uv breaks bnb
on ROCm, the fallback would hit the same issue. Move pip bootstrap
before the branch so both paths use pip consistently.
* Harden pip bootstrap: try ensurepip first, warn on failure
- Try ensurepip --upgrade before falling back to uv pip install pip.
ensurepip works offline and does not need PyPI, making the bootstrap
robust when the network or index is unavailable.
- If both ensurepip and uv fail, emit a visible warning instead of
silently swallowing the error (which previously led to a cryptic
"No module named pip" downstream).
- Use run_maybe_quiet so --verbose users see bootstrap output.
- Update comment to document the actual root cause: uv rejects the
wheel because filename version and metadata version disagree.
* Add --isolated to pip install calls in _install_bnb_rocm
uv pip install ignores pip.conf and PIP_* env vars, but python -m pip
reads them. Without --isolated, users with PIP_INDEX_URL pointing to a
private mirror that does not carry bitsandbytes would see the PyPI
fallback fail where it previously worked under uv. --isolated restores
parity with the old uv behavior.
* Drop --isolated from PyPI fallback in _install_bnb_rocm
--isolated suppresses PIP_INDEX_URL, PIP_EXTRA_INDEX_URL, and pip.conf.
This is correct for the pre-release path (hardcoded GitHub URL, no index
consulted), but breaks the PyPI fallback for users in corporate or
air-gapped environments whose only route to bitsandbytes is a private
mirror configured via those mechanisms. Keep --isolated on the direct-URL
pre-release install; drop it from the index-dependent fallback.
* Drop --isolated from pre-release pip install, fix warning wording
--isolated suppresses pip.conf cert/proxy/CA settings in addition to
index config. For the direct GitHub URL, index config is irrelevant but
cert/proxy settings matter in corporate SSL-inspection environments.
Without this fix, users with pip.conf-based CA bundles get a TLS error
on the pre-release download and silently fall back to the broken PyPI
version -- the exact outcome the PR is trying to prevent.
Also fix the fallback warning: "unreachable" is too specific since the
pre-release install can fail for reasons other than network reachability.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Studio: live model-load progress + rate/ETA on download and load
Two UX fixes for the opaque multi-minute wait between clicking Load
and being able to chat, visible most clearly on large MoE GGUFs like
MiniMax-M2.7 (131 GB of weights on a 97 GB GPU):
1. **Model-load phase is now observable.** The existing chat flow
transitions the toast to "Starting model..." as soon as the
download hits 100%, then shows a spinner with no other feedback
until llama-server reports healthy. For a 130 GB model that spinner
freezes for five-plus minutes while the kernel pages shards into
the page cache. A new `GET /api/inference/load-progress` endpoint
samples `/proc/<pid>/status VmRSS` on the llama-server subprocess
against the sum of shard file sizes on disk, so the UI can render
a real bar plus rate / ETA during that window.
2. **Rate and ETA on downloads and loads.** Both the chat toast and
the training-start overlay used to show a static pair of numbers
(for example "15.4 of 140.8 GB"). A rolling 15-second window over
the existing byte-series now surfaces "85.3 MB/s, 24m 23s left"
beside that pair. The estimator is shared between the download
and load phases so the numbers don't reset when the phase flips.
Also fixes a pre-existing assignment bug uncovered while wiring this
up: `load_model` was storing the caller's `gguf_path` kwarg into
`self._gguf_path`, which is `None` on the HF-download code path. The
resolved on-disk path (`model_path`) is what llama-server actually
mmaps; downstream consumers need that. No existing reader used
`_gguf_path`, so this is a correctness fix for the new endpoint.
- Backend: `LlamaCppBackend.load_progress()`, `GET /api/inference/load-progress`, `LoadProgressResponse` Pydantic model.
- Frontend: `useTransferStats` hook, `formatRate` / `formatEta` helpers, `getLoadProgress` client, rewired chat toast and `DownloadRow` in the training overlay.
- Tests: `studio/backend/tests/test_llama_cpp_load_progress.py` covers empty states, mmap phase, ready phase, sharded total aggregation, missing gguf_path, and unreadable /proc (7 cases). `tsc -b` and `vite build` on the frontend both clean.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* studio: pin peft to 0.18.1 to fix export subprocess issues
peft 0.19.0 causes export subprocess shutdown failures in Studio.
Reverting to 0.18.1 resolves the issue.
* studio: move peft pin to extras-no-deps to prevent torch upgrade
Installing peft via overrides.txt would resolve its deps and pull in
torch>=0.11.0, breaking other pinned packages. Moving the pin to
extras-no-deps.txt ensures --no-deps is used during install.
* Fix num_items_in_batch GA for Gemma4
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* studio: stream export worker output into the export dialog
The Export Model dialog only showed a spinner on the "Exporting..."
button while the worker subprocess was doing the actual heavy lifting.
For Merged to 16bit and GGUF / Llama.cpp exports this meant several
minutes (or more, for large models) of opaque silence, with no way to
tell whether save_pretrained_merged, convert_hf_to_gguf.py, or
llama-quantize was making progress.
This adds a live terminal-style output panel inside the export dialog,
rendered just above the Cancel / Start Export buttons and scrollable
with auto-follow-tail. It shows stdout and stderr from both the worker
process itself and any child process it spawns (GGUF converter,
llama-quantize), coloured by stream.
Backend
- core/export/worker.py: new _setup_log_capture(resp_queue) installed
before LogConfig.setup_logging. It saves the original stdout/stderr
fds, creates pipes, os.dup2's the write ends onto fds 1 and 2 (so
every child process inherits the redirected fds), and spins up two
daemon reader threads. Each thread reads bytes from a pipe, echoes
them back to the original fd (so the server console keeps working),
splits on \n and \r, and forwards each line to the resp queue as
{"type":"log","stream":"stdout|stderr","line":...,"ts":...}.
PYTHONUNBUFFERED=1 is set so nested Python converters flush
immediately.
- core/export/orchestrator.py:
- Thread-safe ring buffer (collections.deque, maxlen 4000) with a
monotonically increasing seq counter. clear_logs(),
get_logs_since(cursor), get_current_log_seq(), is_export_active().
- _wait_response handles rtype == "log" by appending to the buffer
and continuing the wait loop. Status messages are also surfaced as
a "status" stream so users see high level progress alongside raw
subprocess output.
- load_checkpoint, _run_export, and cleanup_memory now wrap their
bodies with the existing self._lock (previously unused), clear the
log buffer at the start of each op, and flip _export_active in a
try/finally so the SSE endpoint can detect idle.
- routes/export.py:
- Wrapped every sync orchestrator call (load_checkpoint,
cleanup_memory, export_merged_model, export_base_model,
export_gguf, export_lora_adapter) in asyncio.to_thread so the
FastAPI event loop stays free during long exports. Without this
the new SSE endpoint could not be served concurrently with the
blocking export POST.
- New GET /api/export/logs/stream SSE endpoint. Honors
Last-Event-ID and a since query param for reconnect, emits log /
heartbeat / complete / error events, uses the id field to carry
the log seq so clients can resume cleanly. On first connect
without an explicit cursor it starts from the current seq so old
lines from a previous run are not replayed.
Frontend
- features/export/api/export-api.ts: streamExportLogs() helper that
authFetches the SSE endpoint and parses id / event / data fields
manually (same pattern as streamTrainingProgress in train-api.ts).
- features/export/components/export-dialog.tsx:
- Local useExportLogs(exporting) hook that opens the SSE stream on
exporting transitions to true, accumulates up to 4000 lines in
component state, and aborts on cleanup.
- New scrollable output panel rendered above DialogFooter, only
shown for Merged to 16bit and GGUF / Llama.cpp (LoRA adapter is
a fast disk write with nothing to show). Dark terminal styling
(bg-black/85, emerald text, rose for stderr, sky for status),
max-height 14rem, auto-scrolls to the bottom on new output but
stops following if the user scrolls up. A small streaming / idle
indicator is shown next to the panel title.
- DialogContent widens from sm:max-w-lg to sm:max-w-2xl when the
output panel is visible so the logs have room to breathe.
Verified
- Python smoke test (tests/smoke_export_log_capture.py): spawns a
real mp.get_context("spawn") process, installs _setup_log_capture,
confirms that parent stdout prints, parent stderr prints, AND a
child subprocess invoked via subprocess.run (both its stdout and
stderr) are all captured in the resp queue. Passes.
- Orchestrator log helpers tested in isolation: _append_log,
get_logs_since (with and without a cursor), clear_logs not
resetting seq so reconnecting clients still progress. Passes.
- routes.export imports cleanly in the studio venv and /logs/stream
shows up in router.routes.
- bun run build: tsc -b plus vite build, no TypeScript errors.
No existing export behavior is changed. If the subprocess, the SSE
endpoint, or the frontend hook fails, the export itself still runs to
completion the same way it did before, with or without logs visible.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* export dialog: trim bootstrap noise, scope logs per screen, show realpath
Several follow-ups to the live export log work:
1. Worker bootstrap noise (transformers venv activation, Unsloth banner,
"Top GGUF/hub models" lists, vision detection, 2k-step weight load
bar) is dropped from the export-dialog stream. A threading.Event
gate in worker.py defaults closed and only opens once _handle_export
actually starts; until then the reader thread still echoes lines to
the saved console fd for debugging but does not push them onto the
resp_queue. The orchestrator already spawns a fresh subprocess for
every checkpoint load, so the gate is naturally reset between runs.
2. tqdm in non-tty mode defaults to a 10s mininterval, which makes
multi-step bars look frozen in the panel. Set TQDM_MININTERVAL=0.5
in the worker env so any tqdm-driven progress emits more often.
3. The dialog's useExportLogs hook now also clears its line buffer
when exportMethod or open changes, so re-opening the dialog into a
different action's screen no longer shows the previous action's
saved output. A useElapsedSeconds tick + "Working Xs" badge in the
log header gives users a visible sign that long single-step phases
(cache copies, GGUF conversion) are still running when no new lines
are arriving.
4. ExportBackend.export_{merged,base,gguf,lora} now return
(success, message, output_path); the worker forwards output_path on
each export_*_done response, the orchestrator's _run_export passes
it to routes/export.py, which surfaces it via
ExportOperationResponse.details.output_path. The dialog's Export
Complete screen renders the resolved on-disk realpath under "Saved
to" so users can find their exported model directly.
* fix(cli): unpack 3-tuple return from export backend
ExportOrchestrator.export_{merged,base,gguf,lora} now return
(success, message, output_path) so the studio dialog can show
the on-disk realpath. The CLI still unpacked 2 values, so every
`unsloth export --format ...` crashed with ValueError before
reporting completion. Update the four call sites and surface
output_path via a "Saved to:" echo.
* fix(studio): anchor export log SSE cursor at run start
The export dialog SSE defaulted its cursor to get_current_log_seq()
at connect time, so any line emitted between the POST that kicks
off the export and the client opening the stream was buffered with
seqs 1..k and then skipped (seq <= cursor). Long-running exports
looked silent during their first seconds.
Snapshot _log_seq into _run_start_seq inside clear_logs() and
expose it via get_run_start_seq(). The SSE default cursor now uses
that snapshot, so every line emitted since the current run began
is reachable regardless of when the client connects. Old runs
still can't leak in because their seqs are <= the snapshot.
* fix(studio): reconnect export log SSE on stream drop
useExportLogs launched streamExportLogs once per exporting
transition and recorded any drop in .catch(). Long GGUF exports
behind a proxy with an idle kill-timeout would silently lose the
stream for the rest of the run even though the backend already
supports Last-Event-ID resume. The "retry: 3000" directive emitted
by the backend is only meaningful to native EventSource; this
hook uses a manual fetch + ReadableStream parse so it had no
effect.
Wrap streamExportLogs in a retry loop that tracks lastSeq from
ExportLogEvent.id and passes it as since on reconnect. Backoff is
exponential with jitter, capped at 5s, reset on successful open.
The loop stops on explicit backend `complete` event or on effect
cleanup.
* fix(studio): register a second command so Typer keeps `export` as a subcommand
The CLI export unpacking tests wrap `unsloth_cli.commands.export.export`
in a fresh Typer app with a single registered command. Typer flattens a
single-command app into that command, so the test's
`runner.invoke(cli_app, ["export", ckpt, out, ...])` treats the leading
`"export"` token as an unexpected extra positional argument -- every
parametrized case failed with:
Got unexpected extra argument (.../out)
Register a harmless `noop` second command so Typer preserves subcommand
routing and the tests actually exercise the 3-tuple unpack path they
were written to guard.
Before: 4 failed
After: 4 passed
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: studio-install <studio@local.install>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* studio: show HF model download progress in training start overlay
During the training setup phase, the overlay only displayed a static
"Loading model..." line while model weights were being downloaded from
Hugging Face. On slow connections this looked like the app had frozen.
This adds a small self-contained progress block inside the existing
TrainingStartOverlay that polls the existing
GET /api/models/download-progress endpoint and renders a Progress bar
with bytes downloaded, total bytes, and percent complete.
Notes:
- Frontend only change. No backend, worker, SSE, or runtime store edits.
- Reuses the existing getDownloadProgress client wrapper and the
existing /api/models/download-progress endpoint that already scans
the HF blob cache for completed and .incomplete files.
- selectedModel is read directly from useTrainingConfigStore inside the
overlay, so no prop drilling and live-training-view.tsx is unchanged.
- Polling runs at 1500 ms and is gated on the HF repo regex
(^[A-Za-z0-9._-]+/[A-Za-z0-9._-]+$), the same regex the backend uses,
so local paths and empty form state never hit the endpoint.
- Polling stops once progress reaches 1.0 so the bar can stay at 100
until the overlay hides on the first training step.
- Network errors are silently swallowed, matching the chat side flow
(the bar simply freezes at the last value).
- When downloadedBytes is 0 the block is hidden entirely, so cached
models do not flash a progress bar.
- When the HF API cannot determine the total size, the block falls
back to "X downloaded" with no percent and no bar.
Verified with bun run build (tsc -b plus vite build, no TypeScript
errors).
* training overlay: track dataset download + show on-disk realpath
Adds a dedicated "Downloading dataset..." section to the training-start
overlay alongside the existing model-weights one, so an HF dataset that
is downloading mid-startup is no longer mislabeled as model weights or
hidden entirely. The new GET /api/datasets/download-progress endpoint
mirrors /api/models/download-progress against the datasets-- prefix in
HF_HUB_CACHE.
Both endpoints now also return cache_path, the resolved on-disk
realpath of the snapshot directory (or the cache repo root if no
snapshot is materialized yet). The overlay surfaces this under each
download row so users can immediately see where the model and dataset
landed without digging through server logs.
The frontend's existing useModelDownloadProgress hook is generalized
to a single useHfDownloadProgress(repoId, fetcher) hook that the
model and dataset variants both delegate to, keeping polling, gating,
and completion semantics in one place.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Studio: Polish training start overlay download progress UI (#4957)
* studio: polish training start overlay download progress visuals
* Fix formatCachePath cross-platform support and redundant sizeLabel
- Extend formatCachePath regex to also shorten macOS /Users/<user> paths to ~
- Suppress sizeLabel when no byte info is available (cachePath-only state),
since the "Preparing" badge already conveys the status
* Fix misleading status badge when download total is unknown
- Hide badge when totalBytes is 0 but downloadedBytes > 0, since we cannot
determine if the download is still in progress or already complete (happens
when HF size metadata lookup fails for gated/private repos)
- Keep "Preparing" badge for the zero-bytes cachePath-only state
- Add Windows native path shortening to formatCachePath (C:\Users\<name>)
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
---------
Co-authored-by: studio-install <studio@local.install>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
* Studio: anchor ctx-slider warning threshold at 4096 when weights exceed VRAM
The chat settings sheet's ctx slider reads `max_context_length` from
`/api/inference/status` and renders
Exceeds estimated VRAM capacity (N tokens). The model may use
system RAM.
when the user drags the slider above that value. For models whose
weights fit on some GPU subset, `_max_context_length` was already set
to the binary-search cap and the warning fired correctly.
For models whose weights exceed 90% of every GPU subset's free memory
(e.g. MiniMax-M2.7-GGUF at 131 GB on a 97 GB GPU), the ceiling-probe
loop never matched a subset, so `max_available_ctx` stayed at the
native context (e.g. 196608). The slider ran all the way to native
with no indication that any value above the 4096 spec default would
trigger `--fit on` and degrade performance.
Anchor `max_available_ctx` at `min(4096, native_context_length)` when
no subset fits, so the warning fires at the right threshold and the
user sees the correct safe-zone / warning-zone split:
Before (MiniMax-M2.7 on 97 GB GPU):
slider 0 .. 196608, warning threshold = 196608 (never fires)
After:
slider 0 .. 196608, warning threshold = 4096 (fires correctly)
No frontend changes required: `chat-settings-sheet.tsx` already
consumes `ggufMaxContextLength` (= status.max_context_length) as the
warning threshold and `ggufNativeContextLength` as the slider max.
Adds tests/test_llama_cpp_max_context_threshold.py covering
weights-exceed-VRAM (single / multi-GPU), a native-ctx below the 4096
fallback case (don't lie about supported ctx), fittable-model
regressions (small / multi-GPU / tiny on huge GPU), and the
`max_context_length` property's fallback semantics.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: make GGUF disk-space preflight cache-aware
The pre-download disk check in LlamaCppBackend.load_model compared the
repo's total GGUF size against free disk without crediting bytes
already present in the Hugging Face cache. Re-loading a large cached
model (e.g. MiniMax-M2.7-GGUF at 131 GB) then failed cold with
"Not enough disk space to download any variant" whenever free disk
was below the full weight footprint, even though nothing actually
needed to be downloaded.
Subtract bytes already on disk via try_to_load_from_cache before
comparing against free space. A partial blob (interrupted download) is
not credited, so a second attempt still allocates room to finish the
download. The log line now also surfaces how much is already cached.
Adds tests/test_llama_cpp_cache_aware_disk_check.py covering the
fully-cached, partial-cache-insufficient-disk, partial-cache-enough-disk,
cold-cache, incomplete-blob, and zero-size-path-info cases. Sparse
tempfiles keep the GB-scale scenarios cheap to simulate.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio: honor explicit GGUF ctx and default to 4096 when weights exceed VRAM
The load-time auto-fit in LlamaCppBackend.load_model had two issues for
models whose weights do not fit on any GPU subset (the common case for
large MoE GGUFs such as MiniMax-M2.7, Qwen3.5-397B-A17B, etc.):
1. Auto mode (max_seq_length=0) left effective_ctx at the model's native
context when no subset passed the 90% fit check. The UI slider then
landed on e.g. 196608 for MiniMax-M2.7, far above anything usable.
Default the auto-pick to 4096 so the UI starts at a sane value; the
slider ceiling stays at the native context so the user can still
opt in to longer contexts and receive the "might be slower" warning.
2. Explicit ctx was silently shrunk when weights fit but the requested
KV overflowed the 90% budget. The shrink loop emitted -c <capped>
-ngl -1 without informing the caller, so a user who had opted into
a longer context via the UI never actually got it. Drop the shrink
loop on the explicit path and emit -c <user_ctx> --fit on instead,
letting llama-server flex -ngl (CPU layer offload).
Adds tests/test_llama_cpp_context_fit.py covering both paths, the
file-size-only fallback when KV metadata is missing, non-regression on
fittable auto-pick, and platform-agnostic input shape.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [Studio] Install flash attn at setup time for linux
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup changes
Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Test cases
* wheel_utils: narrow url_exists exceptions and log at debug level
---------
Signed-off-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* Show non exported models in chat UI
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Distinguish b/w LoRa and full fine tune saves. Cleanup
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
* fix(studio): default chart view to full training history instead of last 80 steps
Fixes#5003
* chore: windowsize as null code comment
---------
Co-authored-by: imagineer99 <samleejackson0@gmail.com>
Co-authored-by: Wasim Yousef Said <wasimysdev@gmail.com>
* fix: polish clipboard style and fix async clipboard path
* Use copyToClipboardAsync in CopyButton for Safari fallback
CopyButton was calling navigator.clipboard.writeText directly,
bypassing the execCommand fallback added in this same PR. Switch
to copyToClipboardAsync which tries execCommand first (Safari
user-gesture requirement) then falls back to the async clipboard API.
* Fix copyToClipboard sync contract regression and improve async path
- Restore copyToClipboard() to return only the execCommand result,
preserving the boolean contract that 7 existing callers depend on
to gate their "Copied!" UI state. The fire-and-forget async fallback
was returning true before the promise resolved, causing false success.
- Add document.body null guard to copyWithExecCommand for SSR safety.
- Reorder copyToClipboardAsync to try the async Clipboard API first,
avoiding unnecessary DOM/focus overhead in Radix focus-trapped dialogs
where execCommand always fails anyway.
* Restore queryCommandSupported guard and fix async catch path
- Restore the queryCommandSupported("copy") guard in copyToClipboard()
to match the original contract exactly: when execCommand is entirely
unsupported, fall through to fire-and-forget async clipboard write.
- Fix copyToClipboardAsync catch block: after navigator.clipboard.writeText
rejects, the user-gesture frame is gone, so execCommand will also fail.
Return false from catch instead of falling through. The execCommand
fallback at the bottom only runs when the Clipboard API is absent
(still in user-gesture frame).
* Restore execCommand fallback in copyToClipboardAsync catch path
The catch block was returning false after clipboard API rejection,
based on the incorrect premise that the user-gesture frame is lost
after an await. Per the HTML spec, transient user activation IS
preserved through promise microtask chains. The real reason
execCommand fails in the Radix dialog is the focus trap intercepting
textarea.focus(), not gesture loss.
For non-dialog callers, execCommand can still succeed after a
clipboard rejection. Inside a Radix modal, execCommand returns
false harmlessly (focus trap blocks it).
* Harden textarea fallback for mobile and continue to async path on failure
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* fix(studio): skip training status/metrics polling when idle
Add an early return in the status and metrics setInterval callbacks when
the runtime store reports phase === "idle" and hasHydrated is true.
Previously these polls fired unconditionally every 3s/5s, generating
unnecessary network traffic and console errors when no training was
running.
* fix(studio): reduce idle polling to 30s instead of stopping entirely
Review feedback (PR #4988): completely stopping polling when idle risks
permanent UI desync if hydration fails, and misses out-of-band state
changes from other clients. Add a 30s background poll that only fires
when idle to recover gracefully.
* fix: harden idle status polling around hydration and runtime reset
---------
Co-authored-by: AdamPlatin123 <AdamPlatin123@users.noreply.github.com>
Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com>
Co-authored-by: imagineer99 <samleejackson0@gmail.com>
* Studio: add API key authentication for programmatic access
External users want to hit the Studio API (chat completions with tool
calling, training, export, etc.) without going through the browser
login flow. This adds sk-unsloth- prefixed API keys that work as a
drop-in replacement for JWTs in the Authorization: Bearer header.
Backend:
- New api_keys table in SQLite (storage.py)
- create/list/revoke/validate functions with SHA-256 hashed storage
- API key detection in _get_current_subject before the JWT path
- POST/GET/DELETE /api/auth/api-keys endpoints on the auth router
Frontend:
- /api-keys page with create form, one-time key reveal, keys table
- API Keys link in desktop and mobile navbar
- Route registered with requireAuth guard
Zero changes to any existing route handler -- every endpoint that uses
Depends(get_current_subject) automatically works with API keys.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use actual origin in API key usage examples
The examples on /api-keys were hardcoded to localhost:8888 which is
wrong for remote users. Use window.location.origin so the examples
show the correct URL regardless of where the user is connecting from.
* Add `unsloth studio run` CLI command for one-liner model serving
Adds a `run` subcommand that starts Studio, loads a model, creates an
API key, and prints a ready-to-use curl command -- similar to
`ollama run` or `vllm serve`.
Usage: unsloth studio run -m unsloth/Qwen3-1.7B-GGUF --gguf-variant UD-Q4_K_XL
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add end-to-end tests for `unsloth studio run` and API key usage
Tests the 4 usage examples from the API Keys page:
1. curl basic (non-streaming) chat completions
2. curl streaming (SSE) chat completions
3. OpenAI Python SDK streaming completions
4. curl with tools (web_search + python)
Also tests --help output, invalid key rejection, and no-key rejection.
All 7 tests pass against Qwen3-1.7B-GGUF.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add /v1/completions, /v1/embeddings, /v1/responses endpoints and --parallel support
- llama_cpp.py: accept n_parallel param, pass to llama-server --parallel
- run.py: plumb llama_parallel_slots through to app.state
- inference.py: add /completions and /embeddings as transparent proxies to
llama-server, add /responses as application-level endpoint that converts
to ChatCompletionRequest; thread n_parallel through load_model
- studio.py: set llama_parallel_slots=4 for `unsloth studio run` path
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Make /v1/responses endpoint match OpenAI Responses API format
The existing /v1/responses shim returned Chat Completions format, which
broke OpenAI SDK clients using openai.responses.create(). This commit
replaces the endpoint with a proper implementation that:
- Returns `output` array with `output_text` content parts instead of
`choices` with `message`
- Uses `input_tokens`/`output_tokens` instead of `prompt_tokens`/
`completion_tokens` in usage
- Sets `object: "response"` and `id: "resp_..."`
- Emits named SSE events for streaming (response.created,
response.output_text.delta, response.completed, etc.)
- Accepts all OpenAI Responses API fields (tools, store, metadata,
previous_response_id) without erroring -- silently ignored
- Maps `developer` role to `system` and `input_text`/`input_image`
content parts to the internal Chat format
Adds Pydantic schemas for request/response models and 23 unit tests
covering schema validation, input normalisation, and response format.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Studio: add Anthropic-compatible /v1/messages endpoint (#4981)
* Add Anthropic-compatible /v1/messages endpoint with tool support
Translate Anthropic Messages API format to/from internal OpenAI format
and reuse the existing server-side agentic tool loop. Supports streaming
SSE (message_start, content_block_delta, etc.) and non-streaming JSON.
Includes offline unit tests and e2e tests in test_studio_run.py.
* Add enable_tools, enabled_tools, session_id to /v1/messages endpoint
Support the same shorthand as /v1/chat/completions: enable_tools=true
with an optional enabled_tools list uses built-in server tools without
requiring full Anthropic tool definitions. session_id is passed through
for sandbox isolation. max_tokens is now optional.
* Strip leaked tool-call XML from Anthropic endpoint content
Apply _TOOL_XML_RE to content events in both streaming and
non-streaming tool paths, matching the OpenAI endpoint behavior.
* Emit custom tool_result SSE event in Anthropic stream
Adds a non-standard tool_result event between the tool_use block close
and the next text block, so clients can see server-side tool execution
results. Anthropic SDKs ignore unknown event types.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Split /v1/messages into server-side and client-side tool paths
enable_tools=true runs the existing server-side agentic loop with
built-in tools (web_search/python/terminal). A bare tools=[...] field
now triggers a client-side pass-through: client-provided tools are
forwarded to llama-server and any tool_use output is returned to the
caller with stop_reason=tool_use for client execution.
This fixes Claude Code (and any Anthropic SDK client) which sends
tools=[...] expecting client-side execution but was previously routed
through execute_tool() and failing with 'Unknown tool'.
Adds AnthropicPassthroughEmitter to convert llama-server OpenAI SSE
chunks into Anthropic SSE events, plus unit tests covering text
blocks, tool_use blocks, mixed, stop reasons, and usage.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix httpcore GeneratorExit in /v1/messages passthrough stream
Explicitly aclose aiter_lines() before the surrounding async with
blocks unwind, mirroring the prior fix in external_provider.py
(a41160d3) and cc757b78's RuntimeError suppression.
* Wire stop_sequences through /v1/messages; warn on tool_choice
Plumb payload.stop_sequences to all three code paths (server-side
tool loop, no-tool plain, client-side passthrough) so Anthropic SDK
clients setting stop_sequences get the behavior they expect. The
llama_cpp backend already accepted `stop` on both generate_chat_
completion and generate_chat_completion_with_tools; the Anthropic
handler simply wasn't passing it.
tool_choice remains declared on the request model for Anthropic SDK
compatibility (the SDK often sets it by default) but is not yet
honored. Log a structured warning on each request carrying a non-
null tool_choice so the silent drop is visible to operators.
* Wire min_p / repetition_penalty / presence_penalty through /v1/messages
Align the Anthropic endpoint's sampling surface with /v1/chat/completions.
Adds the three fields as x-unsloth extensions on AnthropicMessagesRequest
and threads them through all three code paths: server-side tool loop,
no-tool plain, and client-side passthrough.
The passthrough builder emits "repeat_penalty" (not "repetition_penalty")
because that is llama-server's field name; the backend methods already
apply the same rename internally.
* Fix block ordering and prev_text reset in non-streaming tool path
_anthropic_tool_non_streaming was building the response by appending
all tool_use blocks first, then a single concatenated text block at
the end — losing generation order and merging pre-tool and post-tool
text into one block. It also never reset prev_text between synthesis
turns, so the first N characters of each post-tool turn were dropped
(where N = length of the prior turn's final cumulative text).
Rewrite to build content_blocks incrementally in generation order,
matching the streaming emitter's behavior: deltas within a turn are
merged into the trailing text block, tool_use blocks interrupt the
text sequence, and prev_text is reset on tool_end so turn N+1 diffs
against an empty baseline.
Caught by gemini-code-assist[bot] review on #4981.
* Make test_studio_run.py e2e tests pytest-compatible
Add a hybrid session-scoped studio_server fixture in conftest.py that
feeds base_url / api_key into the existing e2e test functions. Three
invocation modes are now supported:
1. Script mode (unchanged) — python tests/test_studio_run.py
2. Pytest + external server — point at a running instance via
UNSLOTH_E2E_BASE_URL / UNSLOTH_E2E_API_KEY env vars, no per-run
GGUF load cost
3. Pytest + fixture-managed server — pytest drives _start_server /
_kill_server itself via --unsloth-model / --unsloth-gguf-variant,
CI-friendly
The existing _start_server / _kill_server helpers and main() stay
untouched so the script entry point keeps working exactly as before.
Test function signatures are unchanged — the (base_url, api_key)
parameters now resolve via the new fixtures when running under
pytest.
* Rename test_studio_run.py -> test_studio_api.py
The file is entirely about HTTP API endpoint testing (OpenAI-compatible
/v1/chat/completions, Anthropic-compatible /v1/messages, API key auth,
plus a CLI --help sanity check on the command that runs the API). None
of its tests cover training, export, chat-UI, or internal-Python-API
concerns.
The old name misleadingly suggested "tests for the unsloth studio run
CLI subcommand" — the new name reflects the actual scope.
Updates:
- git mv the file (rename tracked, history preserved)
- Rewrite opening docstring to state the API surface focus and call
out what is explicitly out of scope
- Update all 4 Usage-block path references to the new filename
- LOG_FILE renamed to test_studio_api.log
- conftest.py fixture import rewritten from test_studio_run to
test_studio_api, plus 7 docstring/comment references updated
No functional changes to test logic, signatures, or main().
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix httpcore asyncgen cleanup in /v1/messages and /v1/completions
The earlier fix in 985e92a9 was incomplete: it closed aiter_lines()
explicitly but still used `async with httpx.AsyncClient()` /
`async with client.stream()` inside the generator. When the generator
is orphaned (e.g. client disconnects mid-stream and Starlette drops
the StreamingResponse iterator without explicitly calling aclose()),
Python's asyncgen finalizer runs the cleanup in a DIFFERENT task than
the one that originally entered the httpx context managers. The
`async with` exits then trigger httpcore's HTTP11ConnectionByteStream
.aclose(), which enters anyio.CancelScope.__exit__ with a mismatched
task and raises RuntimeError("Attempted to exit cancel scope in a
different task"). That error escapes any user-owned try/except
because it happens during GC finalization.
Replace `async with` with manual client/response lifecycle in both
/v1/messages passthrough and /v1/completions proxy. Close the
response and client in a finally block wrapped in
`try: ... except Exception: pass`. This suppresses RuntimeError (and
other Exception subclasses) from the anyio cleanup noise while
letting GeneratorExit (a BaseException, not Exception) propagate
cleanly so the generator terminates as Python expects.
Traceback observed in user report:
File ".../httpcore/_async/connection_pool.py", line 404, in __aiter__
yield part
RuntimeError: async generator ignored GeneratorExit
...
File ".../anyio/_backends/_asyncio.py", line 455, in __exit__
raise RuntimeError(
RuntimeError: Attempted to exit cancel scope in a different task
* Expand unsloth studio run banner with SDK base URL and more curl examples
Add an explicit "OpenAI / Anthropic SDK base URL" line inside the info
box so SDK users don't accidentally copy the bare server URL (without
/v1) into their OpenAI/Anthropic SDK constructors and hit 404s.
Replace the single /v1/chat/completions curl example with three
labeled blocks: chat/completions, Anthropic /messages, and OpenAI
Responses. The Anthropic example includes max_tokens (Anthropic SDKs
require it even though Studio accepts None).
All examples derived from a computed sdk_base_url so the /v1 prefix
stays in sync if the public path ever changes.
* Hash API keys with HMAC-SHA256 + persistent server secret
Stores the HMAC secret in a new app_secrets singleton table. Fixes
CodeQL py/weak-sensitive-data-hashing alert on storage.py:74-76,
394-395. Refresh tokens stay on plain SHA-256 (unchanged _hash_token)
so existing user sessions survive upgrade — API keys are new on this
branch so there is no migration.
* Use PBKDF2 for API key hashing per CodeQL recommendation
HMAC-SHA256 was still flagged by py/weak-sensitive-data-hashing.
Switch to hashlib.pbkdf2_hmac, which is in CodeQL's recommended
allowlist (Argon2/scrypt/bcrypt/PBKDF2). Persistent server-side
salt stays in app_secrets for defense-in-depth. 100k iterations to
match auth/hashing.py's password hasher.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Add mode="wait" and exit={{ opacity: 0 }} to the root AnimatePresence
wrapper so outgoing routes fully unmount before incoming routes render.
Without this, rapid navigation between Studio/Export/Recipes/Chat caused
pages to stack (2x–3x duplication).
Co-authored-by: AdamPlatin123 <AdamPlatin123@users.noreply.github.com>
Co-authored-by: Wasim Yousef Said <wasimysdev@gmail.com>
* Fix Gemma-4 GRPO catastrophic KL divergence with TRL 1.0.0+
Two compounding bugs caused Gemma-4 GRPO training to diverge with KL ~10^12
at step 1 against TRL 1.0.0+. Both fixes are runtime patches in the existing
TRL/model patch flow and are no-ops for models and TRL versions that are not
affected.
Fix 1 (rl.py): replace trl.models.utils.disable_gradient_checkpointing with
a no-op context manager. TRL 1.0.0+ wraps generation in
`with torch.no_grad(), disable_gradient_checkpointing(self.model, ...):`
purely to suppress a cosmetic PyTorch warning ("None of the inputs have
requires_grad=True"). Inside torch.no_grad() the gradient checkpointing
state has no functional effect on the forward pass. On context exit, TRL
calls model.gradient_checkpointing_enable() which dispatches to HF's
generic implementation and overwrites Unsloth's custom
`use_gradient_checkpointing="unsloth"` wrapper, corrupting Gemma-4 forward
numerics. Replacing the toggle with a no-op preserves Unsloth's custom GC
wrapper across generation passes. The patch walks sys.modules dynamically
to also rebind the symbol on every trl.* module that already imported it
(grpo_trainer, dpo_trainer, rloo_trainer, dppo_trainer, gfpo_trainer,
grpo_with_replay_buffer_trainer, and any future trainer module).
Fix 2 (vision.py): inject `final_logit_softcapping` from `config.text_config`
into the top-level `model.config` for multimodal models. Unsloth's GRPO
trainer reads `getattr(model.config, "final_logit_softcapping", 0)` but
for Gemma-4 the attribute lives only on the nested `Gemma4TextConfig`,
so the lookup silently defaults to 0 instead of 30.
Backwards compatibility:
- trl 0.22.2: no `disable_gradient_checkpointing` symbol exists, the patch
early-returns via `hasattr` guard.
- trl 0.27.1: same broken pattern as 1.0.0, the noop replacement is correct.
- trl 1.0.0+: end-to-end verified on `unsloth/gemma-4-E2B-it` GRPO with TRL
1.0.0 and transformers 5.5.0. Step 1 loss=2.46e-08, kl=2.92e-05 (machine
zero) vs broken baseline loss=1.37e+06, kl=1.76e+09.
- Llama / non-VLM text models: Fix 2 is a no-op (no `text_config`); Fix 1
is functionally identical (Unsloth's GC wrapper is preserved).
- Qwen3-VL and other VLMs without final_logit_softcapping: Fix 2 is a no-op
(text_config.final_logit_softcapping is None).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply loop 1 review fixes for PR #4934
- Move Fix 2 from vision.py to rl_replacements.py:858 and :1110 at the
actual consumer sites. This avoids mutating model.config (which could
leak into save_pretrained output) and covers text-only Gemma-4 paths
that do not flow through FastBaseModel.from_pretrained.
- Revert the vision.py injection block entirely.
- Narrow the bare except blocks in patch_trl_disable_gradient_checkpointing
from `except Exception:` to `(AttributeError, ImportError)` and
`(AttributeError, TypeError)` to avoid masking unrelated bugs.
- Add logger.warning_once when the noop patch is installed, matching
patch_trl_openenv and patch_trl_vllm_generation convention.
- Remove the dead per-module `_unsloth_noop_patched` sentinel check inside
the sys.modules walk. The function-level early return already covers
this case.
- Move `import sys` and `from contextlib import contextmanager` to the
module-level imports instead of inside the function body.
- Rewrite the ordering comment in PatchFastRL to accurately describe
why patch_trl_disable_gradient_checkpointing must run before
patch_trl_rl_trainers.
- Fix keyword default spacing to match surrounding rl.py style.
End-to-end verified: Gemma-4-E2B GRPO on TRL 1.0.0 + transformers 5.5.0
step 1 loss=2.464e-08 kl=2.921e-05, all 5 steps succeed.
* Apply loop 2 review fix for PR #4934
Extract the final_logit_softcapping fallback logic into a shared helper
`_unsloth_get_final_logit_softcapping(config)` defined in rl_replacements.py
and injected into the compiled cache via RL_PRE_ITEMS["grpo_trainer"]. Both
call sites (`grpo_trainer__generate_and_score_completions` and
`grpo_trainer_compute_loss`) now use the helper instead of inlining the
same text_config fallback block twice.
Verified: compiled cache file lists the helper at module scope and both
consumer sites call it. Gemma-4-E2B GRPO step 1 loss=2.464e-08 kl=2.921e-05
(unchanged), all 5 steps pass.
* Apply loop 3 review fix for PR #4934
Extend _unsloth_get_final_logit_softcapping to also fall back to
config.get_text_config() for composite configs such as T5GemmaConfig
where the text sub-config is not exposed via the text_config attribute
but only via the get_text_config() method. Guard against (TypeError,
ValueError) raised by ambiguous composite configs, and skip the
self-referential case where get_text_config() returns self.
This addresses the 6/7 reviewer consensus from the third review loop.
Verified:
- Helper returns 30.0 for Gemma-4, T5Gemma, and Gemma 1/2 configs.
- Helper returns 0 for Llama, Qwen, Mistral, Cohere, Granite, and
ambiguous configs raising ValueError.
- Gemma-4-E2B GRPO step 1 loss=2.464e-08 kl=2.921e-05 (unchanged).
- Llama-3.2-1B GRPO all 5 steps loss=0 kl=0 (no regression).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix
bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on
every ROCm target:
- CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a
broken blocksize=32/64 warp64 GEMV kernel whose tests were
explicitly skipped with ROCM_WARP_SIZE_64 guards because the
code was known broken.
- RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time
BNB_WARP_SIZE macro in the host-side dispatch that resolves to
64 when the multi-arch wheel is compiled with CDNA as the
primary target, so num_blocks is wrong on RDNA and half the GEMV
output is never written.
At decode shape (1, 1, hidden) both bugs produce NaN. Training is
unaffected because training shapes are (batch, seq_len > 1, hidden)
and never touch the GEMV path. The crash during autoregressive
inference surfaces as _assert_async_cuda_kernel in torch.multinomial
which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of
a clean Python error.
Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable
blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA",
PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a
runtime hipDeviceGetAttribute query and ships a working CDNA warp64
kernel. That commit has not shipped to PyPI yet, but
continuous-release_main wheels are published on every push to bnb
main via GitHub Releases.
Point the ROCm install path at the continuous-release_main x86_64 and
aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is
unreachable (offline installs, firewalled hosts, or architectures not
covered by the pre-release wheels). Drop the pin once bnb cuts a
0.50+ tag on PyPI.
Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct
bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1
(no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit
sampling generation works end-to-end.
NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is
gated on the ROCm torch index and platform.machine() respectively.
* Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode
The 16-bit fallback in studio/backend/core/inference/inference.py was
added as a workaround for a bug that this PR already fixes at the
install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel
on every ROCm target, which NaNs at decode shape (seq_len=1) and
crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in
0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this
PR) restores correct 4-bit decode on MI300X and verified working
end-to-end with full Unsloth + for_inference + sampling.
Revert the dual code path so ROCm and NVIDIA both go through the
normal FastLanguageModel.from_pretrained + for_inference flow:
- Remove the conditional `from unsloth import` that skipped the
import on ROCm. The monkey-patches it was trying to avoid were
never the cause of the crash; bnb 4-bit GEMV was.
- Remove the `if _hw_module.IS_ROCM:` branch in load_model that
loaded with plain transformers + PEFT + bfloat16, and the
`_resolve_fp16_base` helper it relied on.
- Remove the `get_chat_template is not None` fallback in
_load_chat_template_info -- get_chat_template is now always
imported.
- Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM
directly instead of the removed _IS_ROCM_ENV global. Audio and
vision on ROCm still need separate validation (FastVisionModel
and the CSM audio codecs were never tested on HIP) so the guard
stays for now.
Add _bnb_rocm_4bit_ok() as a runtime safety net for users who
install from this PR before the install.sh bnb pin kicks in, or
whose installer fell back to the PyPI pin because the continuous-
release wheel was unreachable. When the installed bnb is < 0.50 on
ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit /
-bnb-4bit suffix from the model path so a pre-quantized repo
resolves to its FP16 sibling instead of pulling bnb back in via
the repo's quantization_config. LoRA adapters whose base is a
pre-quantized repo on old bnb will still fail inside Unsloth's
loader -- the only real fix there is `unsloth studio update`.
Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1):
- HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized
repo): loads in 4-bit via the fixed GEMV, generation returns
"Paris." for greedy and sampling.
- SAFETY-NET path (simulated old bnb, suffix-stripped to the
FP16 sibling, load_in_4bit=False): loads in bf16, generation
returns "Paris." for greedy and sampling.
Net diff is ~45 lines smaller than the pre-revert state because
the entire plain-transformers 16-bit branch is gone.
* Cache _bnb_rocm_4bit_ok() with functools.cache
load_model() can be called many times in a single session but the bnb
version and hardware state cannot change at runtime, so memoise the
check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes`
inside the try block), subsequent calls drop to sub-microsecond dict
lookups. Zero behavioral change.
* Shorten verbose bnb/ROCm comments
Comment-only cleanup across install.sh, studio/install_python_stack.py,
and studio/backend/core/inference/inference.py. No behavioral change.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove _bnb_rocm_4bit_ok safety net from inference.py
Studio's ROCm support is brand new (PR #4720, merged today) and every
fresh install pulls the bnb continuous-release_main wheel via
install.sh / install_python_stack.py in this same PR. There are no
existing ROCm Studio installs carrying bnb < 0.50, so the defensive
version-check fallback is guarding against a scenario that cannot
actually occur. Delete the helper, the functools import, and the
safety-net block -- inference.py now calls FastLanguageModel.from_pretrained
directly with no ROCm branching.
* Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix
Vision inference was blocked by the same bnb 4-bit GEMV bug that affected
text inference (vision models use bnb 4-bit for the LM backbone). With
bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works
end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit
loaded in 4-bit via FastVisionModel + for_inference returns a correct
answer to a multimodal prompt.
Audio (CSM) was never actually blocked by HIP — on this hardware CSM
loads and runs its backbone forward pass fine with bnb 0.50, then fails
during generate() with a transformers-level kwarg validation mismatch
in generation_csm.py (`backbone_last_hidden_state` rejected). That's a
pre-existing transformers/CSM integration bug that reproduces identically
on NVIDIA, so the ROCm-gated guard was never actually protecting users
from anything HIP-specific.
Remove the combined audio/vision guard and the now-unused _hw_module
import. Also restore the one-word "Can be" in an inline comment that
drifted during the earlier comment-shortening pass, so the inference.py
delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and
nothing else.
* Shorten max_seq_length=0 guard comment to one line
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add ROCm detection to install.sh and expand shell tests
Add AMD ROCm GPU detection to get_torch_index_url() in install.sh.
When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm
version file, hipconfig, dpkg-query, and rpm.
Includes validation guard for malformed _rocm_tag, Debian epoch prefix
stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install,
and status messaging. Shell tests expanded to 23 cases.
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add ROCm torch reinstall support to install_python_stack.py
Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a
Linux host has ROCm but the venv received CPU-only torch, and reinstall
with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a
30-second timeout on the torch GPU probe subprocess.
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add ROCm support to llama.cpp prebuilt installer
Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm
via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream
prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP
prebuilt with CPU fallback). Add linux-rocm and windows-hip install
kinds to runtime_patterns_for_choice().
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add IS_ROCM hardware flag and fix AMD error message
Add IS_ROCM flag to hardware.py detect_hardware() (set when
torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM
from __init__.py. Add "rocm" key to get_package_versions().
Replace "We do not support AMD" error in tokenizer_utils.py with a
helpful message pointing to ROCm installation docs.
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add comprehensive ROCm support test suite (68 tests)
Add tests/studio/install/test_rocm_support.py covering all ROCm code
paths across install_llama_prebuilt.py, install_python_stack.py,
hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks
and run without AMD hardware.
Covers: asset selection (11), runtime patterns (5), HostInfo (4),
ROCm version detection (9), torch reinstall (9), index mapping (8),
hardware flag (8), tokenizer message (2), install.sh structure (10),
and live regression (1).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden ROCm support: probe error handling, version cap, validation
Address review findings from 8 independent reviewers:
- Wrap _ensure_rocm_torch() torch probe in try/except for
TimeoutExpired and OSError so a hung or broken torch import does not
crash the installer (8/8 reviewers flagged this)
- Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to
prevent installing unsupported torch 2.11.0 from the rocm7.1 index
- Use with-statement for file reads in _detect_rocm_version() to avoid
resource leaks
- Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of
default parameter to avoid relative path resolution)
- Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to
reject rocm0.x tags that would produce nonexistent PyTorch index URLs
- Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0*
|rocm7.1* pass through, everything else caps to rocm7.1) so future
ROCm 10+ does not fall through to a nonexistent index
- Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering
- Fix test_probe_timeout_handled: replace zero-assertion test with
proper assertions verifying reinstall proceeds after timeout
* Clean up rocm_paths list construction in detect_host()
Filter None from the ROCM_PATH env var lookup at list construction time
instead of relying on the inline `if p` guard in the any() call.
* Require actual AMD GPU presence before selecting ROCm paths
All 8 reviewers across 2 cycles independently flagged that ROCm
detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core)
as a proxy for GPU presence, which would misroute CPU-only or NVIDIA
hosts that happen to have ROCm tools installed.
Now all 3 detection points (install.sh, install_python_stack.py,
install_llama_prebuilt.py) probe for an actual AMD GPU before
entering the ROCm path:
- install.sh: check rocminfo for gfx* GPU names, or amd-smi list
for device rows, before version detection
- install_python_stack.py: new _has_rocm_gpu() function probes
rocminfo and amd-smi list before _ensure_rocm_torch() proceeds
- install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi
list instead of just checking tool existence or directory paths
Also:
- Shell test mock amd-smi now handles "list" subcommand
- Python tests updated to mock _has_rocm_gpu where needed
- Added test_no_gpu_with_rocm_tools_skips to verify the new guard
- Test index lookups now use sorted() to match production code
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden hipconfig version parsing and torch probe compatibility
- Add parts[1].isdigit() check in hipconfig version parsing to handle
versions like "6.3-HIP" where the minor component has non-numeric
suffix (strip "-" prefix before int() conversion)
- Use getattr() in torch probe subprocess to safely handle old or
custom torch builds that may lack torch.version.hip/cuda attributes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Strengthen AMD GPU detection and add NVIDIA precedence guard
- Change amd-smi list detection from any-non-empty-output to requiring
"gpu" marker in output, matching the shell-side NR>1 check. Prevents
false positives from header-only amd-smi list output.
- Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed
AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and
install_llama_prebuilt.py behavior).
- Apply the same amd-smi marker fix to install_llama_prebuilt.py
detect_host() for consistency.
* Add Windows-specific ROCm/HIP detection in detect_host()
The previous detect_host() ROCm check used rocminfo and amd-smi list
which are Linux-only tools. On Windows, has_rocm would always be False,
making the Windows HIP prebuilt path at line 1794 unreachable.
Now detect_host() uses platform-specific detection:
- Linux: rocminfo (check for gfx GPU names) or amd-smi list
- Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH
This allows Windows AMD users to get the HIP prebuilt binary instead
of silently falling through to the CPU prebuilt.
* Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion
- worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check
for hipcc before ROCm source builds, improve status messages and error
reporting, add timeout and uv support for the source build fallback
- amd.py: New AMD GPU monitoring module via amd-smi metric --json,
mirroring nvidia.py structure (utilization, temperature, power, VRAM)
- hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization,
visible GPU queries, and physical GPU count
- install_python_stack.py: Detect AMD GPUs on Windows and warn that
ROCm-enabled PyTorch must be installed manually
- kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032),
RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries
- tests: Add 32 new tests covering all changes (95/95 pass)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage
- Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi
output markers instead of just checking tool existence on PATH
- _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before
giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools)
- amd.py _parse_numeric: handle dict-shaped metric objects from newer
amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units
- amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly
handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs
- amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index
so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly
- install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels
exist for older versions); fix rocm7.1* glob to not match rocm7.10+
- is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.)
- worker.py: increase ROCm source build timeout from 600s to 1800s;
fix success log message for ROCm source builds
- Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation
- hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm
before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with
HIP-specific env vars report the correct visible device set
- amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi
JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly;
fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB
- install_python_stack.py: Windows AMD warning now validates actual GPU
presence via hipinfo/amd-smi output markers before printing
- install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP
detection after tool-based checks, so Windows HIP installs without CLI
tools on PATH are still detected
- hardware.py: fix IS_ROCM comment to accurately describe its role
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec
Use explicit None checks instead of Python `or` operator when reading
HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string
("") is correctly honored as "no visible GPUs" rather than silently
falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems.
* Fix IS_ROCM test assertion for multi-line formatting
* Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count
- Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in
both install.sh and install_python_stack.py to prevent resolver from
selecting incompatible companion packages from ROCm wheel index
- Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence
without hipinfo/amd-smi is not proof of GPU existence)
- Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which
respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts
* Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428
The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3
(gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201)
architectures is based on the original work from PR #4428.
Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com>
Co-authored-by: billishyahao <bill.he@amd.com>
* Support AMD Radeon for studio (#4770)
Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
* Remove ROCm test files from main PR
Move test_rocm_support.py and shell test additions to a separate PR
to keep the main ROCm support PR focused on implementation changes.
* Fix installer and hardware detection issues for PR #4720
- Fix empty _tri_arg passed to uv pip install in Radeon path (causes
"Empty field is not allowed for PEP508" error)
- Fix Radeon fallback: use ROCm index instead of CPU-only when
repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm)
- Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings
- Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64
wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag)
- Fix IS_ROCM export: use __getattr__ so callers always see the live
value after detect_hardware() runs
- Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES
on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set
- Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB)
- Add amd-smi version as a fallback in _detect_rocm_version
- Fix trailing whitespace and missing newline at EOF in install.sh
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix GPU detection false positives and add missing health groups
- Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows
from amd-smi list, not just header containing "gpu"
- Apply same fix in detect_host() in install_llama_prebuilt.py
- Add runtime_payload_health_groups for linux-rocm and windows-hip so
partial/corrupt ROCm/HIP prebuilt installs are properly detected
- Add bitsandbytes install to Radeon fallback paths (was only in the
success path, skipped when repo.radeon.com was unreachable)
- Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main)
and only use __getattr__ for IS_ROCM
* Fix _ensure_rocm_torch and Windows AMD warning false positives
- _ensure_rocm_torch: only skip when HIP is already present, not for
CUDA builds (which are unusable on AMD-only hosts). Fixes the case
where a venv has a stale CUDA wheel and the repair step is skipped.
- Windows AMD warning: use GPU data row check (same as Linux fix) to
avoid false positives from amd-smi list header-only output.
* Fix amd-smi GPU detection for GPU[N] output format
Older amd-smi versions output "GPU[0] : Card series: ..." instead of
"GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>"
formats to detect actual GPU data rows.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden AMD GPU detection against false positives
- install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with
strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/)
- All files: reject rocminfo gfx000 (CPU HSA agent) by requiring
gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe
- Fixes false positives on hosts with ROCm tools but no AMD GPU
* Remove duplicate comment from pre-commit merge
* Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports
- Extract _has_amd_rocm_gpu() shell function to avoid duplicating the
rocminfo/amd-smi GPU detection logic in get_torch_index_url and
the Radeon auto-detect block
- Consolidate bitsandbytes install into a single case block after torch
install (was duplicated 4 times across Radeon success/fallback paths)
- Move math and re imports to top of amd.py (were inline in functions)
- Add _smi_query() helper in hardware.py to centralize IS_ROCM backend
selection for get_gpu_utilization and get_visible_gpu_utilization
Addresses Gemini code review suggestions.
* Fix VRAM parsing for string values and GB/GiB consistency
- Extract unit from string-valued VRAM fields (e.g. "192 GiB") so
_parse_memory_mb correctly applies the unit multiplier instead of
treating the value as bare MB
- Treat GB and GiB identically (both as binary x1024) since GPU tools
including amd-smi use binary units even when labeling them "GB"
- Fixes incorrect VRAM reporting on MI300-class cards (was showing
~0.19 GB instead of 192 GB for string-valued outputs)
* Add --no-cache to uv for ROCm HIP source builds
Avoid stale cache artifacts from partial HIP source builds when
uv is used for causal-conv1d/mamba-ssm compilation on ROCm.
The pip path already uses --no-cache-dir; this adds the uv equivalent
(--no-cache) only when is_hip is True.
* Fix critical: initialize _amd_gpu_radeon before case block
_amd_gpu_radeon was only set inside the */rocm*) case arm, so on
NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm",
the variable was unbound. With set -u (nounset) enabled, this crashes
the installer for every non-AMD user.
Move initialization to before the case block so it is always defined.
* Fix Windows AMD: route has_rocm hosts to HIP prebuilt path
resolve_release_asset_choice was selecting windows-cpu for all Windows
x86_64 hosts including those with has_rocm=True. Windows AMD users
should fall through to resolve_upstream_asset_choice which tries the
HIP prebuilt first. Add "not host.has_rocm" guard to the published
windows-cpu selection.
* Harden ROCm detection, Radeon wheel fallback, and HIP visibility
Addresses review findings from parallel reviewers on PR #4720:
- install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L
to actually list a GPU before treating the host as NVIDIA. Fixes the
stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the
CUDA branch.
- install.sh: fix hipconfig awk blocks to propagate a non-zero exit code
when the output is not a recognisable version string, so the ||-chain
continues to dpkg-query / rpm instead of terminating early.
- install.sh: fail-closed on Radeon wheel fallback. When torch,
torchvision or torchaudio is missing from the Radeon repo for the
active Python tag, fall back to the standard ROCm index instead of
silently mixing Radeon wheels with PyPI defaults. Quote all wheel
arguments individually so wheel filenames cannot be word-split or
glob-expanded.
- install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to
list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts
with a broken leftover nvidia-smi to the ROCm path instead of
misclassifying them as NVIDIA.
- install_llama_prebuilt.py: scan upstream assets for any rocm-<version>
prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+
users pick up a matching upstream prebuilt when one exists.
- install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for
linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted
on the GPU path instead of passing validation on CPU only.
- install_llama_prebuilt.py: restore the published windows-cpu fallback
for AMD Windows hosts without a HIP prebuilt so hash-approved bundles
are still preferred over the raw upstream CPU asset.
- install_python_stack.py: drop the /opt/rocm / hipcc gate in
_ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm
installs (package-managed minimal installs, Radeon software) that ship
amd-smi / rocminfo without hipcc can now repair a CPU-only venv via
"unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard.
- studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES /
ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in
get_primary_gpu_utilization(). A process restricted to GPU 2 now
reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain
bytes unit detection to an explicit allowlist.
- studio/backend/utils/hardware/hardware.py: route
get_backend_visible_gpu_info()'s backend_cuda_visible_devices field
through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the
unconditional "(rocm=False)" suffix in apply_gpu_ids() logs.
* Fix round 2 regressions: ROCm validate_server and Windows HIP routing
Follow-up to 810b833b addressing review findings on the first round of
hardening commits:
- install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the
resolved install_kind instead of host.has_rocm. AMD Windows hosts
without a HIP prebuilt fall back to windows-cpu and must not be
validated with GPU layers; thread install_kind through from the
caller.
- install_llama_prebuilt.py resolve_release_asset_choice: reinstate the
"not has_rocm" guard on the published windows-cpu bundle so AMD
Windows hosts reach resolve_upstream_asset_choice() where the new
HIP prebuilt path lives. Prefer a published windows-hip bundle first
when one exists, fall through to upstream HIP + upstream CPU
otherwise.
- install_llama_prebuilt.py detect_host: also set has_physical_nvidia
when the secondary --query-gpu block confirms a working NVIDIA GPU,
so older nvidia-smi versions without -L support do not silently skip
the Linux diagnostics that key off has_physical_nvidia.
- install_llama_prebuilt.py: drop redundant "import re as _re" /
"import re as _re_rocm" local aliases in favour of the existing
top-level "import re".
- install_python_stack.py _ensure_rocm_torch: run the AMD
bitsandbytes install unconditionally after the HIP-torch probe so
"unsloth studio update" on venvs that already have ROCm torch still
gains the AMD bitsandbytes build.
- install.sh: add a non-x86_64 early-exit to get_torch_index_url() so
aarch64 / arm64 Linux hosts do not hit the ROCm wheel index
(PyTorch only publishes ROCm wheels for linux_x86_64).
- install.sh: add bitsandbytes install to the migrated-environment
branch so upgrades pick it up for ROCm hosts instead of only the
fresh-install path.
- install.sh: in the Radeon wheel path, pass version constraints +
--no-index --find-links to uv instead of explicit wheel URLs so a
version-compatible torch / torchvision / torchaudio triple is
resolved, rather than picking the highest-version wheel for each
package independently.
- studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall
through to lower-priority visibility env vars when the first entry
is malformed (leading comma, all-whitespace first token) instead of
silently returning GPU 0.
* Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps
Address issues surfaced by the round 3 reviewers on top of 8636fa63:
- install_python_stack.py _ensure_rocm_torch: add the same `x86_64`
guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts
must skip the repair path entirely; PyTorch only publishes ROCm
wheels for linux_x86_64, and without this guard
`unsloth studio update` aborts with a missing-wheel error on non
x86_64 hosts.
- install_llama_prebuilt.py resolve_upstream_asset_choice: add a
best-effort _detect_host_rocm_version() helper (reading
/opt/rocm/.info/version, amd-smi version, hipconfig --version) and
filter rocm_candidates to entries whose major.minor is <= host
version. Falls back to the newest candidate only when no compatible
one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being
handed the numerically newest rocm-7.2 bundle (which fails preflight
and forces a source build).
- install.sh: remove the round 2 --no-index switch from the Radeon
wheel branch. --no-index forced uv to ignore PyPI entirely, which
broke transitive dependency resolution (filelock, sympy, networkx,
jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv.
Restore the round 1 explicit wheel URL invocation but add a
torch / torchvision / torchaudio version-pair sanity check so a
mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio
2.9.0) falls back to the standard ROCm index instead of installing a
broken combination.
- install_python_stack.py _ensure_rocm_torch: restructure the
"tag is None" path so it no longer short-circuits the bitsandbytes
install. On a ROCm runtime older than anything in
_ROCM_TORCH_INDEX, print the "no wheel" warning but still run the
AMD bitsandbytes install.
- studio/backend/core/training/worker.py: restore the pre-PR
"no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source
builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts
slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos)
after 5 minutes; omit timeout for the non-HIP branch so the cap
only applies to ROCm source builds.
* Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate
Address remaining issues surfaced by the round 4 reviewers:
- studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the
selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever
the caller already had a ROCm visibility env var set, not only when
IS_ROCM has already been set by detect_hardware(). Training and
inference workers call apply_gpu_ids() before detect_hardware()
runs, so the old guard would leave a forked ROCm worker with a
stale HIP_VISIBLE_DEVICES mask that no longer matched the
narrowed CUDA_VISIBLE_DEVICES selection.
- install.sh get_radeon_wheel_url: accept X.Y ROCm versions in
addition to X.Y.Z. The `/opt/rocm/.info/version` file and some
hipconfig versions report only two components, and the Radeon
repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/
directories, so treating X.Y as invalid caused Radeon hosts to fall
back to the generic ROCm index even when a matching AMD wheel set
existed.
- install_python_stack.py _ensure_rocm_torch: only install the AMD
bitsandbytes build when the venv actually has a ROCm-compatible
torch (either already present or just installed by this function).
Previously the bitsandbytes install ran unconditionally, which
could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch
on hosts where the ROCm runtime is older than any entry in
_ROCM_TORCH_INDEX. Also add --force-reinstall so an existing
CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id
Two medium-severity defensive fixes from the gemini-code-assist review on
the AMD monitoring backend:
1. _extract_gpu_metrics may return a dict where every value is None when
amd-smi succeeds (zero exit) but the JSON envelope contains no usable
fields (error response, unsupported card). The new _has_real_metrics
helper lets get_primary_gpu_utilization surface available:False and
lets get_visible_gpu_utilization skip ghost device rows so the UI
does not render placeholder cards with empty numbers.
2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit":
"none"}, including the per-GPU id. The previous int(raw_id) call
silently fell back to the enumeration index in that case, losing the
real GPU id. Routing raw_id through the existing _parse_numeric
helper handles bare ints, floats, strings, and the dict shape
uniformly, with a debug log on parse failure.
* Fix gemini round 2 findings: explicit length guard on ROCm version file parser
Both _detect_rocm_version (install_python_stack.py) and
_detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version
or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed
parts[1]. The surrounding broad `except Exception: pass` already swallowed
the resulting IndexError, so a one-component file like "6\n" did fall
through to the next detection source -- but the control flow relied on
exception handling instead of an explicit check.
Add `if len(parts) >= 2:` guards in both helpers so the loop falls through
on its own without raising. Behaviour is unchanged for the common multi-
component case; the previously-silent IndexError path becomes an explicit
no-op.
* Fix gemini round 3: include has_rocm in validate_server fallback path
When validate_server is called without an explicit install_kind (older
call sites that have not been updated), the fallback was only enabling
--n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts
fell through to the CPU validation path even though the prebuilt being
exercised was a HIP binary.
Add host.has_rocm to the fallback expression so the GPU offload flag is
applied consistently with the install_kind=='linux-rocm' / 'windows-hip'
branches above.
* Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb
The previous heuristic divided any bare number above 10_000_000 by
1024*1024 on the assumption that large unit-less values were bytes.
This misclassified small VRAM allocations: 5 MB of used VRAM reported
as 5_242_880 bytes without a unit would be taken at face value and
render as 5_242_880 MB (~5 TB) in the monitoring UI.
Modern amd-smi always provides explicit units (MiB/GiB dict form),
and legacy amd-smi returns bare numbers in MB -- the heuristic never
had a real workload to handle. Drop it and default to MB for bare
numeric input, keeping the existing unit-aware branches for dict /
string inputs unchanged.
The unrelated gemini suggestion to "default minor to 0" in the
amd-smi version awk parser was intentionally NOT applied: rocm7.0
and rocm7.1 ship different wheel sets, so silently substituting 0
for a missing minor could install the wrong wheels. The existing
reject-and-fall-through behaviour is safer.
* Fix gemini round 5: POSIX compliance and leading-comma visibility parsing
Three medium findings from gemini-code-assist addressed in this commit:
1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions
that are not in POSIX and break on BSD/BusyBox coreutils. install.sh
has a #!/bin/sh shebang so the whole pipeline was rewritten as a
single awk script that extracts all href="..." hits on each line,
filters to wheels matching the package prefix and python tag, and
picks the newest version via zero-padded lexical comparison. No
external sort or grep is needed.
2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a
leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to
the next env var", which is surprising given the clear intent to
narrow to device 1. Filter empty tokens after the split and return
the first real one. An all-commas value ("," / ",,,") still falls
through because no real tokens exist; the empty-string and "-1"
explicit-zero cases are unchanged.
The unrelated amd-smi version awk parser suggestion was not applied
(see round 4 commit message for rationale: defaulting a missing minor
to 0 could silently install the wrong ROCm wheel set).
* Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label
Consolidated fix batch from a 20-parallel reviewer.py run on the current
head. Each fix is drawn from a high-consensus finding and addresses a
real bug or feature gap, not a stylistic preference.
1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five
call sites so this branch no longer regresses main's version floor
(main bumped to 2026.4.4 in #4876). Without this, merging 4720 would
silently downgrade the minimum version pin for fresh installs.
2. install.sh: URL-decode Radeon wheel names before extracting the
torch / torchvision / torchaudio version strings. Real wheel URLs
from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...")
so the previous `[+-]` terminator in the sed regex never matched,
`_torch_ver` stayed empty, `_radeon_versions_match` stayed false,
and every Radeon consumer install silently fell back to the generic
ROCm index. Now decode %2B -> + first, then extract, then validate.
3. install.sh: the two AMD bitsandbytes install lines were running
`uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`,
so upgrades where the venv already has a CPU/CUDA bitsandbytes
satisfying the constraint would keep the stale non-AMD wheel. Add
`--force-reinstall --no-cache-dir` to both call sites, matching the
pattern already used in install_python_stack.py::_ensure_rocm_torch.
4. install_python_stack.py and install_llama_prebuilt.py: add
`dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the
Python-side ROCm version detectors so they match the chain in
install.sh::get_torch_index_url. Package-managed ROCm installs
(Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via
rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig,
or amd-smi `version` output -- without these fallbacks, `unsloth
studio update` on such hosts returned None and skipped the ROCm
torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before
parsing so epoch-annotated packages parse correctly.
5. hardware.py: add a `_backend_label(device)` helper that returns
"rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and
use it for every `"backend": ...` emission in JSON responses served
to the Studio frontend. Internally we still represent ROCm hosts as
DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API
surface), but the user-facing API now correctly reports "rocm" on
AMD boxes instead of labeling them as "cuda".
All 250 simulation scenarios pass (was 233 before this batch: added 17
new regression tests covering the version pin, %2B decoding, bnb
force-reinstall flags, dpkg/rpm fallback presence, and the
_backend_label helper's four-way truth table).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4
Two rounds of fixes in one commit, plus a full URL audit of every PyPI /
download.pytorch.org / repo.radeon.com reference the PR introduces.
amd.py (4 medium gemini findings on commit b3627bc2):
1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util
gate. The follow-up `vram_total_mb > 0` already handles the division
guard, but the truthiness check was redundant and slightly surprising
for a 0.0 valid value. Replace with explicit `is not None and > 0`
for both vram_util and power_util.
2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding
for non-dict envelopes. A scalar / string JSON response from amd-smi
would raise AttributeError. Add an isinstance(data, dict) check and
return None for unexpected shapes.
3. get_visible_gpu_utilization had the same .get() exposure on the outer
envelope. Rewrite the gpu_list extraction as an explicit
list/dict/else cascade so a malformed scalar envelope produces
gpu_list=[data] and continues without raising.
4. The same function's per-entry loop also called gpu_data.get() on
whatever was inside gpu_list. If a scalar ever leaks into the list
(directly or via the previous fix's fallback), _extract_gpu_metrics
would raise on the first .get() inside the helper. Skip non-dict
entries in the loop before extracting metrics.
install.sh (URL audit finding, previously flagged by 20-reviewer as #13):
5. get_torch_index_url used `rocm6.*` in the rocm tag case statement,
which matched rocm6.5 and rocm6.6 and emitted
download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because
PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the
supported 6.x minors explicitly and add a rocm6.* fallback branch
that clips to rocm6.4 (the last supported 6.x wheel set).
URL audit results (all URLs PR 4720 references):
- 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130,
rocm6.0..6.4,rocm7.0..7.2} return HTTP 200.
- 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3,
6.4,7.0,7.1,7.2}/ return HTTP 200.
- X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for
6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z
-> X.Y fallback sed in the Radeon wheel install block.
- Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the
llama.cpp GitHub releases API endpoint all return 200.
Test suite: 255 -> 258. New regression coverage:
- U17: get_physical_gpu_count tolerates scalar amd-smi envelope
- U18: get_visible_gpu_utilization tolerates scalar envelope
- U19a-c: vram_util / power_util return None on zero total, but
vram_total_gb still echoes 0.0 (not None)
- A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported
6.x minors to rocm6.4 instead of producing a 403 index URL
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label
Three high-confidence findings from a second 20-parallel reviewer.py run
on commit 7effb3ae. Triaged 15 total findings and applied the three that
were confirmed as real bugs; the rest were either false positives (e.g.
"migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream
via setup.sh regardless), design decisions (e.g. visibility mask env
vars not consulted in installer detection), or edge cases the existing
fallback logic already handles.
1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe
runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then
only raises if `torch.cuda.is_available()` is False. On ROCm torch,
torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.*
API), so the guard becomes dead code on AMD hosts and multi-GPU AMD
setups slip through even though unsloth does not support them yet.
Add a torch.cuda.device_count() > 1 fallback inside the except so
AMD multi-visible-device setups are flagged consistently with the
original CUDA memory check.
2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm
ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when
SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user
running `install.sh --no-torch` on an AMD host would still pull in
bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the
case block in an outer `[ "$SKIP_TORCH" = false ]` guard.
3. studio/backend/main.py [3/20]: the /api/system endpoint returned
`"device_backend": get_device().value`, which is "cuda" on ROCm
hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints
(hardware.py) already use the _backend_label helper which swaps
"cuda" -> "rocm" when IS_ROCM. Route /api/system through the same
helper so the Studio UI reports the backend consistently across all
endpoints.
4. studio/backend/tests/test_utils.py: update test_backend_matches_device
to call _backend_label(get_device()) instead of raw get_device().value
so the test matches the new contract and still passes on CUDA hosts.
Tests: 258 -> 261. New regression coverage:
- X08 main.py /api/system uses _backend_label
- X09 tokenizer multi-GPU guard has device_count() fallback
- X10 fresh-install bnb case block gated on SKIP_TORCH=false
* fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels
During install, bitsandbytes was installed without --no-deps, causing
uv to resolve torch from PyPI (CUDA build) and silently overwrite the
ROCm wheels that were just installed in the previous step.
This happened in three places:
- install.sh: bitsandbytes install in both migrated and fresh paths
- install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch()
Additionally, multiple install steps in install_python_stack.py (extras,
overrides, studio deps) can pull in CUDA torch via transitive
dependencies. A final _ensure_rocm_torch() call at the end of the
install sequence ensures ROCm torch is always in place at runtime.
All changes are gated behind ROCm-specific conditions and do not affect
NVIDIA, CPU-only, macOS, or Windows install paths.
Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms
torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install.
* fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP
On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path:
1. Unsloth's global monkey-patching of transformers model classes
(LlamaRotaryEmbedding, attention modules) triggers
_assert_async_cuda_kernel crashes on HIP during generation.
Training uses different code paths and works fine.
2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion
failures on MI300X (CDNA3 / gfx942), even without Unsloth patching.
This commit adds a ROCm-specific inference fallback that:
- Skips importing Unsloth at module level (prevents global patching)
- Loads models in 16-bit with plain transformers + PEFT instead
- Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx")
since pre-quantized HF repos still trigger bnb codepaths
- Guards get_chat_template calls (unavailable without Unsloth import)
- Fixes max_seq_length=0 being passed to from_pretrained (GGUF
semantics don't apply to transformers path)
The NVIDIA path is completely unchanged -- Unsloth import and
for_inference() optimization remain active. GGUF inference (via
llama-server/HIP) is unaffected since it never imports Python model
classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X)
so 16-bit loading is practical for inference.
Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424):
- Simple generation: PASS
- Compare mode (base vs finetuned): PASS
- GGUF inference + tool calling: PASS (unaffected by this change)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: guard audio/vision inference on ROCm, remove unused import
- Add clear RuntimeError for audio/vision model inference on ROCm
(these paths use Unsloth's FastModel/FastVisionModel which would
crash on HIP; GGUF inference is the supported path on AMD)
- Remove unused `import os as _os` from the ROCm changes
* fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature)
amd-smi on recent ROCm versions (7.x) wraps metric output in a
{"gpu_data": [...]} envelope instead of returning a raw list. This
caused get_primary_gpu_utilization() and get_visible_gpu_utilization()
to fail silently (returning available=False) because the GPU data
dict was never unwrapped.
Additionally:
- VRAM data moved from "vram" to "mem_usage" with "total_vram" /
"used_vram" keys. Added fallback key lookup.
- Temperature "edge" sensor returns "N/A" on MI300X VF; the previous
dict.get() chain returned the "N/A" string instead of falling
through to "hotspot". Changed to a loop that checks each key until
a parseable value is found.
Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x):
- GPU utilization: 0% (idle), up to 100% during training
- Temperature: 40-44C (from hotspot sensor)
- VRAM: 0.28/191.69 GB (idle)
- Power: 158-211W draw
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Bug fix detecting radeon (#4940)
* Bug fix detecting radeon
* Expanding GPU target for gfx1100*
* Generalize gfx family-prefix filter to cover gfx10/gfx12 as well
rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the
specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures
the bare family prefix from the generic line, and passing that to
-DGPU_TARGETS breaks the HIP build because clang only accepts specific
gfxNNN ids.
The previous filter only special-cased gfx11. Generalize it so any bare
2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a
specific sibling target is present in the same list. No real AMD GPU has
a 2-digit gfx id, so the filter can only ever drop family prefixes and
never a real target.
Covers the existing gfx11 cases unchanged, and extends the same fix to
gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4),
which would otherwise hit the same build failure on newer rocminfo.
---------
Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
---------
Co-authored-by: Eda Z <eda.zhou@amd.com>
Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: billishyahao <bill.he@amd.com>
Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com>
Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* updated models template mappers. added lfm2.5vl450m to transformers 5.3.0 whitelist
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: check find() return value before adding offset in try_fix_tokenizer
The `str.find()` result was checked for -1 only after adding
`len(find_text)`, turning the guard into dead code. When the substring
is absent, `start` becomes `len(find_text) - 1` (a positive number),
so the `if start == -1: continue` never triggers and the subsequent
slice extracts garbage from the tokenizer string.
Split the find and offset into two steps so the -1 check works correctly.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add defensive guards for token_id None and end find() returning -1
- Skip loop iteration early when token_id is None to avoid constructing
a find_text that can never match valid JSON
- Guard end = tokenizer_string.find('",', start) against -1 to prevent
silent garbage extraction from malformed tokenizer strings
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(chat): sticky composer bar in thread
* fix(chat): fix compare pane clipping
* fix(chat): tighten scroll-to-bottom placement and compare footer spacing
* Fix TypeScript build break and clean up ViewportFooter classes
- Remove unused `compact` prop from ThreadScrollToBottom call site
(component is FC with no props, passing it caused TS2322)
- Extract shared classes (sticky, bottom-0, z-20, bg-transparent) from
ternary branches into the unconditional className string
- Restore `relative` on normal-mode footer so the inner absolute
bg-background strip has a positioning context
- Remove redundant md:pb-3 / md:pb-4 (same value as base pb-3 / pb-4)
- Remove no-op `sticky bottom-0` from SharedComposer wrapper in both
LoraCompareContent and GeneralCompareContent (flex layout with
shrink-0 already pins it at the bottom; parent has no scrollable
overflow for sticky to bind to)
- Fix truncated comment on pointer-events rationale
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix raw text paragraph break normalization
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Normalize horizontal whitespace before stripping non-ASCII and collapse leftover doubles
Run the [^\S\n]+ horizontal-whitespace collapse before the non-ASCII strip
so that Unicode whitespace (\u00A0, \u202F, \u2009, \u3000, \v, \f, etc.)
becomes a single ASCII space instead of being deleted outright. The prior
ordering silently merged adjacent words on HTML/PDF/OCR-sourced text:
"hello\u00a0world" used to produce "helloworld" after this PR; it now
produces "hello world".
Also drop \t from the allow-list since the horizontal-whitespace collapse
already normalizes tabs to a single space, and add a targeted [ ]{2,} pass
right after the non-ASCII strip so that a non-whitespace non-ASCII character
sitting between two spaces ("word1 (c) word2") does not leave an interior
double space. Without this extra pass, clean_text was not idempotent on
such inputs: the first call produced "word1 word2" and only the second
call collapsed it to "word1 word2". Fuzz testing over 10000 random inputs
now satisfies the idempotence invariant in every case.
* Add regression tests for Unicode/control whitespace and non-ASCII edge cases
Cover:
- Unicode horizontal whitespace separators (NBSP, narrow NBSP, thin space,
en/em space, ideographic space, vertical tab, form feed) normalizing to
a single ASCII space instead of being deleted.
- Mixed paragraph + Unicode whitespace realistic input ("Section\u00a01\r\n\r\nBody\ftext\u202Fhere").
- Tab collapsing and space trimming around newlines.
- Non-whitespace non-ASCII characters (copyright, accented letters, emoji)
sitting between spaces: must not leave an interior double space, and
clean_text must be idempotent on these inputs.
- Non-ASCII characters adjacent to a newline: stripping must not leave
stray leading or trailing spaces on the neighbouring line, and must not
swallow an adjacent paragraph break.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix Mistral training crash when xformers is unavailable
* Fix/adjust Mistral DPO training crash fix for PR #4889
- Clarify comment in MistralForCausalLM_fast_forward: the DPO embed-masking
block runs BEFORE attention_mask is nulled out, and it is the consumer that
requires a 2D mask.
- Add defensive attention_mask.ndim == 2 guard to the LlamaModel_fast_forward
DPO embed-masking block so it self-protects if a 4D mask ever reaches it.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Only run ldconfig CUDA-linking recovery when we have permission
When `import unsloth` runs on a non-root environment (shared HPC,
locked-down container, CI runner, etc.) the CUDA-linking recovery path
shells out to `os.system("ldconfig /usr/lib64-nvidia")`, which fails
loudly with "Permission denied". It's especially noisy for users who
don't even have bitsandbytes installed - they're doing 16bit or full
finetuning and the line immediately above told them "16bit and full
finetuning works!". The reason the recovery runs at all in that case
is that `bnb.functional.lib.cdequantize_blockwise_fp32` raises
AttributeError on `bnb is None`, the bare `except:` swallows it, and
the code drops into the recovery unconditionally.
Fix: gate the recovery body on `os.geteuid() == 0`. When we don't
have permission to run ldconfig, silently skip the recovery. When we
do, the recovery runs UNCHANGED - same `os.system()` calls, same
reload + retry, same warnings. `libcuda_dirs()` is used by both triton
and bitsandbytes, so we still want to run the recovery whenever we
have permission, regardless of whether bnb is installed.
For non-root users who DO have bitsandbytes installed and broken,
emit a single remediation warning telling them how to fix it manually
(`sudo ldconfig /usr/lib64-nvidia`). This preserves the diagnostic
guidance from the original code without the Permission denied noise.
Scope:
- Only the `DEVICE_TYPE == "cuda"` branch is touched.
- The `hip` (AMD ROCm) and `xpu` (Intel) branches are unchanged.
- On a real CUDA box running as root, behavior is byte-identical to
main: same os.system() calls, same reload, same retry, same warnings.
AST-verified by /tmp/verify_minimal/verify.py.
- `hasattr(os, "geteuid")` guards against Windows where `os.geteuid`
doesn't exist.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <info@unsloth.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat: inject local model provider into recipe jobs via JWT
* feat: auto-generate JWT for local model providers in recipes
* feat: add is_local flag to model provider config types and utils
* fix(studio): skip endpoint validation for local providers
* feat(studio): add local/external model source toggle to provider dialog
* feat(studio): thread localProviderNames through model config dialog chain
* feat(studio): show 'Local model (Chat)' label for local model_provider configs
* fix: hardcode loopback for local endpoint, clear stale creds on toggle
* fix: document TOCTOU/JWT rotation, add deferred import comments, fix is_local serialization
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): clear stale local model state on provider toggle and validation
* fix(studio): override empty local endpoint in validation and skip model gate for unused providers
* fix(studio): resolve loopback port from app.state, clear stale local provider fields, sync model id on toggle
Address review feedback on the local-model-provider flow:
- Backend (jobs.py): _resolve_local_v1_endpoint now reads the actual bound
port from app.state.server_port (set in run.py after binding) instead of
parsing it out of request.base_url, which is wrong behind any reverse
proxy or non-default port. The two duplicated urlparse blocks are gone.
- Backend (jobs.py): defensively pop api_key_env, extra_headers, extra_body
from local providers so a previously external provider that flipped to
local cannot leak invalid JSON or rogue auth headers into the local /v1
call. Also dedupe the post-loop assignment and tighten the local-name
intersection so empty names cannot match.
- Backend (jobs.py): hoist datetime and urllib.parse imports to the top
import block for consistency with the rest of the file.
- Backend (run.py): expose the bound port on app.state.server_port after
the uvicorn server is constructed.
- Frontend (model-provider-dialog.tsx): clear extra_headers and extra_body
when toggling to local mode. Hidden inputs would otherwise keep stale
JSON blocking validate/run.
- Frontend (model-config-dialog.tsx): factor the local-aware provider
selection logic into applyProviderChange and call it from both
onValueChange and onBlur, so manually typing a provider name and tabbing
away keeps the model field consistent.
- Frontend (recipe-studio.ts store): handle both directions of the
is_local toggle in the cascade. external -> local now backfills
model: "local" on already-linked model_configs so they pass validation
immediately, mirroring the existing local -> external clear path.
- Frontend (validate.ts + build-payload.ts): thread localProviderNames
into validateModelConfigProviders and skip the "model is required"
check for local-linked configs. Local providers do not need a real
model id since the inference endpoint uses the loaded Chat model.
* fix(studio): narrow store cascade types, sync model placeholder on graph relink and node removal, harden ephemeral port path
Loop 2 review fixes:
- recipe-studio.ts: type-narrow next.is_local by also checking
next.kind === "model_provider". TS otherwise raised TS2339 because
next was typed as the union NodeConfig after the spread. The behavior
is unchanged but the code now compiles cleanly.
- model-config-dialog.tsx: convert the lastProviderRef / providerInputRef
ref-during-render pattern (pre-existing react-hooks/refs lint error)
to a useEffect that syncs providerInputRef from config.provider. The
combobox blur path still uses applyProviderChange and remains stable.
- recipe-graph-connection.ts: when a graph drag links a model_provider
to a model_config, mirror the dialog applyProviderChange behavior:
fill model: "local" if the new provider is local and the model field
is blank, clear model when relinking from a local placeholder to an
external provider, otherwise leave the model alone.
- reference-sync.ts: when a referenced provider node is removed, clear
the synthetic model: "local" placeholder along with the provider
field, so a future relink to an external provider does not pass
validation with a stale value that fails at runtime.
- run.py: only publish app.state.server_port when the bound port is a
real positive integer; for ephemeral binds (port==0) leave it unset
and let request handlers fall back to request.base_url.
- jobs.py: _resolve_local_v1_endpoint also falls back when
app.state.server_port is non-positive, and uses `is None` instead of
the truthy fallback so a literal 0 is handled correctly.
* fix(studio): strict is_local check, narrow loaded-model gate to LLM-reachable configs, add scope-server port fallback
Loop 3 review fixes:
- jobs.py, validate.py: require `is_local is True` instead of truthy
check. Malformed payloads such as is_local: "false" or is_local: 1
would otherwise be treated as local and silently rewritten to the
loopback endpoint.
- jobs.py: _resolve_local_v1_endpoint now tries request.scope["server"]
(the actual uvicorn-assigned (host, port) tuple) as a second
resolution step before falling back to parsing request.base_url.
This covers direct-uvicorn startup paths and ephemeral binds that
never publish app.state.server_port.
- jobs.py: new _used_llm_model_aliases helper collects the set of
model_aliases that an LLM column actually references, and the
"Chat model loaded" gate is now only triggered when a local
provider is reachable from that set. Orphan model_config nodes on
the canvas no longer block unrelated recipe runs.
* fix(studio): force skip_health_check on local-linked configs, skip JSON parsing for local providers, local-aware inline editor
Loop 4 review fixes:
- jobs.py: after rewriting local providers, also force
skip_health_check: true on any model_config linked to a local
provider. The /v1/models endpoint only advertises the real loaded
model id, so data_designer's default model-availability health check
would otherwise fail against the placeholder "local" id before the
first chat completion call. The inference route already ignores the
model id in chat completions, so skipping the check is safe.
- builders-model.ts: buildModelProvider now short-circuits for local
providers and emits only { name, endpoint: "", provider_type, is_local }
without running parseJsonObject on the hidden extra_headers/extra_body
inputs. Imported or hydrated recipes with stale invalid JSON in those
fields no longer block client-side validate/run.
- inline-model.tsx: the model_config branch now accepts an optional
localProviderNames prop and mirrors the dialog applyProviderChange
behavior. Changing provider to/from a local one auto-fills or clears
the "local" placeholder consistently with the other edit paths.
- recipe-graph-node.tsx: derive localProviderNames from the store via
useMemo (stable identity) and pass it through renderNodeBody to
<InlineModel>. Hooks order is preserved by declaring them above the
early return for markdown_note nodes.
- run.py: minor comment tweak - loop 3 already added the scope-server
fallback path, note that in the comment.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: danielhanchen <info@unsloth.ai>
* split venv_t5 into venv_t5_530 and venv_t5_550 for tiered transformers 5.x support
* fix bfloat16 crash on T4 for FORCE_FLOAT32 models and disable trust_remote_code auto-enable for native t5 models
* revert FORCE_FLOAT32 dtype change
* restrict trust_remote_code auto-enable to Nemotron models only
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* use config.json model_type for tier detection, add unsloth/nvidia namespace guard
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"
This reverts commit fb43d468e2.
* Revert "use config.json model_type for tier detection, add unsloth/nvidia namespace guard"
This reverts commit fc49ae2453.
* add unsloth/nvidia namespace guard to Nemotron trust_remote_code auto-enable
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* reorder tier checks: all substring matches before config.json fetches
* extract shared activate_transformers_for_subprocess into transformers_version.py
* narrow Nemotron trust_remote_code to nemotron_h/nemotron-3-nano, add to export worker
* clean venv_t5 dirs before re-install in setup.sh, clarify version alias comment
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* run venv_t5 migration outside deps fast-path gate in both setup scripts
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(chat): prevent implicit empty thread creation and stabilize new-chat flow
* fix(chat): harden compare thread sync and simplify sidebar thread query
* fix(chat): harden new-thread state sync and isolate compare active thread updates
* fix(chat): stabilize new-thread state sync and prevent compare/session bleed
* Fix thread restoration, handleNewThread guard, sidebar filter, and delete flow
- Remove __LOCALID_ filter from getInitialSingleChatView: in this
Dexie-backed adapter, AUI's __LOCALID_ prefixed IDs ARE the real
persistent thread IDs stored by initialize(). Filtering them out
breaks thread restoration on navigation.
- Simplify handleNewThread to synchronous: the async Dexie message
check is redundant (persistence is already deferred to first append)
and strands users on legacy empty threads. Use a simple guard that
checks the store's activeThreadId to detect unsent drafts.
- Add message-count filter to sidebar: filter threads to only show
those with at least one message, hiding legacy empty threads.
- Add store-based sidebar highlighting fallback: use activeThreadId
from the store when view.threadId is not set (nonce-backed chats).
- Fix handleDelete to call onNewThread() instead of onSelect(), and
clear activeThreadId, so the runtime properly resets after deleting
the active thread.
* Fix handleDelete nonce path and restore __LOCALID_ filter
handleDelete was calling onNewThread() after clearing activeThreadId,
but the handleNewThread guard sees !view.threadId && !activeThreadId
and returns early, leaving the UI stuck on the deleted thread.
Fix by directly calling onSelect with a new nonce instead.
Restore __LOCALID_ filter in getInitialSingleChatView to prevent
restoring unpersisted AUI local thread IDs on navigation. Without
this filter, navigating away from /chat before sending a message
would restore a non-existent thread that Dexie cannot fetch.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fix custom folder scanning when pointing directly at a model directory.
When a user adds a custom scan folder that points directly at a model
directory (e.g. /path/to/gemma-4-e2b-it-gguf/ containing config.json
and gemma-4-E2B-it-BF16.gguf), the model list previously showed
individual .gguf files as separate entries instead of recognizing the
directory as a single model. Clicking any entry showed "No GGUF
variants found" because list_local_gguf_variants received a file path
and immediately returned empty.
Changes:
- Add _is_model_directory() helper that detects directories with both
config metadata and actual model weight files (excludes mmproj GGUFs
and non-weight .bin files like tokenizer.bin)
- _scan_models_dir: detect self-model and return single directory entry
- _scan_lmstudio_dir: surface model directories directly instead of
descending into them as publisher folders; handle both root and child
model directories
- Add _resolve_gguf_dir() helper for GGUF path resolution that only
falls back to parent directory when parent has model metadata
- list_local_gguf_variants / _find_local_gguf_by_variant: use resolver
so .gguf file paths inside model directories work correctly
* fix: skip redundant HfFileSystem().glob() calls in loader.py
Guard the SUPPORTS_LLAMA32 glob blocks with `is_model and is_peft` so
the HfFileSystem HTTP call is only made when both configs could actually
exist. This prevents indefinite hangs on slow/unreliable networks since
the glob result is redundant when either AutoConfig or PeftConfig
already failed to load.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test file from main PR - moved to separate PR
Tests for the glob skip guard belong in their own PR to keep
the loader change minimal and reviewable.
* Harden HfFileSystem glob: fix Windows path splitting, add try/except
- Use str.rsplit("/", 1) instead of os.path.split to extract filenames
from HfFileSystem paths. HfFileSystem always returns POSIX-style paths,
but os.path.split uses the OS separator, so on Windows the entire path
was returned as the "filename" and the config name comparison always
failed.
- Wrap the HfFileSystem().glob() call in try/except to gracefully handle
network failures (offline mode, timeouts, unreachable Hub). On failure
both_exist stays False, which is the safe default.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove redundant HfFileSystem().glob() call for remote repos
When is_model and is_peft are both True, AutoConfig and PeftConfig
have already loaded successfully, proving both config.json and
adapter_config.json exist. The HfFileSystem network call to re-verify
this was redundant and could cause hangs on slow networks.
Replace the glob + try/except block with a direct both_exist = True
assignment.
* Remove unused HfFileSystem import
HfFileSystem was only used for the glob() calls that were replaced
with direct both_exist = True assignments in the previous commit.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Gemma-4 does not need FORCE_FLOAT32. Testing shows that both float16 and
bfloat16 work correctly without the forced float32 override:
- Inference: identical outputs for float16 and bfloat16 (greedy decoding)
- Training (100 steps, 4-bit LoRA, SFT on FineTome-100k):
- float16 final loss: 3.048
- bfloat16 final loss: 3.065
- Losses converge to within 0.02 by step 60
- Grad norms healthy and comparable for both dtypes
The FORCE_FLOAT32 path was actually causing training divergence. With
it enabled, the compiled float32 run diverged at step ~28 with grad norms
collapsing to near zero and loss plateauing at ~12.4. Without it, both
dtypes train normally.
This enables float16 on Tesla T4 and other GPUs without bfloat16 support.
* Add tests for is_vision_model() caching behaviour
* Fix review feedback: remove dead helper, fix exception test
- Remove unused _make_config() helper function (dead code)
- Fix test_exception_result_cached to actually exercise the exception path
by mocking load_model_config to raise OSError instead of using
side_effect=[False] which only tested normal False returns
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use strict mock specs so tests exercise intended detection paths
Use MagicMock(spec=[]) for all config mocks so hasattr() only returns
True for explicitly set attributes. Without this, MagicMock defaults
make all hasattr checks truthy, allowing tests to pass via unintended
detection paths (e.g. img_processor instead of vision_config).
---------
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add vision detection cache to is_vision_model() to avoid redundant subprocess spawns
is_vision_model() is called 4-5 times per training run for the same model
with zero caching. For transformers 5.x models, each call spawns a full
subprocess (~6s each). This adds a module-level _vision_detection_cache dict
following the same pattern as the existing _audio_detection_cache used by
detect_audio_type(). The function is refactored into a thin cache wrapper
around _is_vision_model_uncached(), saving ~12s per training run.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Include hf_token in vision cache key for gated model correctness
Cache key is now (model_name, hf_token) instead of just model_name.
This prevents stale False results when an unauthenticated probe for a
gated model is followed by an authenticated call.
* Remove test file from main PR - will be submitted separately
* Fix vision cache: normalize model names and skip caching transient failures
- Normalize model names in cache key using resolve_cached_repo_id_case()
to avoid duplicate entries for different casings of the same HF repo
(aligns with case normalization from #4822)
- Return None instead of False on transient failures (network errors,
subprocess timeouts, HF API issues) so the cache layer can distinguish
"definitely not a vision model" from "failed to check"
- Only cache definitive True/False results; transient failures are retried
on the next call instead of being permanently locked in as False
* Refine failure handling: cache deterministic failures, guard normalization
- Subprocess non-zero exit, JSON errors, and general exceptions return
False (deterministic, cached) instead of None (retryable). Only
subprocess.TimeoutExpired returns None since timeouts are transient.
- Wrap cache key normalization in try/except so resolve_cached_repo_id_case
or normalize_path failures fall back to raw model_name instead of
crashing callers.
* Harden vision detection cache: fix transient failure handling, thread safety, token security
- All subprocess failure paths now return None (transient) instead of False,
preventing permanent misclassification of VLMs after temporary HF/auth/network errors
- Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token
- Add threading.Lock with double-checked locking to prevent thundering herd
of concurrent subprocess spawns for the same uncached model
- Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError,
ValueError) from transient ones in _is_vision_model_uncached
- Pass resolved/normalized model name to detection (not just cache key)
- Log normalization fallback at debug level instead of silent swallow
- Thread hf_token through callers in routes/models.py and trainer.py
that previously omitted it
* Refine lock strategy and token fingerprint
- Move detection computation outside the lock to avoid serializing
long-running subprocess spawns (60s timeout) and HF API calls across
all concurrent model checks. Lock is now only held for cache writes.
- Use full SHA256 digest for token fingerprint instead of truncated
16-char prefix to eliminate collision risk.
* Fix huggingface_hub import fallback and use atomic cache read
- Add fallback import path for RepositoryNotFoundError/GatedRepoError
from huggingface_hub.utils (older hub versions) when .errors is
not available
- Use sentinel-based dict.get() for single atomic cache read instead
of two-step in/[] pattern (future-proof for no-GIL runtimes)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add fallback message for Colab Studio button when localhost link doesn't work
* Make fallback message darker grey for better readability
* Make fallback message bold for better visibility
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
* studio: add speculative decoding support (ngram-mod, on by default)
Enable n-gram speculative decoding for GGUF models in Unsloth Studio.
Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation
with zero VRAM cost via a 4MB fixed hash table that auto-resets on
low acceptance rates.
Backend:
- Add speculative_type field to LoadRequest, LoadResponse, and
InferenceStatusResponse pydantic models
- Add speculative_type parameter to LlamaCppBackend.load_model()
with allowlist validation (ngram-simple, ngram-mod)
- Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags
to llama-server when ngram-mod is active
- Default to ngram-mod for non-vision GGUF models server-side
- Silently skip speculative decoding for vision models (unsupported
in llama.cpp server-context.cpp)
Frontend:
- Add speculative_type to TS API types
- Add speculativeType/loadedSpeculativeType to chat runtime store
with default value of "ngram-mod"
- Add On/Off toggle in Model settings section (GGUF only, hidden
for vision models), included in dirty check for Apply/Reset
- Wire speculative_type through model load request and response
- Restore speculative type state on page refresh/reconnect
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: remove server-side speculative decoding override
The backend was overriding speculative_type=None to "ngram-mod" for
non-vision GGUF models, which prevented users from disabling spec
decoding via the UI toggle. The frontend store already defaults to
"ngram-mod", so the backend fallback was redundant and blocked the
explicit "Off" setting.
* fix: use recommended ngram-mod params from llama.cpp docs
Update speculative decoding params to match the recommended values
from llama.cpp docs (docs/speculative.md):
--spec-ngram-size-n 24 (was 16, docs say small n not recommended)
--draft-min 48 (was 0)
--draft-max 64 (was 24, docs note MoEs need long drafts)
Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes),
not 4 MB.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add benchmark table and references to speculative decoding comment
Include speedup numbers from llama.cpp PRs #18471 and #19164 as an
inline comment so future readers understand the expected gains.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): harden sandbox security for terminal and python tools
The existing command blocklist used naive str.split() which is trivially
bypassable via quoting, full paths, nested shells, variable expansion,
and cross-tool pivoting through Python os.system/subprocess. Fixes#4818.
Changes:
- Replace str.split() blocklist with shlex.split() + os.path.basename()
tokenization and regex scanning at shell command boundaries
- Add sanitized subprocess environment (_build_safe_env) that strips
credentials (HF_TOKEN, WANDB_API_KEY, GH_TOKEN, AWS_*, etc.) and
restricts PATH to /usr/local/bin:/usr/bin:/bin
- Add PR_SET_NO_NEW_PRIVS via prctl on Linux so sudo/su/pkexec fail
at the kernel level regardless of how they are invoked
- Add RLIMIT_NPROC (256) and RLIMIT_FSIZE (100MB) to prevent fork
bombs and disk filling attacks
- Extend AST safety checker to detect os.system(), os.popen(),
subprocess.run/Popen/call/check_output, os.exec*, os.spawn* calls
containing blocked commands or dynamic (non-literal) arguments
- Add cross-platform support: cmd.exe on Windows, bash on Unix;
CREATE_NO_WINDOW flag on Windows, preexec_fn on Unix
- Expand blocklist from 7 to 14 commands: add su, chown, passwd,
mount, umount, fdisk, kill, killall, pkill
- Apply all layers to both _bash_exec and _python_exec
Zero measurable performance overhead -- shlex parsing and a single
prctl syscall per subprocess fork.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix review findings: exception_catching dead code, false positives, process substitution
- Include exception_catching reasons in _check_code_safety so bare
except-in-loop timeout evasion is actually blocked (was computed in
_check_signal_escape_patterns but never read by the caller)
- Remove base.split() inner loop that caused false positives on quoted
text arguments containing blocked words (e.g. echo "kill this process")
- Add targeted nested shell detection for bash/sh/zsh -c arguments
instead, which catches bash -c 'sudo whoami' without false positives
- Add <() process substitution to the regex character class so
diff <(rm -rf /path) is also caught
- Fix error message to say "unsafe patterns" instead of specifically
mentioning signal manipulation when other categories trigger
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review feedback: regex paths, keyword args, list element scanning
- Regex now matches blocked commands after optional path prefix at shell
boundaries (catches ls; /usr/bin/sudo and similar)
- Nested shell detection uses os.path.basename so bash -c "/bin/rm" is
caught
- AST checker now inspects keyword arguments (not just positional) so
subprocess.run(args="sudo ...", shell=True) is detected
- List elements in subprocess calls are now checked via
_find_blocked_commands for consistency (catches subprocess.run(["bash",
"-c", "rm -rf /"]))
- Dynamic argument check uses _is_safe_literal that validates list
contents are all string literals
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix nested shell scan to only check the script body, not positional args
bash -c 'script' arg0 arg1 -- only tokens[i+1] is the script body;
subsequent tokens are $0, $1 positional parameters passed to the script
and are not executed as shell commands. Scanning all remaining tokens
caused false positives.
* Add subshell parentheses to regex command boundary detection
(sudo whoami) was not caught because ( was not in the regex character
class for shell command boundaries. Add ( to the set alongside ;, &,
|, backtick, newline.
* Address high-priority review findings from 7 parallel reviewers
- Track from-imports of dangerous functions (from os import system,
from subprocess import run as r, etc.) via shell_exec_aliases dict
so bare-name calls are detected by the AST checker
- Include the active Python interpreter and virtualenv directories
in the sanitized PATH so pip, uv, and Studio packages remain
accessible in the sandbox
- Add Windows-specific blocked commands (rmdir, takeown, icacls,
runas, powershell, pwsh) only on win32 platform
- Add os.posix_spawn and os.posix_spawnp to _SHELL_EXEC_FUNCS
- Handle tuple literals same as list literals in AST argument
inspection (both _extract_strings_from_list and _is_safe_literal)
* Fix false positive on check=True kwargs and recursive nested shell scanning
- Only inspect command-carrying keyword arguments (args, command,
executable, path, file) in the AST checker, not control flags like
check=True, text=True, capture_output=True which are booleans and
were incorrectly flagged as non-literal dynamic arguments
- Replace split() in nested shell detection with recursive call to
_find_blocked_commands so that quoted commands (bash -c '"sudo"
whoami') and semicolons (bash -c "sudo;ls") within nested shells
are properly detected through the full shlex + regex pipeline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move preexec_fn imports to module level and use find_library for libc
Addresses two Gemini review findings:
1. preexec_fn thread safety: _sandbox_preexec previously imported ctypes
and resource inside the function body, which runs between fork() and
exec() in the child process. In a multi-threaded server, this could
deadlock if the import machinery locks were held by another thread at
fork time. Now all imports and the libc handle are resolved once at
module load time, so _sandbox_preexec only calls C-level functions
(prctl, setrlimit) with no Python import activity.
2. Hardcoded libc.so.6 path: replaced with ctypes.util.find_library("c")
which works on glibc (libc.so.6), musl (libc.musl-*.so.1), and other
Linux distributions where libc has a different soname.
* Apply Gemini style suggestions: combined regex, dict.fromkeys, constant hoisting
- Combine per-word regex loop into a single re.findall with alternation
pattern, avoiding repeated regex compilation and searching
- Replace manual dedup loop with dict.fromkeys for PATH entries
- Hoist _CMD_KWARGS frozenset out of visit_Call to avoid recreating it
on every AST node visit
* Add cmd /c nested shell detection for Windows parity
The nested shell scan only checked for Unix shells (bash -c, sh -c, etc).
Add cmd /c and cmd.exe /c detection so that Windows nested shell
invocations are also recursively scanned for blocked commands. The token
scan already catches blocked commands at any position, so this is
defense-in-depth for consistency across platforms.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Handle combined shell flags (-lc, -xc) and interleaved flags (--login -c)
The nested shell scan only matched token == "-c" with the immediately
preceding token being a shell name. This missed:
- Combined flags: bash -lc 'rm ...' (-lc ends with c, is a valid
combined flag meaning -l -c)
- Interleaved flags: bash --login -c 'sudo ...' (--login sits between
bash and -c)
Now matches any short flag ending in 'c' (e.g. -lc, -xc, -ic) and
walks backwards past intermediate flags to find the shell binary.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix /bin/bash bypass, remove RLIMIT_NPROC, reduce AST false positives
Addresses three high-consensus findings from 20-reviewer pass:
1. /bin/bash -c 'sudo whoami' bypassed nested shell scan because the
backwards flag-skip logic treated paths starting with / as flags.
Now only skips tokens starting with - as Unix flags; on Windows
only skips short /X flags (not /bin/bash style paths). [9/20]
2. RLIMIT_NPROC=256 caused subprocess.run to fail with EAGAIN because
Linux enforces NPROC per real UID, not per process tree. Removed
RLIMIT_NPROC entirely; RLIMIT_FSIZE and PR_SET_NO_NEW_PRIVS remain
as the primary resource and privilege controls. [5/20]
3. AST checker rejected safe dynamic subprocess usage like
cmd=["git","status"]; subprocess.run(cmd) as shell_escape_dynamic.
Now only flags dynamic args for shell-string functions (os.system,
os.popen, subprocess.getoutput, etc.) or when shell=True is
explicitly set. List-based subprocess calls with shell=False (the
default) do not pass through a shell and are not flagged. [12/20]
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Handle Windows drive letter paths and .exe extensions in command detection
Gemini review found that Windows absolute paths (C:\Windows\System32\
shutdown.exe) and executable extensions (.exe, .com, .bat, .cmd) were
not handled:
- Token scan now strips .exe/.com/.bat/.cmd extensions before checking
the blocklist, so sudo.exe matches sudo, shutdown.bat matches shutdown
- Regex pattern now includes optional Windows drive letter prefix
([a-zA-Z]:[/\\]) and optional executable extension suffix, so commands
after shell metacharacters with full Windows paths are also caught
* Handle **kwargs dict expansion, non-literal shell=, and except Exception false positive
Addresses three findings from second 20-reviewer pass:
1. **kwargs dict expansion (9/20): subprocess.run(**{"args": "rm ...",
"shell": True}) bypassed the AST checker because **kwargs were
treated as opaque. Now expands literal dict **kwargs to inspect
their keys, and flags opaque **kwargs (variable dicts) as unsafe.
2. Non-literal shell= values (7/20): shell=variable was treated as
shell=False (safe). Now any shell= value that is not literally
False is treated as potentially True (conservative default).
3. except Exception false positive (1/20): except Exception in a loop
was flagged as timeout evasion, but Exception does not catch
SystemExit or KeyboardInterrupt which are used for timeout
enforcement. Narrowed to only flag except BaseException and
except TimeoutError in loops.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fixes#4809
On a new Studio chat, the first tool call could start before the frontend
initializes the thread ID. That meant the first request could go out without
a session_id, so the backend started the tool in the shared sandbox root
instead of the chat's session sandbox.
Frontend:
- Eagerly initialize the thread when switching to a new chat
- Resolve the thread ID once at request time and keep it stable through
async model-load waits
- Disable ActiveThreadSync during new-chat initialization to prevent
stale thread IDs from being written back
- Add error handling for thread initialization failures
- Clear activeThreadId on all compare-mode entry paths to prevent
cross-session leakage
- Fix exitCompare to restore context usage from the saved view
- Coerce falsy thread IDs to undefined for consistent backend/frontend
fallback behavior
- Use _default as the image sessionId fallback to match the backend
Backend:
- Use ~/studio_sandbox/_default when a request arrives without a session_id
* fix(studio): reuse HF cached repo casing to prevent duplicate downloads
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move cache case resolution tests to separate PR
Tests for resolve_cached_repo_id_case and get_model_config case resolution
belong in their own PR to keep this change focused on the runtime fix.
* fix(studio): debug-log HF_HUB_CACHE fallback in path_utils
* Fix stale memoization in resolve_cached_repo_id_case
- Check exact-case path before memo to ensure a newly-appeared exact
match always wins over a previously memoized variant
- Validate memoized entries still exist on disk before returning them
to prevent stale results when cache dirs are deleted/recreated
* Minor cleanups for cache case resolution
- Use .is_dir() instead of .exists() for exact-case cache check
(cache entries are always directories)
- Remove redundant fallback in _detect_audio_from_tokenizer since
get_cache_path already handles case resolution and returns None
when the model is not cached
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat: allow non-LLM recipes to run without provider block
* feat: reorder execution tabs and add generation-aware data tab empty state
* fix: add accessibility attrs to data tab spinner and use literal ellipsis
* fix(studio): use shared spinner, stub provider, and hide unused LLM metrics
Backend: inject stub model provider for sampler-only recipes so
DataDesigner init does not reject empty provider lists.
Frontend: use shared Spinner component, hide LLM columns metric
and model usage card when recipe has no LLM columns.
* Fix tab reset and terminal auto-scroll regressions for PR #4805
Reset detailTab to "data" when switching between executions so
the Data tab default is applied consistently, not only on first
mount. Also add detailTab to the terminal scroll effect deps so
auto-scroll-to-bottom fires when the user opens the Overview tab
after landing on Data.
* Guard terminal scroll reset to only fire on Overview tab
The previous scroll effect ran on every tab switch, which could
reset the user's manual scroll position if they scrolled up in
the terminal and briefly switched tabs. Now the scroll-to-bottom
and sticky-bottom reset only fires when navigating to the
Overview tab.
* Use None for stub provider api_key instead of literal string
The stub ModelProvider that satisfies the DataDesigner registry
for non-LLM recipes should not carry a fake credential string.
Using None avoids sending an Authorization header if the provider
is ever inadvertently invoked.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Differentiate web_search query searches from URL fetches in the Studio chat UI.
Backend (llama_cpp.py):
- Emit "Reading: hostname" for URL fetches and "Searching: query" for query searches in SSE status events
- Only show hostname for valid http/https URLs; schemeless/non-http URLs get "Reading page..." generic fallback
- Strip www. prefix for consistency with the frontend
Frontend (tool-ui-web-search.tsx):
- Tool card shows "Read hostname" / "Reading hostname..." for URL fetches
- Shows "Searched query" / "Searching for query..." for query searches
- Uses new URL() with protocol check; falls back to "Read page" / "Reading page..." for non-http URLs
* Simplify llama.cpp install logic
* print release tag
* Retry failed json decode
* don't pull all ggml releases
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test file changes from main PR
Test changes for test_pr4562_bugfixes.py will be submitted in a separate PR to keep this PR focused on the install path simplification.
* Fix setup.sh executable bit and direct tag lookup for pinned releases
- Restore setup.sh file mode to 100755 (was accidentally changed to 100644)
- Add direct GitHub API tag lookup in iter_release_payloads_by_time for
non-latest requested tags (e.g. b7879) instead of relying on paginated
release scans that may miss older releases beyond the 5-page limit
- Update stale DEFAULT_PUBLISHED_REPO comment to match new value
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix force-compile default ref and remove dead code in setup.ps1
- Change FORCE_COMPILE_DEFAULT_REF from "main" to "master" in all three
files (install_llama_prebuilt.py, setup.sh, setup.ps1) since
ggml-org/llama.cpp uses "master" as its default branch, not "main".
Using "main" would cause git clone --branch to fail when
UNSLOTH_LLAMA_FORCE_COMPILE=1 with UNSLOTH_LLAMA_TAG=latest.
- Remove dead if ($SkipPrebuiltInstall) block inside the else branch of
setup.ps1 that could never be reached (the outer elseif already
handles $SkipPrebuiltInstall=true).
- Maintain setup.sh executable bit (100755).
* Improve iter_release_payloads_by_time error handling for direct tag lookup
When a pinned release tag is not found (HTTP 404), fall through to the
paginated release scan instead of silently returning empty results.
Non-404 errors (network failures, rate limits) are propagated to the
caller so users get actionable error messages.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: patch PEFT for Gemma4ClippableLinear in loader checkpoint path
The same Gemma4ClippableLinear monkey-patch that exists in vision.py
for training is needed in loader.py for loading existing checkpoints
(used by export and inference).
Gemma4ClippableLinear wraps nn.Linear but does not subclass it, so
PEFT's LoRA injection fails with "Target module not supported".
The patch redirects PEFT to target the inner .linear child instead.
Applied only to the vision model PeftModel.from_pretrained path.
Temporary fix until PEFT adds native support (peft#3129).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: wrap ClippableLinear patch in try/finally to always restore
Ensures _create_and_replace is restored even if PeftModel.from_pretrained
raises, preventing leaked global state across subsequent model loads.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): lazy-import AutoConfig in model_config.py to fix transformers 5.x version switch
Move `from transformers import AutoConfig` from module level to inside
load_model_config() where it is actually used.
model_config.py is transitively imported at module load time via:
core/inference/__init__ → llama_cpp → utils.models → model_config
In inference subprocesses (mp.spawn), this chain runs before
_activate_transformers_version() can prepend .venv_t5/ to sys.path.
The eager import caches transformers 4.57.6 in sys.modules, and the
subsequent sys.path change has no effect — Python always checks
sys.modules before sys.path.
Making the import lazy ensures transformers is not loaded until after
version activation, so the subprocess picks up the correct version.
* fix(studio): also lazy-import extract_model_size_b in llama_cpp.py
Belt-and-suspenders: make the import that originally triggered the
chain lazy as well, so future module-level AutoConfig additions in
utils.models cannot reintroduce the problem.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
When DEFAULT_PUBLISHED_REPO is ggml-org/llama.cpp, the prebuilt
resolver raises PrebuiltFallback because ggml-org releases do not
include a llama-prebuilt-manifest.json asset. This was caught by the
generic Exception handler and printed as "fatal helper error" to
stderr, which triggers NativeCommandError on PowerShell.
Catch PrebuiltFallback separately in the top-level __main__ handler
and exit with EXIT_FALLBACK (code 2) instead of EXIT_ERROR (code 1).
The message is still logged but without the "fatal helper error"
prefix. The shell scripts already handle non-zero exits and fall
back to source builds.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* fix(studio): revert llama.cpp default tag to latest
The latest ggml-org/llama.cpp release (b8637) now includes Gemma 4
support. Revert the temporary "b8637" pin from #4796 to "latest" so
the prebuilt resolver always picks the newest release automatically
without needing manual tag bumps.
* docs: add comment explaining latest vs master for llama.cpp tag
Document in all three files why "latest" is preferred over "master"
and when "master" should be used as a temporary override.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Gemma 4 is a native transformers 5.5 model and does not need
trust_remote_code=True. The auto-enable logic (added for NemotronH)
was catching all transformers 5.x models, including Gemma 4.
When trust_remote_code=True, unsloth_compile_transformers() returns
early without running the compiler. This disables the fused cross
entropy patch, causing logged training loss to be inflated by the
gradient_accumulation_steps factor.
Exclude models matching "gemma-4" or "gemma4" from the auto-enable
so the compiler runs and applies fused cross entropy correctly.
ggml-org/llama.cpp b8637 includes Gemma 4 support (ggml-org/llama.cpp#21309).
Revert the temporary "master" default back to a pinned release tag.
This eliminates the HTTP 422 errors from the prebuilt resolver (which
could not find a release matching "master"), avoids unnecessary source
builds, and restores prebuilt binary downloads on all platforms.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* fix windows llama.cpp compile from source issue
* undo local repo usage
* fix llama.cpp install
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix windows
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: route resolve-source-build call through Invoke-LlamaHelper
The --resolve-source-build call at the source-build resolution path
was still calling install_llama_prebuilt.py directly instead of going
through Invoke-LlamaHelper. On PS7+ with ErrorActionPreference=Stop,
stderr from the 422 response (when tag is "master") would trigger a
terminating NativeCommandError and crash setup.
* fix: suppress stderr error records from Invoke-LlamaHelper
ErrorActionPreference=Continue prevents termination but PowerShell
still displays stderr lines as visible ErrorRecord objects. Capture
all output via 2>&1 and split stdout from stderr manually so that
stderr lines never appear on the console. When StderrPath is given
the stderr content is written to that file for diagnostics.
* fix: always rebuild llama.cpp on Windows when tag is master
When the requested llama.cpp tag is "master" (a moving target), skip
the "already built" early exit so the build path runs and syncs to
the latest commit. Without this, existing llama-server binaries from
an older build (e.g. b8635 which lacks Gemma 4 support) are reused
and model loading fails.
Pinned tags (e.g. b8635) still skip the rebuild when the binary
already exists, since the tag is immutable.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The model list merge order was `top_gguf + top_hub + static_models`,
which meant the HF download-ranked models always came first. New models
like Gemma 4 have low download counts and were not in the HF top-40,
so they got buried after 80 other models despite being at the top of
the curated static defaults in defaults.py.
Flip the merge to `static_models + top_gguf + top_hub` so editorial
picks (new model launches, promoted models) always appear first in the
Recommended section, with HF popularity backfilling after.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The latest ggml-org/llama.cpp release (b8635) does not include Gemma 4
support (ggml-org/llama.cpp#21309 merged after the release was cut).
This causes `llama-server` to fail with "unknown model architecture:
gemma4" when loading Gemma 4 GGUFs.
Temporarily default _DEFAULT_LLAMA_TAG to "master" so all new installs
build from the llama.cpp master branch which includes Gemma 4 support.
Once a new upstream release is cut with Gemma 4, this can be reverted
back to "latest".
Changes:
- setup.sh: add _DEFAULT_LLAMA_TAG="master" maintainer default
- setup.ps1: add $DefaultLlamaTag="master" maintainer default
- install_llama_prebuilt.py: change DEFAULT_LLAMA_TAG fallback to "master"
Users can still override via UNSLOTH_LLAMA_TAG env var.
Revert the >= loosening from f9c4b08 back to exact pins.
Using transformers>=4.57.6 allows pip to install 5.x into the main
Studio venv, which breaks huggingface_hub imports
(is_offline_mode removed in newer hub versions).
The main venv must stay on transformers==4.57.6 and
huggingface-hub==0.36.2. The 5.x version lives only in .venv_t5/
and is dynamically switched via sys.path at runtime.
The v5.5-release branch now exists on huggingface/transformers.
Use transformers==5.5.0 for all install paths and
git+transformers.git@v5.5-release for the MLX installer.
Also bumps huggingface_hub from 1.7.1 to 1.8.0 in setup.sh and
setup.ps1 to stay consistent.
Hardcode the release repo to ggml-org/llama.cpp and remove the
UNSLOTH_LLAMA_RELEASE_REPO and UNSLOTH_LLAMA_SOURCE env var overrides
so that all users always build/download from mainline llama.cpp.
Gemma-4 support landed in transformers main
(huggingface/transformers#45192). Update the version pin from
5.5.0.dev0 to 5.5.0 across loader, Studio version switcher,
and the MLX installer. Also fix install_gemma4_mlx.sh which
referenced a non-existent v5.5-release branch -- pin it to
the correct commit (91b1ab1) instead.
Small GGUF models (<9B) frequently generate full code or lengthy
explanations instead of calling tools, bypassing the existing
plan-without-action re-prompt mechanism. Three issues:
1. _REPROMPT_MAX_CHARS=500 was too low -- models that output full
HTML/code responses (often 1000+ chars) never triggered the
re-prompt at all, since it only fires on short responses.
2. _MAX_REPROMPTS=1 gave the model only one chance to comply.
Small models often need 2-3 nudges before switching from
text generation to tool calling.
3. The re-prompt text ("Please use the available tools...") was
too polite for small models to follow reliably.
4. Tool-calling detection missed chat templates using Jinja
whitespace-trimming syntax ({%- if tools -%}) since only
({%- if tools %}) and ({% if tools %}) were checked.
Changes:
- Raise _REPROMPT_MAX_CHARS from 500 to 2000 so longer responses
(code blocks, multi-paragraph plans) still trigger re-prompts
- Raise _MAX_REPROMPTS from 1 to 3 for more retry budget
- Use direct, imperative re-prompt language that small models
follow more reliably ("STOP. You MUST call a tool NOW.")
- Strengthen the system prompt tool nudge to explicitly forbid
outputting code blocks (redirect to the python tool instead)
- Add Jinja whitespace-trimmed variants to the tool_markers
list so all template styles are detected correctly
* UI Changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unrelated test file
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): display images from Python tool execution in chat UI
When the model calls the Python tool to create a matplotlib plot or
other image file, the image now displays inline in the chat output
instead of being invisible to the user.
Backend:
- Detect new image files (png/jpg/gif/webp/bmp) after Python subprocess
completes by diffing os.listdir before/after execution
- Append __IMAGES__ sentinel to tool result for frontend consumption
- Strip sentinel before injecting result into LLM context (role: tool)
so the model never sees file paths
- Add GET /sandbox/{session_id}/{filename} endpoint with JWT auth
(header or query param), path traversal protection, extension
allowlist, realpath containment check, and nosniff header
Frontend:
- Parse __IMAGES__ sentinel in tool_end SSE events, create structured
result with text/images/sessionId
- Render <img> tags in Python tool UI pointing at the sandbox endpoint
Also fixes a bug where SyntaxError in user code was misreported as
"unsafe code detected" instead of showing the actual Python traceback.
The _check_code_safety function now lets SyntaxError pass through to
the subprocess for a proper error message.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): improve SVG detection and strip XML preamble
Handle <?xml ...?> declarations before <svg> tags in code fences,
strip XML declaration from SVGs before data URI rendering, and
update the sloth suggestion prompt to request showing code.
* fix(studio): persist parentId so retries survive reload
The append() handler was destructuring only { message } from
ExportedMessageRepositoryItem and discarding parentId. When loading
a saved thread, load() used ExportedMessageRepository.fromArray()
which chains all messages sequentially, flattening retry branches
into a linear list.
Now append() writes parentId to the MessageRecord, and load()
reconstructs the tree when parentIds are present. Old threads
without parentId fall back to the existing fromArray() behavior.
* fix(studio): address review findings for image display and retry persistence
Image detection:
- Use mtime comparison instead of filename-only diff so overwritten
files (e.g. plt.savefig("chart.png") called twice) are detected
Sentinel parsing:
- Use rsplit/lastIndexOf instead of split/indexOf so user code that
prints __IMAGES__: does not collide with the backend sentinel
Mixed legacy/new threads:
- For old messages without a stored parentId, infer sequential parent
from the previous message instead of null, preventing multiple roots
Sandbox endpoint:
- Change Cache-Control from "public, max-age=3600" to "private,
no-store" since these are authenticated responses
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(frontend): scope sans font overrides to chat thread only
* fix(frontend): use font-sans fallback for heading stack and simplify chat font rules
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* update logic to incorporate custom prebuilt installs
* bug fixes
* update for review comments
* fix tags
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Separate test changes from main PR
Move test file changes out of this PR to keep the diff focused on
the install_llama_prebuilt.py and setup script changes. Test updates
will be submitted in a follow-up PR.
* Fix branch ref normalization and harden JSON parsing
- Add checkout_friendly_ref() to strip refs/heads/ prefix from branch
refs before emitting them in SourceBuildPlan. git clone --branch does
not accept fully qualified refs like refs/heads/main.
- Apply normalization in source_build_plan_for_release() and the
direct-ref fallback in resolve_source_build_plan().
- Allow validated_checksums_for_bundle() to accept releases that carry
only an exact-commit source archive without the legacy upstream-tag
source tarball.
- Add 2>/dev/null || true guards to all inline python -c JSON parsing
in setup.sh so a malformed payload does not abort the script under
set -e.
* Fix Windows CUDA asset ordering and tag ref normalization
- Reorder windows_cuda_upstream_asset_names to prefer the main binary
archive (llama-{tag}-bin-win-cuda-*) over the cudart sidecar archive
(cudart-llama-bin-win-cuda-*). The cudart ZIP only contains CUDA
runtime DLLs, not llama-server or llama-quantize binaries.
- Extend checkout_friendly_ref to also strip refs/tags/ prefix for tag
refs, matching the refs/heads/ handling for branch refs.
* Simplify JSON parsing consistency in setup.sh
Use json.load(sys.stdin) consistently for all inline JSON parsing
in setup.sh, instead of the more complex json.loads(raw) pattern
on the install-tag resolution path. The 2>/dev/null || true guard
already handles empty/malformed input gracefully.
* Fix source build plan fallback for commit ref kind in PR #4771
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <daniel@unsloth.ai>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Expand test coverage for install_llama_prebuilt.py:
- Add tests for source build plan resolution with custom repos
- Add tests for branch/commit/PR ref matching and normalization
- Add tests for manifest checksum validation
- Add tests for Windows CUDA upstream asset name patterns
- Update capsys checks to capture stderr after log() redirect
* fix(studio): prevent small models from stalling on tool-calling tasks
Small GGUF models (< 9B params) in "Think, Search, Code" mode would
often describe what they planned to do ("Let me create this dashboard")
and then stop generating without ever calling a tool.
Three changes:
1. Simplify web_tips for small models: remove the "fetch its full content
by calling web_search with the url parameter" guidance for models < 9B.
This multi-step instruction causes small models to plan elaborate
search-then-fetch-then-code sequences they cannot reliably execute.
2. Add "always call tools directly" imperative to the system prompt nudge
so models act immediately instead of narrating their intentions.
3. Add plan-without-action re-prompt in the agentic loop: when the model
emits planning text (matching patterns like "let me", "I'll", etc.)
without calling any tool, inject a nudge asking it to call the tool
and continue the loop. Capped at 2 re-prompts per request.
Benchmarked with Qwen3.5-4B-GGUF (N=5 trials per variant):
- Baseline: 40% of requests had any tool call
- Combined fix: 100% of requests had at least one tool call
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix shell injection in GGML conversion paths
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test file from security fix PR
Move test_save_shell_injection.py to a separate PR to keep this PR focused on the security fix itself.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Distinguish between actual network downloads and GPU memory loading for cached LoRA adapters in Studio chat.
- Add isCachedLora detection for local LoRA adapter paths using comprehensive cross-platform regex (Unix, Windows, UNC, WSL, tilde)
- Thread isCachedLora through loadInfo to chat-page inline status for proper 3-way distinction (cached / local LoRA / downloading)
- Skip download progress polling for cached LoRA models (no useless /download-progress API calls)
- Fix initial toast state to use isCachedLoad consistently instead of only checking isDownloaded
- Fix cancelLoading toast to not mention background downloads for cached/local loads
- Keep download-specific text ("Downloading model..." / "Download complete") inside the download-only polling block
- Add min-w-0 guards to thread/message/markdown containers to prevent
content overflow past the composer width
- Unify chat typography from Hellix/Space Grotesk to the sans stack,
keeping monospace for code blocks and inline code
- Restructure desktop navbar right-side controls with shrink-0 wrappers
for consistent spacing across HoverCard roots
- Soften tool-call label styling (font-medium + text-foreground/85
instead of bold)
- Add responsive code block sizing via @container queries
- Add horizontal scrolling for wide code blocks within the thread column
- Scope list-item code block alignment CSS to .aui-thread-root
- Preserve useScrollLock in tool-fallback and tool-group collapsibles
- Fall back to bg-background on ViewportFooter when hideComposer is true
- Widen inline code monospace selector to cover th, blockquote, and
heading elements
- Remove unused @fontsource-variable/space-grotesk import
* Fix script unbound variable error
* remove stale test script, add llama.cpp metal source builds, update tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Metal precedence, test sync, and add behavioral tests
- Move macOS arm64 Metal check before CUDA/ROCm in GPU backend
decision chain so Metal is not bypassed when nvcc is in PATH
- Remove RPATH flags from CPU fallback CMAKE_ARGS (only needed
for Metal library linking)
- Update test_llama_pr_force_and_source.py to match _CLONE_ARGS
rename from _CLONE_BRANCH_ARGS in setup.sh
- Add confirm_install_tree guard test for
existing_install_matches_choice
- Add TestMacOSMetalBuildLogic bash subprocess tests verifying
Metal flag selection, nvcc precedence, and CPU fallback behavior
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Metal CPU fallback to also cover cmake build failures and update tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* 1. _GPU_BACKEND_FRAGMENT synced -- removed dead CPU_FALLBACK_CMAKE_ARGS= init (6/8)
2. RPATH assertion replaced -- new test_macos_arm64_cpu_fallback_args_exclude_rpath checks the actual runtime CPU_FALLBACK_CMAKE_ARGS output for @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (6/8)
3. _TRY_METAL_CPU_FALLBACK=false reset after both configure-failure and build-failure fallback branches in setup.sh (4/8)
4. macOS test now removes libmtmd.0.dylib instead of the platform-agnostic convert_hf_to_gguf.py (3/8)
5. Empty-string tag test added -- test_empty_tag_omits_branch_flag for resolved_tag= (2/8)
6. RPATH checks on cmake call logs -- both fallback tests now assert @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON are absent from CPU fallback cmake calls, plus baseline flag preservation (multiple)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tests clean up
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): allow context length slider to reach model's native limit
The context length slider was hard-capped to the VRAM-estimated maximum,
preventing users from requesting higher context even though the backend
already handles it safely (multi-GPU selection, --fit fallback). Expose
the model's native context length from GGUF metadata as a separate API
field and use it as the slider ceiling instead. Add an amber warning
when the selected context exceeds the estimated VRAM capacity.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Raise VRAM budget to 90% and add native_context_length tests
Increase the GPU memory utilization threshold from 70% to 90% across
_select_gpus and _fit_context_to_vram, allowing longer context lengths
before VRAM capping kicks in.
Add 33 tests for the native_context_length feature covering the backend
property, context value separation invariants, Pydantic models, route
completeness, edge cases, and cross-platform binary I/O.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: add tokenizers to no-torch runtime deps and add TORCH_CONSTRAINT for arm64 macOS py313+
Two installer fixes:
1. Add `tokenizers` to `no-torch-runtime.txt` before `transformers`.
Without it, `from transformers import AutoConfig` crashes on startup
because `--no-deps` skips transitive dependencies.
2. Add `TORCH_CONSTRAINT` variable to `install.sh`. On arm64 macOS with
Python 3.13+, tighten the torch requirement to `>=2.6` since torch
<2.6 has no cp313 arm64 wheels. The variable replaces the previously
hard-coded constraint in the uv pip install line.
Includes 66 tests (42 pytest + 24 bash) covering:
- Structural checks on install.sh, install.ps1, no-torch-runtime.txt
- Shell snippet tests with mocked python for 13 platform/version combos
- Mock uv integration verifying correct constraint string
- E2E venv tests on Python 3.12 and 3.13 confirming AutoConfig works
- Negative control proving AutoConfig fails without tokenizers
- Full no-torch sandbox regression guards (safetensors, huggingface_hub)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix incomplete no-torch manifest and align E2E tests with real --no-deps path
- Add missing transitive deps to no-torch-runtime.txt that are required
under --no-deps: regex, typing_extensions, filelock, httpx, httpcore,
certifi, idna, anyio, sniffio, h11. Without these, `from transformers
import AutoConfig` still fails after install.sh --no-torch.
- Change all E2E tests to use --no-deps (matching what install.sh does)
instead of normal dep resolution. Previous tests passed even with an
incomplete manifest because uv backfilled transitive deps.
- Rewrite negative control to derive from the real no-torch-runtime.txt
with tokenizers stripped, proving the specific fix matters.
- Replace GNU-only sed -i with heredoc in shell test for macOS compat.
- Remove unused os/sys imports from Python test file.
- Quote SKIP_TORCH and mock uv paths in bash -c strings.
* Assert install succeeds before checking import results in E2E tests
Address review feedback: test_torch_not_importable and
test_tokenizers_directly_importable in Group 3 now assert that
uv pip install returns 0 before checking import behavior. This
prevents false positives when the install itself fails silently.
* Assert install succeeds in negative control and tighten error check
- Add missing install-success assertion in test_negative_control_no_tokenizers
to prevent false positives from network/install failures.
- Tighten error message check to look for "tokenizers" in stderr or
ModuleNotFoundError, rather than the generic "No module" substring
which could match unrelated import failures.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Fix SSL handshake failures (SSLV3_ALERT_HANDSHAKE_FAILURE, CERTIFICATE_VERIFY_FAILED) when fetching HTTPS pages by introducing _PinnedHTTPSConnection that separates TCP connect (to pinned IP) from TLS handshake (with real hostname for SNI/cert verification)
- Fix SSRF DNS-rebinding vulnerability: previous impl swapped conn.host before connect(), causing fresh DNS resolution; new subclass keeps TCP pinned to validated IP
- Fix SPA/JS-rendered doc sites returning empty content by rotating real browser User-Agents (Chrome/Firefox/Safari)
- Strip nav/footer from HTML-to-Markdown output so article content is not buried under navigation chrome
- Increase raw fetch cap from 64KB to 512KB so SSR article content is reached on GitBook/Docusaurus/Next.js pages
- Fix IPv6 address bracketing in URL netloc construction
- Hoist SSL context, handler classes, and stdlib imports to module level (created once, not per-call)
- Use consistent UA across redirect hops to avoid breaking session-aware bot detection
Split out from #4741 to keep the main PR focused on installer logic.
- New test_install_llama_prebuilt_logic.py: tests for resolve logic,
fallback behavior, env_int, busy/lock handling
- New test_validate_llama_prebuilt.py: validator tests for staged
release_tag/upstream_tag handling
- New test_llama_pr_force_and_source.py: tests for PR_FORCE and
LLAMA_SOURCE maintainer defaults
- Updated test_selection_logic.py: expanded selection/fallback coverage
- Updated test_pr4562_bugfixes.py: updated bugfix tests for new logic
- Updated smoke_test_llama_prebuilt.py: minor update
Replaces the fixed prebuilt llama.cpp tag with dynamic published-release
resolution, adds bounded fallback across older published releases, and
introduces maintainer-editable defaults for PR/source overrides.
Changes:
- Resolve latest from the latest usable published release in unslothai/llama.cpp
- Use the selected release upstream_tag as the authoritative llama.cpp version
- Prefer Unsloth-published platform assets when available
- Fall back to same-tag upstream ggml-org/llama.cpp assets where allowed
- Keep Linux CUDA anchored to Unsloth-published CUDA bundles only
- Add bounded fallback across older Unsloth published releases
- Add separate busy/in-use install handling (exit code 3)
- Skip reinstall when the installed bundle already matches the selected candidate
- Add maintainer-editable _DEFAULT_LLAMA_PR_FORCE and _DEFAULT_LLAMA_SOURCE
- Harden env parsing so malformed installer env vars do not crash import-time fallback logic
- Honor UNSLOTH_LLAMA_RELEASE_TAG in all resolve steps
- Always sync git remote URL in existing-checkout path
* Fix save_pretrained_merged for full-finetuned models
save_pretrained_merged and push_to_hub_merged silently do nothing when
the model is not a PeftModel (i.e. full finetuning without LoRA).
merge_and_overwrite_lora returns None immediately for non-PeftModel,
and unsloth_generic_save does not check the return value.
Add a non-PeftModel branch in unsloth_generic_save that falls back to
model.save_pretrained / model.push_to_hub. When save_method contains
"16bit", cast weights to bfloat16 (or float16) via a state_dict copy
to honor the user's intent without mutating the live model.
The existing PeftModel (LoRA) code path is unchanged.
* Forward create_pr and revision to tokenizer.push_to_hub
The tokenizer push_to_hub call was missing create_pr and revision,
which could cause the tokenizer to push to the wrong branch or
bypass PR creation when the model push uses them.
* Honor merged_16bit dtype contract for full-finetuned models
Cast state_dict to bfloat16/float16 when save_method contains "16bit"
to match the documented behavior of save_pretrained_merged. Also pass
state_dict and save kwargs consistently to both save_pretrained and
push_to_hub paths.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review feedback for PR #4755
- Simplify PeftModel isinstance check (PeftModelForCausalLM inherits
from PeftModel)
- Add is_main_process guard for distributed training
- Forward variant to save_pretrained
- Set tokenizer padding_side to "left" before saving (matches other
save paths)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): architecture-aware KV cache VRAM estimation
Replace the single legacy formula (2 * n_kv_heads * head_dim * n_layers
* n_ctx * bpe) with 5-path estimation that reads 8 additional GGUF
metadata fields:
1. MLA (DeepSeek-V2/V3, GLM-4.7, GLM-5, Kimi-K2.5) -- K-only cache
using compressed KV latent + RoPE; no separate V allocation
2. Hybrid Mamba (Qwen3.5-27B, Qwen3.5-35B-A3B) -- only attention
layers (1 in N) carry KV; Mamba layers have none
3. Sliding Window (Gemma-3, gpt-oss) -- SWA layers cache
min(ctx, window) tokens instead of the full context
4. Standard GQA -- uses explicit key_length/value_length from GGUF
instead of embed // n_heads (which is wrong for many models)
5. Legacy fallback -- identical to old formula for old GGUFs
New GGUF fields parsed: attention.key_length, attention.value_length,
attention.sliding_window, full_attention_interval,
attention.kv_lora_rank, attention.key_length_mla, ssm.inner_size,
ssm.state_size.
Validated against 9 real GGUF files (72/72 field checks pass).
The legacy formula was off by +682% for Gemma-3 and -81% for
DeepSeek-V3.1.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix MLA fallback and SWA global/local ratio heuristic
Two fixes based on review findings:
1. MLA fallback now uses key_length_mla from GGUF metadata instead of
hardcoded rope_dim=64. Falls back to 64 only when key_length_mla is
absent. This ensures correct estimates for MLA variants that use
rope dimensions other than 64.
2. SWA global/local layer ratio changed from 50/50 to 1/4 (25% global,
75% SWA). Most sliding window architectures have predominantly local
layers (Gemma-3 uses ~17% global, gpt-oss uses ~50%). The 1/4
heuristic is closer to the common case and still a large improvement
over the legacy formula which ignores SWA entirely.
* Tighten _can_estimate_kv gate and treat sliding_window=0 as disabled
Two additional fixes from review round 1 (5/8 and 4/8 reviewer consensus):
1. _can_estimate_kv now requires BOTH key_length AND value_length for
the explicit-dims path. Previously key_length alone was enough,
which could cause silent fallthrough to the legacy formula with
fabricated defaults (n_kv=1, head_dim=128) when value_length was
absent from the GGUF.
2. SWA path now requires sliding_window > 0. Some GGUFs use 0 as a
disabled sentinel. Without this guard, min(ctx, 0) would zero out
all SWA layer contributions, severely underestimating KV cache.
* Fix MLA n_kv safety and use ceiling division for hybrid path
Addresses Gemini Code Assist review findings:
1. MLA path now uses n_kv_mla = n_kv_heads or 1 (not n_heads). This
prevents a 128x overestimate for DeepSeek-V3 if head_count_kv is
absent from the GGUF (n_heads=128 would have been used instead).
2. Hybrid path now uses ceiling division for attention layer count.
This prevents undercounting by 1 when n_layers is not perfectly
divisible by full_attention_interval.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix forward compatibility with transformers 5.x
Tested on transformers 4.57.6, 5.3.0, and 5.4.0. All changes are no-ops
on transformers 4.x.
1. Skip exec-based config patching for transformers >= 5.0
Config classes in v5 use @strict, @auto_docstring, and interval()
which break exec(inspect.getsource(...)). Those configs already use
rope_parameters (the v5 replacement for rope_scaling).
2. Slice position_ids to last token in fast_forward_inference
Transformers 5.x generate() accumulates position_ids as
[batch, full_seq_len] across decode steps instead of [batch, 1].
cos[position_ids] then produces the wrong shape for rotary
embeddings. Fixed in llama, qwen3, falcon_h1, gemma2, cohere,
granite. No-op on 4.x since position_ids is already [batch, 1].
3. Handle @strict config kwargs for sequence classification
num_labels, max_position_embeddings, id2label etc. are set on the
config object and passed via config= instead of as kwargs.
AutoModelForSequenceClassification routing added to FastModel loader.
4. Exclude modernbert from flex_attention
ModernBERT with flex_attention hits CUDA illegal memory access in
create_block_mask. Falls back to eager attention safely.
5. Propagate token_type_ids and mm_token_type_ids through GRPO VLM path
Gemma3 Vision requires token_type_ids during training. Qwen3VL
requires mm_token_type_ids for M-RoPE. Extract from inputs in
compute_loss, pass to grpo_accumulated_loss, and extend
mm_token_type_ids for completion tokens in
_generate_and_score_completions.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add try/except safety net around config exec for pre-release transformers versions
* Pop config-level kwargs in seqclass path and use except Exception
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
When searching for a specific publisher model (e.g. `openai/gpt-oss-20b`), the
unsloth search used the full `openai/gpt-oss-20b` string with `author=unsloth`,
which returned zero results because no unsloth model contains the publisher
prefix in its name. Users never discovered unsloth variants.
This PR strips the org prefix for publisher-qualified queries so unsloth variants
surface, then pins the original publisher model after a small batch of unsloth
results. Plain queries (no slash) and unsloth-prefixed queries are unchanged.
- Strict regex (`/^([^/\s]+)\/([^/\s]+)$/`) only triggers on valid `owner/repo`
identifiers; incomplete typeahead, multi-slash, and URL-like inputs are rejected
- Queries for `unsloth/...` models (case-insensitive) keep the full 20-result
prefetch and secondary sort
- Pinned model lookup fires in parallel with the unsloth prefetch
- Canonical-name dedup prevents duplicates when HF normalizes casing
- Publisher detection extracted into a single `useMemo` block
Replace strikethrough + opacity-50 OOM styling with gray text and red pill badge across all Studio model selectors (chat, training, onboarding).
- Use gray-500/gray-400 for OOM model names (better contrast than strikethrough)
- Red pill badge for OOM indicator with light/dark mode support
- Scope GGUF gray override to quant name only so downloaded/recommended labels keep colors
- Add !important on TIGHT/OOM badges to resist ComboboxItem hover overrides
* Fix Windows "Non-relative patterns are unsupported" when loading local GGUF models
When a user loads a GGUF model from a local Windows path (e.g.
C:\Users\danie\.lmstudio\models\unsloth\functiongemma-270m-it-GGUF),
the model identifier contains backslashes and a drive letter. Both
load_model_defaults() and _has_specific_yaml() constructed a YAML
filename from the full absolute path and passed it to Path.rglob(),
which rejects non-relative patterns on Windows.
Fixed by detecting Windows-style paths (drive letters, UNC paths,
backslashes) in addition to Unix-style paths, and using only the
directory basename for the YAML filename lookup when the identifier
is a local filesystem path.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Refactor: reuse is_local_path helper, fix case-sensitive suffix lookup
- Replace inline local-path detection in model_config.py and
inference_config.py with the existing is_local_path() from utils.paths,
which already handles Unix, Windows drive-letter, UNC, and backslash paths
- Fix case-sensitive suffix lookup in load_model_defaults(): the
_REVERSE_MODEL_MAPPING is lowercase-keyed, so suffix comparisons must use
.lower() to match paths like /path/to/Spark-TTS-0.5B/LLM
* Fix WSL path parsing and _has_specific_yaml suffix lookup
- Use normalize_path() before Path() operations so backslash Windows
paths (e.g. C:\Users\...\model) are correctly split on POSIX/WSL hosts
where pathlib treats backslashes as literal characters
- Add suffix-based (2-component and 1-component) lookup to
_has_specific_yaml() so it matches the same resolution rules as
load_model_defaults(), fixing wrong inference params for local
suffix-mapped models like Spark-TTS-0.5B/LLM
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: clear tool status badge immediately after tool execution
The tool status timer badge (Searching 1s, 2s...) persisted after
tool calls finished because the status clear event was only sent
at the start of the next generation iteration, not after tool
execution completed.
Backend: yield status clear after all tools finish in the agentic
loop iteration, before continue starts the next generation pass.
Frontend: debounce badge visibility by 300ms so sub-second tool
calls dont flash the badge.
* Fix debounce regression for consecutive tool calls
Only apply the 300ms show-delay when transitioning from idle to
tool-active. When switching between consecutive tools in the same
turn (e.g. web_search -> python), keep the badge visible immediately
so it does not flicker or disappear during multi-tool runs.
* Delay wasActiveRef reset to bridge inter-iteration tool gaps
The backend emits a status-clear event between tool iterations,
which was resetting wasActiveRef immediately and causing the next
tool to be re-debounced (300ms hidden gap between consecutive tools
in the same turn). Now the ref reset is delayed by 500ms so a
follow-up tool within the same agentic turn shows the badge
immediately, while a genuinely new turn still gets the debounce.
* Use thread lifecycle to track tool-run boundaries
Replace the 500ms wall-clock timeout with the actual thread.isRunning
state to determine when wasActiveRef should reset. This properly
handles all cases:
- Consecutive tools within the same run stay visible without flicker
- The badge hides only when the thread run actually ends
- New turns always get a fresh 300ms debounce on the first tool
- No heuristic timeout that can misfire on slow or fast inference
* Consolidate wasActiveRef reset into single effect
Removes the separate isThreadRunning effect to avoid a race where
the ref resets before the tool-status effect reads it (when
isThreadRunning flips to false before setToolStatus(null) from
the adapter's finally block). Now wasActiveRef resets only when
both toolStatus is null AND the thread run has ended, eliminating
any flicker on the last tool of a run.
* Simplify debounce: use visible state instead of ref tracking
Drop wasActiveRef entirely and use the visible state as the
debounce gate. When the badge is not yet on screen, debounce
for 300ms before showing. When already visible from a prior tool,
keep showing immediately. This correctly handles all cases:
- All fast tools (<300ms) are suppressed, not just the first
- Consecutive tools after the badge is shown stay visible
- Badge persists across inter-iteration clears while thread runs
- New turns get a fresh debounce after visible resets
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* refactor: move folder management from sidebar into model selector
* Fix folder management: restore LoRA picker sync, error handling, caching
- Restore onFoldersChange callback to keep LoRA adapter picker in sync
when scan folders are added/removed (fixes regression from sidebar move)
- Thread onFoldersChange through ModelSelector -> HubModelPicker prop chain
- Add module-level _scanFoldersCache to prevent folder list flash on re-open
- Surface error toast on folder removal failure instead of silently ignoring
- Guard handleAddFolder against concurrent double-submit via folderLoading
- Clear folderInput on Escape key dismiss to prevent stale input on re-open
- Add refreshLocalModelsList and refreshScanFolders to useEffect dep array
* Fix compare-mode folder sync, Escape key propagation, cancel toggle state
- Wire onFoldersChange through CompareContent/GeneralCompareContent so
compare-mode selectors also refresh local models after folder changes
- Add e.stopPropagation() on Escape key in folder input to prevent
Radix Popover from closing the entire model selector dropdown
- Add e.preventDefault() on Enter key to prevent form submission
- Clear folderInput and folderError when cancel toggle hides the input,
matching the Escape key behavior for consistency
* Fix folder mutation state ordering and touch accessibility
- Use optimistic updates for add/remove so the folder list reflects
changes immediately instead of waiting on a second listScanFolders
round-trip that could silently fail.
- Move refreshScanFolders out of the finally block in handleRemoveFolder
so it runs after the cache update, not after onFoldersChange.
- Make the remove button visible on touch/mobile devices and reachable
via keyboard focus (opacity-100 on small screens, focus-visible).
- Add aria-label to the remove button for screen readers.
* Deduplicate optimistic folder add to match backend behavior
The backend returns the existing ScanFolderInfo row when adding a
path that is already registered. The optimistic update was blindly
appending the returned row, producing duplicate entries and React
key warnings. Now checks by id before appending.
* Add aria-label to folder toggle button and strengthen dedup check
- Add aria-label to the +/cancel icon button for screen readers.
- Extend optimistic dedup check to also compare by path, not just id,
to handle edge cases where the cache is stale.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* style(windows): clean installer/setup log output and remove seeded credential banner
* Keep startup credential hint without exposing plaintext password
Print the username and .bootstrap_password file path on first-run
admin creation instead of the raw password. Headless / Docker / SSH
operators still get a startup-time hint for initial sign-in, and the
plaintext credential no longer appears in terminal output or logs.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* feat: add scan_folders table and CRUD functions to studio_db
* feat: add scan folders API endpoints and integrate into model scan
* feat: add scan folders API client and update source types
* feat: add custom source to model filters and selector
* feat: add Model Folders section to chat settings sidebar
* style: fix biome formatting in ModelFoldersSection
* fix: address review findings for custom scan folders
empty string bypass, concurrent delete crash guard,
Windows case normalization, response_model on endpoints,
logging, deduplicated filter/map, module level cache for
custom folder models, consistent source labels, handleRemove
error surfacing, per folder scan cap
* fix: show custom folders section regardless of chatOnly mode
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refactor: extract shared refreshLocalModelsList in pickers
* Harden custom scan folder validation and scanning
- Validate path exists, is a directory, and is readable before persisting
- Apply per-folder model cap during traversal instead of after (avoids
scanning millions of inodes in large directories)
- Wrap per-folder scan in try/except so one unreadable folder does not
break the entire /api/models/local endpoint for all callers
- Normalize case on Windows before storing so C:\Models and c:\models
dedup correctly
- Extend macOS denylist to cover /private/etc and /private/tmp (realpath
resolves /etc -> /private/etc, bypassing the original denylist)
- Add /boot and /run to Linux denylist
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Improve scan robustness and preserve Windows path casing
- Preserve original Windows path casing in DB instead of lowercasing
(normcase used only for dedup comparison, not storage)
- Catch PermissionError per child directory so one unreadable subdirectory
does not skip the entire custom folder scan
- Wrap list_scan_folders() DB call in try/except so a DB issue does not
break the entire /api/models/local endpoint
* fix: scan custom folders for both flat and HF cache layouts
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Windows case-insensitive path dedup with COLLATE NOCASE
Use COLLATE NOCASE on the scan_folders.path column so that the UNIQUE
constraint correctly deduplicates C:\Models and c:\models on Windows
without lowercasing the stored path. Also use COLLATE NOCASE in the
pre-insert lookup query on Windows to catch existing rows with
different casing.
* Restore early-exit limit in _scan_models_dir for custom folders
Keep the limit parameter so _scan_models_dir stops iterating once
enough models are found, avoiding unbounded traversal of large
directories. The post-traversal slice is still applied after combining
with _scan_hf_cache results.
* feat: scan custom folders with LM Studio layout too
* Fix custom folder models being hidden by dedup
Custom folder entries were appended after HF cache and models_dir
entries. The dedup loop kept the first occurrence of each model id,
so custom models with the same id as an existing HF cache entry were
silently dropped -- they never appeared in the "Custom Folders" UI
section.
Use a separate dedup key for custom-source entries so they always
survive deduplication. This way a model can appear under both
"Downloaded" (from HF cache) and "Custom Folders" (from the
user-registered directory) at the same time.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden LM Studio scan and fix COLLATE NOCASE on Linux
- Add per-child and per-publisher OSError handling in _scan_lmstudio_dir
so one unreadable subdirectory does not discard the entire custom
folder's results
- Only apply COLLATE NOCASE on the scan_folders schema on Windows where
paths are case-insensitive; keep default BINARY collation on Linux
and macOS where /Models and /models are distinct directories
* Use COLLATE NOCASE in post-IntegrityError fallback SELECT on Windows
The fallback SELECT after an IntegrityError race now uses the same
case-insensitive collation as the pre-insert check, so a concurrent
writer that stored the path with different casing does not cause a
false "Folder was concurrently removed" error.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Simplify tool-call dedup: drop hashlib, inline helpers
The duplicate tool-call detector only compares calls within a single
request from the same JSON parser, so dict key order is guaranteed
identical for identical calls (Python 3.7+ insertion-ordered dicts).
- Replace hashlib.md5(json.dumps(...)) with name + str(args)
- Inline _tool_call_key, _is_duplicate_call, _record_tool_call
since each was a one-liner used once
- Remove unused hashlib import
* Remove tool_calling_benchmark_results.md from repo
* Replace html2text with builtin HTML-to-Markdown converter
Drop the external html2text (GPL-3.0) dependency and its regex
fallback. Add _html_to_md.py (~190 lines, stdlib only) using
html.parser.HTMLParser that handles headings, links, bold/italic,
lists, tables, blockquotes, code blocks, and entity decoding.
Strips script/style/head tags entirely.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use json.dumps(sort_keys=True) for tool-call dedup key
str(dict) is sensitive to insertion order, so semantically identical
calls with different key ordering would bypass duplicate detection.
Switch to json.dumps with sort_keys=True for a canonical representation.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert dedup key to str(arguments)
json.dumps(sort_keys=True) is unnecessary here -- the arguments dict
always comes from the same JSON parser within a single request, so
key insertion order is deterministic (Python 3.7+). str() is faster
and sufficient for consecutive-call dedup.
* Address review comments on _html_to_md.py
- Remove "hr" from _BLOCK_TAGS so the dedicated hr handler is reachable
- Prefix all newlines with ">" inside blockquotes (multi-line support)
- Emit full  for images instead of alt text only
- Replace newlines with spaces inside table cells
- Track header cells per-row (_row_has_th) instead of last-cell-only
- Strip trailing tabs in addition to spaces in cleanup regex
* Fix blockquote rendering, truncated-HTML buffer flush, and dedup key canonicalization
_html_to_md.py:
- Rewrite blockquote handling with stack-based buffer approach so nested
blockquotes, pre blocks inside blockquotes, and multi-paragraph quotes
all render correctly with proper "> " prefix on every line.
- Add flush_pending() to recover content from truncated HTML where closing
tags are missing (common when _fetch_page_text caps the download size).
Flushes open <a>, <td>, <pre>, and blockquote buffers.
- Skip <img> tags to match prior html2text ignore_images=True behavior
and avoid data-URI amplification consuming the output budget.
- Collapse all whitespace (including newlines) in non-pre content per
standard HTML whitespace rules: \s+ -> single space.
- Escape pipe characters in table cell content to prevent column breakage.
- Emit separator row after the first row for tables without <th> headers.
- Guard against IndexError on _ol_counter for orphan <li> elements.
- Normalize CRLF line endings before parsing.
llama_cpp.py:
- Restore canonical dedup key with json.dumps(sort_keys=True) so that
semantically identical tool calls with different JSON key order are
correctly detected as duplicates.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix table optional end tags, inline code whitespace, and link text normalization
_html_to_md.py:
- Extract _finish_cell() and _finish_row() helpers to handle HTML tables
that omit optional </td>, </th>, or </tr> end tags. This is valid HTML
and common on real web pages -- previously the parser would silently
drop earlier cells and entire rows.
- Call _finish_cell()/_finish_row() from handle_starttag for <tr>/<td>/<th>,
handle_endtag for </tr>/<td>/<th>/<table>, and flush_pending() so all
three paths (normal close, implicit close, truncated HTML) use the same
row-finalization logic including header separator emission.
- Add _in_inline_code flag so handle_data() preserves literal whitespace
inside <code> spans instead of collapsing it. Source like
<code>pip install unsloth</code> now correctly renders as
`pip install unsloth` rather than `pip install unsloth`.
- Extract _finish_link() helper that normalizes accumulated link text with
\s+ -> single space before building the Markdown link. Prevents block-
level content inside <a> tags (e.g. <a><div>one</div><div>two</div></a>)
from producing multiline [one\n\ntwo](href) link labels.
- Empty blockquotes now produce no output instead of a stray ">".
- Remove unused _bq_depth field (all routing uses _bq_stack).
- Flush open cells and rows in handle_endtag("table") for robustness.
* Support <ol start=N>, <dl>/<dt>/<dd>, and preserve code block whitespace
_html_to_md.py:
- Honor <ol start="N"> attribute so ordered lists preserve their original
numbering instead of always restarting from 1. Important for docs/tutorials
that continue numbering across sections.
- Add dl, dt, dd to _BLOCK_TAGS so definition lists (common on MDN, Python
docs, Django docs) produce separated text instead of concatenated blobs.
- Rewrite _cleanup() to be fence-aware: content inside fenced code blocks
is now preserved verbatim (intentional blank lines in <pre> content are
no longer collapsed). Outside code blocks, blank runs are limited to one
and trailing whitespace is stripped.
- Fix _prefix_blockquote() to strip trailing whitespace before collapsing
blank lines, preventing the "\n\n \n\n" pattern from sneaking through.
* Suppress whitespace-only text nodes between table structural elements
Indented HTML tables (nearly all real-world pages) produce whitespace
text nodes between <table>, <tr>, </tr> etc. that land in the output
as leading spaces before table rows, breaking Markdown table alignment.
Skip whitespace-only text nodes when inside a table but not inside a
cell, so indentation from source HTML does not leak into the output.
* Revert dedup key to str(arguments) with explanatory comment
json.dumps(sort_keys=True) is unnecessary overhead here: arguments
always comes from json.loads on model output within a single request,
so dict insertion order is deterministic in Python 3.7+. A repeated
call from the model produces the same JSON, which parses to the same
dict repr. str() avoids re-serialization on every tool call.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* studio: improve GGUF tool calling accuracy and reliability
- Add URL fetching to web_search tool so models can read full page
content instead of only getting search snippets. Uses html2text for
clean markdown conversion with regex fallback.
- Inject current date and behavioral guidance (URL fetch workflow,
no repeated queries, use code for data processing) into the
tool-use system prompt.
- Append error recovery nudge to tool results that indicate failure,
helping small models avoid looping on the same broken call.
- Strip leaked <tool_call> XML from assistant messages in conversation
history and from the outgoing SSE stream.
- Raise default max tool iterations from 10 to 25 across backend,
model schema, and frontend defaults.
- Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain
enough content for the model to extract useful information.
- Add "IMPORTANT: These are only short snippets" hint to search
results so models know to fetch full pages when needed.
Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after:
- XML leaks in responses: 10/10 -> 0/10
- URL fetch usage: 0 -> 4/10 runs
- Runs producing actual correct answers: 0/10 -> 2/10
- Average tool calls per query: 5.5 -> 3.8 (more efficient)
- Average response time: 12.3s -> 9.8s
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add tool calling benchmark results across model sizes and quants
Tested 16 configurations (4 models x 2 quants x 2 KV cache types)
with 10 runs each on NVIDIA B200.
Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4
correct songs, 0 XML leaks, 131s average response time.
* Add duplicate tool-call detection and final-answer synthesis
When the model repeats the exact same tool call (same name + arguments)
twice in a row, skip execution and return a redirect message telling it
to try a different approach. This prevents the 8x-repeated-query loops
observed on 27B and 35B models.
When the tool iteration cap (25) is reached, inject a "provide your
final answer now" message before the final streaming pass. This lets
the model synthesize a useful answer from everything it gathered
instead of being silently cut off.
Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs):
- Repeated query runs: 4/10 -> 2/10
- Cap hits: 1/10 -> 0/10
- All 4/4 accuracy: 5/10 -> 7/10
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix CodeQL alert: handle whitespace in script/style closing tags
The regex fallback for HTML stripping did not match closing tags
with whitespace before the angle bracket (e.g. </script >).
Use \s* before > in both script and style patterns.
* Address reviewer findings: SSRF, timeout crash, XML regex, dedup
- SSRF: resolve hostname via getaddrinfo and reject private, loopback,
link-local, multicast, and reserved addresses before fetching
- Timeout: handle timeout=None (unlimited mode) in URL fetch path
by defaulting to 60s instead of crashing on min(None, 60)
- Download cap: read at most max_chars*4+1 bytes instead of the
full response body before truncating
- XML regex: match both <tool_call> and <function=...> markup in
the history/stream cleanup (inference.py)
- CodeQL: use [^>]* in closing script/style tags to handle any
whitespace or attributes before >
- Dedup: track whether each tool call failed so retries after
transient errors are allowed; only block consecutive identical
calls that both succeeded
- Final-answer synthesis: guard on max_tool_iterations > 0 so
callers who disable tools do not get a false "used all calls" turn
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix redirect SSRF, SSE streaming regression, dedup off-by-one
- SSRF redirect bypass: disable auto-redirect in urllib, manually
follow up to 5 hops with host validation at each step. Prevents
public URLs from redirecting to loopback/private targets.
- SSE streaming: track prev_text on the raw cumulative and strip
XML from the delta only, so completed tool_call tags do not cause
the cumulative to shrink and drop trailing real text.
- Dedup off-by-one: check the immediately previous call (window=1)
instead of requiring 2 matching history entries, so the second
identical successful call is blocked rather than the third.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix redirect HTTPError handling and tighten error prefixes
- Redirect fix: urllib raises HTTPError (not a normal response) when
the redirect handler returns None. Catch HTTPError for 3xx codes
and extract the Location header from the exception object.
- Error prefixes: remove overly broad "No " prefix that matched
"No results found." (a valid empty-search outcome, not an error).
Replace with specific prefixes like "Blocked:", "No query provided",
"Failed to resolve". This ensures empty search results are correctly
classified as non-errors for duplicate-call tracking.
* Fix SSE cross-chunk XML leaks, cleanup review findings
- SSE streaming: sanitize the full cumulative text before diffing
against the previous sanitized snapshot, so XML tags that span
chunk boundaries are stripped correctly. The previous delta-based
approach leaked split tags.
- DRAINING fallback: use _strip_tool_markup() helper instead of a
manual regex that only handled <tool_call> but not <function=...>.
- Move hashlib import, _TOOL_XML_RE compile, and datetime import to
module level per style guide.
- Remove unused _hit_tool_cap variable.
* Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record
- DNS rebinding: resolve hostname once via getaddrinfo, pin the
returned IP, rewrite the URL to connect to the pinned IP with
a Host header. Each redirect hop re-resolves and re-validates.
Closes the TOCTOU window between validation and connection.
- Charset: use resp.headers.get_content_charset() instead of
hardcoding utf-8, so pages with other encodings decode correctly.
- HTTPError: return descriptive "HTTP {code} {reason}" instead of
re-raising into a generic "Search failed" message.
- Dedup: remove redundant _record_tool_call in the duplicate branch;
the single call at the end of the loop handles all cases.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1
The heartbeat thread now monitors the HF Hub cache directory for
file-size growth. If no bytes are written for 3 minutes, it sends a
"stall" message to the orchestrator, which kills the subprocess and
retries with HF_HUB_DISABLE_XET=1 (falling back from Xet to standard
HTTPS). If the retry also stalls, it errors out with a clear message.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: include transport type (xet/https) in heartbeat and stall log messages
Makes it clear in backend logs whether the download is using xet or
https transport, and which transport stalled — helpful for debugging.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: monitor HF Hub .tmp dir to avoid false stall detections
huggingface_hub downloads into .tmp/ before atomically moving to
blobs/. Without monitoring .tmp, a large shard actively downloading
for several minutes would show zero blob growth and trigger a false
stall.
* fix: scope HF cache size check to specific model being loaded
Instead of scanning every models--*/blobs directory (O(N) with cached
models), only check the specific model's blobs dir plus the global
.tmp dir. Much faster on systems with many cached models.
* Fix false stall detection on cached/local models and cleanup issues
- Only fire stall if download activity was observed (cache size changed
at least once). Previously, any model load taking >180s would trigger
a false stall, even for already-cached or local models where no
download is happening.
- Return -1 from _get_hf_cache_size on exception to distinguish
"unable to measure" from "genuinely zero bytes". Skip stall logic
when measurement fails.
- Add _shutdown_subprocess before raising on terminal stall path to
prevent leaking a stuck subprocess.
- Detect pre-existing HF_HUB_DISABLE_XET=1 in the parent environment
to avoid a redundant retry cycle when Xet is already disabled.
- Remove global .tmp directory scanning (not used by modern
huggingface_hub; in-progress downloads use .incomplete files in
blobs/ which are already captured by iterdir).
- Add f.is_file() guard in cache size calculation.
- Replace em dashes with ASCII dashes for Windows terminal compat.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden stall detection edge cases
- Guard -1 to valid value transition: when initial _get_hf_cache_size
returns -1 (error) and later recovers to a real value, do not count
that as download activity. Only set saw_download_activity when the
previous measurement was also valid (>= 0).
- Move os import to top-level in orchestrator.py instead of inline
import os as _os.
- Fix misleading comment about post-download protection.
* Use .incomplete files to detect active downloads for stall detection
Replace the saw_download_activity heuristic with direct .incomplete file
detection. huggingface_hub creates *.incomplete files in blobs/ during
active downloads and removes them on completion. This gives a reliable
signal for whether a download is actually in progress.
Benefits:
- Cached models: no .incomplete files -> no stall fired even after 180s
- Post-download init (quantization, GPU loading): .incomplete files gone
so stall timer resets, long init phases are not killed
- Pre-download hangs (XET handshake stall): .incomplete files are
created at download start, so zero-byte stalls are now detected
- No more false positives from -1 to valid measurement transitions
The _get_hf_download_state function now returns (total_bytes,
has_incomplete) tuple or None on error, replacing _get_hf_cache_size.
* Add debug logging to download state exception handler
Log the exception at debug level when _get_hf_download_state fails,
instead of silently returning None. Helps with troubleshooting cache
measurement issues.
* Watch both adapter and base model repos for LoRA stall detection
When loading a LoRA adapter, the actual download bottleneck is often
the base model, not the adapter itself. Update the heartbeat to watch
both mc.identifier and mc.base_model cache directories so stall
detection works for LoRA loads where the base model stalls on Xet.
Also update _get_hf_download_state to accept multiple model names and
skip names without "/" (local paths) since those do not have HF cache
directories.
* Fix model name filtering for official HF models without org prefix
Models like gpt2 and bert-base-uncased do not contain a slash but are
still valid HF Hub models with cache directories. Replace the "/" check
with a proper local-path detection that checks for path separators and
path-like prefixes instead.
Also fix the base_model watch list to not require "/" in the base model
name, so official models used as LoRA bases are also monitored.
* Fix local path detection that broke all org/model names on Linux
The os.path.sep check matched "/" in HF model IDs like "org/model" on
Linux, causing the stall detector to skip ALL standard HF models.
Replace with a check that only skips names starting with "/" (absolute
paths), "." (relative paths), "~" (home-relative), or containing "\"
(Windows paths). HF model IDs like "org/model" or "gpt2" pass through
correctly on all platforms.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): change default weight_decay from 0.01 to 0.001
The default weight decay across Studio was 0.01 but should be 0.001.
Updated the default in all backend fallbacks, the Pydantic model, the
frontend config, and every YAML preset/model-default config.
* fix(studio): auto-set learning rate based on training method
Default LR should be 2e-4 for LoRA/QLoRA and 2e-5 for full fine-tuning.
Frontend: track whether the user has manually edited the LR field via a
_learningRateManuallySet flag (same pattern as trainOnCompletions).
When switching training method and the user has not touched the LR,
auto-set it to the appropriate default. Reset the flag on model load.
Backend: change trainer.py start_training default from 5e-5 to 2e-4,
update default.yaml fallback from 5e-5 to 2e-4, and fix
full_finetune.yaml from 0.0002 (2e-4) to 2e-5.
* refactor(studio): centralize weight_decay and learning rate defaults
Create studio/backend/core/training/constants.py as the single source of
truth for DEFAULT_WEIGHT_DECAY (0.001), DEFAULT_LEARNING_RATE (2e-4),
DEFAULT_LEARNING_RATE_FULL (2e-5), and DEFAULT_LEARNING_RATE_STR ("2e-4").
All backend modules (trainer.py, training.py, worker.py, models/training.py)
now import from constants.py instead of hardcoding values.
On the frontend, add LR_DEFAULT_LORA and LR_DEFAULT_FULL to
config/training.ts and use them in the store instead of magic numbers.
A comment cross-references the backend constants file.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix model-specific LR override, persist migration, and flag resets
- Preserve model-specific learning rates from YAML configs when the
async autoSelectTrainingMethod callback fires (fixes Qwen2.5-1.5B
getting 2e-4 instead of its configured 1e-5, etc.)
- Bump zustand persist version to 9 with migration so existing users
with weightDecay=0.01 get updated to 0.001
- Clear _learningRateManuallySet in reset() and applyConfigPatch()
for consistency with trainOnCompletions flag behavior
- Add DEFAULT_LEARNING_RATE_FULL_STR to constants.py
* Refine applyConfigPatch to only clear LR flag when patch includes LR
Only reset _learningRateManuallySet when the applied config patch
actually provides a learningRate value. This prevents unrelated config
patches from silently disarming the manual-edit guard, which would
cause a subsequent setTrainingMethod call to overwrite the user's
custom LR.
* Preserve model-specific LR when switching between qlora and lora
Only auto-switch the learning rate when the training category changes
(adapter <-> full fine-tuning). Switching between qlora and lora keeps
the current LR since both methods share the same learning rate range.
This preserves curated per-model defaults (e.g. 1e-5 for
Qwen2.5-1.5B-Instruct) when the user toggles between adapter methods.
* Remove constants.py, use YAML configs as the source of truth
The YAML config files (model-specific + default.yaml) are the intended
config layer for training defaults. The Python backend fallbacks now use
inline values that match the YAML configs, rather than importing from a
separate constants module. This keeps the config architecture simple:
YAML files are the single source of truth, and the inline Python
fallbacks are just safety nets that mirror them.
* fix(studio): preserve model-specific LR when switching training method
Stash YAML-provided learning rate and use it to restore the correct
value when switching between adapter and full fine-tune modes.
- qlora <-> lora no longer overwrites the model's LR
- full -> adapter restores the YAML LR instead of a hardcoded constant
- selecting a model while on full fine-tune uses LR_DEFAULT_FULL
instead of applying the YAML adapter LR
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* fix: throttle and cache HuggingFace modelInfo API calls
The frontend was firing 40 to 60 parallel modelInfo requests on app
startup with zero caching or deduplication, causing HF rate limits.
Adds a caching layer (hf-cache.ts) with TTL cache, inflight request
dedup, and a concurrency limiter. Also debounces the HF token input
so typing a token no longer re-fires all model searches per keystroke.
* fix: only fetch VRAM info for visible models in chat selector
* Fix cache key isolation and VRAM badge stability for PR #4696
- Cache key now includes a token fingerprint (last 8 chars) instead of a
boolean, so switching HF tokens gives separate cache entries instead of
serving stale data from the previous token.
- Extract token via credentials?.accessToken to match the @huggingface/hub
API surface.
- Extend CachedResult type with safetensors/tags fields so downstream
consumers no longer need unsafe `as` casts.
- Merge VRAM param map with previous state on scroll instead of replacing
it, preventing a brief flash of missing VRAM badges when new models
become visible.
* Fix VRAM badges missing for search-filtered recommended models
When a user types a search query, filteredRecommendedIds can include
models beyond the currently visible page. These models had no VRAM data
because useRecommendedModelVram only received visibleRecommendedIds.
Now we pass the union of visibleRecommendedIds and filteredRecommendedIds
to the VRAM hook, so recommended models surfaced by search also show
their VRAM badges. The hf-cache layer ensures no duplicate network calls.
* Apply biome formatting to hf-cache.ts and use-recommended-model-vram.ts
Auto-formatted with biome check --write to match project lint rules:
- Block statements for single-line if/for bodies
- Import sorting (type imports first)
- Consistent line wrapping
* Fix extractToken to handle both current and deprecated HF auth forms
The @huggingface/hub CredentialsParams type is a union:
- { accessToken: "hf_..." } (current preferred form)
- { credentials: { accessToken: "..." } } (deprecated form)
Previously only checked params.credentials?.accessToken (deprecated path).
Now checks both forms so the cache key is correct regardless of which
calling convention is used.
* Simplify extractToken, map merge, and set construction
- extractToken: remove type assertions, use direct property access with
truthiness checks for cleaner union type handling
- VRAM map merge: use Map spread constructor instead of manual for loop
- idsForVram: use Set spread construction for more concise dedup
* Add rationale comment for MAX_CONCURRENT=3 in hf-cache.ts
* Skip GGUF repos in VRAM fetch and pre-populate cache from listModels
Two changes to reduce redundant HF API calls:
1. Filter GGUF repos from idsForVram before passing to useRecommendedModelVram.
GGUF repos have no safetensors metadata and the render layer already shows
a static "GGUF" badge -- fetching modelInfo for them is a no-op that wastes
a semaphore slot and a network round-trip.
2. Add primeCacheFromListing() to hf-cache.ts and call it from listModels
yield sites in mergedModelIterator and priorityThenListingIterator.
listModels returns the same type (ModelEntry & Pick<ApiModelInfo, T>) as
modelInfo with the same additionalFields, so the data is interchangeable.
Priming only writes if the key is not already fresh, so it never overwrites
a recent modelInfo response.
This means models discovered via listModels are already in cache when
useRecommendedModelVram later calls cachedModelInfo for them, eliminating
duplicate network requests.
* Fix cache key mismatch: prime both token and anonymous slots
The VRAM hook calls cachedModelInfo without credentials (anonymous key),
but listModels results were primed only under the authenticated key.
For authenticated users the priming was a no-op -- cache miss every time.
Fix: prime both the token-specific slot and the anonymous slot when an
access token is present. Public model metadata (safetensors, tags) is
identical regardless of auth so this is safe.
Also add a defensive guard in primeCacheFromListing for empty name.
* Auto-prime anonymous cache slot from authenticated modelInfo fetches
When cachedModelInfo is called with a token, the result was only stored
under the token-specific key (e.g. model::abc12345). The VRAM hook
calls cachedModelInfo without credentials and reads the anonymous slot
(model::anon), causing a cache miss and duplicate fetch for every
priority model.
Now cachedModelInfo also writes to the anonymous slot on success when
a token is present. Public model metadata (safetensors, tags) is
identical regardless of auth, so this is safe and eliminates ~10
duplicate API calls on first page load.
* Guard anonymous cache priming against gated/private models
Only prime the anonymous cache slot for non-gated, non-private models.
Previously, authenticated modelInfo responses and listing results were
unconditionally copied into the anonymous slot, which could briefly
expose gated/private model metadata after clearing the HF token.
Now checks result.gated and result.private before writing the anon slot.
Public unsloth/ models (the common case) still benefit from the
optimization; gated models like meta-llama/* require a fresh fetch
per auth context.
* Extract primeFromListing helper to deduplicate cache priming logic
The cache priming pattern (prime token slot + conditionally prime anon
slot for non-gated models) was duplicated in three places. Extracted
into a single primeFromListing() function for maintainability.
* Export CachedResult type, add isStale helper, simplify primeFromListing
- Export CachedResult so consumers can use it directly instead of
the indirect Parameters<typeof ...> pattern.
- Extract isStale(key) helper to deduplicate the cache freshness
check that was repeated in primeCacheFromListing, cachedModelInfo,
and the anonymous-slot priming logic.
- Simplify primeFromListing to use CachedResult directly for both
the data parameter and the gated/private guard, eliminating the
double cast.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Revert to balanced for inference
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unused for_inference parameter from get_device_map
Since inference and training both use "balanced" now, the for_inference
flag is dead code. Remove it from the function signature, the call site
in inference.py, and simplify the tests accordingly.
* Remove redundant TestDeviceMapForInference test class
TestGpuAutoSelection already covers the same multi-gpu and single-gpu
device_map assertions. The TestDeviceMapForInference class was left
over from when for_inference had distinct behavior.
* Remove redundant test_get_device_map_multi_gpu_uses_balanced
Its assertions ([0,1] -> balanced, [0] -> sequential) are already
covered by test_get_device_map_uses_explicit_gpu_selection.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): open tour ReadMore links in new tab
The quick tour "Read more" links navigate away from Studio instead of
opening in a separate tab. Add target="_blank" and rel="noopener
noreferrer" to the ReadMore component so external doc links open in a
new browser tab.
* fix(studio): only open external ReadMore links in new tab
Apply target="_blank" conditionally based on whether the href starts
with "http", so internal links still navigate in the same tab.
* Tighten external-link detection in ReadMore component
Use regex /^https?:\/\// instead of startsWith("http") so the check
requires the full protocol prefix and does not match non-URL strings
that happen to begin with "http".
* Hoist regex to module scope for ReadMore
Move EXTERNAL_URL_RE to top-level constant to satisfy the biome
useTopLevelRegex lint rule and avoid re-creating the RegExp on
every render.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* studio: gate multimodal incompatibility warning on settled model capabilities
* Also disable Start button during isCheckingVision fallback
When getModelConfig fails and the fallback checkVisionModel is still
in-flight, isLoadingModelDefaults clears before isCheckingVision does.
Without also gating on isCheckingVision the Start button briefly
re-enables with stale capability flags.
Add isCheckingVision to the disabled condition and show "Loading
model..." text while either flag is active.
* Show correct error message for audio dataset incompatibility
The incompatibility warning always said "switch to a vision model"
even when the actual issue was an audio dataset on a non-audio model.
Now shows an audio-specific message when the mismatch is audio.
* Extract isLoadingModel constant for clarity
Pull the combined model-loading condition into a single constant
reused by the settled check, the disabled prop, and the button label.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The 180s wall-clock timeout would kill model loads on slow connections
even when the download was actively progressing. Now the worker sends
heartbeat status messages every 30s during loading, and the orchestrator
resets its 300s deadline on each one — so it only times out when the
subprocess goes truly silent.
* fix: skip download progress polling for exported GGUF models
* fix: revert isLocalGgufDir change — exported GGUFs are file paths, not dirs
* fix: set isDownloaded true for all adapters in LoraModelPicker
* fix(studio): replace unicode emoji in print() to avoid cp1252 crash on Windows
On Windows the default console encoding is cp1252 which cannot encode
unicode emoji like U+2705 or U+26A0. bare print() calls with these
characters cause a UnicodeEncodeError at runtime.
- run.py: replace emoji with ASCII status prefixes [OK] and [WARNING]
- format_conversion.py: remove duplicate print() that mirrors the
logger.info() call on the next line, and drop the emoji from the
log message since loggers handle encoding separately
* fix(studio): apply same emoji/print cleanup to parallel VLM conversion path
The parallel URL-based conversion logic has the same duplicate print()
with emoji that was fixed in the sequential path. Remove the bare
print() and drop the emoji from the logger.info() call.
* Treat install_python_stack.py failure as fatal in setup.ps1
On Linux/Mac, setup.sh runs under set -euo pipefail so a non-zero
exit from install_python_stack.py aborts the installer. On Windows,
setup.ps1 had no exit code check -- if the Python script crashed
(eg from the cp1252 UnicodeEncodeError), the installer silently
continued past the dependency loop and reported success. Studio
would then fail at launch with ModuleNotFoundError for structlog,
fastapi, and other deps that were never installed.
Capture $LASTEXITCODE and exit 1 if the dependency installer fails,
matching the error handling pattern already used for PyTorch install.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: scope packages.find to prevent node_modules namespace scanning
The packages.find section had no include filter, so setuptools'
find_namespace_packages discovered all directories as potential Python
packages -- including the 6,557 directories inside
studio/frontend/node_modules/ after the frontend build step.
This caused the editable install overlay step to run 20,000+ glob
operations across 6,619 "packages", which on fast NVMe takes ~5s but
on slower disks can take 7+ minutes.
Adding an explicit include filter scopes discovery to only the packages
we actually ship (unsloth, unsloth_cli, studio, studio.backend), dropping
from 6,619 to 58 discovered packages and the editable build time from
5.4s to 1.2s.
Also removes the broken kernels/moe exclude (used "/" instead of "."
notation so it never matched) and adds a node_modules exclude as a
safety net.
* fix: use precise node_modules exclude patterns
Use "*.node_modules" and "*.node_modules.*" instead of "*.node_modules*"
to avoid accidentally excluding valid packages that might contain
"node_modules" as a substring in their name.
* [WIP] balanced device map for studio
* gpus as a request parameter
* API for multi GPU stuff
* return multi gpu util in new API
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use balanced_low0 instead of balanced
* Use balanced_low0 instead of balanced
* Fix device_map typo, UUID parsing crash, set() filter bug, and broken tests
- balanced_low0 -> balanced_low_0 (transformers/accelerate rejects the old string)
- get_parent_visible_gpu_ids() now handles UUID/MIG CUDA_VISIBLE_DEVICES
gracefully instead of crashing on int() parse
- _get_backend_visible_gpu_info() set() or None bug: empty set is falsy so
CUDA_VISIBLE_DEVICES=-1 would disable filtering and report all GPUs
- test_gpu_selection.py: add missing get_visible_gpu_utilization import and
add required job_id arg to start_training() calls
* Smart GPU determinism using estimates
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* disallow gpu selection for gguf for now
* cleanup
* Slightly larger baseline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Treat empty list as auto
* Verbose logging/debug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Cleanup and revert unnecessary deletions
* Cleanup excessive logs and guard against disk/cpu offload
* auth for visibility API. cleanup redundant imports. Adjust QLoRA estimate
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* support for non cuda gpus
* Fix multi-GPU auto-selection memory accounting
The multi_gpu_factor was applied uniformly to all GPUs including the
first one, which unfairly penalizes single-GPU capacity when
transitioning to multi-GPU. This created a discontinuity where a model
that barely fits 1 GPU would suddenly require 2 GPUs because the first
GPU's free memory was discounted by 20%.
Now the first GPU keeps its full free memory, and only additional GPUs
have an overhead factor (0.85) applied to account for inter-GPU
communication and sharding overhead. This gives more accurate
auto-selection and avoids unnecessary multi-GPU for models that
comfortably fit on one device.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox tests for multi-GPU selection logic
24 tests covering model size estimation, memory requirements, automatic
GPU selection, device map generation, GPU ID validation, and multi-GPU
overhead accounting. All tests use mocks so they run without GPUs on
Linux, macOS, and Windows.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reviewer findings: 4bit inference estimate, fallback, GGUF gpu_ids, retry
1. 4-bit inference now uses reduced memory estimate (model_size/3 + buffer)
instead of the FP16 1.3x multiplier. This prevents over-sharding
quantized models across unnecessary GPUs.
2. When model size estimation fails, auto_select_gpu_ids now falls back to
all visible GPUs instead of returning None (which could default to
single-GPU loading for an unknown-size model).
3. GGUF inference route now treats gpu_ids=[] as auto-selection (same as
None) instead of rejecting it as an unsupported explicit request.
4. Training retry path for "could not get source code" now preserves the
gpu_ids parameter so the retry lands on the same GPUs.
5. Updated sandbox tests to cover the new 4-bit inference estimate branch.
* Remove accidentally added unsloth-zoo submodule
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix UUID/MIG visibility and update test expectations
1. nvidia.py: When CUDA_VISIBLE_DEVICES uses UUID/MIG tokens, the
visibility APIs now return "unresolved" with empty device lists instead
of exposing all physical GPUs. This prevents the UI from showing GPUs
that the backend process cannot actually use.
2. test_gpu_selection.py: Updated test expectations to match the new
multi-GPU overhead accounting (first GPU at full capacity, 0.85x for
additional GPUs) and 4-bit inference memory estimation formula.
All 60 tests now pass.
* Add CPU/disk offload guard to audio inference path
The audio model loading branch returned before the common
get_offloaded_device_map_entries() check, so audio models loaded with a
multi-GPU device_map that spilled layers to CPU/disk would be accepted
instead of rejected. Now audio loads also verify no modules are offloaded.
* Improve VRAM requirement estimates
* Replace balanced_low_0 with balanced
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refine calculations for slightly easier nums
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* adjust estimates
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use nums instead of obj to avoid seralisation error
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden nvidia-smi parsing and fix fallback GPU list
1. nvidia.py: Wrap int() casts for GPU index and memory in try/except
so MIG slices, N/A values, or unexpected nvidia-smi output skip the
unparseable row instead of aborting the entire GPU list.
2. nvidia.py: Handle GPU names containing commas by using the last
field as memory instead of a fixed positional index.
3. hardware.py: fallback_all now uses gpu_candidates (GPUs with verified
VRAM data) instead of raw devices list, which could include GPUs
with null VRAM that were excluded from the ranking.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* consolidate raise_if_offload
* Improve MoE support. Guard against nvidia-smi failures
* Improve MoE support. Guard against nvidia-smi failures
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix shared-expert LoRA undercount, torch VRAM fallback, and apply_gpu_ids edge case
1. vram_estimation.py: compute_lora_params now includes shared experts
(n_shared_experts) alongside routed experts when computing MoE LoRA
adapter parameters. Previously only n_experts were counted, causing
the estimator to undercount adapter, optimizer, and gradient memory
for DeepSeek/GLM-style models with shared experts.
2. hardware.py: _torch_get_per_device_info now uses mem_get_info (which
reports system-wide VRAM usage) instead of memory_allocated (which
only reports this process's PyTorch allocations). This prevents
auto-selection from treating a GPU as mostly free when another
process is consuming VRAM. Falls back to memory_allocated when
mem_get_info is unavailable.
3. hardware.py: apply_gpu_ids([]) now returns early instead of setting
CUDA_VISIBLE_DEVICES="" which would disable CUDA entirely. Empty
list inherits the parent visibility, same as None.
4. hardware.py: Upgraded fallback_all GPU selection log from debug to
warning so operators are notified when the model likely will not fit
in available VRAM.
* Guard nvidia-smi subprocess calls against OSError and TimeoutExpired
get_visible_gpu_utilization and get_backend_visible_gpu_info now catch
OSError (nvidia-smi not found) and TimeoutExpired internally instead
of relying on callers to wrap every invocation. Returns the standard
available=False sentinel on failure so the torch-based fallback in
hardware.py can take over.
* Guard get_primary_gpu_utilization and reset GPU caches between tests
1. nvidia.py: get_primary_gpu_utilization now catches OSError and
TimeoutExpired internally, matching the pattern already used in
get_visible_gpu_utilization and get_backend_visible_gpu_info. All
three nvidia-smi callers are now self-contained.
2. test_gpu_selection.py: Added _GpuCacheResetMixin that resets the
module-level _physical_gpu_count and _visible_gpu_count caches in
tearDown. Applied to all test classes that exercise GPU selection,
device map, or visibility functions. This prevents stale cache
values from leaking between tests and causing flaky results on
machines with real GPUs.
* Fix nvidia-smi fallback regression and physical GPU count validation
1. hardware.py: get_gpu_utilization, get_visible_gpu_utilization, and
get_backend_visible_gpu_info now check result.get("available") before
returning the nvidia-smi result. When nvidia-smi is unavailable or
returns no data (e.g., containers without nvidia-smi, UUID/MIG masks),
the functions fall through to the torch-based fallback instead of
returning an empty result. This fixes a regression where the internal
exception handling in nvidia.py prevented the caller's except block
from triggering the fallback.
2. hardware.py: resolve_requested_gpu_ids now separates negative-ID
validation from physical upper-bound validation. The physical count
check is only enforced when it is plausibly a true physical count
(i.e., higher than the largest parent-visible ID), since
torch.cuda.device_count() under CUDA_VISIBLE_DEVICES returns the
visible count, not the physical total. The parent-visible-set check
remains authoritative in all cases. This prevents valid physical IDs
like [2, 3] from being rejected as "out of range" when nvidia-smi is
unavailable and CUDA_VISIBLE_DEVICES="2,3" makes torch report only
2 devices.
* Fix UUID/MIG torch fallback to enumerate devices by ordinal
When CUDA_VISIBLE_DEVICES uses UUID or MIG identifiers,
get_parent_visible_gpu_ids() returns [] because the tokens are
non-numeric. The torch fallback in get_visible_gpu_utilization() and
get_backend_visible_gpu_info() previously passed that empty list to
_torch_get_per_device_info(), getting nothing back.
Now both functions detect the empty-list case and fall back to
enumerating torch-visible ordinals (0..device_count-1) with
index_kind="relative". This means the UI and auto-selection still
see real device data in Kubernetes, MIG, and Slurm-style UUID
environments where nvidia-smi output cannot be mapped to physical
indices.
Updated test_uuid_parent_visibility to verify the new torch fallback
path returns available=True with relative ordinals.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add type hint for gpu_ids parameter in InferenceOrchestrator.load_model
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fixes#4670
Separates the GGUF context slider ceiling from the currently active context length so lowering context via Chat Settings no longer locks the slider max to the reduced value.
- Backend: adds `max_context_length` to GGUF load/status responses, computed from the largest VRAM/KV-fit cap across all usable GPU subsets
- Frontend: stores `ggufMaxContextLength` and uses it for Context Length slider/input bounds; hydrates from both `/api/inference/load` and `/api/inference/status`
- Defaults UI ceiling to native context for CPU-only and fallback paths
- Seeds `effective_ctx` and `max_available_ctx` before GPU probing to prevent `UnboundLocalError` on probe failure
- Property fallback uses native `_context_length`, not effective `context_length`
* refactor(studio): unify setup terminal output style and add verbose setup mode
* studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose)
* studio(setup): revert nvcc path reordering to match main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio(setup): restore fail-fast llama.cpp setup flow
* studio(banner): use IPv6 loopback URL when binding :: or ::1
* Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp
- Bracket IPv6 display_host in external_url to produce clickable URLs
- Redirect try_quiet failure log to stderr instead of stdout
- Clamp _step label to column width to prevent negative padding
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox integration tests for PR #4494 UX fixes
Simulation harness (tests/simulate_pr4494.py) creates an isolated uv
venv, copies the real source files into it, and runs subprocess tests
for all three fixes with visual before/after demos and edge cases.
Standalone bash test (tests/test_try_quiet.sh) validates try_quiet
stderr redirect across 8 scenarios including broken-version contrast.
39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all
existing 75 unit tests still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Truncate step() labels in setup.sh to match PS1 and Python
The %-15s printf format pads short labels but does not truncate long
ones. Change to %-15.15s so labels wider than 15 chars are clipped,
matching the PowerShell .Substring(0,15) and Python label[:15] logic.
* Remove sandbox integration tests from PR
These test files are not part of the styling fix and should not
ship with this PR.
* Show error output on failure instead of suppressing it
- install_python_stack.py: restore _red for patch_package_file
warnings (was downgraded to _dim)
- setup.ps1: capture winget output and show on failure for CUDA,
Node, Python, and OpenSSL installs (was piped to Out-Null)
- setup.ps1: always show git pull failure warning, not just in
verbose mode
* Show winget error output for Git and CMake installs on failure
Same capture-and-print-on-failure pattern already used for
Node, Python, CUDA, and OpenSSL winget installs.
* fix: preserve stderr for _run_quiet error messages in setup.sh
The step() helper writes to stdout, but _run_quiet's error header
was originally sent to stderr (>&2). Without the redirect, callers
that separate stdout/stderr would miss the failure headline while
still seeing the log body on stderr. Add >&2 to both step calls
inside _run_quiet to match main's behavior.
* feat: add --verbose flag to setup and update commands
Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that
'unsloth studio update --verbose' (and the deprecated 'setup')
passes the flag to setup.sh / setup.ps1 / install_python_stack.py.
* fix(studio): honor verbose logging and keep llama.cpp failures non-blocking
* fix(studio): switch installer to 'studio update' and normalize Windows setup logs
* chore(studio): refine localhost tip and remove skip-base setup nois
* fix(studio): align Windows setup logs with Linux style and improve startup tips
* fix(studio): align Windows setup logs with Linux style
* refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output
* refactor(windows): align installer/setup output with Linux style and reduce default verbosity
* refactor(windows): match install.ps1 output style/colors to setup and quiet default logs
* fix(studio-banner): update personal-computer localhost tip
* fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode
* fix(install.sh): align installer logging with setup style and restore POSIX-safe color output
* fix(install.sh): preserve installer reliability and launch visibility
Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output.
* fix(windows installer): keep exit semantics and degrade status accurate
Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable.
* fix(setup.sh): improve log clarity and enforce GGUF degraded signaling
Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable.
* fix(installer): harden bash preflight and PowerShell GPU checks
Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling.
* fix(windows): keep verbose output visible while preserving exit codes
Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode.
* fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity
* Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose
- setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer
already reports the limitation, and install.sh runs under set -e so a
non-zero exit aborts the entire install including PATH/shortcuts/launch.
- setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1
sets $? = $false when native commands write to stderr even with exit 0.
Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE.
- startup_banner.py: Show the actual bound address when Studio is bound to
a non-loopback interface instead of always showing 127.0.0.1/localhost.
- setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for
npm install steps so --verbose correctly surfaces npm output.
* Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose
- install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge
stderr into stdout with 2>&1 and drop the $? check that misclassifies
successful native commands on PS 5.1.
- install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env
when --verbose is passed so child processes like install_python_stack.py
also run in verbose mode.
- setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose
correctly surfaces clone diagnostics during source-build fallback.
* Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner
- setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so
users can see download/validation progress while still capturing the log
for structured error reporting on failure.
- setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode.
- setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers.
- startup_banner.py: Avoid printing the same URL twice when Studio is
bound to a specific non-loopback address that matches the display host.
* Fix run_install_cmd exit code after failed if-statement
The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured
$? = 0 because $? reflects the if-statement result, not the command's exit
code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual
command exit code on failure. Applies to both verbose and quiet branches.
* Fix _run_quiet exit code, double uv install, missing --local flag
- setup.sh: Fix _run_quiet verbose path that always captured exit code 0
due to $? resetting after if-then-fi with no else. Switch to the same
'"$@" && return 0; exit_code=$?' pattern used in install.sh.
- setup.sh: Consolidate the two uv install branches (verbose + quiet)
into a single attempt with conditional output. Previously, when verbose
mode was on and the install failed, a second silent attempt was made.
- install.ps1: Pass --local flag to 'unsloth studio update' when
$StudioLocalInstall is true. Without this, studio.py's update() command
overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if
setup.ps1 or install_python_stack.py later checks that variable.
* Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners
- Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already
installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling
setup.sh, so letting install_python_stack.py redo it was redundant and
slowed down --no-torch installs for no benefit.
- Restore the "Unsloth Studio installed!" success banner and "starting
Unsloth Studio..." launch message so users get clear install completion
feedback before the server starts.
* Make llama.cpp build failure a hard error with proper cleanup
- setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF
inference requires a working llama.cpp build, so this should be a
hard failure, not a silent degradation.
- install.sh: Catch setup.sh's non-zero exit with '|| _SETUP_EXIT=$?'
instead of letting set -e abort immediately. This ensures PATH setup,
symlinks, and shortcuts still get created so the user can fix the
build deps and retry with 'unsloth studio update'. After post-install
steps, propagate the failure with a clear error message.
* Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE
'studio update' pops SKIP_STUDIO_BASE from the environment, which
defeats the fast-path version check added in PR #4667. When called
from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1
must survive into setup.ps1 so it skips the redundant PyPI check and
package reinstallation. 'studio setup' does not modify env vars.
* Remove deprecation message from 'studio setup' command
install.ps1 uses 'studio setup' (not 'studio update') to preserve
SKIP_STUDIO_BASE. The deprecation message was confusing during first
install since the user never typed the command.
* Fix stale env vars, scope degraded exit, generic error message for PR #4651
- install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO
when not using --local, to prevent stale values from a previous --local
run in the same PowerShell session. Fix log messages to say 'setup' not
'update' since we call 'studio setup'.
- setup.sh: Only exit non-zero for degraded llama.cpp when called from the
installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps
degraded installs successful since Studio is still usable for non-GGUF
workflows and the footer already reports the limitation.
- install.sh: Make the setup failure error message generic instead of
GGUF-specific, so unrelated failures (npm, Python deps) do not show
misleading cmake/git recovery advice.
* Show captured output on failure in quiet mode for PR #4651
Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand
(setup.ps1) now capture command output in quiet mode and display it
in red when the command fails. This matches the behavior of
run_install_cmd in install.sh where failure output is surfaced even
in quiet mode, making cross-platform error debugging consistent.
* Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651
- setup.ps1: Exit non-zero for degraded llama.cpp when called from
install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct
'unsloth studio update' keeps degraded installs successful.
- install.sh: Show 'unsloth studio update --local' in the recovery
message when the install was run with --local, so users retry with
the correct flag instead of losing local checkout context.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: add PyPI version check to setup.ps1 for fast update path
Port the update-flow logic from setup.sh to setup.ps1 so that
`unsloth studio update` on Windows skips Python dependency reinstall
when the installed version already matches PyPI latest.
* fix: clear SKIP_STUDIO_BASE in update command
install.ps1 sets SKIP_STUDIO_BASE=1 which persists in the PowerShell
session. If the user runs `unsloth studio update` in the same terminal,
the env var causes the version check to be skipped. Clear it explicitly
in the update command.
* fix: harden version check and clear stale env vars in update flow
- Normalize $InstalledVer with Out-String + Trim() to avoid array/whitespace
comparison issues in PowerShell 5.1 (python output can be captured as
string[] instead of scalar string)
- Move Fast-Install --upgrade pip inside if (-not $SkipPythonDeps) so the
fast path avoids unnecessary network round-trips
- Clear STUDIO_LOCAL_REPO when --local is not passed to prevent a previous
--local session from leaking into a plain update
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix blank page on Windows due to broken .js MIME type in registry
* Update studio/backend/main.py
adding defensive suggestion by gemini where we make the mimetypes specific to windows platforms
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* feat(studio): add HF/local model selection UI for GGUF export
* fix(studio):fix selector ring clipping
* fix(studio): export page trust_remote_code control and label styling
* fix(studio): accept hf_token in load_checkpoint orchestrator method
The route was passing hf_token to load_checkpoint() but the method
didn't accept it, causing a TypeError on every /api/export/load-checkpoint
request.
* fix(studio): clear HF model selection when input is edited
Previously selectedSourceModel was only cleared when the input became
empty, so editing to a different repo ID after selecting a model would
silently keep the old selection.
---------
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
normalize_path() unconditionally converted Windows paths like
C:\Users\... to WSL format /mnt/c/Users/..., which breaks path
resolution on native Windows. This caused LM Studio GGUF models
to fail detection (detect_gguf_model returned None for the invalid
path), falling through to the Unsloth import path which requires
a GPU.
Now only performs the /mnt/ mapping when actually running under WSL.
On native Windows, drive letters are preserved and backslashes are
normalized to forward slashes.
* fix: default HF cache to standard platform path instead of legacy Unsloth cache
* feat: show LM Studio and local models in chat Fine-tuned tab
* feat: show LM Studio models in Hub models tab
* fix: fetch local models after auth refresh completes
* Revert "fix: fetch local models after auth refresh completes"
This reverts commit cfd61f0ac7.
* fix: increase llama-server health check timeout to 600s for large models
* feat: expandable GGUF variant picker for LM Studio local models
* fix: show GGUF variant label for locally loaded LM Studio models
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: show publisher name in LM Studio model labels
* fix: set model_id for loose GGUF files in LM Studio publisher dirs
* fix: show publisher prefix in Fine-tuned tab LM Studio models
* fix: only use model_id for lmstudio source models
* fix: only show LM Studio models in Hub tab on Mac/chat-only mode
* fix: respect XDG_CACHE_HOME, handle Windows paths in isLocalPath, refresh LM Studio on remount
- _setup_cache_env now reads XDG_CACHE_HOME (falls back to ~/.cache)
instead of hard-coding ~/.cache/huggingface. This follows the standard
HF cache resolution chain and respects distro/container overrides.
- isLocalPath in GgufVariantExpander uses a regex that covers Windows
drive letters (C:\, D:/), UNC paths (\\server\share), relative paths
(./, ../), and tilde (~/) -- not just startsWith("/").
- HubModelPicker.useEffect now calls listLocalModels() before the
alreadyCached early-return gate so LM Studio models are always
refreshed on remount. Also seeds useState from _lmStudioCache for
instant display on re-open.
* fix: add comment explaining isLocalPath regex for Windows/cross-platform paths
* fix: prioritize unsloth publisher in LM Studio model list
* fix: scope unsloth-first sort to LM Studio models on all platforms
* fix: add missing _lmStudioCache module-level declaration
* fix: prioritize unsloth publisher before timestamp sort in LM Studio group
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Some models like unsloth/Qwen3-0.6B have no safetensors metadata
on Hugging Face, so the training model selector showed no parameter
size badge. The chat model picker already had extractParamLabel()
as a fallback that parses sizes like "0.6B" from the model name.
Add the same fallback to the training model selector and the
onboarding model selection step.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Detect always-on reasoning models and show Think button as locked-on
Models with hardcoded <think>/<think> tags or reasoning_content in
their chat template (e.g. distilled reasoning models) always produce
thinking output regardless of any toggle. Previously these models
were not detected as reasoning-capable at all, so the Think button
was grayed out even though the model was actively reasoning.
Backend:
- Detect <think>/<think> and reasoning_content in GGUF chat templates
as a fallback when enable_thinking is not present
- Add reasoning_always_on flag to LoadResponse and InferenceStatusResponse
- Pass the flag through all GGUF load and status response paths
Frontend:
- Add reasoningAlwaysOn to the chat runtime store and API types
- When reasoning_always_on is true, show the Think button as lit
(active) but not clickable, with a tooltip explaining the model
always uses thinking
- Force reasoningEnabled=true when the model always reasons
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use pointer-events-none instead of disabled for always-on Think button
The HTML disabled attribute was not fully blocking clicks on the Think
button for always-on reasoning models. Switch to pointer-events-none
CSS class which prevents all mouse interaction at the CSS level.
* Use a static span instead of disabled button for always-on Think
Replace the button element with a plain span when reasoning is
always on. This makes it physically impossible to toggle since
there is no clickable element at all, avoiding any CSS or
disabled-attribute edge cases.
* Simplify always-on Think button to stay lit and remain toggleable
Keep the Think button as a normal toggleable button but ensure it
shows as lit when reasoning_always_on is true. The model always
reasons regardless of the toggle state so there is no need to
block interaction.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Use --no-deps for ALL packages (unsloth, unsloth-zoo, and runtime deps)
since the current PyPI metadata for unsloth still declares torch as a
hard dependency. Runtime deps (typer, pydantic, safetensors,
transformers, etc.) are installed from no-torch-runtime.txt with
--no-deps to prevent transitive torch resolution from accelerate, peft,
trl, and sentence-transformers.
no-torch-runtime.txt now includes unsloth's own direct deps (typer,
pydantic, pyyaml, nest-asyncio) since --no-deps skips those too.
install.sh installs no-torch-runtime.txt directly (via helper function
_find_no_torch_runtime). install.ps1 does the same via
Find-NoTorchRuntimeFile. SKIP_STUDIO_BASE stays at 1 to avoid setup.sh
fast-path issues.
install_python_stack.py NO_TORCH branch does the same for unsloth
studio update, using package_name instead of hardcoded "unsloth".
* Fix inference failing for transformers 5.x models (trust_remote_code)
The training worker in core/training/worker.py auto-enables
trust_remote_code for unsloth/* models that need transformers 5.x
(e.g. NVIDIA-Nemotron-3-Nano-4B). The inference worker did not have
the same logic, so loading these models for chat would fail with
"No config file found" while training worked fine.
Add the same auto-detection to the inference worker so
trust_remote_code is set automatically when needed.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio shutdown button
* fix: add auth to shutdown endpoint and improve UX
- Add JWT auth (Depends(get_current_subject)) to POST /api/shutdown
- Use authFetch instead of bare fetch in shutdown dialog
- Only show beforeunload prompt when training is running
- Remove Ctrl+W/Cmd+W interception (browsers don't allow it)
- Store shutdown task on app.state to prevent GC
---------
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: only kill studio-managed llama-server processes, not user's own servers
_kill_orphaned_servers() checked for "unsloth" anywhere in the process
cmdline, which matched the user's own llama-server when serving models
from unsloth/ HF repos (the model path in -m contains "unsloth"). This
caused the user's server to get SIGKILLed on Studio startup, destroying
their prompt cache and forcing full model re-loads.
Narrow the check to only match processes whose binary path lives under
~/.unsloth/llama.cpp/ (the Studio install directory).
* Address review: cover env var paths, move Path.home() inside try block
- Also check LLAMA_SERVER_PATH and UNSLOTH_LLAMA_CPP_PATH so orphans
from custom install locations are still cleaned up.
- Move studio_dirs construction inside the try/except so a Path.home()
failure (containers without HOME) does not crash the constructor.
* Address reviewer feedback: proper path ancestry, /proc/pid/exe, legacy paths
Changes based on 10-reviewer consensus:
- Use Path.is_relative_to() instead of substring matching to prevent
false positives on sibling paths like ~/.unsloth/llama.cpp-backup/.
- Use /proc/<pid>/exe (symlink to real binary) instead of parsing the
first cmdline token, which breaks on paths with spaces. Falls back
to cmdline parsing on non-Linux or when /proc is unavailable.
- Add legacy in-tree install paths (project_root/llama.cpp/ and
project_root/bin/) so orphans from older setup.sh are still cleaned.
- Treat LLAMA_SERVER_PATH as an exact binary match rather than widening
it to its parent directory, which could match unrelated servers in
shared locations like /usr/local/bin/.
- Keep everything inside the try/except so Path.home() failures in
containers do not crash the constructor.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: add Linux platform guard and log cleanup errors
- Guard pgrep fallback with sys.platform check so it does not crash
on Windows/macOS when psutil is unavailable.
- Replace silent except-pass with logger.warning for observability.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The [huggingfacenotorch] extras only exist in pyproject.toml but are
NOT published on PyPI, so uv pip install "unsloth[huggingfacenotorch]"
fails on fresh installs from the registry.
Fix: add studio/backend/requirements/no-torch-runtime.txt with the
runtime deps (safetensors, transformers, datasets, accelerate, etc.)
that mirror [huggingfacenotorch] from pyproject.toml. In no-torch mode:
1. install.sh/ps1 install unsloth + unsloth-zoo with --no-deps
2. SKIP_STUDIO_BASE=0 so install_python_stack.py's NO_TORCH branch runs
3. install_python_stack.py installs no-torch-runtime.txt
* Guard against late tool_calls after visible content, filter incomplete fragments
1. If visible content was already emitted (_last_emitted is non-empty)
when delta.tool_calls arrives, ignore the tool_calls instead of
reclassifying the turn as a tool call. llama-server never
interleaves content and tool_calls (they are mutually exclusive),
but this guard is defensive for other OpenAI-compatible backends.
2. Filter out incomplete structured tool_calls fragments before
execution. Entries with empty function.name (from truncation by
max_tokens, disconnect, or interruption) are skipped instead of
being passed to execute_tool().
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: account for KV cache in GGUF GPU fit check and auto-cap context length
The GPU fit check only compared GGUF file size against free VRAM,
ignoring KV cache memory. Models with large native context lengths
(e.g. Qwen3.5-9B at 262k) would pass the fit check since the GGUF
is only 5.6 GB, but the KV cache at 262k context needs ~40 GB at
f16. This caused llama-server to silently fall back to CPU inference.
Changes:
- Parse block_count, head_count_kv, head_count, and embedding_length
from GGUF metadata alongside context_length
- Add KV cache VRAM estimation based on architecture params and the
selected cache quantization type (f16, q8_0, q4_0, etc.)
- Auto-reduce context length to the maximum that fits in available
GPU VRAM when the native context would exceed it
- Include estimated KV cache size in the _select_gpus total so the
fit decision reflects actual runtime memory, not just file size
For the reported scenario (Qwen3.5-9B on RTX 3090 with 22415 MiB
free), context is auto-reduced from 262144 to ~63k with f16 KV cache,
keeping the model fully on GPU. With q4_0 KV cache quantization the
context can reach ~226k.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: resolve 6 bugs in KV cache VRAM estimation and add test harness
- Fix q8_0 BPE constant: 1.125 -> 34/32 (1.0625) to match llama.cpp block size
- Fix _fit_context_to_vram returning min_ctx when weights exceed budget
(should return requested_ctx unchanged, let --fit handle it)
- Fix binary search inflating below-2048 requests (lo=min_ctx=2048 > hi)
- Fix n_ctx=0 regressing to 4096 when metadata unavailable (preserve sentinel)
- Fix multi-GPU auto-cap using single-GPU budget instead of aggregate
- Fix _context_length being overwritten with capped effective value
Add tests/test_gguf_kv_vram.py: 43 cross-platform pytest tests covering
pure logic, integration (monkeypatched load_model), and real GGUF parsing.
Runs in an isolated uv venv with only pytest -- no GPU/torch/structlog needed.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: complete _effective_context_length lifecycle
- Initialize _effective_context_length in __init__ (prevents AttributeError)
- Reset _effective_context_length in unload_model (prevents stale values)
- Update context_length property to return effective (capped) value for
the UI/API, falling back to native _context_length if not set
* fix: multi-GPU selection tries smallest subset first
The previous approach summed all GPUs' memory to cap context, then
selected GPUs afterward. This was overly optimistic for heterogeneous
setups (e.g., 48 GiB + 4 GiB): the context was inflated by the tiny
GPU's contribution, then both GPUs were dragged in.
Now we try GPU subsets from smallest (1 GPU) to largest, capping
context for each. We pick the smallest subset where the model+KV
fits. This prefers single-GPU when possible (simpler, no tensor
split overhead) and avoids pulling in GPUs that barely help.
Add tests: test_multi_gpu_prefers_fewer_gpus,
test_multi_gpu_heterogeneous.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: prefer fewer GPUs over higher context in GPU selection
Multi-GPU inference is slower due to tensor-split overhead, so we
should prefer fewer GPUs with reduced context over more GPUs with
full context. Now the loop stops at the first GPU subset where the
model fits, rather than continuing to find subsets that allow higher
context. Only if the model can't fit on N GPUs do we try N+1.
This preserves the original behavior: use multi-GPU only when the
model doesn't fit on a single GPU.
* fix: make _kill_orphaned_servers cross-platform via psutil
Replace pgrep + os.kill(SIGKILL) with psutil.process_iter() and
proc.kill(), which work on Linux, macOS, and Windows. Build an
allowlist of install roots matching _find_llama_server_binary so
only studio-managed servers are killed.
* fix: skip KV estimation loop when effective context is unknown
When n_ctx=0 and GGUF metadata lacks context_length, effective_ctx
stays 0. _estimate_kv_cache_bytes(0) returns 0, so a GPU could be
selected with no KV headroom. Guard the loop with effective_ctx > 0
to fall back to file-size-only GPU selection in this case.
* chore: temporarily remove test harness (will add back separately)
* refactor: deduplicate UINT32/UINT64 handling in GGUF parser
Replace duplicated if/elif chains for vtype 4 and 10 with a single
block using setattr. No behavioral change.
* fix: honor explicit n_ctx by using multi-GPU before capping
When the user explicitly sets n_ctx, try to fit the full requested
context using _select_gpus (which adds GPUs as needed). Only cap
context if it doesn't fit on any GPU combination.
When n_ctx=0 (auto/native context), keep the existing behavior:
prefer fewer GPUs with reduced context, since multi-GPU is slower
and the user didn't ask for a specific context length.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: context_length property returns native value for frontend slider
The frontend uses context_length as the slider max. Returning the
capped effective value prevented users from requesting higher context
on reload (e.g., after switching to q4_0 KV cache). Revert to
returning the native GGUF metadata value -- the backend auto-caps
at load time regardless.
* revert: context_length returns effective (capped) value
The UI slider should show what the server is actually running at,
not the theoretical maximum. Revert to returning the effective
context length.
* fix: raise minimum context floor from 2048 to 4096
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix ~1.2s TTFT penalty when tools are enabled in Studio
When users enable web search, Python execution, or terminal tools,
every message gets a ~1.2s delay before any text appears -- even when
the model does not call any tool. This happens because
generate_chat_completion_with_tools() does a non-streaming detection
pass (stream: False) first, waits for the complete response, then
checks for tool calls. For the ~90% of messages that don't trigger a
tool call, this blocking wait is entirely wasted.
Root cause: the detection pass payload uses stream: False, forcing
llama-server to generate the entire response before returning any
tokens.
Fix: replace the non-streaming detection pass with a streaming pass
(stream: True) and a speculative buffer state machine that detects
tool signals in the first 1-2 SSE chunks:
- BUFFERING: accumulate content tokens, check first chars for tool
signal prefixes (<tool_call>, <function=)
- STREAMING: no tool detected, yield tokens to caller immediately
- DRAINING: tool signal found, silently accumulate rest of stream
Three detection paths:
1. Structured delta.tool_calls -- detected instantly, transition to
DRAINING, accumulate fragments, assemble at stream end.
2. XML tool markup in content -- buffer holds up to 32 chars checking
for <tool_call> or <function= prefix, then transitions to DRAINING.
3. No tool signal -- first non-whitespace, non-XML char triggers
immediate transition to STREAMING (fast path, ~90% of requests).
Safety net: after any stream ends in STREAMING state, check accumulated
content for XML tool signals. Handles rare "content before tool call"
edge case.
Additional supporting changes:
- Add headers parameter to _stream_with_retry for auth forwarding
- Share _strip_tool_markup and regex patterns between the detection
pass and the final streaming pass (removes duplication)
- Remove the iteration==0 non-streaming content shortcut (no longer
needed since all iterations stream directly)
- Keep the final streaming pass as fallback for max_tool_iterations
exhaustion
Benchmarked on Qwen3.5-4B Q4_K_XL:
- No tools: TTFT ~112ms (unchanged)
- Tools enabled, no call: TTFT ~112ms (was ~1207ms)
- Decode TPS: 226 (unchanged in all cases)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add unit tests for streaming tool detection state machine
16 tests covering every tool call parsing path:
- Plain text (no tool call) streaming
- Structured delta.tool_calls detection and fragment assembly
- XML <tool_call>JSON</tool_call> detection via buffer
- XML <function=name> tag detection via buffer
- Whitespace before tool XML
- Safety net (content then tool XML)
- Parallel multi-tool calls
- Reasoning token bypass (thinking models)
- Reasoning then tool call
- Empty response handling
- Buffer prefix timeout (HTML not mistaken for tool)
- Non-XML first char instant streaming
- False positive rejection (<tool_tip> vs <tool_call>)
- Arguments split across multiple chunks
- auto_heal_tool_calls=False respects the flag
- Metrics accumulation across tool iterations
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reasoning-only BUFFERING, pre-tool content emission, and code duplication
Addresses review feedback on the streaming tool detection:
1. Reasoning tokens are no longer yielded during BUFFERING/DRAINING
states. The consumer in routes/inference.py tracks prev_text across
tool iterations without resetting it, so yielding reasoning during
a detection pass that resolves to a tool call would corrupt the
delta computation for subsequent iterations. Reasoning is now
silently accumulated during detection (matching the old non-streaming
behavior) and flushed together with content when the buffer resolves
to STREAMING.
2. Handle reasoning-only responses in the BUFFERING resolver. When a
thinking model emits only reasoning_content with no content tokens,
the stream ends while still in BUFFERING state. The resolver now
detects this case and yields reasoning as plain text (without
<think> wrapper), matching the final streaming pass behavior for
models like Qwen3 in always-think mode.
3. Replace duplicated re.sub calls for stripping tool markup with
the existing _strip_tool_markup(content_text, final=True) helper,
removing ~40 lines of redundant regex code.
4. Update tests: adjust reasoning test expectations to match the new
behavior (reasoning batched with content, not streamed individually
during BUFFERING). Add test_reasoning_only_no_content for the
reasoning-only edge case. 17/17 tests pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address remaining reviewer findings: late tool_call IDs and XML speculation
1. Late-arriving tool_calls.id: when a provider sends the real ID on a
later delta chunk (after the initial one with index and function
name), the accumulator now updates the ID instead of keeping the
synthetic "call_{idx}" placeholder. (P2, 2/10 reviewers)
2. XML speculation respects auto_heal_tool_calls: when auto_heal is
explicitly disabled, _TOOL_XML_SIGNALS is empty so the BUFFERING
state never speculatively holds content for XML prefix detection.
Content starting with literal "<tool_call>" or "<function=" text
flows straight through without delay. (P2, 1/10 reviewers)
Skipped: finish_reason="tool_calls" without delta.tool_calls fallback
(P1, 1/10 reviewers). llama-server always sends delta.tool_calls
fragments in streaming mode. A non-streaming fallback for this edge
case would add complexity for a scenario that does not occur in
practice with the supported backend.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Check request.is_disconnected() every 20 tokens instead of every token
The disconnect check is an async round-trip that adds overhead on every
loop iteration. Since the cancel watcher in llama_cpp.py already
handles connection teardown (closes the streaming response on cancel),
this route-layer check is a secondary safety net that does not need to
run on every single token.
Check every 20 tokens across all 4 streaming paths:
- gguf_tool_stream (tool-enabled GGUF)
- gguf_stream_chunks (standard GGUF)
- audio_input_generate (audio/whisper input)
- generic backend stream (non-GGUF fallback)
* Fix safety net, DRAINING metadata, and test import path
1. Safety net no longer retroactively executes tools after visible
content was already emitted to the user. Once _last_emitted is
non-empty, the stream is committed to normal content mode.
Retroactive tool execution after visible output would violate the
streaming contract and corrupt the route-layer cumulative delta
tracker (prev_text). The tool XML is still stripped by
_strip_tool_markup so the user sees clean content.
2. DRAINING false-positive path now merges accumulated metrics from
prior tool iterations instead of dropping them. Uses the same
merge formula as the STREAMING path.
3. Test import path fixed to use repo root instead of hardcoded
sibling directory. Works in clean checkouts and CI.
4. Renamed test_content_then_tool_xml_safety_net to
test_content_then_tool_xml_no_retroactive_execution to reflect
the corrected behavior.
17/17 tests pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Redact --api-key value from llama-server startup log
When UNSLOTH_DIRECT_STREAM=1, the generated bearer token was logged
verbatim in the startup command. Replace the secret with <redacted>
before logging.
* Remove test file temporarily
* Revert disconnect throttle, reset prev_text on tool_start, restore XML safety net
Addresses all P1 findings from reviewer round 3 (10 reviewers):
1. Revert disconnect check to every iteration (was every 20th).
All 10 reviewers flagged this as a correctness regression for
short streams and sparse tool event loops. The cancel watcher in
llama_cpp.py is the primary mechanism but the route-layer check
must remain per-iteration for completeness. [10/10]
2. Reset prev_text on tool_start in gguf_tool_stream. When a tool
cycle begins after visible content was already streamed, the
route-layer cumulative delta tracker (prev_text) must be reset
so the post-tool synthesis response is not truncated or dropped.
[9/10]
3. Remove the _last_emitted gate from the XML safety net. The gate
was added to prevent retroactive tool execution after visible
content, but with prev_text now reset on tool_start (#2), the
root cause is fixed and the safety net can correctly handle
content-then-tool-XML responses (matching pre-PR behavior).
[8/10]
* Use None instead of {} for empty auth headers in TTS methods
* Include accumulated metrics in STREAMING metadata check
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* refactor(studio): unify setup terminal output style and add verbose setup mode
* studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose)
* studio(setup): revert nvcc path reordering to match main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio(setup): restore fail-fast llama.cpp setup flow
* studio(banner): use IPv6 loopback URL when binding :: or ::1
* Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp
- Bracket IPv6 display_host in external_url to produce clickable URLs
- Redirect try_quiet failure log to stderr instead of stdout
- Clamp _step label to column width to prevent negative padding
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox integration tests for PR #4494 UX fixes
Simulation harness (tests/simulate_pr4494.py) creates an isolated uv
venv, copies the real source files into it, and runs subprocess tests
for all three fixes with visual before/after demos and edge cases.
Standalone bash test (tests/test_try_quiet.sh) validates try_quiet
stderr redirect across 8 scenarios including broken-version contrast.
39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all
existing 75 unit tests still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Truncate step() labels in setup.sh to match PS1 and Python
The %-15s printf format pads short labels but does not truncate long
ones. Change to %-15.15s so labels wider than 15 chars are clipped,
matching the PowerShell .Substring(0,15) and Python label[:15] logic.
* Remove sandbox integration tests from PR
These test files are not part of the styling fix and should not
ship with this PR.
* Show error output on failure instead of suppressing it
- install_python_stack.py: restore _red for patch_package_file
warnings (was downgraded to _dim)
- setup.ps1: capture winget output and show on failure for CUDA,
Node, Python, and OpenSSL installs (was piped to Out-Null)
- setup.ps1: always show git pull failure warning, not just in
verbose mode
* Show winget error output for Git and CMake installs on failure
Same capture-and-print-on-failure pattern already used for
Node, Python, CUDA, and OpenSSL winget installs.
* fix: preserve stderr for _run_quiet error messages in setup.sh
The step() helper writes to stdout, but _run_quiet's error header
was originally sent to stderr (>&2). Without the redirect, callers
that separate stdout/stderr would miss the failure headline while
still seeing the log body on stderr. Add >&2 to both step calls
inside _run_quiet to match main's behavior.
* feat: add --verbose flag to setup and update commands
Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that
'unsloth studio update --verbose' (and the deprecated 'setup')
passes the flag to setup.sh / setup.ps1 / install_python_stack.py.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Make Studio shortcuts launch in a visible terminal
Studio shortcuts (Desktop/Start Menu) previously launched the server as a
hidden background process. Closing the browser tab did not stop the server,
leaving users with no obvious way to shut it down. This change makes shortcuts
open a visible terminal window so users can see server output and close the
terminal to stop Studio.
Launcher changes (install.sh):
- Add TTY detection in the launcher's main section. When a TTY is present
(foreground mode), the launcher spawns a background browser-opener and then
exec's the studio process directly. This means closing the terminal sends
SIGHUP to studio, stopping it cleanly. When no TTY is present (background
mode, e.g. macOS .app or headless), the existing _spawn_terminal behavior
is preserved.
- Add _open_browser_when_ready helper that polls health on the specific
launch port and opens the browser once ready.
- Add WSL fallback in _open_browser: uses powershell.exe Start-Process or
cmd.exe /c start instead of unreliable xdg-open under WSL.
Linux .desktop shortcut:
- Change Terminal=false to Terminal=true so the desktop environment opens
the user's default terminal emulator for the launcher.
WSL support:
- Remove the early-return that skipped WSL entirely. WSL now gets the
launcher script and studio.conf written.
- Add WSL shortcut creation: generates Windows Desktop and Start Menu .lnk
files via a temp PowerShell script. Targets wt.exe (Windows Terminal) with
automatic fallback to wsl.exe. Uses WSL_DISTRO_NAME for multi-distro setups.
Windows launcher (install.ps1):
- Add Find-FreeLaunchPort function that mirrors the Unix _find_launch_port
logic, scanning Get-NetTCPConnection for busy ports and returning the first
free port in the configured range.
- Replace the hardcoded $basePort with the dynamic port result, with a
MessageBox error dialog if no free port is found.
* Fix review findings: lock race, WSL quoting, Windows port fallback
Foreground lock race (10/10 reviewers):
The foreground mode released the single-instance lock before exec,
allowing a second launcher to acquire the lock and race for the same
port during startup. Move lock release into the background subshell
so it only happens after the health check passes.
WSL shortcut quoting (10/10 reviewers):
WSL_DISTRO_NAME values with spaces (e.g. "Ubuntu Preview", "Fedora
Remix for WSL") were not quoted, causing the distro name to be split
across multiple arguments. Add double-quoting around the distro name
and launcher path in the generated shortcut arguments.
Windows port fallback (3/10 reviewers):
Find-FreeLaunchPort silently assumed no ports were listening when
Get-NetTCPConnection was unavailable, which could return 8888 even
when busy. Add a Test-PortBusy fallback that probes ports with
TcpListener when Get-NetTCPConnection fails. Also scope the
Get-NetTCPConnection query to only the port range we care about.
* Skip powershell.exe shortcut creation if wslpath fails
If wslpath -w fails (returns empty), do not attempt to pass a Linux-style
path to powershell.exe -- it would always fail. Only run powershell.exe
when we have a valid Windows path for the temp PS1 script.
* Remove dead code and fix background health poll target
- Remove unused _open_browser_when_ready function
- Background mode now polls only the specific _launch_port instead of
scanning all ports via _find_healthy_port, matching foreground behavior
- Add launcher test harness (22 unit + 19 integration tests)
* Fix port probe scope, lock ownership, and T4 test coverage
- Test-PortBusy: bind on Any instead of Loopback to match Studio's
0.0.0.0 bind scope (prevents false-free in fallback path)
- _release_lock: verify PID ownership before removing lock dir
(prevents a timed-out subshell from deleting another launcher's lock)
- T4 test: fail first curl call so the test actually exercises the
lock-contention wait path instead of short-circuiting via fast path
* Temporarily remove launcher test scripts
Tests will be re-added in a follow-up PR to keep this diff focused
on the launcher changes.
* Fix missing num_items_in_batch in unsloth_prediction_step
unsloth_prediction_step calls compute_loss without num_items_in_batch
during evaluation. This causes _unsloth_pre_compute_loss to see
num_items_in_batch=None, which triggers a spurious warning for every
model when gradient_accumulation_steps > 1:
"Unsloth: Not an error, but {model} does not accept num_items_in_batch.
Using gradient accumulation will be very slightly less accurate."
The standard transformers prediction_step computes num_items_in_batch
via _get_num_items_in_batch before passing it to compute_loss. This
patch does the same in unsloth_prediction_step.
Tested on Llama-3.2-1B-Instruct and Olmo-3-7B-Instruct with
gradient_accumulation_steps=3 and eval_steps=3. Warning is gone and
eval loss is computed correctly for both.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard _get_num_items_in_batch for older transformers versions
_get_num_items_in_batch was added in transformers 4.46. Wrap the call
in try/except so older versions fall back to num_items_in_batch=None,
which preserves the original behavior of not passing it.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix Gemma3N audio training stride assertion with non-reentrant checkpointing
Gemma3N audio conformer processes variable-length audio tensors
that cause stride mismatches in AOT autograd compiled backward
when non-reentrant gradient checkpointing is used. The error
manifests as:
AssertionError: expected size 2==2, stride 1928==1936 at dim=0
This happens because the audio conformer's conv/norm layers produce
tensors whose strides vary with audio clip duration, but AOT autograd
traces the backward graph assuming fixed strides from the first batch.
The notebook sets gradient_checkpointing_kwargs={"use_reentrant": False}
and TRL 0.27.0+ also forces this. Both override Unsloth's own
use_reentrant=True set during prepare_model_for_training.
Fix: intercept gradient_checkpointing_enable on Gemma3N models to
always force use_reentrant=True, regardless of what the notebook
or TRL passes.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The previous --no-deps approach skipped ALL dependencies, not just
torch. This left safetensors, transformers, datasets, accelerate, etc.
missing, causing PackageNotFoundError at runtime.
Fix: in no-torch mode, install unsloth[huggingfacenotorch] (which pulls
all runtime deps except torch), then install unsloth-zoo with --no-deps
(since zoo's published metadata still declares torch as a hard dep).
This gives a working no-torch environment with all non-torch packages.
Applied to all three installer files: install.sh, install.ps1, and
studio/install_python_stack.py.
* fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621)
On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the
installer crashes. Even when torch is absent, Studio crashes on startup
because two files have bare top-level torch imports.
Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and
HF-inference already isolate torch to subprocesses. Only 2 files in the
server startup chain had top-level torch imports preventing startup.
Changes:
- install.sh: detect architecture, default to Python 3.12 on Intel Mac,
skip torch install, add Python 3.13.8 guard for arm64, pass
UNSLOTH_NO_TORCH env var to setup.sh
- data_collators.py: remove unused `import torch` (no torch.* refs)
- chat_templates.py: lazy-import IterableDataset into function bodies
- install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip
torch-dependent packages, skip overrides.txt, skip triton on macOS
No existing working flow changes. Linux/WSL and macOS arm64 behavior is
identical.
* tests: add test suite for Mac Intel compat + no-torch mode
Shell tests (test_mac_intel_compat.sh):
- version_ge edge cases (9 tests)
- Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64
- get_torch_index_url returns cpu on simulated Darwin
- UNSLOTH_NO_TORCH propagation to both setup.sh branches
Python unit tests (test_no_torch_filtering.py):
- _filter_requirements with NO_TORCH_SKIP_PACKAGES
- NO_TORCH env var parsing (true/1/TRUE/false/0/unset)
- IS_MACOS constant check
- Overrides skip and triton macOS skip guards
Python import tests (test_studio_import_no_torch.py):
- data_collators.py loads in isolated no-torch venv
- chat_templates.py has no top-level torch imports
- Negative control confirms import torch fails without torch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tests: add E2E sandbox tests for Mac Intel no-torch mode
Replace static/synthetic test stubs with real sandbox tests:
- Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify
torch install is skipped when MAC_INTEL=true, dynamic env propagation
test for UNSLOTH_NO_TORCH in both local and non-local install paths
- Python filtering: test real extras.txt and extras-no-deps.txt with
NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for
5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux,
Windows-only, macOS-only), VCS URL and env marker edge cases
- Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass
instantiation for all 3 collator classes, chat_templates.py exec with
stubs, negative controls proving import torch and torchao install fail
in no-torch venvs
91 total tests, all passing.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: address reviewer findings for Intel Mac no-torch mode
P1 fixes:
- Auto-infer NO_TORCH in install_python_stack.py via platform.machine()
so `unsloth studio update` preserves GGUF-only mode without needing
the UNSLOTH_NO_TORCH env var (6/10 reviewers)
- Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES
since both have unconditional torch dependencies (4/10 reviewers)
- Skip unsloth-zoo on Intel Mac --local installs (depends on torch)
in both migrated and fresh install paths (1/10)
- Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10)
- Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64
and warn user to use native arm64 terminal (1/10)
P2 fixes:
- Wire new test files into tests/run_all.sh (4/10 reviewers)
- Add update-path tests (skip_base=False) for Intel Mac
- Add _infer_no_torch tests for platform auto-detection
P3 fixes:
- Fix macOS progress bar total (triton step skipped but was counted)
- Fix temp file leak when Windows + NO_TORCH filters stack
All tests pass: 30 shell, 66 Python (96 total).
* feat: add --python override flag to install.sh
Lets users force a specific Python version, e.g. ./install.sh --python 3.12.
Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch.
When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade
are skipped so the user's choice is respected.
* tests: add comprehensive E2E sandbox tests for no-torch mode
Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total)
covering the full no-torch import chain, edge cases, and install logic:
- Group 1: BEFORE vs AFTER import chain comparison (proves the bug
existed and the fix works by synthetically prepending top-level torch
imports)
- Group 2: Dataclass instantiation without torch
- Group 3: Edge cases with broken/fake torch modules on sys.path
- Group 4: Hardware detection fallback to CPU without torch
- Group 5: install.sh flag parsing, version resolution, arch detection
- Group 6: install_python_stack.py NO_TORCH filtering
- Group 7: Live server startup without torch (marked @server, skipped
when studio venv is unavailable)
All 43 tests pass on both Python 3.12 and 3.13 isolated venvs.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting
- Fix chat_templates.py: narrow torch IterableDataset import into inner
try/except ImportError so dataset.map() works without torch installed
- Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca
and convert_alpaca_to_chatml
- Add --no-torch flag to install.sh with unified SKIP_TORCH variable
(driven by --no-torch flag OR MAC_INTEL auto-detection)
- Add --no-torch flag to install.ps1 with $SkipTorch variable
- Print CPU hint when no GPU detected and --no-torch not set
- Replace MAC_INTEL guards with SKIP_TORCH in torch install sections
- Update shell tests (40 pass) and Python tests (90 pass)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: address reviewer findings for --no-torch installer paths
- Fix migrated-env branch in install.sh and install.ps1: check
SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously
SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which
depends on torch), defeating --no-torch mode.
- Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true"
or "false" instead of only setting on the true branch. Prevents stale
no-torch state from leaking across runs in the same PS session.
- Fix install_python_stack.py update path: add NO_TORCH guard around
base.txt install so unsloth studio update does not reinstall
unsloth-zoo (which depends on torch) in no-torch mode.
* fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode
Instead of skipping unsloth-zoo entirely (which breaks unsloth's
dependency on it), install both packages with --no-deps so they are
present but torch is not pulled in transitively. Applied consistently
across all no-torch paths: migrated-env, fresh-local, fresh-non-local
in install.sh, install.ps1, and install_python_stack.py.
* chore: temporarily remove test files (will be added in a follow-up)
* refactor: deduplicate SKIP_TORCH conditional branches in installers
Collapse if/else blocks that differ only by --no-deps into a single
branch with a conditional flag variable. Applied to migrated-env and
fresh-local paths in install.sh, install.ps1, and install_python_stack.py.
* fix: apply --no-deps to fresh non-local --no-torch install path
The non-local else branch was missing $_no_deps_arg/$noDepsArg, so
uv pip install unsloth would resolve torch from PyPI metadata (the
published unsloth package still declares torch as a hard dep). Now
--no-deps is applied consistently to all SKIP_TORCH code paths.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Inline querier identity changed every render, forcing useLiveQuery to
resubscribe continuously causing CPU spikes. Store querier in a ref and
only re-subscribe when explicit deps change.
The ChatCompletionRequest Pydantic model defaulted repetition_penalty
to 1.1 when clients omitted the field. This silently forced
llama-server to perform per-token repetition scanning, dropping
streaming throughput from ~225 TPS to ~172 TPS (a 24% penalty).
The Studio frontend always sends repetition_penalty=1.0 explicitly,
so UI users were unaffected. But any API client hitting
/v1/chat/completions without setting the field (curl, third-party
integrations, Open WebUI, etc.) would get the slow path.
Benchmarked on Qwen3.5-4B Q4_K_XL, GPU 0:
- repeat_penalty=1.0: 225.2 TPS
- repeat_penalty=1.1: 172.7 TPS (24% slower)
- LM Studio (which applies rp internally): 170.8 TPS
This aligns the Pydantic default with the frontend default (1.0),
generate_chat_completion's function signature default (1.0), and
llama-server's own default (1.0).
* Allow install_python_stack to run on Colab
The _COLAB_NO_VENV flag was setting _SKIP_PYTHON_DEPS=true, which
skipped both the PyPI version check (needs $VENV_DIR/bin/python) and
install_python_stack (uses sys.executable, works without a venv).
Introduce a separate _SKIP_VERSION_CHECK flag for the version check,
so install_python_stack still runs on Colab. The _SKIP_PYTHON_DEPS
flag remains available for the "versions match" fast path.
* Remove colab.py workarounds that broke transformers/hf-hub compatibility
PR #4601 added _pip_install_backend_deps(), _bootstrap_studio_venv(),
and _is_colab() to colab.py as workarounds for install_python_stack
being skipped on Colab. These workarounds:
- Stripped version constraints from studio.txt and installed into system Python
- Upgraded huggingface-hub to >=1.0, breaking Colab's pre-installed
transformers which requires huggingface-hub<1.0
With install_python_stack now running on Colab (previous commit), these
workarounds are unnecessary — all deps are properly installed by setup.sh.
Restore colab.py to its original PR #4237 structure: just get_colab_url(),
show_link(), and start().
* Remove --local flag from setup.sh in Colab notebook
The --local flag is not needed for the standard Colab flow since
install_python_stack now runs on Colab and installs deps from PyPI.
* studio: humanize ETA display for long training runs
When training takes hours or days, the ETA displayed raw minutes
(e.g. '560m 50s'). This changes the format to:
- Under 1 hour: Xm Ys (unchanged)
- 1-24 hours: Xh Ym Zs
- Over 24 hours: Xd Xh Xm
* Fix formatDuration edge cases and consolidate duplicate for PR #4608
- Guard NaN/Infinity inputs with Number.isFinite() (matches formatNumber in same file)
- Add sub-minute branch so 30s displays as "30s" instead of "0m 30s"
- Accept undefined in type signature to match formatNumber pattern
- Remove duplicate formatDuration from history-card-grid.tsx and import the shared one
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: avoid _yaml.pyd lock on Windows during dependency overrides
* fix: move pytorch_tokenizers and kernels to no-deps install to avoid Windows _yaml.pyd loc
* fix(studio): align config cards, dynamic height for expanders, LoRA collapsible
* Fix clipping regressions in training, dataset, and params section cards
- training-section: Add hasMessage conditional so the card expands
(min-h) when startError, vision/audio incompatibility, or config
validation messages are present instead of always using fixed height
- dataset-section: Expand card when a local dataset is selected via
upload (datasetSource === "upload" && selectedLocalDataset), not only
when the Advanced panel is open
- params-section: Guard loraOpen behind isLora so switching to full
fine-tune collapses the card instead of staying expanded from stale
React useState
* Fix dataset card clipping for direct file uploads
Use uploadedFile instead of selectedLocalDataset in the card height
condition. selectedLocalDataset is derived from localDatasets.find()
which only resolves for Data Recipe entries, not direct file uploads
(.jsonl, .csv, .parquet, .arrow). The card already renders the Eval
Dataset panel based on uploadedFile (line 750), so the height gate
should match.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Recommended models matching the query were filtered from HF results but the Recommended section was hidden during search, causing them to vanish entirely.
- Show filtered recommended models during search by introducing `filteredRecommendedIds`
- Switch `recommendedSet` to use filtered IDs when searching so dedup against HF results is correct
- Hide empty "Hugging Face" label when recommended matches cover the query
- Add `normalizeForSearch` helper to strip separators (spaces, hyphens, underscores, dots) so queries like "llama 3" match "Llama-3.2-1B" and "qwen 2.5" matches "Qwen2.5-7B" in both the recommended model filter and the LoRA adapter filter
* Fix Colab setup skipping llama.cpp installation
The early exit 0 in the Colab no-venv path prevented setup.sh from
ever reaching the llama.cpp install section. Remove the early exit
and instead guard only the venv-dependent Python deps section, so
execution continues through to the llama.cpp prebuilt/source install.
* Simplify _SKIP_PYTHON_DEPS initialization
* Add --local flag to setup.sh in Colab notebook
* Fix Colab huggingface-hub conflict, ensurepip fallback, bump to 2026.3.14
- colab.py / setup.sh: relax == pins to >= when installing studio.txt
on Colab so huggingface-hub does not clobber Colab's bundled version
(breaks transformers is_offline_mode import)
- install_python_stack.py: when uv is unavailable and pip is missing
(uv-created venvs), bootstrap via ensurepip before attempting upgrade
- Bump version to 2026.3.14
- Bump installer min version pins to 2026.3.14
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix Colab Studio launch and setup.ps1 box alignment
- colab.py: when the Studio venv is missing on Colab, pip-install
backend dependencies (structlog, fastapi, etc.) from studio.txt
into the current Python instead of failing with ModuleNotFoundError
- setup.sh: on Colab without a venv, install backend deps into system
Python and skip venv-dependent sections (Python stack update,
llama.cpp build) that would otherwise fail
- setup.ps1: use PadRight(47) for the done-line so "Setup Complete!"
and "Update Complete!" both align with the box border
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): editable context length with Apply/Reset for GGUF model settings
Previously the Context Length field was read-only and the backend
hardcoded `-c 0`, ignoring custom values entirely. KV Cache Dtype also
triggered an immediate model reload with no way to cancel.
Backend:
- llama_cpp.py: pass the actual n_ctx value to `-c` instead of always 0
- models/inference.py: relax max_seq_length to 0..1048576 (0 = model
default) so GGUF models with large context windows are supported
Frontend:
- chat-runtime-store: add customContextLength and loadedKvCacheDtype
state fields for dirty tracking
- chat-settings-sheet: make Context Length an editable number input,
stop KV Cache Dtype from auto-reloading, show Apply/Reset buttons
when either setting has been changed
- use-chat-model-runtime: send customContextLength as max_seq_length
in the load request, reset after successful load
* fix: preserve maxSeqLength for non-GGUF models in load request
customContextLength ?? 0 sent max_seq_length=0 for non-GGUF models,
breaking the finetuning/inference path that needs the slider value.
Now uses a three-way branch:
- customContextLength set: use it (user edited GGUF context)
- GGUF without custom: 0 (model's native context)
- Non-GGUF: maxSeqLength from the sampling slider
* fix: keep max_seq_length default at 4096 for non-GGUF callers
Only relax the bounds (ge=0 for GGUF's "model default" mode,
le=1048576 for large context windows). The default stays at 4096
so API callers that omit max_seq_length still get a sane value
for non-GGUF models.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): rename trust remote code toggle and hide when no model selected
- Rename "Trust remote code" to "Enable custom code"
- Shorten subtitle to "Only enable if sure"
- Hide the toggle when no model is loaded (already hidden for GGUFs)
* fix: restore ge=128 for max_seq_length validation
Keep the minimum at 128 so the API rejects nonsensical values.
GGUF path now sends the model's native context length (from
ggufContextLength) instead of 0 when the user has not customized it.
The upper bound stays at 1048576 for large-context GGUF models.
* feat(studio): replace Context Length input with slider
Use a ParamSlider (512 to model's native context, step 512) instead
of a small number input. Shows "Max" when at the model's native
context length. Consistent with the other slider controls in the
settings panel.
* feat(studio): add editable number input alongside Context Length slider
The slider and number input stay synced -- dragging the slider updates
the number, typing a number moves the slider. The input also accepts
values beyond the slider range for power users who need custom context
lengths larger than the model default.
* fix(studio): widen context length input and use 1024 step for slider
Make the number input wider (100px) so large values like 262144 are
fully visible. Change slider step from 512 to 1024 and min from 512
to 1024.
* fix(studio): context length number input increments by 1024
* fix(studio): cap context length input at model's native max
Adds max attribute and clamps typed/incremented values so the context
length cannot exceed the GGUF model's reported context window.
* fix(studio): point "What's new" link to changelog page
Changed from /blog to /docs/new/changelog.
* fix(studio): preserve custom context length after Apply, remove stale subtitle
- After a reload with a custom context length, keep the user's value
in the UI instead of snapping back to the model's native max.
ggufContextLength always reports the model's native metadata value
regardless of what -c was passed, so we need to preserve
customContextLength when it differs from native.
- Remove "Reload to apply." from KV Cache Dtype subtitle since the
Apply/Reset buttons now handle this.
* feat(studio): auto-enable Search and Code tools when model supports them
Previously toolsEnabled and codeToolsEnabled stayed false after loading
a model even if it reported supports_tools=true. Now both toggles are
automatically enabled when the loaded model supports tool calling,
matching the existing behavior for reasoning.
* fix(studio): auto-enable tools in autoLoadSmallestModel path
The suggestion cards trigger autoLoadSmallestModel which bypasses
selectModel entirely. It was hardcoding toolsEnabled: false and
codeToolsEnabled: false even when the model supports tool calling.
Now both are set from the load response, matching the selectModel
behavior. Also sets kvCacheDtype/loadedKvCacheDtype for dirty
tracking consistency.
* fix(studio): re-read tool flags after auto-loading model
The runtime state was captured once at the start of the chat adapter's
run(), before autoLoadSmallestModel() executes. After auto-load enables
tools in the store, the request was still built with the stale snapshot
that had toolsEnabled=false. Now re-reads the store after auto-load so
the first message includes tools.
* fix(studio): re-read entire runtime state after auto-load, not just tools
The runtime snapshot (including params.checkpoint, model id, and all
tool/reasoning flags) was captured once before auto-load. After
autoLoadSmallestModel sets the checkpoint and enables tools, the
request was still built with stale params (empty checkpoint, tools
disabled). Now re-reads the full store state after auto-load so the
first message has the correct model, tools, and reasoning flags.
* feat(studio): add Hugging Face token field in Preferences
Adds a password input under Configuration > Preferences for users to
enter their HF token. The token is persisted in localStorage and
passed to all model validate/load/download calls, replacing the
previously hardcoded null. This enables downloading gated and private
models.
* fix(studio): use model native context for GGUF auto-load, show friendly errors
The auto-load paths and selectModel for GGUF were sending
max_seq_length=4096 which now actually limits the context window
(since we fixed the backend to respect n_ctx). Changed to send 0
for GGUF, which means "use model's native context size".
Also replaced generic "An internal error occurred" messages with
user-friendly descriptions for known errors like context size
exceeded and lost connections.
LoadRequest validation changed to ge=0 to allow the GGUF "model
default" signal. The frontend slider still enforces min=128 for
non-GGUF models.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): filter out FP8 models from model search results
Hide models matching *-FP8-* or *FP8-Dynamic* from both the
recommended list and HF search results. These models are not
yet supported in the inference UI.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add PID file tracking and `unsloth studio stop` command
On macOS the .app shortcut launches Studio via osascript into a
Terminal window, then the launcher script exits. The server process
runs outside of the launcher's context with no PID file, so there
is no straightforward way to find or stop it.
This adds:
- PID file at ~/.unsloth/studio/studio.pid, written after the
server starts and removed on graceful shutdown or via atexit
- `unsloth studio stop` command that reads the PID file and sends
SIGTERM (or taskkill on Windows) to shut down the server
The PID file is only removed if it still contains the current
process ID, avoiding races when a new server instance replaces
a crashed one.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move atexit PID cleanup into run_server()
The atexit registration was only in the __main__ block, so it
did not cover the `unsloth studio` CLI path that calls
run_server() directly via studio_default(). Moving it into
run_server() ensures the PID file is cleaned up on unexpected
exit regardless of entry point.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The function was called with no arguments, so $args inside the function
was always empty. Script-level args (--local, --package) were never
forwarded. Use @args splatting to pass them through.
Windows install.ps1 had no way to install from a local repo checkout,
unlike install.sh which supports ./install.sh --local. This adds:
- --local: install from the local repo via editable install (-e . --no-deps)
after installing deps from PyPI, mirroring install.sh behavior
- --package: install a different package name for testing
The --local flag:
1. Validates pyproject.toml exists at the script's directory
2. Installs torch + unsloth deps normally
3. Overlays the local checkout with uv pip install -e <repo> --no-deps
4. Passes STUDIO_LOCAL_INSTALL and STUDIO_LOCAL_REPO to setup.ps1
After installation, `unsloth studio` only works if the user
activates the Studio venv first or uses the full absolute path.
The Desktop/Start Menu shortcuts work fine, but typing `unsloth
studio` in a fresh terminal does not.
This adds the venv Scripts dir to the persistent User PATH env
var (if not already present) so `unsloth studio` works from any
new terminal window. The current session is also updated via the
existing Refresh-SessionPath helper.
* feat: multi-source model discovery (HF default, legacy cache, LM Studio)
* Fix multi-source model discovery bugs
- Fix lmstudio_model_dirs: add ~/.lmstudio/models as default path,
remove dead sys.platform branch, add dedup via seen set
- Fix _setup_cache_env: preserve legacy HF cache env vars when the
legacy hub directory exists and is non-empty
- Fix _scan_lmstudio_dir: use absolute path for id field so
is_local_path() returns True
- Remove LM Studio dirs from allowed_roots (scanned unconditionally)
- Replace bare except passes with logger.warning in legacy cache blocks
- Fix delete_cached_model to search both default and legacy HF caches
- Make lmstudio_dirs non-optional in TS interface (matches Python schema)
- Exclude lmstudio source from trainable model filter
- Remove unused import sys
* Scan HF default cache alongside legacy and active caches
When _setup_cache_env overrides HF_HUB_CACHE to the legacy Unsloth
path, the standard HF default cache (~/.cache/huggingface/hub) was
never scanned, hiding models downloaded before Unsloth Studio was
installed.
Add hf_default_cache_dir() and _all_hf_cache_scans() helper that
deduplicates and scans all three HF cache locations (active, legacy,
default). Used in list_local_models, list_cached_gguf,
list_cached_models, and delete_cached_model.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Port the bun cache corruption fix from setup.sh to setup.ps1.
bun's package cache can become corrupt, storing only package metadata
without actual content. This causes bun install to exit 0 but leave
binaries like tsc missing from node_modules/.bin/.
Changes:
- After bun install, verify tsc and vite exist in node_modules\.bin\
- Check for both bare names and .cmd wrappers (Windows creates both)
- If missing, clear the bun cache and retry once
- Only fall back to npm if the retry also fails
* fix(studio): source-build fallback prefers Unsloth's tested tag over upstream latest
When the prebuilt install fails and falls back to source build,
--resolve-llama-tag now queries the Unsloth release repo
(unslothai/llama.cpp) first to get the latest tested/approved tag
(e.g. b8508), instead of going straight to ggml-org/llama.cpp which
may return a newer untested tag (e.g. b8514).
This ensures the source-build fallback compiles the same version that
the prebuilt path would have installed, rather than a potentially
incompatible bleeding-edge release.
Resolution order for "latest":
1. Unsloth release repo (tested/approved)
2. ggml-org upstream (bleeding-edge)
3. Raw requested tag string (last resort)
Changes:
- resolve_requested_llama_tag() accepts optional published_repo param
with docstring explaining the resolution order
- CLI --resolve-llama-tag passes --published-repo through
- setup.sh and setup.ps1 pass --published-repo to --resolve-llama-tag
with inline comments explaining the preference
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
torch 2.11.0 has a torch.compile/dynamo bug that causes a
StopIteration crash in dict_keys_getitem when compiling MoE
router functions (e.g. GptOssTopKRouter_forward). Pin to
<2.11.0 until the upstream fix lands.
Applies to both install.sh (Linux/macOS) and install.ps1
(Windows) fresh install paths.
bun's package cache can become corrupt, storing only package metadata
(package.json, README) without actual content (bin/, lib/). When this
happens, bun install exits 0 and reports packages as installed, but
binaries like tsc are missing from node_modules/.bin/.
For example, a corrupt typescript cache entry is 64KB (metadata only)
vs 23MB when correctly downloaded.
Changes:
- After bun install, verify tsc and vite exist in node_modules/.bin/
- If missing, clear the bun cache with bun pm cache rm and retry once
- Only fall back to npm if the retry also fails
- Revert bun installation to npm install -g bun (the binary is fine,
the cache was the problem)
bun install (specifically the npm "bun" shim v1.3.x installed via
npm install -g bun) can exit 0 while silently failing to install
packages. This causes the frontend build to fail with "tsc: not found"
or missing type declarations, since the fallback to npm only triggers
on a non-zero exit code.
Changes:
1. Initial bun install now tries the official bun.sh installer first
(which gives a real bun runtime), falling back to npm install -g bun
only if that fails.
2. After bun install reports success, verify that critical binaries
(tsc, vite) actually exist in node_modules/.bin/. If they are
missing, reinstall bun from the official source and retry once
before falling back to npm.
3. Extract the bun install + validation logic into _try_bun_install()
to avoid duplicating the check/cleanup across both attempts.
The prebuilt llama.cpp binary (cuda13-newer) links against
libcudart.so.13 and libcublas.so.13. When torch is installed via pip,
these libraries live in the venv's site-packages under
nvidia/cu13/lib/, not in /usr/local/cuda/.
The existing LD_LIBRARY_PATH logic only searched /usr/local/cuda*
paths (which have CUDA 12.x), so the CUDA backend failed to load
silently and llama-server fell back to CPU -- even with -ngl -1.
This adds a glob scan of the venv's nvidia package directories
(cu*, cudnn, nvjitlink) to LD_LIBRARY_PATH before launching
llama-server, matching where pip puts the CUDA runtime.
Tested on Colab with RTX PRO 6000 Blackwell (CUDA 13.0, pip torch):
before -- 3 MiB GPU, 0% util, CPU inference
after -- 13317 MiB GPU, 77% util, full GPU inference
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
When _select_gpus determines that a GGUF model fits on the selected
GPU(s), the code sets CUDA_VISIBLE_DEVICES but never passes -ngl
(number of GPU layers) to llama-server. Without -ngl or --fit,
llama-server defaults to 0 GPU layers and runs entirely on CPU.
This adds -ngl -1 (offload all layers) in the elif branch where
gpu_indices is set and use_fit is False, so models that fit in VRAM
actually use the GPU for inference.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Use prebuilt llama.cpp for unsloth studio setup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix 3 issues that cause unnecessary fallback to source build
1. Make filelock import optional -- environments without filelock
(e.g. minimal installs) crashed at import time instead of
gracefully skipping the lock.
2. Use already-verified converter script from the hydrated source
tree instead of re-downloading from raw.githubusercontent.com
with no checksum. Adds symlink with copy fallback for the
legacy filename.
3. Initialize $SkipPrebuiltInstall in setup.ps1 before first use
to prevent potential uninitialized variable errors.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Keep network fallback in ensure_converter_scripts
Prefer the local verified copy from the hydrated source tree, but
retain the original network download as a fallback if the file is
missing. Create the legacy hyphenated filename as a symlink with a
copy fallback instead of writing a second full copy.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix 4 bugs in source-build fallback and binary_env paths
- setup.ps1: Replace git pull + checkout FETCH_HEAD with fetch + checkout -B
to avoid detached HEAD state that breaks re-runs. Use pinned tag in both
fetch and clone paths.
- setup.sh: Move rm -rf after cmake/git prerequisite checks so a missing
tool no longer deletes the existing install. Add --branch tag to clone.
- install_llama_prebuilt.py: Add binary_path.parent to Linux LD_LIBRARY_PATH
in binary_env() so bundled .so files in build/bin are found even without
RPATH, matching the existing Windows PATH logic.
- Add test for binary_env LD_LIBRARY_PATH on Linux.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Handle unresolved "latest" tag in source-build fallback clone
When tag resolution fails and the requested tag is "latest", both
setup scripts now omit --branch from git clone so the default branch
is cloned instead of failing on a nonexistent "latest" branch/tag.
Similarly, the PS1 fetch path fetches the default ref when the tag
is "latest".
* Resolve actual latest ggml-org tag instead of using literal "latest"
When both Python tag resolution attempts fail and the requested tag
is "latest", query the GitHub API for the actual latest release tag
from ggml-org/llama.cpp (e.g. b8508) instead of passing the literal
string "latest" to git clone --branch, which would fail since no
such branch/tag exists.
setup.sh uses curl + python json parsing; setup.ps1 uses
Invoke-RestMethod. Both fall back to the raw requested tag if the
API call also fails.
* Try Unsloth release repo before ggml-org when resolving latest tag
When falling back to the GitHub API to resolve "latest", query the
Unsloth release repo (unslothai/llama.cpp) first since it has the
prebuilt binaries pinned to tested tags. Only fall back to
ggml-org/llama.cpp if the Unsloth repo query fails.
* Add comprehensive sandbox tests for PR #4562 bug fixes
35 tests covering all fixes across platforms:
- binary_env cross-platform (Linux LD_LIBRARY_PATH, Windows PATH,
macOS DYLD_LIBRARY_PATH) with edge cases (dedup, ordering, existing paths)
- resolve_requested_llama_tag (concrete, latest, None, empty)
- setup.sh logic via subprocess: prereq check ordering (cmake/git missing
preserves install), pinned tag in clone, fetch+checkout -B pattern,
fetch failure warns instead of aborting
- "latest" tag resolution fallback chain (Unsloth API -> ggml-org ->
raw) with mock curl: success, failure, malformed JSON, empty body,
empty tag_name, env overrides
- Source code pattern verification for both .sh and .ps1 files
All 138 tests pass in isolated uv venv.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add binary_path.parent to macOS DYLD_LIBRARY_PATH in binary_env
macOS prebuilt .dylib files are overlaid into build/bin (same as
Linux), but binary_env only added install_dir to DYLD_LIBRARY_PATH.
Add binary_path.parent so the loader can find sibling dylibs even
without embedded loader paths.
Mirrors the existing fix for Linux LD_LIBRARY_PATH and the Windows
PATH pattern.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard --branch when resolved tag is "latest"; fix broken test assertion
When all API fallbacks fail and the tag stays as literal "latest",
omit --branch from git clone (clones default branch instead of
failing). Both setup.sh and setup.ps1 now check for "latest" before
passing --branch to git clone/fetch.
Also fix test_setup_ps1_clone_uses_branch_tag which used Python
tuple syntax (assert "x", "y" in z) that always passes. Changed to
assert "x" in z and "y" in z.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix macOS DYLD trailing colon, install_lock no-op, and debug log
- binary_env macOS: use dedupe_existing_dirs instead of raw string
concatenation. Eliminates trailing colon in DYLD_LIBRARY_PATH
(which causes dyld to search CWD for libraries) and deduplicates
when binary_path.parent == install_dir. Now consistent with the
Linux and Windows branches.
- install_lock: when filelock is not installed, use os.O_CREAT|O_EXCL
as a fallback exclusive file lock with timeout, instead of yielding
with no locking. Prevents concurrent installs from corrupting each
other's staging directories.
- setup.ps1: remove [DEBUG] log line that printed to every user on
every Windows setup run.
* Add stale-lock detection and atomic clone-then-swap
install_lock fallback (no filelock): write PID to lock file and
check if the holder process is still alive on contention. Dead PIDs
(ProcessLookupError) and unreadable lock files trigger immediate
cleanup. Live processes owned by other users (PermissionError) are
correctly recognized as alive -- the lock is not removed.
setup.sh/setup.ps1 source-build: clone into a temporary directory
first, then swap into place only on success. If git clone fails,
the existing install is preserved instead of being deleted by the
premature rm -rf.
* Remove redundant upstream_tag != release_tag check
load_approved_release_checksums compared checksums.upstream_tag
against the Unsloth release_tag, which are different namespaces
(upstream ggml-org tag vs Unsloth published tag). This only worked
because both happened to be "b8508" by convention. Would break if
Unsloth ever uses a different release naming scheme.
The existing check at parse_approved_release_checksums (line 950)
already validates the release_tag field correctly.
* Fix lock TOCTOU race and build-in-temp-dir swap
install_lock fallback: add os.fsync(fd) after writing PID to ensure
the PID is visible to racing processes before they check. Treat
empty lock files (PID not yet written) as "wait and retry" instead
of stale, closing the window where two processes could both see an
empty file, both unlink it, and both acquire the lock.
setup.sh/setup.ps1 source-build: clone AND build in a temp directory
(LLAMA_CPP_DIR.build.$$). Only swap into the final LLAMA_CPP_DIR
after the build succeeds. If clone or cmake or build fails, the temp
dir is cleaned up and the existing working install is preserved.
Previously, rm -rf ran after clone but before build, destroying the
existing install even if the build later failed.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio
* refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check)
* fix: install.sh calls setup.sh directly, keep both setup and update CLI commands
* fix: use importlib.resources.files() directly without _path attribute
* fix: bootstrap uv before pip upgrade to handle uv venvs without pip
* fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin
* feat: add --local flag to install.sh and unsloth studio update for branch testing
* fix: resolve repo root from script location for --local installs
* feat: add --package flag to install.sh for testing with custom package names
* feat: add --package flag to unsloth studio update
* fix: always nuke venv in install.sh for clean installs
* revert: remove Windows changes, will handle in separate PR
* fix: error when --package is passed without an argument
* revert: restore Windows scripts to current main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars
* fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs
* fix: align banner box for Setup vs Update labels
* deprecate: hide 'unsloth studio setup' command, point users to update/install.sh
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: check stdout not stdin for auto-launch detection (curl pipe fix)
* fix: update install URL to unsloth.ai/install.sh
* fix: update install.sh usage comments to unsloth.ai/install.sh
* fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: --local install now also installs unsloth-zoo via base.txt before editable overlay
* fix: don't skip base packages for --local installs (editable needs unsloth-zoo)
* refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths
* feat: add migration support for old .venv and CWD-based installs in setup.sh
* Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh"
This reverts commit 301291d002.
* feat: migrate old .venv layout in install.sh instead of always nuking
* feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure
* fix: try CUDA then fall back to CPU for migration validation
* fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch
* remove: delete unused unsloth ui command (use unsloth studio instead)
* Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py
install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"),
setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py
looks for ".unsloth\studio\unsloth_studio". All three paths were different, so
the Windows installer would never produce a working Studio setup.
install.ps1:
- Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout
- Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio
from the previous install.ps1, or fresh creation with torch validation
- For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels
- Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior)
- Fix launch instructions to use the absolute venv path
setup.ps1:
- Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio"
- Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from
install.ps1 (which should have already created it)
- Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE
* setup.ps1: unconditionally error if venv missing, matching setup.sh
setup.sh always errors out if the venv does not exist (line 224-228),
telling the user to run install.sh first. setup.ps1 was conditionally
creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not
set, which would produce an empty venv with no torch or unsloth. Now
setup.ps1 matches setup.sh: always error, always point to install.ps1.
* Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows
On CPU-only machines, `uv pip install unsloth --torch-backend=auto`
falls back to unsloth==2024.8 because the CPU solver cannot satisfy
newer unsloth's dependencies. install.ps1 already solved this with a
two-step approach; this applies the same fix to install.sh and
install_python_stack.py.
install.sh: add get_torch_index_url() that detects GPU via nvidia-smi
and maps CUDA versions to PyTorch index URLs (matching install.ps1's
Get-TorchIndexUrl). Fresh installs now install torch first via explicit
--index-url, then install unsloth with --upgrade-package to preserve
the pre-installed torch. All 5 --torch-backend=auto removed from
primary paths.
install.ps1: add fallback else-branch when TorchIndexUrl is empty,
using --torch-backend=auto as last resort (matching install.sh).
install_python_stack.py: remove unconditional --torch-backend=auto
from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1
by the time this runs. Callers that need it can set UV_TORCH_BACKEND.
Both install.sh and install.ps1 now share the same three-branch logic:
migrated env (upgrade-package only), normal (torch-first + index-url),
and fallback (--torch-backend=auto if URL detection fails).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use --reinstall-package for migrated envs on both Linux and Windows
For migrated environments (moved from legacy venv location),
--reinstall-package is better than --upgrade-package because it forces
a clean reinstall even if the same version is already installed. This
ensures proper .dist-info and .pyc state in the new venv location.
--upgrade-package remains correct for the fresh install path where
torch is already installed and we just want to add unsloth without
re-resolving torch.
* Address review findings: portability, parity, and stale comments
- Replace grep -oP (GNU Perl regex) with POSIX sed in
get_torch_index_url() so the script works on BSD grep (macOS is
already guarded by the Darwin early-return, but Alpine/BusyBox
would silently get the wrong CUDA tag)
- Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent
output parsing issues
- Add warning on stderr when nvidia-smi output is unparseable, matching
install.ps1's [WARN] message
- Add explicit unsloth-zoo positional arg to install.ps1 migrated path,
matching install.sh (--reinstall-package alone won't install it if it
was never present in the migrated env)
- Fix stale comment in install_python_stack.py line 392 that still
claimed --torch-backend=auto is added by _build_uv_cmd
- Add sed to test tools directory (function now uses sed instead of grep)
* Add --index-url to migrated env path to prevent CPU torch resolution
The migrated path runs uv pip install with --reinstall-package for
unsloth/unsloth-zoo. While uv should keep existing torch as satisfied,
the resolver could still re-resolve torch as a transitive dependency.
Without --index-url pointing at the correct CUDA wheel index, the
resolver would fall back to plain PyPI and potentially pull CPU-only
torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are
available if the resolver needs them.
Applied to both install.sh and install.ps1.
* Revert --index-url on migrated env path
The original install.ps1 on main already handles the migrated path
without --index-url and it works correctly. --reinstall-package only
forces reinstall of the named packages while uv keeps existing torch
as satisfied. No need for the extra flag.
* Fix unsloth studio update --local not installing local checkout
studio.py sets STUDIO_LOCAL_REPO when --local is passed, but
install_python_stack.py never read it. The update path always
installed from PyPI regardless of the --local flag.
Add a local_repo branch that first updates deps from base.txt
(with --upgrade-package to preserve torch), then overlays the
local checkout as an editable install with --no-deps.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add support for ROCm in studio setup
* Fix ROCm detection bugs: ROCM_PATH resolution, CUDA guard, compiler selection
- Set GPU_BACKEND="cuda" when nvcc is found (CUDA path was unreachable)
- Guard ROCm detection with `if [ -z "$GPU_BACKEND" ]` so CUDA takes
priority on mixed-toolchain hosts
- Rename ROCM_PATH to ROCM_HIPCC for the hipcc binary; resolve the
actual ROCm root via readlink -f and hipconfig -R into ROCM_ROOT
- Export both ROCM_PATH and HIP_PATH as the resolved root directory
- Use HIPCXX via hipconfig -l instead of legacy CMAKE_C_COMPILER=hipcc
- Switch grep -oP to grep -oE for portability across Linux distros
- Use GPU_TARGETS (upstream cmake variable) instead of AMDGPU_TARGETS
- Remove stale hardcoded fallback targets; let cmake auto-detect instead
* Fix gfx regex to match gfx90a (MI210/MI250/MI250X)
The grep and bash regex used {3,4} digits after 'gfx', which silently
excluded gfx90a (2 digits + letter 'a') -- the architecture for AMD
Instinct MI210, MI250, and MI250X data-center GPUs. Change to {2,4}
so all real gfx targets from gfx90a through gfx1200 are matched.
---------
Co-authored-by: edamamez <eda.zhou@amd.com>
* feat(tokenizer): add get_tokenizer_info() diagnostic helper
Adds get_tokenizer_info(tokenizer) to tokenizer_utils.py returning a concise dict of key tokenizer properties class name, is_fast, vocab size, added token count, model_max_length, padding side, special tokens (bos, eos, pad, unk), chat template presence, and total special token count. All fields use getattr(..., None) fallbacks so the function never raises on unusual or partially initialized tokenizers. Exported via __all__ alongside the existing public helpers. Useful for logging, debugging, and surfacing tokenizer state in the Unsloth Studio UI.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix docstring, remove artifact, restore valuable comments in tokenizer_utils.py
- Fix get_tokenizer_info() docstring example: correct tokenizer_class to
PreTrainedTokenizerFast, vocab_size to 128000, swap added_tokens_count (256)
and special_tokens_count (3) to match actual Llama-3.2-1B-Instruct output
- Remove accidentally committed "# ... (rest of file unchanged)" diff artifact
- Restore fix_sentencepiece_gguf() docstring with llama.cpp upstream link
- Restore 10 comments containing upstream URLs, model-specific workarounds,
and non-obvious context (issue #292, sentencepiece#121, Starling hack,
Kaggle /tmp limit, Deepseek slow tokenizer, twitter/danielhanchen references)
* Revert "Fix docstring, remove artifact, restore valuable comments in tokenizer_utils.py"
This reverts commit 4e525b734b.
* Revert all deletions, keep only get_tokenizer_info() addition
Restore tokenizer_utils.py to main and add only the new
get_tokenizer_info() function and its __all__ entry.
All comment removals, dead code cleanup, and formatting
changes from the original PR are reverted.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* perf(studio): upgrade to Vite 8 + auto-install bun for 3x faster frontend builds
* fix(studio): make bun-to-npm fallback actually reachable
setup.sh used run_quiet() for the bun install attempt, but run_quiet
calls exit on failure. This killed the script before the npm fallback
could run, making the "falling back to npm" branch dead code.
Replace the run_quiet call with a direct bun invocation that captures
output to a temp file (same pattern, but returns instead of exiting).
Also clean up partial node_modules left by a failed bun install before
falling back to npm, in both setup.sh and build.sh. Without this, npm
inherits a corrupted node_modules tree from the failed bun run.
* fix(studio): restore commonjsOptions for dagre CJS interop
The previous commit removed build.commonjsOptions, assuming Vite 8's
Rolldown handles CJS natively. While optimizeDeps.include covers the
dev server (pre-bundling), it does NOT apply to production builds.
The resolve.alias still points @dagrejs/dagre to its .cjs.js entry,
so without commonjsOptions the production bundle fails to resolve
the CJS default export. This causes "TypeError: e is not a function"
on /chat after build (while dev mode works fine).
Restore the original commonjsOptions block to fix production builds.
* fix(studio): use motion/react instead of legacy framer-motion import
* fix(studio): address PR review findings for Vite 8 + bun upgrade
Fixes:
- Remove bun.lock from repo and add to .gitignore (npm is source of truth)
- Use & bun install *> $null pattern in setup.ps1 for reliable $LASTEXITCODE
- Add Remove-Item node_modules before npm fallback in setup.ps1
- Print bun install failure log in setup.sh before discarding
- Add Refresh-Environment after npm install -g bun in setup.ps1
- Tighten Node version check to ^20.19.0 || >=22.12.0 (Vite 8 requirement)
- Add engines field to package.json
- Use string comparison for _install_ok in build.sh
- Remove explicit framer-motion ^11.18.2 from package.json (motion pulls
framer-motion ^12.38.0 as its own dependency — the old pin caused a
version conflict)
* Fix Colab Node bypass and bun.lock stale-build trigger
Gate the Colab Node shortcut on NODE_OK=true so Colab
environments with a Node version too old for Vite 8 fall
through to the nvm install path instead of silently proceeding.
Exclude bun.lock from the stale-build probe in both setup.sh
and setup.ps1 so it does not force unnecessary frontend rebuilds
on every run.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Shine1i <wasimysdev@gmail.com>
* Add macOS and Linux desktop shortcuts to install.sh
Adds create_studio_shortcuts() function that creates platform-native
shortcuts after `unsloth studio setup` completes, mirroring the Windows
shortcut behavior from PR #4558.
Linux: .desktop file in ~/.local/share/applications/ and ~/Desktop/
macOS: .app bundle in ~/Applications/ with Info.plist, exec stub, and
optional .icns icon built from unsloth-gem.png via sips+iconutil
Both platforms share a Bash launcher script at
~/.local/share/unsloth/launch-studio.sh that provides:
- Health check with service fingerprint verification
- Port scanning (8888-8908) via ss/lsof
- PID-file single-instance guard (no flock dependency)
- Terminal spawning (macOS: Terminal.app; Linux: gnome-terminal etc.)
- Browser open after health poll with 60s timeout
WSL is skipped (no native desktop environment).
* Fix 6 issues found by 10 parallel reviewers
1. [10/10] Health check now supports wget as fallback to curl via
_http_get() helper, matching the installer's own download() pattern.
Previously wget-only systems would time out on every launch.
2. [9/10] Exe path substitution now escapes sed metacharacters (&, \, |)
and shell single-quotes before injection, preventing launcher
corruption for paths like /opt/R&D/bin/unsloth.
3. [4/10] Linux .desktop Exec= field now quotes the launcher path,
fixing launches from home directories containing spaces.
4. [3/10] macOS AppleScript command now escapes backslashes and
double-quotes before interpolation into do script "...", fixing
Terminal.app launch failures.
5. [3/10] Single-instance guard now uses atomic mkdir instead of
racy check-then-write PID file, preventing duplicate concurrent
launches on rapid double-click.
6. [1/10] Launcher now scans for a free port via _find_launch_port()
instead of always hardcoding -p 8888, so Studio starts correctly
when another service already occupies port 8888.
Also fixed: `open` command on Linux (openvt) no longer incorrectly
triggers the macOS browser-open path -- now gated on uname=Darwin.
* Fix mktemp guard and exe path escaping from PR review comments
Two real issues identified from automated review comments:
1. Guard mktemp -d failure in macOS icns generation. If mktemp -d
returned empty, dirname would resolve to / and rm -rf would attempt
to delete the root directory. Now checks that the temp dir was
actually created before proceeding.
2. Replace sed-based exe path substitution with a conf file approach.
The previous sed escaping broke paths containing apostrophes
(e.g. /home/O'Connor/) because the '\'' escape introduced
backslashes that were then double-escaped by the metacharacter
pass. Now writes UNSLOTH_EXE to a separate studio.conf file that
the launcher sources at runtime, eliminating all sed metacharacter
and shell quoting interaction issues.
This also addresses the sed -i.bak portability concern (now moot
since sed is no longer used on the launcher file).
* Fix unbound variable crash and per-user lock in launcher
- Use ${UNSLOTH_EXE:-} so set -u does not crash before the friendly
error message when studio.conf is missing or empty.
- Append $(id -u) to the fallback lock path so each user gets their
own lock directory when XDG_RUNTIME_DIR is unset.
* Mark desktop shortcut as trusted for GNOME/Nautilus
On modern GNOME desktops, chmod +x alone is not sufficient to make
a .desktop file launchable by double-click on ~/Desktop. Nautilus
requires the metadata::trusted attribute to be set via gio, otherwise
it shows a warning dialog instead of launching the application.
The repo has both the CodeQL "default setup" (configured in repo
settings) and this advanced workflow file enabled. GitHub does not
allow both simultaneously, causing all PR CI runs to fail with:
"CodeQL analyses from advanced configurations cannot be processed
when the default setup is enabled"
Since the default setup already covers the same languages (Python,
JavaScript/TypeScript) with the same build-mode (none), remove the
redundant advanced workflow file.
* Add CodeQL analysis workflow configuration
* Add Dependabot configuration for package updates
Configure Dependabot to check for updates in various ecosystems weekly.
* Fix dependabot.yml: bun ecosystem, missing dir, grouping for PR #4479
1. studio/frontend uses bun.lock not package-lock.json, so change npm to bun
2. Add missing studio/backend/requirements/ pip entry (consumed by studio/setup.sh)
3. Add groups with patterns ["*"] to all pip/bun/npm entries to batch updates
and avoid 30+ individual Dependabot PRs on the first run
* Consolidate pip blocks to fix overlapping directory violation
GitHub Dependabot forbids multiple same-ecosystem entries with
overlapping directories on the same branch. The root "/" directory
overlapped the 3 nested pip dirs. Merge all 4 pip blocks into one
using the `directories:` (plural) key.
Also remove redundant open-pull-requests-limit from the bun block
since grouping with patterns: ["*"] already limits PR count.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Try installing causal-conv1d from prebuilt wheels if avialable
* Prefer installing mamba-ssm from wheel to speed up things
* undo python stack install changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert "undo python stack install changes"
This reverts commit d943551092.
* add comments
* Fix wheel installer: model detection, platform tags, torch pin, error handling
- Add nemotron-h (hyphen) and granite-4.0-h / granitemoehybrid to model
detection for both causal-conv1d and mamba-ssm. These hybrid Mamba models
were silently skipped since nemotron_h (underscore) never matches real
HF model IDs like nvidia/Nemotron-H-8B-Base, and granite was missing
entirely despite being a supported model in model_config.py and loader.py.
- Fix _causal_conv1d_platform_tag to detect linux_aarch64 via
platform.machine() instead of hardcoding linux_x86_64. Both upstream
releases publish aarch64 wheels. Drop win_amd64 since neither repo
publishes Windows wheels (avoids a wasted HTTP probe on every run).
- Pin torch to >=2.6.0,<2.11.0 instead of <=2.10.0 to add a version floor
and document the wheel coverage range with upstream release links.
- Strip non-numeric suffixes from torch minor version so nightly builds
like 2.7a0 correctly resolve to wheel tag torch2.7 instead of torch2.7a0.
- Use stderr=_sp.PIPE instead of stderr=_sp.STDOUT in the env probe so
torch import warnings do not corrupt the JSON output.
- Add timeout=30 to the env probe subprocess to prevent indefinite hangs.
- Catch Exception (not just ImportError) on the existing-install check so
ABI-broken installs with OSError/RuntimeError are retried rather than
silently accepted.
- Guard uv invocation with shutil.which("uv") to prevent FileNotFoundError
crash when uv is not on PATH. Wrap the top-level ensure calls in
try/except so failures do not kill the training worker.
- Hoist _SSM_MODEL_SUBSTRINGS to module level.
- Remove redundant --torch-backend=auto flag from direct wheel URL install.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add LFM2 to causal-conv1d detection; stop training on install failure
- Add "lfm2" to _model_wants_causal_conv1d so Studio picks up the
fast kernel path for Liquid Foundation Model 2.
- Replace silent logger.warning on SSM dependency install failure
with an error event that tells the user to choose another model
and stops the training job immediately.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Catch subprocess timeout in torch probe; narrow import guard to ImportError
- _probe_causal_conv1d_env: wrap subprocess.run in try/except for
TimeoutExpired so a slow torch import returns None (falls back to
PyPI) instead of killing the training job.
- _install_package_wheel_first: narrow except Exception to except
ImportError on the __import__ check so unexpected errors from a
broken module still propagate.
* Remove unconditional torch pin from install_python_stack
The torch>=2.6.0,<2.11.0 pin was added to ensure prebuilt
causal-conv1d / mamba-ssm wheels exist, but it runs at install
time for all users regardless of model choice. This can downgrade
or unnecessarily upgrade torch. The worker already handles wheel
compatibility at training time by probing the environment and
falling back to PyPI, so the install-time pin is not needed.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat(chat): ghost-style tool containers
Remove borders and card styling from tool call UI. ToolFallback
uses minimal padding with indented content. ToolGroup defaults
to ghost variant with subtle background for multi-tool grouping.
* feat(chat): compact web search source pills
Switch sources from vertical full-width badges to horizontal
wrapping pills with smaller icons.
* feat(chat): left-accent code and terminal tool UI
Replace bordered card layout with a left border accent for
Python and Terminal tool output. Add timer cleanup on unmount
for the copy button in both components.
* feat(chat): inline latex and clickable links
Enable single-dollar $...$ math rendering via createMathPlugin.
Add styled link component with target=_blank for external links.
* fix(chat): inline generating indicator, static tailwind classes, misc fixes
Move generating indicator from viewport footer into assistant
message using AnimatedShinyText shimmer. Only shows when message
content is empty, hides once tool calls or text appear.
Use static size class map in SourceIcon for Tailwind v4 compat.
Use unique keys for web search sources. Remove px-3 from ghost
tool group variant.
* fix(chat): only show generating indicator while message is running
Hide the shimmer when message is cancelled or errored with no
content, preventing stale loading UI on empty completed messages.
* fix: escape currency dollar signs in LaTeX math rendering and fix TS build error
- Add preprocessLaTeX() in lib/latex.ts to escape currency patterns ($5, $1,000, $5.99, $100K)
before they reach the math parser, preventing false positives when singleDollarTextMath is enabled.
Code blocks and already-escaped dollars are left untouched.
- Use preprocessLaTeX via useMemo in markdown-text.tsx so Streamdown receives clean input.
- Fix TS18048 in thread.tsx: message.status?.type (optional chaining) since status can be undefined.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Bump Data Designer to 0.5.4 (removes litellm dependency)
NVIDIA Data Designer v0.5.4 removes litellm entirely and replaces it
with native OpenAI and Anthropic adapters. This follows the litellm
supply chain incident where versions 1.82.7 and 1.82.8 were compromised
with a credential stealer.
Release notes: https://github.com/NVIDIA-NeMo/DataDesigner/releases/tag/v0.5.4
Changes:
- Bump data-designer, data-designer-config, data-designer-engine to 0.5.4
- Sync data-designer-deps.txt with 0.5.4 engine requirements:
- Added: chardet, fsspec, mcp
- Removed: python-json-logger, pymupdf, pymupdf4llm, mammoth
(these remain in the unstructured-seed plugin which still needs them)
- duckdb constraint relaxed from <1.5 to <2 (upstream fixed record_batch)
- Bump plugin lower bound to >=0.5.4
* Keep pymupdf, pymupdf4llm, mammoth in data-designer-deps
The unstructured-seed plugin is installed with --no-deps, so its
pyproject.toml dependencies are not auto-resolved. These three
packages are needed by the seed route (studio/backend/routes/
data_recipe/seed.py) and must remain in the explicit deps list.
* feat: Implement Q-GaLore optimizer and custom embedding learning rate in the Unsloth trainer.
* feat: Implement QGaLoreAdamW8bit optimizer with 8-bit states, GaLore low-rank gradient projection, and optional INT8 weight quantization, along with supporting projector and tests.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: Introduce Q-GaLore AdamW optimizer with low-rank quantized gradient projection and integrate into the trainer, along with dedicated tests.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: Implement Q-GaLore AdamW optimizer with gradient projection and quantization, including trainer integration and corresponding tests.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix 3 bugs in Q-GaLore optimizer and add weight_quant forward hooks
1. Fix use-after-delete crash: move `del p._saved_data` after the
weight decay block so decoupled weight decay can reference the
current weights correctly (p.data).
2. Fix substring matching in make_q_galore_param_groups: split
parameter names on "." and check exact component matches to
prevent false positives (e.g. "not_q_proj" matching "q_proj").
3. Implement forward pre-hooks for weight_quant: after the optimizer
quantizes weights to INT8, replace p.data with a 1-element
placeholder to free float memory. A register_forward_pre_hook
dequantizes back to float before each forward pass. The trainer
calls install_weight_quant_hooks() when weight_quant is enabled.
4. Update test_weight_decay_uses_saved_data to match the fixed code
path (decoupled decay uses p.data, expected value 2.7). Add
test_weight_quant_hook_restores_float to verify the INT8-to-float
hook round-trip.
All 24/24 Q-GaLore tests pass. Benchmarked on Llama-3.2-1B-Instruct
FFT: Q-GaLore saves 32% VRAM (10.63 -> 7.24 GB) with better loss
convergence (1.3 vs 2.0 at step 100). No regressions in 31-notebook
sweep across Llama, Qwen, Mistral, Phi, Gemma, vision, and GRPO.
* Default weight_quant to False in QGaloreConfig
Benchmarks show weight_quant=True adds ~1 GB on Llama-3.2-1B due to
INT8 copy/scale overhead exceeding savings from the placeholder trick.
Users can still opt in explicitly. The optimizer logic is unchanged.
* Optimize Q-GaLore projector and optimizer step performance
Projector (q_galore_projector.py):
- Use torch.svd_lowrank with oversampling p=10 (Halko et al. 2009) instead
of full SVD for large matrices. Falls back to full SVD when min(m,n) <= 2*rank.
SVD steps are 6-8x faster on Llama-3.2-1B (22s -> 3s for first step).
- Cache the dequantized ortho matrix between project() and project_back() to
avoid redundant dequantization when quant=True.
- Replace F.cosine_similarity with torch.dot for 1-D unit vectors in the
adaptive schedule. Remove unused torch.nn.functional import.
- Use collections.deque(maxlen=queue_size) instead of list with manual pop(0).
Optimizer (q_galore_adamw.py):
- Remove redundant .clone() on dequantized weights (line 151) and on float
data before re-quantization (line 211). _dequantize already returns a fresh
tensor and _quantize/_quantize_stochastic only reads its input.
- Consolidate per-group torch.cuda.synchronize() into a single call after
all param groups complete.
- Use torch.empty instead of torch.zeros for the scalar placeholder tensor
that is never read.
Verified: 24/24 unit tests pass. Llama-3.2-1B 61-step training produces
losses within 0.24% relative diff (correlation >0.9999) of the original.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: remove auto wandb.finish() after train() to allow post-training evaluate()
The prepare_for_training_mode wrapper unconditionally called wandb.finish()
after trainer.train() completed. This terminated the active W&B run, causing
trainer.evaluate() to fail with "You must call wandb.init() before wandb.log()".
Users who need multiple training runs in one session can call wandb.finish()
manually between runs to avoid data overwriting.
Fixes#3954
* fix: defer wandb.finish() to next train() call instead of removing it
Instead of calling wandb.finish() at the end of train() (which breaks
evaluate/log) or removing it entirely (which causes data overwriting on
multiple train() calls), defer it to the start of the next train() call.
This way:
- train() + evaluate() works (run stays open after train)
- train() + train() gets separate W&B runs (previous run finished first)
- train() + evaluate() + train() also works correctly
Also resets HF's WandbCallback._initialized flag so it re-calls
wandb.init() for the new run.
Fixes#3954
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat(db): add SQLite storage layer for training history
* feat(api): add training history endpoints and response models
* feat(training): integrate DB persistence into training event loop
* feat(ui): add training history views and card grid
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): address review issues in training history persistence
- Strip hf_token/wandb_token from config before SQLite storage
- Add UUID suffix to job_id for collision resistance
- Use isfinite() for 0.0 metric handling throughout
- Respect _should_stop in error event finalization
- Run schema DDL once per process, not per connection
- Close connection on schema init failure
- Guard cleanup_orphaned_runs at startup
- Cap _metric_buffer at 500 entries
- Make FLUSH_THRESHOLD a class constant
- Map 'running' to 'training' phase in historical view
- Derive LR/GradNorm from history arrays in historical view
- Fix nested button with div[role=button] in history cards
- Guard String(value) against null/undefined in config popover
- Clear selectedHistoryRunId on auto tab switch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): address round-2 review findings across training backend and frontend
Backend (training.py):
- Move state mutation after proc.start() so a failed spawn does not wedge
the backend with is_training=True
- Create DB run row eagerly after proc.start() so runs appear in history
during model loading, not after first metric event
- Rewrite _flush_metrics_to_db() with snapshot-before-insert pattern to
preserve metrics arriving during the write and retain buffer on failure
- Guard eval_loss with float() coercion and math.isfinite(), matching the
existing grad_norm guard
- Increase pump thread join timeout from 3s to 8s to cover SQLite's
default 5s lock timeout
Frontend (studio-page.tsx):
- Fix history navigation: check isTrainingRunning instead of
showTrainingView in onSelectRun so completed runs are not misrouted
- Replace activeTab state + auto-switch useEffect with derived tab to
eliminate react-hooks/set-state-in-effect lint violation
Frontend (historical-training-view.tsx):
- Add explicit "running" branch to message ternary so running runs no
longer fall through to "Training errored"
- Derive loading from detail/error state and move cleanup to effect
return to eliminate react-hooks/set-state-in-effect lint violation
Frontend (progress-section.tsx):
- Derive stopRequested from isTrainingRunning && stopRequestedLocal to
eliminate react-hooks/set-state-in-effect lint violation and remove
unused useEffect import
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): resolve 3 remaining bugs from round-2 review
1. Stuck on Current Run tab [12/20]: Only force "current-run" tab when
isTrainingRunning is true, not when stale completed-run data exists.
After training ends, users can freely navigate to Configure.
2. Incomplete metric sanitization [7/20]: Apply float() coercion and
isfinite() guards to loss and learning_rate, matching the existing
pattern used by grad_norm and eval_loss. Prevents TypeError from
string values and NaN leaks into history arrays.
3. Stop button state leak across runs [10/20]: Add key={runtime.jobId}
to ProgressSection so React remounts it when a new run starts,
resetting stopRequestedLocal state.
* fix(studio): deduplicate loss/lr sanitization in training event handler
Reuse _safe_loss/_safe_lr from the progress update block instead of
re-sanitizing the same raw event values for metric history.
* fix(studio): restore loss > 0 guard to prevent eval steps injecting 0.0 into metric histories
Round-2/3 fixes relaxed the history append guard from `loss > 0` to
`loss is not None`, which let eval-only log events (where loss defaults
to 0.0) append fake zeros into loss_history and lr_history. Restore the
`loss > 0` check to match the worker's own has_train_loss gate. The
float() coercion and isfinite() sanitization from round-3 remain intact.
* fix(studio): resolve training history bugs — nullable loss/lr, tab nav, sparkline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
The wheel currently ships frontend/public/, frontend/src/, and
frontend/*.lock alongside frontend/dist/. These are build-time inputs
that Vite already copies into dist/ during the build step:
- public/ is copied verbatim into dist/ by vite build (28.6 MB duplicate)
- src/ is TSX source compiled into dist/assets/*.js (2.1 MB, not used at runtime)
- *.lock files are package manager lockfiles (0.9 MB, not used at runtime)
The backend only serves from frontend/dist/ (see main.py setup_frontend
and run.py frontend_path). Nothing references public/ or src/ at runtime.
This drops the wheel from ~62.7 MB to ~31 MB.
* feat(windows): add Studio desktop/Start shortcuts with health-check launcher
* chore(windows): bundle sloth.ico and set shortcut icons when valid
* chore(windows):add images/sloth.ico
* fix(windows): guard PSScriptRoot for Studio shortcut icon in iex installs
* fix(install): high-DPI sloth.ico and relocate to studio/frontend/publi
* chore(studio): update sloth.ico for clearer desktop and shell icons
* chore(studio): use unsloth.ico for Studio shortcut icon
* feat(windows): improve Studio shortcut launcher (fast health + browser UX)
* fix(windows): stable unsloth.ico URL and Unicode-safe Studio launcher scripts
* fix(windows): escape $ in exe path and write launcher UTF-8 with BOM
* fix(windows): skip shortcuts when Desktop or APPDATA paths are missing
* fix(install): log shortcut/icon/port failures and warn early on missing paths
* fix(install): guard missing LOCALAPPDATA before shortcut paths
* fix(install): harden New-StudioShortcuts and improve success messaging
* fix(install): include port 8908 in studio health check
* fix(install): fix launch-studio.ps1 quoting
* Fix launcher edge cases and normalize indentation in install.ps1
- Handle silent timeout: show a message when Studio is still starting
but did not become healthy within the timeout, instead of exiting
with no feedback
- Add -NoProfile to the visible PowerShell terminal launch so the
user profile cannot hang or error before Studio runs
- Add a named mutex (Local\UnslothStudioLauncher) to prevent
double-click from spawning duplicate terminals; second instance
polls for health and opens the browser when ready
- Normalize indentation inside New-StudioShortcuts outer try block
from mixed 8/12-space to consistent 12-space
* Simplify Get-CandidatePorts port dedup with Sort-Object -Unique
Replace the foreach/-notcontains loop with a single pipeline:
$ports = (@($basePort) + $listening) | Sort-Object -Unique
* Harden health probe and handle abandoned mutex in launcher
- Test-StudioHealth now checks resp.service == 'Unsloth UI Backend' to
avoid fingerprinting collisions with other local services on the same
port range.
- Wrap the mutex WaitOne(0) call in a try/catch for
AbandonedMutexException so the launcher recovers gracefully when a
previous instance was killed while holding the mutex.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: prevent UnicodeEncodeError on Windows CP1252 consoles in studio setup
On Windows, `unsloth studio setup` crashes with a UnicodeEncodeError
when install_python_stack.py tries to print Unicode status glyphs
(✅, ❌, ⚠️) to a console that uses a legacy code page like CP1252.
Add a _safe_print() helper that catches UnicodeEncodeError and
gracefully degrades emoji to ASCII equivalents ([OK], [FAIL], [!]).
Replace all print() calls that emit Unicode glyphs with _safe_print().
Fixes#4509
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Replace Unicode dashes with ASCII in install_python_stack.py
Box-drawing (U+2500) and em dash (U+2014) chars in section dividers
and comments are themselves not representable on CP1252 -- replace
with plain ASCII dashes for consistency with the fix.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add GRPO resume vLLM cleanup guard
* Guard GRPO resume sleep on vLLM sleep mode
* Harden GRPO resume vLLM cleanup guard
- Wrap llm.sleep(1) in try/except so a failed sleep does not block
training resume (best-effort cleanup)
- Also check kwargs["model_path"] which transformers.Trainer.train()
still accepts and normalizes to resume_from_checkpoint internally
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat(chat): regroup settings sidebar into Model, Sampling, Tools, and Preferences sections
Split the monolithic Settings collapsible into focused sections with
icons. Model section shows context length and KV cache dtype for GGUF
models, trust remote code for non GGUF. Tools section groups auto heal,
max tool calls, and tool call timeout. Preferences section holds auto
title toggle.
* feat(chat): persist collapsible section open/closed state in localStorage
Remember which sections the user expanded or collapsed across sidebar
toggles, mobile sheet reopens, and browser sessions.
* fix(chat): harden collapsible state persistence and restore defaultOpen
- Validate localStorage values are booleans before using them, preventing
corrupted entries like string "false" from being treated as truthy
- Use Object.hasOwn() instead of `in` operator to avoid prototype chain
matches on keys like "constructor" or "toString"
- Restore defaultOpen={true} on Model and Preferences sections so they
are expanded on first visit, matching the old Settings section behavior
- Fix misleading Context Length description to reflect it is read-only
- Downgrade console.error to console.warn for non-critical localStorage
parse failures
* fix(chat): remove redundant disabled styles on Context Length input
The Input component already applies opacity-50 and cursor-not-allowed
via its disabled: variants. Specifying them unconditionally in the
className is redundant.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Ensures both install scripts always pull a version that has the
litellm removal fix. Without the pin, stale uv/pip caches could
resolve the older 2026.3.10 which still had litellm in
data-designer-deps.txt, causing setup to fail at step 8/11
while PyPI has litellm quarantined.
litellm has been quarantined on PyPI due to a supply chain attack
in version 1.82.8 (malicious credential-stealing .pth file).
No versions are currently installable, which blocks
`unsloth studio setup` at step 8/11 (data-designer deps).
Remove litellm from the single-env data-designer requirements
so setup completes. litellm can be re-added once PyPI lifts the
quarantine.
Ref: https://github.com/BerriAI/litellm/issues/24512
* Revert "fix: handle prompt/completion datasets in slow-path BOS detection (#4548)"
This reverts commit fca83182af.
* fix: support completion_only_loss=True with prompt/completion dataset columns
When completion_only_loss=True, TRL rejects formatting_func but Unsloth's
patched _prepare_dataset/_prepare_non_packed_dataloader assumed either
formatting_func or dataset_text_field was always set, causing a catch-22.
Now handles prompt/completion columns as a third case for BOS token
detection, with a safe None fallback for all other cases.
(cherry picked from commit 978f78c6f1)
* fix: handle prompt/completion datasets in slow-path BOS detection
The slow-path check_text blocks in rl_replacements.py and
tokenizer_utils.py crash when a prompt/completion dataset is used
because they unconditionally access dataset[0][dataset_text_field]
even when the dataset does not have a text field.
This fixes both files to:
- Default dataset_text_field to None instead of raising when undefined
- Detect prompt/completion columns and concatenate them for BOS check
- Guard with isinstance(str) on both prompt and completion to handle
conversational format (list of dicts) by setting test_text to None
- Add test_text is not None guard on has_bos_token_already to prevent
AttributeError on NoneType.startswith()
This is the slow-path complement to unslothai/unsloth-zoo#560 which
fixes the fast-path in sft_prepare_dataset.
Closes#4486
(cherry picked from commit b6ce5786d0)
* fix: preserve chat_template BOS check when test_text is None
The has_bos_token_already guard wrapped both test_text.startswith()
and bos_token in chat_template with test_text is not None, which
disabled the chat_template BOS detection for conversational datasets
where test_text is set to None.
Split the guard so test_text is not None only applies to the
startswith() call, while bos_token in chat_template is always checked.
(cherry picked from commit 40bd8b8917)
---------
Co-authored-by: Ayush Kushwaha <148432773+ayushkushwaha240@users.noreply.github.com>
* fix: handle prompt/completion datasets in slow-path BOS detection
The slow-path check_text blocks in rl_replacements.py and
tokenizer_utils.py crash when a prompt/completion dataset is used
because they unconditionally access dataset[0][dataset_text_field]
even when the dataset does not have a text field.
This fixes both files to:
- Default dataset_text_field to None instead of raising when undefined
- Detect prompt/completion columns and concatenate them for BOS check
- Guard with isinstance(str) on both prompt and completion to handle
conversational format (list of dicts) by setting test_text to None
- Add test_text is not None guard on has_bos_token_already to prevent
AttributeError on NoneType.startswith()
This is the slow-path complement to unslothai/unsloth-zoo#560 which
fixes the fast-path in sft_prepare_dataset.
Closes#4486
* fix: preserve chat_template BOS check when test_text is None
The has_bos_token_already guard wrapped both test_text.startswith()
and bos_token in chat_template with test_text is not None, which
disabled the chat_template BOS detection for conversational datasets
where test_text is set to None.
Split the guard so test_text is not None only applies to the
startswith() call, while bos_token in chat_template is always checked.
* fix: system prompt was dropped in unsloth text and vision inference
* refactor: simplify system prompt message construction
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: use multimodal typed content parts for vision system message and add fallback
The system message content must use typed content parts
([{"type": "text", "text": ...}]) instead of a plain string to match
the multimodal processor contract (consistent with the audio path).
Plain strings cause some processors (e.g. LLaVA) to silently drop the
system prompt.
Also wraps processor.apply_chat_template in try/except so models that
reject the system role gracefully fall back to no system message with
a warning log.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: capture and log original exception in vision system prompt fallback
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: always show chat tool icons, gray out when model doesn't support them
Tool icons (Think, Search, Code) were hidden unless a model was loaded
and supported those features. Now they're always visible so users can
see and pre-select them. If a loaded model doesn't support a feature,
the button gets grayed out and disabled instead of being removed.
* refactor: centralize Qwen thinking params in store
* fix: disable tool buttons when no model is loaded
Change disabled condition from `modelLoaded && !supportsX` to
`!modelLoaded || !supportsX` so buttons are grayed out both when
no model is loaded and when the loaded model lacks the capability.
* Fix Qwen3 param clobbering and restore SuggestionItem capability guards
- Revert setReasoningEnabled() in the store to a pure boolean setter.
Moving the Qwen3 param logic into it caused reconnect/load/refresh
paths (which also call setReasoningEnabled) to silently overwrite
user-customized or server-provided temperature/topP/topK/minP.
- Restore applyQwenThinkingParams() as a standalone function called
only from explicit user toggle click handlers in thread.tsx and
shared-composer.tsx, matching the pre-PR behavior.
- Re-add supportsReasoning/supportsTools guards in the SuggestionItem
click handler so that clicking a suggestion card only activates
tool toggles the loaded model actually supports.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
PR #4543 removed useScrollLock from ReasoningRoot, causing the thread
viewport to jump when a user collapses a reasoning panel. Restore the
hook to freeze scrollTop during the 200ms collapse animation, matching
the pattern used by tool-fallback.tsx and tool-group.tsx.
* Fix port conflict detection when loopback address is held by another process
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use getaddrinfo for IPv6 host support, restore emojis in terminal output
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard against conn.pid being None in _get_pid_on_port
psutil.net_connections() can return entries with pid=None when the
current user lacks privileges to see the owning process (common on
macOS without root, Windows without admin, and some Linux configs).
psutil.Process(None) does not raise -- it silently returns the
current process, which would make the warning incorrectly blame
Unsloth Studio itself for blocking the port.
Skip entries with pid=None so the caller falls back to the generic
"port is already in use" message instead.
* Update studio/backend/run.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* fix(chat): stabilize thinking panel and thread scroll during generation
* fix: match ChatGPT scroll and thinking panel behavior
- Remove autoScroll={false} from thread viewport to restore default
follow-scroll during streaming (pauses when user scrolls up, resumes
at bottom)
- Rewrite reasoning panel state: auto-opens on stream start, user can
close during streaming, auto-collapses when reasoning ends, user can
re-expand after collapse
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): harden system prompt persistence and storage fallback
* Exclude checkpoint from localStorage persistence for PR #4538
checkpoint is backend-owned state -- refresh() already syncs it from
getInferenceStatus() on every page load. Persisting it to localStorage
causes a stale model ID to survive across backend restarts, which
prevents auto-load from triggering when no model is actually loaded.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fixes#4492
The embedding_learning_rate parameter was assigned to a local variable
instead of self.embedding_learning_rate, causing UnslothTrainer.create_optimizer()
to always get None via getattr and silently fall back to a single param group.
Bug: embedding_learning_rate = embedding_learning_rate (no-op)
Fix: self.embedding_learning_rate = embedding_learning_rate
* Fix Studio silently exiting on Windows without error output
On Windows, `unsloth studio` launches a child process via
subprocess.Popen to run the server in the studio venv. If the child
crashes (e.g. due to a missing package), the parent just calls
typer.Exit(rc) with no message -- the user sees "Launching Unsloth
Studio... Please wait..." and then the prompt returns with zero
feedback.
Root cause: `data_designer_unstructured_seed` is imported at the top
level in seed.py. If this package is not installed in the studio venv,
the entire import chain (seed.py -> routes/__init__.py -> main.py ->
run_server()) crashes with ModuleNotFoundError. Since run.py has no
try/except around run_server() and studio.py does not report nonzero
exit codes, the failure is completely silent.
Changes:
- run.py: wrap run_server() in try/except, print clear error with
traceback to stderr. Also reconfigure stderr encoding on Windows so
tracebacks with non-ASCII paths do not cause secondary failures.
- studio.py: print an error message when the child process exits with
a nonzero code on Windows, so the user knows something went wrong.
- seed.py: make data_designer_unstructured_seed import optional with
a try/except fallback. The server starts normally and only returns
HTTP 500 if the unstructured seed endpoints are actually called.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Skip Anaconda/Miniconda Python when creating Studio venv on Windows
Conda-bundled CPython ships modified DLL search paths that prevent
torch from loading c10.dll on Windows. The Studio server fails
silently at startup because the venv was created with conda's Python.
Standalone CPython (python.org, winget, uv) does not have this issue.
Both install.ps1 and setup.ps1 now skip any Python binary whose path
contains conda, miniconda, anaconda, miniforge, or mambaforge when
selecting the interpreter for the studio venv. If only conda Python
is available, the scripts print an error with instructions to install
standalone CPython.
* Fix multi-file preview crash and improve setup.ps1 Python discovery
Addresses review findings [10/10] and [8/10]:
1. seed.py: _read_preview_rows_from_multi_files() had a hard import
of build_multi_file_preview_rows inside the function body, bypassing
the optional-plugin guard. Moved it into the top-level try/except
block and added a None guard matching the other functions.
2. setup.ps1: Python discovery now probes py.exe (Python Launcher)
first, uses Get-Command -All to look past conda entries that shadow
standalone CPython further down PATH, skips WindowsApps stubs, and
resolves the actual executable path so venv creation does not
re-resolve back to a conda interpreter.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Check sys.base_prefix to catch venvs created from conda Python
A venv created from conda Python (e.g. C:\Users\danie\.venv) has a
path that does not contain "conda", but sys.base_prefix still points
to the conda install (e.g. C:\Users\danie\miniconda3). The previous
path-only check missed this case entirely.
Both install.ps1 and setup.ps1 now use a Test-IsConda helper that
checks both the executable path AND sys.base_prefix against the
conda/miniconda/anaconda/miniforge/mambaforge pattern. This catches:
- Direct conda Python executables
- Venvs created from conda Python (base_prefix reveals the origin)
* Fix install.ps1 passing version string to uv venv instead of resolved path
Find-CompatiblePython returned a bare version string (e.g. "3.13")
which was passed to `uv venv --python 3.13`. uv performs its own
interpreter discovery and can resolve that version string back to a
conda Python, defeating the entire conda-skip logic.
Now Find-CompatiblePython returns a hashtable with both .Version (for
display) and .Path (the resolved absolute executable path). The venv
is created with `uv venv --python <absolute-path>`, ensuring uv uses
the exact interpreter we validated.
* Quote resolved Python path in uv venv call for paths with spaces
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): prevent ModuleNotFoundError in dataset.map() on Windows
On Windows, dataset.map() uses "spawn", which requires workers to
import compiled modules from disk. Previously, clear_unsloth_compiled_cache()
deleted the entire directory, causing workers to crash when looking for
UnslothSFTTrainer.py.
Changes:
1. Added `preserve_patterns` to cache cleanup to keep `Unsloth*Trainer.py`
on Windows while clearing model-specific files.
2. Added the cache directory to PYTHONPATH for spawn workers.
Linux/macOS behavior is unchanged.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix spawn-platform coverage, CWD path mismatch, and race condition for PR #4473
- Extend platform guard from win32-only to include macOS (also uses spawn
since Python 3.8, same ModuleNotFoundError would occur)
- Replace fragile CWD-based PYTHONPATH registration with centralized
register_compiled_cache_on_path() that uses the same __file__-relative
_CACHE_DIRS already used by cache_cleanup -- fixes path mismatch when
studio is launched from a directory other than the repo root
- Move PYTHONPATH registration to the top of _train_worker(), before any
dataset.map() call (previously it ran late in config assembly, after
dataset formatting which also calls dataset.map())
- Update inference.py model-unload to preserve trainer files on spawn
platforms, preventing a race where unloading a model via inference tab
would delete UnslothSFTTrainer.py while training workers are importing it
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix cache-dir precedence reversal in register_compiled_cache_on_path()
Iterating _CACHE_DIRS in forward order while calling insert(0) each time
reverses the declared priority: later entries shadow earlier ones. When
multiple compiled-cache directories exist, spawned workers could import a
stale trainer from the wrong cache.
Fix: iterate in reverse so that the highest-priority entry (first in
_CACHE_DIRS) is inserted last and ends up at position 0 in sys.path and
PYTHONPATH.
* fix: harden worker-count helpers against cpu_count=None and desired<=0
- safe_num_proc: guard os.cpu_count() with `or 1`, clamp multi-GPU
path with max(1, min(4, desired)), clamp return with max(1, desired)
- safe_thread_num_proc: same os.cpu_count() guard and return clamp
- Add regression tests (31 L1 unit + 10 sandbox edge-case tests)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* remove regression tests from PR
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
The previous prompt "Show me a live weather dashboard, no API key needed"
was too vague. The new wording explicitly asks for HTML code, which
produces more useful and consistent responses.
* fix(install.ps1): split torch+unsloth install to fix non-NVIDIA package resolution
--torch-backend=auto on a non-NVIDIA Windows machine causes uv to resolve
unsloth==2024.8 (pre-CLI, no unsloth.exe). Fix: detect GPU robustly (PATH +
hardcoded fallback paths, mirrors setup.ps1), install torch first with an
explicit --index-url (CUDA variant for NVIDIA, CPU for everyone else), then
install unsloth separately without --torch-backend so the solver always picks
a modern release that ships the Studio CLI.
Closes the remaining gap flagged in #4478.
* fix(install.ps1): align warning with setup.ps1, add --upgrade, handle CUDA 11.x
- Match the no-GPU warning message to studio/setup.ps1 wording
(chat-only GGUF mode, driver download link)
- Add CUDA 11.x floor check in Get-TorchIndexUrl so old drivers
fall back to CPU wheels instead of silently getting cu124
- Log a warning when nvidia-smi output cannot be parsed
- Add --upgrade to both uv pip install calls so re-runs pick up
newer package versions
* revert --upgrade from uv pip install calls
uv pip install already resolves to the latest satisfying version;
--upgrade is unnecessary and could force unwanted re-installs.
* fix: replace frozen cu124 fallbacks with cu126, guard CUDA 11.x
cu124 wheels are frozen at torch 2.6.0 -- falling back to them pins
users to an outdated PyTorch. Three issues fixed in both install.ps1
and setup.ps1:
1. CUDA 12.0-12.5 now maps to cu126 (was cu124).
2. CUDA 11.x and older now falls back to cpu (was cu124, which would
silently install incompatible GPU wheels).
3. Parse-failure and no-nvidia-smi fallbacks updated to cu126/cpu.
Adds tests/test_cuda_wheel_mapping.py covering the mapping logic,
nvidia-smi parsing, PS1 file sync, PyTorch index URL validation,
and sandbox torch installs.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* remove test file from PR branch
Test file kept locally, not needed in the PR.
* fix: map CUDA 11.x to cu118 instead of cpu
PyTorch still publishes cu118 wheels (up to torch 2.7.1), so CUDA 11.x
users get GPU-accelerated torch rather than being forced to CPU-only.
Only CUDA 10.x and older fall back to cpu.
* fix: revert CUDA 12.0-12.5 to cu124, handle cpu tag in setup.ps1
CUDA 12.0-12.5 drivers only support up to their reported CUDA version,
so cu126 wheels (built with CUDA 12.6) fail to load. Revert the catch-
all for 12.0-12.5 back to cu124.
Also fix setup.ps1 caller: when Get-PytorchCudaTag returns "cpu" (e.g.
CUDA 10.x driver), the installer now correctly skips Triton and prints
"CPU-only" instead of "CUDA support (cpu)".
* fix: add --upgrade to unsloth install for stale venv repair
On reruns against an existing venv, uv pip install unsloth makes no
changes if unsloth==2024.8 is already installed (it satisfies the
constraint). Adding --upgrade only to the unsloth install ensures
stale installs get repaired without forcing a multi-GB torch
re-download.
* fix: use --upgrade-package to avoid clobbering torch CUDA wheels
`--upgrade unsloth` re-resolves torch from default PyPI, stripping the
+cuXXX suffix installed in step 1. `--upgrade-package unsloth unsloth`
upgrades only unsloth (and pulls missing deps like transformers, trl)
while preserving the pinned torch from the CUDA-specific index.
* docs: explain why split-install and --upgrade-package are needed
Expand the inline comment block to document both design decisions:
1. Why torch is installed separately (solver fallback to 2024.8)
2. Why --upgrade-package is used instead of --upgrade (preserves CUDA wheels)
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix Studio crash on Anaconda Python due to platform._sys_version() parse failure
Anaconda and conda-forge modify sys.version to include distributor
metadata between pipe characters, e.g.:
3.12.4 | packaged by Anaconda, Inc. | (main, ...) [MSC v.1929 ...]
Python's platform._sys_version() has a hardcoded regex that cannot
parse this format, raising ValueError. CPython closed this as "not
planned" (cpython#102396) since Anaconda modified the binary.
This breaks the import chain: run.py -> structlog -> rich -> attrs,
which calls platform.python_implementation() at module scope.
Fix: before any library imports, strip the pipe segments, parse the
cleaned version string via the standard parser, and cache the result
under the original sys.version key so all subsequent platform calls
hit the cache.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add defensive fallback for unpaired pipe edge cases in version patch
Address Gemini review suggestion: if the paired-pipe regex leaves
residual pipes (hypothetical single-pipe distributor metadata), fall
back to extracting the version number and the parenthesized build
info directly. Wrap the entire patch in try/except so unexpected
version string formats degrade gracefully instead of crashing the
patch itself.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Refactor into shared _platform_compat module, cover colab.py entrypoint
Address reviewer feedback:
1. Extract the Anaconda/conda-forge sys.version fix into a shared
_platform_compat.py module that wraps platform._sys_version() with
a retry-on-ValueError fallback. This is more robust than cache-seeding
because it handles all future platform._sys_version() calls, not just
the first one.
2. Import the fix from both run.py and colab.py entrypoints, so Studio
no longer crashes on Anaconda Python regardless of the launch path.
3. The wrapper is idempotent (guarded by a flag) and handles edge cases:
paired pipes (Anaconda, conda-forge), unpaired pipes (hypothetical),
and standard CPython strings (no-op since ValueError is never raised).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Replace monkey-patch with cache-prime, fix colab.py duplicate sys.path, cover main.py
- Rewrite _platform_compat.py: replace function-wrapping monkey-patch with
one-shot cache seed (_seed_sys_version_cache). Parses cleaned sys.version
once and seeds platform._sys_version_cache so the stdlib parser never sees
the problematic Anaconda/conda-forge pipe-delimited string. No function
replacement, no idempotency flag, no reload edge cases.
- colab.py: remove duplicate backend_path sys.path insertion after
_bootstrap_studio_venv(). The early insertion (before _platform_compat
import) already covers it. This also fixes backend/ ending up behind
venv site-packages in sys.path ordering.
- run.py: move PYTHONWARNINGS=ignore before _platform_compat import to
preserve original intent of suppressing warnings early.
- main.py: add sys.path + _platform_compat import before route imports,
covering the direct `uvicorn main:app` launch path.
- Add test_platform_compat.py with 7 tests covering Anaconda, conda-forge,
and standard CPython version strings, plus the loggers import chain.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test_platform_compat.py from PR
* Handle Format B conda-forge version strings with duplicate paren groups
Some conda-forge builds produce sys.version with the build info both
before and after the pipe label (e.g. "3.9.7 (default, ...) | packaged
by conda-forge | (default, ...) \n[GCC 7.5.0]"). After stripping the
pipe segment, two consecutive (...) groups remain, which still fails
platform._sys_version(). Add a second regex pass to drop the duplicate
paren group.
* Guard _sys_version call with try/except to avoid making things worse
If the cleaned version string is still unparseable by the stdlib regex
(e.g. nested parens, exotic multi-pipe formats), silently give up
instead of letting ValueError propagate at import time -- which would
be a worse crash than the original deferred one.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: handle Windows subprocess crash during dataset.map()
Windows uses spawn (not fork) for multiprocessing. Spawned workers
cannot resolve Unsloth's dynamically compiled cache modules from
unsloth_compiled_cache/, causing ModuleNotFoundError and RuntimeError
during dataset.map() tokenization.
Add two platform-guarded patches for sys.platform == "win32":
1. Force HF_DATASETS_MULTITHREADING_MAX_WORKERS=1 and set spawn method
2. Monkey-patch Dataset.map() to force num_proc=None
Fixes#4490
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* address review: extend spawn fix to macOS, add multiprocess fallback
- Change platform checks from sys.platform == "win32" to
sys.platform != "linux" so macOS (also spawn-based) is covered
- Wrap multiprocess import in try/except falling back to stdlib
multiprocessing when the multiprocess package isn't installed
- Rename _win32_safe_map to _spawn_safe_map to reflect broader scope
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: replace global Dataset.map monkey-patch with targeted num_proc routing
The previous approach had issues: Patch 1 set HF_DATASETS_MULTITHREADING_MAX_WORKERS
and forced set_start_method (dead code on platforms already using spawn), and Patch 2
globally monkey-patched Dataset.map() (too broad, missed Dataset.filter()).
Replace with a two-layer fix:
1. Studio layer: Add dataset_map_num_proc() that returns None on spawn platforms
(Windows, macOS). Unlike num_proc=1 which still creates Pool(1) and spawns a
worker, num_proc=None runs Dataset.map()/filter() truly in-process.
Update all dataset.map() callsites to use it. ThreadPoolExecutor callers
(format_conversion.py) keep using safe_num_proc() since threads are unaffected.
2. Root-cause layer: Propagate UNSLOTH_COMPILE_LOCATION via PYTHONPATH on spawn
platforms so spawned workers can import compiled modules. Mirrors the .venv_t5
pattern in worker.py. Does not import unsloth_zoo.compiler (heavy torch/triton
imports). Completely skipped on Linux.
Also extend safe_num_proc() to return 1 on macOS (was only guarding Windows),
and narrow the transformers 5.x dataloader guard from != "linux" to explicit
("win32", "darwin").
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: add safe_thread_num_proc() for ThreadPoolExecutor callsites
safe_num_proc() correctly caps to 1 on macOS/Windows for process-based
multiprocessing, but format_conversion.py reuses it for ThreadPoolExecutor
workers. Threads share address space and are unaffected by spawn, so
capping to 1 makes image URL downloads sequential -- a real regression.
Add safe_thread_num_proc() that skips the platform guard but keeps the
cpu_count heuristic, and switch both ThreadPoolExecutor callsites in
format_conversion.py to use it.
* fix: remove double-wrap in dataset_num_proc + fix num_proc=1 in datasets route
- trainer.py:3009: Replace safe_num_proc(max(1, os.cpu_count() // 4))
with max(1, (os.cpu_count() or 1) // 4) to avoid double-wrapping
inside dataset_map_num_proc which already calls safe_num_proc
- trainer.py:15-20: Clarify comment on PYTHONPATH propagation
- datasets.py:445: Change num_proc=1 to num_proc=None for 10-row
preview slice (avoids unnecessary multiprocessing overhead)
* fix: guard os.cpu_count() against None in worker-count helpers
os.cpu_count() can return None on some platforms. Use (os.cpu_count() or 1)
to prevent TypeError in safe_num_proc() and safe_thread_num_proc().
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* gate on min uv version and shortcut python candidate search if known
* fix sort -V cross compat issue, run_quiet early exit on llamacpp, autolaunch
* update launch message
* Fix PR comments
* auto launch and find open port
* remove dev install
* Fix review findings: major-version guard, non-fatal port fallback, tty comment, restore local
* Remove autolaunch, clean up dead state and debug noise
- Remove find_open_port, TTY-gated autolaunch, and </dev/tty
redirection from install.sh; just print launch instructions
- Remove unused BEST_MAJOR variable from studio/setup.sh
- Remove stray "finished finding best python" debug echo
- Fix stale comment "below 3.12" to "below 3.11"
* Reject prerelease uv at exact minimum version boundary
* Remove 2>/dev/null from version_ge numeric comparisons
Let non-numeric version parts surface errors on stderr
instead of being silently swallowed.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: reconfigure stdout UTF-8 on Windows to prevent UnicodeEncodeError from emoji
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: default frontend_path when None to fix blank page when venv is pre-activated
* Restore Windows UTF-8 stdout fix dropped in earlier commit
The cp1252 console encoding on Windows cannot render emoji characters
used in startup messages (e.g. print("✅ Frontend loaded ...")).
This causes UnicodeEncodeError and crashes the server before it starts.
Place sys.stdout.reconfigure(encoding="utf-8", errors="replace") at the
top of run_server(), unconditionally before any print() or structlog
call, so all emoji output is covered -- including the frontend status
messages and silent=True paths that the original placement missed.
Guarded by sys.platform == "win32" and hasattr check, so it is a no-op
on Linux/macOS and safe in non-standard stdout environments (Jupyter,
piped IO).
* fix: preserve run_server(None) as headless, fix CLI frontend kwarg
Remove the frontend_path=None fallback in run_server() that changed
None from "headless/API-only" to "mount bundled frontend", breaking
backwards compatibility for embedders.
The blank-page bug was actually caused by the CLI wrappers always
passing frontend_path=frontend (even when frontend=None), which
overrode run_server()'s default. Fix studio.py and ui.py to only
pass frontend_path when the user explicitly sets --frontend.
* fix: use timeout loop for shutdown event in ui command
Match studio_default()'s shutdown loop that uses a 1-second timeout
on Event.wait(). Without a timeout, the bare wait() blocks at the C
level on Linux, preventing Python from delivering SIGINT (Ctrl+C).
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: add CUDA minimum version check and abort for llama.cpp (>= 12.4)
- setup.ps1/setup.sh: abort with clear error if CUDA toolkit < 12.4
(llama.cpp requirement); link to cuda-toolkit-archive for upgrade
- setup.ps1: promote CUDA VS integration copy failure from WARN to
ERROR + exit 1; remove manual-copy hack instructions per Roland —
correct fix is re-installing CUDA/MSBuild, not a manual workaround
Fixes: https://github.com/unslothai/unsloth/issues/4437
Reported by: Sebastien
* fix: wipe stale studio venv when torch CUDA tag changes
When the NVIDIA driver is updated, the required PyTorch CUDA tag changes
(e.g. cu124 -> cu130) but setup.ps1 was silently reusing the existing
.venv, leaving the old torch wheel in place and breaking the UI for
everyone on the next setup run.
Before creating/reusing the venv, inspect the installed torch version
string. If its CUDA tag does not match what the current driver requires,
wipe the venv so we always get a clean, correct install.
* Fix CUDA version check: portability, non-fatal fallback, stale venv detection
- setup.sh: Replace grep -oP with POSIX sed for macOS compatibility
- setup.sh: Replace exit 1 with NVCC_PATH="" to fall back to CPU-only build
- setup.sh: Move version check before -DGGML_CUDA=ON append
- setup.sh: Add else branch warning when nvcc version is unparseable
- setup.ps1: Replace exit 1 with $NvccPath=$null for non-fatal CUDA fallback
- setup.ps1: Add driver vs toolkit guidance in version warning
- setup.ps1: Guard CUDA env/VS integration setup with if ($NvccPath)
- setup.ps1: VS integration catch: downgrade to WARN, restore source/dest paths
- setup.ps1: Stale venv: detect CPU torch and untagged wheels, not just +cuNNN
- setup.ps1: Stale venv: rebuild on failed torch import
- setup.ps1: Stale venv: wrap Remove-Item in try/catch for locked files
* Remove incorrect CUDA >= 12.4 check, keep only stale venv detection
llama.cpp has no hard minimum CUDA version -- it builds with CUDA as old
as 11.2 and degrades features gracefully via #if CUDART_VERSION guards.
The 12.4 figure was the default Docker/CI baseline, not a build requirement.
Reverted:
- CUDA version check in setup.sh (entirely removed)
- CUDA version check in setup.ps1 (entirely removed)
- VS integration catch block cosmetic changes (restored to main)
- if ($NvccPath) guard around CUDA env setup (not needed without version check)
Kept:
- Stale venv detection in setup.ps1: detects torch CUDA tag mismatch
(cu124 vs cu130, cpu vs cuXXX, broken torch import) and rebuilds venv
* Fix stale venv detection: incomplete venvs, timeout, fatal delete failure
- Add 30s timeout for torch import probe via ProcessStartInfo/WaitForExit
- Use Test-Path -PathType Container to reject files masquerading as venv dir
- Trigger rebuild when python.exe is missing (incomplete venv)
- Make Remove-Item failure fatal ([ERROR] + exit 1) instead of warn-and-continue
- Move $expectedTorchTag computation inside -not $shouldRebuild guard
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
The ruff pre-commit hook runs on all file types by default, including
.ipynb notebooks. Colab notebooks are authored in Colab's editor and
can contain IPython magics (%cd, !git) that ruff cannot parse. This
causes pre-commit.ci to fail on unrelated PRs when a notebook on main
has syntax ruff does not understand.
Add `exclude: '\.ipynb$'` to the ruff hook so notebooks are skipped.
* feat(chat): add server-side timings and context display for GGUF
Extract timings/usage metadata from llama-server SSE stream and forward
through the full stack. Replace client-side estimates with accurate
server-reported metrics (prompt eval, tok/s, token counts, cache hits).
Add context window usage bar to chat top nav.
* feat(chat): source badges with hover cards and 2-row collapse
- Add hover cards to source badges showing favicon, title, URL and
snippet description on hover
- Limit source badges to 2 rows with +X more expand/collapse
- Parse snippet from web search results for hover card descriptions
- Replace individual Source rendering with grouped SourcesGroup component
* fix(chat): add null guards for server timings edge cases
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(chat): reset contextUsage on thread switch, remove unused context-display
* fix(chat): stop double-counting completion tokens in tool-calling path
* fix(chat): skip metadata events in llm_assist consumers
* fix(chat): hide context usage bar in compare mode
* fix(chat): harden timings pipeline and context usage persistence
Accumulate prompt_ms, predicted_ms, and predicted_n from intermediate
tool-detection passes so the final metadata reflects total server work.
Persist contextUsage in message metadata (Dexie) and restore on thread
load. Add type guard in gguf_stream_chunks for unexpected dict events.
Clear contextUsage when entering compare mode.
* feat(chat): make GGUF stream metadata OpenAI-compatible
* fix(chat): address PR review feedback
* feat(chat): address PR review feedback
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(recipe-studio): prevent fitView from zooming to wrong location on recipe load
* feat: add pymupdf/python-docx deps and unstructured uploads storage root
* feat: add POST /seed/upload-unstructured-file endpoint
* feat: add multi-file chunking with source_file column
* feat: update frontend types and API layer for multi-file upload
* feat: round-robin preview rows across source files
Ensures every uploaded file is represented in the preview table
by cycling through sources instead of just taking the first N rows.
* fix: disable OCR, fix auto-load timing, fix persistence on reload
- Disable pymupdf4llm OCR with write_images=False, show_progress=False
- Replace onAllUploaded callback with useEffect that detects uploading→done
transition (avoids stale closure reading empty file IDs)
- Fix importer to preserve file IDs from saved recipes instead of clearing
(clearing only happens at share time via sanitizeSeedForShare)
* fix: harden unstructured upload with input validation and state fixes
Validate block_id/file_id with alphanumeric regex to prevent path
traversal, use exact stem match for file deletion, add error handling
for metadata writes and empty files, fix React stale closures and
object mutations in upload loop, and correct validation logic for
unstructured seed resolved_paths.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: address PR review - legacy path import, share sanitizer, sync effect
Promote legacy source.path into resolved_paths for old unstructured
recipes, clear source.paths in share sanitizer to prevent leaking local
filesystem paths, and gate file sync effect to dialog open transition
so users can actually delete all uploaded files.
* fix: CSV column fix (BOM + whitespace + unnamed index re-save) for #4470
* fix: harden unstructured upload flow and polish dialog UX
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix Windows installer Python detection and winget error handling
The PowerShell installer crashes on some Windows machines due to two
issues:
1. Windows Store App Execution Aliases: Get-Command finds the stub at
WindowsApps\python.exe, then python --version writes to stderr.
With $ErrorActionPreference = "Stop" on PowerShell 5.1, stderr
from native commands becomes a terminating error, killing the
script before it tries to install Python.
2. winget "already installed" exit code: winget returns -1978335189
(APPINSTALLER_CLI_ERROR_UPDATE_NOT_APPLICABLE) when the package is
already at the latest version. The script treated any non-zero exit
as failure. The fallback Get-Command check could also find the
Store stub or fail if Python was partially uninstalled.
Changes:
- Add Find-CompatiblePython helper that tries the py launcher first,
then python3/python via Get-Command -All, explicitly skipping any
WindowsApps stubs. All invocations wrapped in try-catch so stderr
never triggers ErrorActionPreference.
- Replace exit-code-based winget error handling with outcome-based:
re-detect Python after install, retry with --force if not found,
show actionable manual install instructions on final failure.
- Deduplicate PATH entries in Refresh-SessionPath to prevent unbounded
growth from repeated machine+user path prepending.
* Address reviewer feedback: wrap winget calls, remove blanket WindowsApps filter
Three fixes based on code review:
1. Wrap all winget install calls in $ErrorActionPreference = "Continue"
blocks so that winget stderr (progress bars, warnings) does not
become a terminating error on PowerShell 5.1. This matches the
pattern already used in studio/setup.ps1 line 983.
2. Remove the blanket *\WindowsApps\* path filter that rejected all
WindowsApps executables including valid Microsoft Store Python
installs. Instead, rely on the existing try-catch + version regex
probing to determine if a candidate is functional. Non-functional
entries (App Execution Alias stubs) fail the try-catch and are
skipped naturally.
3. Use $pyLauncher.Source (resolved path) instead of bare py name,
add -CommandType Application to avoid matching aliases/functions,
and derive winget package ID from $PythonVersion variable instead
of hardcoding Python.Python.3.13.
* Add back WindowsApps filter for python3/python fallback path
The App Execution Alias stubs in WindowsApps can open the Microsoft
Store as a side effect when invoked, even though the try-catch handles
the error. Since the py launcher (tried first) already detects
legitimate Store Python -- Store packages include py since Python
3.11 -- filtering WindowsApps in the python3/python fallback is safe
and avoids the Store popup.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* fix(install.ps1): detect AMD/no-NVIDIA GPU early and guard unsloth.exe existence
When a user has an AMD GPU (no nvidia-smi), uv's --torch-backend=auto
resolves to CPU torch, which constrains the solver to unsloth==2024.8.
That ancient release has no unsloth.exe CLI entry point, so the subsequent
& \ studio setup call throws a confusing PowerShell
'module could not be loaded' CommandNotFoundException instead of a
clear error.
Two fixes:
- Detect nvidia-smi early; if no NVIDIA GPU is found, print a clear
error explaining AMD/Intel GPUs are unsupported and exit before
wasting time installing the wrong package version.
- Guard Test-Path \ before invoking it, so any future case
where the CLI entry point is missing produces a readable error
instead of a cryptic PowerShell exception.
Fixes: unsloth_studio\Scripts\unsloth.exe CommandNotFoundException
on AMD GPU systems (Windows).
* fix(install.ps1): correct GPU support message - AMD is Linux-only via ROCm
* Slim down to just the unsloth.exe existence guard
Remove the early NVIDIA GPU detection gate -- Studio supports Windows
and Mac without a GPU (finetuning is simply disabled). The GPU gate
was blocking legitimate non-NVIDIA users from installing.
Keep only the Test-Path guard on unsloth.exe before invoking it. This
turns the confusing PowerShell CommandNotFoundException into a clear
error message pointing at the likely cause (older unsloth version
resolved by the package solver that does not include the Studio CLI).
* Fix quickstart link in unsloth.exe guard message
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat: support full model GGUF export, disable incompatible methods in UI
* fix: resolve base model from config.json for venv_t5 export switching
* feat: detect BNB-quantized models and disable all export methods for quantized non-PEFT checkpoints
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: relocate Ollama Modelfile alongside GGUFs during non-PEFT export cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix macOS install.sh: stdin consumption and Python discovery
Two issues when running `curl | sh` on macOS:
1. Commands like `brew install` consume bytes from the piped stdin,
causing the shell to lose its place in the script. The remaining
source code gets printed as text instead of being executed, so
users have to run the installer twice. Fixed by redirecting stdin
from /dev/null for brew, apt-get, xcode-select, and the uv
installer subprocess.
2. setup.sh searches for Python 3.11-3.13 on the system PATH via
`compgen -c`. On macOS systems that only have Python 3.9 and/or
3.14, this fails with "No Python version between 3.11 and 3.13
found" even though uv already installed Python 3.13 into the
venv. Fixed by adding the venv's bin/ to PATH before invoking
`unsloth studio setup`.
* Guard PATH export against empty VENV_ABS_BIN
If cd into the venv bin/ fails, VENV_ABS_BIN would be empty and
PATH would start with ":", causing the current directory to be
searched for executables. Wrap the export in a non-empty check.
* full finetuning studio
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update studio/backend/core/training/trainer.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* One liner setup for unsloth studio
* Fix install scripts: system deps, activation bugs, curl/wget support
- install.sh: detect platform (macOS/Linux/WSL) and check for missing
system dependencies (cmake, git, build-essential, libcurl4-openssl-dev).
Prompt user once for permission to install all missing packages via
brew (macOS) or sudo apt-get (Linux/WSL). Add wget fallback via
download() helper since curl is not always present on minimal Linux
installs. Fix nested curl|sh stdin stealing by downloading uv installer
to a tempfile first. Replace venv activation (no-op in a pipe subshell)
with explicit --python flag for uv pip install and direct venv binary
invocation. Add idempotency guard for venv creation. Redirect stdin
on unsloth studio setup to prevent pipe consumption. On macOS, check
for Xcode Command Line Tools and trigger install if missing.
- install.ps1: wrap script body in Install-UnslothStudio function so
that errors use return instead of exit (exit kills the terminal when
run via irm|iex). Remove activate.ps1 invocation entirely -- use
explicit --python path for uv pip install and & $UnslothExe for
studio setup. This avoids both the child-scope activation bug (& vs
dot-source) and the execution policy error on default Windows systems.
Add winget availability check with clear error message. Fix PATH
refresh to append registry paths instead of replacing the session PATH.
Add uv installer fallback via astral.sh PowerShell script if winget
install does not put uv on PATH. Broaden Python version check to
accept 3.11-3.13. Add idempotency guard for venv creation.
- README.md: add wget one-liner alternative for systems without curl.
* Fix Tailwind CSS v4 .gitignore bug on Windows (#4444)
- Add .gitignore hiding workaround to setup.ps1 (matching existing
setup.sh logic) so venv .gitignore files containing "*" don't prevent
Tailwind's oxide scanner from finding .tsx source files
- Add CSS size validation to setup.sh, setup.ps1, and build.sh to catch
truncated Tailwind builds early
- Remove stray force-rebuild overrides that made the "skip build if
current" cache check dead code in both setup scripts
- Add rm -rf dist to build.sh to force clean rebuilds for wheel packaging
* Change default port 8000 to 8888, fix installer bugs, improve UX
- Change default Studio port from 8000 to 8888 across all entry points
(run.py, studio.py, ui.py, colab.py, vite.config.ts, setup scripts)
- Update launch banner: "Launching with studio venv..." to
"Launching Unsloth Studio... Please wait..."
- Add "Open your web browser" banner and rename labels
(Local -> Local Access, External -> Worldwide Web Address)
- Fix venv idempotency: check for bin/python instead of just directory
existence, clean up partial venvs on retry
- Fix build.sh CSS validation: handle empty CSS case that silently
bypassed the check with "integer expression expected"
- Fix install.sh sudo handling: try apt-get without sudo first (works
when root), then escalate with per-package tracking and user prompt
- Fix install.ps1: check exit code from studio setup, fail on error
- Add pciutils to WSL GGUF build dependencies
- Apply same smart apt-get escalation pattern to studio/setup.sh
* Use detected Python version for venv, abort on non-apt Linux
- install.ps1: detect existing Python 3.11/3.12/3.13 and use that
version for venv creation instead of always forcing 3.13
- install.sh: exit with error on non-apt Linux distros when required
packages cannot be auto-installed, instead of silently continuing
* Make sudo permission prompt more prominent with warning banner
* Add Accept [Y/n] sudo prompt to studio/setup.sh for consistency
* Fix native command exit code handling and sudo decline flow
install.ps1: Add $LASTEXITCODE checks after winget (Python), uv venv,
and uv pip install calls. $ErrorActionPreference only catches PowerShell
cmdlet errors, not native executable failures. The Python check also
handles winget returning non-zero for "already installed".
setup.sh: Skip llama-server build when user declines sudo or sudo is
unavailable. Previously the script continued to section 8 which would
fail with confusing errors (e.g. "gcc: command not found") since
build-essential was never installed.
* Move rm -rf llama.cpp inside build branch to preserve existing install
When _SKIP_GGUF_BUILD is set (user declined sudo or sudo unavailable),
the previous rm -rf would destroy an already-working llama-server before
the skip check ran. Move it inside the else branch so existing builds
are preserved when the rebuild is skipped.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fixing Qwen3.5 bug and adding Outetts dependencies
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix studio crash on Mac: vendor check_signal_escape_patterns from unsloth_zoo
Vendor the `check_signal_escape_patterns` function from
`unsloth_zoo.rl_environments` directly into `tools.py`. The function is
pure Python (only uses stdlib `ast`) and has zero GPU dependencies, but
importing it from unsloth_zoo triggers `unsloth_zoo.__init__` which calls
`get_device_type()` at module scope -- raising NotImplementedError on
Apple Silicon Macs.
By vendoring the code, the safety checks still run on all platforms
(Mac, Linux, Windows) without needing unsloth_zoo at all.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- tool-ui-python.tsx: use explicit tuple type instead of `as const` to
match the mutable `[BundledTheme, BundledTheme]` expected by Streamdown
- chat-adapter.ts: add missing `argsText` field required by
ToolCallMessagePart and fix `args` type to use ReadonlyJSONObject
* Add elapsed timer to tool status pill in Studio
Show a count-up seconds timer (0s, 1s, 2s, ...) next to the tool status
text in the composer area. Helps users gauge how long a tool call (web
search, code execution) has been running. Timer resets when a new tool
starts and disappears when all tools finish.
* Fix tool call parsing, add tool outputs panel and reasoning copy button
Backend:
- Rewrite tool call XML parser to use balanced-brace JSON extraction
instead of greedy regex, fixing truncation on nested braces in
code/JSON arguments
- Handle optional closing tags (</tool_call>, </function>, </parameter>)
that models frequently omit
- Support bare <function=...> tags without <tool_call> wrapper
- Strip tool call markup from streamed content so raw XML never leaks
into the chat UI
- Use a persistent ~/studio_sandbox/ working directory for tool
execution so files persist across calls within a session
- Emit tool_start/tool_end SSE events so the frontend can display
tool inputs and outputs
Frontend:
- Add collapsible "Tool Outputs" panel below assistant messages showing
each tool call's input and output with copy buttons
- Add copy button to reasoning blocks
- Add elapsed timer to tool status pill
- Update project URLs in pyproject.toml (http -> https, add docs link)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add interactive HTML preview with fullscreen toggle for code blocks
HTML code fences now render an interactive sandboxed iframe preview
below the syntax-highlighted code, similar to how SVG fences show
an image preview. The iframe uses sandbox="allow-scripts" to allow
JavaScript execution while blocking access to the parent page.
Includes a fullscreen toggle (enlarge/minimize button) that expands
the preview into a viewport overlay, dismissible via button, Escape
key, or backdrop click. A streaming placeholder prevents partial
HTML from rendering mid-stream.
* Add tool call settings: auto-heal toggle, max iterations, timeout
Add three user-configurable tool call settings to the Studio Settings panel:
- Auto Heal Tool Calls: toggle to control fallback XML parsing of malformed
tool calls from model output (default: on)
- Max Tool Calls Per Message: slider 0-40 + Max to cap tool call iterations
per message (default: 10)
- Max Tool Call Duration: slider 1-30 minutes + Max to set per-tool-call
execution timeout (default: 5 minutes)
All settings persist to localStorage and flow through the full stack:
frontend store -> API request -> Pydantic model -> route -> llama_cpp -> tools.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix tool call timeout: respect no-limit and apply to web search
- Use a sentinel to distinguish timeout=None (no limit) from the default
(300s). Previously None was silently replaced with _EXEC_TIMEOUT.
- Pass the configured timeout to DDGS() for web searches so the setting
applies uniformly to all tool types.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add input validation bounds and per-thread sandbox isolation
- Add ge=0 constraint to max_tool_calls_per_message (rejects negative values)
- Add ge=1 constraint to tool_call_timeout (minimum 1 second)
- Thread session_id from frontend through backend to tool execution
- Scope sandbox directories per conversation: ~/studio_sandbox/{thread_id}/
- Backwards compatible: API callers without session_id use ~/studio_sandbox/
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix non-monotonic streaming and Python temp script path
- Split tool markup stripping into closed-only (mid-stream) and full
(final flush) to prevent cumulative text from shrinking mid-stream
- Enforce monotonicity: only emit when cleaned text grows, so the
proxy's delta logic (cumulative[len(prev_text):]) never breaks
- Place Python temp scripts in the sandbox workdir instead of /tmp so
sys.path[0] points to the sandbox and cross-call imports work
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Sanitize session_id to prevent path traversal in sandbox
Strip path separators and parent-dir references from session_id before
using it as a directory name. Verify the resolved path stays under
~/studio_sandbox/ as a second guard.
* feat(chat): proper assistant-ui tool call UIs with sources
Replace custom metadata-based ToolOutputsGroup with native assistant-ui
tool-call content parts. Backend SSE tool_start/tool_end events now emit
proper { type: "tool-call" } parts from the adapter, enabling per-tool
UIs registered via tools.by_name in MessagePrimitive.Parts.
- Web search: Globe icon, Source badges with favicons, auto-collapse
when LLM starts responding
- Python: Code icon, syntax-highlighted code via Streamdown/shiki,
output block with copy
- Terminal: Terminal icon, command in trigger, output with copy
- ToolGroup wraps consecutive tool calls (skips for single calls)
- Sources component renders URL badges at end of message
- Flattened code block CSS (single border, no nested boxes)
* fix(inference): respect empty enabled_tools allowlist
`if payload.enabled_tools:` is falsy for [], falling through to
ALL_TOOLS. Use `is not None` so an explicit empty list disables
all tools as intended.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Shine1i <wasimysdev@gmail.com>
* revert: remove frontend build caching from setup scripts
The mtime-based caching introduced in #4404/#4413 can incorrectly skip
frontend builds -- e.g. after git pull when filesystem timestamps are
not preserved, or after our Tailwind v4 discovery that the site-packages
.gitignore must be hidden before vite build (which the cached path
doesn't handle).
Always rebuild the frontend on setup. The build takes ~15s and is
safer than risking a stale dist/.
* revert: disable frontend build caching, keep code commented out
Caching disabled by always setting _NEED_FRONTEND_BUILD=true.
The mtime-based logic is preserved in comments for future re-enabling.
Reasons for disabling:
- Git does not preserve file timestamps, so cached dist/ can appear
newer than freshly checked-out source after a pull
- Tailwind v4 requires hiding site-packages/.gitignore before vite
build; the cache path bypasses this, producing broken CSS
* revert: always rebuild frontend, remove mtime caching
* revert: always rebuild frontend, override caching with _NEED_FRONTEND_BUILD=true
* fix: exclude nemotron_h from flex_attention
NemotronHForCausalLM does not support flex_attention and raises:
NotImplementedError: NemotronHForCausalLM does not support an
attention implementation through torch's flex_attention.
Add nemotron_h to the exclusion list alongside gpt_oss and mllama
so Unsloth falls back to the default attention implementation.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix inference stall during prefill by removing retry storm
The _stream_with_retry method used a 0.5s read timeout and retried by
sending a brand new POST request each time. During prompt prefill (which
can take 5-30+ seconds for long contexts or reasoning models), this
caused 10-60 duplicate requests that forced llama-server to restart
processing from scratch each time, resulting in 10-20s stalls visible
as "Generating" with no progress in the UI.
Fix: send the request ONCE with a 120s read timeout for the initial
response headers. Cancel support during the prefill wait is handled by
a background thread that monitors cancel_event (checked every 0.3s)
and closes the response to unblock the httpx read immediately. This
preserves the ability to stop/cancel/refresh during generation.
The existing 0.5s timeout on the httpx.Client is still used by
_iter_text_cancellable for per-token cancel checking during streaming
(after prefill), which is unaffected by this change.
* Fix race in cancel watcher when response is not yet created
When cancel_event fires before client.stream() returns (response is
still None), the watcher would hit return and exit without closing
anything. The main thread stays blocked for up to 120s.
Fix: after cancel is requested, keep polling _response_ref every 0.1s
until the response object appears (then close it) or _cancel_closed
is set (main thread finished on its own).
* Minor cleanup: remove redundant None check, add debug logging in cancel watcher
Address Gemini review: cancel_event is guaranteed non-None when the
watcher thread runs, and logging the close exception aids debugging.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Retry r.close() on failure instead of giving up
If r.close() raises, stay in the polling loop and retry rather than
returning and leaving the main thread blocked for up to 120s.
* fix: keep short read timeout during token streaming
The prefill_timeout (read=120s) was passed to client.stream(), which
applied to ALL reads -- not just the initial response headers. This
meant _iter_text_cancellable's ReadTimeout-based cancel checking was
broken during token streaming: the Stop button could take up to 120s
to respond instead of 0.5s.
Fix: keep the client's short read timeout (0.5s) for the stream call.
During prefill, catch ReadTimeout in a loop and re-check cancel_event
instead of re-sending the POST (which was the original retry storm).
Once the first bytes arrive, yield the response with a PrependStream
wrapper so iter_text() sees the buffered first chunk.
This preserves both:
- Fast cancel during prefill (via cancel watcher + ReadTimeout loop)
- Fast cancel during streaming (via _iter_text_cancellable's 0.5s
ReadTimeout, which now fires correctly again)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: swap to short-timeout stream after prefill completes
Address two review issues:
1. _PrependStream did not inherit from httpx.SyncByteStream, so
Response.iter_raw() would raise RuntimeError. Replaced with a
_ShortTimeoutStream that inherits SyncByteStream properly.
2. client.stream() entry itself raises ReadTimeout during slow prefill
(before headers arrive). The previous fix tried to catch this at
the body-read level but missed the connection-level timeout.
New approach: keep the 120s read timeout for client.stream() so the
connection survives long prefills. Once headers arrive, replace the
response stream with _ShortTimeoutStream -- a wrapper that uses a
background reader thread and a Queue with a short get() timeout to
re-raise ReadTimeout at the original 0.5s interval. This way
_iter_text_cancellable's cancel-checking remains responsive during
token streaming while prefill gets the long timeout it needs.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: move _ShortTimeoutStream before LlamaCppBackend class
The class was placed inside LlamaCppBackend's body, splitting the
class in two and making _codec_mgr and other attributes unreachable.
Move it to module level before LlamaCppBackend.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: remove _ShortTimeoutStream, use watcher for all cancel
_ShortTimeoutStream had two critical issues:
1. Raising ReadTimeout from a generator kills it -- Python finalizes
generators after an uncaught exception, so the next next() call
hits StopIteration and streaming ends mid-response.
2. The unbounded Queue in the background reader loses backpressure,
causing memory spikes with slow clients.
Simpler approach: use the 120s read timeout for the entire stream and
rely on the cancel watcher thread for all cancellation (both prefill
and streaming). The watcher closes the response on cancel_event,
which unblocks any blocking httpx read within ~0.3s. This eliminates
the need for short timeout tricks entirely.
Cancel latency:
- Prefill: ~0.3s (watcher polls cancel_event every 0.3s)
- Streaming: ~0.3s (same watcher mechanism)
- Both faster than the old 0.5s ReadTimeout approach
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* docs: clarify cancel limitations in _stream_with_retry
The docstrings claimed ~0.3s cancel in all cases, but httpx cannot
interrupt a blocked read before the response object exists. Update
the docstrings to accurately describe the behavior:
- Cancel during prefill (header wait) is deferred until headers arrive
- Cancel during streaming works via response.close() from the watcher
- _iter_text_cancellable docstring updated to reflect the watcher-based
cancel mechanism instead of the old ReadTimeout polling
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Allow Windows setup to complete without NVIDIA GPU
setup.ps1 previously hard-exited if nvidia-smi was not found, blocking
setup entirely on CPU-only or non-NVIDIA machines. The backend already
supports CPU and MLX (Apple Silicon) in chat-only GGUF mode, and the
Linux/Mac setup.sh handles missing GPUs gracefully.
Changes:
- Convert the GPU check from a hard exit to a warning
- Guard CUDA toolkit installation behind $HasNvidiaSmi
- Install CPU-only PyTorch when no GPU is detected
- Build llama.cpp without CUDA flags when no GPU is present
- Update doc comment to reflect CPU support
* Cache frontend build across setup runs
Skip the frontend npm install + build if frontend/dist already exists.
Previously setup.ps1 nuked node_modules and package-lock.json on every
run, and both scripts always rebuilt even when dist/ was already present.
On a git clone editable install, the first setup run still builds the
frontend as before. Subsequent runs skip it, saving several minutes.
To force a rebuild, delete frontend/dist and re-run setup.
* Show pip progress for PyTorch download on Windows
The torch CUDA wheel is ~2.8 GB and the CPU wheel is ~300 MB. With
| Out-Null suppressing all output, the install appeared completely
frozen with no feedback. Remove | Out-Null for the torch install
lines so pip's download progress bar is visible. Add a size hint
so users know the download is expected to take a while.
Also moves the Triton success message inside the GPU branch so it
only prints when Triton was actually installed.
* Guard CUDA env re-sanitization behind GPU check in llama.cpp build
The CUDA_PATH re-sanitization block (lines 1020-1033) references
$CudaToolkitRoot which is only set when $HasNvidiaSmi is true and
the CUDA Toolkit section runs. On CPU-only machines, $CudaToolkitRoot
is null, causing Split-Path to throw:
Split-Path : Cannot bind argument to parameter 'Path' because it is null.
Wrap the entire block in `if ($HasNvidiaSmi -and $CudaToolkitRoot)`.
* Rebuild frontend when source files are newer than dist/
Instead of only checking if dist/ exists, compare source file timestamps
against the dist/ directory. If any file in frontend/src/ is newer than
dist/, trigger a rebuild. This handles the case where a developer pulls
new frontend changes and re-runs setup -- stale assets get rebuilt
automatically.
* Fix cmake not found on Windows after winget install
Two issues fixed:
1. After winget installs cmake, Refresh-Environment may not pick up the
new PATH entry (MSI PATH changes sometimes need a new shell). Added a
fallback that probes cmake's default install locations (Program Files,
LocalAppData) and adds the directory to PATH explicitly if found.
2. If cmake is still unavailable when the llama.cpp build starts (e.g.
winget failed silently or PATH was not updated), the build now skips
gracefully with a [SKIP] warning instead of crashing with
"cmake : The term 'cmake' is not recognized".
* Fix frontend rebuild detection and decouple oxc-validator install
Address review feedback:
- Check entire frontend/ directory for changes, not just src/.
The build also depends on package.json, vite.config.ts,
tailwind.config.ts, public/, and other config files. A change
to any of these now triggers a rebuild.
- Move oxc-validator npm install outside the frontend build gate
in setup.sh so it always runs on setup, matching setup.ps1
which already had it outside the gate.
* Show cmake errors on failure and retry CUDA VS integration with elevation
Two fixes for issue #4405 (Windows setup fails at cmake configure):
1. cmake configure: capture output and display it on failure instead of
piping to Out-Null. When the error mentions "No CUDA toolset found",
print a hint about the CUDA VS integration files.
2. CUDA VS integration copy: when the direct Copy-Item fails (needs
admin access to write to Program Files), retry with Start-Process
-Verb RunAs to prompt for elevation. This is the root cause of the
"No CUDA toolset found" cmake failure -- the .targets files that let
MSBuild compile .cu files are missing from the VS BuildCustomizations
directory.
* Address reviewer feedback: cmake PATH persistence, stale cache, torch error check
1. Persist cmake PATH to user registry so Refresh-Environment cannot
drop it later in the same setup run. Previously the process-only
PATH addition at phase 1 could vanish when Refresh-Environment
rebuilt PATH from registry during phase 2/3 installs.
2. Clean stale CMake cache before configure. If a previous run built
with CUDA and the user reruns without a GPU (or vice versa), the
cached GGML_CUDA value would persist. Now the build dir is removed
before configure.
3. Explicitly set -DGGML_CUDA=OFF for CPU-only builds instead of just
omitting CUDA flags. This prevents cmake from auto-detecting a
partial CUDA installation.
4. Fix CUDA cmake flag indentation -- was misaligned from the original
PR, now consistently indented inside the if/else block.
5. Fail hard if pip install torch returns a non-zero exit code instead
of silently continuing with a broken environment.
* Remove extra CUDA cmake flags to align Windows with Linux build
Drop GGML_CUDA_FA_ALL_QUANTS, GGML_CUDA_F16, GGML_CUDA_GRAPHS,
GGML_CUDA_FORCE_CUBLAS, and GGML_CUDA_PEER_MAX_BATCH_SIZE flags.
The Linux build in setup.sh only sets GGML_CUDA=ON and lets llama.cpp
use its defaults for everything else. Keep Windows consistent.
* Address reviewer round 2: GPU probe fallback, Triton check, stale binary rebuild
1. GPU detection: fallback to default nvidia-smi install locations
(Program Files\NVIDIA Corporation\NVSMI, System32) when nvidia-smi
is not on PATH. Prevents silent CPU-only provisioning on machines
that have a GPU but a broken PATH.
2. Triton: check $LASTEXITCODE after pip install and print [WARN]
on failure instead of unconditional [OK].
3. Stale llama-server: check CMakeCache.txt for GGML_CUDA setting
and rebuild if the existing binary does not match the current GPU
mode (e.g. CUDA binary on a now-CPU-only rerun, or vice versa).
* Fix frontend rebuild detection and npm dependency issues
Addresses reviewer feedback on the frontend caching logic:
1. setup.sh: Fix broken find command that caused exit under pipefail.
The piped `find | xargs find -newer` had paths after the expression
which GNU find rejects. Replaced with a simpler `find -maxdepth 1
-type f -newer dist/` that checks ALL top-level files (catches
index.html, bun.lock, etc. that the extension allowlist missed).
2. setup.sh: Guard oxc-validator npm install behind `command -v npm`
check. When the frontend build is skipped (dist/ is cached), Node
bootstrap is also skipped, so npm may not be available.
3. setup.ps1: Replace Get-ChildItem -Include with explicit path
probing for src/ and public/. PowerShell's -Include without a
trailing wildcard silently returns nothing, so src/public changes
were never detected. Also check ALL top-level files instead of
just .json/.ts/.js/.mjs extensions.
* Fix studio setup: venv isolation, centralized .venv_t5, uv targeting
- All platforms (including Colab) now create ~/.unsloth/studio/.venv
with --without-pip fallback for broken ensurepip environments
- Add --python sys.executable to uv pip install in install_python_stack.py
so uv targets the correct venv instead of system Python
- Centralize .venv_t5 bootstrap in transformers_version.py with proper
validation (checks required packages exist, not just non-empty dir)
- Replace ~150 lines of duplicated install code across 3 worker files
with calls to the shared _ensure_venv_t5_exists() helper
- Use uv-if-present with pip fallback; do not install uv at runtime
- Add site.addsitedir() shim in colab.py so notebook cells can import
studio packages from the venv without system-Python double-install
- Update .venv_t5 packages: huggingface_hub 1.3.0->1.7.1, add hf_xet
- Bump transformers pin 4.57.1->4.57.6 in requirements + constraints
- Add Fast-Install helper to setup.ps1 with uv+pip fallback
- Keep Colab-specific completion banner in setup.sh
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix nvidia-smi PATH persistence and cmake requirement for CPU-only
1. Store nvidia-smi as an absolute path ($NvidiaSmiExe) on first
detection. All later calls (Get-CudaComputeCapability,
Get-PytorchCudaTag, CUDA toolkit detection) use this absolute
path instead of relying on PATH. This survives Refresh-Environment
which rebuilds PATH from the registry and drops process-only
additions.
2. Make cmake fatal for CPU-only installs. CPU-only machines depend
entirely on llama-server for GGUF chat mode, so reporting "Setup
Complete!" without it is misleading. GPU machines can still skip
the llama-server build since they have other inference paths.
* Fix broken frontend freshness detection in setup scripts
- setup.sh: Replace broken `find | xargs find -newer` pipeline with
single `find ... -newer` call. The old pipeline produced "paths must
precede expression" errors (silently suppressed by 2>/dev/null),
causing top-level config changes to never trigger a rebuild.
- setup.sh: Add `command -v npm` guard to oxc-validator block so it
does not fail when Node was not installed (build-skip path).
- setup.ps1: Replace `Get-ChildItem -Include` (unreliable without
-Recurse on PS 5.1) with explicit directory paths for src/ and
public/ scanning.
- Both: Add *.html to tracked file patterns so index.html (Vite
entry point) changes trigger a rebuild.
- Both: Use -print -quit instead of piping to head -1 for efficiency.
* Fix bugs found during review of PRs #4404, #4400, #4399
- setup.sh: Add || true guard to find command that checks frontend/src
and frontend/public dirs, preventing script abort under set -euo
pipefail when either directory is missing
- colab.py: Use sys.path.insert(0, ...) instead of site.addsitedir()
so Studio venv packages take priority over system copies. Add warning
when venv is missing instead of silently failing.
- transformers_version.py: _venv_t5_is_valid() now checks installed
package versions via .dist-info metadata, not just directory presence.
Prevents false positives from stale or wrong-version packages.
- transformers_version.py: _install_to_venv_t5() now passes --upgrade
so pip replaces existing stale packages in the target directory.
- setup.ps1: CPU-only PyTorch install uses --index-url for cpu wheel
and all install commands use Fast-Install (uv with pip fallback).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix _venv_t5_is_valid dist-info loop exiting after first directory
Remove premature break that caused the loop over .dist-info directories
to exit after the first match even if it had no METADATA file. Now
continues iterating until a valid METADATA is found or all dirs are
exhausted.
* Capture error output on failure instead of discarding with Out-Null
setup.ps1: 6 locations changed from `| Out-Null` to `| Out-String` with
output shown on failure -- PyTorch GPU/CPU install, Triton install,
venv_t5 package loop, cmake llama-server and llama-quantize builds.
transformers_version.py: clean stale .venv_t5 directory before reinstall
when validation detects missing or version-mismatched packages.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix ModuleNotFoundError when CLI imports studio.backend.core
The backend uses bare "from utils.*" imports everywhere, relying on
backend/ being on sys.path. Workers and routes add it at startup, but
the CLI imports studio.backend.core as a package -- backend/ was never
added. Add sys.path setup at the top of core/__init__.py so lazy
imports resolve correctly regardless of entry point.
Fixes: unsloth inference unsloth/Qwen3-8B "who are you" crashing with
"No module named 'utils'"
* Fix frontend freshness check to detect all top-level file changes
The extension allowlist (*.json, *.ts, *.js, *.mjs, *.html) missed
files like bun.lock, so lockfile-only dependency changes could skip
the frontend rebuild. Check all top-level files instead.
* Add tiktoken to .venv_t5 for Qwen-family tokenizers
Qwen models use tiktoken-based tokenizers which fail when routed through
the transformers 5.x overlay without tiktoken installed. Add it to the
setup scripts (with deps for Windows) and runtime fallback list.
Integrates PR #4418.
* Fix tiktoken crash in _venv_t5_is_valid and stray brace in setup.ps1
_venv_t5_is_valid() crashed with ValueError on unpinned packages like
"tiktoken" (no ==version). Handle by splitting safely and skipping
version check for unpinned packages (existence check only).
Also remove stray closing brace in setup.ps1 tiktoken install block.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): infinite scroll for recommended models list
The model selector showed a hard cap of 4 GGUFs + 4 safetensors in the
Recommended section. Users who wanted to browse more had to search
manually on Hugging Face.
Backend: increase the default model pool from 8+8 to 40+40 (the HF
fetch already pulls 80, so no extra network cost).
Frontend: replace the static 4+4 cap with on-demand lazy loading.
A page counter tracks how many groups of 4 to show per category.
An IntersectionObserver on a sentinel div at the bottom of the list
increments the page when the user scrolls down. Models are interleaved
in groups of 4 GGUFs then 4 hub models per page for a balanced view.
Key implementation details:
- Callback ref for the sentinel so the observer attaches reliably on
first popover open (useRef would miss the initial mount)
- Observer disconnects after each fire and re-attaches via useEffect
with a 100ms layout delay to prevent runaway page loading
- VRAM info fetched incrementally via useRecommendedModelVram on the
visible slice only
- recommendedSet uses visible IDs so HF search dedup stays correct
* refactor: address review feedback on recommended infinite scroll
- Simplify visibleRecommendedIds: use findIndex to locate the GGUF/hub
split point instead of re-filtering the entire array each time.
recommendedIds is already sorted GGUF-first, so a single slice is
enough.
- Fix VRAM refetch churn: pass the full recommendedIds (stable across
page increments) to useRecommendedModelVram instead of the growing
visibleRecommendedIds slice. The hook derives its stableKey from the
sorted+joined input, so passing the same pool on every page avoids
redundant HF modelInfo requests.
* Comment out large unused packages from Studio setup requirements
Audited all packages installed by `unsloth studio setup` against actual
imports in unsloth, unsloth_zoo, and studio/backend. The following have
zero imports anywhere and are the largest offenders by disk size:
- gradio (148 MB) in studio.txt -- Studio uses React + FastAPI, not Gradio
- executorch (41.5 MB) in extras-no-deps.txt -- no imports found
- scikit-learn (31.8 MB) in extras.txt -- no imports found
- MeCab (19.9 MB) in extras.txt -- Japanese tokenizer, no imports found
- coremltools (10.2 MB) in extras.txt -- Apple CoreML, no imports found
- uroman (4.0 MB) in extras.txt -- romanization tool, no imports found
Total savings: ~255 MB (~32% of the 805 MB installed by setup).
Each line is commented out with the package size annotated so they can be
re-enabled easily if needed in the future.
* Restore scikit-learn -- needed by sentence_transformers
sentence_transformers is installed with --no-deps in extras-no-deps.txt,
so its sklearn dependency is not auto-resolved. Multiple modules in
sentence_transformers import sklearn at the top level (evaluation,
util/similarity), so removing scikit-learn would break embedding jobs.
* Rename cli/ to unsloth_cli/ to fix namespace collision with stringzilla
stringzilla installs a namespace package at cli/ (cli/split.py, cli/wc.py)
in site-packages without an __init__.py. When unsloth is installed as an
editable package (pip install -e .), the entry point script does
`from cli import app` which finds stringzilla's namespace cli/ first and
fails with `ImportError: cannot import name 'app' from 'cli'`.
Non-editable installs happened to work because unsloth's cli/__init__.py
overwrites the namespace directory, but this is fragile and breaks if
stringzilla is installed after unsloth.
Renaming to unsloth_cli/ avoids the collision entirely and fixes both
editable and non-editable install paths.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update stale cli/ references in comments and license files
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* compare for 2 diff models
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolving gemini comments
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): refine model-load toast stop action and compare selector sizing (#4369)
Co-authored-by: imagineer99 <samleejackson0@gmail.com>
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: imagineer99 <samleejackson0@gmail.com>
* studio: improve onboarding UX, tooltips, and training defaults
- Change splash text to "Train and run LLMs locally"
- Add "Chat Only" card with BubbleChatIcon to skip directly to chat
- Add Skip/Skip to Chat buttons in sidebar and footer
- Back button on step 1 returns to splash screen instead of being disabled
- Change "Watch video guide" to "Get started with our guide" with new URL
- Update intro text to mention all model types + chat
- Make all tooltips clickable (in addition to hover) via React context
- Strip surrounding quotes from pasted HF tokens
- Rename "Eval Split" to "Evaluation Split"
- Add SparklesIcon to "Auto Detect" format option
- Change step 4 heading to "Choose your training parameters"
- Default max_steps to 60
- Learning rate displayed in scientific notation with +/- stepper
- Context length options capped by model's max_position_embeddings (via AutoConfig)
- Fix "QLORA"/"LORA" to "QLoRA"/"LoRA" in summary step
- Backend: add max_position_embeddings to model config endpoint
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* compare for 2 diff models
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolving gemini comments
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: disable thinking for Qwen3.5 <9B and always for AI Assist
- Change Qwen3.5 thinking threshold from <=2B to <9B (0.8B, 2B, 4B
all disable thinking by default; 9B+ enables it)
- Always pass enable_thinking=False in AI Assist helper calls
(_run_with_helper and _generate_with_backend) regardless of chat
thinking settings
* studio: address PR review comments
- Extract _get_max_position_embeddings helper to DRY config extraction
- Fix "Skip to Chat" to navigate to /chat on step 1 (was /studio)
* fix: comment out debug print statements
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: skip Shiki highlighting for incomplete SVG code fences
While streaming SVG content, the syntax highlighter (Shiki) re-parses
the entire growing SVG on every token, blocking the main thread and
freezing the code area until the fence closes. Show a plain-text
preview for incomplete SVG fences instead, similar to how Mermaid
diagrams show a placeholder while streaming.
* studio: fix default top_k from 50/40 to 20 for chat inference
Per Qwen3.5 docs (unsloth.ai/docs/models/qwen3.5), top_k should be 20
for both thinking and non-thinking modes. The model-specific config in
inference_defaults.json already had top_k=20 for Qwen3.5, but the
generic fallback defaults were wrong:
- Frontend DEFAULT_INFERENCE_PARAMS.topK: 50 -> 20
- Backend generate_chat_completion top_k: 40 -> 20
- Backend generate_chat_completion_with_tools top_k: 40 -> 20
- Frontend title generation top_k: 40 -> 20
* studio: set universal inference defaults for unknown models
Default params for any model without specific config:
temperature=0.6, top_p=0.95, top_k=20, min_p=0.01,
presence_penalty=0.0, repetition_penalty=1.0
Models with entries in inference_defaults.json (Qwen3.5, Gemma-3,
Llama, etc.) override these with their recommended values.
Updated in: frontend DEFAULT_INFERENCE_PARAMS, backend Pydantic
request models, and backend generate_chat_completion defaults.
* studio: only trust_remote_code for unsloth/ models in AutoConfig
Only set trust_remote_code=True when the model name starts with
"unsloth/". All other models default to False for safety.
* studio: move Generating spinner above the composer
The "Generating" spinner was below the send message bar, causing
the bar to jump up and down. Move it above the composer in both
the regular thread view and the welcome/empty view.
* studio: adjust toast close button position away from edge
Move the X close button on toasts (like "Starting model...") from
top-1.5 to top-3 and add right-3, giving more breathing room from
the top-right corner.
* studio: make Think button smaller with tighter icon-text gap
Reduce gap from 1.5 to 0.5, padding from px-2.5/py-1 to px-2/py-0.5,
and icon from size-3.5 to size-3.
* studio: multiple onboarding and chat UX improvements
- Move Generating spinner above composer (fixes jumping send bar)
- Make Think button smaller with tighter icon-text gap
- Chat card now inside grid (same size as Audio/Embeddings cards)
- Rename "Chat Only" to "Chat"
- Chat card requires Continue to proceed (no auto-advance)
- Continue on Chat selection skips onboarding and goes to /chat
- Tooltip (i) click on Chat card doesn't trigger navigation
- Step 1 footer Back button goes back to splash (label is "Back")
- Splash "Skip Onboarding" renamed to "Skip to Chat", navigates to /chat
- Toast close button moved away from edge
* studio: align Skip to Chat button, add Skip to footer
- Sidebar "Skip to Chat" now uses primary (green) Button style with
arrow icon, full width, aligned like step items. Shows on all steps.
- Footer: added "Skip" outline button next to Continue that goes
directly to /studio with progress saved (markOnboardingDone)
* studio: change default max steps from 30 to 60 in toggle hook
The DEFAULT_MAX_STEPS in use-max-steps-epochs-toggle.ts was still 30,
used as fallback when toggling from epochs back to max steps.
* studio: extend context length options to 262K
CONTEXT_LENGTHS now includes 65536, 131072, 262144 in addition to
the existing 512-32768 range. The onboarding step filters these by
the model's max_position_embeddings (e.g. Nemotron-3-Nano-4B has
262144), showing powers of 2 up to the model's maximum.
* studio: auto-select LoRA vs QLoRA based on model size and GPU memory
After selecting a model in onboarding, detect the total model weight
file size from HF Hub (safetensors/bin files). Then estimate memory
needed: model_size_gb * 1.5 * context_scale, where context_scale is:
- <=8192 tokens: 1.0x
- >8192 tokens: 1.7x
- >=16384 tokens: 2.0x
- >=32768 tokens: 4.0x
If the estimate fits in free GPU VRAM, default to LoRA (16-bit).
Otherwise default to QLoRA (4-bit).
Backend changes:
- Add model_size_bytes to ModelDetails (models.py)
- Add _get_model_size_bytes() using HfApi.repo_info (routes/models.py)
- Add vram_free_gb to get_gpu_summary (hardware.py)
Frontend changes:
- Add autoSelectTrainingMethod() in training-config-store.ts
- Called after model defaults are loaded
- Add model_size_bytes to ModelConfigResponse type
- Add vramFreeGb to HardwareInfo hook
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: rename "Importing ML libraries..." to "Importing Unsloth..."
* studio: show model/dataset in training status, fix LoRA/QLoRA casing
- Training status now shows 'Training "model_name"' and 'Dataset = ...'
instead of generic "Starting training..."
- Fix Studio progress section to show QLoRA/LoRA instead of QLORA/LORA
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: rename 'Skip to Chat' to 'Skip Onboarding' on splash screen
* studio: add presence_penalty support for chat inference
Add presence_penalty as a parameter across the full stack:
- Backend: llama_cpp.py generate_chat_completion/with_tools, Pydantic
models (inference.py), routes/inference.py pass-through
- Frontend: InferenceParams type, DEFAULT_INFERENCE_PARAMS (0.0),
chat-adapter.ts payload, chat-settings-sheet.tsx slider (0-2),
model defaults loading from inference_defaults.json
- Set Qwen3.5 default presence_penalty to 1.5 per official docs
- Default for unknown models is 0.0 (off)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: fix Chat card deselecting Text and aligning with other cards
* studio: fix presence_penalty not loading from inference defaults
The inference_config.py load_inference_config() was not including
presence_penalty in the returned config dict, so the Qwen3.5
default of 1.5 from inference_defaults.json never reached the
frontend. Added it to the config builder.
* studio: add delete button for cached models in model selector
Add trash icon on each downloaded model row (GGUF and safetensors) with
confirmation dialog. Backend DELETE /api/models/delete-cached endpoint
uses huggingface_hub scan_cache_dir + delete_revisions to cleanly remove
cached repos, refusing if the model is currently loaded.
* studio: restore inference defaults, reasoning, and tools on page refresh
On page refresh with a model already loaded, the frontend was not
re-applying model-specific inference defaults (presence_penalty,
temperature, etc.) or restoring reasoning/tools support flags.
Backend: Add inference config, supports_reasoning, supports_tools,
and context_length to InferenceStatusResponse.
Frontend: In the refresh callback, when an active model is detected,
apply mergeRecommendedInference and restore reasoning/tools flags
with proper Qwen3.5 size-based defaults.
* studio: fix delete dialog closing before async completes
Prevent AlertDialogAction's default close behavior with
e.preventDefault() so the dialog stays open during deletion.
Also block onOpenChange dismiss while deleting is in progress.
* fix: add Dict and Any imports to inference models
* studio: fix Qwen3.5 reasoning threshold in frontend load path
The frontend loadModel handler had the old threshold (<=2) for
disabling reasoning on small Qwen3.5 models. Changed to <9 to
match the backend. This was causing 4B to not properly disable
thinking by default when auto-loaded.
* studio: move GGUF delete to per-variant level
For GGUF repos, the trash icon now appears on each downloaded variant
row inside the quantization expander instead of on the repo-level row.
Backend accepts optional variant param to delete specific GGUF files
(blob + symlink) rather than the entire repo cache.
* studio: restore ggufContextLength on page refresh
The Max Tokens slider was capped at 32768 on page refresh because
ggufContextLength was not restored from the status response.
Now set it from statusRes.context_length on reconnect.
* fix: remove <think> from Qwen3.5 response template marker
The train-on-responses-only feature uses template markers to find
where the assistant response starts. The Qwen3.5 response marker
included '<think>\n' which is only present when thinking mode is
enabled. With thinking disabled (default for <9B), the marker
never matched, causing 100% of samples to be dropped.
Changed response marker from '<|im_start|>assistant\n<think>\n'
to '<|im_start|>assistant\n' which works regardless of thinking mode.
* studio: fix sloth ASCII art alignment in training overlay
* fix: correct sloth ASCII art alignment to match Unsloth banner
* studio: add Python and terminal tool calling to chat
Register python and terminal tools alongside web search. Python
executor validates imports (stdlib only) via unsloth_zoo
rl_environments, runs code in a subprocess sandbox with 5-min
timeout and cancel support. Terminal executor blocks dangerous
commands (rm, sudo, etc.) and runs in a temp directory.
Update llama_cpp tool loop to show tool-specific status messages
and pass cancel_event through to executors. Rename composer
toggle from "Search" to "Tools" and show TerminalIcon for
execution status pills.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: fix Nemotron/transformers 5.x support, onboarding navigation, port binding
Backend:
- Dynamic transformers 5.x detection via tokenizer_config.json fetch
(checks for TokenizersBackend class, cached per-model)
- Bump transformers 5.x version from 5.2.0 to 5.3.0 across all workers,
setup scripts (setup.sh, setup.ps1)
- Auto-enable trust_remote_code for unsloth/* models needing transformers 5.x
(workaround for NemotronH config parsing bug in transformers)
- Auto-install mamba-ssm/causal-conv1d for SSM models (NemotronH, Falcon-H1)
with --no-build-isolation --no-deps to avoid torch version conflicts
- Add SO_REUSEADDR to port check in run.py (fixes Colab proxy stale connection
falsely reporting port as in-use)
Frontend:
- Fix "Skip to Chat" navigation: use window.location.href instead of React
Router navigate() to bypass useEffect redirect race
- Fix "Skip Onboarding" on splash: navigates to /studio (not /chat)
- Fix onboarding guard: only check isOnboardingDone() on initial mount
- Fix Chat card on step 1: add sr-only spacer for consistent alignment
- Fix Chat+Text both selected: clear RadioGroup value when Chat is selected
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: split tools toggle into Search and Code buttons
Replace the single "Tools" toggle with two independent toggles:
- "Search" (globe icon) enables web search only
- "Code" (terminal icon) enables Python and terminal execution
Add enabled_tools list field to the inference payload so the
backend only registers the tools the user has toggled on. Both
toggles appear in the main composer and the compare composer.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: fix tool calling import validation and error logging
Replace unsloth_zoo-dependent import checker with a standalone
ast-based validator using sys.stdlib_module_names. This properly
blocks non-stdlib imports (numpy, requests, etc.) and returns a
clear error message to the model so it can rewrite using only
stdlib.
Add full traceback to tool streaming error logs for debugging.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: parse gpt-oss harmony channels for clean safetensors chat output
gpt-oss models emit multi-channel output via harmony protocol tokens
(<|channel|>analysis<|message|>... and <|channel|>final<|message|>...).
TextIteratorStreamer with skip_special_tokens=True strips the special
tokens but leaves channel names concatenated with content, producing
garbled output like "analysisWe need to...assistantfinalHello!".
Add HarmonyTextStreamer that decodes with skip_special_tokens=False,
parses harmony markup via regex, and emits <think>analysis</think>
for the analysis channel and plain text for the final channel --
reusing the existing frontend reasoning UI.
Also expose supports_reasoning=True for non-GGUF gpt-oss models in
the /status endpoint so the frontend enables the Think toggle.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: use unsloth_zoo for Python sandbox validation
Set UNSLOTH_IS_PRESENT=1 and import check_python_modules and
check_signal_escape_patterns directly from unsloth_zoo instead
of a standalone fallback. This gives us the full Unsloth
validation including stdlib-only import checks and signal/timeout
escape pattern detection.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: allow all imports in Python tool sandbox
Remove stdlib-only import restriction. Keep signal escape
pattern detection via unsloth_zoo for safety.
* studio: fix ReadTimeout on tool streaming final pass
The 0.5s read timeout used for cancel-checking during streaming
also fires when waiting for the first response from llama-server
(e.g. reasoning model thinking for 15+ seconds). Add
_stream_with_retry() context manager that retries on ReadTimeout
while checking cancel_event, so the model has unlimited time to
think before producing the first token. Applied to both the
regular streaming path and the tool-calling final pass.
* fix: rewrite HarmonyTextStreamer with stateful incremental parsing
The delta-on-transformed approach had two critical bugs:
1. Before the full <|channel|>X<|message|> pattern was complete, the
strip-tokens fallback emitted "analysis" as plain text. Then when
the regex matched, _transform returned a completely different format
(<think>...</think>) and the delta was computed against the wrong
base string, producing fragments like "think>", "nk>", ">".
2. Even with full matches, the closing </think> tag shifted position
as content grew, so text[prev_len:] produced garbled deltas.
Replace with stateful incremental parsing that:
- Buffers until a complete channel+message pair is seen
- Emits <think> once when analysis channel first appears
- Streams analysis content deltas (computed on channel content directly)
- Emits </think> once when final channel first appears
- Streams final content deltas
- Closes open think tags in end()
Also skip the generic all_special_tokens stripping in
_clean_generated_text for gpt-oss since HarmonyTextStreamer already
produces clean output and the generic stripping was mangling <think>
tags.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: strip all <|...|> tokens in gpt-oss cleanup, not just harmony subset
The gpt-oss tokenizer has added tokens like <|return|> (id=200002) that
are not part of the harmony channel protocol but can leak into output.
The previous regex only stripped channel|message|start|end tokens.
Broaden the _clean_generated_text regex for gpt-oss to <\|[a-z_]+\|>
which catches all pipe-delimited tokens (return, constrain, reserved,
etc.) without matching <think>/<\/think> tags.
Verified: gpt-oss all_special_tokens are only <|return|>,
<|reserved_200017|>, <|startoftext|> -- none overlap with <think>.
The harmony tokens (channel, message, start, end) are added_tokens
but not in all_special_tokens.
* fix: hide config-only model repos from cached models list
Repos that only have metadata/config files cached (no .safetensors or
.bin weight files) were showing up in the Downloaded list with tiny
sizes like "1.8 KB" or "24 KB". These are just leftover config
snapshots from architecture checks, not usable models.
Filter the cached-models endpoint to only include repos that contain
actual model weight files (.safetensors or .bin).
* studio: fix toast description text contrast in dark mode
Add explicit !text-muted-foreground to toast description classNames
so secondary text (e.g. "Releases VRAM and resets inference state.")
is readable in dark mode.
* studio: fix Chat card icon alignment with size-4 spacer
Replace sr-only span (takes no space) with a size-4 shrink-0 div
matching the RadioGroupItem dimensions in other cards, so the Chat
icon aligns vertically with Text/Audio/Vision/Embeddings icons.
---------
Co-authored-by: workspace <user@workspace.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Manan17 <shahmanan170602@gmail.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* fix(llm_assist): disable thinking mode for helper model JSON output
Pass enable_thinking=False to generate_chat_completion() in both
_run_with_helper() and _generate_with_backend() so the Qwen3.5-4B
helper model produces clean JSON instead of wrapping responses in
<think> tags.
* fix(llm_assist): log per-request enable_thinking=False override
Add info-level log lines so the user can see that each helper/advisor
request overrides the server-level thinking default to False.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Add SVG preview rendering below code blocks using safe data URI
in <img> tag. Includes sanitization to block script/event handlers.
- Fix GGUF streaming crash: cache response.iter_text() iterator
instead of creating a new one on every loop iteration.
- Fix model selector showing "Select model..." after auto-load by
re-reading store state after setCheckpoint before setParams.
- Remove unused warmupToastShown variable (TS6133 build error).
- Change default suggestion to "Draw an SVG of a cute sloth".
The streaming loop used response.iter_text() with timeout=None, which
blocks until the next chunk arrives from llama-server. On large models
like Qwen3.5-27B where each token takes seconds, pressing Stop in the
UI would not take effect until the next token was produced.
Fix by using a 0.5s read timeout and a new _iter_text_cancellable()
helper that checks cancel_event between timeout windows and explicitly
closes the response when cancelled. Applied to both the regular chat
completion and tool-calling streaming paths.
Creative: temperature=1.5, min_p=0.1, top_p=Off (1.0), top_k=Off (0)
Precise: temperature=0.1, top_p=0.95, top_k=80, min_p=0.01
Also show "Off" in the slider label for top_p=1.0, top_k=0, and
repetition_penalty=1.0 since those values disable their respective
samplers. Changed top_k slider min from -1 to 0.
* studio: switch helper model to Qwen3.5-4B-GGUF
Replace Qwen3-4B-Instruct-2507-GGUF with Qwen3.5-4B-GGUF as the
default helper model for LLM-assisted dataset detection. Same
UD-Q4_K_XL variant.
* studio: fix stale GGUF metadata when switching models (#4347)
Reset _supports_reasoning, _supports_tools, _context_length, and
_chat_template at the start of _read_gguf_metadata() to prevent
stale settings from a previous model leaking into the next load.
Co-authored-by: Daniel Han <daniel@unsloth.ai>
* studio: change login error to "Incorrect password", add reset-password CLI
- Login error now says "Incorrect password" instead of the generic
"Incorrect username or password" since Studio only has one account.
- Add `unsloth studio reset-password` command that deletes the auth
database so a fresh admin account with a new random password is
created on the next server start.
* studio: include reset command in login error message
* studio: change password setup subtitle wording
## Summary
- Add web search tool calling for GGUF models (Search toggle, DuckDuckGo via ddgs)
- Add KV cache dtype dropdown (f16/bf16/q8_0/q5_1/q4_1) in Chat Settings
- Fix Qwen3/3.5 inference defaults per official docs (thinking on/off params)
- Enable reasoning by default for Qwen3.5 4B and 9B
- Replace "Generating" toast with inline spinner
- Fix stop button via asyncio.to_thread (event loop no longer blocked)
- Fix CUDA 12 compat lib paths for llama-server on CUDA 13 systems
- Fix auto-load model name not appearing in selector
- Training progress messages + dataset_num_proc fix
Integrated PRs:
- #4327 (imagineer99): BETA badge alignment (already in tree)
- #4340 (Manan Shah): prioritize training models in model selection
- #4344 (Roland Tannous): setup.sh macOS python version compatibility
- #4345 (Manan Shah): revamp model+dataset checking logic
* Removing .precommit config
* edited colab comments
* studio: update Unsloth_Studio_Colab.ipynb
* studio: update Unsloth_Studio_Colab.ipynb
* studio: add Colab T4 GPU metadata to force T4 instance
* style: update colab popup to black/white theme with gem icon and play button
* feat: center landscape image in colab notebook
* style: shrink popup to fit content, truncate URL display
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: center landscape image in colab notebook
* feat: use GitHub raw URL for studio landscape image in notebook
* chore: update colab notebook
* feat: add studio landscape colab display image and update notebook
* feat: update notebook with studio landscape image
* style: remove colors, add progress bar, add VERBOSE flag to install output
* docs: add comments explaining VERBOSE flag and progress bar
* chore: update colab notebook
* fix: define VERBOSE, _STEP, _TOTAL at module level to fix NameError
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Remove outdated xformers Blackwell version guard
The guard at _utils.py:976-989 blocked xformers 0.0.32.post2 on
Blackwell/RTX 50x/Jetson GPUs (SM 10.0/11.0/12.0) due to a FA3
dispatch bug that caused CUDA errors (issue #1329).
This is no longer needed because:
1. xformers fixed the FA3 dispatch in 0.0.33.post2 by capping it
at SM <= 9.0, so FA3 is never attempted on Blackwell. The FA2
backend works correctly via PTX forward compatibility.
2. The only blocked version (0.0.32.post2) was built for torch 2.8.0
and cannot load on torch 2.9+ due to ABI mismatch, so the guard
never actually triggers for any current user.
3. The existing _register_extensions() check plus the except Exception
fallback already handle broken xformers installs gracefully by
falling back to SDPA.
Verified on NVIDIA RTX PRO 6000 Blackwell (SM 12.0) with both
pre-built wheels (0.0.33.post2) and source builds -- all attention
tests pass with exact numerical match vs SDPA.
* Update xformers Blackwell guard with root cause and broader coverage
Changes to the xformers version guard for Blackwell/RTX 50x/Jetson GPUs:
1. Broaden version check from `in (0.0.32.post2,)` to `<= 0.0.32.post2`
to cover all versions with the broken FA3 dispatch, not just one.
2. Add `DEVICE_TYPE == "cuda"` guard to avoid calling
`get_device_capability()` on non-CUDA devices (XPU, etc.).
3. Document the root cause: xformers <= 0.0.32.post2 used
`capability >= (9, 0)` in the FA3 dispatch, which matched
Blackwell SM 12.0 and attempted sm_90a Hopper kernels on it.
Fixed upstream in 0.0.33 with `<= (9, 0)`.
4. Update error message to include the installed version, mention
the fix (upgrade to >= 0.0.33), and keep the build-from-source
fallback. The raise is caught by `except Exception` which shows
the message when UNSLOTH_ENABLE_LOGGING is set and falls back
to SDPA.
Verified on NVIDIA RTX PRO 6000 Blackwell (SM 12.0):
- xformers 0.0.33.post2 pre-built wheel: works (FA2 via PTX)
- xformers source build: works (FA2 native)
- Both have exact numerical match vs SDPA
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* fix: add Qwen3.5 version gate in loader dispatch (#4188)
Qwen3.5 (model_type qwen3_5) only exists in transformers >= 5.0.0.
Without this gate, loading a Qwen3.5 model on transformers 4.x gives
an unhelpful generic error. This adds a clear version check before the
qwen3 dispatch to prevent substring misrouting and give a useful error
message pointing users to upgrade.
No dedicated FastQwen3_5Model is needed -- the compiler already applies
fused CE automatically via apply_fused_lm_head for both
Qwen3_5ForCausalLM and Qwen3_5ForConditionalGeneration. The generic
FastModel fallback path handles everything.
FORCE_FLOAT32 already has qwen3_5 on main.
Tested on transformers 5.3.0: Qwen3.5-0.8B 4bit, 1.38 GB peak memory.
Backwards compatible: import unsloth works on transformers 4.57.6.
* fix: update FORCE_FLOAT32 comment for qwen3_5
The (1+w) RMSNorm pattern does not overflow float16 since Qwen3_5RMSNorm
computes in float32 internally. The actual reason FORCE_FLOAT32 is needed
is that Qwen3.5 GDN layers produce NaN grad norms during float16 training.
Updated the comment to reflect the real reason.
* fix: move qwen3_5 version check before dispatch chain
The elif block intercepted qwen3_5 on transformers >= 5.0.0 without
setting dispatch_model, causing UnboundLocalError at line 715.
Move the version check before the if/elif dispatch chain so on
transformers >= 5.0.0 the model_type falls through to the generic
FastModel path as intended.
* fix: qwen3_5 requires transformers >= 5.2.0, not 5.0.0
Checked all 5.x releases:
- 5.0.0: no qwen3_5 module
- 5.1.0: no qwen3_5 module
- 5.2.0: qwen3_5 available
* fix: move qwen3_5 version check into AutoConfig error handler
The previous version check at the dispatch chain was unreachable --
AutoConfig.from_pretrained fails first with a generic "does not
recognize this architecture" error on transformers < 5.2.0, so
execution never reached the check.
Move the qwen3_5-specific error message into the AutoConfig exception
handler where "architecture" errors are caught. This intercepts the
error before the generic message and gives users a clear upgrade path.
Also remove the now-redundant check before the dispatch chain.
Both FastLanguageModel and FastModel paths are covered.
Tested: transformers 4.57.6 shows the Qwen3.5-specific error,
transformers 5.3.0 loads and trains normally.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Removing .precommit config
* edited colab comments
* studio: update Unsloth_Studio_Colab.ipynb
* studio: update Unsloth_Studio_Colab.ipynb
* studio: add Colab T4 GPU metadata to force T4 instance
* style: update colab popup to black/white theme with gem icon and play button
* feat: center landscape image in colab notebook
* style: shrink popup to fit content, truncate URL display
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: center landscape image in colab notebook
* feat: use GitHub raw URL for studio landscape image in notebook
* chore: update colab notebook
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: prefer existing CUDA_PATH toolkit to avoid version mismatch on multi-CUDA systems
* fix: validate GPU arch support before accepting CUDA toolkit (sm_120 + CUDA 12.4 fallback)
* debug: add temporary CUDA compatibility check print
* fix: auto-copy CUDA VS integration files when missing (No CUDA toolset found)
* fix: return false when nvcc --list-gpu-arch unavailable (reject old toolkit, scan for newer)
* fix: re-sanitize CUDA env vars before cmake build (survives Refresh-Environment)
* fix: use --list-gpu-code (sm_*) instead of --list-gpu-arch (compute_*) for arch probing
* studio: extract param count from model name as fallback
When HuggingFace API doesn't return totalParams for a model,
extract the param count from the model name (e.g. "Qwen3-0.6B"
-> "0.6B", "Llama-3.2-1B-Instruct" -> "1B"). Applied to both
the recommended list and HF search results.
* studio: read GGUF context_length via fast header parser, set max tokens
- Fast GGUF metadata reader (~30-55ms) parses only KV header, skips
tensor data and large arrays (tokenizer vocab etc)
- Extracts context_length and chat_template from GGUF metadata
- Returns context_length in LoadResponse for frontend to use
- Frontend sets maxTokens to actual context_length for GGUFs (e.g.
262144 for Qwen3.5-9B, 131072 for Qwen2.5-7B)
- Max Tokens slider shows "Max" and is locked for GGUFs
- Auto-load path also uses actual context_length from load response
- Toast auto-dismiss (5s) and close button for auto-load toast
* studio: GGUF TTS audio support (from PR #4318)
Add GGUF TTS audio generation via llama-server. When a GGUF model
loads, the backend probes its vocabulary to detect audio codecs
(SNAC/BiCodec/DAC/CSM/Whisper). If detected, the codec is pre-loaded
and the model is reported as audio to the frontend.
During chat, TTS models route to the audio generation path which sends
a per-codec prompt to llama-server's /completion endpoint, extracts
generated tokens/text, and decodes to WAV using AudioCodecManager.
Also strips base64 audio data from prior assistant messages to prevent
context overflow.
Co-authored-by: Manan Shah <mananshah511@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove package-lock.json from tracking
* studio: per-model inference defaults, GGUF max tokens fix, reasoning toggle
- Add inference_defaults.json with per-model-family sampling parameters
for ~50 families (Qwen3.5, Qwen3, Gemma-3, Llama-3, DeepSeek, etc.).
Values sourced from unslothai/docs and Ollama params blobs.
- Family-based lookup in inference_config.py: extracts model family from
identifier, matches against patterns (longest match first), merges with
priority: model-specific YAML > family JSON > default.yaml.
- Fix GGUF Max Tokens slider locked at "Max": store ggufContextLength
separately from maxTokens so the slider is adjustable (step=64).
- Fix Ministral YAML: top_p was literal string "default", now 0.95.
- Add reasoning toggle for thinking models (Qwen3.5, Qwen3, DeepSeek-R1,
DeepSeek-V3.1, etc.): detect enable_thinking support from GGUF chat
template metadata, pass --jinja to llama-server, send
chat_template_kwargs per-request. Frontend shows "Reasoning is ON/OFF"
pill button next to attachment button in composer.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: remove default system prompt injection
Backend was injecting "You are a helpful AI assistant." when no system
prompt was provided. Neither unslothai/docs nor Ollama specify a default
system prompt for most models. Now defaults to empty string, letting the
model's own chat template handle system behavior.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: use lightbulb icons and "Think" label for reasoning toggle
Lightbulb on when thinking enabled, lightbulb-off when disabled.
Label is just "Think" in both states; grayed out styling when off.
* studio: fix HTML file upload breaking chat
Replace SimpleTextAttachmentAdapter with custom TextAttachmentAdapter
(excludes text/html) and HtmlAttachmentAdapter that strips tags via
DOMParser, removing scripts/styles and extracting readable text content
instead of dumping raw HTML markup into the conversation.
* studio: show chat template in Configuration panel
Display the model's Jinja2 chat template in a new "Chat Template"
section under Settings (now open by default). For GGUFs, reads from
GGUF metadata; for safetensors, reads from tokenizer.chat_template.
Template is editable with a "Restore default chat template" button
that appears when modified. Section only shows when a model with a
chat template is loaded.
* studio: editable chat template with Apply & Reload
Chat template section now functional:
- Editing the template shows "Apply & Reload" (reloads model with
custom template) and "Revert changes" buttons
- For GGUFs: writes template to temp .jinja file, passes
--chat-template-file to llama-server on reload
- For non-GGUF: passes chat_template_override in load request
- Settings section now open by default
- selectModel supports forceReload to reload same model
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio: fix DeepSeek reasoning detection and auto-load metadata
- Set _model_identifier before _read_gguf_metadata so DeepSeek
"thinking" template detection works (was always None before)
- Populate ggufContextLength, supportsReasoning, reasoningEnabled,
defaultChatTemplate in autoLoadSmallestModel GGUF path
* studio: add spacing before BETA badge in navbar
Add gap-1.5 on the logo Link container to space the BETA label
from the wordmark.
Co-authored-by: Imagineer99 <Imagineer99@users.noreply.github.com>
* studio: vertically center BETA badge with logo
---------
Co-authored-by: Manan Shah <mananshah511@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Imagineer99 <Imagineer99@users.noreply.github.com>
* Strip <think> blocks from LLM assist model output
* Add debug logging for raw LLM assist output
* Quiet llama-server logs, use structlog in llm_assist
* Fix think-tag stripping when response is inside tags
* Remove debug logging of raw model output
* Clarify GGUF download logs: show cache hit vs actual download
* Clarify heuristic-detected mapping in UI text
* Default helper model to Qwen3-4B-Instruct-2507 UD-Q4_K_XL
* Remove package-lock.json from tracking, add to .gitignore
* Auto-open mapping dialog on Start Training for custom_heuristic format
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use last think block when extracting inner content (review feedback)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix VLM GRPO matmul shape mismatch in _get_per_token_logps_and_entropies
VLM models (e.g. Qwen2.5-VL) can return logits [B*T, vocab_size] instead
of hidden states [B*T, hidden_dim] from their forward pass. When this
happens, chunked_hidden_states_selective_log_softmax tries to compute
logits @ lm_head.t() which fails with a shape mismatch.
Add a shape guard in the VLM branch of _get_per_token_logps_and_entropies:
check output.shape[-1] against lm_head.shape[1] (hidden_dim). When hidden
states are returned, the existing path is taken. When logits are returned,
scaling/softcapping/temperature are applied manually and
chunked_selective_log_softmax is used instead.
Also add chunked_selective_log_softmax to the import from unsloth_zoo.
The text-only branch (pixel_values is None) is unchanged.
Companion PR to unslothai/unsloth-zoo for grpo_accumulated_loss.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove redundant scaling in logits fallback path
When COMPILE_DISABLE=1 and the model returns logits directly, scaling
and softcapping are already applied by the model forward. Only
temperature (a GRPO training parameter) needs to be applied.
* Pass temperature to chunked_selective_log_softmax instead of manual cast
Use the new temperature parameter in chunked_selective_log_softmax
(added in companion zoo PR) to avoid casting the entire logits tensor
to float32 before the function call.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The existing fix that removes use_reentrant=False from
gradient_checkpointing_kwargs was gated behind RLConfig_name ==
"GRPOConfig", so only GRPOConfig was protected. SFTConfig, DPOConfig,
KTOConfig, CPOConfig, ORPOConfig etc. were all still affected.
Remove the GRPOConfig guard so the fix applies to all compiled trainer
configs when TRL >= 0.27.0.
This is defense-in-depth alongside the unsloth_zoo fix that forces
use_reentrant=True in unsloth_checkpoint() itself.
- GGUF: use -c 0 for model's native context size (no 4096 cap)
- GGUF: hide Max Seq Length slider (irrelevant), set Max Tokens to Max
- Non-GGUF: default Max Tokens to 4096
- Max Tokens slider shows "Max" label when at ceiling for GGUFs
- Run non-GGUF load_model in asyncio.to_thread for progress polling
- Auto-load smallest downloaded model when chatting without selection
- Wait for in-progress model load before inference (modelLoading store flag)
- Recommended list: 4 GGUFs + 4 hub models after case-insensitive dedup
- Model selector waits for cached data before rendering
- Toast close button repositioned, Sampling section open by default
- Add logging to _get_repo_size_cached exception handler
- Use -c 0 for llama-server (model's native context size, no 4096 cap)
- Run non-GGUF backend.load_model in asyncio.to_thread for progress polling
- Auto-load smallest downloaded model when user chats without selecting one
- Wait for in-progress model load before inference (no "No model loaded" error)
- Add modelLoading flag to zustand store for cross-component coordination
- Dynamic top models: send 8 GGUFs + 8 hub models, frontend caps 4+4 after dedup
- Case-insensitive dedup: downloaded models correctly hide from recommended list
- Prevent duplicate toasts: guard against double selectModel calls
- Model selector waits for cached data before rendering (no empty flash)
- Toast close button positioned at top-right with proper spacing
- Sampling section expanded by default in chat settings
- Global toast close button styling fix
Change all repetition_penalty defaults from 1.1 (or 1.05/1.2 in
presets) to 1.0 across the entire backend and frontend. Most models
handle repetition well on their own and a non-1.0 penalty can degrade
output quality, especially for code, structured output, and creative
tasks.
Files changed:
- Backend: inference.py, llama_cpp.py, orchestrator.py, worker.py,
models/inference.py (Field defaults)
- Frontend: chat-settings-sheet.tsx (Creative/Precise presets),
runtime-provider.tsx (auto-title generation)
- maxTokens: 2048 -> 8192. The old 2048 limit caused generation to
stop mid-output for longer responses (e.g. reasoning/thinking models
that produce long chain-of-thought before the answer).
- repetitionPenalty: 1.1 -> 1.0 (disabled). Most models handle
repetition well on their own. A penalty of 1.1 can hurt quality
for creative tasks like code generation and ASCII art.
- Change welcome message from "Run LLMs or test your fine-tune" to
"Chat with your model".
Merge the toast UX refactor from PR #4304 (by @Shine1i):
- Toast duration 5s default with close button (X) for manual dismiss
- Inline progress bar component (ModelLoadInlineStatus) shown in the
header after toast is dismissed
- Model switch warning only for image compatibility (not generic)
- activeThreadId tracked in store via ActiveThreadSync
- Loading state cleanup via resetLoadingUi helper
- Toast uses Infinity duration during loading with onDismiss handler
Re-applied non-GGUF download progress additions on top:
- getDownloadProgress for all models (not just GGUF)
- hasShownProgress flag, loadingModelRef race condition checks
- First poll at 500ms, bytes-only fallback when expected size unknown
Don't show "Model changed for this chat" toast when the thread has
no messages. On a fresh page load with a stale thread from a previous
session, this warning is confusing. The warning is only useful
mid-conversation to alert about image compatibility with the new model.
When messages.length === 0, silently update the thread's modelId and
proceed with loading.
The _VISION_CHECK_SCRIPT subprocess used logger.info() but logger was
never defined in the subprocess context. This caused a NameError on
every vision check, making all transformers 5.x models (Qwen3.5,
GLM, etc.) fall back to text-only mode even when they support vision.
Replace logger.info() with print() since the parent process reads
the subprocess stdout via result.stdout.
- Add sloth emoji prefix to "Downloaded" and "Recommended" section
labels in the Hub model picker so they are visually distinct.
- Replace browser network errors ("NetworkError when attempting to
fetch resource" / "Failed to fetch") with a clearer message:
"Studio isn't running -- please relaunch it."
llama-server uses stb_image internally which does not support WebP,
TIFF, AVIF, and other formats that browsers accept for upload.
Uploading a WebP image to a vision GGUF model caused a 400 error:
"Failed to load image or audio file" / "failed to decode image bytes".
Convert all uploaded images to PNG via PIL before base64-encoding and
forwarding to llama-server. This handles WebP, TIFF, BMP, GIF, AVIF,
and any other format PIL supports. RGBA images are converted to RGB
first since PNG with alpha can cause issues in some vision pipelines.
GGUF repos with mmproj files (e.g. Qwen3.5-0.8B-GGUF) are already
detected as vision-capable by list_gguf_variants(), and is_vision is
set correctly in ModelConfig. However, the HF download path only
downloaded the main GGUF file without the mmproj projection file,
so llama-server started without --mmproj and rejected image uploads
with "text-only model" errors.
Add _download_mmproj() to LlamaCppBackend that:
- Lists repo files for mmproj*.gguf matches
- Prefers mmproj-F16.gguf (best quality), falls back to any mmproj
- Downloads via hf_hub_download (uses the same HF cache)
In load_model(), when is_vision=True and no explicit mmproj_path was
provided (HF mode), auto-download the mmproj after the main GGUF.
The downloaded path is passed to llama-server via --mmproj.
1. Backend: When a model fails with "No config file found" or similar
unsupported-model errors, wrap the message with "This model is not
supported yet. Try a different model." instead of showing the raw
Unsloth exception.
2. Frontend: Compute estimated download size from the HF search API's
safetensors.parameters dtype breakdown (BF16=2B/param, I32=4B/param,
F32=4B/param, etc.) and show it in the model picker instead of just
the param count. For example, Kimi-K2.5 now shows "~554 GB" instead
of "171B" (which was misleading since 171B params != 171GB download).
Three fixes on top of the download progress feature:
1. Backend: Replace broken "no .incomplete = done" completion check
with a 95% byte threshold. HF downloads files sequentially, so
between files there are briefly no .incomplete files even though
the download is far from done (e.g. Kimi-K2.5 reported "done"
after downloading 22KB of config files out of 595GB).
2. Frontend: Track hasShownProgress flag. Only show "Download
complete. Loading into memory..." if we actually displayed
download progress before. For already-cached models where the
first poll returns progress=1.0, this avoids the misleading
"Download complete" message.
3. Frontend: Deduplicate recommended vs downloaded -- filter out
models already in the "Downloaded" section. Cache the fetched
lists at module level so re-mounting the popover does not flash
an empty "Downloaded" section.
Previously only GGUF models showed download progress in Chat. Non-GGUF
models (safetensors, bnb quantized, etc.) showed a static message with
no progress indication. This adds progress tracking for all model types
and fixes several related issues.
Backend:
- Add /api/models/download-progress endpoint that checks the HF cache
blobs directory for completed and .incomplete files. Uses model_info()
(cached per repo) to determine expected total size for percentage.
- Add /api/models/cached-models endpoint that lists non-GGUF model repos
from the HF cache via scan_cache_dir().
- Fix progress stuck at 0.99: when no .incomplete files remain, report
1.0 immediately (blob deduplication can make byte totals mismatch).
Frontend:
- Remove the ggufVariant gate so download progress polling works for all
non-cached models, not just GGUFs.
- Use GGUF-specific endpoint when variant + expectedBytes available,
otherwise use the general download-progress endpoint.
- Fix toast stuck after load: check loadingModelRef.current before and
after the async poll to prevent overwriting the success toast.
- First poll at 500ms instead of waiting for the 2s interval.
- Show downloaded non-GGUF models in the Hub model picker "Downloaded"
section alongside GGUFs.
* fix: Ctrl+C not breaking out of backend on Linux
threading.Event.wait() without a timeout blocks at the C level on
Linux, preventing Python from delivering SIGINT. Use a 1-second
timeout loop so the interpreter can process pending signals.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* user can upload eval dataset, removed bugs
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolving merge conflicts
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* resolving gpt comments
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
The `set -u` (nounset) flag in setup.sh causes `${_HIDDEN_GITIGNORES[@]}`
to fail with "unbound variable" when no parent .gitignore with `*` is
found (common on Mac where the install is not inside a Python venv).
Use the `${arr[@]+"${arr[@]}"}` idiom to safely expand empty arrays
under nounset mode.
Two issues caused the studio frontend to render without any styling
when installed via `pip install` (non-editable):
1. `pyproject.toml` package-data only included `frontend/dist/**/*`.
The `include-package-data = true` setting relies on `git ls-files`,
which fails in isolated builds (pip/uv copy source to a temp dir
without `.git`). This meant `frontend/src/`, `package.json`,
`vite.config.ts`, and other build files were missing from the
installed package. Tailwind had no source files to scan.
2. Python venvs auto-create a `.gitignore` with a bare `*` pattern.
Tailwind v4's oxide scanner walks parent directories and respects
`.gitignore` -- so even when source files are present, the venv's
`*` pattern causes the scanner to skip all `.tsx` files. The result
is a 34KB CSS skeleton with zero utility classes instead of the
expected 265KB.
Additionally, Vite adds `crossorigin` to script/link tags by default.
This forces CORS mode on font subresource loads, which Firefox
HTTPS-Only Mode does not exempt -- causing all @font-face downloads
to fail silently when Studio is served over HTTP.
Changes:
- pyproject.toml: Expand package-data to include frontend source,
config files, setup scripts, and backend requirements using glob
patterns (no node_modules)
- studio/setup.sh: Temporarily hide parent .gitignore files containing
a bare `*` during `npm run build`, with trap-based restoration
- studio/backend/main.py: Strip `crossorigin` attributes from HTML
at serve time so fonts load correctly on any protocol
* feat(studio): switch to password-only login and simplify first-time setup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: align change-password button state with validation rules
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
* fix: graceful shutdown on Windows (signal handlers for Ctrl+C)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: update Colab notebook to use public unsloth repo and correct paths
* Update studio/Unsloth_Studio_Colab.ipynb
For efficiency, especially in environments like Colab, it's better to perform a shallow clone of the repository. This fetches only the latest commit from the specified branch, which is significantly faster and uses less disk space than cloning the entire project history.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update Unsloth_Studio_Colab.ipynb
* studio: add standard Unsloth header, news, section headings, and footer to Colab notebook
* studio: refine Colab notebook section headings and cell cleanup
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
* chat only with gguf for mac devices
* resolving gpt comments
* add change-password for chat only
* hide lora adaptors dropdown
* solving gpt comments
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* addressing the comment
* fixing auth flow
---------
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Run GGUF load_model in asyncio.to_thread so the event loop stays free
for progress polling during download (was blocking all requests).
- Extract download phase out of the lock in LlamaCppBackend.load_model
so unload_model/cancel can take effect immediately during download.
- Fix "downloaded" badge for split GGUFs: check total cached bytes
across all shards vs expected size, not just first shard existence.
- Respect CUDA_VISIBLE_DEVICES in /api/system GPU reporting so the
frontend GGUF fit estimation uses actual available VRAM.
- Sort tight variants (need CPU offload) smallest-first instead of
largest-first -- closer to GPU budget = faster inference.
- Fix cancel: use refs instead of React state for abort controller and
toast ID so both cancel buttons (text + toast) work reliably. Make
cancel synchronous (fire-and-forget unload) for instant UI response.
Check abortCtrl.signal.aborted after loadModel returns to prevent
ghost model state. Skip rollback and suppress errors on cancel.
- Dynamic top 4 GGUF models fetched from HF API sorted by downloads,
prepended to the default recommended list.
- Remove turnAnchor="top" for auto-scroll to bottom during generation.
- Set default toast duration to 10s (was infinite for loading toasts).
- Deduplicate cached GGUF repos using scan_cache_dir API (fixes
Qwen/X-GGUF vs qwen/x-gguf duplicates from lowercased HF cache).
- Pre-compile repo_id validation regex to silence CodeQL ReDoS warning.
- Change welcome text and default suggestion text.
_get_gpu_free_memory was filtering by CUDA_VISIBLE_DEVICES, so with
CUDA_VISIBLE_DEVICES='0' set by the training env, llama-server only
saw 1 GPU and used --fit for CPU offloading instead of spreading
across all 8 GPUs.
Since llama-server manages its own GPU allocation (the _select_gpus
method picks GPUs and sets CUDA_VISIBLE_DEVICES for the subprocess),
the query must see ALL physical GPUs to make the right decision.
1. Progress endpoint now takes a variant parameter and only counts
.gguf files matching that variant (not all files in the repo cache,
which would include previously downloaded variants)
2. Tracks .incomplete files in HF blobs dir for in-progress single-shard
downloads, capping at 99% until the file is fully committed
3. Fixed loading text: "Loading model..." for cached, "Downloading
model..." for new downloads, with appropriate descriptions
4. Wording: "Downloading and loading model. Large models can take a
while." instead of "This may include downloading."
1. Loading text: shows "Loading model..." for cached models,
"Downloading model..." for new downloads. Toast description
adapts accordingly.
2. Download progress: polls /api/models/gguf-download-progress every
2s during downloads, updating the toast with percentage and GB
downloaded. Progress is estimated by checking the HF cache folder
size against the expected total bytes.
3. Passes isDownloaded and expectedBytes through the full chain from
variant click to selectModel for accurate UI state.
1. n_gpu_layers kwarg: accept (and ignore) in load_model signature
so callers like llm_assist.py don't get TypeError
2. mmproj exclusion: filter out mmproj files in _find_smallest_fitting_variant
so fallback doesn't pick a tiny vision projection as the "model"
3. Shard preservation after fallback: re-discover shards for the
fallback variant instead of resetting to empty list, so split
GGUFs download all shards
4. Orphan cleanup safety: only kill llama-server processes whose
cmdline contains ".unsloth/", avoiding termination of unrelated
llama-server instances on the same machine
5. Path expression sanitization: validate repo_id format before using
it in cache directory lookups
The variant filename includes a subfolder prefix (e.g.
UD-Q4_K_XL/Kimi-K2.5-UD-Q4_K_XL-00001-of-00013.gguf) but rglob
returns just the filename. Use Path.name for the comparison.
HF cache dirs use the exact case from the repo_id at download time
(e.g. models--unsloth--kimi-k2.5-gguf) which may differ from the
canonical HF repo_id (unsloth/Kimi-K2.5-GGUF). Use case-insensitive
matching to find the cache directory.
Downloaded variants now take priority over the recommended badge in
sort order. Within the same tier (downloaded+fits, etc.), recommended
still sorts first. Order: downloaded -> recommended -> fits -> tight -> OOM
- Backend: /gguf-variants now checks HF cache for each variant's file
and returns a downloaded flag per variant
- Frontend: downloaded variants sort before non-downloaded (after
recommended), and show a green "downloaded" badge
- Sort order: recommended -> downloaded+fits -> downloaded+tight ->
fits -> tight -> OOM
1. Interruptible downloads: load_model now checks a cancel event
between shard downloads. unload_model sets the event so cancel
stops the download at the next shard boundary.
2. /api/models/cached-gguf endpoint: scans the HF cache for
already-downloaded GGUF repos with their total size and cache path.
3. "Downloaded" section in Hub model picker: shows cached GGUF repos
at the top (before Recommended) so users can quickly re-load
previously downloaded models without re-downloading.
The unload endpoint checked is_loaded (requires healthy=True), but
during initial loading the server is not yet healthy. Cancel had no
effect because the unload route fell through to the Unsloth backend.
Fix: add is_active property (process exists, loading or loaded) and
check it in the unload route so cancel kills llama-server even during
the download/loading phase.
Also: toast cancel button now properly triggers the backend unload.
Replace toast.promise with a manual toast.loading that includes a
Cancel action button. Users can now cancel model downloads/loads from
the toast notification itself, not just from the header bar spinner.
When the studio process is killed (SIGTERM/SIGKILL), atexit handlers
may not run in the subprocess orchestrator, leaving llama-server
processes orphaned and holding GPU memory. This caused OOM errors when
trying to load a new model after a studio restart.
On init, LlamaCppBackend now runs pgrep to find and SIGKILL any stale
llama-server processes before starting fresh.
Updated GGUF fit classification to match llama-server's --fit behavior:
- fits: model <= 70% of total GPU memory (all GPUs)
- tight: model > 70% GPU but <= 70% GPU + 70% available system RAM
(llama-server uses --fit to offload layers to CPU)
- OOM: model exceeds both GPU and system RAM budgets
useGpuInfo now also returns systemRamAvailableGb from /api/system so the
frontend can compute the combined GPU+RAM budget.
Two fixes for accurate GGUF OOM detection:
1. /api/system now uses nvidia-smi to enumerate all physical GPUs
instead of torch.cuda which only sees CUDA_VISIBLE_DEVICES. This
matches llama-server which can use all GPUs regardless of the env
var. Falls back to torch-based detection if nvidia-smi unavailable.
2. Frontend GGUF OOM check now uses 70% of total GPU memory as the
budget, matching the PR's _select_gpus logic (30% reserved for KV
cache and compute buffers). Previously used checkVramFit's 100%
threshold which was too generous.
Adds a Cancel button next to the "Downloading model..." spinner so
users can abort long downloads. Clicking it aborts the in-flight load,
calls unloadModel to kill any running llama-server process, and clears
the loading state.
OOM variants are more useful sorted ascending by size since smaller ones
are more likely to run with --fit. Non-OOM variants remain largest-first
(best quality).
Two fixes for GGUF variant dropdown:
1. useGpuInfo now sums memory across all GPU devices instead of only
reading devices[0]. This matches llama-server's multi-GPU allocation
where models can be split across GPUs.
2. When the backend-recommended variant (e.g. UD-Q4_K_XL) exceeds total
GPU VRAM, the frontend picks the largest variant that fits instead.
If all variants are OOM, it recommends the smallest one (most likely
to work with --fit).
The useMemo for sortedVariants was placed after the loading/error early
returns, which violated React's rules of hooks (hooks must be called in
the same order every render). Move it before the conditional returns.
Fixes: Minified React error #310
Move the sort logic from the backend to the frontend GgufVariantExpander
component where GPU VRAM info is available. The backend now does a simple
size-descending sort. The frontend pins the recommended variant at the
top, pushes OOM variants to the bottom, and sorts the rest by file size
descending (largest/best quality first).
The variants list was returned in HuggingFace file listing order (alphabetical),
making the dropdown confusing (e.g. BF16 before Q4_0). Now sorted as:
1. Recommended variant (from _pick_best_gguf) pinned at top
2. Other UD (Unsloth Dynamic) variants sorted by disk size ascending
3. Non-UD variants sorted by disk size ascending
If the requested port (default 8000) is already in use, auto-
increment and try the next port, up to 20 attempts. Prints a
message like "Port 8000 is in use, using port 8001 instead".
Previously, if port 8000 was busy, uvicorn would fail with
"[Errno 98] address already in use" and the studio would not
start. Now it gracefully finds the next free port.
Uses socket.bind() to check availability before starting uvicorn.
Cross-platform (Linux, macOS, Windows).
Reorder _GGUF_QUANT_PREFERENCE so all UD (Unsloth Dynamic) variants
come before standard quants. UD-Q4_K_XL is the default (best
size/quality tradeoff), followed by other UD quants in decreasing
preference order.
For repos without UD variants (e.g., bartowski), falls through to
standard quants starting with Q4_K_M.
Verified with:
- unsloth/Qwen3.5-35B-A3B-GGUF -> UD-Q4_K_XL
- bartowski/Qwen_Qwen3.5-35B-A3B-GGUF -> Q4_K_M
- unsloth/DeepSeek-V3.2-GGUF -> UD-Q4_K_XL (9 shards)
- unsloth/Llama-3.2-1B-Instruct-GGUF -> UD-Q4_K_XL
The smallest-fitting-variant fallback now groups split GGUF shards
by their variant prefix and sums all shard sizes per variant.
For example, DeepSeek-V3.2 UD-Q4_K_XL has 9 shards totaling
379.8 GB. The previous code treated each shard as a separate
"variant" and would have incorrectly selected a single 50 GB shard
as fitting, ignoring the other 8 shards needed.
Tested with unsloth/DeepSeek-V3.2-GGUF (237 GGUF files, 27
variants from 150 GB to 1.25 TB). Correctly groups and sorts
all variants by total size.
Two changes for GGUF variant selection:
1. Default variant preference now starts with UD-Q4_K_XL (Unsloth
Dynamic quantization) which provides better quality per bit than
standard Q4_K_M. Also added UD-Q2_K_XL, UD-IQ2_M, UD-IQ1_M,
UD-IQ1_S as small fallback options.
2. If the selected variant doesn't fit on disk, automatically fall
back to the smallest GGUF variant in the repo that does fit.
Queries all GGUF file sizes via get_paths_info() and picks the
smallest one under the free disk space limit. If nothing fits,
raises a clear error.
This means users with limited disk space won't get a download
error -- they'll get a smaller quantization instead.
Query file sizes from HuggingFace via get_paths_info() before
downloading, and compare against free disk space on the cache
partition. Raises a clear error if there is not enough space,
instead of failing mid-download.
Uses get_paths_info() instead of repo_info() because xet-stored
repos return size=None from repo_info().siblings, but
get_paths_info() returns the actual file sizes.
If the size check fails for any reason (network error, API change),
it logs a warning and continues with the download anyway.
Set HF_HOME, HF_HUB_CACHE, HF_XET_CACHE, UV_CACHE_DIR, and
VLLM_CACHE_ROOT to a unified location under ~/.unsloth/studio/cache/
on startup. This keeps all model downloads, datasets, and caches
in one place instead of scattered across ~/.cache/huggingface,
~/.cache/uv, etc.
Layout:
~/.unsloth/studio/cache/
huggingface/ (HF_HOME)
hub/ (HF_HUB_CACHE -- model/dataset downloads)
xet/ (HF_XET_CACHE -- xet blob store)
uv/ (UV_CACHE_DIR -- uv package cache)
vllm/ (VLLM_CACHE_ROOT -- vllm compiled kernels)
Only sets variables that are not already in the environment, so
user overrides (e.g. HF_HOME=/data/models) are respected.
Cross-platform: uses Path.home() which resolves correctly on
Linux (~), macOS (~), and Windows (C:\Users\<user>).
If CUDA_VISIBLE_DEVICES is already set in the environment (e.g.,
by the user or a wrapper script), only consider those GPUs when
selecting devices for llama-server. nvidia-smi reports all physical
GPUs regardless of CUDA_VISIBLE_DEVICES, so we filter its output
to match the allowed set.
Without this, the GPU selector could pick a GPU outside the user's
allowed set, overriding their restriction.
Automatically select the best GPU(s) for a GGUF model based on
file size and available VRAM, instead of relying on hardcoded
-ngl -1 or letting llama-server guess.
Logic:
1. Measure total GGUF file size (including split shards)
2. Query free memory per GPU via nvidia-smi
3. If the model fits in 70% of the most-free GPU's memory,
pin to that single GPU (CUDA_VISIBLE_DEVICES=X, no --fit)
4. If it needs multiple GPUs, pick the N most-free GPUs
(CUDA_VISIBLE_DEVICES=X,Y, no --fit)
5. If it's too large for all GPUs combined, omit
CUDA_VISIBLE_DEVICES and use --fit on to let llama-server
handle partial offloading
The 70% threshold accounts for KV cache and compute buffers
that sit on top of the model weights.
Removed the -ngl parameter (was hardcoded to -1). llama-server's
default of "auto" handles layer offloading correctly, especially
with --fit on for oversized models.
Tested on 8x B200:
- 1B model (0.75 GB): picks 1 GPU, no --fit
- 27B model (17 GB): picks 1 GPU, no --fit
- 405B model (230 GB): picks 2 GPUs, no --fit
- 2TB model: all GPUs, --fit on
Refactor command building (deduplicate HF/local paths) and add
flags for better performance:
- --parallel 1: studio is single-user, so only 1 inference slot
is needed. The previous auto-detect picked 4 slots, wasting
VRAM on 3 unused KV caches.
- --flash-attn on: force flash attention for faster inference.
Default is "auto" which may not always enable it.
- --fit on: auto-adjust parameters to fit in available device
memory. Already the default but now explicit.
Also cleaned up the duplicated command building for HF vs local
mode into a single block.
Remove the hard max_tokens=2048 default and le=4096 cap for GGUF
chat completions. When max_tokens is not set (None), the field is
omitted from the llama-server payload entirely, letting the model
generate until it produces an EOS token or hits the context limit.
This is critical for thinking/reasoning models (Qwen3.5, DeepSeek-R1,
etc.) where the thinking phase alone can consume 1000+ tokens before
the actual answer. With the previous 2048 default, simple questions
like "What is 2+2?" used all tokens on thinking and produced empty
visible responses.
Changes:
- llama_cpp.py: max_tokens default None, only include in payload
when explicitly set
- models/inference.py: default None, remove le=4096 cap
- routes/inference.py: pass max_tokens directly, no "or 2048" fallback
llama-server handles omitted max_tokens gracefully (generates until
EOS or context limit). The context size (-c flag, default 4096) acts
as the hard upper bound.
llama-server sends thinking/reasoning tokens as "reasoning_content"
in the SSE delta (separate from "content"). The studio was only
reading delta.content, so all reasoning tokens from models like
Qwen3.5, Qwen3-Thinking, DeepSeek-R1, etc. were silently dropped.
This caused "replies with nothing" for thinking models: the model
would spend its entire token budget on reasoning, produce zero
content tokens, and the user would see an empty response.
Fix: read reasoning_content from the delta and wrap it in
<think>...</think> tags. The frontend already has full support
for these tags (parse-assistant-content.ts splits them into
reasoning parts, reasoning.tsx renders a collapsible "Thinking..."
indicator).
Verified with Qwen3.5-27B-GGUF (UD-Q4_K_XL):
- Before: "What is 2+2?" -> empty response (all tokens in reasoning)
- After: shows collapsible thinking + answer "4"
* fix: resolve compare mode deadlock, cancel_event poisoning, and add dispatcher-based IPC optimization
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* revert to 2048 tokens
* refactor: extract dispatcher timeout values into named constants
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: guard dispatcher shutdown against active compare mailboxes
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* miscallenous studio
* chore: upload dataset misc
* chore: redudancy studio cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: adress the pr comments
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: adress comments about recipes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* quiet llama.cpp build, smarter CUDA install via winget, accept Python 3.11-3.13
* studio: hide Python traceback when setup script exits with error
* setup.ps1: auto-add Python Scripts dir to PATH so 'unsloth' command works in new terminals
* setup.ps1: fix GPU check to run nvidia-smi instead of just checking command existence
* setup.ps1: fix PATH check to use exact entry comparison instead of substring match
* setup.ps1: validate Python probe exit code before persisting Scripts PATH
* fix: quotation marks
* diceware passphrase generation
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Replace _in_virtualenv() heuristic with a runtime probe. At
bootstrap time, try a dry-run uv install without --system. If
that fails (exit code 2, "No virtual environment found"), retry
with --system to confirm it works. This handles all environments
correctly: venvs, Colab (system Python), local machines, containers.
Three fixes based on review:
1. Make uv truly optional: _bootstrap_uv() now only checks if uv is
already on PATH. It no longer tries to pip install uv. If uv is
not present, pip is used with zero changes to behavior.
2. Add --system flag for Colab: on Colab there is no venv (packages
install into system Python). uv requires --system in this case,
otherwise it errors with "No virtual environment found". Added
_in_virtualenv() check that detects VIRTUAL_ENV, sys.real_prefix,
or sys.base_prefix != sys.prefix.
3. Fix label printed twice on uv fallback: when uv fails and falls
back to pip, the label now says "(pip)" to distinguish from the
initial uv attempt, instead of printing the same label twice.
Tested:
- venv path: no --system flag, uv installs correctly
- no-venv path (Colab sim): --system flag added automatically
- full unsloth studio setup + training run (Llama-3.2-1B, 10 steps)
install_python_stack.py:
- Print uv error output on failure for debuggability
- Refactor pip_install() to use early return after uv success,
removing duplicated pip command path
setup.sh:
- Guard nvidia-smi command substitution with || true so it does
not abort the script under set -euo pipefail when nvidia-smi
fails (e.g., containerized environments, driver quirks)
- Read all GPU compute capabilities and deduplicate, so
mixed-GPU hosts get kernels built for all present architectures
instead of only the first GPU
Restore separate cmake --build calls for llama-server and
llama-quantize on both setup.sh and setup.ps1. The combined
approach made llama-quantize failure fatal, but it was originally
best-effort (|| true on Linux, [WARN] on Windows). The timing
savings from combining was only ~2.7s, not worth the semantic
change.
The Ninja + arch detection speedups are preserved (55s vs 1m 37s).
Build llama-server and llama-quantize in a single cmake --build
invocation on Windows, matching the same optimization done in
setup.sh. This allows MSBuild to better parallelize the two targets.
The Visual Studio generator is kept as-is (not switching to Ninja on
Windows since VS generator is the standard approach and interacts
with MSBuild).
Three improvements to the llama.cpp build step in setup.sh:
1. Detect GPU compute capability via nvidia-smi and limit
CMAKE_CUDA_ARCHITECTURES to the current GPU. Without this, cmake
builds for all default CUDA architectures which is very slow.
2. Use Ninja build generator when available. Ninja has better
parallelism than Make for CUDA compilation.
3. Build both llama-server and llama-quantize targets in a single
cmake --build invocation for better parallelism.
4. Add --threads=0 to CMAKE_CUDA_FLAGS for multi-threaded nvcc
compilation.
Measured on 192-core machine with B200 (sm_100):
Make (all archs): very slow (minutes for each arch)
Make (single arch): 1m 37s
Ninja (single arch): 55s
Speedup: ~1.7x
Combined with the uv change, total setup goes from ~4m 35s to ~1m 40s.
Replace pip with uv in install_python_stack.py to speed up the Python
dependency installation phase of `unsloth studio setup`.
- Add _bootstrap_uv() that checks for uv on PATH, and if not found,
installs it via pip. Falls back to pip if uv is unavailable.
- Translate pip flags to uv equivalents (--no-cache-dir dropped since
uv caching is fast, --force-reinstall becomes --reinstall).
- Add --torch-backend=auto so uv auto-detects CUDA version for
PyTorch ecosystem packages.
- Per-install fallback: if any uv install step fails, it retries that
step with pip before exiting.
Measured on clean venv setup:
Python packages (pip): 2m 28s
Python packages (uv): 18s
Speedup: ~8x
Total setup time goes from ~4m 35s to ~2m 30s (llama.cpp build is
now the bottleneck at 1m 40s).
Related to #1615
Add documentation and function for exporting models from Colab to local machines.
* **README.md**: Add a new section titled "Exporting Models from Colab to Local Machine" under "✨ Finetune for Free" with detailed steps for exporting models from Colab to local machines.
* **CONTRIBUTING.md**: Add a note about the new documentation section for exporting models from Colab.
* **unsloth/save.py**: Add a new function `export_model_to_local` to handle exporting models from Colab to local machines.
(cherry picked from commit 0361bd658f)
Editable installs (-e) work via a .pth file that is only processed at
Python startup. In Colab the kernel is already running when setup.sh
installs the plugin, so the .pth file never gets picked up and
data_designer_unstructured_seed is not importable.
Remove -e so pip copies the package files directly into site-packages,
which the live kernel can find immediately. Local venv installs are
unaffected since the venv is always created fresh before install.
* fix(seed): disable remote code execution for seed inspect loads
* fix(test): use __file__-relative path in seed test
The test used a CWD-relative path (`studio/backend/routes/...`) which
only resolved when pytest was invoked from the repo root. Use
`Path(__file__).resolve()` so the test passes regardless of CWD.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Test <test@test.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: disable remote code loading for ai-assist model hint lookup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The studio was disabling flex attention entirely on Blackwell+ GPUs
(sm_120 and above) by setting UNSLOTH_ENABLE_FLEX_ATTENTION=0 at
startup. This was a workaround for the flex_attention backward kernel
exceeding shared memory limits on these GPUs.
The root cause is now fixed in unsloth-zoo (PR #542) which patches the
backward kernel config selection to generate safe fallback configs that
fit within the GPU's shared memory limit. With that fix, flex attention
works correctly on Blackwell GPUs and provides a ~1.3x speedup over
the SDPA fallback.
Fix `llm_int8_skip_modules` not being respected for VLMs with dynamic quantization on transformers 5.x.
Dynamic quant checkpoints (e.g. `gemma-3-4b-it-unsloth-bnb-4bit`) encode skip paths as `language_model.model.layers.*`, but the live module tree on 5.x surfaces them as `model.language_model.layers.*`. This prefix mismatch causes `should_convert_module` to miss the skip list, so 22 modules meant to stay in 16-bit get wrapped in `Linear4bit` without a `quant_state`, producing "Skipping ... no quant_state found" warnings.
Patches `should_convert_module` to expand both the module name and the skip patterns into all equivalent alias forms before matching. Guarded by `hasattr` so it is a no-op on transformers 4.x where the bug does not exist.
Closes#4208
* Update CODEOWNERS for studio and cli
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* packing optimziation with cache to reduce D2H copy
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cache per device to avoid race condition for multi-gpu
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add cache freeing up func
---------
Co-authored-by: ruixiangw <ruixiangw@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ruixiang <wangruixiang07@outlook.com>
* Rebuild Studio branch on top of main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix security and code quality issues for Studio PR #4237
- Validate models_dir query param against allowed directory roots
to prevent path traversal in /api/models/local endpoint
- Replace string startswith() with Path.is_relative_to() for
frontend path traversal check in serve_frontend
- Sanitize SSE error messages to not leak exception details to
clients (4 locations in inference.py)
- Bind port-discovery socket to 127.0.0.1 instead of all interfaces
in llama_cpp backend
- Import datasets_root and resolve_output_dir in embedding training
function to fix NameError and use managed output directory
- Remove stale .gitignore entries for package-lock.json and test
directories so tests can be tracked in version control
- Add venv-reexecution logic to ui CLI command matching the studio
command behavior
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move models_dir path validation before try/except block
The HTTPException(403) was inside the try/except Exception handler,
so it would be caught and re-raised as a 500. Moving the validation
before the try block ensures the 403 is returned directly and also
makes the control flow clearer for static analysis (path is validated
before any filesystem operations).
* Use os.path.realpath + startswith for models_dir validation
CodeQL py/path-injection does not recognize Path.is_relative_to() as
a sanitizer. Switched to os.path.realpath + str.startswith which is
a recognized sanitizer pattern in CodeQL's taint analysis. The
startswith check uses root_str + os.sep to prevent prefix collisions
(e.g. /app/models_evil matching /app/models).
* Never pass user input to Path constructor in models_dir validation
CodeQL traces taint through Path(resolved) even after a startswith
barrier guard. Fix: the user-supplied models_dir is only used as a
string for comparison against allowed roots. The Path object passed
to _scan_models_dir comes from the trusted allowed_roots list, not
from user input. This fully breaks the taint chain.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Workers now compute backend_path and venv_t5 locally via Path(__file__)
- Moved .venv_t5 to ~/.unsloth/studio/.venv_t5
- Added ensure_studio_directories() call on server startup
- Expanded CLI studio command into sub-app with setup subcommand
Tier 1 check-format was picking images.zip over testmini.parquet,
causing wrong columns (image/label) and broken VLM mapping.
Also log first VLM conversion failure instead of swallowing silently.
Prevent negative Train Split Start/End values in the dataset advanced UI and sanitize payload mapping so negative slice values are never sent to the backend.
Made-with: Cursor
Instead of downloading the full dataset and then slicing, use
streaming mode to only fetch the rows needed (up to slice_end + 1)
when a manual dataset slice is configured.
startswith(prefix) could match unrelated split variants whose names
extend the selected file's prefix (e.g. model-Q8_0-v2-00001-of-...).
Now builds an exact regex from the chosen file's base prefix and shard
total so only true siblings are downloaded.
Substring matching (e.g. "Q8_0" in filename) could match superset
variants like "IQ8_0", causing wrong quantizations to be downloaded.
Now uses word-boundary regex for variant matching and discovers split
shards by shared filename prefix rather than treating all variant
matches as shards.
start_training() cherry-picks kwargs into a config dict but was missing
is_embedding, so config.get("is_embedding", False) in worker.py always
returned False and embedding training never ran.
LlamaCppBackend.load_model() only downloaded the first matching GGUF
file. For split models (e.g. 7B Q8_0 with 3 shards), llama-server
needs all shards present. Now collects and downloads all matching files.
Separate pure-audio from audio-VLM logic in runDatasetCheck so pure
audio models are always forced to trainOnCompletions=false regardless
of dataset type, while audio VLMs (gemma3n) only uncheck when the
dataset is audio.
Clear stale isAudioModel in the fallback path when getModelConfig
fails, preventing a previously-selected audio model's flag from
leaking into the next model selection.
Add end-to-end embedding/sentence-transformer training pipeline using
FastSentenceTransformer, SentenceTransformerTrainer, and
MultipleNegativesRankingLoss with BatchSamplers.NO_DUPLICATES.
Backend:
- Add is_embedding_model() detection via HF tags + pipeline_tag
- Add /check-embedding/ API route and EmbeddingCheckResponse
- Extend derive_model_type() to return "embeddings"
- Add _run_embedding_training() in worker.py with progress callbacks,
stop handling, LoRA (task_type=FEATURE_EXTRACTION), and model saving
- Add is_embedding field to TrainingStartRequest and ModelDetails
- Add YAML configs for 5 models: all-MiniLM-L6-v2, bge-m3,
embeddinggemma-300m, gte-modernbert-base, Qwen3-Embedding-0.6B
Frontend:
- Wire isEmbeddingModel flag through store, API types, and mappers
- Force packing=false, train_on_completions=false, warmup_ratio=0.03
- Hide packing and train_on_completions checkboxes for embedding models
- Auto-set modelType to "embeddings" from backend model_type response
- Pass 1: clearer definition of "conversational" vs non-conversational,
constrained dataset_type to specific enum values
- Pass 2: much more explicit worked examples with step-by-step reasoning,
added "skip" role for metadata columns, stronger reminder at end that
all-user is wrong
- Pass 3: returns raw text instead of JSON for cleaner system prompts,
removed system message to give model more freedom
Pure audio models (orpheus, sparktts, whisper, sesame-csm) now
always have trainOnCompletions auto-unchecked when selected.
Gemma3n (audio_vlm) only unchecks when the dataset is audio.
- Add is_audio to frontend ModelConfigResponse (backend already returns it)
- Add isAudioModel state to training config store
- Auto-set trainOnCompletions=false for pure audio models on model load
- Auto-set trainOnCompletions=false for audio VLMs when dataset is audio
- Respect manual user override via existing _trainOnCompletionsManuallySet flag
The advisor now only assigns columns to user/assistant roles and
generates a system prompt. Templates (user_template, assistant_template)
are removed entirely — the LLM was frequently putting all columns in
user or copying actual data values into templates.
Column values are now used directly as message content, grouped and
concatenated by role. This is simpler, more robust, and prevents the
class of bugs where the advisor generates bad template content.
Derive a single model_type string ("text" | "vision" | "audio" | "embeddings")
from existing is_vision and audio_type detection, so the frontend doesn't have
to infer modality from scattered boolean flags.
The LLM was putting all columns in user_template (e.g. summarization
dataset had both document AND summary as user input). Fixed by:
- Reframed system message: explicitly states user=INPUT, assistant=OUTPUT
- Added 4 concrete correct examples (summarization, NLI, translation, QA)
showing exactly how to split columns
- Added "NEVER put the output/target column in the user template" rule
- Added sanity check: if assistant_template has no column placeholders,
reject the result and fall back to simple classification
Pass 3 now sees the label mapping from Pass 2 (e.g. "0 = does not follow,
1 = follows, 2 = entailed") so the generated system prompt can explain
what each label value means. Also bumped to 2-4 sentences to give room
for the label descriptions.
- Add "Beta" badge next to AI Assist button text
- When advisor generates a system prompt, show it as a "System (generated)"
column prepended to the data table so user can see it alongside data
- Fix table being squished to near-zero height when advisor notification
banner is present: add min-h-[250px] to table wrapper, change body
from overflow-hidden to overflow-auto
Pass 1: Classify dataset type (unchanged)
Pass 2: Generate user/assistant templates + label mapping + column roles
(system_prompt removed from this pass to keep it focused)
Pass 3: Generate system prompt (only for non-conversational datasets)
- Dedicated pass with focused prompt that sees the templates from Pass 2
- Skipped entirely for conversational datasets
- Produces specific, task-relevant system prompts
- System prompt is now optional — LLM only generates one when the task
is ambiguous from the data alone (persona, domain, format constraints)
- Sanitize system_prompt extraction (handle literal "null" string)
- Show system prompt, user template, and assistant template in the
advisor notification banner so user can see exactly what was generated
- Templates displayed in monospace with labeled sections
The LLM was bad at scoring its own conversion quality — rejecting good
Pass 2 output (score 5/10 for a perfectly usable conversion). Instead:
- Remove Pass 3 entirely (saves ~0.4s and one inference call)
- Trust Pass 2 output and return it to the user
- Build notification from Pass 1 classification info instead
- User can always adjust mapping via dropdowns if they disagree
- Reject advisor result when Pass 3 scores < 6 or is_acceptable=false,
falls back to simple column classification instead of using bad output
- Improved Pass 2 prompt: explicit rules for label_mapping completeness,
{column_name} vs {column_name_name} for mapped labels, column_roles
must match which template uses them
- Build suggested_mapping from ALL template-referenced columns (not just
first match per role) — fixes hypothesis being dropped from SNLI mapping
- Guard against LLM returning literal string "null" for revised_system_prompt
- Always show AI Assist button when available, even when mapping looks complete
- Handle dict columns (e.g. squad answers) by extracting text instead
of raw repr()
- Handle list columns by joining or extracting single value
- Catch ValueError in .format() calls (stray { } in column data)
- Add missing json import to dataset_utils.py
Non-conversational HF datasets (e.g. stanfordnlp/snli) were naively mapped
column→role, producing poor training results. The AI Assist button now runs
a 3-pass advisor using Qwen 7B that:
1. Fetches the HF dataset card/README to understand the dataset purpose
2. Classifies the dataset type and determines if conversion is needed
3. Generates a system prompt, user/assistant templates with {column}
placeholders, and label mappings (e.g. 0→entailment)
4. Validates the conversion quality (score ≥7/10 required)
Architecture: advisor metadata flows as __-prefixed keys in
custom_format_mapping (e.g. __system_prompt, __user_template,
__assistant_template, __label_mapping). The existing _apply_user_mapping()
detects these keys and routes to template-based conversation construction.
No __ keys = existing simple mode (backwards compatible).
Backend: upgraded llm_assist.py (7B default, multi-pass advisor,
HF card fetching), extended API models, added _apply_template_mapping()
to dataset_utils.py.
Frontend: extended store with advisor state fields, wired AI Assist
to store templates/system prompt, inject __ metadata in training request,
show advisor notification banner in mapping card.
LlamaCppBackend.load_model() and precache_helper_gguf() only downloaded
the first matching GGUF file. For split models (e.g. 7B Q8_0 with 3
shards), llama-server needs all shards present. Now collects and
downloads all matching files.
Move LLM-assisted column mapping from silent /check-format automation
to an explicit "AI Assist" button in the dataset mapping dialog. This
makes the feature transparent and user-controlled.
- Remove llm_classify_columns() from check_dataset_format() (heuristic-only)
- Remove auto-save suggested_mapping from use-training-actions.ts
- Add POST /api/datasets/ai-assist-mapping endpoint (receives preview
samples from frontend, no dataset re-loading needed)
- Add AiAssistMappingRequest/Response models
- Add aiAssistMapping() frontend API function
- Add Sparkles AI Assist button to DatasetMappingCard with loading state
- Wire up handleAiAssist handler in dataset-preview-dialog.tsx
- Frontend auto-saves suggested_mapping into datasetManualMapping when
check-format returns requires_manual_mapping=false, so the mapping
flows to training via custom_format_mapping (no redundant AI calls)
- Backend returns meaningful warning when column detection fails
(LLM-generated or static fallback) for both text and VLM datasets
- /check-format endpoint merges check_dataset_format warnings with
existing URL-based image detection warnings
When multiple image columns are found, probes them (HEAD for URLs,
os.path.exists for paths) and picks the first that works.
Skips probing when top candidate is PIL/dict (score >= 75).
find_image_column now scores candidates by resolvability (PIL > dict > URL > path)
and has a Pass 2 value-based fallback for columns not matching image keywords.
Fixes phiyodr/coco2017 picking file_name (unresolvable) over coco_url (resolvable).
- Detect and convert ShareGPT/ChatML conversations with <image> placeholders
- Add file_name/filename as image column keywords
- Detect image paths and URLs by value (string ending in .jpg/.png/etc)
Datasets like VQAonline store image filenames (e.g. "img.png") without
the directory prefix. Build a basename→repo_path lookup using
list_repo_files, then resolve each file via hf_hub_download.
Tier 1 check-format was picking images.zip over testmini.parquet,
causing wrong columns (image/label) and broken VLM mapping.
Also log first VLM conversion failure instead of swallowing silently.
* Refactor loss computation to include completion_mask
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fixes for trl 0.28 and above
Remove sync/reload weights calls , remove vllm.LLM instantiation
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Refactor loss computation to include completion_mask
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fixes for trl 0.28 and above
Remove sync/reload weights calls , remove vllm.LLM instantiation
* patch rpc in openenv for newer trl
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pluesclues <136766175+pluesclues@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix gpt temporary patch for grpo to happen after compile
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Refactor loss computation to include completion_mask
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1. Export route: stop_training() only signals the subprocess — wait up to
30s for it to actually exit before loading the export checkpoint, avoiding
a GPU memory race.
2. Training reset: clear _should_stop so /api/train/status returns phase=idle
instead of staying stuck on phase=stopped after a user-triggered stop.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, /v1/chat/completions requests in local dev are served by
Vite instead of being proxied to the FastAPI backend.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two overlapping /chat/completions requests could both read from the shared
resp_queue, consuming and dropping each other's token events. Replace the
request_id filtering (which silently dropped non-matching messages) with a
threading.Lock that serializes generation — correct for single-GPU inference.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Models like GLM-4.7-Flash have architectures (glm4_moe_lite) that
AutoConfig in the main process (transformers 4.57.x) can't recognize.
Instead of a raw config.json workaround, run the AutoConfig check in
a subprocess with .venv_t5/ activated — same pattern as training and
inference workers. This is more robust and consistent.
AutoConfig.from_pretrained() fails for models needing transformers 5.x
(e.g. glm4_moe_lite) when running with 4.57.x. Add a raw config.json
fallback that bypasses AutoConfig's architecture registry — fetches
config.json directly from local path or HuggingFace Hub and checks
for vision indicators without needing the architecture to be registered.
All version switching now uses .venv_t5/ (pre-installed by setup.sh).
The old .venv_overlay/ with runtime pip installs is removed.
ensure_transformers_version() (used only by export) now does a
lightweight sys.path swap instead of pip installing at runtime.
Reusing a subprocess after unsloth patches torch internals causes
inspect.getsource() failures when loading a different model type.
Each load now gets a clean Python interpreter.
Replaces cmd_queue-based cancel polling with a shared mp.Event.
Fixes two issues:
- Loading a new model while generating no longer hangs (cancel is instant)
- Subprocess shuts down cleanly after explicit stop generation
Inference now runs in a persistent subprocess, solving the same
transformers version-switching problem that was fixed for training.
The subprocess stays alive between requests (model in GPU memory)
and is only restarted when switching transformers versions.
New files:
- core/inference/worker.py: subprocess entry point with command loop
- core/inference/orchestrator.py: parent-side proxy with same API
Modified:
- core/inference/__init__.py: exports orchestrator as default backend
- routes/inference.py: removed in-process ensure_transformers_version()
These are standalone benchmark scripts that were force-added despite being
gitignored. They have no test functions and run network calls at module
level, which breaks pytest collection in CI.
- Add 200-sample parallel probe using ThreadPoolExecutor + safe_num_proc
to estimate download speed and failure rate before full conversion
- Abort with clear error if >=30% of probe images fail to download
- Show estimated download time in the training overlay modal
- Parallel batch conversion for URL-based datasets (vs sequential for local)
- Add warning field to /check-format response for URL-based image datasets
- Display URL warning in dataset preview dialog (amber banner)
- Thread progress_callback from trainer through format_and_template_dataset
to convert_to_vlm_format for real-time status updates
Place Train Split Start / End inputs inside the Advanced collapsible
with descriptive tooltips clarifying they slice the training split.
Revert the selectors component to its original eval-split-only layout.
Place Slice Start and Slice End inputs alongside the Eval Split
selector in a single row (grid-cols-3) so the dataset card stays
compact. Remove the duplicate controls from the Advanced section.
Add Start/End index inputs under Advanced in the dataset card,
allowing users to slice a dataset by row range before training.
Wired end-to-end: frontend store, API payload, backend Pydantic
model, and trainer dataset loading (inclusive on both ends).
Place Train Split Start / End inputs inside the Advanced collapsible
with descriptive tooltips clarifying they slice the training split.
Revert the selectors component to its original eval-split-only layout.
Place Slice Start and Slice End inputs alongside the Eval Split
selector in a single row (grid-cols-3) so the dataset card stays
compact. Remove the duplicate controls from the Advanced section.
Add Start/End index inputs under Advanced in the dataset card,
allowing users to slice a dataset by row range before training.
Wired end-to-end: frontend store, API payload, backend Pydantic
model, and trainer dataset loading (inclusive on both ends).
trl/trainer/callbacks.py imports is_wandb_available from
accelerate.utils, not from transformers. The original fix in #4147
only patched the transformers version, so `from trl import GRPOTrainer`
still crashed via the callbacks.py -> accelerate -> wandb path.
Must patch both the source module (accelerate.utils.imports) AND the
re-export namespace (accelerate.utils) since Python's
`from accelerate.utils import X` reads from the latter, which holds
its own cached reference.
* Fix broken wandb import crashing unsloth startup
When wandb is installed but broken (e.g., wandb < 0.19.11 with
protobuf >= 6.0), the import chain unsloth -> trl -> transformers ->
is_wandb_available() -> import wandb crashes with:
ImportError: cannot import name 'Imports' from
'wandb.proto.wandb_telemetry_pb2'
This happens because transformers' is_wandb_available() has no
try/except around `import wandb`. The error propagates up and kills
`from unsloth import FastLanguageModel` even though wandb is optional.
Add disable_broken_wandb() following the same pattern as
disable_torchcodec_if_broken(). It proactively tries importing wandb
during early init, and if the import fails, patches
is_wandb_available() to return False and sets WANDB_DISABLED=true.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fixup mapper issues and resolve properly
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: update GGUF save paths to use ~/.unsloth/llama.cpp with Windows support
* fix: quote LLAMA_CPP_DEFAULT_DIR in fallback shell commands to handle paths with spaces
* refactor: deduplicate platform-specific build instructions in quantization error message
* chore: remove accidentally committed PR description file
* Fix import safety and f-string bugs in save.py
- H4: Add defensive try/except for LLAMA_CPP_DEFAULT_DIR and IS_WINDOWS imports
with fallback defaults, so save.py works even if zoo PR #526 is not merged yet
- H5: Fix Kaggle error path using plain "Error: {e}" instead of f"Error: {e}",
so the actual exception is shown to users
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix lm_head lora save
* Fix _need_to_train_embeddings guard for lm_head LoRA targets
When lm_head is already in final_modules as a LoRA target, the
_need_to_train_embeddings block should not also add it to
modules_to_save. This prevents dual-wrapping (LoRA + modules_to_save
on the same module) which causes assertion failures downstream.
Check if embed_tokens/lm_head are already being trained as LoRA
targets before adding them to modules_to_save. Also prevents
duplicate entries with elif guards.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Current arch.startswith("gfx1") incorrectly matches:
- RDNA1 (gfx10xx) and RDNA2 (gfx103x): not ROCm supported
- gfx1102 (RX 7600), gfx1103 (Phoenix APU): not in ROCm support matrix
- gfx1150/1151/1152 (RDNA3.5 APUs): not in ROCm support matrix
Replace with explicit whitelist aligned to the ROCm Linux support matrix:
https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html
gfx1100 - RDNA3 discrete (RX 7900 series, PRO W7900/W7800)
gfx1101 - RDNA3 discrete (RX 7800/7700 series, PRO W7700)
gfx1200 - RDNA4 discrete (RX 9060 series)
gfx1201 - RDNA4 discrete (RX 9070 series, AI PRO R9700)
Mirrors the existing is_cdna() pattern. Avoids silently applying
unverified Triton kernel tuning to unsupported hardware.
Fix global dequantize buffer dtype mismatch when loading multiple 4-bit models with different dtypes in the same process. Adds dtype check alongside existing None check for WEIGHT_BUFFER in both CUDA/HIP and XPU paths.
Use 16 warps for RDNA in the chunked cross-entropy forward kernel
(large vocab > 65536), matching the existing CDNA optimization.
Benchmarked on W7900 (gfx1100) with actual unsloth kernels (5 trials, median):
- Chunked CE forward (BS=65536): 16 warps = 2.4-2.6x faster than 32
- All other kernels (LayerNorm, RoPE, SwiGLU): default heuristic is
already optimal for RDNA; no modification needed.
Depends on: #4109 (provides is_rdna() detection)
TMA (Tensor Memory Accelerator) is an NVIDIA Hopper+ feature that does
not exist on AMD GPUs. However, _check_tma_support() incorrectly
returns True on ROCm because:
1. torch.cuda.get_device_capability() returns (11, 0) for gfx1100,
satisfying the >= 9 check intended for Hopper (sm_90).
2. ROCm Triton exports tl.make_tensor_descriptor (the symbol exists
even though the hardware does not support TMA).
This would cause MoE grouped_gemm to attempt TMA operations on AMD
GPUs, leading to runtime failures.
Fix: early-return False for HIP devices, matching the existing XPU
guard.
* fix(Triton): ensure float32 eps in RMS LayerNorm rsqrt for HIP/ROCm
On HIP (AMD ROCm), Triton constexpr eps may not promote to float32
in rsqrt, causing numerical instability (NaN/Inf) on RDNA GPUs
(gfx1100, gfx1151 Strix Halo, etc.).
Use tl.full((), eps, tl.float32) to explicitly create a float32
scalar before adding to row_var in rsqrt. Applied to both standard
and Gemma RMS LayerNorm forward kernels.
Tested on W7900 (gfx1100): full test suite passed (dim 512-2048,
bf16/fp16, various seqlen).
Related: #3385, #3588
* Apply same float32 eps fix to layernorm.py for PR #4110
layernorm.py has the identical tl.constexpr eps pattern in
layernorm_forward that can misfire on HIP/ROCm. Apply the same
tl.full((), eps, tl.float32) fix for consistency.
Both testing_suite_layernorm (standard LayerNorm) and
testing_suite_layernorm (RMS LayerNorm) pass on NVIDIA after
this change.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(ROCm): comprehensive RDNA GPU support - fix Gemma3 NaN & add is_rdna()
- Add is_rdna() detection for RDNA3/3.5/RDNA4 consumer GPUs (gfx11xx, gfx1151, gfx12xx)
- Disable torch.compile for Gemma3 on HIP to fix NaN loss (fixes#3385, #4029)
- Export is_cdna/is_rdna from kernels for downstream use
- Import is_rdna into cross_entropy_loss for future RDNA-specific tuning
Tested on AMD Radeon PRO W7900 (gfx1100) with ROCm 7.1:
✓ Gemma3-1B: loss 3.37→3.25 (no NaN)
✓ Llama-3.2-1B: loss 2.44→2.37 (no NaN)
✓ Qwen2.5-1.5B: loss 1.89→1.85 (no NaN)
✓ RMS LayerNorm Triton kernel: bf16/fp16 PASSED
✓ Cross Entropy Loss Triton kernel: 32K/256K vocab PASSED
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: scope compile disable to RDNA only, use partial mode, remove unused import
Changes based on Daniel's review:
1. (HIGH) Replace DEVICE_TYPE=='hip' with is_rdna() to avoid disabling
torch.compile on CDNA GPUs (MI250X/MI300X/MI350) where it works fine
2. (MEDIUM) Use 'partial' instead of '1' for UNSLOTH_COMPILE_DISABLE to
only disable model forward compilation while keeping loss compilation,
matching the existing Sesame pattern
3. (LOW) Remove unused is_rdna import from cross_entropy_loss.py (F401)
* Remove redundant is_cdna/is_rdna exports from kernels/__init__.py
These functions are imported directly from .utils where needed
(e.g. cross_entropy_loss.py, loader.py). No external code imports
them from the unsloth.kernels namespace.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The function (introduced in #3923) assumed that the absence of
`triton.runtime.triton_key` on ROCm means torch.compile will crash.
Investigation shows this is incorrect:
1. `triton.runtime.triton_key` was renamed/removed in the ROCm Triton
fork — it does not exist at that path. However,
`triton.compiler.compiler.triton_key` (the path torch._inductor
actually imports) EXISTS and works correctly on ROCm.
2. Both call-sites in torch._inductor (codecache.py and
async_compile.py) already wrap the import in try/except, so even a
genuinely missing triton_key would be handled gracefully.
3. Comprehensive testing on ROCm 7.1 + Triton 3.4.0 + gfx1100 confirms
torch.compile works correctly for matmul, cross-entropy, RMSNorm,
multi-layer transformer forward+backward, and LoRA — all without
triton.runtime.triton_key.
The original code was also ineffective (environment variables set after
torch import have no effect on torch._dynamo config), so removing it
has zero behavioral change on existing installations.
Supersedes the compile-disable portion of #3923.
Adds lower bound (>= 3.11) and tightens upper bound (< 3.14) for
Python version discovery in setup.sh. Extracts bounds into
MIN_PY_MINOR / MAX_PY_MINOR variables for easy future updates.
`attachment.type` resolves to `string & {}` via @assistant-ui/store@0.1.6's
generic type chain when installed through npm (package-lock.json), breaking
the `const _exhaustiveCheck: never = type` exhaustive check pattern.
Replace with a direct throw that compiles cleanly across library versions
while preserving identical runtime behaviour.
Fixes#263
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix transformers v5 RoPE inv_freq corruption during model loading
Transformers v5 initializes models on the meta device, then
_move_missing_keys_from_meta_to_device() replaces all non-persistent
buffers with torch.empty_like() (uninitialized memory). Vanilla
transformers restores inv_freq via _init_weights() checking for
original_inv_freq, but Unsloth's LlamaRotaryEmbedding subclasses
lack this attribute, so inv_freq stays corrupted with garbage values.
This caused 5-11x higher training loss on transformers v5 for all
models using Unsloth's rope (Llama 3.x, Qwen3, Mistral, TinyLlama,
Granite). Models using native transformers rope (Gemma, Phi-4,
Falcon-H1) were unaffected.
The fix recomputes inv_freq from the stored base/dim after model
loading, applies model-specific scaling via _apply_inv_freq_scaling(),
and rebuilds cos/sin caches. Also handles LongRopeRotaryEmbedding
(Phi-3.5 style short/long inv_freq). Guarded by transformers >= 5.0.0
so it is a no-op on v4.
Tested on: Llama 3.1 8B, Llama 3.2 3B, Qwen3 14B, Qwen3 4B, Phi-4,
TinyLlama, Mistral 7B, Gemma2 2B, Falcon-H1 -- all v5 losses now
match v4 baselines to < 0.004 absolute difference.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Unpack BatchEncoding in generate() for v4/v5 backwards compatibility
Old notebooks pass the full tokenizer output as input_ids:
inputs = tokenizer(..., return_tensors="pt").to("cuda")
model.generate(input_ids=inputs, ...)
This worked on transformers v4 because generate() internally
extracted the tensor. Transformers v5 calls .shape on input_ids
directly, which crashes since BatchEncoding has no .shape attribute.
Fix: in unsloth_fast_generate(), detect when input_ids is a dict-like
object (BatchEncoding) and unpack its contents into separate kwargs
before forwarding to the underlying generate(). This makes both old
and new notebook patterns work on both v4 and v5.
* Remove redundant seen_ids dedup in _fix_rope_inv_freq
named_modules() already deduplicates with remove_duplicate=True (default).
Also clarify that native v5 rotary classes (Gemma3 etc.) have original_inv_freq
which transformers v5's _init_weights() uses to restore inv_freq, so they do
not need this fix.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix left-padding masks and positions in batched decode/prefill
* Fix batched generation with left padding
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix attention mask handling, padding_idx zeroing, and Mistral batched generation
1. attention_dispatch.py: Fall back from flash/xformers to SDPA when an
attention_mask is present, since flash attention only supports causal
masking via flag and cannot consume arbitrary padding masks.
2. gemma2.py: Apply attention_mask during decode inference for bsz > 1.
Guard against boolean SWA/GA flags with isinstance check. Slice mask
to match K/V length when sliding window is active. Remove dead
commented-out SDPA branch (SDPA does not support softcapping).
3. granite.py: Apply attention_mask during decode inference for bsz > 1.
Remove dead commented-out SDPA branch and misleading comment.
4. mistral.py: Fix 2D-to-4D padding mask conversion -- convert 0/1 mask
to additive format (0 for keep, -inf for mask) before combining with
the causal mask. Force SDPA backend when attention_mask is present.
5. llama.py: Skip zeroing embed_tokens.weight[padding_idx] when the
embedding is weight-tied to lm_head, since zeroing the shared weight
forces logit(pad) = 0 which is higher than real token logits in models
like Gemma, causing the decoder to emit pad tokens as gibberish. Also
add eos != pad guard, clean up unused _seq_length variable, and fix
get_max_cache_shape handling.
6. vision.py: Same padding_idx fix as llama.py for the vision model
loading path.
Tested on gemma-2b-it, gemma-2-2b-it, Llama-3.2-1B, Mistral-7B-v0.3,
Qwen2.5-0.5B, Qwen3-0.6B with flash-attn 2.8.3 active. All outputs
coherent, zero crashes, zero resize warnings.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Inference path optimizations: eliminate per-layer GPU-CPU sync, cache inspect.signature, add Granite SDPA split
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* More inference path optimizations across model files
- gemma: hoist rotary_seq_len computation to model level (eliminates N
per-layer GPU-CPU syncs from position_ids.max().item()), pre-convert
attention mask to bool once for all layers, use scalar float multiply
instead of torch.tensor allocation for embedding scaling
- gemma2: use in-place tanh_() for softcap attention, use scalar float
multiply for embedding scaling
- granite: pre-convert attention mask to bool once for all layers
- cohere: use in-place neg_() for rotary embedding (consistent with
all other model files)
- falcon_h1: use in-place mul_() for key_multiplier scaling
- llama: use in-place tanh_() for logit softcapping
* Revert scalar multiply for Gemma/Gemma2 embedding scaling
The original torch.tensor(..., dtype=hidden_states.dtype) is intentional:
sqrt(3072) rounds to 55.5 in bfloat16 vs 55.4256 in float32. A plain
scalar multiply may compute at higher precision internally, producing
different results. Restore the explicit dtype-cast tensor to match the
training path in LlamaModel_fast_forward.
* Fix hardcoded cuda:0 device strings and add Cohere .eq(0) bool mask
Replace 15 hardcoded "cuda:0" with f"{DEVICE_TYPE_TORCH}:0" across
gemma.py, gemma2.py, cohere.py, and falcon_h1.py to support multi-GPU
and non-CUDA devices (XPU, etc.). Add .eq(0) bool mask pre-conversion
in CohereModel_fast_forward_inference for batched inference consistency
with llama.py, granite.py, and gemma.py.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Disable flex_attention for Mllama (Llama 3.2 Vision)
Mllama's _update_causal_mask uses the deprecated make_flex_block_causal_mask
which creates a BlockMask with Q_LEN=KV_LEN=total_seq_len. During decode
with KV cache, q_len=1 but the block_mask still has Q_LEN=total_seq_len,
causing a ValueError. This is an upstream transformers issue -- newer models
use flex_attention_mask from masking_utils which handles decode correctly
via cache_position, but mllama has not been updated yet.
Add mllama to the exclusion list in prefer_flex_attn_if_supported alongside
gpt_oss so it falls back to sdpa, which works correctly for both training
and inference.
* Fix off-by-one in sliding window K/V slicing for gemma2, qwen3, falcon_h1, cohere
The old formula `slicing_tokens = 1 - sliding_window` uses negative indexing
that keeps `sliding_window - 1` tokens instead of `sliding_window`. For example
with sliding_window=32 and kv_seq_len=100, `1-32 = -31` keeps indices 69..99
(31 tokens) instead of the correct 68..99 (32 tokens).
Replace with `start = kv_seq_len - sliding_window` to match the fix already
applied in llama.py and the canonical definition in transformers masking_utils
(sliding_window_overlay: kv_idx > q_idx - W, which keeps exactly W tokens).
Also add attention_mask slicing after K/V trim in qwen3, falcon_h1, and cohere
to prevent mask/K dimension mismatch during batched SDPA inference, matching
the pattern already used in llama.py.
Currently only gemma2 (sliding_window=4096) is actively affected. The other
three models have sliding_window=None in their configs so the code path is
not triggered, but this keeps it correct for any future models that set it.
* Fix Gemma2 softcapping order: apply mask after softcap, not before
The attention mask must be applied AFTER logit softcapping, not before.
Both the Google DeepMind reference implementation (google-deepmind/gemma,
gm/nn/_modules.py lines 254-277) and transformers' eager_attention_forward
(gemma2/modeling_gemma2.py lines 187-193) use this order:
1. logits = Q @ K^T * scale
2. logits = tanh(logits / softcap) * softcap # softcap first
3. logits = logits + mask # mask after
4. probs = softmax(logits)
The PR had the mask addition before softcapping, which causes tanh to
clamp the -inf mask values to -softcap instead of preserving them as -inf
for softmax. While the practical impact is small (masked positions get
~1e-23 probability instead of exact zero), this should match upstream.
* Clarify GQA condition precedence and remove stale comments
Add explicit parentheses to grouped query attention conditions in
llama.py, qwen3.py, granite.py to make operator precedence clear.
The expression `bsz == 1 or not X and Y` relies on Python binding
`not` > `and` > `or` which is correct but easy to misread.
Remove dead commented-out code (`# else: # Knn, Vnn = Knn, Vnn`)
and stale mask comments (`# if attention_mask ...`) from the bsz==1
fast path in llama, qwen3, cohere, falcon_h1, gemma2 inference
functions. These were leftover from the pre-batched-inference
structure and no longer apply.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Allow fp8 for non fast inference
* Extensive fp8 alow and quantizer patch
* Clean up commented-out code, duplicate import, and revert unnecessary Version() changes
- Delete commented-out FP8 fast_inference guard in FastModel (loader.py)
instead of leaving it commented -- matches FastLanguageModel which was
properly deleted
- Delete commented-out fast_inference guard in loader_utils.py
- Remove duplicate `from transformers import GenerationConfig, CompileConfig`
in vision.py (line 112 already imports both plus AutoConfig)
- Revert Version(trl.__version__) back to Version(trl) in trainer.py --
trainer.py imports Version from unsloth_zoo.utils which already handles
module objects
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add resilience to TRL internal API reclassification
TRL is moving toward v1.0 and will reclassify several
currently-importable symbols as internal with no stability
guarantees. This adds try/except cascading imports with local
fallbacks so Unsloth keeps working regardless of whether TRL
removes, moves, or restructures these symbols.
Changes:
- rl.py: Add try/except cascade for unwrap_model_for_generation
with local contextmanager fallback. Wire sanitize_logprob from
RL_REPLACEMENTS into the compiled trainer template (same pipeline
as selective_log_softmax and other global functions). Add import
math and import logging to the template header.
- rl_replacements.py: Remove inline import of sanitize_logprob
from trl.scripts.vllm_serve in the regex replacement. The
function is now a module-level global in the compiled file.
- tokenizer_utils.py: Wrap dynamic exec import with per-item
fallback so a single removed symbol does not break the entire
bulk import.
Depends on unslothai/unsloth-zoo#516.
Tested across all TRL versions from 0.22.2 through 0.29.0.dev0
(git main). Training losses and grad norms are bit-identical
to unpatched runs.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Warn when save_pretrained_gguf overrides quantization to MXFP4 for GPT-OSS
GPT-OSS only supports MXFP4 format. If the user passes a different
quantization_method, log a warning via logger.warning_once before
overriding. Pass quantization_method=None to suppress the warning.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
GGUF was in the global EXCLUDED_TAGS set which filtered it from all
consumers of useHfModelSearch, including the chat page. Move GGUF
exclusion to an opt-in excludeGguf option so only training and
onboarding pages filter out GGUF models.
GGUF models can't be fine-tuned, so hide them from the training/studio
page while keeping them available for inference on the chat page.
- Add "gguf" to EXCLUDED_TAGS in HF model search hook
- Filter local models with .gguf extension or -GGUF in ID
* Fix Nemotron-H and Nemotron-VL model support
- Add Mamba kernel precision settings for Nemotron-H hybrid models
- Fix VL model auto_model selection for models that only register
AutoModelForCausalLM in their auto_map
- Skip quantization of out_proj for Nemotron-H Mamba layers
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Simplify VLM auto_model selection logic
Reduce three branches to two since the first and third both assign
AutoModelForVision2Seq. The simplified condition checks whether the
auto_map exclusively registers AutoModelForCausalLM without the VLM
class, and defaults to AutoModelForVision2Seq otherwise.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Replaced `CookBookIcon` with `ChefHatIcon` in navbar for improved clarity.
- Added dark mode-specific gradient styles to recipe cards for better visual differentiation.
Remove os.chdir(save_directory) from export.py which was causing all of
unsloth-zoo's relative-path internals (check_llama_cpp, use_local_gguf,
_download_convert_hf_to_gguf) to resolve against the export directory
instead of the repo root. This caused llama.cpp to be cloned inside each
export dir and destroyed the repo root's llama-server build on cleanup.
Now passes absolute paths to save_pretrained_gguf so unsloth resolves
llama.cpp from the repo root where setup.sh already built it.
Also builds llama-quantize in setup.sh (needed by unsloth-zoo's export
pipeline) and symlinks it to llama.cpp root for check_llama_cpp().
- Replace datetime.UTC with datetime.timezone.utc in authentication.py and storage.py
- Fixes ImportError on Python versions < 3.11
- timezone.utc works on Python 3.9+
Resolves#237
Fixes two bugs:
1. Chat template tags (<|im_start|>, <|im_end|>) leaking into output
because /v1/completions treated them as literal text
2. Image hallucination because image_b64 was never passed to llama-server
Now llama-server handles chat templates natively and receives images
as OpenAI-format multimodal content parts for vision models.
Replace Python-side GGUF download with llama-server's native -hf flag for
HuggingFace repos. Add frontend variant picker so users can choose
quantization (Q4_K_M, Q8_0, BF16, etc.) with file sizes. Fix vision
detection via mmproj files instead of hardcoding is_vision=False.
* Fix FP8 model loading for BNB/16-bit: redirect to BF16 sibling
Models like Ministral-3-3B-Instruct-2512 ship with FP8 weights and an FP8
quantization_config in their config.json. Loading these with BNB 4-bit/8-bit
fails because BNB cannot quantize FP8 tensors. Loading with 16-bit also fails
because the FP8 quantization config has activation_scheme=static which is
unsupported by transformers' FineGrainedFP8Config.
When an FP8 model is detected and the user is not explicitly requesting FP8
loading, check if a BF16 sibling repo exists (model_name + "-BF16") and
redirect to it. This happens early in the loading flow before any quantization
config processing.
Also pass the modified model_config to auto_model.from_pretrained to avoid
transformers re-reading the original config from the model repo.
Tested with Ministral-3-3B in 4-bit and 16-bit modes. Both now load and
train correctly.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Simplify FP8 condition and narrow exception handling
Simplify the load_in_fp8 check (works for bool and string values).
Narrow inner except to KeyError and add comment for outer except.
* Warn user when FP8 model has no BF16 sibling for redirect
Previously the except block silently fell through with `pass`,
so users would get a confusing BNB dtype error later. Now prints
a clear message explaining the FP8 situation and suggesting
load_in_fp8=True or uploading a BF16 version.
* Fix FP8 redirect state corruption and add fbgemm_fp8 support
- Fix state corruption: model_name was reassigned before
AutoConfig.from_pretrained, so if config fetch failed,
model_name pointed to BF16 repo while auto_config still
had FP8. Now only updates state after both checks succeed.
- Save original model_name so warning message is correct
even on failure.
- Handle fbgemm_fp8 quant method in addition to fp8.
* Extract FP8 redirect to shared _redirect_fp8_to_bf16() in _utils.py
Addresses reviewer feedback:
- Move FP8 redirect logic to a shared function callable from both
vision.py (FastBaseModel) and llama.py (FastLlamaModel)
- Raise RuntimeError instead of warning when BF16 sibling not found
- Add FP8 redirect to llama.py for text-only model loading path
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add Ministral 3B/8B/14B mapper entries
Adds all 9 Ministral model variants to the mapper:
- Instruct (3B, 8B, 14B) with FP8 variant mappings
- Base (3B, 8B, 14B)
- Reasoning (3B, 8B, 14B)
This routes mistralai/Ministral-* to unsloth/Ministral-* repos
(BF16 weights), which also avoids the FP8 config issue for the
standard loading path through loader.py.
* Add FP8 mapper entries for Mistral-Small-3.2 and Magistral-Small-2509
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-253.us-east-2.compute.internal>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Added constrained dependency files for single-env installations: `constraints.txt`, `data-designer.txt`, and `data-designer-deps.txt`.
- Implemented a `patch_metadata.py` script to resolve metadata conflicts between dependency versions.
- Updated `setup.sh` to integrate single-env setup, including dependency installation and metadata patching.
- Upgraded `fastmcp` and `websockets` versions in `extras.txt` for compatibility.
- Commented out unused "Start Tutorial" button in `data-recipes-page.tsx`.
- Added support for configuring markdown note block styles, including color and opacity.
- Enabled double-click on markdown notes to open their configuration dialog.
- Adjusted layout styles in markdown previews for better interaction control.
- Updated relevant payloads, types, and UI logic to support added styling features.
- Integrated multiple example notes in learning recipes for better visualization.
- Added "Markdown Note" block to allow users to add UI-only markdown notes to the canvas for documentation purposes.
- Integrated note creation, editing, and rendering in the `recipe-studio` UI, including markdown previews.
- Updated payload generation logic to omit markdown notes from backend payloads.
- Enhanced block types, definitions, and dialog support to include the new "Markdown Note" feature.
- Introduced "Multi-Turn Chat" recipe to generate structured user-assistant conversations with domain/topic-based goals and constraints.
- Added `conversation.json` with model configuration, sampling strategies, and LLM prompts.
- Updated UI nodes, layout, and graph rendering logic to support new recipe.
- Enhanced `recipe-studio` fit view logic to improve editor layout responsiveness.
- Added three new learning recipes: "Instruction from Answer," "PDF Grounded QA," and "Structured Outputs Jinja," with respective metadata and configuration.
- Integrated support for unstructured and structured input handling, including sampling strategies, prompt definitions, and model specifications.
- Enhanced JSON structure and UI nodes to facilitate better recipe visualization and execution.
- Introduced `layoutDirection` to control graph orientation ("LR" or "TB") and integrate into edges, nodes, and payloads.
- Enhanced handle management with new default, semantic, and data-specific mappings based on layout direction.
- Added handle normalization for consistent connections across layouts and semantic/data flows.
- Updated UI to reflect layout-aware positioning and semantic connections.
- Added handle normalization functions to standardize handle IDs across connections.
- Expanded UI for scorer options with real-time updates, input fields for values and descriptions, and support for adding/removing options.
- Updated graph node handles and their layout logic for better connection visualization.
- Stripped sensitive fields (e.g., `api_key`) from payloads during export.
- Introduced a new "Instruction from Answer" learning recipe with related metadata, payload integration, and UI updates.
- Enhanced badge display logic to include up to 3 badges with overflow indication for additional learning badges.
- Changed default eval_steps from 0.01 to 0.0 across backend and frontend
- Fixed UI to allow eval_steps=0 (removed min=0.001 constraint)
- Added conditional eval logic with helpful console messages
- Updated tooltip to explain how to disable evaluation
- Tested: confirmed eval disabled by default with eval_steps=0.0
* Suppress FBGEMM CUTLASS "Arch conditional MMA" stdout spam on Blackwell GPUs
On Blackwell GPUs (B200/B100, SM100), FBGEMM's f8f8bf16_blockwise kernel
is hardcoded to cutlass::arch::Sm90 with no SM100 code path. When
test_has_fbgemm() probes this kernel, it fires 2304 "ERROR : Arch
conditional MMA instruction used without targeting appropriate compute
capability" lines before aborting and returning zeros.
The existing HidePrintMessage filter on sys.stderr (line 109) does not
catch these because CUDA device-side printf writes to stdout fd 1 at the
C level, bypassing Python's sys.stdout/sys.stderr entirely.
Fix: add suppress_cuda_printf() context manager in import_fixes.py that
redirects fd 1 and fd 2 to /dev/null at the OS level, with
torch.cuda.synchronize() and libc fflush before restoring. Wrap the
test_has_fbgemm() call in fp8.py with this context manager.
Tested on B200 with fbgemm-gpu-genai 1.4.0+cu130 and 1.5.0+cu130:
- Before: 2304 warning lines on every import
- After: 0 warning lines
- UNSLOTH_HAS_FBGEMM correctly set to 0 (Triton fallback works)
- Works with both UNSLOTH_ENABLE_LOGGING=0 and =1
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Guard _libc init and fflush to prevent fd leak on failure
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-253.us-east-2.compute.internal>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix VLM processor load degradation and vLLM CUDA version detection
vision.py - Fix VLM processor load for issue #4085:
- Before loading the processor, scan local config files and strip the
_Unsloth_Patched_ prefix. AutoProcessor.from_pretrained silently
degrades to a text-only tokenizer instead of raising an exception
when it encounters the unrecognized class name, so the existing
get_auto_processor fallback never triggers. Sanitizing the configs
before loading fixes backwards compat for old corrupted saves.
- After loading, detect when AutoProcessor returned a text-only
tokenizer for a VLM model (has no image_processor attribute) and
trigger the manual fallback constructor.
import_fixes.py - Fix vLLM CUDA version mismatch detection:
- _is_broken_vllm_error now also matches CUDA shared library errors
(libcudart, libcublas, libnvrtc) with "cannot open shared object
file". Previously it only matched errors containing "vllm._c" in
the message text, which missed cases where the error message was
about the missing CUDA library itself (e.g. vllm built for CUDA 12
on a CUDA 13 system).
- New _get_vllm_cuda_mismatch_message function extracts the CUDA
version from the error, compares to the system CUDA version via
torch.version.cuda, and returns a targeted install command using
the correct GitHub releases wheel URL.
- disable_broken_vllm uses the targeted message when a CUDA mismatch
is detected, falling back to the existing generic message otherwise.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-16-253.us-east-2.compute.internal>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Removed unused model usage properties (`total`, `tps`, `requestsSuccess`, etc.) for cleaner data handling.
- Added new metrics: total input/output tokens, null rate, and low uniqueness flags.
- Improved UI for execution summary cards with consolidated insights and model usage tables.
- Introduced detailed analysis for dataset columns, including dropped columns and LLM column counts.
- Optimized rendering logic to reduce clutter and enhance user experience.
- Added `log_lines` field to track and display runtime logs for executions.
- Enhanced progress tracking with terminal-like log outputs and live log scrolling.
- Introduced detailed "model usage" and "dropped columns" analysis in `ExecutionsView`.
- Optimized UI components for displaying dataset metrics, including input/output token averages.
- Added column visibility toggles using a dropdown menu for greater customization.
- Introduced expandable table cells for long values with "expand/collapse" functionality.
- Ensured hidden columns reset on execution change, providing a consistent user experience.
- Added logic to calculate and manage column-level progress for job executions.
- Introduced `progress_columns_total` and `_column_done` fields for more granular progress updates.
- Improved overall progress computation by considering total columns and individual progress per column.
- Extracted shared execution utilities into `execution-helpers.ts` for reusability across features.
- Replaced deprecated `/preview` endpoint and its logic with unified job execution handling.
- Consolidated job execution flows ("Preview" and "Full Run") into shared `runJobExecution` logic.
- Enhanced execution progress tracking with support for column-level progress reporting.
- Added support for handling execution job events and improved error reporting from the backend.
- Updated backend to better manage dataset access errors and provide more informative error messages.
- Cleaned up redundant code in `use-recipe-studio-actions` and streamlined execution APIs.
- Introduced backend changes to handle dataset pagination with limit, offset, and total row support.
- Updated frontend execution view with dataset pagination controls, including "Next" and "Prev" buttons.
- Extended recipe execution logic to manage dataset pagination details like page number, page size, and total records.
- Introduced "Full Run" support in execution logic, including progress tracking, cancellation, and job status updates.
- Extended backend to manage full execution jobs, handle dataset previews, and return detailed analysis and artifacts.
- Updated frontend components to support full runs, with execution sorting, live updates, and detailed execution views.
- Enhanced `ExecutionsView` with progress indicators, status filtering, and dataset preview capabilities.
- Added IndexedDB schema migration to track additional execution metadata.
- Added `ExecutionsView` with execution history tracking, live updates, and detailed data analysis.
- Implemented IndexedDB support via Dexie to persist execution records locally.
- Enhanced backend preview logic to return execution analysis and artifacts.
- Updated studio header with view toggling between "Editor" and "Executions."
- Deleted `jinja-ref-autocomplete` components and related hooks.
- Replaced custom Jinja variable autocomplete with standard `Textarea` and `Input` components.
- Streamlined variable handling logic by replacing `getAvailableRefItems` with `getAvailableVariables`.
- Removed unused state (`flowMoving`) and redundant logic tied to Jinja-specific functionality.
* Add `datasets` metadata support to model cards
Add an optional `datasets` parameter to all save/push functions so users
can specify which datasets were used for training. The metadata is set
via `ModelCard.data.datasets` for standard paths and via
`metadata_update` for GGUF and generic save paths.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Fix datasets metadata for existing repos, add token, improve errors
- Add metadata_update fallback in create_huggingface_repo and
upload_to_huggingface so datasets metadata is set even when the
repo already exists (previously only worked on first creation).
- Pass token=token to all metadata_update calls so they work
without a global HF login.
- Replace silent except:pass with logger.warning_once for
metadata failures so users know if something went wrong.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix generic datasets metadata repo resolution for PR #4076
* Fix create_huggingface_repo username resolution for PR #4076
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
- Adjusted padding, spacing, and grid configurations for better alignment and scaling across screen sizes.
- Enhanced mobile responsiveness by updating flex and grid layouts, ensuring optimal display on smaller devices.
- Tuned container dimensions and card styling to maintain design consistency.
- Implemented a utility to manage `training-compare-handoff` data in `sessionStorage` with strict validation and expiration logic.
- Added methods to set, retrieve, and clear handoff data for improved chat training flow.
- Added halfway/completed training hints with actionable links.
- Introduced sliders for adjusting max steps and epochs dynamically.
- Refined tooltip explanations for configuration parameters.
- Enabled custom overlay styling for `AlertDialogContent`.
- Introduced `EvalLossChartCard`, `GradNormChartCard`, `LearningRateChartCard`, and `TrainingLossChartCard` components.
- Implemented shared chart settings via `SharedChartSettings` to manage scale, outliers, and view configuration.
- Added utilities for metrics formatting, step tick generation, data compression, and smoothing (`utils.ts`).
- Created types and structures for chart data handling (`types.ts`).
- Removed redundant vision-check controllers.
- Added `NON_PERSISTED_STATE_KEYS` to manage persisted training state.
- Introduced `partializePersistedState` for cleaner state filtering.
- Implemented backend model configuration mapping to training state.
- Added auto-apply logic for default configurations when models are selected.
- Introduced utilities for type conversion and validation within training configuration.
- Create studio/backend/colab.py using Colab's built-in proxy
- Uses google.colab.kernel.proxyPort() for URL (no cloudflare)
- Shows nice clickable link with IPython.display.HTML
- Notebook has just 2 cells: setup and start
- Much simpler than external tunneling approach
* FP8 per tensor quant support
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix security-regression fallout in chat templates and PDL patching
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Drop security regression test files from PR scope
* Apply suggestion from @danielhanchen
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Guard optional vLLM imports when extension is broken
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove vLLM import guard tests from PR scope
* Block broken vLLM imports like causal_conv1d
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Raise ImportError for stable torchvision mismatches
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove torchvision compatibility tests from PR scope
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Updated step descriptions across Studio, Chat, and Export tours for better clarity.
- Added `openSidebar` state management function and integrated it into the tour logic.
- Improved target detection in guided tours with retry logic for better handling of unavailable elements.
- Replaced `setThreadWarming` logic with streamlined token settlement functions (`settleFirstTokenOk` and `settleFirstTokenErr`) for improved readability and reliability.
- Simplified model loading/unloading functions with reusable `performLoad` and `performUnload` patterns.
- Removed `warmingByThreadId` from runtime store and associated code for reduced complexity.
- Enhanced title generation flow by consolidating logic for persisting and streaming titles.
- Refactored loading/unloading logic to provide detailed toast notifications with statuses (loading, success, error).
- Removed unused `WarmupIndicator` component from thread UI to simplify interface.
- Introduced better error handling for model refresh and inference tasks.
When using device_map='balanced' with multiple GPUs, the labels tensor
may reside on a different device than the logits/losses tensors. This
causes a RuntimeError at the masked_fill_ call in the chunked
cross-entropy forward path.
Fix: explicitly move labels to the same device as logits at the start
of Fast_CrossEntropyLoss.forward(). This is a no-op on single-GPU
setups.
Fixes#4041
* Auto-configure AMDGPU_ASIC_ID_TABLE_PATH on ROCm startup
* Remove ROCm fd2 amdgpu.ids noise filter wrappers
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use PyPI bitsandbytes for amd extra to avoid malformed wheel URL
* Add amd-preview extra for bitsandbytes continuous wheel channel
* Keep amd extra on bitsandbytes>=0.49.1 and remove amd-preview
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Wrap unsloth_zoo import with HIP amdgpu.ids filter
* Refactor ROCm ids filter helpers for readability
* Rename ROCm ids filter helper and annotate call sites
* Remove obsolete amdgpu ids filter alias
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
MI355X (gfx950) has the same 1024-thread workgroup limit as MI300X (gfx942),
but was missing from is_cdna(), causing all Triton kernels to use num_warps=32
(2048 threads) instead of 16 (1024 threads), resulting in OutOfResources crash.
Tested on: 8x AMD Instinct MI355X (gfx950), ROCm 7.1
* Suppress HIP libdrm stderr noise in causal_conv1d probe
* Broaden HIP libdrm stderr suppression for early ROCm startup
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Improve HIP GPU name reporting in startup banner
* Drop MI300X arch suffix in banner name
* Normalize _utils.py file mode
* Simplify FA2 fallback text and filter AMD ids noise
* Strip trailing GPU arch suffix via regex
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use gfx lookup default and normalize Ryzen AI naming
* Remove name-path Ryzen AI normalization
* Expand ROCm gfx map to full documented GPU name aliases
* Simplify HIP fallback naming to AMD gfx token
* Remove Ryzen Al torch_name normalization
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Improve HIP GPU name reporting in startup banner
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Handle broken causal_conv1d import at runtime
Add a startup import-time probe for causal_conv1d and disable the fast path when the shared library is ABI broken. This keeps Falcon H1/model loading resilient without requiring env flags.
- Add disable_broken_causal_conv1d in import_fixes.
- Invoke it early from unsloth/__init__ during package init.
- Make Falcon H1 optional imports in loader and models/__init__ soft-fail instead of failing hard.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Enforce unavailable semantics for broken causal_conv1d
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove Falcon H1 import swallowing
* Restore optional Falcon H1 import guard
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove causal_conv1d regression tests
* Trim FA2 fallback messaging
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Introduced `ViewportControls` for zoom, fit view, and interactive toggle in canvas lab.
- Extracted and reused `CANVAS_FLOATING_ICON_BUTTON_CLASS` for consistent button styling.
- Updated API base paths and server proxy settings.
- Enabled dynamic interaction states for nodes and connections in canvas lab.
* convert print to logger
* Print but cleaner
* Hide model on multiple devices
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix typo transfomers -> transformers, revert MoE message change
* Update MoE detection message to show num_experts and target_modules
* Fix llama-cli path in save info message
* target_parameters warning for moe
* fix should_convert_module for llm_int8_skip_modules
* fix should_convert_module for llm_int8_skip_modules
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Logging filters
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* negation
* remove should_convert_module patch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Fix warmup_ratio deprecation warning for transformers >= 5.0
In transformers 5.0, warmup_ratio is deprecated in favor of
warmup_steps which now accepts float values (< 1 = ratio,
>= 1 = absolute steps).
The compiler now conditionally sets warmup_steps=0.1 on
transformers >= 5.0 (same semantics as warmup_ratio=0.1) and
keeps warmup_ratio=0.1 on older versions where warmup_steps
only accepts int.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Inject token_type_ids for Gemma3 multimodal training on transformers 5.x
In transformers 5.x, create_causal_mask_mapping() raises ValueError when
is_training=True and token_type_ids is None. When doing text-only SFT on
Gemma3 4B (a multimodal model), the dataset_utils detection for
_needs_token_type_ids can miss because:
- The model is wrapped in PeftModel, so type(model).__module__ points to
peft.peft_model instead of transformers
- The processing_class is a tokenizer (not Gemma3Processor), so the
fallback MRO check resolves to a module without create_causal_mask_mapping
This adds a fallback in _unsloth_pre_compute_loss that injects
token_type_ids=zeros when:
1. token_type_ids is not already in inputs
2. The inner model config has model_type "gemma3"
3. The model's module has create_causal_mask_mapping (transformers 5.x)
4. The model is in training mode
On transformers 4.x, create_causal_mask_mapping does not exist so this
check is inert.
Depends on: unslothai/unsloth-zoo#488
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* FP8: Load model on-the-fly in vLLM
**Summary:** Existing support for `load_in_fp8=True` performs
an offline quantization when loading the initial model.
This is no longer necessary as of vllm==0.12.0 (after
https://github.com/vllm-project/vllm/pull/23014), where we
can quantize the model on-the-fly when we load it:
```
llm = LLM(
...
hf_overrides={
"quantization_config_dict_str": json.dumps(torchao_config),
},
)
```
**Note:** Needs https://github.com/unslothai/unsloth-zoo/pull/380
**Test Plan:**
https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix on-the-fly FP8: always check mapper first, fallback to on-the-fly
The original implementation bypasses the FP8 mapper entirely for
vllm >= 0.12.0, meaning models like Llama-3.2-1B-Instruct and Qwen3-8B
that have pre-quantized FP8-Block/FP8 checkpoints would never use them.
This fixes the priority order:
1. Mapper has a pre-quantized model -> use it (always)
2. Mapper has no match + vllm >= 0.12.0 -> on-the-fly FP8 via torchao
3. Mapper has no match + vllm < 0.12.0 -> offline quantization
Changes:
- loader_utils.py: Move vllm >= 0.12.0 check after mapper lookups
- loader.py: Set load_in_fp8=False when mapper resolves to a
pre-quantized model to prevent double quantization
Tested on B200 with Llama-3.2-1B-Instruct and Qwen3-8B. Corrected code
produces results matching baseline (pre-quantized path preserved).
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* convert print to logger
* Print but cleaner
* Hide model on multiple devices
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix typo
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix typo transfomers -> transformers, revert MoE message change
* Update MoE detection message to show num_experts and target_modules
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Fix#3397: Prevent trainer tokenization hang with safe num_proc
* Fix#3397: Add missing import sys for Windows-safe tokenization
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Consolidate with existing num_proc guard in dataset_utils.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Fix EmbeddingGemma float16 NaN by adding gemma3_text to FORCE_FLOAT32 and SDPA lists
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Inject model reference for dynamic token_type_ids detection in SFTTrainer
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Suppress vLLM v1 executor sleep/wake log messages
Add HideLoggingMessage filters for vllm.v1.executor.abstract logger to
suppress repetitive sleep/wake INFO and WARNING messages that spam training
output when UNSLOTH_VLLM_STANDBY is enabled. The existing filter at line 275
handles the legacy vllm.executor.executor_base path; this adds coverage for
the v1 engine path used by vllm 0.11+.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Moved `AvailableVariables` to shared directory.
- Updated dialogs to use shared `AvailableVariables` component.
- Enhanced inline expressions and processors dialog with better variable display.
- Introduce `AvailableVariables` for displaying variables linked to configs.
- Implement `ChipInput` for dynamic value management in category and subcategory dialogs.
- Add `AuxVariableBadges` to aux nodes for displaying variable references.
- Update inline components with comboboxes for better user experience.
- Replace badges and manual inputs with streamlined reusable components.
* Silence peft target_parameters RuntimeWarning for MoE models
Wrap _get_peft_model calls with warnings.catch_warnings() to suppress
the "target_parameters were set but no parameter was matched" warning.
This fires on MoE models where expert layers use nn.Parameter naming
that peft warns about but handles correctly.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Strip the "anihilate"/"annihilate" warning block from compiled trainer
source so it does not fire when Unsloth auto-enables padding-free mode
with batch size 1 (the common single-GPU case).
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Fix dtype mismatch in fp16 + 4-bit/8-bit LoRA training
Two fixes for training with dtype=torch.float16 and load_in_4bit=True:
1. fast_lora.py: fast_dequantize() returns tensors in quant_state.dtype
(typically bfloat16 or float32), but activations may be float16. The
subsequent matmul/addmm operations require matching dtypes. Add dtype
casts after each fast_dequantize() call in LoRA_MLP.backward and
LoRA_QKV.backward (5 locations total).
2. rl.py: TRL unconditionally casts trainable parameters to bfloat16 in
the peft init block. When training with fp16=True, this causes
GradScaler to crash since it requires float32 parameters. Make the
cast conditional -- use float32 when fp16 is enabled, bfloat16
otherwise. This is a no-op for GRPOTrainer (whose peft init block is
already removed by the existing regex), but fixes SFTTrainer and
other TRL trainers.
Tested with Llama-3.2-1B-Instruct 4-bit on both fp16 and bf16 training.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix fp16 + 4-bit LoRA: thread correct_dtype through post_patch
Root cause: fast_dequantize returns tensors in quant_state.dtype, which
for pre-quantized models is bfloat16 (from config.json). The post_patch
methods in llama/gemma/gemma2 call patch_model_and_tokenizer without
passing correct_dtype, so quant_state.dtype is never overridden to match
the user's requested dtype. This causes a dtype mismatch crash in the
backward pass when training with dtype=torch.float16.
Fix: pass the user's dtype from from_pretrained through post_patch to
patch_model_and_tokenizer as correct_dtype, matching the pattern already
used by vision.py.
Revert the 5 symptom-level dtype casts in fast_lora.py (upW, gateW, QW,
KW, VW) since they are no longer needed with quant_state.dtype properly
set at the source.
Tested: fp16+4bit and bf16+4bit Llama-3.2-1B-Instruct 15-step SFT runs
both complete successfully with similar losses (~1.558 vs ~1.563).
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove TRL's unconditional bfloat16 cast instead of patching the dtype
TRL 0.26.0+ hardcodes `param.data.to(torch.bfloat16)` for all trainable
params in quantized models, citing the QLoRA paper recommendation. This
is wrong: it ignores the user's requested dtype and breaks GradScaler
when fp16=True. The block exists in sft_trainer, grpo_trainer,
rloo_trainer, and reward_trainer (not dpo_trainer).
Previous fix patched the cast to be dtype-conditional. This commit
replaces the entire guard `if getattr(model, "is_loaded_in_4bit", ...)
or getattr(model, "is_loaded_in_8bit", ...):` with `if False:` to
disable the block entirely. Unsloth already handles adapter dtype via
patch_model_and_tokenizer, making TRL's cast both unnecessary and
harmful.
For GRPOTrainer the enclosing peft init block is already removed by
the regex above, making this a no-op for GRPO.
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix trainer compilation failures from trl.experimental thin wrappers
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix OOM from prepare_model_for_kbit_training overwriting peft_config patching
---------
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
TRL 0.22.x checks _is_vlm (model type) instead of _is_vision_dataset
(dataset content, added in 0.25.1+) in _set_signature_columns_if_needed.
When _is_vlm=True (e.g. Gemma3), signature columns are set to vision-only
["messages","prompt","completion","images"], which has zero overlap with
tokenized text columns [input_ids, labels, attention_mask, ...], causing
a ValueError.
Fix: expand the VLM branch signature columns to include both vision and
text column names. Extra columns not present in the dataset are harmlessly
ignored by _remove_unused_columns (it only raises when zero columns match).
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Patch before compile?
* Fix notebook compatibility for transformers 4.57.6 and TRL 0.22-0.27
Fixes several notebook failures discovered during testing all 125
notebooks with transformers==4.57.6 + tRL 0.22.2 and TRL 0.27.1.
Warning suppression (import_fixes.py):
- Suppress torch 2.9+ pin_memory/is_pinned device deprecation warnings
- Suppress cuda.cudart/cuda.nvrtc module deprecation FutureWarning
- Filter vllm "Level is deprecated" stderr noise
- Filter PydanticSerializationUnexpectedValue warnings
- Filter Triton "df: No such file" stderr noise
VLM tokenizer loading (vision.py):
- Add _construct_vlm_processor_fallback() for models where
AutoProcessor.from_pretrained fails (e.g., ERNIE 4.5 VL, LFM2.5-VL)
- Wrap processor loading in try/except with fallback to manual
construction from separate image_processor + tokenizer components
- Add fallback to AutoTokenizer/PreTrainedTokenizerFast when tokenizer
loading or patching fails
TRL 0.27.1 trainer compatibility (trainer.py):
- Add _resolve_trainer_params() to handle thin wrapper trainers that
only have def __init__(self, *args, **kwargs) (e.g., ORPOTrainer
in TRL 0.27.1) by walking MRO for real parameter signature
VLM _is_vlm detection (rl.py):
- Replace blanket _is_vlm=False override with model-architecture-based
detection that checks vision_config or ForConditionalGeneration class
name, fixing VLM training when bare tokenizer is passed as
processing_class
ModernBERT SDPA compatibility (loader.py, sentence_transformer.py):
- Add "modernbert" to DISABLE_SDPA_MODEL_NAMES to avoid stride
alignment issues with torch.compile backward pass
- Add DISABLE_SDPA check for sentence transformer models
Other fixes (_utils.py):
- Suppress false uninitialized weight warnings for VLM
multi_modal_projector.layer_norm
Tested: 92/125 notebooks pass with TRL 0.22.2, 94/125 with TRL 0.27.1.
Remaining failures are infra (missing FFmpeg, network timeouts, GPU
arch) not code bugs.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix KTO shape mismatch on TRL 0.27.2+ and truncation alignment
- Patch KTO get_batch_logps to auto-align logits and labels when Unsloth
model forward truncates input_ids beyond max_seq_length. TRL 0.27.2
changed _process_tokens to only truncate completions (not prompts), so
sequences with long prompts exceed max_seq_length and trigger model-side
truncation. The original ValueError is replaced with min-length alignment.
- Also truncate attention_mask in LlamaModel forward when input_ids are
truncated to max_seq_length, preventing shape mismatches in attention.
- Widen except clause in rl_replacements.py openenv import from
`except ImportError` to `except (ImportError, NameError, Exception)` to
handle vllm SamplingParams NameError in TRL 0.27.2.
* Fix TRL 0.26+ thin wrapper resolution, enable ModernBERT SDPA, clean up warning filters
TRL 0.26+ thin wrapper resolution (rl.py):
- Filter _-prefixed private imports when discovering Trainer/Config classes
- Look up Config in separate *_config.py module when not found in trainer module
- Detect thin wrappers (<1000 chars source) and resolve to experimental parent
via MRO walk; use resolved module for imports and create_new_function
- Enables all 15 trainers to patch successfully (was 5/15 before)
ModernBERT SDPA (loader.py):
- Remove "modernbert" from DISABLE_SDPA_MODEL_NAMES
- SDPA works correctly for both classification and sentence transformers
- Verified: 88.9% accuracy on emotion classification, correct domain-specific
embeddings after sentence transformer fine-tuning
Warning filter cleanup (import_fixes.py):
- Remove cuda.cudart/cuda.nvrtc FutureWarning filters (no such warnings
exist in torch 2.9.1+; proactive suppression is unnecessary)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove multi_modal_projector.layer_norm from uninitialized weight guard
The LFM2.5-VL projector LayerNorm is properly initialized by
transformers and does not need to be excluded from the uninitialized
weight check. The original exclusion was added as a workaround but is
no longer needed after the upstream fix.
* Add transformers 5.0 compat: rope_theta helper, config-as-dim detection, BatchEncoding guard, try/except for TRL trainer source, push_to_hub_token compiler fix
- llama.py: Add _get_rope_theta() helper handling both config.rope_theta and rope_parameters dict
- llama.py: Handle BatchEncoding in unsloth_fast_generate (transformers 5.0+ returns BatchEncoding from apply_chat_template)
- gemma.py: Detect config passed as dim arg in GemmaFixedRotaryEmbedding
- tokenizer_utils.py: Add try/except for TRL trainer getsource in patch_sft_trainer_tokenizer
- rl_replacements.py: Add compiler fix replacing bare pop("push_to_hub_token") with pop(..., None)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use trl.experimental string check instead of char-count heuristic for thin wrapper detection
The <1000 / >1000 char threshold was fragile -- XPOConfig's parent is only
994 chars and would be skipped. All thin wrappers in TRL 0.26+ contain
"trl.experimental" in their deprecation warning, while no real trainer or
config class does, making it a reliable detection marker.
* Move DISABLE_SDPA_MODEL_NAMES import to module level in sentence_transformer
The function-level import was redundant since loader.py is already imported
at module level. Move it to the existing loader import line.
---------
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Add `inputs_embeds` parameter to `_fast_prepare_inputs_for_generation` so
`model.generate(inputs_embeds=...)` works with Unsloth-patched models.
Changes:
- Add `inputs_embeds=None` to function signature (fixes HF inspect check)
- Track `use_inputs_embeds` flag: True when inputs_embeds provided and no cache
- Conditionally return inputs_embeds on first step, input_ids on subsequent steps
- Handle input_ids being None/empty for batch size and device extraction
- Add attention_mask None-guard before slicing
Fixes: https://github.com/unslothai/unsloth/issues/3798
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: siddhudonda <siddhudonda@users.noreply.github.com>
When using torchrun with quantized models (4bit/8bit/fp8), each rank
must load the model directly onto its own GPU. The default device_map
("sequential") places everything on GPU 0, causing illegal memory
access errors when Accelerate tries to relocate quantized weights.
Use the existing prepare_device_map() utility from loader_utils to
detect distributed training via LOCAL_RANK/WORLD_SIZE env vars and
override device_map to target each rank's local GPU. This is applied
in both FastLanguageModel.from_pretrained and FastModel.from_pretrained,
covering text, vision, and audio model paths.
Fixes#3914
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Refactor Ollama template wiring and harden packing helpers
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
* Fix Qwen3 and Gemma3n template bindings and tidy packing test helper
* Fix gptoss Ollama comment and tinyllama stop parameter
- Fix wrong comment referencing gemma3n for gptoss_ollama in chat_templates.py
- Add missing stop keyword to tinyllama PARAMETER in ollama_template_mappers.py
* Fix _DummyTrainer compatibility across TRL versions
The try/except only handled the removal of return_position_ids
(TRL v0.24+) but not the absence of padding_free (TRL v0.18.2).
Gracefully degrade through all optional collator flags so the
test works from trl>=0.18.2 through v0.27+.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Mohammad Miadh Angkad <MAngkad.BSDSBA2027@aim.edu>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* seperate gguf
* fix Modelfile log
* ollama Modelfile create
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix GGUF file placement: move initial conversion to _gguf dir, fix cleanup
- Move initial GGUF files (from convert_to_gguf) into {model_directory}_gguf/
immediately after conversion, so all GGUF outputs live in the dedicated
directory regardless of quantization method (fixes bf16-only case where
quant == first_conversion skipped the loop and _gguf dir was never created)
- Remove redundant gguf_directory/makedirs from inside the re-quant loop
since the directory is now created before the loop
- Use Path.unlink(missing_ok=True) for base GGUF cleanup robustness
- Unify Modelfile location to {save_directory}_gguf/Modelfile for both
VLM and non-VLM models
- Fix print message to show actual modelfile_location path
- Add gguf_directory key to return dict
- Clean up {save_directory}_gguf in push_to_hub_gguf error/finally blocks
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Implement GGUF upload method for SentenceTransformer
Added a method to convert and upload SentenceTransformer models to GGUF format, including handling of tokenizer, quantization methods, and repository management on Hugging Face Hub.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
On Windows and macOS (Python 3.8+), multiprocessing uses the spawn
start method. When datasets .map(num_proc=N) is called, it creates a
Pool(N) which re-imports __main__ in each worker, causing infinite
recursion and a RuntimeError during bootstrapping.
Guard the auto-computed dataset_num_proc in the generated Config
__init__ by checking multiprocessing.get_start_method() != 'fork'.
When the start method is not fork (spawn/forkserver), force
dataset_num_proc = None so datasets takes the single-process path.
Linux fork behavior is unchanged.
Also replace the fixed memory threshold logic with the simpler
adaptive approach: cap at 64, then min(num_proc, int(available_gb)),
with a safety floor of 1 when available memory is at or below 2GB.
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
* Disable torchcodec in transformers when FFmpeg is missing
When torchcodec is installed but FFmpeg libraries are unavailable,
transformers still thinks torchcodec is available (via find_spec check)
and tries to use it for audio loading, causing RuntimeError.
This adds disable_torchcodec_if_broken() which tests if torchcodec can
actually load its native libraries, and if not, patches transformers'
_torchcodec_available to False so it falls back to librosa instead.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The cuda.cutlass_epilogue_fusion_enabled and cuda.cutlass_tma_only
inductor config options were added in PyTorch 2.8.0. Using these
options on older PyTorch versions causes a RuntimeError during
GRPOTrainer initialization.
This fix adds a version check to only include these options when
running PyTorch 2.8.0 or later, allowing GRPO training to work on
older PyTorch versions (e.g., Colab environments with PyTorch 2.5-2.7).
Co-authored-by: Daniel Hanchen <danielhanchen@users.noreply.github.com>
When datasets library has torchcodec installed but FFmpeg libraries
are missing, torchcodec raises a RuntimeError during import. The
exception handler only caught ImportError and AttributeError, causing
the error to propagate and crash Unsloth imports in environments
like Colab where FFmpeg may not be installed.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Improve MoE performance
* small changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix imports
* disable autotune
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* LoRA for MoE
* Make autotune default
* make dy contiguous
* use non lora model as base for RL
* Revert "use non lora model as base for RL"
This reverts commit bc8f15629d060593b2eaf436f158ff5ac9df0d5d.
* fixup derp
* non TMA [T4]
* Revert "non TMA [T4]"
This reverts commit 35304566690e7c9ab9632899920c85bff322409a.
* Fixes for VL MoE and v5 transformers
* [transformers] [v5] remove unused hybridcache (#3910)
* remote unused hybridcache
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* No double compile for qwen3moe
* Fix top_k on trl GRPO
* Recognise GLM as MoE
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix missing RotaryEmbeddingConfigMixin
* Licensing for autotuning cache
* Cleanup
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Erland366 <erland.pg366@gmail.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
_patch_trl_rl_trainers enumerates all trainer modules from dir(trl.trainer)
and attempts to import each one. Modules like alignprop_trainer fail because
they depend on optional packages (diffusers) that may not be installed. The
failure is harmless but the print() call produces noise on every import.
Change print() to logger.info() so these messages only appear when
UNSLOTH_ENABLE_LOGGING=1.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
GPT-OSS models use eager attention during inference because flex
attention returns incorrect results (likely due to left padding).
However, when _attn_implementation is set to "flex_attention",
transformers creates BlockMask objects which cause a TypeError
when passed to the eager attention path:
TypeError: unsupported operand type(s) for +=: 'Tensor' and 'BlockMask'
This fix excludes GPT-OSS from using flex_attention, keeping it on
the eager path to avoid the BlockMask/Tensor type mismatch.
* Enable flex attention by default
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Avoid dropping flex attention when SDPA unsupported
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update rl_replacements.py
* Update rl_replacements.py
* Update rl.py
* Update rl_replacements.py
* Update rl_replacements.py
* Update rl.py
* Update rl.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update rl_replacements.py
* Update rl.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update rl_replacements.py, remove chat template from codexes commits
* Update rl.py, got rid of gradient checkpointing code that did not work
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix torchvision compatibility check for source builds and future torch versions
The torchvision version check raised a hard ImportError for custom/source-built
PyTorch installations (e.g. AMD ROCm from source with +git* suffixes), even when
the actual build was functional. This also silently skipped any torch version
not already in the hardcoded table, giving no warning at all for future releases.
Changes:
- Detect custom/source builds by checking the raw version string's local
identifier against known standard prefixes (cu, rocm, cpu, xpu). Our custom
Version() strips local identifiers via regex, so detection must happen on the
raw string before parsing.
- Downgrade to a warning (instead of ImportError) for custom/source builds,
since their version numbers may not follow standard PyPI release pairings.
- Add formula-based inference for future torch versions not yet in the table.
The torch->torchvision minor version formula (torch 2.x -> tv 0.(x+15)) has
held for every release from torch 2.0 through 2.9. For formula-predicted
versions, mismatches produce a warning rather than a hard error.
- Add UNSLOTH_SKIP_TORCHVISION_CHECK=1 env var to skip the check entirely.
- Wrap importlib_version and Version calls in try/except so broken metadata
never crashes the import.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: stricter regex, case insensitivity, pre-release detection
Fixes three edge cases found during review:
1. Regex precision: cu/xpu now require a trailing digit (cu\d, xpu\d) to
avoid false negatives on suffixes like "+custom_build" that happen to
start with "cu". cpu/xpu match as exact strings only.
2. Case insensitivity: added re.IGNORECASE so "+ROCM6.3" and "+CPU" are
correctly recognized as standard builds rather than custom ones.
3. Pre-release detection: nightly/dev/alpha/beta/rc builds with standard
CUDA/ROCm suffixes (e.g. "2.7.0.dev20250301+cu124") now produce a
warning instead of a hard ImportError. These builds commonly have
version mismatches that are expected during development.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address PR review comments: fullmatch, env var casing, torchvision pre-release
1. Switch re.match to re.fullmatch for the custom build regex so the
entire local identifier must match. Fixes false negatives where
suffixes like +cu124_custom were misclassified as standard because
re.match only checked the start of the string.
2. Use .lower() for the UNSLOTH_SKIP_TORCHVISION_CHECK env var so
any casing of "true" / "TRUE" / etc. is accepted.
3. Check torchvision_version_raw for pre-release tags in addition to
torch_version_raw, so a stable torch paired with a nightly
torchvision (e.g. 0.23.0.dev...) also gets a warning instead of
a hard ImportError.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
vLLM's distributed module (device_communicators) crashes with std::bad_alloc
when imported on SM100 GPUs (B200/B100/Blackwell) with torch < 2.9.0.
This adds an early check that runs before vLLM is imported, providing a
helpful error message instead of a cryptic C++ exception.
The check:
1. Detects if vLLM is installed
2. Checks if torch version is < 2.9.0
3. Checks if any GPU is SM100 (Blackwell)
4. If all conditions met, raises RuntimeError with clear upgrade instructions
* Add TRL truncation regression and metadata loss fixes
Fix 1: TRL 0.24.0-0.25.1 right-truncation regression
- These versions pass max_length=self.max_prompt_length and truncation=True
to the tokenizer, which right-truncates prompts and strips the assistant
turn suffix
- Use regex to remove these kwargs from the generated code
Fix 3: Metadata loss for chat_template_kwargs
- TRL 0.24.0+ extracts prompts = [x["prompt"] for x in inputs], losing metadata
like reasoning_effort
- Inject code to store per-sample chat_template_kwargs on self before extraction
- Preserve these kwargs in prompts_text generation for all TRL versions
Tested with TRL versions 0.22.2, 0.23.1, 0.24.0, 0.25.1, 0.26.2, and 0.27.1.
* Update Fix 1 comment with detailed TRL version behavior explanation
Expand the comment for the TRL 0.24.0-0.25.1 truncation regression fix
to clarify what each TRL version does:
- TRL 0.22.2-0.23.1: Uses truncate_with_protected_tokens() for smart
truncation that preserves rightmost tokens and protects special tokens
- TRL 0.24.0-0.25.1: Removed smart truncation, passes kwargs directly
to tokenizer (max_length, truncation=True, add_special_tokens=False)
- TRL 0.26.2+: Removed these kwargs entirely
The fix removes these problematic kwargs so 0.24.0-0.25.1 behaves like
0.26.2+ (no tokenizer-level truncation).
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
When users pass `num_train_epochs=None` to GRPOConfig (relying on
max_steps to control training duration), Trainer.__init__ fails with:
TypeError: '>' not supported between instances of 'NoneType' and 'int'
This happens because transformers.Trainer does `args.num_train_epochs > 0`
in its __init__ which fails when the value is None.
This fix converts None to 3.0 (the default) before Trainer initialization.
The actual training duration is still controlled by max_steps since it
takes precedence when both are set.
Example that now works:
```python
config = GRPOConfig(
num_train_epochs=None, # Previously caused TypeError
max_steps=500, # This controls actual duration
...
)
```
* [fix] Vision GRPO string prompts and OpenEnv async compatibility
- Guard prepare_multimodal_messages in GRPO trainer to skip processing
when prompts are pre-templated strings. Notebooks that pre-apply
apply_chat_template() produce strings with image tokens already
embedded; calling prepare_multimodal_messages on those crashes with
TypeError.
- Apply nest_asyncio when OpenEnv EnvClient exposes async reset/step,
so scripts using run_until_complete() wrappers work in all contexts.
- Add wrapper to call patch_torchcodec_audio_decoder() from unsloth_zoo
for AudioDecoder dict-compatibility.
* Add apply_chat_template guard for pre-templated string prompts in Vision GRPO
When notebooks pre-apply apply_chat_template, prompts become strings.
The existing guard skips prepare_multimodal_messages for strings. This
adds a second guard to skip apply_chat_template in the forward_kwargs
block, using prompts directly as prompts_text instead. Covers both
TRL 0.25.x (no tools param) and TRL 0.26.2+ (with tools=self.tools).
Non-matching replacements silently pass for older TRL versions.
* Add TRL 0.25.1 single-line variant for apply_chat_template guard
TRL 0.25.1 uses single-line formatting for apply_chat_template:
apply_chat_template({"prompt": prompt}, ...)["prompt"]
While TRL 0.26.2+ uses multi-line formatting:
apply_chat_template(
{"prompt": prompt}, ...
)["prompt"]
Add both variants to ensure full backwards compatibility.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix TRL 0.27.0 GRPO compatibility and PEFT model handling
- Remove use_reentrant=False from gradient_checkpointing_kwargs for TRL 0.27.0+
TRL 0.27.0 auto-sets use_reentrant=False in GRPOConfig.__post_init__, but
Unsloth gradient checkpointing requires use_reentrant=True. This adds a
post-init cleanup that removes the setting when present.
- Handle prepare_peft_model standalone function pattern for TRL 0.22.0+
TRL changed from self._prepare_peft_model() method to prepare_peft_model()
standalone function. Both patterns are now bypassed to let Unsloth handle
PEFT model preparation.
Tested with TRL versions 0.22.2, 0.23.1, 0.24.0, 0.25.1, 0.26.2, and 0.27.1.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* reduce code duplication
* address reviewer feedback: keep original function name
- Keep original function name `_offload_frozen_module_for_training`
- Make `offload_device` parameter Optional (can be None)
- Keep original error handling (return None for missing modules_to_save)
- Maintain code deduplication by reusing the helper function
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Use standard gradient checkpointing for small sequence lengths
When max_seq_length < 512, the overhead of gradient offloading in
gc="unsloth" mode is not worth it. Benchmarks on B200 show:
| seq_len | gc=unsloth | gc=True | Difference |
|---------|------------|----------|------------|
| 256 | 6,803 t/s | 6,993 t/s| +2.8% |
| 384 | 9,889 t/s | 9,963 t/s| +0.7% |
| 512 | 13,151 t/s | 13,092 t/s| -0.4% |
| 1024 | 26,662 t/s | 25,094 t/s| -5.9% |
The crossover point is around seq_len 384-512. For sequences shorter
than 512, we now automatically use standard gradient checkpointing
instead of the custom offloading implementation.
Additionally, when user explicitly sets use_gradient_checkpointing to
True or False in get_peft_model, it now correctly overrides any
previous "unsloth" patching from from_pretrained. This ensures
consistent behavior regardless of the order of function calls.
Updated in three locations:
- FastLlamaModel.get_peft_model (llama.py)
- FastLanguageModel.from_pretrained (loader.py)
- FastModel.from_pretrained (loader.py)
* Refactor: extract gradient checkpointing heuristic into utility function
Addresses code review feedback to reduce duplication. The gradient
checkpointing heuristic logic was duplicated in 3 places:
- FastLlamaModel.get_peft_model (llama.py)
- FastLanguageModel.from_pretrained (loader.py)
- FastModel.from_pretrained (loader.py)
Created apply_unsloth_gradient_checkpointing() utility function in
_utils.py that handles:
- Heuristic: seq < 512 falls back to standard gc
- Explicit True/False overrides unpatch previous patching
- Returns the effective use_gradient_checkpointing value
Net reduction of ~6 lines while improving maintainability.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix for intel devices
* Refactor torch_compile_options to use base options with device-specific extensions
- Extract common options into base_options shared by all device types
- CUDA devices get additional CUDA-specific options
- XPU, HIP, and other devices use base options only
- Reduces code duplication and improves maintainability
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: danielhanchen <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix for qwen3-guard tokenizer
* Better qwen3guard check
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
backend restructuring and housekeeping
Changes made:
- Moved all files from backend/backend/ → backend/core/ with nested subdirectories
- Created init.py for each submodule with proper exports
- Updated all imports in routes (routes/training.py, routes/models.py)
- Updated internal relative imports to use .. for parent references
- Deleted old backend/backend/ directory
- Moved shared modules (path_utils.py , model_config.py) to utils/ subfolder
* [transformers] [v5] remove unused hybridcache (#3910)
* remote unused hybridcache
* cleanup
* Fix top_k on trl GRPO
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add torch compile options for GRPOTrainer
* Update CUDA settings based on device capability
* Add triton persistent TMA matmul condition
* Fix syntax for triton.enable_persistent_tma_matmul
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update rl.py
* Update rl.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Guard torch.compile on ROCm when triton_key missing
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update unsloth/import_fixes.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Tighten ROCm Triton import handling
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Rachel Li <rachelliqx07@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* add FastSentenceTransformer
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Gemini code review suggestions
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* unsloth-zoo patch only fixed usage for XLMRobertaForMaskedLM, this is a fix for XLMRobertaModel
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refactor do_lower_case
* add some comments
* force disable FP8 loading
* refactor pooling detection, add missing pooling types
* add save_pretrained_merged method which gets modules and config
* fix _save_pretrained_merged
* rename read_pooling_mode, load modules instead of hard-coding em
* comment
* revert save_pretrained_merged change
* propagate trust_remote_code properly
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add super hacky mpnet patch from hell
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refactor _load_modules, add for_inference to from_pretrained, add transformers 5 code for mpnet, add distilbert patches
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add ModernBert
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* deberta-v2 support (provisional), fix remote_code
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add generic add_pooling_layer logic
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix for missing config
* add push_to_hub_merged
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* edit messages, throw exception if no HF token
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix device_map mismatch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add comments, move import, other suggestions by Datta0
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* re-add adapter removal to save_pretrained_merged, but if saving to folder which had adapters before, leave them
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add unsloth branding to save_pretrained_merged
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* propagate dtype to internal module when loading for inference
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix mpnet gradient checkpointing for torch >= 2.9
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* same thing for transformers 5, oops =)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix FastSentenceTransformer performance: 6x speedup via torch.compile + SDPA
The original implementation was 31% slower than naive SentenceTransformer due to
conflicting decorators from Unsloth's auto-compiler (@torch.compile on attention
modules but @torch.compiler.disable on sub-modules).
Changes:
- Add fast encoder path that bypasses Unsloth patching for encoder models
- Use native torch.compile with mode="reduce-overhead" for 6x speedup
- Auto-detect and enable SDPA for models that support it (BERT, RoBERTa, etc.)
- Change defaults: load_in_16bit=True, load_in_4bit=False (16-bit is optimal)
- Change default: use_gradient_checkpointing=False (conflicts with torch.compile)
- Add UNSLOTH_COMPILE_DISABLE=1 env var to fall back to old path if needed
Supported encoder types: mpnet, bert, distilbert, roberta, xlm-roberta, albert, electra
Benchmark results (BS=32, seq_len=128):
- Naive 16-bit LoRA: 13-50ms per iter
- Unsloth 16-bit LoRA: 2-9ms per iter (5.4x-6.7x faster)
- Memory usage: 61MB-1.3GB (even largest model fits easily)
Note: 4-bit + torch.compile has a PyTorch bug (pytorch/pytorch#90665).
4-bit is also 1.7-1.9x slower than 16-bit due to dequantization overhead,
so 16-bit is recommended for these small encoder models anyway.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use Unsloth's prepare_model_for_kbit_training for consistency
Changed from peft.prepare_model_for_kbit_training to
unsloth.models._utils.prepare_model_for_kbit_training.
Unsloth's version provides:
- Float32 mixed precision upcasting for LoRA layers
- Better numerical stability
- Consistency with rest of Unsloth codebase
* Use relative imports and add float16 machine support
- Changed absolute import to relative: from ._utils import prepare_model_for_kbit_training
- Added SUPPORTS_BFLOAT16 import for proper dtype detection
- Handle devices that don't support bfloat16 by falling back to float16
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add save_pretrained_torchao
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add auto-compile for torch.compile based on training step breakeven analysis
Changes:
- Change default compile_mode from "reduce-overhead" to "default" since CUDA
Graphs (used by reduce-overhead) is incompatible with PEFT/LoRA
- Add _estimate_compile_threshold() to calculate minimum steps needed for
torch.compile to be beneficial based on model parameter count
- Add _apply_torch_compile() helper with accelerate unwrap_model bug workaround
- Defer torch.compile application to trainer initialization time so we can
check max_steps against the breakeven threshold
- Patch SentenceTransformerTrainer to auto-apply compile when max_steps
exceeds the calculated threshold
Breakeven thresholds (with 1.2x safety margin):
- 22M params (MiniLM): ~1388 steps
- 110M params (mpnet): ~242 steps
- 335M params (snowflake): ~203 steps
This ensures torch.compile warmup cost is only paid when training is long
enough to benefit from the speedup.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* do QAT preparation for fast path
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix double loading model, thanks Etherl
* do mpnet gradient checkpoint patch if gc is enabled
* remove distilbert patches from mpnet fix
* sanity check on model params, thanks Etherl
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add save_pretrained_gguf, thanks Etherl
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Refine compile threshold estimation for sentence transformers
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* add int8 weight-only QAT scheme, add test, fix tests for current torchao version
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* change quantization to PerAxis
* lambda =/
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add torchao messages, remove group_size from int8
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* raise exception on missing torchao
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* touch up the torchao imports
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
ModulesToSaveWrapper was removed from peft.tuners.tuners_utils in PEFT
0.16.0. The class has been available in peft.utils.other since at least
PEFT 0.7.1, which is the minimum version Unsloth requires.
This fixes the ImportError when using PEFT >= 0.16.0.
- Fix Kaggle misclassification by prioritizing filesystem markers over env vars
- Preserve telemetry pings when statistics is explicitly provided
- Replace bare except with except Exception
- Minor cleanup based on automated review feedback
Fixed Codex regression: keep snapshot_download pings for explicit statistics values; detection only runs when statistics is None. Also replaced bare except.
Problem: Kaggle notebook environments can expose both KAGGLE_* and COLAB_* environment keys. _get_statistics currently checks COLAB_ before KAGGLE_, causing Kaggle sessions to be labeled colab/colabpro.
Prefer filesystem markers (e.g. /kaggle/working, /content + /opt/colab) before env-key heuristics, then fall back to the existing env-key checks. This avoids misclassification when providers leak overlapping env vars.
Kaggle test notebook: https://www.kaggle.com/code/hnxnq07/kaggle-stats-gathering-test
- Fix test file: use return_tokenized instead of return_tensors
- Fix test file: use text_dataset instead of undefined dataset variable
- Move parameter validation to constructor (fail fast on invalid params)
- Add labels field in tokenized output for causal LM training
- Add empty file handling with clear error message
- Add tests for constructor validation and labels field
This PR fixes the "Arch conditional MMA instruction used without targeting
appropriate compute capability. Aborting." errors that occur when using
FBGEMM on Blackwell GPUs (B200/B100, SM100).
Changes:
- Add stderr filters in import_fixes.py for CUTLASS/FBGEMM MMA errors
- Add warning filters for various deprecation messages
- Update check_fbgemm_gpu_version() to disable FBGEMM instead of raising
an error when old versions are detected
- Update test_has_fbgemm() in fp8.py to catch broader CUTLASS/CUDA errors
and gracefully fall back to Triton kernels
- Update loader_utils.py to disable FBGEMM instead of raising ValueError
for old fbgemm_gpu versions
The key behavior change is that FBGEMM errors no longer crash the script.
Instead, FBGEMM is disabled and Triton kernels are used automatically.
This allows Unsloth to work on SM100 GPUs where CUTLASS SM90 kernels fail,
and also gracefully handles old FBGEMM versions.
The GitHub issue check had issues:
1. Network latency on import
2. Issue being closed does not mean the fix is in the installed vLLM version
Now skip the PDL workaround if vLLM version > 0.13.2, which is when
the upstream fix is expected to be included.
- Patch vllm.lora.ops.triton_ops.utils directly where supports_pdl is defined
- Clear lru_cache before patching to prevent stale cached results
- Add fused_moe_lora_op to consumer modules list
- Use *args, **kwargs in fake function for compatibility
- Add _spec_exists helper function to reduce duplication
- Scan all GPUs for SM100 instead of just device 0
- Use loop for module patching to improve maintainability
When using base models with custom chat templates applied after loading,
vLLM's internal tokenizer may not have the chat_template set. This causes
issues during RL training with vLLM inference.
This fix syncs the chat_template from the processing_class (the tokenizer
you loaded and configured) to vLLM's internal tokenizer during trainer
initialization, but only if vLLM's tokenizer does not already have one set.
vLLM's LoRA Triton kernels use tl.extra.cuda.gdc_wait() for PDL
optimization on SM90+ GPUs. This fails on SM100 (Blackwell) during
CUDA graph capture because Triton's pipeliner cannot handle gdc_wait
in complex kernels.
This fix:
- Detects SM100 GPUs and applies the workaround automatically
- Sets TRITON_DISABLE_PDL=1 environment variable
- Monkey-patches supports_pdl to return False in lora_expand_op and
lora_shrink_op
- Checks GitHub issue #30872 status (with 3s timeout) to auto-disable
the workaround once the upstream fix is merged
- Includes quick internet connectivity check (0.5s) to avoid delays
when offline
Fixes the error:
'tt.elementwise_inline_asm' op pipeliner doesn't know how to predicate this op
LLVM ERROR: Fatal pipeliner error
See: https://github.com/vllm-project/vllm/issues/30872
When users load a model with fast_inference=False but then try to use
vLLM-style arguments with fast_generate, they previously got confusing
errors. This adds a wrapper that detects common mistakes and provides
helpful guidance:
- Using sampling_params: explains to use HF generate args instead
- Using lora_request: explains LoRA weights are already merged
- Passing text strings: shows how to tokenize input first
Changes:
- Add make_fast_generate_wrapper to _utils.py
- Apply wrapper in llama.py when fast_inference=False
- Apply wrapper in vision.py when fast_inference=False
Gemma3 models have a large vocabulary (262144 tokens) which causes
training loss to explode when using int8 embedding quantization.
This fix auto-detects Gemma3 models and switches from int8-int4
(phone-deployment) to int4 weight-only QAT for stable training.
1. cohere.py:347-348 - Fixed wrong variable names in QK normalization.
Used `Q`/`K` but variables were named `Qn`/`Kn`. This caused NameError
when `use_qk_norm=True` (e.g., c4ai-command-r-plus models).
2. cohere.py:482 - Fixed wrong object reference in inference loop.
Used `self.mlp` but should be `decoder_layer.mlp` since we're
iterating through decoder layers. Caused AttributeError during inference.
3. falcon_h1.py:459,461 - Fixed wrong attribute names in inference path.
Used `post_attention_layernorm` and `mlp` but Falcon H1 uses
`pre_ff_layernorm` and `feed_forward`. Caused AttributeError during generation.
4. qwen3_moe.py:210 - Fixed wrong module path with incorrect capitalization.
Used `transformers.models.Qwen3Moe` but should be `transformers.models.qwen3_moe`.
Caused AttributeError when patching rotary embeddings.
5. qwen3_moe.py:239 - Fixed wrong model_patcher class.
Used `FastQwen3Model` but should be `FastQwen3MoeModel` for MoE models.
Caused incorrect patching for Qwen3 MoE models.
6. hf_hub.py:21-22 - Fixed floor division and missing return for billion values.
Used `//` instead of `/` for millions, and had no return for values >= 1B.
Caused incorrect formatting and None return for large numbers.
7. save.py:550 - Fixed self-assignment that did nothing.
`sharded_ram_usage = sharded_ram_usage` should be `= max_shard_size`.
Caused integer shard sizes to be ignored.
8. rl.py:562-567 - Fixed orphan string not included in length_check.
The elif branch for max_seq_length validation was a standalone string
expression, not concatenated to length_check. Caused silent skip of
the max_seq_length > model_max_seq_length warning.
9. granite.py:49-52 - Fixed wrong model name and version in error message.
Said "Gemma2" and "4.42.3" but should be "Granite" and "4.45.0".
* Fix correctness bugs in rl.py, rl_replacements.py, and vision.py
1. rl_replacements.py (lines 864, 870): Fixed undefined `nanmin`/`nanmax`
functions by using `.nan_to_num(nan=inf/-inf).min()/.max()` pattern.
PyTorch doesn't have torch.nanmin/nanmax, so we replace NaN values
before computing min/max.
2. vision.py (line 150): Fixed bug where code checked for "input" key
but then accessed kwargs["input_ids"] instead of kwargs["input"].
3. vision.py (line 159): Fixed bug where literal string "key" was used
instead of the variable `key` when accessing kwargs.
4. rl.py (lines 903, 905): Fixed non-existent `MathError` exception
by replacing with `ValueError`.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Add "corda" as an allowed value for the init_lora_weights parameter
in FastLanguageModel.get_peft_model() and FastBaseModel.get_peft_model().
This enables users to use CorDA (Correlation-aware Decomposed Adaptation)
initialization from PEFT, which provides an alternative LoRA initialization
strategy for improved finetuning performance.
Fixes#3693
Signed-off-by: majiayu000 <1835304752@qq.com>
* Fix is_contiguous() method call and remove duplicate imports
- Fix bug in rope_embedding.py where is_contiguous was used without
parentheses, causing the method object (always truthy) to be evaluated
instead of calling the method. This fixes issue #3781 where fast rope
backpropagation was broken for zero strided/non-contiguous tensors.
- Remove duplicate `import torch` in rl.py (lines 20 and 25)
- Remove duplicate `import functools` and `import types` in vision.py
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Fix Boolean value of Tensor ambiguity error in mistral.py
Replace `or` operator with explicit `is None` check when getting
n_items from kwargs. The `or` operator fails when the value is a
Tensor because Python cannot determine the boolean value of a
multi-element tensor.
Fixes#3766🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* Update rope_embedding.py
---------
Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Guard optional trl.experimental.openenv usage in RL patches
* Simplify optional trl.openenv import handling
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(trainer): import psutil to prevent NameError in _prepare_dataset
Fixes#3777
* Update rl.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Use regex to dynamically detect and preserve the original indentation
when replacing the 'return output' statement, instead of hardcoding
spaces. This ensures the patched code maintains consistent indentation
regardless of the original formatting.
Store the model's training state before generation and restore inference
mode after completion if the model wasn't originally in training mode.
This ensures the model returns to the correct state after generate and
score operations.
* Update _utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [FIX] [Transformers] VLM input embeds fix for gradients (#3715)
* Fix get_input_embeds call for VLMs
* patch input_require_grads instead
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup old patch
* cleanup old patch
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
* use logger instead of prints
* Move unsloth present set
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update rope_embedding.py
* Fixes
* Update _utils.py
* Update import_fixes.py
* Update rl_replacements.py
* fix_openenv_no_vllm
* Fix
* Update __init__.py
* Update __init__.py
* Update __init__.py
* Update import_fixes.py
* Update import_fixes.py
* Update import_fixes.py
* logger
* Update __init__.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update __init__.py
* Update import_fixes.py
* Update __init__.py
* Update import_fixes.py
* Update import_fixes.py
* Update import_fixes.py
* Update import_fixes.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update import_fixes.py
* Update unsloth/import_fixes.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update save.py
* [fbgemm] Silence tma fbgemm (#3735)
* Silence fbgemm TMA print
Also safer .push_to_hub
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update loader.py
* Update save.py
* Update save.py
* Update _utils.py
* Update _utils.py
* Diffusers warnings
* Update pyproject.toml
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [hf_hub] Token login (#3739)
* login on token
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup old code
* safer imports
* cleanup
* Return token after login
* correct return types
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
* add back imports
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* finish return token
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Do not overwrite slots (#3752)
* Do not overwrite slots
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update save.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update _utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [FIX] [Transformers] VLM input embeds fix for gradients (#3715)
* Fix get_input_embeds call for VLMs
* patch input_require_grads instead
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup old patch
* cleanup old patch
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
* use logger instead of prints
* Move unsloth present set
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update rope_embedding.py
* Fixes
* Update _utils.py
* Update import_fixes.py
* Update rl_replacements.py
* fix_openenv_no_vllm
* Fix
* Update __init__.py
* Update __init__.py
* Update __init__.py
* Update import_fixes.py
* Update import_fixes.py
* Update import_fixes.py
* logger
* Update __init__.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update __init__.py
* Update import_fixes.py
* Update __init__.py
* Update import_fixes.py
* Update import_fixes.py
* Update import_fixes.py
* Update import_fixes.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update import_fixes.py
* Update unsloth/import_fixes.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Update save.py
* [fbgemm] Silence tma fbgemm (#3735)
* Silence fbgemm TMA print
Also safer .push_to_hub
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update loader.py
* Update save.py
* Update save.py
* Update _utils.py
* Update _utils.py
* Diffusers warnings
* Update pyproject.toml
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [hf_hub] Token login (#3739)
* login on token
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup old code
* safer imports
* cleanup
* Return token after login
* correct return types
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
* add back imports
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* finish return token
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Do not overwrite slots (#3752)
* Do not overwrite slots
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* Enable 4-bit quant on Radeon
* Fix table centering
* Update comments for clarity
* Handle failure to import Bitsandbytes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update device_type.py
* Apply suggestion from @danielhanchen
* Update device_type.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Do not overwrite slots
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* login on token
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup old code
* safer imports
* cleanup
* Return token after login
* correct return types
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
* add back imports
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* finish return token
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Silence fbgemm TMA print
Also safer .push_to_hub
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update _utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [FIX] [Transformers] VLM input embeds fix for gradients (#3715)
* Fix get_input_embeds call for VLMs
* patch input_require_grads instead
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup old patch
* cleanup old patch
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
* use logger instead of prints
* Move unsloth present set
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update rope_embedding.py
* Fixes
* Update _utils.py
* Update import_fixes.py
* Update rl_replacements.py
* fix_openenv_no_vllm
* Fix
* Update __init__.py
* Update __init__.py
* Update __init__.py
* Update import_fixes.py
* Update import_fixes.py
* Update import_fixes.py
* logger
* Update __init__.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update __init__.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
* Update torchao save
* up
* up
* up
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix get_input_embeds call for VLMs
* patch input_require_grads instead
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup old patch
* cleanup old patch
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
* use logger instead of prints
* Move unsloth present set
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* skip xpu fbgemm fp8
* Apply suggestion from @gemini-code-assist[bot]
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* update TRL filter
* both filters
* Apply suggestion from @danielhanchen
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update _utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fbgemm version check
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* safer version check
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add check for torchvision-torch compatibility
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refactor package check logic
* Remove logs and enforce torch
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove reload_weights rpc call from grpo trainer
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use regex instead of static string
* patch openenv reload_weights call
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Better handle sleep and wakeup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Reset indentation
* Handle multi line self.llm.chat better
* Use logger
* re-indent
* Stricter regex to replace wildcard
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove reload_weights rpc call from grpo trainer
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use regex instead of static string
* patch openenv reload_weights call
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Better handle sleep and wakeup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Reset indentation
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update transformers version constraint in pyproject.toml
The latest transformers version just fixes the local training.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update transformers version constraint in pyproject.toml
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* vllm sampling params fix
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* do not patch base_trainer
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* seperate vllm fixes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fixup deletion
* Fix indentation
* revert to old style
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* vllm sampling params fix
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* do not patch base_trainer
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* seperate vllm fixes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Apply suggestion from @danielhanchen
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"
This reverts commit 58b483dc0d1790f99580665801d3fa0d7267c533.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"
This reverts commit b2497519659a9f301e7a633795d9efdafdc2b277.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"
This reverts commit de3daaf429f81aceb6632932b0cb1af5149652a8.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix: remove load_in_fp8 from kwargs to prevent Qwen3Moe init TypeError (Fix#3649)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Only restore training mode after generation, if the model started out in training mode
Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Signed-off-by: Dina Suehiro Jones <dina.s.jones@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Enable FP8 + RL training for bf16 models (#3440)
* Enable FP8 + RL training for bf16 models
**Summary:** Enable FP8 + RL training using TorchAO for 1.33x faster training and 42% less model memory usage:
- We quantize the frozen LoRA weights into fp8 and keep the LoRA adapters in bf16
- We leverage TorchAO's `Float8Tensor`, which calls into fbgemm's fp8 x fp8 rowwise matmul kernel
- For now, we need to do an offline quantization first, because vllm doesn't support on-the-fly quantization for torchao yet (this is in progress: https://github.com/vllm-project/vllm/pull/26327)
**Example usage:**
```
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-8B-Base",
max_seq_length = 2048,
load_in_4bit = False,
fast_inference = True,
max_lora_rank = 32,
load_in_fp8 = True, # set this to True
)
\# the rest is the same as before
model = FastLanguageModel.get_peft_model(...)
```
**Initial results:**
```
\# fp8
{'train_runtime': 1725.4337, 'train_samples_per_second': 0.232, 'train_steps_per_second': 0.058, 'train_loss': 0.00015715716748673002, 'epoch': 0.01}
\# bf16
{'train_runtime': 2297.8145, 'train_samples_per_second': 0.174, 'train_steps_per_second': 0.044, 'train_loss': 0.00016081033063528594, 'epoch': 0.01}
```
<img width="1199" height="448" alt="Screenshot 2025-11-11 at 4 10 50 PM" src="https://github.com/user-attachments/assets/b6304afd-89e9-42b1-8064-775807e17b23" />
Test script: https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423
**Requires:**
- https://github.com/pytorch/ao/pull/3158 (torchao nightly or 0.15.0+)
- https://github.com/unslothai/unsloth-zoo/pull/351
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* _get_inference_mode_context_manager
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update utils.py
* Update utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update __init__.py
* Fix/save torchao model loading logic (#3621)
* make loading gpt-oss-BF16 faster. Linked to unsloth-zoo PR #314
* fix model loading and clean merged model directory
* revert default quant
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* revert mapper.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update loader_utils.py
* Update loader_utils.py
* Add 128x128 PerBlock FP8 + RL (#3629)
* Add 128x128 PerBlock FP8 + RL
**Summary:** Following https://github.com/unslothai/unsloth/pull/3440,
this PR extends torchao FP8 + RL support to also handle 128x128
PerBlock granularity (in addition to PerRow).
**Example usage:**
```
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-8B-Base",
max_seq_length = 2048,
load_in_4bit = False,
fast_inference = True,
max_lora_rank = 32,
load_in_fp8 = "block", # or "row" or True
)
```
**Initial results:** TBD
**Note:**
- Requires https://github.com/pytorch/ao/pull/3370
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Version
* Update vision.py
* Update rl.py
* Add torch 2.9.1
* Fix auto installer
* Update fp8.py
* Float8
* Update fp8.py
* Update mapper.py
* Update mapper.py
* Update loader_utils.py
* Update loader.py
* Update fp8.py
* Versioning
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: andrewor14 <andrewor14@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
* make loading gpt-oss-BF16 faster. Linked to unsloth-zoo PR #314
* fix model loading and clean merged model directory
* revert default quant
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* revert mapper.py
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Enable FP8 + RL training for bf16 models
**Summary:** Enable FP8 + RL training using TorchAO for 1.33x faster training and 42% less model memory usage:
- We quantize the frozen LoRA weights into fp8 and keep the LoRA adapters in bf16
- We leverage TorchAO's `Float8Tensor`, which calls into fbgemm's fp8 x fp8 rowwise matmul kernel
- For now, we need to do an offline quantization first, because vllm doesn't support on-the-fly quantization for torchao yet (this is in progress: https://github.com/vllm-project/vllm/pull/26327)
**Example usage:**
```
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3-8B-Base",
max_seq_length = 2048,
load_in_4bit = False,
fast_inference = True,
max_lora_rank = 32,
load_in_fp8 = True, # set this to True
)
\# the rest is the same as before
model = FastLanguageModel.get_peft_model(...)
```
**Initial results:**
```
\# fp8
{'train_runtime': 1725.4337, 'train_samples_per_second': 0.232, 'train_steps_per_second': 0.058, 'train_loss': 0.00015715716748673002, 'epoch': 0.01}
\# bf16
{'train_runtime': 2297.8145, 'train_samples_per_second': 0.174, 'train_steps_per_second': 0.044, 'train_loss': 0.00016081033063528594, 'epoch': 0.01}
```
<img width="1199" height="448" alt="Screenshot 2025-11-11 at 4 10 50 PM" src="https://github.com/user-attachments/assets/b6304afd-89e9-42b1-8064-775807e17b23" />
Test script: https://gist.github.com/andrewor14/5b85119fae46845d07b608d420907423
**Requires:**
- https://github.com/pytorch/ao/pull/3158 (torchao nightly or 0.15.0+)
- https://github.com/unslothai/unsloth-zoo/pull/351
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* _get_inference_mode_context_manager
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update utils.py
* Update utils.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Remove grpo requirement bs=num_generations
* Update rl.py
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix qwen3 vl gradient accumulation
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Update unsloth/models/_utils.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* up
* up
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Support for out-of-source quantizers
* Fix decorators and functions to be staticmethod
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Patch in tiled mlp
* Update unsloth/models/llama.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Update rl_replacements.py grpo accumulation kwargs
* Update rl.py, remove bnpo default when setting dapo
* Update rl.py
* Update rl_replacements.py, add support for vllm importance sampling
* Update rl_replacements.py, added ability to get metrics
* Update rl_replacements.py send sampling per token logps to backend
* Update rl_replacements.py, corrected if statement in monkey patch
* Update rl_replacements.py, updating to handle nan cases as well
* Update rl_replacements.py, imported text warp
* Update rl_replacements.py, yes
* Add error handling for sampling_per_token_logps
Handle NameError for sampling_per_token_logps assignment.
* Add delta check for use_vllm condition
* Refactor vision model flag to use is_vlm variable
* Fix FP8 for models with non 8 multiple weights
* patch fp8 forward methods for compiled models
* patch hf quantizer for fp8
* Failsafe import of fbgemmfp8linear and fp8linear
* Beautify
* Prefer loading model from pretrained instead of config
* Fixup FP8 forward pass and inference
* [WIP] Fix lora forwards
* Infer block size from weight shapes
* reconstruct weights from fp8 quants for lora matmul
* Return weight transpose and fix dtype
* Refactor FP8 operations
* Fix naming :)
* Saner compile
* do not depend on transformers
* [WIP] fix training
* Update comment
* fixup training
* use dequant kernel from deepseek
* Differentiate between fp8 and fbgemmfp8
* fixup differentiation b/w fp8 and fbgemm_fp8
* make inputs contiguous if required
* Improve dequant
* More robust handling
* Fixup backward pass for fbgemm_fp8
* refactor and use bf16 for dequant
* Use torch fp8 block matmul
* Disable torch block matmul for now
* safer import and cosmetics
* more cosmectics
* add torchao operations
* Spaceeeeeee
* GGUF conversion code + model to template mappers + chat template adds/fixes
* syntax fixes
* extract tokenizer from video processor
* model file cleanup after multiple quantizations
* flip is_vlm flag is mmproj has text only llama.cpp support for MLM
* preserve processor files for merge operation
* reinstate chr(92)
* fixed starling mapping
* ollama Modelfile from gguf for text models
* specify bf16 ollama model precision for vision models
* fix keyError in templatedict when no mapping
* revert chat_templates.py to original syntax
* ollama modelfile template to model mapper
* link save to ollama mapper, fix some bugs
* rename to ollama_template_mappers
* Remove old template_mappers file (renamed ollama_template_mappers)
* fix final printout
* fix model list and printout
* remove yi base model, keep chat/instruct
* fixed dangling > in HF repo readme for uploaded models
* added granite model ollama support
* Combine use_local_gguf() blocks
* model_name relative to base_model_name
**Summary:** The existing QAT + LoRA path only applied fake
quantization to the original slow path, but the default is the
fast path that calls unsloth's fast LoRA primitives. This commit
integrates fake quantization into these fast primitives as well,
and add unit tests to assert that fake quantization is actually
taking place.
**Test Plan:**
Unit tests:
```
pytest tests/utils/test_qat.py
```
End-to-end test: https://gist.github.com/andrewor14/6360dd69b5784c71c46e80c14f53e6b6
Full fine-tuning Llama3.1-8B with and without QAT + LoRA on yahma/alpaca-cleaned for 1 epoch:
- Batch size = 8 (no grad accum)
- Learning rate = 2e-4
- Quantization scheme = int4 weight only (with bf16 activations)
Wikitext perplexity:
- Baseline = int4 quantized model finetuned without QAT
- QAT int4 quantized model (with this PR) achieved 33% lower perplexity than the int4 baseline
- QAT int4 quantized model without this PR was worse than the int4 baseline
```
==> unsloth_model_lora_baseline_output/lm_eval_float.log <==
| | |none | 0|word_perplexity|↓ |7.5551|± | N/A|
==> unsloth_model_lora_baseline_output/lm_eval_quantized.log <==
| | |none | 0|word_perplexity|↓ |8.7655|± | N/A|
==> unsloth_model_lora_qat_int4_output/lm_eval_quantized.log <==
| | |none | 0|word_perplexity|↓ |8.3548|± | N/A|
```
* Kept, padding logic
* Made sure prediction step in rl.py allows logging for callbacks in RL trainers
* updated llama.py to new online_dpo changes
* Update rl.py to make logic simpiler
* Update rl.py, made sure tokenized_output on eval step was on same device
* Update rl.py, corrected tokenized_outputs to inputs
* Update rl.py, removed sagemaker stuff
* Update llama.py, figures out if there is right padding automatically
* Update llama.py, changed conditional statement for right padding slightlyt
* Update llama.py, updated OS.environ variable to temp variable
* Update rl.py, made it account for right padding in online dpo and reward modeling
* Update llama.py, automatically figures out if right padding is needed
* Update rl_replacements.py, fixed up passing image data to functions
* Update rl_replacements.py, for VLM GRPO support with TRL
* Update rl_replacements.py, gspo added
* Update rl.py, forgot about Online_DPO changes in this branch
* Update rl.py, forgot to not include Online DPO PR changes
* Update llama.py, forgot to disinclude Online DPO PR changes
* Update rl_replacements.py, updated generate and score completions to be up to date for trl
* Update rl_replacements.py
* Update rl_replacements.py, fixed nan issues with vlms
* Update rl_replacements.py, added indent
* Update rl_replacements.py, added attention mask to calculations of old and ref hidden states
* Update unsloth/models/rl_replacements.py
* Update unsloth/models/rl_replacements.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
**Summary:** Following https://github.com/unslothai/unsloth/pull/2976,
which adds support for QAT + LoRA, this PR adds support for QAT
during full fine-tuning. See the [torchao QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md)
for more details.
Current QAT schemes supported are:
```
fp8-int4, targeting the torch.ops.fbgemm.f8i4bf16_shuffled kernel
fp8-fp8, targeting the torch.ops.fbgemm.f8f8bf16_rowwise kernel
```
**Test Plan:** https://gist.github.com/andrewor14/048b5c1bd01b7fa23c53913856a8ef9f
Full fine-tuning Llama3.1-8B with and without QAT on `yahma/alpaca-cleaned` for 1 epoch:
- Batch size = 16 (no grad accum)
- Learning rate = 4e-5
- Quantization scheme = fp8-int4
Wikitext perplexity:
- QAT improved perplexity by 19.2% compared to regular fine-tuning
- QAT's int4 quantized model even outperformed the bf16 baseline
- Regular int4 quantized model (without QAT) was significantly worse than the bf16 baseline
```
==> unsloth_model_full_baseline_output/eval_float.log <==
| | |none | 0|word_perplexity|↓ |9.8446|± | N/A|
==> unsloth_model_full_baseline_output/eval_quantized.log <==
| | |none | 0|word_perplexity|↓ |11.4595|± | N/A|
==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <==
| | |none | 0|word_perplexity|↓ |9.2336|± | N/A|
```
Fibonacci test:
- Both bf16 baseline and int4 quantized models correctly identified 13 as the next number
- QAT quantized model was more succinct in its response
- No substantial differences here
```
### Instruction:
Continue the fibonnaci sequence.
### Input:
1, 1, 2, 3, 5, 8
==> unsloth_model_full_baseline_output/eval_float.log <==
### Response:
The next number in the Fibonacci sequence is 13.<|end_of_text|>
==> unsloth_model_full_baseline_output/eval_quantized.log <==
### Response:
The next number in the Fibonacci sequence is 13.<|end_of_text|>
==> unsloth_model_full_qat_fp8-int4_output/eval_quantized.log <==
### Response:
13<|end_of_text|>
```
Summary:
Previously the test was not ran correctly and the save to local path is not tested
this PR added support for that and tries to test properly
Note: `python tests/saving/test_unsloth_save.py` doesn't run test
Test Plan:
pytest tests/saving/test_unsloth_save.py -k test_save_torchao
Reviewers:
Subscribers:
Tasks:
Tags:
* Update test_qwen3_grpo.py to correct function call
This test file uses the incorrect name for the function, which is gradient_checkpointing_disable(), not disable_gradient_checkpointing().
I copied the line from test_llama32_sft.py - I'm not sure if this actually is required, just wanted it consistent for when other people like me test this and have no clue what they're doing when it throws an exception.
* Update blackwell/test_qwen3_grpo.py
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update mistral.py, showed flag to not call cut cross entropy
* Update mistral.py, made it so if its not equal to zero
* Update unsloth/models/mistral.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Summary:
Allow users merge the LoRA weights and then do a post training quantization with torchao
Usage:
```
from torchao.quantization import Int8DynamicActivationInt8WeightConfig
torchao_config = Int8DynamicActivationInt8WeightConfig()
model.save_pretrained_torchao(
save_path,
tokenizer=tokenizer,
torchao_config=torchao_config,
)
```
Test Plan:
python tests/saving/test_unsloth_save.py
Reviewers:
Subscribers:
Tasks:
Tags:
1. Removed the `--extra-index-url https://wheels.vllm.ai/nightly` from the uv install instructions because this causes it to crash; Removing that flag solves the issue and is more stable overall. Tested with RTX 5090 CUDA 12.8 on Linux.
2. Removed `uv pip install -U triton>=3.3.1` because triton 3.3.1 is already installed with the vllm command.
* sync all instead
* sync after move and rope init instead
* sync after rope inside
* Return new tensors and no sync
* Sync only current stream
* Fixup mask for xformers
* sync for prefill only
* clean up
* Support pre-dequantized quantization states in fast_dequantize kernel
* has_nested_quant conditional set to only
* Update utils.py
* Update utils.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fixes "Argument of type 'float' cannot be assigned to parameter 'lora_dropout' of type 'int'" error by ensuring lora_dropout is consistently a float (0.0) rather than int (0) across vision.py, llama.py, and unsloth-cli.py
Because we don't have down and gate multipliers, the MLP output values are too huge, causing NaN and unstable training. To bypass that lets rely on HF's implementation for the time being
* Move tensors to right devices
* fix multi gpu for non mistral models
* multi GPU RoPE for gemma2
* Finish up multi GPU inference
* Make multiGPU rope a list
* Remove unnecessary transfer to CPU
* Remove unnecessary move to CPU
* Donot move inputs to device yet
will be handled separately in another PR
* Move inputs to appropriate decoder device
* Make device count global variable
* Cleanup RoPE device code
* Fixup num_gpu to device count
* Cleanup device counts
* Use device index for RoPE get_cache
* Donot typecast
* Use tuple instead of list for tensors. Use device index directly
* fixup move to device logic
Pass `init_lora_weights` and `loftq_config` to `LoraConfig` constructor, which enables classes like `FastModel` to use LoftQ support. Thank you very much in advance!
* import undefined transformers_version for falcon model
fixed falcon transformers version check and added error handling for FalconH1Attention bad import
* Also, conditionally load module from falcon_h1 depending on if the transformers version supports is
* vLLM sleep once generation is done
* Make enable_sleep_model configurable
* Make default to false
Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>
* Force standby under environment variable
---------
Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>
* Update llama.py, sequence_classifcaiton update
* Update llama.py, adapting to original commit
* Update llama.py, for seqeuence classifcation update
* Update llama.py, added transformer import
* Update llama.py, dealt with output weight
* Update llama.py, renamed it peft model fast forward
* Update llama.py, set up is classification varaiable
* Update llama.py, updated lora dict to initialize sequence classification object
* Update llama.py, gets model name correctly before Lora dict is initialized
* Update llama.py, Task_type_SEQ_CLS doesnt work but it does work with Task_type.CAUSAL_LM
Ignores none params when building the subprocess_command for vllm. As none values stop vllm from deploying properly, as --quantize will be passed with none if quantization type isn't specified in the model name.
* Update llama.py making set and reset functions in order to properly use autoSequenceClassification
* Update fast_lora.py, added mixed precising pytorch autocasting
* Update llama.py did not included rotary embeddings in the reset functions correctly
* Update rl.py: correct get reward model added as well as the eval step stuff
* Update rl.py removed function that did not need to be patched
* Update llama.py: kept reset functions and made their names generic
* Update fast_lora.py
* Update rl.py, try except
* Update fast_lora.py, removing downcasting stuff
* Update llama.py removed depircate LLamaLinearScalingRotaryEmbedding
* Update rl.py for VLLM RLOO and PPO
* Update rl.py reverted
* Update rl.py with peft cahnges
* Update rl.py, disabling adapters screws inference up
* Update rl.py getting PPO support
* Update rl.py cleanup
* Update rl.py cleaned up not useful commented code
* Update llama.py, enabled new flag, keep padding
* Upgrade trl fix
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
* Update rl.py made changes relative to the review
* Revert accidental patch block for non grpo
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
* Fixup sampling params issue
* Fix rl.py regex
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
* loss type: grpo, drgrpo and bnpo
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
* Add trl version check for vllm colocate mode for RL trainers
* Update rl.py
For TRL 0.18.0 (Main branch of TRL at the time because its on 0.17.0) , the SFT trainer for some reason deletes the labels column and unsloth internal loss funcitons need that column for hte claculations so I add it back in like this.
* Update llama.py, merge it to be dattas llama version
* Update rl.py, sft changes to get 0.18.0 to be working
* Update rl_replacements.py, added hidden state stuff
* Update rl_replacements.py
* Update rl_replacements.py
* Update rl_replacements.py, rechanged the accumlated loss
* Fixup num_iterations>1 for grpo
Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>
* Update rl_replacements.py
* no unnecessary logits upcast. fix naming
Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>
* Update rl_replacements.py returned hidden states from logprobs
* Update rl_replacements.py removed debug logic
* Update rl_replacements.py, should be fine now
* Update rl_replacements.py, should take new args for GRPO trainer
* Update rl_replacements.py, made it compatible with trl 0.15.2
* Update rl_replacements.py, fixed typo in per tokne-Logps
---------
Signed-off-by: Dattu Sharma <venkatadattasainimmaturi@gmail.com>
Signed-off-by: datta0 <venkatadattasainimmaturi@gmail.com>
Co-authored-by: pluesclues <136766175+pluesclues@users.noreply.github.com>
* Update loader.py
change vllm installed check by transformers utils function
* Update llama.py
change vllm installed check by transformers utils function
* add sample notebook
* fix Indentation
* add global is_vLLM_available function
* Pythonic style
* Delete nb/Qwen2.5_(3B)-GRPO-windows.ipynb
Would be great to move it to https://github.com/unslothai/notebooks - appreciate it!
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* add DEVICE_TYPE and resolve device specific API
* reuse import torch
* move env under device typr
* resolve comments
* add more comments
* add more comments
This commit fixes a NameError that occurs when `importlib` is referenced in _utils.py
without being imported, especially when UNSLOTH_USE_MODELSCOPE=1 is enabled.
By adding the missing import statement, the code will no longer throw a NameError.
* fix: config.torch_dtype in LlamaModel_fast_forward_inference
* Update llama.py
* update for consistency
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
When loading a PEFT model fails, only the `autoconfig_error` is shown. Instead of the `peft_error`, which is what really matters when we're trying to load a PEFT adapter, the user will see something like this:
```
RuntimeError: Unrecognized model in my_model. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, ...
```
This PR just changes it so `autoconfig_error` and `peft_error` are both displayed.
Fix typo in comment: know -> now.
This was printed when running the Llama3.1_(8B)-GRPO.ipynb example notebook, so I'd expect others to run into it as well.
* check for torch.cuda and triton if available
on my machine(mac m3) the cuda were not available
* Update pyproject.toml
* Update __init__.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update __init__.py
This PR is solving the (issue)[https://github.com/unslothai/unsloth/issues/1518] with some GPUs
* Update __init__.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update granite.py
Grab residual multiplier directly from layer
* Update llama.py
Version should read >= 4.47.1 as that is the version requiring the changes
* Update granite.py
* Update llama.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update granite to work with latest post_patch methods
* Pass position_embeddings for granite even if transformers<4.47
* Update llama.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* change initilization of n_heads, n_kv_heads, hidden_size in llama.py
* do the same for cohere, mistral, gemma2, granite
* do the same for flexattention,cohere, mistral, granite
* Update README.md
Llama 3.3 + Reddit
* Update README.md
Apple ML Cross Entropy
* Update README.md
Removing double citation
* Fix loader.py to work on Windows
---------
Co-authored-by: Michael Han <107991372+shimmyshimmer@users.noreply.github.com>
* change the colab notebook for dpo zephyr and orpo
* use original tokenizer
* Update README.md
* Update README.md
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Refactor trainer.py to import SFTConfig directly and update UnslothTrainingArguments class inheritance
* Update trainer.py
* Update trainer.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Refactor `get_chat_template` to now support system message instead. It supposed to fix ollama tokenizer chattemplate to
* Remove type hinting
* Update chat_templates.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update README.md with os.environ in example
Added OS Environ in example to avoid device conflicts , for a user at least in jupyter notebook this allows to select GPU in a multi GPU setup.
As currently the unsloth init checks all GPU's and takes the first in the order which can be a issue when some GPU's are in use and the list still shows them. So to manually avoid this, this os config is required.
Small change but a bit time saver for those who straight away copies the tutorials
* Update README.md
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Enhance install_python_non_blocking to handle protobuf installation and process management
* Revert "Enhance install_python_non_blocking to handle protobuf installation and process management"
This reverts commit a3b796a05841fb8d93c652c845591e12cf81ea93.
* Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION to 'python' to address issue #1266
* Revert "Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION to 'python' to address issue #1266"
This reverts commit f00fbf5eac7ad4f5d48c70b98d770255d1a9ef58.
* Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION to 'python' to address issue #1266
* Update __init__.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Throw error when inferencing longer than max_popsition_embeddings without rope scaling
* Update llama.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix: cast logits to float32 in cross_entropy_forward to prevent errors
* Update cross_entropy_loss.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update save.py
Check whether path is in /tmp dir for Kaggle environment
* Update save.py
Move temporary_location to /tmp in Kaggle
* Enhance Kaggle environment support in save and tokenizer utilities
---------
Co-authored-by: dendarrion <37800703+dendarrion@users.noreply.github.com>
Co-authored-by: Erland366 <erland.pg366@gmail.com>
* Bring back float32 if float16 instead of bfloat16
* Refactor mixed precision handling for lm_head and embed_tokens to ensure correct dtype usage
* Fix dtype retrieval for embed_tokens and lm_head in mixed precision training
* Fix dtype retrieval for embed_tokens and lm_head to use weight dtype in mixed precision training
* Fix dtype handling for embed_tokens and lm_head to ensure correct float32 usage in mixed precision training
* Fix dtype assignment for lm_head modules to ensure correct weight dtype usage in mixed precision training
* Enhance rotary embedding handling in LlamaAttention and LongRopeRotaryEmbedding
* Typo
* Improve rotary embedding handling in LlamaAttention to prevent errors with short KV cache
* Update llama.py
* Update llama.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Currently, Unsloth doesn't pass additional parameters to Trainer.compute_loss such as return_outputs. This leads to errors when calling trainer.evaluate(). This change fixes the bug by properly passing parameters to Trainer.compute_loss.
* bugs
* Update _utils.py
* flash-attn softcapping
* Update gemma2.py
* Update gemma2.py
* Update gemma2.py
* Update gemma2.py
* Update mapper.py
* Update README.md
* Update _utils.py
* Fix ROPE extension issue and device mismatch (#840)
* When an exception has been assigned using as target, it is cleared at the end of the except clause.(https://docs.python.org/3/reference/compound_stmts.html#the-try-statement)
* Update loader.py
* round up to extend rope size
* inv_freq.device changed, make sure they are on the same device
---------
Co-authored-by: xiaoyang <xiaoyang@youzan.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update gemma.py
---------
Co-authored-by: XiaoYang <xyangk@gmail.com>
Co-authored-by: xiaoyang <xiaoyang@youzan.com>
* When an exception has been assigned using as target, it is cleared at the end of the except clause.(https://docs.python.org/3/reference/compound_stmts.html#the-try-statement)
* Update loader.py
* round up to extend rope size
* inv_freq.device changed, make sure they are on the same device
---------
Co-authored-by: xiaoyang <xiaoyang@youzan.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* When an exception has been assigned using as target, it is cleared at the end of the except clause.(https://docs.python.org/3/reference/compound_stmts.html#the-try-statement)
* Update loader.py
---------
Co-authored-by: xiaoyang <xiaoyang@youzan.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Update mapper.py
* Update Model Conversion Command in `save.py` to `convert_hf_to_gguf.py` (#730)
* Updated convert_hf_to_gguf.py call to align with changes in llama.cpp repository
* Update save.py
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Typo Fix (#690)
---------
Co-authored-by: M. Ali Bayram <malibayram91@gmail.com>
Co-authored-by: johnpaulbin <johnpaulbin@gmail.com>
Thank you for not only using Unsloth but also for being interested in helping out! We value all contributions, whether they come in the form of code, ideas, support for others or just by simply spreading the word of Unsloth! 💕
- **[Support the Community](https://github.com/unslothai/unsloth/issues)**: Answer questions, review pull requests, or assist others in discussions.
- **Fix Bugs**: Identify and resolve issues with the existing codebase.
- **Submit Ideas**: Request new features or share enhancements you'd like to see.
- **Develop Features**: Implement new functionality or improve existing tools which can be done via PRs.
- **[Improve Documentation](https://docs.unsloth.ai/)**: Help by creating guides, FAQs, or enhancing clarity.
One of the best ways to support us is by spreading the word about Unsloth! Share how it’s powering your amazing projects in blog posts or social media, and inspire others to explore its potential. Even a simple star on our repo goes a long way in showing your support and helping the community grow. 🌟
## Submitting Issues
If you find a bug or have a feature idea, we’d love to hear from you! Here’s how to make your submission stand out:
### Reporting Bugs
1. **Search First**: Check if the issue has already been reported using GitHub’s search bar under Issues.
2. **Details Matter**: Is this on Google Colab, Kaggle, or on another platform service? Are you using Unsloth's official notebook? Include your OS, Python version, and other relevant details. For bugs, a concise code snippet that reproduces the issue is incredibly helpful.
3. **Be Thorough**: Attach screenshots, traceback logs, or any additional information that might speed up resolution.
## Spread the Word
Your support extends beyond code:
- Spread the word by writing about Unsloth in blogs or social media.
- Share how Unsloth powers your projects.
- Star our repository to show your appreciation.
Finally, please be mindful of our [Code of Conduct](https://github.com/unslothai/unsloth/blob/main/CODE_OF_CONDUCT.md) to ensure a welcoming and inclusive environment for everyone.
Thank you so much for reading and we hope you have lots of fun using Unsloth! 🦥
<ahref="https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing"><imgsrc="./images/Free version button.png"height="50"></a>
* **NEW!** [TinyLlama 1.1b](https://github.com/jzhang38/TinyLlama) on 3T tokens! ⭐**Free!** example <ahref="https://colab.research.google.com/drive/1AZghoNBQaMDgWJpi4RbffGM1h6raLUj9?usp=sharing"><imgsrc="./images/Colab.png"height="20">
* **NEW!** We're in 🤗 Huggingface's official docs! We're on the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and the [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)!
* Supports Llama, Yi, Mistral, CodeLlama, Qwen (llamafied), Deepseek and their derived models (Open Hermes etc).
* All kernels written in [OpenAI's Triton](https://openai.com/research/triton) language. **Manual backprop engine**.
* **0% loss in accuracy** - no approximation methods - all exact.
* No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) [Check your GPU!](https://developer.nvidia.com/cuda-gpus) GTX 1070, 1080 works, but is slow.
* Works on **Linux** and **Windows** via WSL.
* **NEW!** Download 4 bit models 4x faster from 🤗 Huggingface! Eg: `unsloth/mistral-7b-bnb-4bit`
* Supports 4bit and 16bit QLoRA / LoRA finetuning via [bitsandbytes](https://github.com/TimDettmers/bitsandbytes).
* **NEW!** Want a UI for finetuning? Try [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory) and use `--use_unsloth`!
* Open source trains 5x faster - see [Unsloth Pro](https://unsloth.ai/) for **30x faster training**!
Do **NOT** use this if you have Anaconda. You must use the Conda install method, or else stuff will BREAK.
1. Find your CUDA version via
```python
import torch; torch.version.cuda
#### Windows:
```powershell
irm https://unsloth.ai/install.ps1 | iex
```
2. For Pytorch 2.1.0: You can update Pytorch via Pip (interchange `cu121` / `cu118`). Go to https://pytorch.org/ to learn more. Select either `cu118` for CUDA 11.8 or `cu121` for CUDA 12.1. If you have a RTX 3060 or higher (A100, H100 etc), use the `"ampere"` path. For Pytorch 2.1.1: got to step 3.
#### Community:
- [Discord](https://discord.gg/unsloth)
- [𝕏 (Twitter)](https://x.com/UnslothAI)
- [Reddit](https://reddit.com/r/unsloth)
## ⭐ Features
Unsloth Studio (Beta) lets you run and train text, [audio](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning), [embedding](https://unsloth.ai/docs/new/embedding-finetuning), [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) models on Windows, Linux and macOS.
### Inference
* **Search + download + run models** including GGUF, LoRA adapters, safetensors
* **Export models**: [Save or export](https://unsloth.ai/docs/new/studio/export) models to GGUF, 16-bit safetensors and other formats.
* **Tool calling**: Support for [self-healing tool calling](https://unsloth.ai/docs/new/studio/chat#auto-healing-tool-calling) and web search
* **[Code execution](https://unsloth.ai/docs/new/studio/chat#code-execution)**: lets LLMs test code in Claude artifacts and sandbox environments
* [Auto-tune inference parameters](https://unsloth.ai/docs/new/studio/chat#auto-parameter-tuning) and customize chat templates.
* We work directly with teams behind [gpt-oss](https://docs.unsloth.ai/new/gpt-oss-how-to-run-and-fine-tune#unsloth-fixes-for-gpt-oss), [Qwen3](https://www.reddit.com/r/LocalLLaMA/comments/1kaodxu/qwen3_unsloth_dynamic_ggufs_128k_context_bug_fixes/), [Llama 4](https://github.com/ggml-org/llama.cpp/pull/12889), [Mistral](models/tutorials/devstral-how-to-run-and-fine-tune.md), [Gemma 1-3](https://news.ycombinator.com/item?id=39671146), and [Phi-4](https://unsloth.ai/blog/phi4), where we’ve fixed bugs that improve model accuracy.
* Upload images, audio, PDFs, code, DOCX and more file types to chat with.
### Training
* Train and RL **500+ models** up to **2x faster** with up to **70% less VRAM**, with no accuracy loss.
* Custom Triton and mathematical **kernels**. See some collabs we did with [PyTorch](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) and [Hugging Face](https://unsloth.ai/docs/new/faster-moe).
* **Data Recipes**: [Auto-create datasets](https://unsloth.ai/docs/new/studio/data-recipe) from **PDF, CSV, DOCX** etc. Edit data in a visual-node workflow.
* **[Reinforcement Learning](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide)** (RL): The most efficient [RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide) library, using **80% less VRAM** for GRPO, [FP8](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) etc.
* **Observability**: Monitor training live, track loss and GPU usage and customize graphs.
* [Multi-GPU](https://unsloth.ai/docs/basics/multi-gpu-training-with-unsloth) training is supported, with major improvements coming soon.
## 📥 Install
Unsloth can be used in two ways: through **[Unsloth Studio](https://unsloth.ai/docs/new/studio/)**, the web UI, or through **Unsloth Core**, the code-based version. Each has different requirements.
### Unsloth Studio (web UI)
Unsloth Studio (Beta) works on **Windows, Linux, WSL** and **macOS**.
* **CPU:** Supported for Chat and Data Recipes currently
* **NVIDIA:** Training works on RTX 30/40/50, Blackwell, DGX Spark, Station and more
* **macOS:** Currently supports chat and Data Recipes. **MLX training** is coming very soon
* **AMD:** Chat + Data works. Train with [Unsloth Core](#unsloth-core-code-based). Studio support is out soon.
* **Coming soon:** Training support for Apple MLX, AMD, and Intel.
* **Multi-GPU:** Available now, with a major upgrade on the way
5. If you get errors, try the below first, then go back to step 1:
docker run -d -e JUPYTER_PASSWORD="mypassword" \
-p 8888:8888 -p 8000:8000 -p 2222:22 \
-v $(pwd)/work:/workspace/work \
--gpus all \
unsloth/unsloth
```
#### Developer, Nightly, Uninstall
To see developer, nightly and uninstallation etc. instructions, see [advanced installation](#-advanced-installation).
### Unsloth Core (code-based)
#### Linux, WSL:
```bash
pip install --upgrade pip
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv unsloth_env --python 3.13
source unsloth_env/bin/activate
uv pip install unsloth --torch-backend=auto
```
# Documentation
We support Huggingface's TRL, Trainer, Seq2SeqTrainer or even Pytorch code!
We're in 🤗 Huggingface's official docs! We're on the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and the [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)!
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 60,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
output_dir = "outputs",
optim = "adamw_8bit",
seed = 3407,
),
)
trainer.train()
#### Windows:
```powershell
winget install -e --id Python.Python.3.13
winget install --id=astral-sh.uv -e
uv venv unsloth_env --python 3.13
.\unsloth_env\Scripts\activate
uv pip install unsloth --torch-backend=auto
```
For Windows, `pip install unsloth` works only if you have PyTorch installed. Read our [Windows Guide](https://unsloth.ai/docs/get-started/install/windows-installation).
You can use the same Docker image as Unsloth Studio.
<aname="DPO"></a>
# DPO (Direct Preference Optimization) Support
DPO, PPO, Reward Modelling all seem to work as per 3rd party independent testing from [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory). We have a preliminary Google Colab notebook for reproducing Zephyr on Tesla T4 here: [notebook](https://colab.research.google.com/drive/15vttTpzzVXv_tJwEk-hIcQ0S9FcEWvwP?usp=sharing).
#### AMD, Intel:
For RTX 50x, B200, 6000 GPUs: `uv pip install unsloth --torch-backend=auto`. Read our guides for: [Blackwell](https://unsloth.ai/docs/blog/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) and [DGX Spark](https://unsloth.ai/docs/blog/fine-tuning-llms-with-nvidia-dgx-spark-and-unsloth). <br>
To install Unsloth on **AMD** and **Intel** GPUs, follow our [AMD Guide](https://unsloth.ai/docs/get-started/install/amd) and [Intel Guide](https://unsloth.ai/docs/get-started/install/intel).
We're in 🤗 Huggingface's official docs! We're on the [SFT docs](https://huggingface.co/docs/trl/main/en/sft_trainer#accelerate-fine-tuning-2x-using-unsloth) and the [DPO docs](https://huggingface.co/docs/trl/main/en/dpo_trainer#accelerate-dpo-fine-tuning-using-unsloth)!
## 📒 Free Notebooks
```python
from unsloth import FastLanguageModel, PatchDPOTrainer
PatchDPOTrainer()
import torch
from transformers import TrainingArguments
from trl import DPOTrainer
Train for free with our notebooks. You can use our new [free Unsloth Studio notebook](https://colab.research.google.com/github/unslothai/unsloth/blob/main/studio/Unsloth_Studio_Colab.ipynb) to run and train models for free in a web UI.
Read our [guide](https://unsloth.ai/docs/get-started/fine-tuning-llms-guide). Add dataset, run, then deploy your trained model.
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
use_gradient_checkpointing = True,
random_state = 3407,
max_seq_length = max_seq_length,
)
- See all our notebooks for: [Kaggle](https://github.com/unslothai/notebooks?tab=readme-ov-file#-kaggle-notebooks), [GRPO](https://unsloth.ai/docs/get-started/unsloth-notebooks#grpo-reasoning-rl-notebooks), [TTS](https://unsloth.ai/docs/get-started/unsloth-notebooks#text-to-speech-tts-notebooks), [embedding](https://unsloth.ai/docs/new/embedding-finetuning) & [Vision](https://unsloth.ai/docs/get-started/unsloth-notebooks#vision-multimodal-notebooks)
- See [all our models](https://unsloth.ai/docs/get-started/unsloth-model-catalog) and [all our notebooks](https://unsloth.ai/docs/get-started/unsloth-notebooks)
- See detailed documentation for Unsloth [here](https://unsloth.ai/docs)
dpo_trainer = DPOTrainer(
model = model,
ref_model = None,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
warmup_ratio = 0.1,
num_train_epochs = 3,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
seed = 42,
output_dir = "outputs",
),
beta = 0.1,
train_dataset = YOUR_DATASET_HERE,
# eval_dataset = YOUR_DATASET_HERE,
tokenizer = tokenizer,
max_length = 1024,
max_prompt_length = 512,
)
dpo_trainer.train()
```
## 🦥 Unsloth News
- **Qwen3.6**: Qwen3.6-35B-A3B can now be trained and run in Unsloth Studio. [Blog](https://unsloth.ai/docs/models/qwen3.6)
- **Gemma 4**: Run and train Google’s new models directly in Unsloth. [Blog](https://unsloth.ai/docs/models/gemma-4)
- **Introducing Unsloth Studio**: our new web UI for running and training LLMs. [Blog](https://unsloth.ai/docs/new/studio)
- **Qwen3.5** - 0.8B, 2B, 4B, 9B, 27B, 35-A3B, 112B-A10B are now supported. [Guide + notebooks](https://unsloth.ai/docs/models/qwen3.5/fine-tune)
- Train **MoE LLMs 12x faster** with 35% less VRAM - DeepSeek, GLM, Qwen and gpt-oss. [Blog](https://unsloth.ai/docs/new/faster-moe)
- New **7x longer context RL** vs. all other setups, via our new batching algorithms. [Blog](https://unsloth.ai/docs/new/grpo-long-context)
- New RoPE & MLP **Triton Kernels**&**Padding Free + Packing**: 3x faster training & 30% less VRAM. [Blog](https://unsloth.ai/docs/new/3x-faster-training-packing)
- **500K Context**: Training a 20B model with >500K context is now possible on an 80GB GPU. [Blog](https://unsloth.ai/docs/blog/500k-context-length-fine-tuning)
- **FP8 & Vision RL**: You can now do FP8 & VLM GRPO on consumer GPUs. [FP8 Blog](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/fp8-reinforcement-learning) • [Vision RL](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide/vision-reinforcement-learning-vlm-rl)
- **gpt-oss** by OpenAI: Read our [RL blog](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/gpt-oss-reinforcement-learning), [Flex Attention](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune/long-context-gpt-oss-training) blog and [Guide](https://unsloth.ai/docs/models/gpt-oss-how-to-run-and-fine-tune).
# Support us!
We're currently 2 brothers trying to make LLMs for everyone! It'll be super cool if you can support our work!!
* Slim Orca `bsz=1` for all benchmarks since `bsz=2` OOMs. We can handle `bsz=2`, but we benchmark it with `bsz=1` for consistency.
# Llama-Factory 3rd party benchmarking
| Method | Bits | TGS | GRAM | Speed |
| --- | --- | --- | --- | --- |
| HF | 16 | 2392 | 18GB | 100% |
| HF+FA2 | 16 | 2954 | 17GB | 123% |
| Unsloth+FA2 | 16 | 4007 | 16GB | **168%** |
| HF | 4 | 2415 | 9GB | 101% |
| Unsloth+FA2 | 4 | 3726 | 7GB | **160%** |
[Link](https://github.com/hiyouga/LLaMA-Factory/wiki/Performance-Comparison) to performance table. TGS: tokens per GPU per second. Model: LLaMA2-7B. GPU: NVIDIA A100 * 1. Batch size: 4. Gradient accumulation: 2. LoRA rank: 8. Max length: 1024.
# How did we make it faster?
Manual autograd, Triton kernels etc. See our [Benchmark Breakdown](https://unsloth.ai/blog/mistral-benchmark) for more info!
# Troubleshooting
1. Sometimes `bitsandbytes` or `xformers` does not link properly. Try running:
## 📥 Advanced Installation
The below advanced instructions are for Unsloth Studio. For Unsloth Core advanced installation, [view our docs](https://unsloth.ai/docs/get-started/install/pip-install#advanced-pip-installation).
#### Developer installs: macOS, Linux, WSL:
```bash
!ldconfig /usr/lib64-nvidia
git clone https://github.com/unslothai/unsloth
cd unsloth
./install.sh --local
unsloth studio -H 0.0.0.0 -p 8888
```
Then to update :
```bash
unsloth studio update
```
2. Windows is not supported as of yet - we rely on Xformers and Triton support, so until both packages support Windows officially, Unsloth will then support Windows.
3. If it doesn't install - maybe try updating `pip`.
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\install.ps1 --local
unsloth studio -H 0.0.0.0 -p 8888
```
Then to launch every time:
```bash
unsloth studio -H 0.0.0.0 -p 8888
```
#### Uninstall
You can uninstall Unsloth Studio by deleting its install folder usually located under `$HOME/.unsloth/studio` on Mac/Linux/WSL and `%USERPROFILE%\.unsloth\studio` on Windows. Using the `rm -rf` commands will **delete everything**, including your history, cache:
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
You can delete old model files either from the bin icon in model search or by removing the relevant cached model folder from the default Hugging Face cache directory. By default, HF uses:
### Mistral 7b
| 1 A100 40GB | Hugging Face | Flash Attention 2 | Unsloth Open | Unsloth Equal | Unsloth Pro | Unsloth Max |
| <imgwidth="13"src="https://upload.wikimedia.org/wikipedia/commons/0/09/X_(formerly_Twitter)_logo_late_2025.svg"/>**Twitter (aka X)** | [Follow us on X](https://twitter.com/unslothai) |
author = {Daniel Han, Michael Han and Unsloth team},
title = {Unsloth},
url = {https://github.com/unslothai/unsloth},
year = {2023}
}
```
If you trained a model with 🦥Unsloth, you can use this cool sticker! <imgsrc="https://raw.githubusercontent.com/unslothai/unsloth/main/images/made with unsloth.png"width="200"align="center"/>
| 1 T4 16GB | Hugging Face | Flash Attention | Unsloth Open | Unsloth Pro Equal | Unsloth Pro | Unsloth Max |
Unsloth uses a dual-licensing model of Apache 2.0 and AGPL-3.0. The core Unsloth package remains licensed under **[Apache 2.0](https://github.com/unslothai/unsloth?tab=Apache-2.0-1-ov-file)**, while certain optional components, such as the Unsloth Studio UI are licensed under the open-source license **[AGPL-3.0](https://github.com/unslothai/unsloth?tab=AGPL-3.0-2-ov-file)**.
| 1 T4 16GB | Hugging Face | Flash Attention | Unsloth Open | Unsloth Pro Equal | Unsloth Pro | Unsloth Max |
if uv pip install --python "$_VENV_PY" -q "transformers>=5.2.0";then
substep "installed from PyPI"
else
substep "PyPI install failed, trying GitHub..."
if uv pip install --python "$_VENV_PY" -q "git+https://github.com/huggingface/transformers.git";then
substep "installed from huggingface/transformers main"
else
fail "Could not install transformers>=5.2.0 (required for Qwen3.5/3.6 model support). Please check your Python version (>=3.10 required) and network connection, then try again."
fi
fi
step "install""installing torch + torchvision (needed for Qwen3 VL processor)..."
"<a href=\"https://unsloth.ai/docs/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a> Join Discord if you need help + ⭐ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐\n",
"</div>\n",
"\n",
"To install Unsloth Studio on your local device, follow [our guide](https://unsloth.ai/docs/new/unsloth-studio/install). Unsloth Studio is licensed [AGPL-3.0](https://github.com/unslothai/unsloth/blob/main/studio/LICENSE.AGPL-3.0).\n",
"\n",
"### Unsloth Studio\n",
"\n",
"Train and run open models with [**Unsloth Studio**](https://unsloth.ai/docs/new/unsloth-studio/start). NEW! Installation should now only take 2 mins!\n",
"\n",
"\n",
"We are actively working on making Unsloth Studio install on Colab T4 GPUs faster.\n",
"for _ in range(10000): time.sleep(300), print(\"=\", end = \"\")"
],
"metadata": {
"id": "wb9UELh--XzX"
},
"id": "wb9UELh--XzX",
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"id": "f2b0c6a1",
"metadata": {
"id": "f2b0c6a1"
},
"source": [
"And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n",
"\n",
"Some other resources:\n",
"1. Looking to use Unsloth locally? Read our [Installation Guide](https://unsloth.ai/docs/get-started/install) for details on installing Unsloth on Windows, Docker, AMD, Intel GPUs.\n",
"2. Learn how to do Reinforcement Learning with our [RL Guide and notebooks](https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide).\n",
"3. Read our guides and notebooks for [Text-to-speech (TTS)](https://unsloth.ai/docs/basics/text-to-speech-tts-fine-tuning) and [vision](https://unsloth.ai/docs/basics/vision-fine-tuning) model support.\n",
"4. Explore our [LLM Tutorials Directory](https://unsloth.ai/docs/models/tutorials-how-to-fine-tune-and-run-llms) to find dedicated guides for each model.\n",
"5. Need help with Inference? Read our [Inference & Deployment page](https://unsloth.ai/docs/basics/inference-and-deployment) for details on using vLLM, llama.cpp, Ollama etc.\n",
# Model defaults for unsloth/Llama-3.2-11B-Vision-Instruct
# Based on Llama3.2_(11B)-Vision.ipynb
# Also applies to: unsloth/Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit, meta-llama/Llama-3.2-11B-Vision-Instruct, unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit
# added inference parameters from unsloth notebook
# Model defaults for unsloth/Llama-3.2-1B-Instruct
# Based on Llama3.2_(1B)-RAFT.ipynb
# Also applies to: unsloth/Llama-3.2-1B-Instruct-unsloth-bnb-4bit, meta-llama/Llama-3.2-1B-Instruct, unsloth/Llama-3.2-1B-Instruct-bnb-4bit, RedHatAI/Llama-3.2-1B-Instruct-FP8, unsloth/Llama-3.2-1B-Instruct-FP8-Block, unsloth/Llama-3.2-1B-Instruct-FP8-Dynamic
# Model defaults for unsloth/Llama-3.2-3B-Instruct
# Based on Llama3.2_(1B_and_3B)-Conversational.ipynb
# Also applies to: unsloth/Llama-3.2-3B-Instruct-unsloth-bnb-4bit, meta-llama/Llama-3.2-3B-Instruct, unsloth/Llama-3.2-3B-Instruct-bnb-4bit, RedHatAI/Llama-3.2-3B-Instruct-FP8, unsloth/Llama-3.2-3B-Instruct-FP8-Block, unsloth/Llama-3.2-3B-Instruct-FP8-Dynamic
# added inference parameters from unsloth notebook
# Model defaults for unsloth/Llama-3.3-70B-Instruct
# Based on Llama3.3_(70B)_A100-Conversational.ipynb
# Also applies to: unsloth/Llama-3.3-70B-Instruct-unsloth-bnb-4bit, meta-llama/Llama-3.3-70B-Instruct, unsloth/Llama-3.3-70B-Instruct-bnb-4bit, RedHatAI/Llama-3.3-70B-Instruct-FP8, unsloth/Llama-3.3-70B-Instruct-FP8-Block, unsloth/Llama-3.3-70B-Instruct-FP8-Dynamic
# added inference parameters from unsloth notebook
# Model defaults for unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
# Based on Llama3.1_(8B)-Inference.ipynb
# Also applies to: "unsloth/Meta-Llama-3.1-8B-Instruct-unsloth-bnb-4bit", "meta-llama/Meta-Llama-3.1-8B-Instruct", "unsloth/Meta-Llama-3.1-8B-Instruct","RedHatAI/Llama-3.1-8B-Instruct-FP8","unsloth/Llama-3.1-8B-Instruct-FP8-Block","unsloth/Llama-3.1-8B-Instruct-FP8-Dynamic"
# Model defaults for unsloth/Mistral-Nemo-Base-2407-bnb-4bit
# Based on Mistral_Nemo_(12B)-Alpaca.ipynb
# Also applies to: "unsloth/Mistral-Nemo-Base-2407", "mistralai/Mistral-Nemo-Base-2407", "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit", "unsloth/Mistral-Nemo-Instruct-2407", "mistralai/Mistral-Nemo-Instruct-2407",