unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

Author	SHA1	Message	Date
Roland Tannous	daaea21af1	Merge branch 'main' into fix/studio-stop-button	2026-04-20 20:54:53 +04:00
Michael Han	b24f3f61b8	Update README.md	2026-04-20 00:37:40 -07:00
Michael Han	f5eec8a6f2	Qwen3.6 and ReadMe revamp.md	2026-04-19 23:16:36 -07:00
pre-commit-ci[bot]	8037c9f928	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2026-04-19 11:56:36 +00:00
Daniel Han	02562e7449	Merge remote-tracking branch 'staging/pr-5069-tests' into pr-5069-head	2026-04-19 11:56:05 +00:00
Daniel Han	6ee93148eb	Consolidate review tests for Studio stop-button cancel flow Move review-added tests out of test_cancel_dispatch_edges.py into the existing PR test files that already cover the same areas: - backend registry fan-out / exclusivity / idempotency / falsy-keys edge cases moved into tests/studio/test_cancel_atomicity.py - frontend plain-fetch (not authFetch) + manual Authorization header moved into tests/studio/test_cancel_id_wiring.py Delete the now-empty test_cancel_dispatch_edges.py.	2026-04-19 11:54:31 +00:00
Daniel Han	e770e76e9f	studio: trim comments on stop-button review changes Collapse multi-paragraph rationale blocks on the cancel registry, _openai_passthrough_stream, and the frontend onAbortCancel handler into one-line explanations of why the non-obvious behaviour exists. Drop authFetch import that became unused when the cancel POST switched to plain fetch.	2026-04-19 11:51:36 +00:00
Daniel Han	348065814e	Add review tests for Studio stop-button cancel flow	2026-04-19 11:48:54 +00:00
Daniel Han	9f60dfedd9	studio: harden cancel registry against ghost-cancel and leak paths - Revert the session_id/completion_id stash in the fallback cancel helper. session_id is thread-scoped and reused across runs, so stashing it on an unmatched POST would fire cancel_event for the user's next unrelated request via _TrackedCancel.__enter__. cancel_id remains the only per-run unique key that gets stashed. - Default max_tokens to _DEFAULT_MAX_TOKENS in the tool-passthrough body. Mirror the direct GGUF path so OpenAI/Anthropic passthrough callers who omit max_tokens get the same zombie-decode cap instead of relying on the wall-clock backstop alone. - Wrap _openai_passthrough_stream setup with an outer try/except BaseException. The inner except httpx.RequestError does not catch asyncio.CancelledError at await client.send, which would otherwise leave _tracker registered in _CANCEL_REGISTRY indefinitely. - Frontend stop POST uses plain fetch + manual Authorization header instead of authFetch. A 401 on the cancel POST no longer refreshes tokens or redirects the user to the login page mid-stop.	2026-04-19 11:43:39 +00:00
Daniel Han	35f6af4ad0	studio: extend stop-path to passthrough streams; tighten wall-clock cap - Lower _DEFAULT_T_MAX_PREDICT_MS from 1 hour to 10 minutes so the wall-clock backstop actually bounds runaway decodes when cancel signaling fails. - Wire _TrackedCancel and cancel_event.is_set() into _openai_passthrough_stream and _anthropic_passthrough_stream and disable httpx keepalive so stop requests from /v1 and /v1/messages tool-calling clients reach llama-server. - Apply t_max_predict_ms to the tool-passthrough request body so the backstop covers passthrough paths as well. - Symmetric pre-registration stash for session_id/completion_id cancels (_cancel_by_keys_or_stash) so early cancels by those keys replay on later registration like cancel_id. - Drop dead except BaseException guards around StreamingResponse() at four streaming sites; cleanup lives in the generator's finally.	2026-04-19 11:19:00 +00:00
pre-commit-ci[bot]	34a4825311	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2026-04-19 04:40:06 +00:00
Daniel Han	d12c448a31	Consolidate review tests for Studio stop-button cancel flow - Delete standalone test_cancel_registry.py at repo root: tests duplicated test_cancel_atomicity.py / test_cancel_id_wiring.py and re-implemented registry primitives inline (scaffolding). - Extend tests/studio/test_stream_cancel_registration_timing.py with regression guards for the iter-1 cancel-loop fixes: structural: each streaming generator checks cancel_event in its loop; audio_input_stream offloads next() via asyncio.to_thread; stream_chunks cancel branch calls reset_generation_state(). runtime: Unsloth loop breaks on external cancel and resets state; audio loop stays responsive under blocking next(); both loops emit zero tokens on pre-set cancel (replay path).	2026-04-19 04:38:15 +00:00
Daniel Han	e9f9dcfebb	Add review tests for Studio stop-button cancel flow	2026-04-19 04:33:38 +00:00
Daniel Han	f22ed4acc0	studio: make cancel-via-POST interrupt Unsloth and audio-input streams Close two remaining gaps in the stop-button cancellation wiring: - stream_chunks (Unsloth path): add a top-of-loop cancel_event check and call backend.reset_generation_state() so cancel POSTs flush GPU state and close the SSE cleanly instead of relying on request.is_disconnected (which does not fire through proxies like Colab's). - audio_input_stream: run the synchronous audio_input_generate() via asyncio.to_thread so blocking whisper chunks do not freeze the event loop, matching the pattern already used by the GGUF streaming paths.	2026-04-19 04:12:24 +00:00
pre-commit-ci[bot]	d3b8afdaa9	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2026-04-19 01:19:10 +00:00
Daniel Han	420c1a9fbd	Merge remote-tracking branch 'staging/pr-5069-tests' into pr-5069-head	2026-04-19 01:18:57 +00:00
Daniel Han	f12e07e1bd	Consolidate review tests for Studio stop-button cancel flow - Merge the 6 behavioral tests from test_stream_cleanup_on_disconnect.py (finally cleanup on normal/exception/aclose, pre-set cancel_event pattern, and its regressions) into test_stream_cancel_registration_timing.py, which is the PR's existing file covering the same area. - Extend structural invariants to include audio_input_stream alongside the three GGUF / Unsloth streaming generators: no _tracker.__enter__ inside the async gen body, cleanup via try/finally, no background= on StreamingResponse. - Delete test_stream_cleanup_on_disconnect.py (now empty).	2026-04-19 01:17:08 +00:00
Daniel Han	a174b871d8	studio: wire audio-input stream into cancel registry - Register cancel_event with _TrackedCancel on the audio-input streaming path so POST /api/inference/cancel can stop whisper / audio-input GGUF runs. Previously the registry stayed empty on this branch, so the stop button returned {"cancelled":0} and the decode ran to completion. - Apply the same finally-based cleanup and pre-iteration cancel-event check used on the other three streaming paths. - Update the _CANCEL_REGISTRY block comment to list cancel_id as the primary key (was stale "session_id preferred").	2026-04-19 01:10:56 +00:00
Daniel Han	4400c90181	Add review tests for Studio stop-button	2026-04-19 00:51:41 +00:00
Daniel Han	caa56091fa	studio: move cancel cleanup to generator finally; drop dead helper - Move _tracker.__exit__ from Starlette BackgroundTask into each streaming generator's finally block. Starlette skips the background callback when stream_response raises (OSError / ClientDisconnect), which leaked _CANCEL_REGISTRY entries on abrupt disconnect. - Check cancel_event.is_set() at the top of each GGUF while loop so a pending-replay cancel falls through to final_chunk + [DONE] instead of propagating GeneratorExit out of _stream_with_retry. - Remove unused _remember_pending_cancel; _cancel_by_cancel_id_or_stash superseded it.	2026-04-19 00:47:42 +00:00
pre-commit-ci[bot]	510e2115da	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2026-04-18 12:04:27 +00:00
Daniel Han	6e0a3eb517	Align cancel-route test with exclusive cancel_id semantics	2026-04-18 12:03:09 +00:00
Daniel Han	acdea3f2d5	Consolidate review tests for Studio stop button	2026-04-18 12:03:09 +00:00
Daniel Han	0667520771	Add review tests for Studio stop button	2026-04-18 12:03:09 +00:00
Daniel Han	78573842e0	studio/llama_cpp: drop upstream PR hashes from benchmark comment	2026-04-18 11:56:57 +00:00
Daniel Han	7fbf4061f1	studio: trim verbose comments and docstrings in cancel path	2026-04-18 11:52:40 +00:00
Daniel Han	023bc0cd14	studio: close TOCTOU race and restore wall-clock backstop on UI path - Close TOCTOU race in the pending-cancel mechanism. The previous fix split cancel_inference's (cancel_by_keys + remember_pending_cancel) and _TrackedCancel.__enter__'s (register + consume_pending) into four separate lock acquisitions. Under contention a cancel POST could acquire-then-release the lock, find the registry empty, and stash ONLY AFTER __enter__ had already registered and consumed an empty pending map -- silently dropping the cancel. Both call sites now do their work inside a single _CANCEL_LOCK critical section, via the new atomic helper _cancel_by_cancel_id_or_stash() and an inlined consume-pending step in __enter__. Reproduced the race under forced interleaving pre-fix; 0/2000 drops post-fix under parallel stress. - Apply t_max_predict_ms UNCONDITIONALLY at all three llama-server payload sites. The previous iteration gated the cap on `max_tokens is None`, which turned out to be dead code on the primary Studio UI path: chat-adapter.ts sets maxTokens=loadResp.context_length after every model load, so every chat request carries an explicit max_tokens and the wall-clock safety net never fired. The cap's original purpose is to bound stuck decodes regardless of the token budget; it must always apply. - Raise _DEFAULT_T_MAX_PREDICT_MS from 10 minutes to 1 hour. 10 minutes was too aggressive for legitimate slow-CPU chat responses (a 4096-token reply at 2 tok/s takes ~34 min); 1 hour accommodates that and still catches genuine zombie decodes. - Prune _PENDING_CANCELS inside _cancel_by_keys as well, so stashed entries expire proportionally to overall cancel traffic rather than only to cancel_id-specific POSTs.	2026-04-18 11:48:23 +00:00
Daniel Han	132d4202c0	studio: harden stop-button cancel semantics and wall-clock cap - Make /inference/cancel match cancel_id EXCLUSIVELY when supplied. Previously the handler iterated ('cancel_id','session_id','completion_id') and unioned matches, so a stale cancel POST carrying {cancel_id:old, session_id:thr} would still cancel a later run on the same thread via the shared session_id. cancel_id is now a per-run exclusive key; session_id / completion_id are only used as fallbacks when cancel_id is absent. - Close the early-cancel race. If /inference/cancel lands before the streaming handler reaches _TrackedCancel.__enter__() (stop clicked during prefill / warmup / proxy buffering), the cancel was silently dropped. Stash unmatched cancel_ids in _PENDING_CANCELS with a 30 s TTL; _TrackedCancel.__enter__() now replays any matching pending cancel by set()-ing the event immediately after registration. - Make t_max_predict_ms = _DEFAULT_T_MAX_PREDICT_MS conditional on max_tokens is None at all three llama-server payload sites. The cap is a safety net for callers who leave max_tokens unset (otherwise llama-server defaults n_predict to n_ctx, up to 262144). Callers who set an explicit max_tokens are already self-limiting and must not be silently truncated at 10 minutes on slow CPU / macOS / Windows legitimate long generations. - Guard each StreamingResponse return with try/except BaseException so _tracker.__exit__ runs even if StreamingResponse construction or any preceding statement raises between _tracker.__enter__() and the BackgroundTask attachment. Prevents a registry leak on that narrow window.	2026-04-18 11:21:10 +00:00
Daniel Han	2aee7a6c3d	Merge remote-tracking branch 'origin/main'	2026-04-18 10:57:37 +00:00
Roland Tannous	ac2daf8b7a	Studio: forward standard OpenAI tools / tool_choice to llama-server (#5099 ) * fix(studio): forward OpenAI tools/tool_choice to llama-server (#4999) Studio's /v1/chat/completions silently stripped standard OpenAI `tools` and `tool_choice` fields, so clients using standard function calling (opencode, Claude Code, Cursor, Continue, ...) never got structured tool_calls back. Adds a client-side pass-through path mirroring the existing Anthropic /v1/messages flow: when `tools` is present without Studio's `enable_tools` shorthand, the request is forwarded to llama-server verbatim so the client sees native id, finish_reason ("tool_calls"), delta.tool_calls, and accurate usage tokens. Also wires Anthropic tool_choice forwarding: /v1/messages previously accepted tool_choice on the request model but silently dropped it with a warning. Translate the four Anthropic shapes to OpenAI format and forward them so agentic clients can actually enforce tool use. - ChatCompletionRequest: add tools, tool_choice, stop; extra="allow" - ChatMessage: accept role="tool", optional tool_call_id / tool_calls / name; content is now optional (assistant with only tool_calls) - routes/inference.py: _openai_passthrough_stream / _openai_passthrough_non_streaming helpers, routing branch in openai_chat_completions, vision+tools via content-parts injection - _build_passthrough_payload: tool_choice parameter (default "auto") - anthropic_compat: anthropic_tool_choice_to_openai() translator - tests/test_openai_tool_passthrough.py: Pydantic + translator unit tests - tests/test_studio_api.py: 5 new E2E tests (non-stream, stream, multi-turn, OpenAI SDK, Anthropic tool_choice=any regression) * fix(studio): surface httpx transport errors from OpenAI passthrough When the managed llama-server subprocess crashes mid-request, the async pass-through helpers in routes/inference.py used to return a bare 500 (non-streaming) or an "An internal error occurred" SSE chunk (streaming) because _friendly_error only recognized the sync path's "Lost connection to llama-server" substring -- httpx transport failures (ConnectError / ReadError / RemoteProtocolError / ReadTimeout) stringify differently and fell through to the generic case. - _friendly_error: map any httpx.RequestError subclass to the same "Lost connection to the model server" message the sync chat path emits. Placed before the substring heuristics so the streaming path automatically picks it up via its existing except Exception catch. - _openai_passthrough_non_streaming: wrap the httpx.AsyncClient.post in a try/except httpx.RequestError and re-raise as HTTPException 502 with the friendly detail. - tests/test_openai_tool_passthrough.py: new TestFriendlyErrorHttpx class pinning the mapping for ConnectError, ReadError, RemoteProtocolError, ReadTimeout, and confirming non-httpx paths (context-size heuristic, generic fallback) are unchanged. * fix(studio): close aiter_bytes/aiter_lines explicitly in passthroughs The httpcore asyncgen cleanup fix in `5cedd9a5` is incomplete on Python 3.13 + httpcore 1.0.x: it switched to manual client/response lifecycle but still used anonymous `async for raw_line in resp.aiter_lines():` patterns in all three streaming paths. Python's async for does NOT auto-close the iterator on break/return, so the aiter_lines / aiter_bytes async generator remains alive, reachable only from the surrounding coroutine frame. Once `_stream()` returns the frame is GC'd and the orphaned asyncgen is finalized on a LATER GC pass in a DIFFERENT asyncio task, where httpcore's HTTP11ConnectionByteStream.aclose() enters anyio.CancelScope.__exit__ with a mismatched task and prints "Exception ignored in: <async generator>" / "async generator ignored GeneratorExit" / "Attempted to exit cancel scope in a different task" to the server log. User observed this on /v1/messages after successful (status 200) requests, with the traceback pointing at HTTP11ConnectionByteStream .__aiter__ / .aclose inside httpcore. Fix: save resp.aiter_lines() / resp.aiter_bytes() as a variable and explicitly `await iter.aclose()` in the finally block BEFORE resp.aclose() / client.aclose(). This closes the asyncgen inside the current task's event loop, so the internal httpcore byte stream is cleaned up before Python's asyncgen GC hook has anything orphaned to finalize. Each aclose is wrapped in try/except Exception so nested anyio cleanup noise can't bubble out. Applied to all three streaming passthrough paths: - _anthropic_passthrough_stream (/v1/messages client-side tool path) - _openai_passthrough_stream (/v1/chat/completions client-side tool path, new in this PR) - openai_completions (/v1/completions bytes proxy from PR #4956) * fix(studio): default ChatCompletionRequest.stream to false per OpenAI spec OpenAI's /v1/chat/completions spec defaults `stream` to false, so clients that omit the field (naive curl, minimal integrations) expect a single JSON response back. Studio was defaulting to true, silently switching those clients into SSE and breaking any parser that didn't also handle streaming. ResponsesRequest and AnthropicMessagesRequest already default to false correctly; only ChatCompletionRequest was wrong. Studio's own frontend always sets `stream` explicitly on every chat-adapter / chat-api / runtime-provider call site, so the flip has no UI impact. SDK users (OpenAI Python/JS SDK, opencode, Claude Code, Cursor, Continue) also always pass `stream` explicitly, so they're unaffected. The only clients feeling the change are raw-curl users who were relying on the wrong default -- those get the correct OpenAI behavior now. Added a regression test pinning the default so it can't silently flip back. * fix(studio): reject images in OpenAI tool passthrough for text-only GGUFs The new tool passthrough branch runs before _extract_content_parts, skipping the existing not is_vision guard. Requests combining tools with an image on a text-only tool-capable GGUF were forwarded to llama-server, producing opaque upstream errors instead of the pre-existing clear 400. Restore the guard inline at the dispatch point, checking both legacy image_base64 and inline image_url parts. * fix(studio): require tool_call_id on role=tool chat messages Enforce the OpenAI spec rule that role="tool" messages must carry a tool_call_id. Without it, upstream backends cannot associate a tool result with the assistant's prior tool_calls entry and the request fails in non-obvious ways through the passthrough path. Reject at the request boundary with a 422 instead. * fix(studio): harden OpenAI tool passthrough validation and error surfacing Three related fixes called out by the PR review: 1. Preserve upstream status codes in the streaming passthrough. The httpx request is now dispatched before the StreamingResponse is constructed. Non-200 upstream responses and httpx RequestError transport failures raise HTTPException with the real status instead of being buried inside a 200 SSE error frame, so OpenAI SDK clients see APIError/BadRequestError/... as expected. 2. Require non-empty content on user/system/tool messages. Per the OpenAI spec, content may only be omitted on assistant messages that carry tool_calls; enforce that at the request boundary so malformed messages never reach the passthrough path. 3. Role-constrain tool-call metadata. tool_calls is only valid on role=assistant, tool_call_id and name only on role=tool. Without this, a user/system message with tool_calls would flip the passthrough branch on and be forwarded to llama-server, surfacing as an opaque upstream error. * fix(studio): normalize image mode and passthrough JSON verbatim Two Gemini-code-assist review findings on PR #5099: 1. Unconditionally convert decoded images to RGB before PNG encoding. The prior code only handled RGBA, letting CMYK/I/F images crash at img.save(format="PNG") and surface as opaque 400s. Applied to both the passthrough helper and the non-passthrough GGUF path that originally carried this pattern, keeping the two sites in sync. 2. Return the upstream JSON body as raw bytes via Response rather than parse-then-re-serialize with JSONResponse. Matches the passthrough helper's "verbatim" contract and drops a redundant round-trip. --------- Co-authored-by: Lee Jackson <130007945+Imagineer99@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-18 12:53:23 +04:00
pre-commit-ci[bot]	163052a734	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2026-04-18 01:02:00 +00:00
Daniel Han	46ca892958	Add review tests for PR #5069	2026-04-18 00:59:48 +00:00
Daniel Han	f259215286	studio: close cancel-race and stale-cancel gaps in stop path - Register the cancel tracker before returning StreamingResponse so a stop POST that arrives during prefill / warmup / proxy buffering finds an entry in _CANCEL_REGISTRY. Cleanup now runs via a Starlette BackgroundTask instead of a finally inside the async generator body. - Add a per-run cancel_id on the frontend (crypto.randomUUID) and in ChatCompletionRequest so /api/inference/cancel matches one specific generation. Removes the stale-cancel bug where pressing stop then starting a new run in the same thread would cancel the retry. - Apply t_max_predict_ms unconditionally in all three llama-server payload builders (previously gated on max_tokens=None, which made it dead code for UI callers that always send params.maxTokens). Raise the default to 10 minutes so slow CPU / macOS / Windows installs are not cut off mid-generation. - Make _cancel_by_keys refuse empty input (return 0) so a future internal caller can not accidentally mass-cancel every in-flight request. - Accept cancel_id (primary), session_id, and completion_id on the /api/inference/cancel route. Unify the three streaming sites on the same _cancel_keys / _tracker variable names. - Annotate _CANCEL_REGISTRY as dict[str, set[threading.Event]].	2026-04-18 00:56:39 +00:00
Daniel Han	b0735f71db	studio: harden stop-button cancel path and scope cancel route - Require at least one identifier for /api/inference/cancel so a missing thread id cannot silently cancel every in-flight generation. - Scope /cancel to a dedicated studio_router so it is not exposed under the /v1 OpenAI-compat prefix as a surprise endpoint. - Store a set of cancel events per key in _CANCEL_REGISTRY so concurrent requests on the same session_id do not overwrite each other, and deduplicate in _cancel_by_keys so the cancelled count reflects unique requests. - Always send session_id with chat completions (not only when tools are enabled) so non-tool GGUF streams register under it and are reachable from /cancel. - Register the non-GGUF stream_chunks path in the cancel registry too, so transformers-based stop-button works behind proxies that swallow fetch aborts. - Only apply the 2-minute t_max_predict_ms wall-clock cap when the caller did not pass max_tokens, so legitimate long generations on slow CPU/macOS/Windows supported installs are not silently truncated. - Remove the abort listener on normal stream completion so reused AbortSignals cannot fire a spurious cancel POST after the fact.	2026-04-18 00:31:16 +00:00
Daniel Han	667dfd66f8	Merge remote-tracking branch 'origin/main' into pr-5069-head	2026-04-18 00:13:50 +00:00
Manan Shah	7d0d2f256c	Add qwen3.6 script (#5084 ) * unsloth gemma4 support files * some fixes * Fixing cache.empty() calls (#4813) * Fixing cache.empty() calls * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix/gemma4 mlx (#4816) * Fixing cache.empty() calls * fixing for mlx versions * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * removed bidirectional check for 31b (#4839) Co-authored-by: Manan17 <shahmanan170602@gmail.coml> * Add Gemma 4 26B MoE support (MLX) (#4844) * removed bidirectional check for 31b * Change gemma4_text for moe * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix(gemma4): cast RoPE offset to int before mx.arange() (#4901) * fix(gemma4): cast RoPE offset to int before mx.arange() * fix(gemma4): use zero-based arange + offset to avoid CPU-GPU sync * qwen3.6 patches for multi-turn chat * qwen3.6 script * removing unnecessary scripts * displaying errors for not installed packages --------- Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com> Co-authored-by: Manan Shah <mananshah@Manans-MacBook-Pro.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Manan17 <shahmanan170602@gmail.coml> Co-authored-by: Théophile Lafargue <138336683+eauchs@users.noreply.github.com>	2026-04-17 01:21:30 -07:00
Daniel Han	d20b306755	Versioning	2026-04-16 12:06:10 -07:00
pre-commit-ci[bot]	5dfbf37aa1	[pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci	2026-04-16 18:58:46 +00:00
danielhanchen	9d26096fe3	Studio: make stop button actually stop generation The UI stop button routes through assistant-ui's cancelRun, which aborts the frontend fetch. Four issues combined to let llama-server keep decoding long after the user clicked stop: 1. request.is_disconnected() does not fire reliably behind proxies (e.g. Colab) that don't propagate fetch aborts. 2. llama-server defaults n_predict to n_ctx when max_tokens is not sent, so a cancelled request keeps producing tokens up to 262144. 3. The httpx.Client pool keeps TCP keep-alive, so even a cleanly closed stream reuses the same connection and llama-server's liveness poll never sees a disconnect. 4. No explicit backend route to cancel - every cancel path relied on is_disconnected. Changes: - Add POST /api/inference/cancel keyed by session_id/completion_id, with a registry populated for the lifetime of each streaming response. - Have the frontend (chat-adapter.ts) POST /inference/cancel on AbortController abort, alongside the existing fetch teardown. - Send max_tokens=4096 + t_max_predict_ms=120000 as defaults on every outbound chat completion to llama-server; honoured by user overrides. - Disable httpx keep-alive on the streaming client so connection close reaches llama-server and its 1s liveness check fires. No behaviour changes for non-streaming paths or for existing callers that already pass max_tokens/session_id.	2026-04-16 18:57:59 +00:00
Daniel Han	0b57884120	Add Qwen3.6 inference defaults for Studio (#5065 ) * Add Qwen3.6 inference defaults for Studio Add qwen3.6 family entry to inference_defaults.json with the recommended sampling parameters from Qwen's documentation: temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0. Without this, Qwen3.6 models fall through to the generic qwen3 pattern which uses different defaults (temperature=0.6, top_p=0.95, no presence_penalty). * Add Qwen3.6-35B-A3B-GGUF to default model lists * Add Qwen3.5/3.6 presence_penalty to thinking toggle and small-model disable logic - Thinking toggle (on-load + button click) now sets presencePenalty: 1.5 for Qwen3.5 and Qwen3.6 models (both thinking-ON and thinking-OFF states) - Small-model thinking-disable check (<9B defaults to no-thinking) extended from Qwen3.5-only to also cover Qwen3.6, in all 3 locations: frontend on-load, frontend refresh, backend llama_cpp.py	2026-04-16 11:42:42 -07:00
Daniel Han	d56f980452	fix: multi-GPU inference crash for bnb 4-bit/8-bit models (#5068 ) * fix: multi-GPU inference crash for bnb 4-bit/8-bit models When load_in_4bit or load_in_8bit is used with device_map="sequential" and max_memory constraints that place weights across multiple GPUs (or entirely on a non-default GPU like cuda:1), the bitsandbytes loading path in transformers never calls dispatch_model. No AlignDevicesHook is installed, and the first forward/generate call crashes with: RuntimeError: Expected all tensors to be on the same device This adds _attach_bnb_multidevice_hooks() which is called after from_pretrained returns. It infers a device map from actual parameter placements and calls dispatch_model(force_hooks=True) to install the missing hooks. The function is a complete no-op for the common single-GPU cuda:0 case. Call sites: FastBaseModel.from_pretrained (vision.py) and FastLlamaModel.from_pretrained (llama.py). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: align with PR #5053 final review improvements - Add hook call to the bnb quantized loading branch in llama.py (the primary load_in_4bit path), not just the non-fast-inference fallback - Expand bnb detection: also check model.is_loaded_in_4bit, model.is_loaded_in_8bit, model.quantization_method - Pass explicit main_device and skip_keys to dispatch_model - Use logger.info instead of print for the success message - Use kwargs.get("load_in_8bit", False) at llama.py call sites * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-16 11:35:02 -07:00
Lee Jackson	ee86530e55	chore: switch helper and no-cache fallback to Gemma (#5066 )	2026-04-16 22:27:30 +04:00
Wasim Yousef Said	bc9ddb3af6	Fix onboarding followups (#5064 ) * Fix onboarding followups * Rename sidebar studio to train	2026-04-16 10:11:35 -07:00
Wasim Yousef Said	7ef65bd2e5	Chat first onboarding (#5063 ) * auth: default to chat * settings: relaunch onboarding * onboarding: return to launch page * studio: stop auto guided tour * ui: soften global radius * cleanup: rename onboarding exit prop * fix onboarding redirect safety * Show real Unsloth version in settings * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-16 09:58:10 -07:00
हिमांशु	f4422b0a62	change torchcodec version to 0.10.0 in extra-no-deps (#5043 ) Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>	2026-04-16 19:50:57 +04:00
Wasim Yousef Said	b01e9af124	feat(studio): replace navbar with collapsible sidebar (#4936 ) * feat(studio): replace navbar navigation with collapsible sidebar Add an app-wide sidebar with hover-expand and pin-to-dock behavior. Navigation items (Studio, Recipes, Export, Chat) move from the center pill navbar to the sidebar. Chat threads and recipes render as collapsible sub-lists. Navbar simplified to logo + update + close. - Extend SidebarProvider with pinned/hovered state model - New AppSidebar with animated active indicator, sloth profile menu, theme toggle, guided tour, back/forward navigation - Chat page refactored to URL-driven view state via search params - Extract reusable hooks for chat thread and recipe sidebar data - Guard startViewTransition for browser compatibility - Wrap chat deletions in Dexie transaction for data integrity * feat(studio): move logo to sidebar and make navbar overlay - Sidebar is now full-height with logo in SidebarHeader - Collapsed sidebar shows sticker.png, expanded shows full logo - Navbar is absolute-positioned overlay (no layout space) - Main content extends to top, aligning with navbar controls * feat(studio): full-height sidebar with recents, edge-to-edge nav buttons - Sidebar outside max-w-7xl, pinned to left edge - Remove sidebar rounding, menu buttons rounded-md - Nav buttons flush to sidebar edges with no left rounding - Replace collapsible recipes/chat with flat nav items - Add Recents section with chat history (1 item when not on chat, full on chat) - New Chat as first nav item with PencilEdit02Icon - Cursor pointer on all sidebar buttons - Navbar temporarily hidden for screenshots * fix(studio): fix chat scroll, action bar hover, collapsible recents - Fix sticky composer by removing `relative` override on viewport footer - Action bar buttons only show on hover (autohide=always) - Remove floating border/shadow from action bar - Add scroll space above composer for last message actions - Back/forward buttons use router history (stay in-app) - Recents section collapsible with chevron on chat route - Set html/body/#root height for proper h-full chain * fix(studio): address review feedback, clean up unused code - Unhide navbar (was left hidden from screenshot) - Remove unused imports: SidebarMenuSub, BubbleChatIcon, ColumnInsertIcon - Remove unused vars: recipeItems, activeRecipeId, canCompare, recipesOpen - Include compare query id in active sidebar selection - Use store type for contextUsage instead of inline type - Simplify noop in sidebar.tsx - Remove empty className prop feat(studio): add mobile sidebar, recent runs section, and misc UX fixes * feat(studio): scaffold settings feature module with dialog store * feat(studio): add tri-state theme store for settings * feat(chat): add clear-all-chats and export-chat-history utils * feat(studio): add settings dialog shell with tab rail * feat(studio): add appearance tab with theme and sidebar pin * feat(studio): add settings general tab with hf token, auto-title, reset prefs * feat(studio): add settings chat tab with export and clear * feat(studio): add api keys tab with list and revoke flow * feat(studio): add create-key form and reveal dialog * feat(studio): add usage examples panel to api keys tab * feat(studio): add settings about tab with update and shutdown * feat(studio): add settings dropdown item and cmd-comma shortcut * feat(studio): remove legacy api-keys route and chat-sheet preference rows * fix(studio): settings dialog a11y + polish pass * feat(studio): inline api key reveal card replacing nested dialog * fix(studio): hide revoked keys from settings list * refactor(studio): strip navbar and hoist training unload guard * feat(studio): explicit sidebar toggle, remove hover-open and pin icons * fix(studio): use SidebarRight01Icon for collapsed sidebar open toggle * fix(studio): address code review findings for settings dialog * feat(studio): collapsible navigate group with standalone new-chat and compare * fix(studio): chat-only standalone actions, use ColumnInsertIcon for compare * fix(studio): sidebar new-chat/compare state reset and icon-mode collapsible * feat(studio): add compact logo assets for sidebar header * Fixed sidebar design * fix(studio): sidebar delete icon hover contrast and sizing * feat(studio): route-gate sidebar recents (chats off /studio, runs on /studio) * feat(studio): add chat search store * feat(studio): add chat search index hook with snapshot-on-open * feat(studio): add chat search command dialog with global shortcut * feat(studio): wire chat search into sidebar * fix(studio): trim hf token on save, add show/hide toggle, commit on close * revert(studio): restore original sidebar/border colors, brighten sidebar * feat(studio): forward overlayClassName through CommandDialog * fix(studio): wrap search dialog in Command context, redesign as flat 635px card * fix(studio): reserve right padding on recent items so delete icon stops overlapping title * fix(studio): skip hf token unmount-commit during reset-prefs reload * chore(studio): drop unused icon import and unreachable runs navigate branch * fix(studio): chat search index filters archived before limit, batches message query, picks up reasoning text * fix(studio): keep CommandEmpty in tree so empty state renders correctly * fix(studio): cap system prompt and chat template textareas so they scroll instead of growing * fix(studio): attach chat-compare tour anchor to sidebar compare button * fix(studio): persist system theme explicitly so next-themes does not clobber on reload * fix(studio): auto-switch to history tab when selecting a recent run from sidebar * UI overhaul: chatbox, scrollbar, sidebar, and compare view UI Changes: - Redesigned the Compare UI with general cleanup - Redesigned the Chatbox UI - Reduced the width of the user chat bubble for improved readability - Narrowed the user chat box across the content page - Adjusted thinking-box text color to be slightly darker - Removed faded text effect from chat messages - Removed faded text effect from the thinking box - Added a small LLM chat safety note at the bottom of the chatbox - Restyled the scrollbar Layout & Behavior: - Reworked the scrollbar to span the full height of the page (no top/bottom padding) and remain persistently visible when content is scrollable, rather than only on hover - Reworked the Configuration sidebar to span full height — removed rounded corners and borders, with the scrollbar adjusted to match the full top-to-bottom layout - Adjusted the top menu and bottom chatbox content areas to work correctly with the new full-page scroll behavior - Made chat content match the chatbox width, with content sliding slightly behind the chatbox when scrolling - Aligned chat text width with the chatbox for visual consistency, including how far the text extends behind the chatbox Fixes: - Fixed the chatbox not auto-expanding when typing multi-line input while bottom-positioned during an active chat (previously only worked before a chat had started) - Fixed positioning and design of the user chat hover menu buttons to match the assistant chat box — now displayed below the chat bubble instead of on the left side * Fix user message layout in thread component * swap code icon * fix compare layout * fix compare pane flex * Sidebar improvements and fixes - Added scrolling support to the sidebar so menus and recent chats no longer get hidden - Recent chats are now always visible in the sidebar, not hidden when in Studio, Recipes, or Export - Recent chat is now deselected when selecting other navigations - Fixed sidebar glitch where browser resize could make the sidebar and expand button disappear completely - Fixed glitch where the open-sidebar hover tooltip appeared above the logo when clicking expand sidebar - Reduced sidebar width on mobile to around 2/3 of the screen (was too wide) - Made the close-sidebar hover tooltip consistent with the rest of the design - Removed sidebar collapse/expand animation - Small adjustment to chat width * Fix route scrolling, polling, and theme sync issues * Fix Studio page scrolling --------- Co-authored-by: sneakr <hauzin@hotmail.com>	2026-04-16 08:46:16 -07:00
Daniel Han	05ec0f110b	Studio: Ollama support, recommended folders, Custom Folders UX polish (#5050 ) * Studio: Ollama support, recommended folders, Custom Folders UX polish Backend: - Add _scan_ollama_dir that reads manifests/registry.ollama.ai/library/* and creates .gguf symlinks under <ollama_dir>/.studio_links/ pointing at the content-addressable blobs, so detect_gguf_model and llama-server -m work unchanged for Ollama models - Filter entries under .studio_links from the generic models/hf/lmstudio scanners to avoid duplicate rows and leaked internal paths in the UI - New GET /api/models/recommended-folders endpoint returning LM Studio and Ollama model directories that currently exist on the machine (OLLAMA_MODELS env var + standard paths, ~/.lmstudio/models, legacy LM Studio cache), used by the Custom Folders quick-add chips - detect_gguf_model now uses os.path.abspath instead of Path.resolve so the readable symlink name is preserved as display_name (e.g. qwen2.5-0.5b-Q4_K_M.gguf instead of sha256-abc...) - llama-server failure with a path under .studio_links or .cache/ollama surfaces a friendlier message ("Some Ollama models do not work with llama.cpp. Try a different model, or use this model directly through Ollama instead.") instead of the generic validation error Frontend: - ListLabel supports an optional leading icon and collapse toggle; used for Downloaded (download icon), Custom Folders (folder icon), and Recommended (star icon) - Custom Folders header gets folder icon on the left, and +, search, and chevron buttons on the right; chevron uses ml-auto so it aligns with the Downloaded and Recommended chevrons - New recommended folder chips render below the registered scan folders when there are unregistered well-known paths; one click adds them as a scan folder - Custom folder rows that are direct .gguf files (Ollama symlinks) load immediately via onSelect instead of opening the GGUF variant expander (which is for repos containing multiple quants, not single files) - When loading a direct .gguf file path, send max_seq_length = 0 so the backend uses the model's native context instead of the 4096 chat default (qwen2.5:0.5b now loads at 32768 instead of 4096) - New listRecommendedFolders() helper on the chat API * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: log silent exceptions and support read-only Ollama dirs Replace silent except blocks in _scan_ollama_dir and the recommended-folders endpoint with narrower exception types plus debug or warning logs, so failures are diagnosable without hiding signal. Add _ollama_links_dir helper that falls back to a per-ollama-dir hashed namespace under Studio's own cache (~/.unsloth/studio/cache/ollama_links) when the Ollama models directory is read-only. Common for system installs at /usr/share/ollama/.ollama/models and /var/lib/ollama/.ollama/models where the Studio process has read but not write access. Previously the scanner returned an empty list in that case and Ollama models would silently not appear. The fallback preserves the .gguf suffix on symlink names so detect_gguf_model keeps recognising them. The prior "raw sha256 blob path" fallback would have missed the suffix check and failed to load. * Address review: detect mmproj next to symlink target for vision GGUFs Codex P1 on model_config.py:1012: when detect_gguf_model returns the symlink path (to preserve readable display names), detect_mmproj_file searched the symlink's parent directory instead of the target's. For vision GGUFs surfaced via Ollama's .studio_links/ -- where the weight file is symlinked but any mmproj sidecar lives next to the real blob -- mmproj was no longer detected, so the model was misclassified as text-only and llama-server would start without --mmproj. detect_mmproj_file now adds the resolved target's parent to the scan order when path is a symlink. Direct (non-symlink) .gguf paths are unchanged, so LM Studio and HF cache layouts keep working exactly as before. Verified with a fake layout reproducing the bug plus a regression check on a non-symlink LM Studio model. * Address review: support all Ollama namespaces and vision projector layers - Iterate over all directories under registry.ollama.ai/ instead of hardcoding the "library" namespace. Custom namespaces like "mradermacher/llama3" now get scanned and include the namespace prefix in display names, model IDs, and symlink names to avoid collisions. - Create companion -mmproj.gguf symlinks for Ollama vision models that have an "application/vnd.ollama.image.projector" layer, so detect_mmproj_file can find the projector alongside the model. - Extract symlink creation into _make_symlink helper to reduce duplication between model and projector paths. * Address review: move imports to top level and add scan limit - Move hashlib and json imports to the top of the file (PEP 8). - Remove inline `import json as _json` and `import hashlib` from function bodies, use the top-level imports directly. - Add `limit` parameter to `_scan_ollama_dir()` with early exit when the threshold is reached. - Pass `_MAX_MODELS_PER_FOLDER` into the scanner so it stops traversing once enough models are found. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: Windows fallback, all registry hosts, collision safety _make_link (formerly _make_symlink): - Falls back to os.link() hardlink when symlink_to() fails (Windows without Developer Mode), then to shutil.copy2 as last resort - Uses atomic os.replace via tmp file to avoid race window where the .gguf path is missing during rescan Scanner now handles all Ollama registry layouts: - Uses rglob over manifests/ instead of hardcoding registry.ollama.ai - Discovers hf.co/org/repo:tag and any other host, not just library/ - Filenames include a stable sha1 hash of the manifest path to prevent collisions between models that normalize to the same stem Per-model subdirectories under .studio_links/: - Each model's links live in their own hash-keyed subdirectory - detect_mmproj_file only sees the projector for that specific model, not siblings from other Ollama models Friendly Ollama error detection: - Now also matches ollama_links/ (the read-only fallback cache path) and model_identifier starting with "ollama/" Recommended folders: - Added os.access(R_OK \| X_OK) check so unreadable system directories like /var/lib/ollama/.ollama/models are not advertised as chips * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: filter ollama_links from generic scanners The generic scanners (models_dir, hf_cache, lmstudio) already filter out .studio_links to avoid duplicate Ollama entries, but missed the ollama_links fallback cache directory used for read-only Ollama installs. Add it to the filter. * Address review: idempotent link creation and path-component filter _make_link: - Skip recreation when a valid link/copy already exists (samefile or matching size check). Prevents blocking the model-list API with multi-GB copies on repeated scans. - Use uuid4 instead of os.getpid() for tmp file names to avoid race conditions from concurrent scans. - Log cleanup errors instead of silently swallowing them. Path filter: - Use os.sep-bounded checks instead of bare substring match to avoid false positives on paths like "my.studio_links.backup/model.gguf". * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: drop copy fallback, targeted glob, robust path filter _make_link: - Drop shutil.copy2 fallback -- copying multi-GB GGUFs inside a sync API request would block the backend. Log a warning and skip the model when both symlink and hardlink fail. Scanner: - Replace rglob("") with targeted glob patterns (// and ///) to avoid traversing unrelated subdirectories in large custom folders. Path filter: - Use Path.parts membership check instead of os.sep substring matching for robustness across platforms. Scan limit: - Skip _scan_ollama_dir when _generic already fills the per-folder cap. * Address review: sha256, top-level uuid import, Path.absolute() - Switch hashlib.sha1 to hashlib.sha256 for path hashing consistency. - Move uuid import to the top of the file instead of inside _make_link. - Replace os.path.abspath with Path.absolute() in detect_gguf_model to match the pathlib style used throughout the codebase. * Address review: fix stale comments (sha1, rglob, copy fallback) Update three docstrings/comments that still referenced the old implementation after recent changes: - sha1 comment now says "not a security boundary" (no hash name) - "rglob" -> "targeted glob patterns" - "file copies as a last resort" -> removed (copy fallback was dropped) * Address review: fix stale links, support all manifest depths, scope error _make_link: - Drop size-based idempotency shortcut that kept stale links after ollama pull updates a tag to a same-sized blob. Only samefile() is used now -- if the link doesn't point at the exact same inode, it gets replaced. Scanner: - Revert targeted glob back to rglob so deeper OCI-style repo names (5+ path segments) are not silently skipped. Ollama error: - Only show "Some Ollama models do not work with llama.cpp" when the server output contains GGUF compatibility hints (key not found, unknown architecture, failed to load). Unrelated failures like OOM or missing binaries now show the generic error instead of being misdiagnosed. --------- Co-authored-by: Daniel Han <info@unsloth.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: danielhanchen <michaelhan2050@gmail.com>	2026-04-16 08:24:08 -07:00
Daniel Han	ff23ce40b4	Fix review findings for chat-template repair (#5049 ) (#5056 ) * Fix review findings for PR #49 1. Sandbox fallback Jinja env in _VariantTokenizerProxy.apply_chat_template (use SandboxedEnvironment, matching _derive_assistant_prefix_by_render) 2. Unwrap benign outer-If guards in _template_ends_with_toplevel_for so templates like {% if messages %}{% for ... %}{% endfor %}{% endif %} are still repairable (preserves Qwen3-Guard rejection via else-branch and add_generation_prompt-name checks) 3. Preserve raw name_or_path in _VariantTokenizerProxy._source_path so local-path detection works for dict/list variant tokenizers 4. Context-aware strict-mode messages: omit "will still load" and "Set UNSLOTH_STRICT_CHAT_TEMPLATE=1" when already raising * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-16 08:02:05 -07:00
Daniel Han	b42e3a120d	Remove legacy venv Scripts entry from User PATH on upgrade (#5060 ) Older installers persisted the venv Scripts directory directly in the User PATH registry. The shim approach from #4961 no longer writes that entry, but on upgrade the old one survived and python.exe / pip.exe from the unsloth venv continued winning resolution in every new shell. Before creating the shim, read the current User PATH, filter out any entry matching $VenvDir\Scripts (using the same symmetric raw+expanded comparison as Add-ToUserPath), and write back if changed. No-op on fresh installs where the legacy entry was never written. Confirmed on a real Windows machine: `where.exe python` was returning the venv interpreter first even after the shim PR merged.	2026-04-16 07:36:59 -07:00
Daniel Han	5b8643969e	Revert "Remove legacy venv Scripts entry from User PATH on upgrade" This reverts commit `cae4a74297`.	2026-04-16 14:20:43 +00:00

1 2 3 4 5 ...

5103 commits