unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

Author	SHA1	Message	Date
Roland Tannous	f801e59c29	split venv_t5 into tiered 5.3.0/5.5.0 and fix trust_remote_code (#4878 ) * split venv_t5 into venv_t5_530 and venv_t5_550 for tiered transformers 5.x support * fix bfloat16 crash on T4 for FORCE_FLOAT32 models and disable trust_remote_code auto-enable for native t5 models * revert FORCE_FLOAT32 dtype change * restrict trust_remote_code auto-enable to Nemotron models only * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use config.json model_type for tier detection, add unsloth/nvidia namespace guard * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit `fb43d468e2`. * Revert "use config.json model_type for tier detection, add unsloth/nvidia namespace guard" This reverts commit `fc49ae2453`. * add unsloth/nvidia namespace guard to Nemotron trust_remote_code auto-enable * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reorder tier checks: all substring matches before config.json fetches * extract shared activate_transformers_for_subprocess into transformers_version.py * narrow Nemotron trust_remote_code to nemotron_h/nemotron-3-nano, add to export worker * clean venv_t5 dirs before re-install in setup.sh, clarify version alias comment * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * run venv_t5 migration outside deps fast-path gate in both setup scripts --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-07 20:05:01 +04:00
Lee Jackson	8c89b84bb6	Studio: Fix empty chat threads on navigation and stabilize new chat flow (#4872 ) * fix(chat): prevent implicit empty thread creation and stabilize new-chat flow * fix(chat): harden compare thread sync and simplify sidebar thread query * fix(chat): harden new-thread state sync and isolate compare active thread updates * fix(chat): stabilize new-thread state sync and prevent compare/session bleed * Fix thread restoration, handleNewThread guard, sidebar filter, and delete flow - Remove __LOCALID_ filter from getInitialSingleChatView: in this Dexie-backed adapter, AUI's __LOCALID_ prefixed IDs ARE the real persistent thread IDs stored by initialize(). Filtering them out breaks thread restoration on navigation. - Simplify handleNewThread to synchronous: the async Dexie message check is redundant (persistence is already deferred to first append) and strands users on legacy empty threads. Use a simple guard that checks the store's activeThreadId to detect unsent drafts. - Add message-count filter to sidebar: filter threads to only show those with at least one message, hiding legacy empty threads. - Add store-based sidebar highlighting fallback: use activeThreadId from the store when view.threadId is not set (nonce-backed chats). - Fix handleDelete to call onNewThread() instead of onSelect(), and clear activeThreadId, so the runtime properly resets after deleting the active thread. * Fix handleDelete nonce path and restore __LOCALID_ filter handleDelete was calling onNewThread() after clearing activeThreadId, but the handleNewThread guard sees !view.threadId && !activeThreadId and returns early, leaving the UI stuck on the deleted thread. Fix by directly calling onSelect with a new nonce instead. Restore __LOCALID_ filter in getInitialSingleChatView to prevent restoring unpersisted AUI local thread IDs on navigation. Without this filter, navigating away from /chat before sending a message would restore a non-existent thread that Dexie cannot fetch. --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-06 09:32:54 -07:00
JYYYYYT	aa4c6010e1	fix(studio): custom folder scan fails to find GGUF variants when pointing directly at a model directory (#4860 ) Fix custom folder scanning when pointing directly at a model directory. When a user adds a custom scan folder that points directly at a model directory (e.g. /path/to/gemma-4-e2b-it-gguf/ containing config.json and gemma-4-E2B-it-BF16.gguf), the model list previously showed individual .gguf files as separate entries instead of recognizing the directory as a single model. Clicking any entry showed "No GGUF variants found" because list_local_gguf_variants received a file path and immediately returned empty. Changes: - Add _is_model_directory() helper that detects directories with both config metadata and actual model weight files (excludes mmproj GGUFs and non-weight .bin files like tokenizer.bin) - _scan_models_dir: detect self-model and return single directory entry - _scan_lmstudio_dir: surface model directories directly instead of descending into them as publisher folders; handle both root and child model directories - Add _resolve_gguf_dir() helper for GGUF path resolution that only falls back to parent directory when parent has model metadata - list_local_gguf_variants / _find_local_gguf_by_variant: use resolver so .gguf file paths inside model directories work correctly	2026-04-06 08:31:07 -07:00
Daniel Han	ab65b47c73	Add tests for is_vision_model() caching behaviour (#4855 ) * Add tests for is_vision_model() caching behaviour * Fix review feedback: remove dead helper, fix exception test - Remove unused _make_config() helper function (dead code) - Fix test_exception_result_cached to actually exercise the exception path by mocking load_model_config to raise OSError instead of using side_effect=[False] which only tested normal False returns * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use strict mock specs so tests exercise intended detection paths Use MagicMock(spec=[]) for all config mocks so hasattr() only returns True for explicitly set attributes. Without this, MagicMock defaults make all hasattr checks truthy, allowing tests to pass via unintended detection paths (e.g. img_processor instead of vision_config). --------- Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-06 06:41:40 -07:00
Roland Tannous	278f462996	[Studio][Optimization]Add vision detection cache to is_vision_model() (#4853 ) * Add vision detection cache to is_vision_model() to avoid redundant subprocess spawns is_vision_model() is called 4-5 times per training run for the same model with zero caching. For transformers 5.x models, each call spawns a full subprocess (~6s each). This adds a module-level _vision_detection_cache dict following the same pattern as the existing _audio_detection_cache used by detect_audio_type(). The function is refactored into a thin cache wrapper around _is_vision_model_uncached(), saving ~12s per training run. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Include hf_token in vision cache key for gated model correctness Cache key is now (model_name, hf_token) instead of just model_name. This prevents stale False results when an unauthenticated probe for a gated model is followed by an authenticated call. * Remove test file from main PR - will be submitted separately * Fix vision cache: normalize model names and skip caching transient failures - Normalize model names in cache key using resolve_cached_repo_id_case() to avoid duplicate entries for different casings of the same HF repo (aligns with case normalization from #4822) - Return None instead of False on transient failures (network errors, subprocess timeouts, HF API issues) so the cache layer can distinguish "definitely not a vision model" from "failed to check" - Only cache definitive True/False results; transient failures are retried on the next call instead of being permanently locked in as False * Refine failure handling: cache deterministic failures, guard normalization - Subprocess non-zero exit, JSON errors, and general exceptions return False (deterministic, cached) instead of None (retryable). Only subprocess.TimeoutExpired returns None since timeouts are transient. - Wrap cache key normalization in try/except so resolve_cached_repo_id_case or normalize_path failures fall back to raw model_name instead of crashing callers. * Harden vision detection cache: fix transient failure handling, thread safety, token security - All subprocess failure paths now return None (transient) instead of False, preventing permanent misclassification of VLMs after temporary HF/auth/network errors - Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token - Add threading.Lock with double-checked locking to prevent thundering herd of concurrent subprocess spawns for the same uncached model - Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError, ValueError) from transient ones in _is_vision_model_uncached - Pass resolved/normalized model name to detection (not just cache key) - Log normalization fallback at debug level instead of silent swallow - Thread hf_token through callers in routes/models.py and trainer.py that previously omitted it * Refine lock strategy and token fingerprint - Move detection computation outside the lock to avoid serializing long-running subprocess spawns (60s timeout) and HF API calls across all concurrent model checks. Lock is now only held for cache writes. - Use full SHA256 digest for token fingerprint instead of truncated 16-char prefix to eliminate collision risk. * Fix huggingface_hub import fallback and use atomic cache read - Add fallback import path for RepositoryNotFoundError/GatedRepoError from huggingface_hub.utils (older hub versions) when .errors is not available - Use sentinel-based dict.get() for single atomic cache read instead of two-step in/[] pattern (future-proof for no-GIL runtimes) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-06 06:41:20 -07:00
Leo Borcherding	68965988cf	Fix/studio colab button message: Add fallback message for Colab Studio button when proxy URL fails (#4866 ) * Add fallback message for Colab Studio button when localhost link doesn't work * Make fallback message darker grey for better readability * Make fallback message bold for better visibility --------- Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>	2026-04-05 21:57:45 -07:00
Daniel Han	4020a70a93	Add tests for cache case resolution (from PR #4822 ) (#4823 ) Tests for resolve_cached_repo_id_case and get_model_config case resolution, separated from the runtime changes in PR #4822.	2026-04-03 13:58:26 -07:00
Daniel Han	4f65cc94bc	Add Gemma 4 model sampling defaults (#4838 ) Add per-model YAML configs and MODEL_NAME_MAPPING entries for all 8 Gemma 4 models (4 instruct + 4 base): - gemma-4-31B-it / gemma-4-31B - gemma-4-26B-A4B-it / gemma-4-26B-A4B - gemma-4-E2B-it / gemma-4-E2B - gemma-4-E4B-it / gemma-4-E4B GGUF variants (only for -it models) resolve via the gemma-4 family entry in inference_defaults.json. Sampling defaults: temperature=1.0, top_p=0.95, top_k=64, min_p=0.0, no repetition or presence penalty. Matches gemma-3n and gemma-3.	2026-04-03 13:57:15 -07:00
Daniel Han	a32b871f0e	studio: add speculative decoding support (ngram-mod, on by default) (#4836 ) * studio: add speculative decoding support (ngram-mod, on by default) Enable n-gram speculative decoding for GGUF models in Unsloth Studio. Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation with zero VRAM cost via a 4MB fixed hash table that auto-resets on low acceptance rates. Backend: - Add speculative_type field to LoadRequest, LoadResponse, and InferenceStatusResponse pydantic models - Add speculative_type parameter to LlamaCppBackend.load_model() with allowlist validation (ngram-simple, ngram-mod) - Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags to llama-server when ngram-mod is active - Default to ngram-mod for non-vision GGUF models server-side - Silently skip speculative decoding for vision models (unsupported in llama.cpp server-context.cpp) Frontend: - Add speculative_type to TS API types - Add speculativeType/loadedSpeculativeType to chat runtime store with default value of "ngram-mod" - Add On/Off toggle in Model settings section (GGUF only, hidden for vision models), included in dirty check for Apply/Reset - Wire speculative_type through model load request and response - Restore speculative type state on page refresh/reconnect * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: remove server-side speculative decoding override The backend was overriding speculative_type=None to "ngram-mod" for non-vision GGUF models, which prevented users from disabling spec decoding via the UI toggle. The frontend store already defaults to "ngram-mod", so the backend fallback was redundant and blocked the explicit "Off" setting. * fix: use recommended ngram-mod params from llama.cpp docs Update speculative decoding params to match the recommended values from llama.cpp docs (docs/speculative.md): --spec-ngram-size-n 24 (was 16, docs say small n not recommended) --draft-min 48 (was 0) --draft-max 64 (was 24, docs note MoEs need long drafts) Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes), not 4 MB. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmark table and references to speculative decoding comment Include speedup numbers from llama.cpp PRs #18471 and #19164 as an inline comment so future readers understand the expected gains. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-03 13:56:59 -07:00
Daniel Han	2c73ab7871	fix(studio): harden sandbox security for terminal and python tools (#4827 ) * fix(studio): harden sandbox security for terminal and python tools The existing command blocklist used naive str.split() which is trivially bypassable via quoting, full paths, nested shells, variable expansion, and cross-tool pivoting through Python os.system/subprocess. Fixes #4818. Changes: - Replace str.split() blocklist with shlex.split() + os.path.basename() tokenization and regex scanning at shell command boundaries - Add sanitized subprocess environment (_build_safe_env) that strips credentials (HF_TOKEN, WANDB_API_KEY, GH_TOKEN, AWS_, etc.) and restricts PATH to /usr/local/bin:/usr/bin:/bin - Add PR_SET_NO_NEW_PRIVS via prctl on Linux so sudo/su/pkexec fail at the kernel level regardless of how they are invoked - Add RLIMIT_NPROC (256) and RLIMIT_FSIZE (100MB) to prevent fork bombs and disk filling attacks - Extend AST safety checker to detect os.system(), os.popen(), subprocess.run/Popen/call/check_output, os.exec, os.spawn* calls containing blocked commands or dynamic (non-literal) arguments - Add cross-platform support: cmd.exe on Windows, bash on Unix; CREATE_NO_WINDOW flag on Windows, preexec_fn on Unix - Expand blocklist from 7 to 14 commands: add su, chown, passwd, mount, umount, fdisk, kill, killall, pkill - Apply all layers to both _bash_exec and _python_exec Zero measurable performance overhead -- shlex parsing and a single prctl syscall per subprocess fork. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix review findings: exception_catching dead code, false positives, process substitution - Include exception_catching reasons in _check_code_safety so bare except-in-loop timeout evasion is actually blocked (was computed in _check_signal_escape_patterns but never read by the caller) - Remove base.split() inner loop that caused false positives on quoted text arguments containing blocked words (e.g. echo "kill this process") - Add targeted nested shell detection for bash/sh/zsh -c arguments instead, which catches bash -c 'sudo whoami' without false positives - Add <() process substitution to the regex character class so diff <(rm -rf /path) is also caught - Fix error message to say "unsafe patterns" instead of specifically mentioning signal manipulation when other categories trigger * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review feedback: regex paths, keyword args, list element scanning - Regex now matches blocked commands after optional path prefix at shell boundaries (catches ls; /usr/bin/sudo and similar) - Nested shell detection uses os.path.basename so bash -c "/bin/rm" is caught - AST checker now inspects keyword arguments (not just positional) so subprocess.run(args="sudo ...", shell=True) is detected - List elements in subprocess calls are now checked via _find_blocked_commands for consistency (catches subprocess.run(["bash", "-c", "rm -rf /"])) - Dynamic argument check uses _is_safe_literal that validates list contents are all string literals * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix nested shell scan to only check the script body, not positional args bash -c 'script' arg0 arg1 -- only tokens[i+1] is the script body; subsequent tokens are $0, $1 positional parameters passed to the script and are not executed as shell commands. Scanning all remaining tokens caused false positives. * Add subshell parentheses to regex command boundary detection (sudo whoami) was not caught because ( was not in the regex character class for shell command boundaries. Add ( to the set alongside ;, &, \|, backtick, newline. * Address high-priority review findings from 7 parallel reviewers - Track from-imports of dangerous functions (from os import system, from subprocess import run as r, etc.) via shell_exec_aliases dict so bare-name calls are detected by the AST checker - Include the active Python interpreter and virtualenv directories in the sanitized PATH so pip, uv, and Studio packages remain accessible in the sandbox - Add Windows-specific blocked commands (rmdir, takeown, icacls, runas, powershell, pwsh) only on win32 platform - Add os.posix_spawn and os.posix_spawnp to _SHELL_EXEC_FUNCS - Handle tuple literals same as list literals in AST argument inspection (both _extract_strings_from_list and _is_safe_literal) * Fix false positive on check=True kwargs and recursive nested shell scanning - Only inspect command-carrying keyword arguments (args, command, executable, path, file) in the AST checker, not control flags like check=True, text=True, capture_output=True which are booleans and were incorrectly flagged as non-literal dynamic arguments - Replace split() in nested shell detection with recursive call to _find_blocked_commands so that quoted commands (bash -c '"sudo" whoami') and semicolons (bash -c "sudo;ls") within nested shells are properly detected through the full shlex + regex pipeline * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move preexec_fn imports to module level and use find_library for libc Addresses two Gemini review findings: 1. preexec_fn thread safety: _sandbox_preexec previously imported ctypes and resource inside the function body, which runs between fork() and exec() in the child process. In a multi-threaded server, this could deadlock if the import machinery locks were held by another thread at fork time. Now all imports and the libc handle are resolved once at module load time, so _sandbox_preexec only calls C-level functions (prctl, setrlimit) with no Python import activity. 2. Hardcoded libc.so.6 path: replaced with ctypes.util.find_library("c") which works on glibc (libc.so.6), musl (libc.musl-.so.1), and other Linux distributions where libc has a different soname. Apply Gemini style suggestions: combined regex, dict.fromkeys, constant hoisting - Combine per-word regex loop into a single re.findall with alternation pattern, avoiding repeated regex compilation and searching - Replace manual dedup loop with dict.fromkeys for PATH entries - Hoist _CMD_KWARGS frozenset out of visit_Call to avoid recreating it on every AST node visit * Add cmd /c nested shell detection for Windows parity The nested shell scan only checked for Unix shells (bash -c, sh -c, etc). Add cmd /c and cmd.exe /c detection so that Windows nested shell invocations are also recursively scanned for blocked commands. The token scan already catches blocked commands at any position, so this is defense-in-depth for consistency across platforms. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Handle combined shell flags (-lc, -xc) and interleaved flags (--login -c) The nested shell scan only matched token == "-c" with the immediately preceding token being a shell name. This missed: - Combined flags: bash -lc 'rm ...' (-lc ends with c, is a valid combined flag meaning -l -c) - Interleaved flags: bash --login -c 'sudo ...' (--login sits between bash and -c) Now matches any short flag ending in 'c' (e.g. -lc, -xc, -ic) and walks backwards past intermediate flags to find the shell binary. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix /bin/bash bypass, remove RLIMIT_NPROC, reduce AST false positives Addresses three high-consensus findings from 20-reviewer pass: 1. /bin/bash -c 'sudo whoami' bypassed nested shell scan because the backwards flag-skip logic treated paths starting with / as flags. Now only skips tokens starting with - as Unix flags; on Windows only skips short /X flags (not /bin/bash style paths). [9/20] 2. RLIMIT_NPROC=256 caused subprocess.run to fail with EAGAIN because Linux enforces NPROC per real UID, not per process tree. Removed RLIMIT_NPROC entirely; RLIMIT_FSIZE and PR_SET_NO_NEW_PRIVS remain as the primary resource and privilege controls. [5/20] 3. AST checker rejected safe dynamic subprocess usage like cmd=["git","status"]; subprocess.run(cmd) as shell_escape_dynamic. Now only flags dynamic args for shell-string functions (os.system, os.popen, subprocess.getoutput, etc.) or when shell=True is explicitly set. List-based subprocess calls with shell=False (the default) do not pass through a shell and are not flagged. [12/20] * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Handle Windows drive letter paths and .exe extensions in command detection Gemini review found that Windows absolute paths (C:\Windows\System32\ shutdown.exe) and executable extensions (.exe, .com, .bat, .cmd) were not handled: - Token scan now strips .exe/.com/.bat/.cmd extensions before checking the blocklist, so sudo.exe matches sudo, shutdown.bat matches shutdown - Regex pattern now includes optional Windows drive letter prefix ([a-zA-Z]:[/\\]) and optional executable extension suffix, so commands after shell metacharacters with full Windows paths are also caught * Handle kwargs dict expansion, non-literal shell=, and except Exception false positive Addresses three findings from second 20-reviewer pass: 1. kwargs dict expansion (9/20): subprocess.run({"args": "rm ...", "shell": True}) bypassed the AST checker because kwargs were treated as opaque. Now expands literal dict kwargs to inspect their keys, and flags opaque kwargs (variable dicts) as unsafe. 2. Non-literal shell= values (7/20): shell=variable was treated as shell=False (safe). Now any shell= value that is not literally False is treated as potentially True (conservative default). 3. except Exception false positive (1/20): except Exception in a loop was flagged as timeout evasion, but Exception does not catch SystemExit or KeyboardInterrupt which are used for timeout enforcement. Narrowed to only flag except BaseException and except TimeoutError in loops. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-03 13:33:42 -07:00
Neodon	c027ec192e	fix(studio): ensure first chat tool call starts in session sandbox (#4810 ) Fixes #4809 On a new Studio chat, the first tool call could start before the frontend initializes the thread ID. That meant the first request could go out without a session_id, so the backend started the tool in the shared sandbox root instead of the chat's session sandbox. Frontend: - Eagerly initialize the thread when switching to a new chat - Resolve the thread ID once at request time and keep it stable through async model-load waits - Disable ActiveThreadSync during new-chat initialization to prevent stale thread IDs from being written back - Add error handling for thread initialization failures - Clear activeThreadId on all compare-mode entry paths to prevent cross-session leakage - Fix exitCompare to restore context usage from the saved view - Coerce falsy thread IDs to undefined for consistent backend/frontend fallback behavior - Use _default as the image sessionId fallback to match the backend Backend: - Use ~/studio_sandbox/_default when a request arrives without a session_id	2026-04-03 11:44:22 -07:00
Lee Jackson	a29b4e23fd	studio: reuse HF cached repo casing to prevent duplicate downloads (#4822 ) * fix(studio): reuse HF cached repo casing to prevent duplicate downloads * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move cache case resolution tests to separate PR Tests for resolve_cached_repo_id_case and get_model_config case resolution belong in their own PR to keep this change focused on the runtime fix. * fix(studio): debug-log HF_HUB_CACHE fallback in path_utils * Fix stale memoization in resolve_cached_repo_id_case - Check exact-case path before memo to ensure a newly-appeared exact match always wins over a previously memoized variant - Validate memoized entries still exist on disk before returning them to prevent stale results when cache dirs are deleted/recreated * Minor cleanups for cache case resolution - Use .is_dir() instead of .exists() for exact-case cache check (cache entries are always directories) - Remove redundant fallback in _detect_audio_from_tokenizer since get_cache_path already handles case resolution and returns None when the model is not cached --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-03 05:48:24 -07:00
Wasim Yousef Said	50dede11cc	Allow non-LLM recipes to run and move Data tab first in executions (#4805 ) * feat: allow non-LLM recipes to run without provider block * feat: reorder execution tabs and add generation-aware data tab empty state * fix: add accessibility attrs to data tab spinner and use literal ellipsis * fix(studio): use shared spinner, stub provider, and hide unused LLM metrics Backend: inject stub model provider for sampler-only recipes so DataDesigner init does not reject empty provider lists. Frontend: use shared Spinner component, hide LLM columns metric and model usage card when recipe has no LLM columns. * Fix tab reset and terminal auto-scroll regressions for PR #4805 Reset detailTab to "data" when switching between executions so the Data tab default is applied consistently, not only on first mount. Also add detailTab to the terminal scroll effect deps so auto-scroll-to-bottom fires when the user opens the Overview tab after landing on Data. * Guard terminal scroll reset to only fire on Overview tab The previous scroll effect ran on every tab switch, which could reset the user's manual scroll position if they scrolled up in the terminal and briefly switched tabs. Now the scroll-to-bottom and sticky-bottom reset only fires when navigating to the Overview tab. * Use None for stub provider api_key instead of literal string The stub ModelProvider that satisfies the DataDesigner registry for non-LLM recipes should not carry a fake credential string. Using None avoids sending an Authorization header if the provider is ever inadvertently invoked. --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-03 05:37:26 -07:00
Wasim Yousef Said	5b7c0615f3	feat(studio): differentiate web search and URL fetch in chat tool UI (#4802 ) Differentiate web_search query searches from URL fetches in the Studio chat UI. Backend (llama_cpp.py): - Emit "Reading: hostname" for URL fetches and "Searching: query" for query searches in SSE status events - Only show hostname for valid http/https URLs; schemeless/non-http URLs get "Reading page..." generic fallback - Strip www. prefix for consistency with the frontend Frontend (tool-ui-web-search.tsx): - Tool card shows "Read hostname" / "Reading hostname..." for URL fetches - Shows "Searched query" / "Searching for query..." for query searches - Uses new URL() with protocol check; falls back to "Read page" / "Reading page..." for non-http URLs	2026-04-03 05:03:27 -07:00
DoubleMathew	ac562bac66	Fix/llama.cppbuilding (#4804 ) * Simplify llama.cpp install logic * print release tag * Retry failed json decode * don't pull all ggml releases * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove test file changes from main PR Test changes for test_pr4562_bugfixes.py will be submitted in a separate PR to keep this PR focused on the install path simplification. * Fix setup.sh executable bit and direct tag lookup for pinned releases - Restore setup.sh file mode to 100755 (was accidentally changed to 100644) - Add direct GitHub API tag lookup in iter_release_payloads_by_time for non-latest requested tags (e.g. b7879) instead of relying on paginated release scans that may miss older releases beyond the 5-page limit - Update stale DEFAULT_PUBLISHED_REPO comment to match new value * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix force-compile default ref and remove dead code in setup.ps1 - Change FORCE_COMPILE_DEFAULT_REF from "main" to "master" in all three files (install_llama_prebuilt.py, setup.sh, setup.ps1) since ggml-org/llama.cpp uses "master" as its default branch, not "main". Using "main" would cause git clone --branch to fail when UNSLOTH_LLAMA_FORCE_COMPILE=1 with UNSLOTH_LLAMA_TAG=latest. - Remove dead if ($SkipPrebuiltInstall) block inside the else branch of setup.ps1 that could never be reached (the outer elseif already handles $SkipPrebuiltInstall=true). - Maintain setup.sh executable bit (100755). * Improve iter_release_payloads_by_time error handling for direct tag lookup When a pinned release tag is not found (HTTP 404), fall through to the paginated release scan instead of silently returning empty results. Non-404 errors (network failures, rate limits) are propagated to the caller so users get actionable error messages. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-03 00:34:20 -07:00
Roland Tannous	f91ef8f9b0	fix(studio): lazy-import transformers in model_config to fix 5.x version switch (#4806 ) * fix(studio): lazy-import AutoConfig in model_config.py to fix transformers 5.x version switch Move `from transformers import AutoConfig` from module level to inside load_model_config() where it is actually used. model_config.py is transitively imported at module load time via: core/inference/__init__ → llama_cpp → utils.models → model_config In inference subprocesses (mp.spawn), this chain runs before _activate_transformers_version() can prepend .venv_t5/ to sys.path. The eager import caches transformers 4.57.6 in sys.modules, and the subsequent sys.path change has no effect — Python always checks sys.modules before sys.path. Making the import lazy ensures transformers is not loaded until after version activation, so the subprocess picks up the correct version. * fix(studio): also lazy-import extract_model_size_b in llama_cpp.py Belt-and-suspenders: make the import that originally triggered the chain lazy as well, so future module-level AutoConfig additions in utils.models cannot reintroduce the problem. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-03 02:56:01 +04:00
Daniel Han	e553a8ad0b	fix(studio): suppress fatal error when prebuilt manifest is missing (#4799 ) When DEFAULT_PUBLISHED_REPO is ggml-org/llama.cpp, the prebuilt resolver raises PrebuiltFallback because ggml-org releases do not include a llama-prebuilt-manifest.json asset. This was caught by the generic Exception handler and printed as "fatal helper error" to stderr, which triggers NativeCommandError on PowerShell. Catch PrebuiltFallback separately in the top-level __main__ handler and exit with EXIT_FALLBACK (code 2) instead of EXIT_ERROR (code 1). The message is still logged but without the "fatal helper error" prefix. The shell scripts already handle non-zero exits and fall back to source builds. Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-04-02 12:18:11 -07:00
Daniel Han	934478ae31	fix(studio): revert llama.cpp default tag to latest (#4797 ) * fix(studio): revert llama.cpp default tag to latest The latest ggml-org/llama.cpp release (b8637) now includes Gemma 4 support. Revert the temporary "b8637" pin from #4796 to "latest" so the prebuilt resolver always picks the newest release automatically without needing manual tag bumps. * docs: add comment explaining latest vs master for llama.cpp tag Document in all three files why "latest" is preferred over "master" and when "master" should be used as a temporary override. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-04-02 11:52:37 -07:00
Daniel Han	401621618b	fix(studio): don't set trust_remote_code for Gemma 4 training (#4795 ) Gemma 4 is a native transformers 5.5 model and does not need trust_remote_code=True. The auto-enable logic (added for NemotronH) was catching all transformers 5.x models, including Gemma 4. When trust_remote_code=True, unsloth_compile_transformers() returns early without running the compiler. This disables the fused cross entropy patch, causing logged training loss to be inflated by the gradient_accumulation_steps factor. Exclude models matching "gemma-4" or "gemma4" from the auto-enable so the compiler runs and applies fused cross entropy correctly.	2026-04-02 11:44:26 -07:00
Daniel Han	8d1712b4ea	fix(studio): pin llama.cpp to b8637 release (Gemma 4 support) (#4796 ) ggml-org/llama.cpp b8637 includes Gemma 4 support (ggml-org/llama.cpp#21309). Revert the temporary "master" default back to a pinned release tag. This eliminates the HTTP 422 errors from the prebuilt resolver (which could not find a release matching "master"), avoids unnecessary source builds, and restores prebuilt binary downloads on all platforms. Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-04-02 11:43:53 -07:00
DoubleMathew	7ae9b7f45f	fix windows llama.cpp compile from source issue (#4793 ) * fix windows llama.cpp compile from source issue * undo local repo usage * fix llama.cpp install * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix windows * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: route resolve-source-build call through Invoke-LlamaHelper The --resolve-source-build call at the source-build resolution path was still calling install_llama_prebuilt.py directly instead of going through Invoke-LlamaHelper. On PS7+ with ErrorActionPreference=Stop, stderr from the 422 response (when tag is "master") would trigger a terminating NativeCommandError and crash setup. * fix: suppress stderr error records from Invoke-LlamaHelper ErrorActionPreference=Continue prevents termination but PowerShell still displays stderr lines as visible ErrorRecord objects. Capture all output via 2>&1 and split stdout from stderr manually so that stderr lines never appear on the console. When StderrPath is given the stderr content is written to that file for diagnostics. * fix: always rebuild llama.cpp on Windows when tag is master When the requested llama.cpp tag is "master" (a moving target), skip the "already built" early exit so the build path runs and syncs to the latest commit. Without this, existing llama-server binaries from an older build (e.g. b8635 which lacks Gemma 4 support) are reused and model loading fails. Pinned tags (e.g. b8635) still skip the rebuild when the binary already exists, since the tag is immutable. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-04-02 11:43:46 -07:00
Daniel Han	7023e2a4ff	fix(studio): prioritize curated defaults over HF download ranking in Recommended (#4792 ) The model list merge order was `top_gguf + top_hub + static_models`, which meant the HF download-ranked models always came first. New models like Gemma 4 have low download counts and were not in the HF top-40, so they got buried after 80 other models despite being at the top of the curated static defaults in defaults.py. Flip the merge to `static_models + top_gguf + top_hub` so editorial picks (new model launches, promoted models) always appear first in the Recommended section, with HF popularity backfilling after. Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-04-02 10:46:53 -07:00
Daniel Han	1ce83c40aa	fix(studio): build llama.cpp from master instead of latest release tag (#4790 ) The latest ggml-org/llama.cpp release (b8635) does not include Gemma 4 support (ggml-org/llama.cpp#21309 merged after the release was cut). This causes `llama-server` to fail with "unknown model architecture: gemma4" when loading Gemma 4 GGUFs. Temporarily default _DEFAULT_LLAMA_TAG to "master" so all new installs build from the llama.cpp master branch which includes Gemma 4 support. Once a new upstream release is cut with Gemma 4, this can be reverted back to "latest". Changes: - setup.sh: add _DEFAULT_LLAMA_TAG="master" maintainer default - setup.ps1: add $DefaultLlamaTag="master" maintainer default - install_llama_prebuilt.py: change DEFAULT_LLAMA_TAG fallback to "master" Users can still override via UNSLOTH_LLAMA_TAG env var.	2026-04-02 09:45:56 -07:00
Daniel Han	2af53bf9a6	Pin transformers and huggingface-hub in main Studio venv (#4788 ) Revert the >= loosening from `f9c4b08` back to exact pins. Using transformers>=4.57.6 allows pip to install 5.x into the main Studio venv, which breaks huggingface_hub imports (is_offline_mode removed in newer hub versions). The main venv must stay on transformers==4.57.6 and huggingface-hub==0.36.2. The 5.x version lives only in .venv_t5/ and is dynamically switched via sys.path at runtime.	2026-04-02 09:21:30 -07:00
Daniel Han	a241c58d84	Use transformers v5.5-release branch and pin to 5.5.0 (#4786 ) The v5.5-release branch now exists on huggingface/transformers. Use transformers==5.5.0 for all install paths and git+transformers.git@v5.5-release for the MLX installer. Also bumps huggingface_hub from 1.7.1 to 1.8.0 in setup.sh and setup.ps1 to stay consistent.	2026-04-02 09:10:02 -07:00
Daniel Han	a353557249	Force llama.cpp to always use mainline ggml-org (#4785 ) Hardcode the release repo to ggml-org/llama.cpp and remove the UNSLOTH_LLAMA_RELEASE_REPO and UNSLOTH_LLAMA_SOURCE env var overrides so that all users always build/download from mainline llama.cpp.	2026-04-02 09:03:00 -07:00
Daniel Han	f1c3b9caa9	Pin Gemma-4 transformers requirement to 5.5.0 stable (#4784 ) Gemma-4 support landed in transformers main (huggingface/transformers#45192). Update the version pin from 5.5.0.dev0 to 5.5.0 across loader, Studio version switcher, and the MLX installer. Also fix install_gemma4_mlx.sh which referenced a non-existent v5.5-release branch -- pin it to the correct commit (91b1ab1) instead.	2026-04-02 08:59:21 -07:00
Daniel Han	4f9986ecb9	fix(studio): improve tool-calling re-prompt for small models (#4783 ) Small GGUF models (<9B) frequently generate full code or lengthy explanations instead of calling tools, bypassing the existing plan-without-action re-prompt mechanism. Three issues: 1. _REPROMPT_MAX_CHARS=500 was too low -- models that output full HTML/code responses (often 1000+ chars) never triggered the re-prompt at all, since it only fires on short responses. 2. _MAX_REPROMPTS=1 gave the model only one chance to comply. Small models often need 2-3 nudges before switching from text generation to tool calling. 3. The re-prompt text ("Please use the available tools...") was too polite for small models to follow reliably. 4. Tool-calling detection missed chat templates using Jinja whitespace-trimming syntax ({%- if tools -%}) since only ({%- if tools %}) and ({% if tools %}) were checked. Changes: - Raise _REPROMPT_MAX_CHARS from 500 to 2000 so longer responses (code blocks, multi-paragraph plans) still trigger re-prompts - Raise _MAX_REPROMPTS from 1 to 3 for more retry budget - Use direct, imperative re-prompt language that small models follow more reliably ("STOP. You MUST call a tool NOW.") - Strengthen the system prompt tool nudge to explicitly forbid outputting code blocks (redirect to the python tool instead) - Add Jinja whitespace-trimmed variants to the tool_markers list so all template styles are detected correctly	2026-04-02 08:59:02 -07:00
Daniel Han	f9c4b08726	UI Changes (#4782 ) * UI Changes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unrelated test file --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-02 08:05:55 -07:00
Daniel Han	c8d311a053	feat(studio): display images from Python tool execution in chat UI (#4778 ) * feat(studio): display images from Python tool execution in chat UI When the model calls the Python tool to create a matplotlib plot or other image file, the image now displays inline in the chat output instead of being invisible to the user. Backend: - Detect new image files (png/jpg/gif/webp/bmp) after Python subprocess completes by diffing os.listdir before/after execution - Append __IMAGES__ sentinel to tool result for frontend consumption - Strip sentinel before injecting result into LLM context (role: tool) so the model never sees file paths - Add GET /sandbox/{session_id}/{filename} endpoint with JWT auth (header or query param), path traversal protection, extension allowlist, realpath containment check, and nosniff header Frontend: - Parse __IMAGES__ sentinel in tool_end SSE events, create structured result with text/images/sessionId - Render <img> tags in Python tool UI pointing at the sandbox endpoint Also fixes a bug where SyntaxError in user code was misreported as "unsafe code detected" instead of showing the actual Python traceback. The _check_code_safety function now lets SyntaxError pass through to the subprocess for a proper error message. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix(studio): improve SVG detection and strip XML preamble Handle <?xml ...?> declarations before <svg> tags in code fences, strip XML declaration from SVGs before data URI rendering, and update the sloth suggestion prompt to request showing code. * fix(studio): persist parentId so retries survive reload The append() handler was destructuring only { message } from ExportedMessageRepositoryItem and discarding parentId. When loading a saved thread, load() used ExportedMessageRepository.fromArray() which chains all messages sequentially, flattening retry branches into a linear list. Now append() writes parentId to the MessageRecord, and load() reconstructs the tree when parentIds are present. Old threads without parentId fall back to the existing fromArray() behavior. * fix(studio): address review findings for image display and retry persistence Image detection: - Use mtime comparison instead of filename-only diff so overwritten files (e.g. plt.savefig("chart.png") called twice) are detected Sentinel parsing: - Use rsplit/lastIndexOf instead of split/indexOf so user code that prints __IMAGES__: does not collide with the backend sentinel Mixed legacy/new threads: - For old messages without a stored parentId, infer sequential parent from the previous message instead of null, preventing multiple roots Sandbox endpoint: - Change Cache-Control from "public, max-age=3600" to "private, no-store" since these are authenticated responses --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-02 05:08:16 -07:00
Lee Jackson	5a5f1a4f34	studio: fix chat font changes leaking outside chat page (#4775 ) * fix(frontend): scope sans font overrides to chat thread only * fix(frontend): use font-sans fallback for heading stack and simplify chat font rules --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-02 05:04:23 -07:00
DoubleMathew	1ce8a8e7cd	Feat/custom llama prebuilt (#4771 ) * update logic to incorporate custom prebuilt installs * bug fixes * update for review comments * fix tags * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Separate test changes from main PR Move test file changes out of this PR to keep the diff focused on the install_llama_prebuilt.py and setup script changes. Test updates will be submitted in a follow-up PR. * Fix branch ref normalization and harden JSON parsing - Add checkout_friendly_ref() to strip refs/heads/ prefix from branch refs before emitting them in SourceBuildPlan. git clone --branch does not accept fully qualified refs like refs/heads/main. - Apply normalization in source_build_plan_for_release() and the direct-ref fallback in resolve_source_build_plan(). - Allow validated_checksums_for_bundle() to accept releases that carry only an exact-commit source archive without the legacy upstream-tag source tarball. - Add 2>/dev/null \|\| true guards to all inline python -c JSON parsing in setup.sh so a malformed payload does not abort the script under set -e. * Fix Windows CUDA asset ordering and tag ref normalization - Reorder windows_cuda_upstream_asset_names to prefer the main binary archive (llama-{tag}-bin-win-cuda-) over the cudart sidecar archive (cudart-llama-bin-win-cuda-). The cudart ZIP only contains CUDA runtime DLLs, not llama-server or llama-quantize binaries. - Extend checkout_friendly_ref to also strip refs/tags/ prefix for tag refs, matching the refs/heads/ handling for branch refs. * Simplify JSON parsing consistency in setup.sh Use json.load(sys.stdin) consistently for all inline JSON parsing in setup.sh, instead of the more complex json.loads(raw) pattern on the install-tag resolution path. The 2>/dev/null \|\| true guard already handles empty/malformed input gracefully. * Fix source build plan fallback for commit ref kind in PR #4771 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <daniel@unsloth.ai> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-02 04:52:26 -07:00
Daniel Han	e4d1499230	fix(studio): prevent small models from stalling on tool-calling tasks (#4769 ) * fix(studio): prevent small models from stalling on tool-calling tasks Small GGUF models (< 9B params) in "Think, Search, Code" mode would often describe what they planned to do ("Let me create this dashboard") and then stop generating without ever calling a tool. Three changes: 1. Simplify web_tips for small models: remove the "fetch its full content by calling web_search with the url parameter" guidance for models < 9B. This multi-step instruction causes small models to plan elaborate search-then-fetch-then-code sequences they cannot reliably execute. 2. Add "always call tools directly" imperative to the system prompt nudge so models act immediately instead of narrating their intentions. 3. Add plan-without-action re-prompt in the agentic loop: when the model emits planning text (matching patterns like "let me", "I'll", etc.) without calling any tool, inject a nudge asking it to call the tool and continue the loop. Capped at 2 re-prompts per request. Benchmarked with Qwen3.5-4B-GGUF (N=5 trials per variant): - Baseline: 40% of requests had any tool call - Combined fix: 100% of requests had at least one tool call * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-02 02:11:07 -07:00
AdamPlatin123	ba8081fc96	fix(chat): correct loading text for cached models during inference (#4764 ) Distinguish between actual network downloads and GPU memory loading for cached LoRA adapters in Studio chat. - Add isCachedLora detection for local LoRA adapter paths using comprehensive cross-platform regex (Unix, Windows, UNC, WSL, tilde) - Thread isCachedLora through loadInfo to chat-page inline status for proper 3-way distinction (cached / local LoRA / downloading) - Skip download progress polling for cached LoRA models (no useless /download-progress API calls) - Fix initial toast state to use isCachedLoad consistently instead of only checking isDownloaded - Fix cancelLoading toast to not mention background downloads for cached/local loads - Keep download-specific text ("Downloading model..." / "Download complete") inside the download-only polling block	2026-04-01 20:24:48 -07:00
Lee Jackson	ca4ea8b9fb	studio: align composer/code, unify fonts, and remove tool collapse jitter (#4763 ) - Add min-w-0 guards to thread/message/markdown containers to prevent content overflow past the composer width - Unify chat typography from Hellix/Space Grotesk to the sans stack, keeping monospace for code blocks and inline code - Restructure desktop navbar right-side controls with shrink-0 wrappers for consistent spacing across HoverCard roots - Soften tool-call label styling (font-medium + text-foreground/85 instead of bold) - Add responsive code block sizing via @container queries - Add horizontal scrolling for wide code blocks within the thread column - Scope list-item code block alignment CSS to .aui-thread-root - Preserve useScrollLock in tool-fallback and tool-group collapsibles - Fall back to bg-background on ViewportFooter when hideComposer is true - Widen inline code monospace selector to cover th, blockquote, and heading elements - Remove unused @fontsource-variable/space-grotesk import	2026-04-01 19:57:10 -07:00
DoubleMathew	71b934ef9d	Fix custom llama.cpp source builds and macos metal source builds (#4762 ) * Fix script unbound variable error * remove stale test script, add llama.cpp metal source builds, update tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Metal precedence, test sync, and add behavioral tests - Move macOS arm64 Metal check before CUDA/ROCm in GPU backend decision chain so Metal is not bypassed when nvcc is in PATH - Remove RPATH flags from CPU fallback CMAKE_ARGS (only needed for Metal library linking) - Update test_llama_pr_force_and_source.py to match _CLONE_ARGS rename from _CLONE_BRANCH_ARGS in setup.sh - Add confirm_install_tree guard test for existing_install_matches_choice - Add TestMacOSMetalBuildLogic bash subprocess tests verifying Metal flag selection, nvcc precedence, and CPU fallback behavior * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Metal CPU fallback to also cover cmake build failures and update tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. _GPU_BACKEND_FRAGMENT synced -- removed dead CPU_FALLBACK_CMAKE_ARGS= init (6/8) 2. RPATH assertion replaced -- new test_macos_arm64_cpu_fallback_args_exclude_rpath checks the actual runtime CPU_FALLBACK_CMAKE_ARGS output for @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (6/8) 3. _TRY_METAL_CPU_FALLBACK=false reset after both configure-failure and build-failure fallback branches in setup.sh (4/8) 4. macOS test now removes libmtmd.0.dylib instead of the platform-agnostic convert_hf_to_gguf.py (3/8) 5. Empty-string tag test added -- test_empty_tag_omits_branch_flag for resolved_tag= (2/8) 6. RPATH checks on cmake call logs -- both fallback tests now assert @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON are absent from CPU fallback cmake calls, plus baseline flag preservation (multiple) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests clean up * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 14:06:39 -05:00
Daniel Han	39fe23ded8	Tests for architecture-aware KV cache estimation (#4760 ) * test: add 66 tests for architecture-aware KV cache estimation Covers all 5 estimation paths (MLA, Hybrid Mamba, Sliding Window, Standard GQA, Legacy), GGUF parser for 8 new metadata fields, _can_estimate_kv gate conditions, quantization scaling, edge cases, path priority ordering, and lifecycle (init/unload/reparse). Zero external dependencies beyond pytest. No GPU or network required. Cross-platform (Linux, macOS, Windows, WSL). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:13:37 -07:00
Daniel Han	653eb3819a	fix(studio): allow context length slider to reach model's native limit (#4746 ) * fix(studio): allow context length slider to reach model's native limit The context length slider was hard-capped to the VRAM-estimated maximum, preventing users from requesting higher context even though the backend already handles it safely (multi-GPU selection, --fit fallback). Expose the model's native context length from GGUF metadata as a separate API field and use it as the slider ceiling instead. Add an amber warning when the selected context exceeds the estimated VRAM capacity. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Raise VRAM budget to 90% and add native_context_length tests Increase the GPU memory utilization threshold from 70% to 90% across _select_gpus and _fit_context_to_vram, allowing longer context lengths before VRAM capping kicks in. Add 33 tests for the native_context_length feature covering the backend property, context value separation invariants, Pydantic models, route completeness, edge cases, and cross-platform binary I/O. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:12:52 -07:00
Daniel Han	d22b2a18f9	fix: add tokenizers to no-torch deps and TORCH_CONSTRAINT for arm64 macOS py313+ (#4748 ) * fix: add tokenizers to no-torch runtime deps and add TORCH_CONSTRAINT for arm64 macOS py313+ Two installer fixes: 1. Add `tokenizers` to `no-torch-runtime.txt` before `transformers`. Without it, `from transformers import AutoConfig` crashes on startup because `--no-deps` skips transitive dependencies. 2. Add `TORCH_CONSTRAINT` variable to `install.sh`. On arm64 macOS with Python 3.13+, tighten the torch requirement to `>=2.6` since torch <2.6 has no cp313 arm64 wheels. The variable replaces the previously hard-coded constraint in the uv pip install line. Includes 66 tests (42 pytest + 24 bash) covering: - Structural checks on install.sh, install.ps1, no-torch-runtime.txt - Shell snippet tests with mocked python for 13 platform/version combos - Mock uv integration verifying correct constraint string - E2E venv tests on Python 3.12 and 3.13 confirming AutoConfig works - Negative control proving AutoConfig fails without tokenizers - Full no-torch sandbox regression guards (safetensors, huggingface_hub) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incomplete no-torch manifest and align E2E tests with real --no-deps path - Add missing transitive deps to no-torch-runtime.txt that are required under --no-deps: regex, typing_extensions, filelock, httpx, httpcore, certifi, idna, anyio, sniffio, h11. Without these, `from transformers import AutoConfig` still fails after install.sh --no-torch. - Change all E2E tests to use --no-deps (matching what install.sh does) instead of normal dep resolution. Previous tests passed even with an incomplete manifest because uv backfilled transitive deps. - Rewrite negative control to derive from the real no-torch-runtime.txt with tokenizers stripped, proving the specific fix matters. - Replace GNU-only sed -i with heredoc in shell test for macOS compat. - Remove unused os/sys imports from Python test file. - Quote SKIP_TORCH and mock uv paths in bash -c strings. * Assert install succeeds before checking import results in E2E tests Address review feedback: test_torch_not_importable and test_tokenizers_directly_importable in Group 3 now assert that uv pip install returns 0 before checking import behavior. This prevents false positives when the install itself fails silently. * Assert install succeeds in negative control and tighten error check - Add missing install-success assertion in test_negative_control_no_tokenizers to prevent false positives from network/install failures. - Tighten error message check to look for "tokenizers" in stderr or ModuleNotFoundError, rather than the generic "No module" substring which could match unrelated import failures. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:12:17 -07:00
Daniel Han	76cb48be0b	fix: studio web search SSL failures and empty page content (#4754 ) - Fix SSL handshake failures (SSLV3_ALERT_HANDSHAKE_FAILURE, CERTIFICATE_VERIFY_FAILED) when fetching HTTPS pages by introducing _PinnedHTTPSConnection that separates TCP connect (to pinned IP) from TLS handshake (with real hostname for SNI/cert verification) - Fix SSRF DNS-rebinding vulnerability: previous impl swapped conn.host before connect(), causing fresh DNS resolution; new subclass keeps TCP pinned to validated IP - Fix SPA/JS-rendered doc sites returning empty content by rotating real browser User-Agents (Chrome/Firefox/Safari) - Strip nav/footer from HTML-to-Markdown output so article content is not buried under navigation chrome - Increase raw fetch cap from 64KB to 512KB so SSR article content is reached on GitBook/Docusaurus/Next.js pages - Fix IPv6 address bracketing in URL netloc construction - Hoist SSL context, handler classes, and stdlib imports to module level (created once, not per-call) - Use consistent UA across redirect hops to avoid breaking session-aware bot detection	2026-04-01 06:12:02 -07:00
DoubleMathew	428efc7d95	Resolve latest usable published llama.cpp release instead of fixed pinned tag (#4741 ) Replaces the fixed prebuilt llama.cpp tag with dynamic published-release resolution, adds bounded fallback across older published releases, and introduces maintainer-editable defaults for PR/source overrides. Changes: - Resolve latest from the latest usable published release in unslothai/llama.cpp - Use the selected release upstream_tag as the authoritative llama.cpp version - Prefer Unsloth-published platform assets when available - Fall back to same-tag upstream ggml-org/llama.cpp assets where allowed - Keep Linux CUDA anchored to Unsloth-published CUDA bundles only - Add bounded fallback across older Unsloth published releases - Add separate busy/in-use install handling (exit code 3) - Skip reinstall when the installed bundle already matches the selected candidate - Add maintainer-editable _DEFAULT_LLAMA_PR_FORCE and _DEFAULT_LLAMA_SOURCE - Harden env parsing so malformed installer env vars do not crash import-time fallback logic - Honor UNSLOTH_LLAMA_RELEASE_TAG in all resolve steps - Always sync git remote URL in existing-checkout path	2026-04-01 06:06:17 -07:00
Daniel Han	77e1a9edc9	feat(studio): architecture-aware KV cache VRAM estimation (#4757 ) * feat(studio): architecture-aware KV cache VRAM estimation Replace the single legacy formula (2 * n_kv_heads * head_dim * n_layers * n_ctx * bpe) with 5-path estimation that reads 8 additional GGUF metadata fields: 1. MLA (DeepSeek-V2/V3, GLM-4.7, GLM-5, Kimi-K2.5) -- K-only cache using compressed KV latent + RoPE; no separate V allocation 2. Hybrid Mamba (Qwen3.5-27B, Qwen3.5-35B-A3B) -- only attention layers (1 in N) carry KV; Mamba layers have none 3. Sliding Window (Gemma-3, gpt-oss) -- SWA layers cache min(ctx, window) tokens instead of the full context 4. Standard GQA -- uses explicit key_length/value_length from GGUF instead of embed // n_heads (which is wrong for many models) 5. Legacy fallback -- identical to old formula for old GGUFs New GGUF fields parsed: attention.key_length, attention.value_length, attention.sliding_window, full_attention_interval, attention.kv_lora_rank, attention.key_length_mla, ssm.inner_size, ssm.state_size. Validated against 9 real GGUF files (72/72 field checks pass). The legacy formula was off by +682% for Gemma-3 and -81% for DeepSeek-V3.1. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix MLA fallback and SWA global/local ratio heuristic Two fixes based on review findings: 1. MLA fallback now uses key_length_mla from GGUF metadata instead of hardcoded rope_dim=64. Falls back to 64 only when key_length_mla is absent. This ensures correct estimates for MLA variants that use rope dimensions other than 64. 2. SWA global/local layer ratio changed from 50/50 to 1/4 (25% global, 75% SWA). Most sliding window architectures have predominantly local layers (Gemma-3 uses ~17% global, gpt-oss uses ~50%). The 1/4 heuristic is closer to the common case and still a large improvement over the legacy formula which ignores SWA entirely. * Tighten _can_estimate_kv gate and treat sliding_window=0 as disabled Two additional fixes from review round 1 (5/8 and 4/8 reviewer consensus): 1. _can_estimate_kv now requires BOTH key_length AND value_length for the explicit-dims path. Previously key_length alone was enough, which could cause silent fallthrough to the legacy formula with fabricated defaults (n_kv=1, head_dim=128) when value_length was absent from the GGUF. 2. SWA path now requires sliding_window > 0. Some GGUFs use 0 as a disabled sentinel. Without this guard, min(ctx, 0) would zero out all SWA layer contributions, severely underestimating KV cache. * Fix MLA n_kv safety and use ceiling division for hybrid path Addresses Gemini Code Assist review findings: 1. MLA path now uses n_kv_mla = n_kv_heads or 1 (not n_heads). This prevents a 128x overestimate for DeepSeek-V3 if head_count_kv is absent from the GGUF (n_heads=128 would have been used instead). 2. Hybrid path now uses ceiling division for attention layer count. This prevents undercounting by 1 when n_layers is not perfectly divisible by full_attention_interval. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:04:12 -07:00
Roland Tannous	41df4ec437	feat(studio): strip org prefix in model search to surface unsloth variants (#4749 ) When searching for a specific publisher model (e.g. `openai/gpt-oss-20b`), the unsloth search used the full `openai/gpt-oss-20b` string with `author=unsloth`, which returned zero results because no unsloth model contains the publisher prefix in its name. Users never discovered unsloth variants. This PR strips the org prefix for publisher-qualified queries so unsloth variants surface, then pins the original publisher model after a small batch of unsloth results. Plain queries (no slash) and unsloth-prefixed queries are unchanged. - Strict regex (`/^([^/\s]+)\/([^/\s]+)$/`) only triggers on valid `owner/repo` identifiers; incomplete typeahead, multi-slash, and URL-like inputs are rejected - Queries for `unsloth/...` models (case-insensitive) keep the full 20-result prefetch and secondary sort - Pinned model lookup fires in parallel with the unsloth prefetch - Canonical-name dedup prevents duplicates when HF normalizes casing - Publisher detection extracted into a single `useMemo` block	2026-04-01 04:37:28 -07:00
Leo Borcherding	63ad6dbd6d	Fix OOM model styling in Studio model selectors (#4738 ) Replace strikethrough + opacity-50 OOM styling with gray text and red pill badge across all Studio model selectors (chat, training, onboarding). - Use gray-500/gray-400 for OOM model names (better contrast than strikethrough) - Red pill badge for OOM indicator with light/dark mode support - Scope GGUF gray override to quant name only so downloaded/recommended labels keep colors - Add !important on TIGHT/OOM badges to resist ComboboxItem hover overrides	2026-04-01 02:06:49 -07:00
Daniel Han	6c0826a9e4	Fix Windows local GGUF model loading crash (#4730 ) * Fix Windows "Non-relative patterns are unsupported" when loading local GGUF models When a user loads a GGUF model from a local Windows path (e.g. C:\Users\danie\.lmstudio\models\unsloth\functiongemma-270m-it-GGUF), the model identifier contains backslashes and a drive letter. Both load_model_defaults() and _has_specific_yaml() constructed a YAML filename from the full absolute path and passed it to Path.rglob(), which rejects non-relative patterns on Windows. Fixed by detecting Windows-style paths (drive letters, UNC paths, backslashes) in addition to Unix-style paths, and using only the directory basename for the YAML filename lookup when the identifier is a local filesystem path. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor: reuse is_local_path helper, fix case-sensitive suffix lookup - Replace inline local-path detection in model_config.py and inference_config.py with the existing is_local_path() from utils.paths, which already handles Unix, Windows drive-letter, UNC, and backslash paths - Fix case-sensitive suffix lookup in load_model_defaults(): the _REVERSE_MODEL_MAPPING is lowercase-keyed, so suffix comparisons must use .lower() to match paths like /path/to/Spark-TTS-0.5B/LLM * Fix WSL path parsing and _has_specific_yaml suffix lookup - Use normalize_path() before Path() operations so backslash Windows paths (e.g. C:\Users\...\model) are correctly split on POSIX/WSL hosts where pathlib treats backslashes as literal characters - Add suffix-based (2-component and 1-component) lookup to _has_specific_yaml() so it matches the same resolution rules as load_model_defaults(), fixing wrong inference params for local suffix-mapped models like Spark-TTS-0.5B/LLM --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 01:38:09 -07:00
Wasim Yousef Said	d63cc57e1e	fix: clear tool status badge immediately after tool execution (#4733 ) * fix: clear tool status badge immediately after tool execution The tool status timer badge (Searching 1s, 2s...) persisted after tool calls finished because the status clear event was only sent at the start of the next generation iteration, not after tool execution completed. Backend: yield status clear after all tools finish in the agentic loop iteration, before continue starts the next generation pass. Frontend: debounce badge visibility by 300ms so sub-second tool calls dont flash the badge. * Fix debounce regression for consecutive tool calls Only apply the 300ms show-delay when transitioning from idle to tool-active. When switching between consecutive tools in the same turn (e.g. web_search -> python), keep the badge visible immediately so it does not flicker or disappear during multi-tool runs. * Delay wasActiveRef reset to bridge inter-iteration tool gaps The backend emits a status-clear event between tool iterations, which was resetting wasActiveRef immediately and causing the next tool to be re-debounced (300ms hidden gap between consecutive tools in the same turn). Now the ref reset is delayed by 500ms so a follow-up tool within the same agentic turn shows the badge immediately, while a genuinely new turn still gets the debounce. * Use thread lifecycle to track tool-run boundaries Replace the 500ms wall-clock timeout with the actual thread.isRunning state to determine when wasActiveRef should reset. This properly handles all cases: - Consecutive tools within the same run stay visible without flicker - The badge hides only when the thread run actually ends - New turns always get a fresh 300ms debounce on the first tool - No heuristic timeout that can misfire on slow or fast inference * Consolidate wasActiveRef reset into single effect Removes the separate isThreadRunning effect to avoid a race where the ref resets before the tool-status effect reads it (when isThreadRunning flips to false before setToolStatus(null) from the adapter's finally block). Now wasActiveRef resets only when both toolStatus is null AND the thread run has ended, eliminating any flicker on the last tool of a run. * Simplify debounce: use visible state instead of ref tracking Drop wasActiveRef entirely and use the visible state as the debounce gate. When the badge is not yet on screen, debounce for 300ms before showing. When already visible from a prior tool, keep showing immediately. This correctly handles all cases: - All fast tools (<300ms) are suppressed, not just the first - Consecutive tools after the badge is shown stay visible - Badge persists across inter-iteration clears while thread runs - New turns get a fresh debounce after visible resets --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-01 00:28:38 -07:00
Wasim Yousef Said	4fb9778988	feat: move folder management into model selector dropdown (#4731 ) * refactor: move folder management from sidebar into model selector * Fix folder management: restore LoRA picker sync, error handling, caching - Restore onFoldersChange callback to keep LoRA adapter picker in sync when scan folders are added/removed (fixes regression from sidebar move) - Thread onFoldersChange through ModelSelector -> HubModelPicker prop chain - Add module-level _scanFoldersCache to prevent folder list flash on re-open - Surface error toast on folder removal failure instead of silently ignoring - Guard handleAddFolder against concurrent double-submit via folderLoading - Clear folderInput on Escape key dismiss to prevent stale input on re-open - Add refreshLocalModelsList and refreshScanFolders to useEffect dep array * Fix compare-mode folder sync, Escape key propagation, cancel toggle state - Wire onFoldersChange through CompareContent/GeneralCompareContent so compare-mode selectors also refresh local models after folder changes - Add e.stopPropagation() on Escape key in folder input to prevent Radix Popover from closing the entire model selector dropdown - Add e.preventDefault() on Enter key to prevent form submission - Clear folderInput and folderError when cancel toggle hides the input, matching the Escape key behavior for consistency * Fix folder mutation state ordering and touch accessibility - Use optimistic updates for add/remove so the folder list reflects changes immediately instead of waiting on a second listScanFolders round-trip that could silently fail. - Move refreshScanFolders out of the finally block in handleRemoveFolder so it runs after the cache update, not after onFoldersChange. - Make the remove button visible on touch/mobile devices and reachable via keyboard focus (opacity-100 on small screens, focus-visible). - Add aria-label to the remove button for screen readers. * Deduplicate optimistic folder add to match backend behavior The backend returns the existing ScanFolderInfo row when adding a path that is already registered. The optimistic update was blindly appending the returned row, producing duplicate entries and React key warnings. Now checks by id before appending. * Add aria-label to folder toggle button and strengthen dedup check - Add aria-label to the +/cancel icon button for screen readers. - Extend optimistic dedup check to also compare by path, not just id, to handle edge cases where the cache is stale. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 23:15:50 -07:00
Lee Jackson	2cac3e8e4d	studio: Polish Windows installer/setup logs (#4736 ) * style(windows): clean installer/setup log output and remove seeded credential banner * Keep startup credential hint without exposing plaintext password Print the username and .bootstrap_password file path on first-run admin creation instead of the raw password. Headless / Docker / SSH operators still get a startup-time hint for initial sign-in, and the plaintext credential no longer appears in terminal output or logs. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-03-31 23:12:42 -07:00
Wasim Yousef Said	1e8875584d	feat: custom scan folders for GGUF model discovery (#4723 ) * feat: add scan_folders table and CRUD functions to studio_db * feat: add scan folders API endpoints and integrate into model scan * feat: add scan folders API client and update source types * feat: add custom source to model filters and selector * feat: add Model Folders section to chat settings sidebar * style: fix biome formatting in ModelFoldersSection * fix: address review findings for custom scan folders empty string bypass, concurrent delete crash guard, Windows case normalization, response_model on endpoints, logging, deduplicated filter/map, module level cache for custom folder models, consistent source labels, handleRemove error surfacing, per folder scan cap * fix: show custom folders section regardless of chatOnly mode * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: extract shared refreshLocalModelsList in pickers * Harden custom scan folder validation and scanning - Validate path exists, is a directory, and is readable before persisting - Apply per-folder model cap during traversal instead of after (avoids scanning millions of inodes in large directories) - Wrap per-folder scan in try/except so one unreadable folder does not break the entire /api/models/local endpoint for all callers - Normalize case on Windows before storing so C:\Models and c:\models dedup correctly - Extend macOS denylist to cover /private/etc and /private/tmp (realpath resolves /etc -> /private/etc, bypassing the original denylist) - Add /boot and /run to Linux denylist * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Improve scan robustness and preserve Windows path casing - Preserve original Windows path casing in DB instead of lowercasing (normcase used only for dedup comparison, not storage) - Catch PermissionError per child directory so one unreadable subdirectory does not skip the entire custom folder scan - Wrap list_scan_folders() DB call in try/except so a DB issue does not break the entire /api/models/local endpoint * fix: scan custom folders for both flat and HF cache layouts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Windows case-insensitive path dedup with COLLATE NOCASE Use COLLATE NOCASE on the scan_folders.path column so that the UNIQUE constraint correctly deduplicates C:\Models and c:\models on Windows without lowercasing the stored path. Also use COLLATE NOCASE in the pre-insert lookup query on Windows to catch existing rows with different casing. * Restore early-exit limit in _scan_models_dir for custom folders Keep the limit parameter so _scan_models_dir stops iterating once enough models are found, avoiding unbounded traversal of large directories. The post-traversal slice is still applied after combining with _scan_hf_cache results. * feat: scan custom folders with LM Studio layout too * Fix custom folder models being hidden by dedup Custom folder entries were appended after HF cache and models_dir entries. The dedup loop kept the first occurrence of each model id, so custom models with the same id as an existing HF cache entry were silently dropped -- they never appeared in the "Custom Folders" UI section. Use a separate dedup key for custom-source entries so they always survive deduplication. This way a model can appear under both "Downloaded" (from HF cache) and "Custom Folders" (from the user-registered directory) at the same time. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden LM Studio scan and fix COLLATE NOCASE on Linux - Add per-child and per-publisher OSError handling in _scan_lmstudio_dir so one unreadable subdirectory does not discard the entire custom folder's results - Only apply COLLATE NOCASE on the scan_folders schema on Windows where paths are case-insensitive; keep default BINARY collation on Linux and macOS where /Models and /models are distinct directories * Use COLLATE NOCASE in post-IntegrityError fallback SELECT on Windows The fallback SELECT after an IntegrityError race now uses the same case-insensitive collation as the pre-insert check, so a concurrent writer that stored the path with different casing does not cause a false "Folder was concurrently removed" error. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 06:40:31 -07:00
Daniel Han	9a8b622306	Studio: simplify tool-call dedup and replace html2text with builtin converter (#4722 ) * Simplify tool-call dedup: drop hashlib, inline helpers The duplicate tool-call detector only compares calls within a single request from the same JSON parser, so dict key order is guaranteed identical for identical calls (Python 3.7+ insertion-ordered dicts). - Replace hashlib.md5(json.dumps(...)) with name + str(args) - Inline _tool_call_key, _is_duplicate_call, _record_tool_call since each was a one-liner used once - Remove unused hashlib import * Remove tool_calling_benchmark_results.md from repo * Replace html2text with builtin HTML-to-Markdown converter Drop the external html2text (GPL-3.0) dependency and its regex fallback. Add _html_to_md.py (~190 lines, stdlib only) using html.parser.HTMLParser that handles headings, links, bold/italic, lists, tables, blockquotes, code blocks, and entity decoding. Strips script/style/head tags entirely. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use json.dumps(sort_keys=True) for tool-call dedup key str(dict) is sensitive to insertion order, so semantically identical calls with different key ordering would bypass duplicate detection. Switch to json.dumps with sort_keys=True for a canonical representation. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert dedup key to str(arguments) json.dumps(sort_keys=True) is unnecessary here -- the arguments dict always comes from the same JSON parser within a single request, so key insertion order is deterministic (Python 3.7+). str() is faster and sufficient for consecutive-call dedup. * Address review comments on _html_to_md.py - Remove "hr" from _BLOCK_TAGS so the dedicated hr handler is reachable - Prefix all newlines with ">" inside blockquotes (multi-line support) - Emit full ![alt](url) for images instead of alt text only - Replace newlines with spaces inside table cells - Track header cells per-row (_row_has_th) instead of last-cell-only - Strip trailing tabs in addition to spaces in cleanup regex * Fix blockquote rendering, truncated-HTML buffer flush, and dedup key canonicalization _html_to_md.py: - Rewrite blockquote handling with stack-based buffer approach so nested blockquotes, pre blocks inside blockquotes, and multi-paragraph quotes all render correctly with proper "> " prefix on every line. - Add flush_pending() to recover content from truncated HTML where closing tags are missing (common when _fetch_page_text caps the download size). Flushes open <a>, <td>, <pre>, and blockquote buffers. - Skip <img> tags to match prior html2text ignore_images=True behavior and avoid data-URI amplification consuming the output budget. - Collapse all whitespace (including newlines) in non-pre content per standard HTML whitespace rules: \s+ -> single space. - Escape pipe characters in table cell content to prevent column breakage. - Emit separator row after the first row for tables without <th> headers. - Guard against IndexError on _ol_counter for orphan <li> elements. - Normalize CRLF line endings before parsing. llama_cpp.py: - Restore canonical dedup key with json.dumps(sort_keys=True) so that semantically identical tool calls with different JSON key order are correctly detected as duplicates. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix table optional end tags, inline code whitespace, and link text normalization _html_to_md.py: - Extract _finish_cell() and _finish_row() helpers to handle HTML tables that omit optional </td>, </th>, or </tr> end tags. This is valid HTML and common on real web pages -- previously the parser would silently drop earlier cells and entire rows. - Call _finish_cell()/_finish_row() from handle_starttag for <tr>/<td>/<th>, handle_endtag for </tr>/<td>/<th>/<table>, and flush_pending() so all three paths (normal close, implicit close, truncated HTML) use the same row-finalization logic including header separator emission. - Add _in_inline_code flag so handle_data() preserves literal whitespace inside <code> spans instead of collapsing it. Source like <code>pip install unsloth</code> now correctly renders as `pip install unsloth` rather than `pip install unsloth`. - Extract _finish_link() helper that normalizes accumulated link text with \s+ -> single space before building the Markdown link. Prevents block- level content inside <a> tags (e.g. <a><div>one</div><div>two</div></a>) from producing multiline [one\n\ntwo](href) link labels. - Empty blockquotes now produce no output instead of a stray ">". - Remove unused _bq_depth field (all routing uses _bq_stack). - Flush open cells and rows in handle_endtag("table") for robustness. * Support <ol start=N>, <dl>/<dt>/<dd>, and preserve code block whitespace _html_to_md.py: - Honor <ol start="N"> attribute so ordered lists preserve their original numbering instead of always restarting from 1. Important for docs/tutorials that continue numbering across sections. - Add dl, dt, dd to _BLOCK_TAGS so definition lists (common on MDN, Python docs, Django docs) produce separated text instead of concatenated blobs. - Rewrite _cleanup() to be fence-aware: content inside fenced code blocks is now preserved verbatim (intentional blank lines in <pre> content are no longer collapsed). Outside code blocks, blank runs are limited to one and trailing whitespace is stripped. - Fix _prefix_blockquote() to strip trailing whitespace before collapsing blank lines, preventing the "\n\n \n\n" pattern from sneaking through. * Suppress whitespace-only text nodes between table structural elements Indented HTML tables (nearly all real-world pages) produce whitespace text nodes between <table>, <tr>, </tr> etc. that land in the output as leading spaces before table rows, breaking Markdown table alignment. Skip whitespace-only text nodes when inside a table but not inside a cell, so indentation from source HTML does not leak into the output. * Revert dedup key to str(arguments) with explanatory comment json.dumps(sort_keys=True) is unnecessary overhead here: arguments always comes from json.loads on model output within a single request, so dict insertion order is deterministic in Python 3.7+. A repeated call from the model produces the same JSON, which parses to the same dict repr. str() avoids re-serialization on every tool call. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-31 06:15:18 -07:00
Lee Jackson	9451bb1bac	fix(export): preserve selected/manual model on enter and blur (#4726 )	2026-03-31 17:05:55 +04:00
Daniel Han	e159b93b97	studio: improve GGUF tool calling accuracy and reliability (#4700 ) * studio: improve GGUF tool calling accuracy and reliability - Add URL fetching to web_search tool so models can read full page content instead of only getting search snippets. Uses html2text for clean markdown conversion with regex fallback. - Inject current date and behavioral guidance (URL fetch workflow, no repeated queries, use code for data processing) into the tool-use system prompt. - Append error recovery nudge to tool results that indicate failure, helping small models avoid looping on the same broken call. - Strip leaked <tool_call> XML from assistant messages in conversation history and from the outgoing SSE stream. - Raise default max tool iterations from 10 to 25 across backend, model schema, and frontend defaults. - Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain enough content for the model to extract useful information. - Add "IMPORTANT: These are only short snippets" hint to search results so models know to fetch full pages when needed. Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after: - XML leaks in responses: 10/10 -> 0/10 - URL fetch usage: 0 -> 4/10 runs - Runs producing actual correct answers: 0/10 -> 2/10 - Average tool calls per query: 5.5 -> 3.8 (more efficient) - Average response time: 12.3s -> 9.8s * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add tool calling benchmark results across model sizes and quants Tested 16 configurations (4 models x 2 quants x 2 KV cache types) with 10 runs each on NVIDIA B200. Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4 correct songs, 0 XML leaks, 131s average response time. * Add duplicate tool-call detection and final-answer synthesis When the model repeats the exact same tool call (same name + arguments) twice in a row, skip execution and return a redirect message telling it to try a different approach. This prevents the 8x-repeated-query loops observed on 27B and 35B models. When the tool iteration cap (25) is reached, inject a "provide your final answer now" message before the final streaming pass. This lets the model synthesize a useful answer from everything it gathered instead of being silently cut off. Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs): - Repeated query runs: 4/10 -> 2/10 - Cap hits: 1/10 -> 0/10 - All 4/4 accuracy: 5/10 -> 7/10 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix CodeQL alert: handle whitespace in script/style closing tags The regex fallback for HTML stripping did not match closing tags with whitespace before the angle bracket (e.g. </script >). Use \s* before > in both script and style patterns. * Address reviewer findings: SSRF, timeout crash, XML regex, dedup - SSRF: resolve hostname via getaddrinfo and reject private, loopback, link-local, multicast, and reserved addresses before fetching - Timeout: handle timeout=None (unlimited mode) in URL fetch path by defaulting to 60s instead of crashing on min(None, 60) - Download cap: read at most max_chars4+1 bytes instead of the full response body before truncating - XML regex: match both <tool_call> and <function=...> markup in the history/stream cleanup (inference.py) - CodeQL: use [^>] in closing script/style tags to handle any whitespace or attributes before > - Dedup: track whether each tool call failed so retries after transient errors are allowed; only block consecutive identical calls that both succeeded - Final-answer synthesis: guard on max_tool_iterations > 0 so callers who disable tools do not get a false "used all calls" turn * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix redirect SSRF, SSE streaming regression, dedup off-by-one - SSRF redirect bypass: disable auto-redirect in urllib, manually follow up to 5 hops with host validation at each step. Prevents public URLs from redirecting to loopback/private targets. - SSE streaming: track prev_text on the raw cumulative and strip XML from the delta only, so completed tool_call tags do not cause the cumulative to shrink and drop trailing real text. - Dedup off-by-one: check the immediately previous call (window=1) instead of requiring 2 matching history entries, so the second identical successful call is blocked rather than the third. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix redirect HTTPError handling and tighten error prefixes - Redirect fix: urllib raises HTTPError (not a normal response) when the redirect handler returns None. Catch HTTPError for 3xx codes and extract the Location header from the exception object. - Error prefixes: remove overly broad "No " prefix that matched "No results found." (a valid empty-search outcome, not an error). Replace with specific prefixes like "Blocked:", "No query provided", "Failed to resolve". This ensures empty search results are correctly classified as non-errors for duplicate-call tracking. * Fix SSE cross-chunk XML leaks, cleanup review findings - SSE streaming: sanitize the full cumulative text before diffing against the previous sanitized snapshot, so XML tags that span chunk boundaries are stripped correctly. The previous delta-based approach leaked split tags. - DRAINING fallback: use _strip_tool_markup() helper instead of a manual regex that only handled <tool_call> but not <function=...>. - Move hashlib import, _TOOL_XML_RE compile, and datetime import to module level per style guide. - Remove unused _hit_tool_cap variable. * Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record - DNS rebinding: resolve hostname once via getaddrinfo, pin the returned IP, rewrite the URL to connect to the pinned IP with a Host header. Each redirect hop re-resolves and re-validates. Closes the TOCTOU window between validation and connection. - Charset: use resp.headers.get_content_charset() instead of hardcoding utf-8, so pages with other encodings decode correctly. - HTTPError: return descriptive "HTTP {code} {reason}" instead of re-raising into a generic "Search failed" message. - Dedup: remove redundant _record_tool_call in the duplicate branch; the single call at the end of the loop handles all cases. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-31 03:06:44 -07:00
Lee Jackson	815619d972	feat: add update instructions card with OS toggle and mobile expand flow (#4721 ) Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>	2026-03-31 14:05:05 +04:00
Roland Tannous	cc5e4fbf17	fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1 (#4712 ) * fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1 The heartbeat thread now monitors the HF Hub cache directory for file-size growth. If no bytes are written for 3 minutes, it sends a "stall" message to the orchestrator, which kills the subprocess and retries with HF_HUB_DISABLE_XET=1 (falling back from Xet to standard HTTPS). If the retry also stalls, it errors out with a clear message. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: include transport type (xet/https) in heartbeat and stall log messages Makes it clear in backend logs whether the download is using xet or https transport, and which transport stalled — helpful for debugging. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: monitor HF Hub .tmp dir to avoid false stall detections huggingface_hub downloads into .tmp/ before atomically moving to blobs/. Without monitoring .tmp, a large shard actively downloading for several minutes would show zero blob growth and trigger a false stall. * fix: scope HF cache size check to specific model being loaded Instead of scanning every models--/blobs directory (O(N) with cached models), only check the specific model's blobs dir plus the global .tmp dir. Much faster on systems with many cached models. Fix false stall detection on cached/local models and cleanup issues - Only fire stall if download activity was observed (cache size changed at least once). Previously, any model load taking >180s would trigger a false stall, even for already-cached or local models where no download is happening. - Return -1 from _get_hf_cache_size on exception to distinguish "unable to measure" from "genuinely zero bytes". Skip stall logic when measurement fails. - Add _shutdown_subprocess before raising on terminal stall path to prevent leaking a stuck subprocess. - Detect pre-existing HF_HUB_DISABLE_XET=1 in the parent environment to avoid a redundant retry cycle when Xet is already disabled. - Remove global .tmp directory scanning (not used by modern huggingface_hub; in-progress downloads use .incomplete files in blobs/ which are already captured by iterdir). - Add f.is_file() guard in cache size calculation. - Replace em dashes with ASCII dashes for Windows terminal compat. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden stall detection edge cases - Guard -1 to valid value transition: when initial _get_hf_cache_size returns -1 (error) and later recovers to a real value, do not count that as download activity. Only set saw_download_activity when the previous measurement was also valid (>= 0). - Move os import to top-level in orchestrator.py instead of inline import os as _os. - Fix misleading comment about post-download protection. * Use .incomplete files to detect active downloads for stall detection Replace the saw_download_activity heuristic with direct .incomplete file detection. huggingface_hub creates .incomplete files in blobs/ during active downloads and removes them on completion. This gives a reliable signal for whether a download is actually in progress. Benefits: - Cached models: no .incomplete files -> no stall fired even after 180s - Post-download init (quantization, GPU loading): .incomplete files gone so stall timer resets, long init phases are not killed - Pre-download hangs (XET handshake stall): .incomplete files are created at download start, so zero-byte stalls are now detected - No more false positives from -1 to valid measurement transitions The _get_hf_download_state function now returns (total_bytes, has_incomplete) tuple or None on error, replacing _get_hf_cache_size. Add debug logging to download state exception handler Log the exception at debug level when _get_hf_download_state fails, instead of silently returning None. Helps with troubleshooting cache measurement issues. * Watch both adapter and base model repos for LoRA stall detection When loading a LoRA adapter, the actual download bottleneck is often the base model, not the adapter itself. Update the heartbeat to watch both mc.identifier and mc.base_model cache directories so stall detection works for LoRA loads where the base model stalls on Xet. Also update _get_hf_download_state to accept multiple model names and skip names without "/" (local paths) since those do not have HF cache directories. * Fix model name filtering for official HF models without org prefix Models like gpt2 and bert-base-uncased do not contain a slash but are still valid HF Hub models with cache directories. Replace the "/" check with a proper local-path detection that checks for path separators and path-like prefixes instead. Also fix the base_model watch list to not require "/" in the base model name, so official models used as LoRA bases are also monitored. * Fix local path detection that broke all org/model names on Linux The os.path.sep check matched "/" in HF model IDs like "org/model" on Linux, causing the stall detector to skip ALL standard HF models. Replace with a check that only skips names starting with "/" (absolute paths), "." (relative paths), "~" (home-relative), or containing "\" (Windows paths). HF model IDs like "org/model" or "gpt2" pass through correctly on all platforms. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 03:00:46 -07:00
Daniel Han	e164c930ff	fix(studio): correct default weight_decay and learning rate (#4695 ) * fix(studio): change default weight_decay from 0.01 to 0.001 The default weight decay across Studio was 0.01 but should be 0.001. Updated the default in all backend fallbacks, the Pydantic model, the frontend config, and every YAML preset/model-default config. * fix(studio): auto-set learning rate based on training method Default LR should be 2e-4 for LoRA/QLoRA and 2e-5 for full fine-tuning. Frontend: track whether the user has manually edited the LR field via a _learningRateManuallySet flag (same pattern as trainOnCompletions). When switching training method and the user has not touched the LR, auto-set it to the appropriate default. Reset the flag on model load. Backend: change trainer.py start_training default from 5e-5 to 2e-4, update default.yaml fallback from 5e-5 to 2e-4, and fix full_finetune.yaml from 0.0002 (2e-4) to 2e-5. * refactor(studio): centralize weight_decay and learning rate defaults Create studio/backend/core/training/constants.py as the single source of truth for DEFAULT_WEIGHT_DECAY (0.001), DEFAULT_LEARNING_RATE (2e-4), DEFAULT_LEARNING_RATE_FULL (2e-5), and DEFAULT_LEARNING_RATE_STR ("2e-4"). All backend modules (trainer.py, training.py, worker.py, models/training.py) now import from constants.py instead of hardcoding values. On the frontend, add LR_DEFAULT_LORA and LR_DEFAULT_FULL to config/training.ts and use them in the store instead of magic numbers. A comment cross-references the backend constants file. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix model-specific LR override, persist migration, and flag resets - Preserve model-specific learning rates from YAML configs when the async autoSelectTrainingMethod callback fires (fixes Qwen2.5-1.5B getting 2e-4 instead of its configured 1e-5, etc.) - Bump zustand persist version to 9 with migration so existing users with weightDecay=0.01 get updated to 0.001 - Clear _learningRateManuallySet in reset() and applyConfigPatch() for consistency with trainOnCompletions flag behavior - Add DEFAULT_LEARNING_RATE_FULL_STR to constants.py * Refine applyConfigPatch to only clear LR flag when patch includes LR Only reset _learningRateManuallySet when the applied config patch actually provides a learningRate value. This prevents unrelated config patches from silently disarming the manual-edit guard, which would cause a subsequent setTrainingMethod call to overwrite the user's custom LR. * Preserve model-specific LR when switching between qlora and lora Only auto-switch the learning rate when the training category changes (adapter <-> full fine-tuning). Switching between qlora and lora keeps the current LR since both methods share the same learning rate range. This preserves curated per-model defaults (e.g. 1e-5 for Qwen2.5-1.5B-Instruct) when the user toggles between adapter methods. * Remove constants.py, use YAML configs as the source of truth The YAML config files (model-specific + default.yaml) are the intended config layer for training defaults. The Python backend fallbacks now use inline values that match the YAML configs, rather than importing from a separate constants module. This keeps the config architecture simple: YAML files are the single source of truth, and the inline Python fallbacks are just safety nets that mirror them. * fix(studio): preserve model-specific LR when switching training method Stash YAML-provided learning rate and use it to restore the correct value when switching between adapter and full fine-tune modes. - qlora <-> lora no longer overwrites the model's LR - full -> adapter restores the YAML LR instead of a hardcoded constant - selecting a model while on full fine-tune uses LR_DEFAULT_FULL instead of applying the YAML adapter LR --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>	2026-03-31 13:50:25 +04:00
Wasim Yousef Said	28aaf849bf	fix: throttle and cache HuggingFace modelInfo API calls (#4696 ) * fix: throttle and cache HuggingFace modelInfo API calls The frontend was firing 40 to 60 parallel modelInfo requests on app startup with zero caching or deduplication, causing HF rate limits. Adds a caching layer (hf-cache.ts) with TTL cache, inflight request dedup, and a concurrency limiter. Also debounces the HF token input so typing a token no longer re-fires all model searches per keystroke. * fix: only fetch VRAM info for visible models in chat selector * Fix cache key isolation and VRAM badge stability for PR #4696 - Cache key now includes a token fingerprint (last 8 chars) instead of a boolean, so switching HF tokens gives separate cache entries instead of serving stale data from the previous token. - Extract token via credentials?.accessToken to match the @huggingface/hub API surface. - Extend CachedResult type with safetensors/tags fields so downstream consumers no longer need unsafe `as` casts. - Merge VRAM param map with previous state on scroll instead of replacing it, preventing a brief flash of missing VRAM badges when new models become visible. * Fix VRAM badges missing for search-filtered recommended models When a user types a search query, filteredRecommendedIds can include models beyond the currently visible page. These models had no VRAM data because useRecommendedModelVram only received visibleRecommendedIds. Now we pass the union of visibleRecommendedIds and filteredRecommendedIds to the VRAM hook, so recommended models surfaced by search also show their VRAM badges. The hf-cache layer ensures no duplicate network calls. * Apply biome formatting to hf-cache.ts and use-recommended-model-vram.ts Auto-formatted with biome check --write to match project lint rules: - Block statements for single-line if/for bodies - Import sorting (type imports first) - Consistent line wrapping * Fix extractToken to handle both current and deprecated HF auth forms The @huggingface/hub CredentialsParams type is a union: - { accessToken: "hf_..." } (current preferred form) - { credentials: { accessToken: "..." } } (deprecated form) Previously only checked params.credentials?.accessToken (deprecated path). Now checks both forms so the cache key is correct regardless of which calling convention is used. * Simplify extractToken, map merge, and set construction - extractToken: remove type assertions, use direct property access with truthiness checks for cleaner union type handling - VRAM map merge: use Map spread constructor instead of manual for loop - idsForVram: use Set spread construction for more concise dedup * Add rationale comment for MAX_CONCURRENT=3 in hf-cache.ts * Skip GGUF repos in VRAM fetch and pre-populate cache from listModels Two changes to reduce redundant HF API calls: 1. Filter GGUF repos from idsForVram before passing to useRecommendedModelVram. GGUF repos have no safetensors metadata and the render layer already shows a static "GGUF" badge -- fetching modelInfo for them is a no-op that wastes a semaphore slot and a network round-trip. 2. Add primeCacheFromListing() to hf-cache.ts and call it from listModels yield sites in mergedModelIterator and priorityThenListingIterator. listModels returns the same type (ModelEntry & Pick<ApiModelInfo, T>) as modelInfo with the same additionalFields, so the data is interchangeable. Priming only writes if the key is not already fresh, so it never overwrites a recent modelInfo response. This means models discovered via listModels are already in cache when useRecommendedModelVram later calls cachedModelInfo for them, eliminating duplicate network requests. * Fix cache key mismatch: prime both token and anonymous slots The VRAM hook calls cachedModelInfo without credentials (anonymous key), but listModels results were primed only under the authenticated key. For authenticated users the priming was a no-op -- cache miss every time. Fix: prime both the token-specific slot and the anonymous slot when an access token is present. Public model metadata (safetensors, tags) is identical regardless of auth so this is safe. Also add a defensive guard in primeCacheFromListing for empty name. * Auto-prime anonymous cache slot from authenticated modelInfo fetches When cachedModelInfo is called with a token, the result was only stored under the token-specific key (e.g. model::abc12345). The VRAM hook calls cachedModelInfo without credentials and reads the anonymous slot (model::anon), causing a cache miss and duplicate fetch for every priority model. Now cachedModelInfo also writes to the anonymous slot on success when a token is present. Public model metadata (safetensors, tags) is identical regardless of auth, so this is safe and eliminates ~10 duplicate API calls on first page load. * Guard anonymous cache priming against gated/private models Only prime the anonymous cache slot for non-gated, non-private models. Previously, authenticated modelInfo responses and listing results were unconditionally copied into the anonymous slot, which could briefly expose gated/private model metadata after clearing the HF token. Now checks result.gated and result.private before writing the anon slot. Public unsloth/ models (the common case) still benefit from the optimization; gated models like meta-llama/* require a fresh fetch per auth context. * Extract primeFromListing helper to deduplicate cache priming logic The cache priming pattern (prime token slot + conditionally prime anon slot for non-gated models) was duplicated in three places. Extracted into a single primeFromListing() function for maintainability. * Export CachedResult type, add isStale helper, simplify primeFromListing - Export CachedResult so consumers can use it directly instead of the indirect Parameters<typeof ...> pattern. - Extract isStale(key) helper to deduplicate the cache freshness check that was repeated in primeCacheFromListing, cachedModelInfo, and the anonymous-slot priming logic. - Simplify primeFromListing to use CachedResult directly for both the data parameter and the gated/private guard, eliminating the double cast. --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 02:21:17 -07:00
Datta Nimmaturi	3b5a49776b	[studio] multi gpu: revert to balanced for inference. (#4698 ) * Revert to balanced for inference * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused for_inference parameter from get_device_map Since inference and training both use "balanced" now, the for_inference flag is dead code. Remove it from the function signature, the call site in inference.py, and simplify the tests accordingly. * Remove redundant TestDeviceMapForInference test class TestGpuAutoSelection already covers the same multi-gpu and single-gpu device_map assertions. The TestDeviceMapForInference class was left over from when for_inference had distinct behavior. * Remove redundant test_get_device_map_multi_gpu_uses_balanced Its assertions ([0,1] -> balanced, [0] -> sequential) are already covered by test_get_device_map_uses_explicit_gpu_selection. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 01:24:41 -07:00
Daniel Han	fe6609a624	fix(studio): open tour ReadMore links in new tab (#4694 ) * fix(studio): open tour ReadMore links in new tab The quick tour "Read more" links navigate away from Studio instead of opening in a separate tab. Add target="_blank" and rel="noopener noreferrer" to the ReadMore component so external doc links open in a new browser tab. * fix(studio): only open external ReadMore links in new tab Apply target="_blank" conditionally based on whether the href starts with "http", so internal links still navigate in the same tab. * Tighten external-link detection in ReadMore component Use regex /^https?:\/\// instead of startsWith("http") so the check requires the full protocol prefix and does not match non-URL strings that happen to begin with "http". * Hoist regex to module scope for ReadMore Move EXTERNAL_URL_RE to top-level constant to satisfy the biome useTopLevelRegex lint rule and avoid re-creating the RegExp on every render. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-03-30 23:41:14 -07:00
Lee Jackson	308bb948d1	studio: prevent false multimodal warning during model loading (#4704 ) * studio: gate multimodal incompatibility warning on settled model capabilities * Also disable Start button during isCheckingVision fallback When getModelConfig fails and the fallback checkVisionModel is still in-flight, isLoadingModelDefaults clears before isCheckingVision does. Without also gating on isCheckingVision the Start button briefly re-enables with stale capability flags. Add isCheckingVision to the disabled condition and show "Loading model..." text while either flag is active. * Show correct error message for audio dataset incompatibility The incompatibility warning always said "switch to a vision model" even when the actual issue was an audio dataset on a non-audio model. Now shows an audio-specific message when the mismatch is audio. * Extract isLoadingModel constant for clarity Pull the combined model-loading condition into a single constant reused by the settled check, the disabled prop, and the button label. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-03-30 23:11:20 -07:00
Roland Tannous	d6d3f59984	fix: replace hard timeout with inactivity timeout for model loading (#4707 ) The 180s wall-clock timeout would kill model loads on slow connections even when the download was actively progressing. Now the worker sends heartbeat status messages every 30s during loading, and the orchestrator resets its 300s deadline on each one — so it only times out when the subprocess goes truly silent.	2026-03-31 07:35:04 +04:00
Roland Tannous	7f353acfd4	fix: skip download progress polling for exported GGUF models (#4709 ) * fix: skip download progress polling for exported GGUF models * fix: revert isLocalGgufDir change — exported GGUFs are file paths, not dirs * fix: set isDownloaded true for all adapters in LoraModelPicker	2026-03-31 07:21:23 +04:00
Etherll	34272a796f	Fix/bun windows bin detection (#4703 ) * fix(studio): detect bun .exe shims in Windows binary check * Update setup.sh * add .bunx checking	2026-03-30 21:58:33 +04:00
Daniel Han	6d83ad9a28	fix(studio): avoid UnicodeEncodeError on Windows cp1252 consoles (#4699 ) * fix(studio): replace unicode emoji in print() to avoid cp1252 crash on Windows On Windows the default console encoding is cp1252 which cannot encode unicode emoji like U+2705 or U+26A0. bare print() calls with these characters cause a UnicodeEncodeError at runtime. - run.py: replace emoji with ASCII status prefixes [OK] and [WARNING] - format_conversion.py: remove duplicate print() that mirrors the logger.info() call on the next line, and drop the emoji from the log message since loggers handle encoding separately * fix(studio): apply same emoji/print cleanup to parallel VLM conversion path The parallel URL-based conversion logic has the same duplicate print() with emoji that was fixed in the sequential path. Remove the bare print() and drop the emoji from the logger.info() call. * Treat install_python_stack.py failure as fatal in setup.ps1 On Linux/Mac, setup.sh runs under set -euo pipefail so a non-zero exit from install_python_stack.py aborts the installer. On Windows, setup.ps1 had no exit code check -- if the Python script crashed (eg from the cp1252 UnicodeEncodeError), the installer silently continued past the dependency loop and reported success. Studio would then fail at launch with ModuleNotFoundError for structlog, fastapi, and other deps that were never installed. Capture $LASTEXITCODE and exit 1 if the dependency installer fails, matching the error handling pattern already used for PyTorch install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-30 06:40:47 -07:00
Datta Nimmaturi	9311df2b29	[Studio] multi gpu finetuning/inference via "balanced_low0/sequential" device_map (#4602 ) * [WIP] balanced device map for studio * gpus as a request parameter * API for multi GPU stuff * return multi gpu util in new API * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use balanced_low0 instead of balanced * Use balanced_low0 instead of balanced * Fix device_map typo, UUID parsing crash, set() filter bug, and broken tests - balanced_low0 -> balanced_low_0 (transformers/accelerate rejects the old string) - get_parent_visible_gpu_ids() now handles UUID/MIG CUDA_VISIBLE_DEVICES gracefully instead of crashing on int() parse - _get_backend_visible_gpu_info() set() or None bug: empty set is falsy so CUDA_VISIBLE_DEVICES=-1 would disable filtering and report all GPUs - test_gpu_selection.py: add missing get_visible_gpu_utilization import and add required job_id arg to start_training() calls * Smart GPU determinism using estimates * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disallow gpu selection for gguf for now * cleanup * Slightly larger baseline * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Treat empty list as auto * Verbose logging/debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleanup and revert unnecessary deletions * Cleanup excessive logs and guard against disk/cpu offload * auth for visibility API. cleanup redundant imports. Adjust QLoRA estimate * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support for non cuda gpus * Fix multi-GPU auto-selection memory accounting The multi_gpu_factor was applied uniformly to all GPUs including the first one, which unfairly penalizes single-GPU capacity when transitioning to multi-GPU. This created a discontinuity where a model that barely fits 1 GPU would suddenly require 2 GPUs because the first GPU's free memory was discounted by 20%. Now the first GPU keeps its full free memory, and only additional GPUs have an overhead factor (0.85) applied to account for inter-GPU communication and sharding overhead. This gives more accurate auto-selection and avoids unnecessary multi-GPU for models that comfortably fit on one device. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox tests for multi-GPU selection logic 24 tests covering model size estimation, memory requirements, automatic GPU selection, device map generation, GPU ID validation, and multi-GPU overhead accounting. All tests use mocks so they run without GPUs on Linux, macOS, and Windows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer findings: 4bit inference estimate, fallback, GGUF gpu_ids, retry 1. 4-bit inference now uses reduced memory estimate (model_size/3 + buffer) instead of the FP16 1.3x multiplier. This prevents over-sharding quantized models across unnecessary GPUs. 2. When model size estimation fails, auto_select_gpu_ids now falls back to all visible GPUs instead of returning None (which could default to single-GPU loading for an unknown-size model). 3. GGUF inference route now treats gpu_ids=[] as auto-selection (same as None) instead of rejecting it as an unsupported explicit request. 4. Training retry path for "could not get source code" now preserves the gpu_ids parameter so the retry lands on the same GPUs. 5. Updated sandbox tests to cover the new 4-bit inference estimate branch. * Remove accidentally added unsloth-zoo submodule * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix UUID/MIG visibility and update test expectations 1. nvidia.py: When CUDA_VISIBLE_DEVICES uses UUID/MIG tokens, the visibility APIs now return "unresolved" with empty device lists instead of exposing all physical GPUs. This prevents the UI from showing GPUs that the backend process cannot actually use. 2. test_gpu_selection.py: Updated test expectations to match the new multi-GPU overhead accounting (first GPU at full capacity, 0.85x for additional GPUs) and 4-bit inference memory estimation formula. All 60 tests now pass. * Add CPU/disk offload guard to audio inference path The audio model loading branch returned before the common get_offloaded_device_map_entries() check, so audio models loaded with a multi-GPU device_map that spilled layers to CPU/disk would be accepted instead of rejected. Now audio loads also verify no modules are offloaded. * Improve VRAM requirement estimates * Replace balanced_low_0 with balanced * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine calculations for slightly easier nums * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust estimates * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use nums instead of obj to avoid seralisation error * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden nvidia-smi parsing and fix fallback GPU list 1. nvidia.py: Wrap int() casts for GPU index and memory in try/except so MIG slices, N/A values, or unexpected nvidia-smi output skip the unparseable row instead of aborting the entire GPU list. 2. nvidia.py: Handle GPU names containing commas by using the last field as memory instead of a fixed positional index. 3. hardware.py: fallback_all now uses gpu_candidates (GPUs with verified VRAM data) instead of raw devices list, which could include GPUs with null VRAM that were excluded from the ranking. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * consolidate raise_if_offload * Improve MoE support. Guard against nvidia-smi failures * Improve MoE support. Guard against nvidia-smi failures * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix shared-expert LoRA undercount, torch VRAM fallback, and apply_gpu_ids edge case 1. vram_estimation.py: compute_lora_params now includes shared experts (n_shared_experts) alongside routed experts when computing MoE LoRA adapter parameters. Previously only n_experts were counted, causing the estimator to undercount adapter, optimizer, and gradient memory for DeepSeek/GLM-style models with shared experts. 2. hardware.py: _torch_get_per_device_info now uses mem_get_info (which reports system-wide VRAM usage) instead of memory_allocated (which only reports this process's PyTorch allocations). This prevents auto-selection from treating a GPU as mostly free when another process is consuming VRAM. Falls back to memory_allocated when mem_get_info is unavailable. 3. hardware.py: apply_gpu_ids([]) now returns early instead of setting CUDA_VISIBLE_DEVICES="" which would disable CUDA entirely. Empty list inherits the parent visibility, same as None. 4. hardware.py: Upgraded fallback_all GPU selection log from debug to warning so operators are notified when the model likely will not fit in available VRAM. * Guard nvidia-smi subprocess calls against OSError and TimeoutExpired get_visible_gpu_utilization and get_backend_visible_gpu_info now catch OSError (nvidia-smi not found) and TimeoutExpired internally instead of relying on callers to wrap every invocation. Returns the standard available=False sentinel on failure so the torch-based fallback in hardware.py can take over. * Guard get_primary_gpu_utilization and reset GPU caches between tests 1. nvidia.py: get_primary_gpu_utilization now catches OSError and TimeoutExpired internally, matching the pattern already used in get_visible_gpu_utilization and get_backend_visible_gpu_info. All three nvidia-smi callers are now self-contained. 2. test_gpu_selection.py: Added _GpuCacheResetMixin that resets the module-level _physical_gpu_count and _visible_gpu_count caches in tearDown. Applied to all test classes that exercise GPU selection, device map, or visibility functions. This prevents stale cache values from leaking between tests and causing flaky results on machines with real GPUs. * Fix nvidia-smi fallback regression and physical GPU count validation 1. hardware.py: get_gpu_utilization, get_visible_gpu_utilization, and get_backend_visible_gpu_info now check result.get("available") before returning the nvidia-smi result. When nvidia-smi is unavailable or returns no data (e.g., containers without nvidia-smi, UUID/MIG masks), the functions fall through to the torch-based fallback instead of returning an empty result. This fixes a regression where the internal exception handling in nvidia.py prevented the caller's except block from triggering the fallback. 2. hardware.py: resolve_requested_gpu_ids now separates negative-ID validation from physical upper-bound validation. The physical count check is only enforced when it is plausibly a true physical count (i.e., higher than the largest parent-visible ID), since torch.cuda.device_count() under CUDA_VISIBLE_DEVICES returns the visible count, not the physical total. The parent-visible-set check remains authoritative in all cases. This prevents valid physical IDs like [2, 3] from being rejected as "out of range" when nvidia-smi is unavailable and CUDA_VISIBLE_DEVICES="2,3" makes torch report only 2 devices. * Fix UUID/MIG torch fallback to enumerate devices by ordinal When CUDA_VISIBLE_DEVICES uses UUID or MIG identifiers, get_parent_visible_gpu_ids() returns [] because the tokens are non-numeric. The torch fallback in get_visible_gpu_utilization() and get_backend_visible_gpu_info() previously passed that empty list to _torch_get_per_device_info(), getting nothing back. Now both functions detect the empty-list case and fall back to enumerating torch-visible ordinals (0..device_count-1) with index_kind="relative". This means the UI and auto-selection still see real device data in Kubernetes, MIG, and Slurm-style UUID environments where nvidia-smi output cannot be mapped to physical indices. Updated test_uuid_parent_visibility to verify the new torch fallback path returns available=True with relative ordinals. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add type hint for gpu_ids parameter in InferenceOrchestrator.load_model --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-30 02:33:15 -07:00
Lee Jackson	2f0a5baa87	fix(studio): preserve GGUF context max after apply and refresh (#4691 ) Fixes #4670 Separates the GGUF context slider ceiling from the currently active context length so lowering context via Chat Settings no longer locks the slider max to the reduced value. - Backend: adds `max_context_length` to GGUF load/status responses, computed from the largest VRAM/KV-fit cap across all usable GPU subsets - Frontend: stores `ggufMaxContextLength` and uses it for Context Length slider/input bounds; hydrates from both `/api/inference/load` and `/api/inference/status` - Defaults UI ceiling to native context for CPU-only and fallback paths - Seeds `effective_ctx` and `max_available_ctx` before GPU probing to prevent `UnboundLocalError` on probe failure - Property fallback uses native `_context_length`, not effective `context_length`	2026-03-30 01:33:16 -07:00
Lee Jackson	5557e1fd27	studio: unify Windows installer/setup logging style, verbosity controls, and startup messaging (#4651 ) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. * fix(studio): honor verbose logging and keep llama.cpp failures non-blocking * fix(studio): switch installer to 'studio update' and normalize Windows setup logs * chore(studio): refine localhost tip and remove skip-base setup nois * fix(studio): align Windows setup logs with Linux style and improve startup tips * fix(studio): align Windows setup logs with Linux style * refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output * refactor(windows): align installer/setup output with Linux style and reduce default verbosity * refactor(windows): match install.ps1 output style/colors to setup and quiet default logs * fix(studio-banner): update personal-computer localhost tip * fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode * fix(install.sh): align installer logging with setup style and restore POSIX-safe color output * fix(install.sh): preserve installer reliability and launch visibility Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output. * fix(windows installer): keep exit semantics and degrade status accurate Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable. * fix(setup.sh): improve log clarity and enforce GGUF degraded signaling Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable. * fix(installer): harden bash preflight and PowerShell GPU checks Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling. * fix(windows): keep verbose output visible while preserving exit codes Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode. * fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity * Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose - setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer already reports the limitation, and install.sh runs under set -e so a non-zero exit aborts the entire install including PATH/shortcuts/launch. - setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1 sets $? = $false when native commands write to stderr even with exit 0. Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE. - startup_banner.py: Show the actual bound address when Studio is bound to a non-loopback interface instead of always showing 127.0.0.1/localhost. - setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for npm install steps so --verbose correctly surfaces npm output. * Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose - install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge stderr into stdout with 2>&1 and drop the $? check that misclassifies successful native commands on PS 5.1. - install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env when --verbose is passed so child processes like install_python_stack.py also run in verbose mode. - setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose correctly surfaces clone diagnostics during source-build fallback. * Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner - setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so users can see download/validation progress while still capturing the log for structured error reporting on failure. - setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode. - setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers. - startup_banner.py: Avoid printing the same URL twice when Studio is bound to a specific non-loopback address that matches the display host. * Fix run_install_cmd exit code after failed if-statement The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured $? = 0 because $? reflects the if-statement result, not the command's exit code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual command exit code on failure. Applies to both verbose and quiet branches. * Fix _run_quiet exit code, double uv install, missing --local flag - setup.sh: Fix _run_quiet verbose path that always captured exit code 0 due to $? resetting after if-then-fi with no else. Switch to the same '"$@" && return 0; exit_code=$?' pattern used in install.sh. - setup.sh: Consolidate the two uv install branches (verbose + quiet) into a single attempt with conditional output. Previously, when verbose mode was on and the install failed, a second silent attempt was made. - install.ps1: Pass --local flag to 'unsloth studio update' when $StudioLocalInstall is true. Without this, studio.py's update() command overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if setup.ps1 or install_python_stack.py later checks that variable. * Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners - Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling setup.sh, so letting install_python_stack.py redo it was redundant and slowed down --no-torch installs for no benefit. - Restore the "Unsloth Studio installed!" success banner and "starting Unsloth Studio..." launch message so users get clear install completion feedback before the server starts. * Make llama.cpp build failure a hard error with proper cleanup - setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF inference requires a working llama.cpp build, so this should be a hard failure, not a silent degradation. - install.sh: Catch setup.sh's non-zero exit with '\|\| _SETUP_EXIT=$?' instead of letting set -e abort immediately. This ensures PATH setup, symlinks, and shortcuts still get created so the user can fix the build deps and retry with 'unsloth studio update'. After post-install steps, propagate the failure with a clear error message. * Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE 'studio update' pops SKIP_STUDIO_BASE from the environment, which defeats the fast-path version check added in PR #4667. When called from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1 must survive into setup.ps1 so it skips the redundant PyPI check and package reinstallation. 'studio setup' does not modify env vars. * Remove deprecation message from 'studio setup' command install.ps1 uses 'studio setup' (not 'studio update') to preserve SKIP_STUDIO_BASE. The deprecation message was confusing during first install since the user never typed the command. * Fix stale env vars, scope degraded exit, generic error message for PR #4651 - install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO when not using --local, to prevent stale values from a previous --local run in the same PowerShell session. Fix log messages to say 'setup' not 'update' since we call 'studio setup'. - setup.sh: Only exit non-zero for degraded llama.cpp when called from the installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps degraded installs successful since Studio is still usable for non-GGUF workflows and the footer already reports the limitation. - install.sh: Make the setup failure error message generic instead of GGUF-specific, so unrelated failures (npm, Python deps) do not show misleading cmake/git recovery advice. * Show captured output on failure in quiet mode for PR #4651 Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand (setup.ps1) now capture command output in quiet mode and display it in red when the command fails. This matches the behavior of run_install_cmd in install.sh where failure output is surfaced even in quiet mode, making cross-platform error debugging consistent. * Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651 - setup.ps1: Exit non-zero for degraded llama.cpp when called from install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct 'unsloth studio update' keeps degraded installs successful. - install.sh: Show 'unsloth studio update --local' in the recovery message when the install was run with --local, so users retry with the correct flag instead of losing local checkout context. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-30 00:53:23 -07:00
Roland Tannous	5bbfabb151	fix: [Studio] setup.ps1 update-flow for windows (#4667 ) * fix: add PyPI version check to setup.ps1 for fast update path Port the update-flow logic from setup.sh to setup.ps1 so that `unsloth studio update` on Windows skips Python dependency reinstall when the installed version already matches PyPI latest. * fix: clear SKIP_STUDIO_BASE in update command install.ps1 sets SKIP_STUDIO_BASE=1 which persists in the PowerShell session. If the user runs `unsloth studio update` in the same terminal, the env var causes the version check to be skipped. Clear it explicitly in the update command. * fix: harden version check and clear stale env vars in update flow - Normalize $InstalledVer with Out-String + Trim() to avoid array/whitespace comparison issues in PowerShell 5.1 (python output can be captured as string[] instead of scalar string) - Move Fast-Install --upgrade pip inside if (-not $SkipPythonDeps) so the fast path avoids unnecessary network round-trips - Clear STUDIO_LOCAL_REPO when --local is not passed to prevent a previous --local session from leaking into a plain update --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-29 21:14:36 -07:00
Roland Tannous	a6c1f893fc	Fix blank page on Windows due to broken .js MIME type (#4674 ) * Fix blank page on Windows due to broken .js MIME type in registry * Update studio/backend/main.py adding defensive suggestion by gemini where we make the mimetypes specific to windows platforms Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-28 22:26:49 +04:00
Lee Jackson	5d2dca801c	studio: add HF/local model selection UI for GGUF export (#4365 ) * feat(studio): add HF/local model selection UI for GGUF export * fix(studio):fix selector ring clipping * fix(studio): export page trust_remote_code control and label styling * fix(studio): accept hf_token in load_checkpoint orchestrator method The route was passing hf_token to load_checkpoint() but the method didn't accept it, causing a TypeError on every /api/export/load-checkpoint request. * fix(studio): clear HF model selection when input is edited Previously selectedSourceModel was only cleared when the input became empty, so editing to a different repo ID after selecting a model would silently keep the old selection. --------- Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>	2026-03-28 22:18:25 +04:00
Daniel Han	82d14b44d3	fix: preserve Windows drive-letter paths on native Windows (#4665 ) normalize_path() unconditionally converted Windows paths like C:\Users\... to WSL format /mnt/c/Users/..., which breaks path resolution on native Windows. This caused LM Studio GGUF models to fail detection (detect_gguf_model returned None for the invalid path), falling through to the Unsloth import path which requires a GPU. Now only performs the /mnt/ mapping when actually running under WSL. On native Windows, drive letters are preserved and backslashes are normalized to forward slashes.	2026-03-27 08:19:41 -07:00
Roland Tannous	562e54fc6e	Fix HF cache default and show LM Studio models in chat/inference (#4653 ) * fix: default HF cache to standard platform path instead of legacy Unsloth cache * feat: show LM Studio and local models in chat Fine-tuned tab * feat: show LM Studio models in Hub models tab * fix: fetch local models after auth refresh completes * Revert "fix: fetch local models after auth refresh completes" This reverts commit `cfd61f0ac7`. * fix: increase llama-server health check timeout to 600s for large models * feat: expandable GGUF variant picker for LM Studio local models * fix: show GGUF variant label for locally loaded LM Studio models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: show publisher name in LM Studio model labels * fix: set model_id for loose GGUF files in LM Studio publisher dirs * fix: show publisher prefix in Fine-tuned tab LM Studio models * fix: only use model_id for lmstudio source models * fix: only show LM Studio models in Hub tab on Mac/chat-only mode * fix: respect XDG_CACHE_HOME, handle Windows paths in isLocalPath, refresh LM Studio on remount - _setup_cache_env now reads XDG_CACHE_HOME (falls back to ~/.cache) instead of hard-coding ~/.cache/huggingface. This follows the standard HF cache resolution chain and respects distro/container overrides. - isLocalPath in GgufVariantExpander uses a regex that covers Windows drive letters (C:\, D:/), UNC paths (\\server\share), relative paths (./, ../), and tilde (~/) -- not just startsWith("/"). - HubModelPicker.useEffect now calls listLocalModels() before the alreadyCached early-return gate so LM Studio models are always refreshed on remount. Also seeds useState from _lmStudioCache for instant display on re-open. * fix: add comment explaining isLocalPath regex for Windows/cross-platform paths * fix: prioritize unsloth publisher in LM Studio model list * fix: scope unsloth-first sort to LM Studio models on all platforms * fix: add missing _lmStudioCache module-level declaration * fix: prioritize unsloth publisher before timestamp sort in LM Studio group --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-27 06:59:27 -07:00
Wasim Yousef Said	73969a1e4f	fix: disable OCR in pymupdf4llm PDF extraction (#4659 )	2026-03-27 06:53:33 -07:00
Daniel Han	c4e34c88c8	Fall back to parsing model name when HF API has no param count (#4656 ) Some models like unsloth/Qwen3-0.6B have no safetensors metadata on Hugging Face, so the training model selector showed no parameter size badge. The chat model picker already had extractParamLabel() as a fallback that parses sizes like "0.6B" from the model name. Add the same fallback to the training model selector and the onboarding model selection step. Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-03-27 05:57:49 -07:00
Wasim Yousef Said	4ab7fb1f7b	fix: replace navbar shutdown text button with icon-only button (#4655 )	2026-03-27 05:44:59 -07:00
Daniel Han	e36f72c685	Detect always-on reasoning models and show Think button as locked-on (#4654 ) * Detect always-on reasoning models and show Think button as locked-on Models with hardcoded <think>/<think> tags or reasoning_content in their chat template (e.g. distilled reasoning models) always produce thinking output regardless of any toggle. Previously these models were not detected as reasoning-capable at all, so the Think button was grayed out even though the model was actively reasoning. Backend: - Detect <think>/<think> and reasoning_content in GGUF chat templates as a fallback when enable_thinking is not present - Add reasoning_always_on flag to LoadResponse and InferenceStatusResponse - Pass the flag through all GGUF load and status response paths Frontend: - Add reasoningAlwaysOn to the chat runtime store and API types - When reasoning_always_on is true, show the Think button as lit (active) but not clickable, with a tooltip explaining the model always uses thinking - Force reasoningEnabled=true when the model always reasons * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use pointer-events-none instead of disabled for always-on Think button The HTML disabled attribute was not fully blocking clicks on the Think button for always-on reasoning models. Switch to pointer-events-none CSS class which prevents all mouse interaction at the CSS level. * Use a static span instead of disabled button for always-on Think Replace the button element with a plain span when reasoning is always on. This makes it physically impossible to toggle since there is no clickable element at all, avoiding any CSS or disabled-attribute edge cases. * Simplify always-on Think button to stay lit and remain toggleable Keep the Think button as a normal toggleable button but ensure it shows as lit when reasoning_always_on is true. The model always reasons regardless of the toggle state so there is no need to block interaction. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-27 05:42:26 -07:00
Daniel Han	eacaf6827c	fix: no-torch install deps without pulling torch transitively (#4650 ) Use --no-deps for ALL packages (unsloth, unsloth-zoo, and runtime deps) since the current PyPI metadata for unsloth still declares torch as a hard dependency. Runtime deps (typer, pydantic, safetensors, transformers, etc.) are installed from no-torch-runtime.txt with --no-deps to prevent transitive torch resolution from accelerate, peft, trl, and sentence-transformers. no-torch-runtime.txt now includes unsloth's own direct deps (typer, pydantic, pyyaml, nest-asyncio) since --no-deps skips those too. install.sh installs no-torch-runtime.txt directly (via helper function _find_no_torch_runtime). install.ps1 does the same via Find-NoTorchRuntimeFile. SKIP_STUDIO_BASE stays at 1 to avoid setup.sh fast-path issues. install_python_stack.py NO_TORCH branch does the same for unsloth studio update, using package_name instead of hardcoded "unsloth".	2026-03-27 05:19:26 -07:00
Daniel Han	a7c43bc46d	Fix inference failing for transformers 5.x models (trust_remote_code) (#4652 ) * Fix inference failing for transformers 5.x models (trust_remote_code) The training worker in core/training/worker.py auto-enables trust_remote_code for unsloth/* models that need transformers 5.x (e.g. NVIDIA-Nemotron-3-Nano-4B). The inference worker did not have the same logic, so loading these models for chat would fail with "No config file found" while training worked fine. Add the same auto-detection to the inference worker so trust_remote_code is set automatically when needed. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-27 04:51:30 -07:00
Wasim Yousef Said	887b8cb1c2	fix: add auth + UX improvements to shutdown button (#4642 ) * Studio shutdown button * fix: add auth to shutdown endpoint and improve UX - Add JWT auth (Depends(get_current_subject)) to POST /api/shutdown - Use authFetch instead of bare fetch in shutdown dialog - Only show beforeunload prompt when training is running - Remove Ctrl+W/Cmd+W interception (browsers don't allow it) - Store shutdown task on app.state to prevent GC --------- Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-27 04:36:08 -07:00
Daniel Han	1fb9fe3304	Fix orphan server cleanup killing user's own llama-server (#4622 ) * fix: only kill studio-managed llama-server processes, not user's own servers _kill_orphaned_servers() checked for "unsloth" anywhere in the process cmdline, which matched the user's own llama-server when serving models from unsloth/ HF repos (the model path in -m contains "unsloth"). This caused the user's server to get SIGKILLed on Studio startup, destroying their prompt cache and forcing full model re-loads. Narrow the check to only match processes whose binary path lives under ~/.unsloth/llama.cpp/ (the Studio install directory). * Address review: cover env var paths, move Path.home() inside try block - Also check LLAMA_SERVER_PATH and UNSLOTH_LLAMA_CPP_PATH so orphans from custom install locations are still cleaned up. - Move studio_dirs construction inside the try/except so a Path.home() failure (containers without HOME) does not crash the constructor. * Address reviewer feedback: proper path ancestry, /proc/pid/exe, legacy paths Changes based on 10-reviewer consensus: - Use Path.is_relative_to() instead of substring matching to prevent false positives on sibling paths like ~/.unsloth/llama.cpp-backup/. - Use /proc/<pid>/exe (symlink to real binary) instead of parsing the first cmdline token, which breaks on paths with spaces. Falls back to cmdline parsing on non-Linux or when /proc is unavailable. - Add legacy in-tree install paths (project_root/llama.cpp/ and project_root/bin/) so orphans from older setup.sh are still cleaned. - Treat LLAMA_SERVER_PATH as an exact binary match rather than widening it to its parent directory, which could match unrelated servers in shared locations like /usr/local/bin/. - Keep everything inside the try/except so Path.home() failures in containers do not crash the constructor. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review: add Linux platform guard and log cleanup errors - Guard pgrep fallback with sys.platform check so it does not crash on Windows/macOS when psutil is unavailable. - Replace silent except-pass with logger.warning for observability. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-27 04:33:04 -07:00
Daniel Han	b1c3a1e857	fix: replace [huggingfacenotorch] with no-torch-runtime.txt requirements (#4649 ) The [huggingfacenotorch] extras only exist in pyproject.toml but are NOT published on PyPI, so uv pip install "unsloth[huggingfacenotorch]" fails on fresh installs from the registry. Fix: add studio/backend/requirements/no-torch-runtime.txt with the runtime deps (safetensors, transformers, datasets, accelerate, etc.) that mirror [huggingfacenotorch] from pyproject.toml. In no-torch mode: 1. install.sh/ps1 install unsloth + unsloth-zoo with --no-deps 2. SKIP_STUDIO_BASE=0 so install_python_stack.py's NO_TORCH branch runs 3. install_python_stack.py installs no-torch-runtime.txt	2026-03-27 03:58:51 -07:00
Daniel Han	9d68621614	Streaming tool detection: guard late tool_calls, filter incomplete fragments (#4648 ) * Guard against late tool_calls after visible content, filter incomplete fragments 1. If visible content was already emitted (_last_emitted is non-empty) when delta.tool_calls arrives, ignore the tool_calls instead of reclassifying the turn as a tool call. llama-server never interleaves content and tool_calls (they are mutually exclusive), but this guard is defensive for other OpenAI-compatible backends. 2. Filter out incomplete structured tool_calls fragments before execution. Entries with empty function.name (from truncation by max_tokens, disconnect, or interruption) are skipped instead of being passed to execute_tool(). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-27 03:40:14 -07:00
Wasim Yousef Said	5c7c3883cb	feat: update app icons to rounded logo (#4640 ) Replace favicon.png, unsloth-gem.png, and unsloth.ico with rounded.png. Update install.sh to source rounded.png for Linux/macOS shortcuts.	2026-03-27 03:18:20 -07:00
Daniel Han	79d9bf0c9a	Fix GGUF GPU fit check to account for KV cache VRAM (#4623 ) * fix: account for KV cache in GGUF GPU fit check and auto-cap context length The GPU fit check only compared GGUF file size against free VRAM, ignoring KV cache memory. Models with large native context lengths (e.g. Qwen3.5-9B at 262k) would pass the fit check since the GGUF is only 5.6 GB, but the KV cache at 262k context needs ~40 GB at f16. This caused llama-server to silently fall back to CPU inference. Changes: - Parse block_count, head_count_kv, head_count, and embedding_length from GGUF metadata alongside context_length - Add KV cache VRAM estimation based on architecture params and the selected cache quantization type (f16, q8_0, q4_0, etc.) - Auto-reduce context length to the maximum that fits in available GPU VRAM when the native context would exceed it - Include estimated KV cache size in the _select_gpus total so the fit decision reflects actual runtime memory, not just file size For the reported scenario (Qwen3.5-9B on RTX 3090 with 22415 MiB free), context is auto-reduced from 262144 to ~63k with f16 KV cache, keeping the model fully on GPU. With q4_0 KV cache quantization the context can reach ~226k. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: resolve 6 bugs in KV cache VRAM estimation and add test harness - Fix q8_0 BPE constant: 1.125 -> 34/32 (1.0625) to match llama.cpp block size - Fix _fit_context_to_vram returning min_ctx when weights exceed budget (should return requested_ctx unchanged, let --fit handle it) - Fix binary search inflating below-2048 requests (lo=min_ctx=2048 > hi) - Fix n_ctx=0 regressing to 4096 when metadata unavailable (preserve sentinel) - Fix multi-GPU auto-cap using single-GPU budget instead of aggregate - Fix _context_length being overwritten with capped effective value Add tests/test_gguf_kv_vram.py: 43 cross-platform pytest tests covering pure logic, integration (monkeypatched load_model), and real GGUF parsing. Runs in an isolated uv venv with only pytest -- no GPU/torch/structlog needed. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: complete _effective_context_length lifecycle - Initialize _effective_context_length in __init__ (prevents AttributeError) - Reset _effective_context_length in unload_model (prevents stale values) - Update context_length property to return effective (capped) value for the UI/API, falling back to native _context_length if not set * fix: multi-GPU selection tries smallest subset first The previous approach summed all GPUs' memory to cap context, then selected GPUs afterward. This was overly optimistic for heterogeneous setups (e.g., 48 GiB + 4 GiB): the context was inflated by the tiny GPU's contribution, then both GPUs were dragged in. Now we try GPU subsets from smallest (1 GPU) to largest, capping context for each. We pick the smallest subset where the model+KV fits. This prefers single-GPU when possible (simpler, no tensor split overhead) and avoids pulling in GPUs that barely help. Add tests: test_multi_gpu_prefers_fewer_gpus, test_multi_gpu_heterogeneous. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: prefer fewer GPUs over higher context in GPU selection Multi-GPU inference is slower due to tensor-split overhead, so we should prefer fewer GPUs with reduced context over more GPUs with full context. Now the loop stops at the first GPU subset where the model fits, rather than continuing to find subsets that allow higher context. Only if the model can't fit on N GPUs do we try N+1. This preserves the original behavior: use multi-GPU only when the model doesn't fit on a single GPU. * fix: make _kill_orphaned_servers cross-platform via psutil Replace pgrep + os.kill(SIGKILL) with psutil.process_iter() and proc.kill(), which work on Linux, macOS, and Windows. Build an allowlist of install roots matching _find_llama_server_binary so only studio-managed servers are killed. * fix: skip KV estimation loop when effective context is unknown When n_ctx=0 and GGUF metadata lacks context_length, effective_ctx stays 0. _estimate_kv_cache_bytes(0) returns 0, so a GPU could be selected with no KV headroom. Guard the loop with effective_ctx > 0 to fall back to file-size-only GPU selection in this case. * chore: temporarily remove test harness (will add back separately) * refactor: deduplicate UINT32/UINT64 handling in GGUF parser Replace duplicated if/elif chains for vtype 4 and 10 with a single block using setattr. No behavioral change. * fix: honor explicit n_ctx by using multi-GPU before capping When the user explicitly sets n_ctx, try to fit the full requested context using _select_gpus (which adds GPUs as needed). Only cap context if it doesn't fit on any GPU combination. When n_ctx=0 (auto/native context), keep the existing behavior: prefer fewer GPUs with reduced context, since multi-GPU is slower and the user didn't ask for a specific context length. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: context_length property returns native value for frontend slider The frontend uses context_length as the slider max. Returning the capped effective value prevented users from requesting higher context on reload (e.g., after switching to q4_0 KV cache). Revert to returning the native GGUF metadata value -- the backend auto-caps at load time regardless. * revert: context_length returns effective (capped) value The UI slider should show what the server is actually running at, not the theoretical maximum. Revert to returning the effective context length. * fix: raise minimum context floor from 2048 to 4096 --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-27 03:14:42 -07:00
Daniel Han	e318da21a7	Fix ~1.2s TTFT penalty when tools are enabled in Studio (#4639 ) * Fix ~1.2s TTFT penalty when tools are enabled in Studio When users enable web search, Python execution, or terminal tools, every message gets a ~1.2s delay before any text appears -- even when the model does not call any tool. This happens because generate_chat_completion_with_tools() does a non-streaming detection pass (stream: False) first, waits for the complete response, then checks for tool calls. For the ~90% of messages that don't trigger a tool call, this blocking wait is entirely wasted. Root cause: the detection pass payload uses stream: False, forcing llama-server to generate the entire response before returning any tokens. Fix: replace the non-streaming detection pass with a streaming pass (stream: True) and a speculative buffer state machine that detects tool signals in the first 1-2 SSE chunks: - BUFFERING: accumulate content tokens, check first chars for tool signal prefixes (<tool_call>, <function=) - STREAMING: no tool detected, yield tokens to caller immediately - DRAINING: tool signal found, silently accumulate rest of stream Three detection paths: 1. Structured delta.tool_calls -- detected instantly, transition to DRAINING, accumulate fragments, assemble at stream end. 2. XML tool markup in content -- buffer holds up to 32 chars checking for <tool_call> or <function= prefix, then transitions to DRAINING. 3. No tool signal -- first non-whitespace, non-XML char triggers immediate transition to STREAMING (fast path, ~90% of requests). Safety net: after any stream ends in STREAMING state, check accumulated content for XML tool signals. Handles rare "content before tool call" edge case. Additional supporting changes: - Add headers parameter to _stream_with_retry for auth forwarding - Share _strip_tool_markup and regex patterns between the detection pass and the final streaming pass (removes duplication) - Remove the iteration==0 non-streaming content shortcut (no longer needed since all iterations stream directly) - Keep the final streaming pass as fallback for max_tool_iterations exhaustion Benchmarked on Qwen3.5-4B Q4_K_XL: - No tools: TTFT ~112ms (unchanged) - Tools enabled, no call: TTFT ~112ms (was ~1207ms) - Decode TPS: 226 (unchanged in all cases) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add unit tests for streaming tool detection state machine 16 tests covering every tool call parsing path: - Plain text (no tool call) streaming - Structured delta.tool_calls detection and fragment assembly - XML <tool_call>JSON</tool_call> detection via buffer - XML <function=name> tag detection via buffer - Whitespace before tool XML - Safety net (content then tool XML) - Parallel multi-tool calls - Reasoning token bypass (thinking models) - Reasoning then tool call - Empty response handling - Buffer prefix timeout (HTML not mistaken for tool) - Non-XML first char instant streaming - False positive rejection (<tool_tip> vs <tool_call>) - Arguments split across multiple chunks - auto_heal_tool_calls=False respects the flag - Metrics accumulation across tool iterations * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reasoning-only BUFFERING, pre-tool content emission, and code duplication Addresses review feedback on the streaming tool detection: 1. Reasoning tokens are no longer yielded during BUFFERING/DRAINING states. The consumer in routes/inference.py tracks prev_text across tool iterations without resetting it, so yielding reasoning during a detection pass that resolves to a tool call would corrupt the delta computation for subsequent iterations. Reasoning is now silently accumulated during detection (matching the old non-streaming behavior) and flushed together with content when the buffer resolves to STREAMING. 2. Handle reasoning-only responses in the BUFFERING resolver. When a thinking model emits only reasoning_content with no content tokens, the stream ends while still in BUFFERING state. The resolver now detects this case and yields reasoning as plain text (without <think> wrapper), matching the final streaming pass behavior for models like Qwen3 in always-think mode. 3. Replace duplicated re.sub calls for stripping tool markup with the existing _strip_tool_markup(content_text, final=True) helper, removing ~40 lines of redundant regex code. 4. Update tests: adjust reasoning test expectations to match the new behavior (reasoning batched with content, not streamed individually during BUFFERING). Add test_reasoning_only_no_content for the reasoning-only edge case. 17/17 tests pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address remaining reviewer findings: late tool_call IDs and XML speculation 1. Late-arriving tool_calls.id: when a provider sends the real ID on a later delta chunk (after the initial one with index and function name), the accumulator now updates the ID instead of keeping the synthetic "call_{idx}" placeholder. (P2, 2/10 reviewers) 2. XML speculation respects auto_heal_tool_calls: when auto_heal is explicitly disabled, _TOOL_XML_SIGNALS is empty so the BUFFERING state never speculatively holds content for XML prefix detection. Content starting with literal "<tool_call>" or "<function=" text flows straight through without delay. (P2, 1/10 reviewers) Skipped: finish_reason="tool_calls" without delta.tool_calls fallback (P1, 1/10 reviewers). llama-server always sends delta.tool_calls fragments in streaming mode. A non-streaming fallback for this edge case would add complexity for a scenario that does not occur in practice with the supported backend. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Check request.is_disconnected() every 20 tokens instead of every token The disconnect check is an async round-trip that adds overhead on every loop iteration. Since the cancel watcher in llama_cpp.py already handles connection teardown (closes the streaming response on cancel), this route-layer check is a secondary safety net that does not need to run on every single token. Check every 20 tokens across all 4 streaming paths: - gguf_tool_stream (tool-enabled GGUF) - gguf_stream_chunks (standard GGUF) - audio_input_generate (audio/whisper input) - generic backend stream (non-GGUF fallback) * Fix safety net, DRAINING metadata, and test import path 1. Safety net no longer retroactively executes tools after visible content was already emitted to the user. Once _last_emitted is non-empty, the stream is committed to normal content mode. Retroactive tool execution after visible output would violate the streaming contract and corrupt the route-layer cumulative delta tracker (prev_text). The tool XML is still stripped by _strip_tool_markup so the user sees clean content. 2. DRAINING false-positive path now merges accumulated metrics from prior tool iterations instead of dropping them. Uses the same merge formula as the STREAMING path. 3. Test import path fixed to use repo root instead of hardcoded sibling directory. Works in clean checkouts and CI. 4. Renamed test_content_then_tool_xml_safety_net to test_content_then_tool_xml_no_retroactive_execution to reflect the corrected behavior. 17/17 tests pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Redact --api-key value from llama-server startup log When UNSLOTH_DIRECT_STREAM=1, the generated bearer token was logged verbatim in the startup command. Replace the secret with <redacted> before logging. * Remove test file temporarily * Revert disconnect throttle, reset prev_text on tool_start, restore XML safety net Addresses all P1 findings from reviewer round 3 (10 reviewers): 1. Revert disconnect check to every iteration (was every 20th). All 10 reviewers flagged this as a correctness regression for short streams and sparse tool event loops. The cancel watcher in llama_cpp.py is the primary mechanism but the route-layer check must remain per-iteration for completeness. [10/10] 2. Reset prev_text on tool_start in gguf_tool_stream. When a tool cycle begins after visible content was already streamed, the route-layer cumulative delta tracker (prev_text) must be reset so the post-tool synthesis response is not truncated or dropped. [9/10] 3. Remove the _last_emitted gate from the XML safety net. The gate was added to prevent retroactive tool execution after visible content, but with prev_text now reset on tool_start (#2), the root cause is fixed and the safety net can correctly handle content-then-tool-XML responses (matching pre-PR behavior). [8/10] * Use None instead of {} for empty auth headers in TTS methods * Include accumulated metrics in STREAMING metadata check * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-27 03:13:38 -07:00
Lee Jackson	0233fe7f9c	studio: setup log styling (#4494 ) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-27 03:12:48 -07:00
Daniel Han	3c9f0ed149	fix: use unsloth[huggingfacenotorch] instead of --no-deps in no-torch mode (#4647 ) The previous --no-deps approach skipped ALL dependencies, not just torch. This left safetensors, transformers, datasets, accelerate, etc. missing, causing PackageNotFoundError at runtime. Fix: in no-torch mode, install unsloth[huggingfacenotorch] (which pulls all runtime deps except torch), then install unsloth-zoo with --no-deps (since zoo's published metadata still declares torch as a hard dep). This gives a working no-torch environment with all non-torch packages. Applied to all three installer files: install.sh, install.ps1, and studio/install_python_stack.py.	2026-03-27 02:38:11 -07:00
Daniel Han	e9ac785346	fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624 ) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-27 02:09:21 -07:00
Daniel Han	d57a4d993d	studio: fix chat CPU spike (#4632 ) Inline querier identity changed every render, forcing useLiveQuery to resubscribe continuously causing CPU spikes. Store querier in a ref and only re-subscribe when explicit deps change.	2026-03-27 06:20:26 +00:00
Daniel Han	e62085a3d6	Fix repetition_penalty default causing 24% TPS drop in GGUF inference (#4634 ) The ChatCompletionRequest Pydantic model defaulted repetition_penalty to 1.1 when clients omitted the field. This silently forced llama-server to perform per-token repetition scanning, dropping streaming throughput from ~225 TPS to ~172 TPS (a 24% penalty). The Studio frontend always sends repetition_penalty=1.0 explicitly, so UI users were unaffected. But any API client hitting /v1/chat/completions without setting the field (curl, third-party integrations, Open WebUI, etc.) would get the slow path. Benchmarked on Qwen3.5-4B Q4_K_XL, GPU 0: - repeat_penalty=1.0: 225.2 TPS - repeat_penalty=1.1: 172.7 TPS (24% slower) - LM Studio (which applies rp internally): 170.8 TPS This aligns the Pydantic default with the frontend default (1.0), generate_chat_completion's function signature default (1.0), and llama-server's own default (1.0).	2026-03-26 20:20:53 -07:00
Roland Tannous	e79a178200	Allow install_python_stack to run on Colab (#4633 ) * Allow install_python_stack to run on Colab The _COLAB_NO_VENV flag was setting _SKIP_PYTHON_DEPS=true, which skipped both the PyPI version check (needs $VENV_DIR/bin/python) and install_python_stack (uses sys.executable, works without a venv). Introduce a separate _SKIP_VERSION_CHECK flag for the version check, so install_python_stack still runs on Colab. The _SKIP_PYTHON_DEPS flag remains available for the "versions match" fast path. * Remove colab.py workarounds that broke transformers/hf-hub compatibility PR #4601 added _pip_install_backend_deps(), _bootstrap_studio_venv(), and _is_colab() to colab.py as workarounds for install_python_stack being skipped on Colab. These workarounds: - Stripped version constraints from studio.txt and installed into system Python - Upgraded huggingface-hub to >=1.0, breaking Colab's pre-installed transformers which requires huggingface-hub<1.0 With install_python_stack now running on Colab (previous commit), these workarounds are unnecessary — all deps are properly installed by setup.sh. Restore colab.py to its original PR #4237 structure: just get_colab_url(), show_link(), and start(). * Remove --local flag from setup.sh in Colab notebook The --local flag is not needed for the standard Colab flow since install_python_stack now runs on Colab and installs deps from PyPI.	2026-03-27 00:29:27 +04:00
Wasim Yousef Said	71781272dd	fix: add python-json-logger dependency to data-designer-deps (#4627 )	2026-03-26 09:50:51 -07:00
Radouane Elhajali	a6fe743ebe	studio: humanize ETA display for long training runs (#4608 ) * studio: humanize ETA display for long training runs When training takes hours or days, the ETA displayed raw minutes (e.g. '560m 50s'). This changes the format to: - Under 1 hour: Xm Ys (unchanged) - 1-24 hours: Xh Ym Zs - Over 24 hours: Xd Xh Xm * Fix formatDuration edge cases and consolidate duplicate for PR #4608 - Guard NaN/Infinity inputs with Number.isFinite() (matches formatNumber in same file) - Add sub-minute branch so 30s displays as "30s" instead of "0m 30s" - Accept undefined in type signature to match formatNumber pattern - Remove duplicate formatDuration from history-card-grid.tsx and import the shared one --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-26 06:55:54 -07:00
Michael Han	937da02f6c	Update Unsloth_Studio_Colab.ipynb	2026-03-26 05:45:30 -07:00
Etherll	b3a3435ac3	fix: Windows installer fails on _yaml.pyd Access Denied (os error 5) (#4617 ) * fix: avoid _yaml.pyd lock on Windows during dependency overrides * fix: move pytorch_tokenizers and kernels to no-deps install to avoid Windows _yaml.pyd loc	2026-03-26 05:15:19 -07:00
Lee Jackson	352455610b	studio: align Dataset/Parameters/Training cards, fix expandable height, animate LoRA settings (#4614 ) * fix(studio): align config cards, dynamic height for expanders, LoRA collapsible * Fix clipping regressions in training, dataset, and params section cards - training-section: Add hasMessage conditional so the card expands (min-h) when startError, vision/audio incompatibility, or config validation messages are present instead of always using fixed height - dataset-section: Expand card when a local dataset is selected via upload (datasetSource === "upload" && selectedLocalDataset), not only when the Advanced panel is open - params-section: Guard loraOpen behind isLora so switching to full fine-tune collapses the card instead of staying expanded from stale React useState * Fix dataset card clipping for direct file uploads Use uploadedFile instead of selectedLocalDataset in the card height condition. selectedLocalDataset is derived from localDatasets.find() which only resolves for Data Recipe entries, not direct file uploads (.jsonl, .csv, .parquet, .arrow). The card already renders the Eval Dataset panel based on uploadedFile (line 750), so the height gate should match. --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-26 04:05:30 -07:00
Wasim Yousef Said	07abcb46de	fix: normalize search matching for recommended models and LoRA picker (#4615 ) Recommended models matching the query were filtered from HF results but the Recommended section was hidden during search, causing them to vanish entirely. - Show filtered recommended models during search by introducing `filteredRecommendedIds` - Switch `recommendedSet` to use filtered IDs when searching so dedup against HF results is correct - Hide empty "Hugging Face" label when recommended matches cover the query - Add `normalizeForSearch` helper to strip separators (spaces, hyphens, underscores, dots) so queries like "llama 3" match "Llama-3.2-1B" and "qwen 2.5" matches "Qwen2.5-7B" in both the recommended model filter and the LoRA adapter filter	2026-03-26 03:40:11 -07:00
Roland Tannous	6b3eb504b2	Fix Colab setup skipping llama.cpp installation (#4618 ) * Fix Colab setup skipping llama.cpp installation The early exit 0 in the Colab no-venv path prevented setup.sh from ever reaching the llama.cpp install section. Remove the early exit and instead guard only the venv-dependent Python deps section, so execution continues through to the llama.cpp prebuilt/source install. * Simplify _SKIP_PYTHON_DEPS initialization * Add --local flag to setup.sh in Colab notebook	2026-03-26 13:55:46 +04:00
Daniel Han	baabfa0a6e	Fix Colab huggingface-hub conflict, ensurepip fallback, bump to 2026.3.14 (#4603 ) * Fix Colab huggingface-hub conflict, ensurepip fallback, bump to 2026.3.14 - colab.py / setup.sh: relax == pins to >= when installing studio.txt on Colab so huggingface-hub does not clobber Colab's bundled version (breaks transformers is_offline_mode import) - install_python_stack.py: when uv is unavailable and pip is missing (uv-created venvs), bootstrap via ensurepip before attempting upgrade - Bump version to 2026.3.14 - Bump installer min version pins to 2026.3.14 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-25 09:38:02 -07:00
Daniel Han	23eb7fc0a7	Fix Colab Studio launch and setup.ps1 box alignment (#4601 ) * Fix Colab Studio launch and setup.ps1 box alignment - colab.py: when the Studio venv is missing on Colab, pip-install backend dependencies (structlog, fastapi, etc.) from studio.txt into the current Python instead of failing with ModuleNotFoundError - setup.sh: on Colab without a venv, install backend deps into system Python and skip venv-dependent sections (Python stack update, llama.cpp build) that would otherwise fail - setup.ps1: use PadRight(47) for the done-line so "Setup Complete!" and "Update Complete!" both align with the box border * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-25 09:00:08 -07:00
Daniel Han	55d24d7c49	feat(studio): editable context length with Apply/Reset for GGUF settings (#4592 ) * feat(studio): editable context length with Apply/Reset for GGUF model settings Previously the Context Length field was read-only and the backend hardcoded `-c 0`, ignoring custom values entirely. KV Cache Dtype also triggered an immediate model reload with no way to cancel. Backend: - llama_cpp.py: pass the actual n_ctx value to `-c` instead of always 0 - models/inference.py: relax max_seq_length to 0..1048576 (0 = model default) so GGUF models with large context windows are supported Frontend: - chat-runtime-store: add customContextLength and loadedKvCacheDtype state fields for dirty tracking - chat-settings-sheet: make Context Length an editable number input, stop KV Cache Dtype from auto-reloading, show Apply/Reset buttons when either setting has been changed - use-chat-model-runtime: send customContextLength as max_seq_length in the load request, reset after successful load * fix: preserve maxSeqLength for non-GGUF models in load request customContextLength ?? 0 sent max_seq_length=0 for non-GGUF models, breaking the finetuning/inference path that needs the slider value. Now uses a three-way branch: - customContextLength set: use it (user edited GGUF context) - GGUF without custom: 0 (model's native context) - Non-GGUF: maxSeqLength from the sampling slider * fix: keep max_seq_length default at 4096 for non-GGUF callers Only relax the bounds (ge=0 for GGUF's "model default" mode, le=1048576 for large context windows). The default stays at 4096 so API callers that omit max_seq_length still get a sane value for non-GGUF models. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix(studio): rename trust remote code toggle and hide when no model selected - Rename "Trust remote code" to "Enable custom code" - Shorten subtitle to "Only enable if sure" - Hide the toggle when no model is loaded (already hidden for GGUFs) * fix: restore ge=128 for max_seq_length validation Keep the minimum at 128 so the API rejects nonsensical values. GGUF path now sends the model's native context length (from ggufContextLength) instead of 0 when the user has not customized it. The upper bound stays at 1048576 for large-context GGUF models. * feat(studio): replace Context Length input with slider Use a ParamSlider (512 to model's native context, step 512) instead of a small number input. Shows "Max" when at the model's native context length. Consistent with the other slider controls in the settings panel. * feat(studio): add editable number input alongside Context Length slider The slider and number input stay synced -- dragging the slider updates the number, typing a number moves the slider. The input also accepts values beyond the slider range for power users who need custom context lengths larger than the model default. * fix(studio): widen context length input and use 1024 step for slider Make the number input wider (100px) so large values like 262144 are fully visible. Change slider step from 512 to 1024 and min from 512 to 1024. * fix(studio): context length number input increments by 1024 * fix(studio): cap context length input at model's native max Adds max attribute and clamps typed/incremented values so the context length cannot exceed the GGUF model's reported context window. * fix(studio): point "What's new" link to changelog page Changed from /blog to /docs/new/changelog. * fix(studio): preserve custom context length after Apply, remove stale subtitle - After a reload with a custom context length, keep the user's value in the UI instead of snapping back to the model's native max. ggufContextLength always reports the model's native metadata value regardless of what -c was passed, so we need to preserve customContextLength when it differs from native. - Remove "Reload to apply." from KV Cache Dtype subtitle since the Apply/Reset buttons now handle this. * feat(studio): auto-enable Search and Code tools when model supports them Previously toolsEnabled and codeToolsEnabled stayed false after loading a model even if it reported supports_tools=true. Now both toggles are automatically enabled when the loaded model supports tool calling, matching the existing behavior for reasoning. * fix(studio): auto-enable tools in autoLoadSmallestModel path The suggestion cards trigger autoLoadSmallestModel which bypasses selectModel entirely. It was hardcoding toolsEnabled: false and codeToolsEnabled: false even when the model supports tool calling. Now both are set from the load response, matching the selectModel behavior. Also sets kvCacheDtype/loadedKvCacheDtype for dirty tracking consistency. * fix(studio): re-read tool flags after auto-loading model The runtime state was captured once at the start of the chat adapter's run(), before autoLoadSmallestModel() executes. After auto-load enables tools in the store, the request was still built with the stale snapshot that had toolsEnabled=false. Now re-reads the store after auto-load so the first message includes tools. * fix(studio): re-read entire runtime state after auto-load, not just tools The runtime snapshot (including params.checkpoint, model id, and all tool/reasoning flags) was captured once before auto-load. After autoLoadSmallestModel sets the checkpoint and enables tools, the request was still built with stale params (empty checkpoint, tools disabled). Now re-reads the full store state after auto-load so the first message has the correct model, tools, and reasoning flags. * feat(studio): add Hugging Face token field in Preferences Adds a password input under Configuration > Preferences for users to enter their HF token. The token is persisted in localStorage and passed to all model validate/load/download calls, replacing the previously hardcoded null. This enables downloading gated and private models. * fix(studio): use model native context for GGUF auto-load, show friendly errors The auto-load paths and selectModel for GGUF were sending max_seq_length=4096 which now actually limits the context window (since we fixed the backend to respect n_ctx). Changed to send 0 for GGUF, which means "use model's native context size". Also replaced generic "An internal error occurred" messages with user-friendly descriptions for known errors like context size exceeded and lost connections. LoadRequest validation changed to ge=0 to allow the GGUF "model default" signal. The frontend slider still enforces min=128 for non-GGUF models. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix(studio): filter out FP8 models from model search results Hide models matching -FP8- or FP8-Dynamic from both the recommended list and HF search results. These models are not yet supported in the inference UI. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-25 08:32:38 -07:00

1 2 3 4 5 ...

1155 commits