* split venv_t5 into venv_t5_530 and venv_t5_550 for tiered transformers 5.x support
* fix bfloat16 crash on T4 for FORCE_FLOAT32 models and disable trust_remote_code auto-enable for native t5 models
* revert FORCE_FLOAT32 dtype change
* restrict trust_remote_code auto-enable to Nemotron models only
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* use config.json model_type for tier detection, add unsloth/nvidia namespace guard
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"
This reverts commit fb43d468e2.
* Revert "use config.json model_type for tier detection, add unsloth/nvidia namespace guard"
This reverts commit fc49ae2453.
* add unsloth/nvidia namespace guard to Nemotron trust_remote_code auto-enable
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* reorder tier checks: all substring matches before config.json fetches
* extract shared activate_transformers_for_subprocess into transformers_version.py
* narrow Nemotron trust_remote_code to nemotron_h/nemotron-3-nano, add to export worker
* clean venv_t5 dirs before re-install in setup.sh, clarify version alias comment
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* run venv_t5 migration outside deps fast-path gate in both setup scripts
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(chat): prevent implicit empty thread creation and stabilize new-chat flow
* fix(chat): harden compare thread sync and simplify sidebar thread query
* fix(chat): harden new-thread state sync and isolate compare active thread updates
* fix(chat): stabilize new-thread state sync and prevent compare/session bleed
* Fix thread restoration, handleNewThread guard, sidebar filter, and delete flow
- Remove __LOCALID_ filter from getInitialSingleChatView: in this
Dexie-backed adapter, AUI's __LOCALID_ prefixed IDs ARE the real
persistent thread IDs stored by initialize(). Filtering them out
breaks thread restoration on navigation.
- Simplify handleNewThread to synchronous: the async Dexie message
check is redundant (persistence is already deferred to first append)
and strands users on legacy empty threads. Use a simple guard that
checks the store's activeThreadId to detect unsent drafts.
- Add message-count filter to sidebar: filter threads to only show
those with at least one message, hiding legacy empty threads.
- Add store-based sidebar highlighting fallback: use activeThreadId
from the store when view.threadId is not set (nonce-backed chats).
- Fix handleDelete to call onNewThread() instead of onSelect(), and
clear activeThreadId, so the runtime properly resets after deleting
the active thread.
* Fix handleDelete nonce path and restore __LOCALID_ filter
handleDelete was calling onNewThread() after clearing activeThreadId,
but the handleNewThread guard sees !view.threadId && !activeThreadId
and returns early, leaving the UI stuck on the deleted thread.
Fix by directly calling onSelect with a new nonce instead.
Restore __LOCALID_ filter in getInitialSingleChatView to prevent
restoring unpersisted AUI local thread IDs on navigation. Without
this filter, navigating away from /chat before sending a message
would restore a non-existent thread that Dexie cannot fetch.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fix custom folder scanning when pointing directly at a model directory.
When a user adds a custom scan folder that points directly at a model
directory (e.g. /path/to/gemma-4-e2b-it-gguf/ containing config.json
and gemma-4-E2B-it-BF16.gguf), the model list previously showed
individual .gguf files as separate entries instead of recognizing the
directory as a single model. Clicking any entry showed "No GGUF
variants found" because list_local_gguf_variants received a file path
and immediately returned empty.
Changes:
- Add _is_model_directory() helper that detects directories with both
config metadata and actual model weight files (excludes mmproj GGUFs
and non-weight .bin files like tokenizer.bin)
- _scan_models_dir: detect self-model and return single directory entry
- _scan_lmstudio_dir: surface model directories directly instead of
descending into them as publisher folders; handle both root and child
model directories
- Add _resolve_gguf_dir() helper for GGUF path resolution that only
falls back to parent directory when parent has model metadata
- list_local_gguf_variants / _find_local_gguf_by_variant: use resolver
so .gguf file paths inside model directories work correctly
* Add tests for is_vision_model() caching behaviour
* Fix review feedback: remove dead helper, fix exception test
- Remove unused _make_config() helper function (dead code)
- Fix test_exception_result_cached to actually exercise the exception path
by mocking load_model_config to raise OSError instead of using
side_effect=[False] which only tested normal False returns
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use strict mock specs so tests exercise intended detection paths
Use MagicMock(spec=[]) for all config mocks so hasattr() only returns
True for explicitly set attributes. Without this, MagicMock defaults
make all hasattr checks truthy, allowing tests to pass via unintended
detection paths (e.g. img_processor instead of vision_config).
---------
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add vision detection cache to is_vision_model() to avoid redundant subprocess spawns
is_vision_model() is called 4-5 times per training run for the same model
with zero caching. For transformers 5.x models, each call spawns a full
subprocess (~6s each). This adds a module-level _vision_detection_cache dict
following the same pattern as the existing _audio_detection_cache used by
detect_audio_type(). The function is refactored into a thin cache wrapper
around _is_vision_model_uncached(), saving ~12s per training run.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Include hf_token in vision cache key for gated model correctness
Cache key is now (model_name, hf_token) instead of just model_name.
This prevents stale False results when an unauthenticated probe for a
gated model is followed by an authenticated call.
* Remove test file from main PR - will be submitted separately
* Fix vision cache: normalize model names and skip caching transient failures
- Normalize model names in cache key using resolve_cached_repo_id_case()
to avoid duplicate entries for different casings of the same HF repo
(aligns with case normalization from #4822)
- Return None instead of False on transient failures (network errors,
subprocess timeouts, HF API issues) so the cache layer can distinguish
"definitely not a vision model" from "failed to check"
- Only cache definitive True/False results; transient failures are retried
on the next call instead of being permanently locked in as False
* Refine failure handling: cache deterministic failures, guard normalization
- Subprocess non-zero exit, JSON errors, and general exceptions return
False (deterministic, cached) instead of None (retryable). Only
subprocess.TimeoutExpired returns None since timeouts are transient.
- Wrap cache key normalization in try/except so resolve_cached_repo_id_case
or normalize_path failures fall back to raw model_name instead of
crashing callers.
* Harden vision detection cache: fix transient failure handling, thread safety, token security
- All subprocess failure paths now return None (transient) instead of False,
preventing permanent misclassification of VLMs after temporary HF/auth/network errors
- Use SHA256 fingerprint for hf_token in cache key instead of raw bearer token
- Add threading.Lock with double-checked locking to prevent thundering herd
of concurrent subprocess spawns for the same uncached model
- Distinguish permanent failures (RepositoryNotFoundError, GatedRepoError,
ValueError) from transient ones in _is_vision_model_uncached
- Pass resolved/normalized model name to detection (not just cache key)
- Log normalization fallback at debug level instead of silent swallow
- Thread hf_token through callers in routes/models.py and trainer.py
that previously omitted it
* Refine lock strategy and token fingerprint
- Move detection computation outside the lock to avoid serializing
long-running subprocess spawns (60s timeout) and HF API calls across
all concurrent model checks. Lock is now only held for cache writes.
- Use full SHA256 digest for token fingerprint instead of truncated
16-char prefix to eliminate collision risk.
* Fix huggingface_hub import fallback and use atomic cache read
- Add fallback import path for RepositoryNotFoundError/GatedRepoError
from huggingface_hub.utils (older hub versions) when .errors is
not available
- Use sentinel-based dict.get() for single atomic cache read instead
of two-step in/[] pattern (future-proof for no-GIL runtimes)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Add fallback message for Colab Studio button when localhost link doesn't work
* Make fallback message darker grey for better readability
* Make fallback message bold for better visibility
---------
Co-authored-by: LeoBorcherding <LeoBorcherding@users.noreply.github.com>
* studio: add speculative decoding support (ngram-mod, on by default)
Enable n-gram speculative decoding for GGUF models in Unsloth Studio.
Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation
with zero VRAM cost via a 4MB fixed hash table that auto-resets on
low acceptance rates.
Backend:
- Add speculative_type field to LoadRequest, LoadResponse, and
InferenceStatusResponse pydantic models
- Add speculative_type parameter to LlamaCppBackend.load_model()
with allowlist validation (ngram-simple, ngram-mod)
- Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags
to llama-server when ngram-mod is active
- Default to ngram-mod for non-vision GGUF models server-side
- Silently skip speculative decoding for vision models (unsupported
in llama.cpp server-context.cpp)
Frontend:
- Add speculative_type to TS API types
- Add speculativeType/loadedSpeculativeType to chat runtime store
with default value of "ngram-mod"
- Add On/Off toggle in Model settings section (GGUF only, hidden
for vision models), included in dirty check for Apply/Reset
- Wire speculative_type through model load request and response
- Restore speculative type state on page refresh/reconnect
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: remove server-side speculative decoding override
The backend was overriding speculative_type=None to "ngram-mod" for
non-vision GGUF models, which prevented users from disabling spec
decoding via the UI toggle. The frontend store already defaults to
"ngram-mod", so the backend fallback was redundant and blocked the
explicit "Off" setting.
* fix: use recommended ngram-mod params from llama.cpp docs
Update speculative decoding params to match the recommended values
from llama.cpp docs (docs/speculative.md):
--spec-ngram-size-n 24 (was 16, docs say small n not recommended)
--draft-min 48 (was 0)
--draft-max 64 (was 24, docs note MoEs need long drafts)
Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes),
not 4 MB.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* add benchmark table and references to speculative decoding comment
Include speedup numbers from llama.cpp PRs #18471 and #19164 as an
inline comment so future readers understand the expected gains.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): harden sandbox security for terminal and python tools
The existing command blocklist used naive str.split() which is trivially
bypassable via quoting, full paths, nested shells, variable expansion,
and cross-tool pivoting through Python os.system/subprocess. Fixes#4818.
Changes:
- Replace str.split() blocklist with shlex.split() + os.path.basename()
tokenization and regex scanning at shell command boundaries
- Add sanitized subprocess environment (_build_safe_env) that strips
credentials (HF_TOKEN, WANDB_API_KEY, GH_TOKEN, AWS_*, etc.) and
restricts PATH to /usr/local/bin:/usr/bin:/bin
- Add PR_SET_NO_NEW_PRIVS via prctl on Linux so sudo/su/pkexec fail
at the kernel level regardless of how they are invoked
- Add RLIMIT_NPROC (256) and RLIMIT_FSIZE (100MB) to prevent fork
bombs and disk filling attacks
- Extend AST safety checker to detect os.system(), os.popen(),
subprocess.run/Popen/call/check_output, os.exec*, os.spawn* calls
containing blocked commands or dynamic (non-literal) arguments
- Add cross-platform support: cmd.exe on Windows, bash on Unix;
CREATE_NO_WINDOW flag on Windows, preexec_fn on Unix
- Expand blocklist from 7 to 14 commands: add su, chown, passwd,
mount, umount, fdisk, kill, killall, pkill
- Apply all layers to both _bash_exec and _python_exec
Zero measurable performance overhead -- shlex parsing and a single
prctl syscall per subprocess fork.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix review findings: exception_catching dead code, false positives, process substitution
- Include exception_catching reasons in _check_code_safety so bare
except-in-loop timeout evasion is actually blocked (was computed in
_check_signal_escape_patterns but never read by the caller)
- Remove base.split() inner loop that caused false positives on quoted
text arguments containing blocked words (e.g. echo "kill this process")
- Add targeted nested shell detection for bash/sh/zsh -c arguments
instead, which catches bash -c 'sudo whoami' without false positives
- Add <() process substitution to the regex character class so
diff <(rm -rf /path) is also caught
- Fix error message to say "unsafe patterns" instead of specifically
mentioning signal manipulation when other categories trigger
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review feedback: regex paths, keyword args, list element scanning
- Regex now matches blocked commands after optional path prefix at shell
boundaries (catches ls; /usr/bin/sudo and similar)
- Nested shell detection uses os.path.basename so bash -c "/bin/rm" is
caught
- AST checker now inspects keyword arguments (not just positional) so
subprocess.run(args="sudo ...", shell=True) is detected
- List elements in subprocess calls are now checked via
_find_blocked_commands for consistency (catches subprocess.run(["bash",
"-c", "rm -rf /"]))
- Dynamic argument check uses _is_safe_literal that validates list
contents are all string literals
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix nested shell scan to only check the script body, not positional args
bash -c 'script' arg0 arg1 -- only tokens[i+1] is the script body;
subsequent tokens are $0, $1 positional parameters passed to the script
and are not executed as shell commands. Scanning all remaining tokens
caused false positives.
* Add subshell parentheses to regex command boundary detection
(sudo whoami) was not caught because ( was not in the regex character
class for shell command boundaries. Add ( to the set alongside ;, &,
|, backtick, newline.
* Address high-priority review findings from 7 parallel reviewers
- Track from-imports of dangerous functions (from os import system,
from subprocess import run as r, etc.) via shell_exec_aliases dict
so bare-name calls are detected by the AST checker
- Include the active Python interpreter and virtualenv directories
in the sanitized PATH so pip, uv, and Studio packages remain
accessible in the sandbox
- Add Windows-specific blocked commands (rmdir, takeown, icacls,
runas, powershell, pwsh) only on win32 platform
- Add os.posix_spawn and os.posix_spawnp to _SHELL_EXEC_FUNCS
- Handle tuple literals same as list literals in AST argument
inspection (both _extract_strings_from_list and _is_safe_literal)
* Fix false positive on check=True kwargs and recursive nested shell scanning
- Only inspect command-carrying keyword arguments (args, command,
executable, path, file) in the AST checker, not control flags like
check=True, text=True, capture_output=True which are booleans and
were incorrectly flagged as non-literal dynamic arguments
- Replace split() in nested shell detection with recursive call to
_find_blocked_commands so that quoted commands (bash -c '"sudo"
whoami') and semicolons (bash -c "sudo;ls") within nested shells
are properly detected through the full shlex + regex pipeline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move preexec_fn imports to module level and use find_library for libc
Addresses two Gemini review findings:
1. preexec_fn thread safety: _sandbox_preexec previously imported ctypes
and resource inside the function body, which runs between fork() and
exec() in the child process. In a multi-threaded server, this could
deadlock if the import machinery locks were held by another thread at
fork time. Now all imports and the libc handle are resolved once at
module load time, so _sandbox_preexec only calls C-level functions
(prctl, setrlimit) with no Python import activity.
2. Hardcoded libc.so.6 path: replaced with ctypes.util.find_library("c")
which works on glibc (libc.so.6), musl (libc.musl-*.so.1), and other
Linux distributions where libc has a different soname.
* Apply Gemini style suggestions: combined regex, dict.fromkeys, constant hoisting
- Combine per-word regex loop into a single re.findall with alternation
pattern, avoiding repeated regex compilation and searching
- Replace manual dedup loop with dict.fromkeys for PATH entries
- Hoist _CMD_KWARGS frozenset out of visit_Call to avoid recreating it
on every AST node visit
* Add cmd /c nested shell detection for Windows parity
The nested shell scan only checked for Unix shells (bash -c, sh -c, etc).
Add cmd /c and cmd.exe /c detection so that Windows nested shell
invocations are also recursively scanned for blocked commands. The token
scan already catches blocked commands at any position, so this is
defense-in-depth for consistency across platforms.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Handle combined shell flags (-lc, -xc) and interleaved flags (--login -c)
The nested shell scan only matched token == "-c" with the immediately
preceding token being a shell name. This missed:
- Combined flags: bash -lc 'rm ...' (-lc ends with c, is a valid
combined flag meaning -l -c)
- Interleaved flags: bash --login -c 'sudo ...' (--login sits between
bash and -c)
Now matches any short flag ending in 'c' (e.g. -lc, -xc, -ic) and
walks backwards past intermediate flags to find the shell binary.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix /bin/bash bypass, remove RLIMIT_NPROC, reduce AST false positives
Addresses three high-consensus findings from 20-reviewer pass:
1. /bin/bash -c 'sudo whoami' bypassed nested shell scan because the
backwards flag-skip logic treated paths starting with / as flags.
Now only skips tokens starting with - as Unix flags; on Windows
only skips short /X flags (not /bin/bash style paths). [9/20]
2. RLIMIT_NPROC=256 caused subprocess.run to fail with EAGAIN because
Linux enforces NPROC per real UID, not per process tree. Removed
RLIMIT_NPROC entirely; RLIMIT_FSIZE and PR_SET_NO_NEW_PRIVS remain
as the primary resource and privilege controls. [5/20]
3. AST checker rejected safe dynamic subprocess usage like
cmd=["git","status"]; subprocess.run(cmd) as shell_escape_dynamic.
Now only flags dynamic args for shell-string functions (os.system,
os.popen, subprocess.getoutput, etc.) or when shell=True is
explicitly set. List-based subprocess calls with shell=False (the
default) do not pass through a shell and are not flagged. [12/20]
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Handle Windows drive letter paths and .exe extensions in command detection
Gemini review found that Windows absolute paths (C:\Windows\System32\
shutdown.exe) and executable extensions (.exe, .com, .bat, .cmd) were
not handled:
- Token scan now strips .exe/.com/.bat/.cmd extensions before checking
the blocklist, so sudo.exe matches sudo, shutdown.bat matches shutdown
- Regex pattern now includes optional Windows drive letter prefix
([a-zA-Z]:[/\\]) and optional executable extension suffix, so commands
after shell metacharacters with full Windows paths are also caught
* Handle **kwargs dict expansion, non-literal shell=, and except Exception false positive
Addresses three findings from second 20-reviewer pass:
1. **kwargs dict expansion (9/20): subprocess.run(**{"args": "rm ...",
"shell": True}) bypassed the AST checker because **kwargs were
treated as opaque. Now expands literal dict **kwargs to inspect
their keys, and flags opaque **kwargs (variable dicts) as unsafe.
2. Non-literal shell= values (7/20): shell=variable was treated as
shell=False (safe). Now any shell= value that is not literally
False is treated as potentially True (conservative default).
3. except Exception false positive (1/20): except Exception in a loop
was flagged as timeout evasion, but Exception does not catch
SystemExit or KeyboardInterrupt which are used for timeout
enforcement. Narrowed to only flag except BaseException and
except TimeoutError in loops.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Fixes#4809
On a new Studio chat, the first tool call could start before the frontend
initializes the thread ID. That meant the first request could go out without
a session_id, so the backend started the tool in the shared sandbox root
instead of the chat's session sandbox.
Frontend:
- Eagerly initialize the thread when switching to a new chat
- Resolve the thread ID once at request time and keep it stable through
async model-load waits
- Disable ActiveThreadSync during new-chat initialization to prevent
stale thread IDs from being written back
- Add error handling for thread initialization failures
- Clear activeThreadId on all compare-mode entry paths to prevent
cross-session leakage
- Fix exitCompare to restore context usage from the saved view
- Coerce falsy thread IDs to undefined for consistent backend/frontend
fallback behavior
- Use _default as the image sessionId fallback to match the backend
Backend:
- Use ~/studio_sandbox/_default when a request arrives without a session_id
* fix(studio): reuse HF cached repo casing to prevent duplicate downloads
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Move cache case resolution tests to separate PR
Tests for resolve_cached_repo_id_case and get_model_config case resolution
belong in their own PR to keep this change focused on the runtime fix.
* fix(studio): debug-log HF_HUB_CACHE fallback in path_utils
* Fix stale memoization in resolve_cached_repo_id_case
- Check exact-case path before memo to ensure a newly-appeared exact
match always wins over a previously memoized variant
- Validate memoized entries still exist on disk before returning them
to prevent stale results when cache dirs are deleted/recreated
* Minor cleanups for cache case resolution
- Use .is_dir() instead of .exists() for exact-case cache check
(cache entries are always directories)
- Remove redundant fallback in _detect_audio_from_tokenizer since
get_cache_path already handles case resolution and returns None
when the model is not cached
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* feat: allow non-LLM recipes to run without provider block
* feat: reorder execution tabs and add generation-aware data tab empty state
* fix: add accessibility attrs to data tab spinner and use literal ellipsis
* fix(studio): use shared spinner, stub provider, and hide unused LLM metrics
Backend: inject stub model provider for sampler-only recipes so
DataDesigner init does not reject empty provider lists.
Frontend: use shared Spinner component, hide LLM columns metric
and model usage card when recipe has no LLM columns.
* Fix tab reset and terminal auto-scroll regressions for PR #4805
Reset detailTab to "data" when switching between executions so
the Data tab default is applied consistently, not only on first
mount. Also add detailTab to the terminal scroll effect deps so
auto-scroll-to-bottom fires when the user opens the Overview tab
after landing on Data.
* Guard terminal scroll reset to only fire on Overview tab
The previous scroll effect ran on every tab switch, which could
reset the user's manual scroll position if they scrolled up in
the terminal and briefly switched tabs. Now the scroll-to-bottom
and sticky-bottom reset only fires when navigating to the
Overview tab.
* Use None for stub provider api_key instead of literal string
The stub ModelProvider that satisfies the DataDesigner registry
for non-LLM recipes should not carry a fake credential string.
Using None avoids sending an Authorization header if the provider
is ever inadvertently invoked.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Differentiate web_search query searches from URL fetches in the Studio chat UI.
Backend (llama_cpp.py):
- Emit "Reading: hostname" for URL fetches and "Searching: query" for query searches in SSE status events
- Only show hostname for valid http/https URLs; schemeless/non-http URLs get "Reading page..." generic fallback
- Strip www. prefix for consistency with the frontend
Frontend (tool-ui-web-search.tsx):
- Tool card shows "Read hostname" / "Reading hostname..." for URL fetches
- Shows "Searched query" / "Searching for query..." for query searches
- Uses new URL() with protocol check; falls back to "Read page" / "Reading page..." for non-http URLs
* Simplify llama.cpp install logic
* print release tag
* Retry failed json decode
* don't pull all ggml releases
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test file changes from main PR
Test changes for test_pr4562_bugfixes.py will be submitted in a separate PR to keep this PR focused on the install path simplification.
* Fix setup.sh executable bit and direct tag lookup for pinned releases
- Restore setup.sh file mode to 100755 (was accidentally changed to 100644)
- Add direct GitHub API tag lookup in iter_release_payloads_by_time for
non-latest requested tags (e.g. b7879) instead of relying on paginated
release scans that may miss older releases beyond the 5-page limit
- Update stale DEFAULT_PUBLISHED_REPO comment to match new value
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix force-compile default ref and remove dead code in setup.ps1
- Change FORCE_COMPILE_DEFAULT_REF from "main" to "master" in all three
files (install_llama_prebuilt.py, setup.sh, setup.ps1) since
ggml-org/llama.cpp uses "master" as its default branch, not "main".
Using "main" would cause git clone --branch to fail when
UNSLOTH_LLAMA_FORCE_COMPILE=1 with UNSLOTH_LLAMA_TAG=latest.
- Remove dead if ($SkipPrebuiltInstall) block inside the else branch of
setup.ps1 that could never be reached (the outer elseif already
handles $SkipPrebuiltInstall=true).
- Maintain setup.sh executable bit (100755).
* Improve iter_release_payloads_by_time error handling for direct tag lookup
When a pinned release tag is not found (HTTP 404), fall through to the
paginated release scan instead of silently returning empty results.
Non-404 errors (network failures, rate limits) are propagated to the
caller so users get actionable error messages.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): lazy-import AutoConfig in model_config.py to fix transformers 5.x version switch
Move `from transformers import AutoConfig` from module level to inside
load_model_config() where it is actually used.
model_config.py is transitively imported at module load time via:
core/inference/__init__ → llama_cpp → utils.models → model_config
In inference subprocesses (mp.spawn), this chain runs before
_activate_transformers_version() can prepend .venv_t5/ to sys.path.
The eager import caches transformers 4.57.6 in sys.modules, and the
subsequent sys.path change has no effect — Python always checks
sys.modules before sys.path.
Making the import lazy ensures transformers is not loaded until after
version activation, so the subprocess picks up the correct version.
* fix(studio): also lazy-import extract_model_size_b in llama_cpp.py
Belt-and-suspenders: make the import that originally triggered the
chain lazy as well, so future module-level AutoConfig additions in
utils.models cannot reintroduce the problem.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
When DEFAULT_PUBLISHED_REPO is ggml-org/llama.cpp, the prebuilt
resolver raises PrebuiltFallback because ggml-org releases do not
include a llama-prebuilt-manifest.json asset. This was caught by the
generic Exception handler and printed as "fatal helper error" to
stderr, which triggers NativeCommandError on PowerShell.
Catch PrebuiltFallback separately in the top-level __main__ handler
and exit with EXIT_FALLBACK (code 2) instead of EXIT_ERROR (code 1).
The message is still logged but without the "fatal helper error"
prefix. The shell scripts already handle non-zero exits and fall
back to source builds.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* fix(studio): revert llama.cpp default tag to latest
The latest ggml-org/llama.cpp release (b8637) now includes Gemma 4
support. Revert the temporary "b8637" pin from #4796 to "latest" so
the prebuilt resolver always picks the newest release automatically
without needing manual tag bumps.
* docs: add comment explaining latest vs master for llama.cpp tag
Document in all three files why "latest" is preferred over "master"
and when "master" should be used as a temporary override.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Gemma 4 is a native transformers 5.5 model and does not need
trust_remote_code=True. The auto-enable logic (added for NemotronH)
was catching all transformers 5.x models, including Gemma 4.
When trust_remote_code=True, unsloth_compile_transformers() returns
early without running the compiler. This disables the fused cross
entropy patch, causing logged training loss to be inflated by the
gradient_accumulation_steps factor.
Exclude models matching "gemma-4" or "gemma4" from the auto-enable
so the compiler runs and applies fused cross entropy correctly.
ggml-org/llama.cpp b8637 includes Gemma 4 support (ggml-org/llama.cpp#21309).
Revert the temporary "master" default back to a pinned release tag.
This eliminates the HTTP 422 errors from the prebuilt resolver (which
could not find a release matching "master"), avoids unnecessary source
builds, and restores prebuilt binary downloads on all platforms.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* fix windows llama.cpp compile from source issue
* undo local repo usage
* fix llama.cpp install
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix windows
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: route resolve-source-build call through Invoke-LlamaHelper
The --resolve-source-build call at the source-build resolution path
was still calling install_llama_prebuilt.py directly instead of going
through Invoke-LlamaHelper. On PS7+ with ErrorActionPreference=Stop,
stderr from the 422 response (when tag is "master") would trigger a
terminating NativeCommandError and crash setup.
* fix: suppress stderr error records from Invoke-LlamaHelper
ErrorActionPreference=Continue prevents termination but PowerShell
still displays stderr lines as visible ErrorRecord objects. Capture
all output via 2>&1 and split stdout from stderr manually so that
stderr lines never appear on the console. When StderrPath is given
the stderr content is written to that file for diagnostics.
* fix: always rebuild llama.cpp on Windows when tag is master
When the requested llama.cpp tag is "master" (a moving target), skip
the "already built" early exit so the build path runs and syncs to
the latest commit. Without this, existing llama-server binaries from
an older build (e.g. b8635 which lacks Gemma 4 support) are reused
and model loading fails.
Pinned tags (e.g. b8635) still skip the rebuild when the binary
already exists, since the tag is immutable.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The model list merge order was `top_gguf + top_hub + static_models`,
which meant the HF download-ranked models always came first. New models
like Gemma 4 have low download counts and were not in the HF top-40,
so they got buried after 80 other models despite being at the top of
the curated static defaults in defaults.py.
Flip the merge to `static_models + top_gguf + top_hub` so editorial
picks (new model launches, promoted models) always appear first in the
Recommended section, with HF popularity backfilling after.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The latest ggml-org/llama.cpp release (b8635) does not include Gemma 4
support (ggml-org/llama.cpp#21309 merged after the release was cut).
This causes `llama-server` to fail with "unknown model architecture:
gemma4" when loading Gemma 4 GGUFs.
Temporarily default _DEFAULT_LLAMA_TAG to "master" so all new installs
build from the llama.cpp master branch which includes Gemma 4 support.
Once a new upstream release is cut with Gemma 4, this can be reverted
back to "latest".
Changes:
- setup.sh: add _DEFAULT_LLAMA_TAG="master" maintainer default
- setup.ps1: add $DefaultLlamaTag="master" maintainer default
- install_llama_prebuilt.py: change DEFAULT_LLAMA_TAG fallback to "master"
Users can still override via UNSLOTH_LLAMA_TAG env var.
Revert the >= loosening from f9c4b08 back to exact pins.
Using transformers>=4.57.6 allows pip to install 5.x into the main
Studio venv, which breaks huggingface_hub imports
(is_offline_mode removed in newer hub versions).
The main venv must stay on transformers==4.57.6 and
huggingface-hub==0.36.2. The 5.x version lives only in .venv_t5/
and is dynamically switched via sys.path at runtime.
The v5.5-release branch now exists on huggingface/transformers.
Use transformers==5.5.0 for all install paths and
git+transformers.git@v5.5-release for the MLX installer.
Also bumps huggingface_hub from 1.7.1 to 1.8.0 in setup.sh and
setup.ps1 to stay consistent.
Hardcode the release repo to ggml-org/llama.cpp and remove the
UNSLOTH_LLAMA_RELEASE_REPO and UNSLOTH_LLAMA_SOURCE env var overrides
so that all users always build/download from mainline llama.cpp.
Gemma-4 support landed in transformers main
(huggingface/transformers#45192). Update the version pin from
5.5.0.dev0 to 5.5.0 across loader, Studio version switcher,
and the MLX installer. Also fix install_gemma4_mlx.sh which
referenced a non-existent v5.5-release branch -- pin it to
the correct commit (91b1ab1) instead.
Small GGUF models (<9B) frequently generate full code or lengthy
explanations instead of calling tools, bypassing the existing
plan-without-action re-prompt mechanism. Three issues:
1. _REPROMPT_MAX_CHARS=500 was too low -- models that output full
HTML/code responses (often 1000+ chars) never triggered the
re-prompt at all, since it only fires on short responses.
2. _MAX_REPROMPTS=1 gave the model only one chance to comply.
Small models often need 2-3 nudges before switching from
text generation to tool calling.
3. The re-prompt text ("Please use the available tools...") was
too polite for small models to follow reliably.
4. Tool-calling detection missed chat templates using Jinja
whitespace-trimming syntax ({%- if tools -%}) since only
({%- if tools %}) and ({% if tools %}) were checked.
Changes:
- Raise _REPROMPT_MAX_CHARS from 500 to 2000 so longer responses
(code blocks, multi-paragraph plans) still trigger re-prompts
- Raise _MAX_REPROMPTS from 1 to 3 for more retry budget
- Use direct, imperative re-prompt language that small models
follow more reliably ("STOP. You MUST call a tool NOW.")
- Strengthen the system prompt tool nudge to explicitly forbid
outputting code blocks (redirect to the python tool instead)
- Add Jinja whitespace-trimmed variants to the tool_markers
list so all template styles are detected correctly
* UI Changes
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unrelated test file
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): display images from Python tool execution in chat UI
When the model calls the Python tool to create a matplotlib plot or
other image file, the image now displays inline in the chat output
instead of being invisible to the user.
Backend:
- Detect new image files (png/jpg/gif/webp/bmp) after Python subprocess
completes by diffing os.listdir before/after execution
- Append __IMAGES__ sentinel to tool result for frontend consumption
- Strip sentinel before injecting result into LLM context (role: tool)
so the model never sees file paths
- Add GET /sandbox/{session_id}/{filename} endpoint with JWT auth
(header or query param), path traversal protection, extension
allowlist, realpath containment check, and nosniff header
Frontend:
- Parse __IMAGES__ sentinel in tool_end SSE events, create structured
result with text/images/sessionId
- Render <img> tags in Python tool UI pointing at the sandbox endpoint
Also fixes a bug where SyntaxError in user code was misreported as
"unsafe code detected" instead of showing the actual Python traceback.
The _check_code_safety function now lets SyntaxError pass through to
the subprocess for a proper error message.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): improve SVG detection and strip XML preamble
Handle <?xml ...?> declarations before <svg> tags in code fences,
strip XML declaration from SVGs before data URI rendering, and
update the sloth suggestion prompt to request showing code.
* fix(studio): persist parentId so retries survive reload
The append() handler was destructuring only { message } from
ExportedMessageRepositoryItem and discarding parentId. When loading
a saved thread, load() used ExportedMessageRepository.fromArray()
which chains all messages sequentially, flattening retry branches
into a linear list.
Now append() writes parentId to the MessageRecord, and load()
reconstructs the tree when parentIds are present. Old threads
without parentId fall back to the existing fromArray() behavior.
* fix(studio): address review findings for image display and retry persistence
Image detection:
- Use mtime comparison instead of filename-only diff so overwritten
files (e.g. plt.savefig("chart.png") called twice) are detected
Sentinel parsing:
- Use rsplit/lastIndexOf instead of split/indexOf so user code that
prints __IMAGES__: does not collide with the backend sentinel
Mixed legacy/new threads:
- For old messages without a stored parentId, infer sequential parent
from the previous message instead of null, preventing multiple roots
Sandbox endpoint:
- Change Cache-Control from "public, max-age=3600" to "private,
no-store" since these are authenticated responses
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(frontend): scope sans font overrides to chat thread only
* fix(frontend): use font-sans fallback for heading stack and simplify chat font rules
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* update logic to incorporate custom prebuilt installs
* bug fixes
* update for review comments
* fix tags
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Separate test changes from main PR
Move test file changes out of this PR to keep the diff focused on
the install_llama_prebuilt.py and setup script changes. Test updates
will be submitted in a follow-up PR.
* Fix branch ref normalization and harden JSON parsing
- Add checkout_friendly_ref() to strip refs/heads/ prefix from branch
refs before emitting them in SourceBuildPlan. git clone --branch does
not accept fully qualified refs like refs/heads/main.
- Apply normalization in source_build_plan_for_release() and the
direct-ref fallback in resolve_source_build_plan().
- Allow validated_checksums_for_bundle() to accept releases that carry
only an exact-commit source archive without the legacy upstream-tag
source tarball.
- Add 2>/dev/null || true guards to all inline python -c JSON parsing
in setup.sh so a malformed payload does not abort the script under
set -e.
* Fix Windows CUDA asset ordering and tag ref normalization
- Reorder windows_cuda_upstream_asset_names to prefer the main binary
archive (llama-{tag}-bin-win-cuda-*) over the cudart sidecar archive
(cudart-llama-bin-win-cuda-*). The cudart ZIP only contains CUDA
runtime DLLs, not llama-server or llama-quantize binaries.
- Extend checkout_friendly_ref to also strip refs/tags/ prefix for tag
refs, matching the refs/heads/ handling for branch refs.
* Simplify JSON parsing consistency in setup.sh
Use json.load(sys.stdin) consistently for all inline JSON parsing
in setup.sh, instead of the more complex json.loads(raw) pattern
on the install-tag resolution path. The 2>/dev/null || true guard
already handles empty/malformed input gracefully.
* Fix source build plan fallback for commit ref kind in PR #4771
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <daniel@unsloth.ai>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): prevent small models from stalling on tool-calling tasks
Small GGUF models (< 9B params) in "Think, Search, Code" mode would
often describe what they planned to do ("Let me create this dashboard")
and then stop generating without ever calling a tool.
Three changes:
1. Simplify web_tips for small models: remove the "fetch its full content
by calling web_search with the url parameter" guidance for models < 9B.
This multi-step instruction causes small models to plan elaborate
search-then-fetch-then-code sequences they cannot reliably execute.
2. Add "always call tools directly" imperative to the system prompt nudge
so models act immediately instead of narrating their intentions.
3. Add plan-without-action re-prompt in the agentic loop: when the model
emits planning text (matching patterns like "let me", "I'll", etc.)
without calling any tool, inject a nudge asking it to call the tool
and continue the loop. Capped at 2 re-prompts per request.
Benchmarked with Qwen3.5-4B-GGUF (N=5 trials per variant):
- Baseline: 40% of requests had any tool call
- Combined fix: 100% of requests had at least one tool call
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Distinguish between actual network downloads and GPU memory loading for cached LoRA adapters in Studio chat.
- Add isCachedLora detection for local LoRA adapter paths using comprehensive cross-platform regex (Unix, Windows, UNC, WSL, tilde)
- Thread isCachedLora through loadInfo to chat-page inline status for proper 3-way distinction (cached / local LoRA / downloading)
- Skip download progress polling for cached LoRA models (no useless /download-progress API calls)
- Fix initial toast state to use isCachedLoad consistently instead of only checking isDownloaded
- Fix cancelLoading toast to not mention background downloads for cached/local loads
- Keep download-specific text ("Downloading model..." / "Download complete") inside the download-only polling block
- Add min-w-0 guards to thread/message/markdown containers to prevent
content overflow past the composer width
- Unify chat typography from Hellix/Space Grotesk to the sans stack,
keeping monospace for code blocks and inline code
- Restructure desktop navbar right-side controls with shrink-0 wrappers
for consistent spacing across HoverCard roots
- Soften tool-call label styling (font-medium + text-foreground/85
instead of bold)
- Add responsive code block sizing via @container queries
- Add horizontal scrolling for wide code blocks within the thread column
- Scope list-item code block alignment CSS to .aui-thread-root
- Preserve useScrollLock in tool-fallback and tool-group collapsibles
- Fall back to bg-background on ViewportFooter when hideComposer is true
- Widen inline code monospace selector to cover th, blockquote, and
heading elements
- Remove unused @fontsource-variable/space-grotesk import
* Fix script unbound variable error
* remove stale test script, add llama.cpp metal source builds, update tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Metal precedence, test sync, and add behavioral tests
- Move macOS arm64 Metal check before CUDA/ROCm in GPU backend
decision chain so Metal is not bypassed when nvcc is in PATH
- Remove RPATH flags from CPU fallback CMAKE_ARGS (only needed
for Metal library linking)
- Update test_llama_pr_force_and_source.py to match _CLONE_ARGS
rename from _CLONE_BRANCH_ARGS in setup.sh
- Add confirm_install_tree guard test for
existing_install_matches_choice
- Add TestMacOSMetalBuildLogic bash subprocess tests verifying
Metal flag selection, nvcc precedence, and CPU fallback behavior
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Metal CPU fallback to also cover cmake build failures and update tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* 1. _GPU_BACKEND_FRAGMENT synced -- removed dead CPU_FALLBACK_CMAKE_ARGS= init (6/8)
2. RPATH assertion replaced -- new test_macos_arm64_cpu_fallback_args_exclude_rpath checks the actual runtime CPU_FALLBACK_CMAKE_ARGS output for @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (6/8)
3. _TRY_METAL_CPU_FALLBACK=false reset after both configure-failure and build-failure fallback branches in setup.sh (4/8)
4. macOS test now removes libmtmd.0.dylib instead of the platform-agnostic convert_hf_to_gguf.py (3/8)
5. Empty-string tag test added -- test_empty_tag_omits_branch_flag for resolved_tag= (2/8)
6. RPATH checks on cmake call logs -- both fallback tests now assert @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON are absent from CPU fallback cmake calls, plus baseline flag preservation (multiple)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tests clean up
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): allow context length slider to reach model's native limit
The context length slider was hard-capped to the VRAM-estimated maximum,
preventing users from requesting higher context even though the backend
already handles it safely (multi-GPU selection, --fit fallback). Expose
the model's native context length from GGUF metadata as a separate API
field and use it as the slider ceiling instead. Add an amber warning
when the selected context exceeds the estimated VRAM capacity.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Raise VRAM budget to 90% and add native_context_length tests
Increase the GPU memory utilization threshold from 70% to 90% across
_select_gpus and _fit_context_to_vram, allowing longer context lengths
before VRAM capping kicks in.
Add 33 tests for the native_context_length feature covering the backend
property, context value separation invariants, Pydantic models, route
completeness, edge cases, and cross-platform binary I/O.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: add tokenizers to no-torch runtime deps and add TORCH_CONSTRAINT for arm64 macOS py313+
Two installer fixes:
1. Add `tokenizers` to `no-torch-runtime.txt` before `transformers`.
Without it, `from transformers import AutoConfig` crashes on startup
because `--no-deps` skips transitive dependencies.
2. Add `TORCH_CONSTRAINT` variable to `install.sh`. On arm64 macOS with
Python 3.13+, tighten the torch requirement to `>=2.6` since torch
<2.6 has no cp313 arm64 wheels. The variable replaces the previously
hard-coded constraint in the uv pip install line.
Includes 66 tests (42 pytest + 24 bash) covering:
- Structural checks on install.sh, install.ps1, no-torch-runtime.txt
- Shell snippet tests with mocked python for 13 platform/version combos
- Mock uv integration verifying correct constraint string
- E2E venv tests on Python 3.12 and 3.13 confirming AutoConfig works
- Negative control proving AutoConfig fails without tokenizers
- Full no-torch sandbox regression guards (safetensors, huggingface_hub)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix incomplete no-torch manifest and align E2E tests with real --no-deps path
- Add missing transitive deps to no-torch-runtime.txt that are required
under --no-deps: regex, typing_extensions, filelock, httpx, httpcore,
certifi, idna, anyio, sniffio, h11. Without these, `from transformers
import AutoConfig` still fails after install.sh --no-torch.
- Change all E2E tests to use --no-deps (matching what install.sh does)
instead of normal dep resolution. Previous tests passed even with an
incomplete manifest because uv backfilled transitive deps.
- Rewrite negative control to derive from the real no-torch-runtime.txt
with tokenizers stripped, proving the specific fix matters.
- Replace GNU-only sed -i with heredoc in shell test for macOS compat.
- Remove unused os/sys imports from Python test file.
- Quote SKIP_TORCH and mock uv paths in bash -c strings.
* Assert install succeeds before checking import results in E2E tests
Address review feedback: test_torch_not_importable and
test_tokenizers_directly_importable in Group 3 now assert that
uv pip install returns 0 before checking import behavior. This
prevents false positives when the install itself fails silently.
* Assert install succeeds in negative control and tighten error check
- Add missing install-success assertion in test_negative_control_no_tokenizers
to prevent false positives from network/install failures.
- Tighten error message check to look for "tokenizers" in stderr or
ModuleNotFoundError, rather than the generic "No module" substring
which could match unrelated import failures.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Fix SSL handshake failures (SSLV3_ALERT_HANDSHAKE_FAILURE, CERTIFICATE_VERIFY_FAILED) when fetching HTTPS pages by introducing _PinnedHTTPSConnection that separates TCP connect (to pinned IP) from TLS handshake (with real hostname for SNI/cert verification)
- Fix SSRF DNS-rebinding vulnerability: previous impl swapped conn.host before connect(), causing fresh DNS resolution; new subclass keeps TCP pinned to validated IP
- Fix SPA/JS-rendered doc sites returning empty content by rotating real browser User-Agents (Chrome/Firefox/Safari)
- Strip nav/footer from HTML-to-Markdown output so article content is not buried under navigation chrome
- Increase raw fetch cap from 64KB to 512KB so SSR article content is reached on GitBook/Docusaurus/Next.js pages
- Fix IPv6 address bracketing in URL netloc construction
- Hoist SSL context, handler classes, and stdlib imports to module level (created once, not per-call)
- Use consistent UA across redirect hops to avoid breaking session-aware bot detection
Replaces the fixed prebuilt llama.cpp tag with dynamic published-release
resolution, adds bounded fallback across older published releases, and
introduces maintainer-editable defaults for PR/source overrides.
Changes:
- Resolve latest from the latest usable published release in unslothai/llama.cpp
- Use the selected release upstream_tag as the authoritative llama.cpp version
- Prefer Unsloth-published platform assets when available
- Fall back to same-tag upstream ggml-org/llama.cpp assets where allowed
- Keep Linux CUDA anchored to Unsloth-published CUDA bundles only
- Add bounded fallback across older Unsloth published releases
- Add separate busy/in-use install handling (exit code 3)
- Skip reinstall when the installed bundle already matches the selected candidate
- Add maintainer-editable _DEFAULT_LLAMA_PR_FORCE and _DEFAULT_LLAMA_SOURCE
- Harden env parsing so malformed installer env vars do not crash import-time fallback logic
- Honor UNSLOTH_LLAMA_RELEASE_TAG in all resolve steps
- Always sync git remote URL in existing-checkout path
* feat(studio): architecture-aware KV cache VRAM estimation
Replace the single legacy formula (2 * n_kv_heads * head_dim * n_layers
* n_ctx * bpe) with 5-path estimation that reads 8 additional GGUF
metadata fields:
1. MLA (DeepSeek-V2/V3, GLM-4.7, GLM-5, Kimi-K2.5) -- K-only cache
using compressed KV latent + RoPE; no separate V allocation
2. Hybrid Mamba (Qwen3.5-27B, Qwen3.5-35B-A3B) -- only attention
layers (1 in N) carry KV; Mamba layers have none
3. Sliding Window (Gemma-3, gpt-oss) -- SWA layers cache
min(ctx, window) tokens instead of the full context
4. Standard GQA -- uses explicit key_length/value_length from GGUF
instead of embed // n_heads (which is wrong for many models)
5. Legacy fallback -- identical to old formula for old GGUFs
New GGUF fields parsed: attention.key_length, attention.value_length,
attention.sliding_window, full_attention_interval,
attention.kv_lora_rank, attention.key_length_mla, ssm.inner_size,
ssm.state_size.
Validated against 9 real GGUF files (72/72 field checks pass).
The legacy formula was off by +682% for Gemma-3 and -81% for
DeepSeek-V3.1.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix MLA fallback and SWA global/local ratio heuristic
Two fixes based on review findings:
1. MLA fallback now uses key_length_mla from GGUF metadata instead of
hardcoded rope_dim=64. Falls back to 64 only when key_length_mla is
absent. This ensures correct estimates for MLA variants that use
rope dimensions other than 64.
2. SWA global/local layer ratio changed from 50/50 to 1/4 (25% global,
75% SWA). Most sliding window architectures have predominantly local
layers (Gemma-3 uses ~17% global, gpt-oss uses ~50%). The 1/4
heuristic is closer to the common case and still a large improvement
over the legacy formula which ignores SWA entirely.
* Tighten _can_estimate_kv gate and treat sliding_window=0 as disabled
Two additional fixes from review round 1 (5/8 and 4/8 reviewer consensus):
1. _can_estimate_kv now requires BOTH key_length AND value_length for
the explicit-dims path. Previously key_length alone was enough,
which could cause silent fallthrough to the legacy formula with
fabricated defaults (n_kv=1, head_dim=128) when value_length was
absent from the GGUF.
2. SWA path now requires sliding_window > 0. Some GGUFs use 0 as a
disabled sentinel. Without this guard, min(ctx, 0) would zero out
all SWA layer contributions, severely underestimating KV cache.
* Fix MLA n_kv safety and use ceiling division for hybrid path
Addresses Gemini Code Assist review findings:
1. MLA path now uses n_kv_mla = n_kv_heads or 1 (not n_heads). This
prevents a 128x overestimate for DeepSeek-V3 if head_count_kv is
absent from the GGUF (n_heads=128 would have been used instead).
2. Hybrid path now uses ceiling division for attention layer count.
This prevents undercounting by 1 when n_layers is not perfectly
divisible by full_attention_interval.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
When searching for a specific publisher model (e.g. `openai/gpt-oss-20b`), the
unsloth search used the full `openai/gpt-oss-20b` string with `author=unsloth`,
which returned zero results because no unsloth model contains the publisher
prefix in its name. Users never discovered unsloth variants.
This PR strips the org prefix for publisher-qualified queries so unsloth variants
surface, then pins the original publisher model after a small batch of unsloth
results. Plain queries (no slash) and unsloth-prefixed queries are unchanged.
- Strict regex (`/^([^/\s]+)\/([^/\s]+)$/`) only triggers on valid `owner/repo`
identifiers; incomplete typeahead, multi-slash, and URL-like inputs are rejected
- Queries for `unsloth/...` models (case-insensitive) keep the full 20-result
prefetch and secondary sort
- Pinned model lookup fires in parallel with the unsloth prefetch
- Canonical-name dedup prevents duplicates when HF normalizes casing
- Publisher detection extracted into a single `useMemo` block
Replace strikethrough + opacity-50 OOM styling with gray text and red pill badge across all Studio model selectors (chat, training, onboarding).
- Use gray-500/gray-400 for OOM model names (better contrast than strikethrough)
- Red pill badge for OOM indicator with light/dark mode support
- Scope GGUF gray override to quant name only so downloaded/recommended labels keep colors
- Add !important on TIGHT/OOM badges to resist ComboboxItem hover overrides
* Fix Windows "Non-relative patterns are unsupported" when loading local GGUF models
When a user loads a GGUF model from a local Windows path (e.g.
C:\Users\danie\.lmstudio\models\unsloth\functiongemma-270m-it-GGUF),
the model identifier contains backslashes and a drive letter. Both
load_model_defaults() and _has_specific_yaml() constructed a YAML
filename from the full absolute path and passed it to Path.rglob(),
which rejects non-relative patterns on Windows.
Fixed by detecting Windows-style paths (drive letters, UNC paths,
backslashes) in addition to Unix-style paths, and using only the
directory basename for the YAML filename lookup when the identifier
is a local filesystem path.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Refactor: reuse is_local_path helper, fix case-sensitive suffix lookup
- Replace inline local-path detection in model_config.py and
inference_config.py with the existing is_local_path() from utils.paths,
which already handles Unix, Windows drive-letter, UNC, and backslash paths
- Fix case-sensitive suffix lookup in load_model_defaults(): the
_REVERSE_MODEL_MAPPING is lowercase-keyed, so suffix comparisons must use
.lower() to match paths like /path/to/Spark-TTS-0.5B/LLM
* Fix WSL path parsing and _has_specific_yaml suffix lookup
- Use normalize_path() before Path() operations so backslash Windows
paths (e.g. C:\Users\...\model) are correctly split on POSIX/WSL hosts
where pathlib treats backslashes as literal characters
- Add suffix-based (2-component and 1-component) lookup to
_has_specific_yaml() so it matches the same resolution rules as
load_model_defaults(), fixing wrong inference params for local
suffix-mapped models like Spark-TTS-0.5B/LLM
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: clear tool status badge immediately after tool execution
The tool status timer badge (Searching 1s, 2s...) persisted after
tool calls finished because the status clear event was only sent
at the start of the next generation iteration, not after tool
execution completed.
Backend: yield status clear after all tools finish in the agentic
loop iteration, before continue starts the next generation pass.
Frontend: debounce badge visibility by 300ms so sub-second tool
calls dont flash the badge.
* Fix debounce regression for consecutive tool calls
Only apply the 300ms show-delay when transitioning from idle to
tool-active. When switching between consecutive tools in the same
turn (e.g. web_search -> python), keep the badge visible immediately
so it does not flicker or disappear during multi-tool runs.
* Delay wasActiveRef reset to bridge inter-iteration tool gaps
The backend emits a status-clear event between tool iterations,
which was resetting wasActiveRef immediately and causing the next
tool to be re-debounced (300ms hidden gap between consecutive tools
in the same turn). Now the ref reset is delayed by 500ms so a
follow-up tool within the same agentic turn shows the badge
immediately, while a genuinely new turn still gets the debounce.
* Use thread lifecycle to track tool-run boundaries
Replace the 500ms wall-clock timeout with the actual thread.isRunning
state to determine when wasActiveRef should reset. This properly
handles all cases:
- Consecutive tools within the same run stay visible without flicker
- The badge hides only when the thread run actually ends
- New turns always get a fresh 300ms debounce on the first tool
- No heuristic timeout that can misfire on slow or fast inference
* Consolidate wasActiveRef reset into single effect
Removes the separate isThreadRunning effect to avoid a race where
the ref resets before the tool-status effect reads it (when
isThreadRunning flips to false before setToolStatus(null) from
the adapter's finally block). Now wasActiveRef resets only when
both toolStatus is null AND the thread run has ended, eliminating
any flicker on the last tool of a run.
* Simplify debounce: use visible state instead of ref tracking
Drop wasActiveRef entirely and use the visible state as the
debounce gate. When the badge is not yet on screen, debounce
for 300ms before showing. When already visible from a prior tool,
keep showing immediately. This correctly handles all cases:
- All fast tools (<300ms) are suppressed, not just the first
- Consecutive tools after the badge is shown stay visible
- Badge persists across inter-iteration clears while thread runs
- New turns get a fresh debounce after visible resets
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* refactor: move folder management from sidebar into model selector
* Fix folder management: restore LoRA picker sync, error handling, caching
- Restore onFoldersChange callback to keep LoRA adapter picker in sync
when scan folders are added/removed (fixes regression from sidebar move)
- Thread onFoldersChange through ModelSelector -> HubModelPicker prop chain
- Add module-level _scanFoldersCache to prevent folder list flash on re-open
- Surface error toast on folder removal failure instead of silently ignoring
- Guard handleAddFolder against concurrent double-submit via folderLoading
- Clear folderInput on Escape key dismiss to prevent stale input on re-open
- Add refreshLocalModelsList and refreshScanFolders to useEffect dep array
* Fix compare-mode folder sync, Escape key propagation, cancel toggle state
- Wire onFoldersChange through CompareContent/GeneralCompareContent so
compare-mode selectors also refresh local models after folder changes
- Add e.stopPropagation() on Escape key in folder input to prevent
Radix Popover from closing the entire model selector dropdown
- Add e.preventDefault() on Enter key to prevent form submission
- Clear folderInput and folderError when cancel toggle hides the input,
matching the Escape key behavior for consistency
* Fix folder mutation state ordering and touch accessibility
- Use optimistic updates for add/remove so the folder list reflects
changes immediately instead of waiting on a second listScanFolders
round-trip that could silently fail.
- Move refreshScanFolders out of the finally block in handleRemoveFolder
so it runs after the cache update, not after onFoldersChange.
- Make the remove button visible on touch/mobile devices and reachable
via keyboard focus (opacity-100 on small screens, focus-visible).
- Add aria-label to the remove button for screen readers.
* Deduplicate optimistic folder add to match backend behavior
The backend returns the existing ScanFolderInfo row when adding a
path that is already registered. The optimistic update was blindly
appending the returned row, producing duplicate entries and React
key warnings. Now checks by id before appending.
* Add aria-label to folder toggle button and strengthen dedup check
- Add aria-label to the +/cancel icon button for screen readers.
- Extend optimistic dedup check to also compare by path, not just id,
to handle edge cases where the cache is stale.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* style(windows): clean installer/setup log output and remove seeded credential banner
* Keep startup credential hint without exposing plaintext password
Print the username and .bootstrap_password file path on first-run
admin creation instead of the raw password. Headless / Docker / SSH
operators still get a startup-time hint for initial sign-in, and the
plaintext credential no longer appears in terminal output or logs.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* feat: add scan_folders table and CRUD functions to studio_db
* feat: add scan folders API endpoints and integrate into model scan
* feat: add scan folders API client and update source types
* feat: add custom source to model filters and selector
* feat: add Model Folders section to chat settings sidebar
* style: fix biome formatting in ModelFoldersSection
* fix: address review findings for custom scan folders
empty string bypass, concurrent delete crash guard,
Windows case normalization, response_model on endpoints,
logging, deduplicated filter/map, module level cache for
custom folder models, consistent source labels, handleRemove
error surfacing, per folder scan cap
* fix: show custom folders section regardless of chatOnly mode
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refactor: extract shared refreshLocalModelsList in pickers
* Harden custom scan folder validation and scanning
- Validate path exists, is a directory, and is readable before persisting
- Apply per-folder model cap during traversal instead of after (avoids
scanning millions of inodes in large directories)
- Wrap per-folder scan in try/except so one unreadable folder does not
break the entire /api/models/local endpoint for all callers
- Normalize case on Windows before storing so C:\Models and c:\models
dedup correctly
- Extend macOS denylist to cover /private/etc and /private/tmp (realpath
resolves /etc -> /private/etc, bypassing the original denylist)
- Add /boot and /run to Linux denylist
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Improve scan robustness and preserve Windows path casing
- Preserve original Windows path casing in DB instead of lowercasing
(normcase used only for dedup comparison, not storage)
- Catch PermissionError per child directory so one unreadable subdirectory
does not skip the entire custom folder scan
- Wrap list_scan_folders() DB call in try/except so a DB issue does not
break the entire /api/models/local endpoint
* fix: scan custom folders for both flat and HF cache layouts
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Windows case-insensitive path dedup with COLLATE NOCASE
Use COLLATE NOCASE on the scan_folders.path column so that the UNIQUE
constraint correctly deduplicates C:\Models and c:\models on Windows
without lowercasing the stored path. Also use COLLATE NOCASE in the
pre-insert lookup query on Windows to catch existing rows with
different casing.
* Restore early-exit limit in _scan_models_dir for custom folders
Keep the limit parameter so _scan_models_dir stops iterating once
enough models are found, avoiding unbounded traversal of large
directories. The post-traversal slice is still applied after combining
with _scan_hf_cache results.
* feat: scan custom folders with LM Studio layout too
* Fix custom folder models being hidden by dedup
Custom folder entries were appended after HF cache and models_dir
entries. The dedup loop kept the first occurrence of each model id,
so custom models with the same id as an existing HF cache entry were
silently dropped -- they never appeared in the "Custom Folders" UI
section.
Use a separate dedup key for custom-source entries so they always
survive deduplication. This way a model can appear under both
"Downloaded" (from HF cache) and "Custom Folders" (from the
user-registered directory) at the same time.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden LM Studio scan and fix COLLATE NOCASE on Linux
- Add per-child and per-publisher OSError handling in _scan_lmstudio_dir
so one unreadable subdirectory does not discard the entire custom
folder's results
- Only apply COLLATE NOCASE on the scan_folders schema on Windows where
paths are case-insensitive; keep default BINARY collation on Linux
and macOS where /Models and /models are distinct directories
* Use COLLATE NOCASE in post-IntegrityError fallback SELECT on Windows
The fallback SELECT after an IntegrityError race now uses the same
case-insensitive collation as the pre-insert check, so a concurrent
writer that stored the path with different casing does not cause a
false "Folder was concurrently removed" error.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Simplify tool-call dedup: drop hashlib, inline helpers
The duplicate tool-call detector only compares calls within a single
request from the same JSON parser, so dict key order is guaranteed
identical for identical calls (Python 3.7+ insertion-ordered dicts).
- Replace hashlib.md5(json.dumps(...)) with name + str(args)
- Inline _tool_call_key, _is_duplicate_call, _record_tool_call
since each was a one-liner used once
- Remove unused hashlib import
* Remove tool_calling_benchmark_results.md from repo
* Replace html2text with builtin HTML-to-Markdown converter
Drop the external html2text (GPL-3.0) dependency and its regex
fallback. Add _html_to_md.py (~190 lines, stdlib only) using
html.parser.HTMLParser that handles headings, links, bold/italic,
lists, tables, blockquotes, code blocks, and entity decoding.
Strips script/style/head tags entirely.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use json.dumps(sort_keys=True) for tool-call dedup key
str(dict) is sensitive to insertion order, so semantically identical
calls with different key ordering would bypass duplicate detection.
Switch to json.dumps with sort_keys=True for a canonical representation.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert dedup key to str(arguments)
json.dumps(sort_keys=True) is unnecessary here -- the arguments dict
always comes from the same JSON parser within a single request, so
key insertion order is deterministic (Python 3.7+). str() is faster
and sufficient for consecutive-call dedup.
* Address review comments on _html_to_md.py
- Remove "hr" from _BLOCK_TAGS so the dedicated hr handler is reachable
- Prefix all newlines with ">" inside blockquotes (multi-line support)
- Emit full  for images instead of alt text only
- Replace newlines with spaces inside table cells
- Track header cells per-row (_row_has_th) instead of last-cell-only
- Strip trailing tabs in addition to spaces in cleanup regex
* Fix blockquote rendering, truncated-HTML buffer flush, and dedup key canonicalization
_html_to_md.py:
- Rewrite blockquote handling with stack-based buffer approach so nested
blockquotes, pre blocks inside blockquotes, and multi-paragraph quotes
all render correctly with proper "> " prefix on every line.
- Add flush_pending() to recover content from truncated HTML where closing
tags are missing (common when _fetch_page_text caps the download size).
Flushes open <a>, <td>, <pre>, and blockquote buffers.
- Skip <img> tags to match prior html2text ignore_images=True behavior
and avoid data-URI amplification consuming the output budget.
- Collapse all whitespace (including newlines) in non-pre content per
standard HTML whitespace rules: \s+ -> single space.
- Escape pipe characters in table cell content to prevent column breakage.
- Emit separator row after the first row for tables without <th> headers.
- Guard against IndexError on _ol_counter for orphan <li> elements.
- Normalize CRLF line endings before parsing.
llama_cpp.py:
- Restore canonical dedup key with json.dumps(sort_keys=True) so that
semantically identical tool calls with different JSON key order are
correctly detected as duplicates.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix table optional end tags, inline code whitespace, and link text normalization
_html_to_md.py:
- Extract _finish_cell() and _finish_row() helpers to handle HTML tables
that omit optional </td>, </th>, or </tr> end tags. This is valid HTML
and common on real web pages -- previously the parser would silently
drop earlier cells and entire rows.
- Call _finish_cell()/_finish_row() from handle_starttag for <tr>/<td>/<th>,
handle_endtag for </tr>/<td>/<th>/<table>, and flush_pending() so all
three paths (normal close, implicit close, truncated HTML) use the same
row-finalization logic including header separator emission.
- Add _in_inline_code flag so handle_data() preserves literal whitespace
inside <code> spans instead of collapsing it. Source like
<code>pip install unsloth</code> now correctly renders as
`pip install unsloth` rather than `pip install unsloth`.
- Extract _finish_link() helper that normalizes accumulated link text with
\s+ -> single space before building the Markdown link. Prevents block-
level content inside <a> tags (e.g. <a><div>one</div><div>two</div></a>)
from producing multiline [one\n\ntwo](href) link labels.
- Empty blockquotes now produce no output instead of a stray ">".
- Remove unused _bq_depth field (all routing uses _bq_stack).
- Flush open cells and rows in handle_endtag("table") for robustness.
* Support <ol start=N>, <dl>/<dt>/<dd>, and preserve code block whitespace
_html_to_md.py:
- Honor <ol start="N"> attribute so ordered lists preserve their original
numbering instead of always restarting from 1. Important for docs/tutorials
that continue numbering across sections.
- Add dl, dt, dd to _BLOCK_TAGS so definition lists (common on MDN, Python
docs, Django docs) produce separated text instead of concatenated blobs.
- Rewrite _cleanup() to be fence-aware: content inside fenced code blocks
is now preserved verbatim (intentional blank lines in <pre> content are
no longer collapsed). Outside code blocks, blank runs are limited to one
and trailing whitespace is stripped.
- Fix _prefix_blockquote() to strip trailing whitespace before collapsing
blank lines, preventing the "\n\n \n\n" pattern from sneaking through.
* Suppress whitespace-only text nodes between table structural elements
Indented HTML tables (nearly all real-world pages) produce whitespace
text nodes between <table>, <tr>, </tr> etc. that land in the output
as leading spaces before table rows, breaking Markdown table alignment.
Skip whitespace-only text nodes when inside a table but not inside a
cell, so indentation from source HTML does not leak into the output.
* Revert dedup key to str(arguments) with explanatory comment
json.dumps(sort_keys=True) is unnecessary overhead here: arguments
always comes from json.loads on model output within a single request,
so dict insertion order is deterministic in Python 3.7+. A repeated
call from the model produces the same JSON, which parses to the same
dict repr. str() avoids re-serialization on every tool call.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* studio: improve GGUF tool calling accuracy and reliability
- Add URL fetching to web_search tool so models can read full page
content instead of only getting search snippets. Uses html2text for
clean markdown conversion with regex fallback.
- Inject current date and behavioral guidance (URL fetch workflow,
no repeated queries, use code for data processing) into the
tool-use system prompt.
- Append error recovery nudge to tool results that indicate failure,
helping small models avoid looping on the same broken call.
- Strip leaked <tool_call> XML from assistant messages in conversation
history and from the outgoing SSE stream.
- Raise default max tool iterations from 10 to 25 across backend,
model schema, and frontend defaults.
- Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain
enough content for the model to extract useful information.
- Add "IMPORTANT: These are only short snippets" hint to search
results so models know to fetch full pages when needed.
Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after:
- XML leaks in responses: 10/10 -> 0/10
- URL fetch usage: 0 -> 4/10 runs
- Runs producing actual correct answers: 0/10 -> 2/10
- Average tool calls per query: 5.5 -> 3.8 (more efficient)
- Average response time: 12.3s -> 9.8s
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add tool calling benchmark results across model sizes and quants
Tested 16 configurations (4 models x 2 quants x 2 KV cache types)
with 10 runs each on NVIDIA B200.
Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4
correct songs, 0 XML leaks, 131s average response time.
* Add duplicate tool-call detection and final-answer synthesis
When the model repeats the exact same tool call (same name + arguments)
twice in a row, skip execution and return a redirect message telling it
to try a different approach. This prevents the 8x-repeated-query loops
observed on 27B and 35B models.
When the tool iteration cap (25) is reached, inject a "provide your
final answer now" message before the final streaming pass. This lets
the model synthesize a useful answer from everything it gathered
instead of being silently cut off.
Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs):
- Repeated query runs: 4/10 -> 2/10
- Cap hits: 1/10 -> 0/10
- All 4/4 accuracy: 5/10 -> 7/10
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix CodeQL alert: handle whitespace in script/style closing tags
The regex fallback for HTML stripping did not match closing tags
with whitespace before the angle bracket (e.g. </script >).
Use \s* before > in both script and style patterns.
* Address reviewer findings: SSRF, timeout crash, XML regex, dedup
- SSRF: resolve hostname via getaddrinfo and reject private, loopback,
link-local, multicast, and reserved addresses before fetching
- Timeout: handle timeout=None (unlimited mode) in URL fetch path
by defaulting to 60s instead of crashing on min(None, 60)
- Download cap: read at most max_chars*4+1 bytes instead of the
full response body before truncating
- XML regex: match both <tool_call> and <function=...> markup in
the history/stream cleanup (inference.py)
- CodeQL: use [^>]* in closing script/style tags to handle any
whitespace or attributes before >
- Dedup: track whether each tool call failed so retries after
transient errors are allowed; only block consecutive identical
calls that both succeeded
- Final-answer synthesis: guard on max_tool_iterations > 0 so
callers who disable tools do not get a false "used all calls" turn
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix redirect SSRF, SSE streaming regression, dedup off-by-one
- SSRF redirect bypass: disable auto-redirect in urllib, manually
follow up to 5 hops with host validation at each step. Prevents
public URLs from redirecting to loopback/private targets.
- SSE streaming: track prev_text on the raw cumulative and strip
XML from the delta only, so completed tool_call tags do not cause
the cumulative to shrink and drop trailing real text.
- Dedup off-by-one: check the immediately previous call (window=1)
instead of requiring 2 matching history entries, so the second
identical successful call is blocked rather than the third.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix redirect HTTPError handling and tighten error prefixes
- Redirect fix: urllib raises HTTPError (not a normal response) when
the redirect handler returns None. Catch HTTPError for 3xx codes
and extract the Location header from the exception object.
- Error prefixes: remove overly broad "No " prefix that matched
"No results found." (a valid empty-search outcome, not an error).
Replace with specific prefixes like "Blocked:", "No query provided",
"Failed to resolve". This ensures empty search results are correctly
classified as non-errors for duplicate-call tracking.
* Fix SSE cross-chunk XML leaks, cleanup review findings
- SSE streaming: sanitize the full cumulative text before diffing
against the previous sanitized snapshot, so XML tags that span
chunk boundaries are stripped correctly. The previous delta-based
approach leaked split tags.
- DRAINING fallback: use _strip_tool_markup() helper instead of a
manual regex that only handled <tool_call> but not <function=...>.
- Move hashlib import, _TOOL_XML_RE compile, and datetime import to
module level per style guide.
- Remove unused _hit_tool_cap variable.
* Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record
- DNS rebinding: resolve hostname once via getaddrinfo, pin the
returned IP, rewrite the URL to connect to the pinned IP with
a Host header. Each redirect hop re-resolves and re-validates.
Closes the TOCTOU window between validation and connection.
- Charset: use resp.headers.get_content_charset() instead of
hardcoding utf-8, so pages with other encodings decode correctly.
- HTTPError: return descriptive "HTTP {code} {reason}" instead of
re-raising into a generic "Search failed" message.
- Dedup: remove redundant _record_tool_call in the duplicate branch;
the single call at the end of the loop handles all cases.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1
The heartbeat thread now monitors the HF Hub cache directory for
file-size growth. If no bytes are written for 3 minutes, it sends a
"stall" message to the orchestrator, which kills the subprocess and
retries with HF_HUB_DISABLE_XET=1 (falling back from Xet to standard
HTTPS). If the retry also stalls, it errors out with a clear message.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: include transport type (xet/https) in heartbeat and stall log messages
Makes it clear in backend logs whether the download is using xet or
https transport, and which transport stalled — helpful for debugging.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: monitor HF Hub .tmp dir to avoid false stall detections
huggingface_hub downloads into .tmp/ before atomically moving to
blobs/. Without monitoring .tmp, a large shard actively downloading
for several minutes would show zero blob growth and trigger a false
stall.
* fix: scope HF cache size check to specific model being loaded
Instead of scanning every models--*/blobs directory (O(N) with cached
models), only check the specific model's blobs dir plus the global
.tmp dir. Much faster on systems with many cached models.
* Fix false stall detection on cached/local models and cleanup issues
- Only fire stall if download activity was observed (cache size changed
at least once). Previously, any model load taking >180s would trigger
a false stall, even for already-cached or local models where no
download is happening.
- Return -1 from _get_hf_cache_size on exception to distinguish
"unable to measure" from "genuinely zero bytes". Skip stall logic
when measurement fails.
- Add _shutdown_subprocess before raising on terminal stall path to
prevent leaking a stuck subprocess.
- Detect pre-existing HF_HUB_DISABLE_XET=1 in the parent environment
to avoid a redundant retry cycle when Xet is already disabled.
- Remove global .tmp directory scanning (not used by modern
huggingface_hub; in-progress downloads use .incomplete files in
blobs/ which are already captured by iterdir).
- Add f.is_file() guard in cache size calculation.
- Replace em dashes with ASCII dashes for Windows terminal compat.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden stall detection edge cases
- Guard -1 to valid value transition: when initial _get_hf_cache_size
returns -1 (error) and later recovers to a real value, do not count
that as download activity. Only set saw_download_activity when the
previous measurement was also valid (>= 0).
- Move os import to top-level in orchestrator.py instead of inline
import os as _os.
- Fix misleading comment about post-download protection.
* Use .incomplete files to detect active downloads for stall detection
Replace the saw_download_activity heuristic with direct .incomplete file
detection. huggingface_hub creates *.incomplete files in blobs/ during
active downloads and removes them on completion. This gives a reliable
signal for whether a download is actually in progress.
Benefits:
- Cached models: no .incomplete files -> no stall fired even after 180s
- Post-download init (quantization, GPU loading): .incomplete files gone
so stall timer resets, long init phases are not killed
- Pre-download hangs (XET handshake stall): .incomplete files are
created at download start, so zero-byte stalls are now detected
- No more false positives from -1 to valid measurement transitions
The _get_hf_download_state function now returns (total_bytes,
has_incomplete) tuple or None on error, replacing _get_hf_cache_size.
* Add debug logging to download state exception handler
Log the exception at debug level when _get_hf_download_state fails,
instead of silently returning None. Helps with troubleshooting cache
measurement issues.
* Watch both adapter and base model repos for LoRA stall detection
When loading a LoRA adapter, the actual download bottleneck is often
the base model, not the adapter itself. Update the heartbeat to watch
both mc.identifier and mc.base_model cache directories so stall
detection works for LoRA loads where the base model stalls on Xet.
Also update _get_hf_download_state to accept multiple model names and
skip names without "/" (local paths) since those do not have HF cache
directories.
* Fix model name filtering for official HF models without org prefix
Models like gpt2 and bert-base-uncased do not contain a slash but are
still valid HF Hub models with cache directories. Replace the "/" check
with a proper local-path detection that checks for path separators and
path-like prefixes instead.
Also fix the base_model watch list to not require "/" in the base model
name, so official models used as LoRA bases are also monitored.
* Fix local path detection that broke all org/model names on Linux
The os.path.sep check matched "/" in HF model IDs like "org/model" on
Linux, causing the stall detector to skip ALL standard HF models.
Replace with a check that only skips names starting with "/" (absolute
paths), "." (relative paths), "~" (home-relative), or containing "\"
(Windows paths). HF model IDs like "org/model" or "gpt2" pass through
correctly on all platforms.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): change default weight_decay from 0.01 to 0.001
The default weight decay across Studio was 0.01 but should be 0.001.
Updated the default in all backend fallbacks, the Pydantic model, the
frontend config, and every YAML preset/model-default config.
* fix(studio): auto-set learning rate based on training method
Default LR should be 2e-4 for LoRA/QLoRA and 2e-5 for full fine-tuning.
Frontend: track whether the user has manually edited the LR field via a
_learningRateManuallySet flag (same pattern as trainOnCompletions).
When switching training method and the user has not touched the LR,
auto-set it to the appropriate default. Reset the flag on model load.
Backend: change trainer.py start_training default from 5e-5 to 2e-4,
update default.yaml fallback from 5e-5 to 2e-4, and fix
full_finetune.yaml from 0.0002 (2e-4) to 2e-5.
* refactor(studio): centralize weight_decay and learning rate defaults
Create studio/backend/core/training/constants.py as the single source of
truth for DEFAULT_WEIGHT_DECAY (0.001), DEFAULT_LEARNING_RATE (2e-4),
DEFAULT_LEARNING_RATE_FULL (2e-5), and DEFAULT_LEARNING_RATE_STR ("2e-4").
All backend modules (trainer.py, training.py, worker.py, models/training.py)
now import from constants.py instead of hardcoding values.
On the frontend, add LR_DEFAULT_LORA and LR_DEFAULT_FULL to
config/training.ts and use them in the store instead of magic numbers.
A comment cross-references the backend constants file.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix model-specific LR override, persist migration, and flag resets
- Preserve model-specific learning rates from YAML configs when the
async autoSelectTrainingMethod callback fires (fixes Qwen2.5-1.5B
getting 2e-4 instead of its configured 1e-5, etc.)
- Bump zustand persist version to 9 with migration so existing users
with weightDecay=0.01 get updated to 0.001
- Clear _learningRateManuallySet in reset() and applyConfigPatch()
for consistency with trainOnCompletions flag behavior
- Add DEFAULT_LEARNING_RATE_FULL_STR to constants.py
* Refine applyConfigPatch to only clear LR flag when patch includes LR
Only reset _learningRateManuallySet when the applied config patch
actually provides a learningRate value. This prevents unrelated config
patches from silently disarming the manual-edit guard, which would
cause a subsequent setTrainingMethod call to overwrite the user's
custom LR.
* Preserve model-specific LR when switching between qlora and lora
Only auto-switch the learning rate when the training category changes
(adapter <-> full fine-tuning). Switching between qlora and lora keeps
the current LR since both methods share the same learning rate range.
This preserves curated per-model defaults (e.g. 1e-5 for
Qwen2.5-1.5B-Instruct) when the user toggles between adapter methods.
* Remove constants.py, use YAML configs as the source of truth
The YAML config files (model-specific + default.yaml) are the intended
config layer for training defaults. The Python backend fallbacks now use
inline values that match the YAML configs, rather than importing from a
separate constants module. This keeps the config architecture simple:
YAML files are the single source of truth, and the inline Python
fallbacks are just safety nets that mirror them.
* fix(studio): preserve model-specific LR when switching training method
Stash YAML-provided learning rate and use it to restore the correct
value when switching between adapter and full fine-tune modes.
- qlora <-> lora no longer overwrites the model's LR
- full -> adapter restores the YAML LR instead of a hardcoded constant
- selecting a model while on full fine-tune uses LR_DEFAULT_FULL
instead of applying the YAML adapter LR
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* fix: throttle and cache HuggingFace modelInfo API calls
The frontend was firing 40 to 60 parallel modelInfo requests on app
startup with zero caching or deduplication, causing HF rate limits.
Adds a caching layer (hf-cache.ts) with TTL cache, inflight request
dedup, and a concurrency limiter. Also debounces the HF token input
so typing a token no longer re-fires all model searches per keystroke.
* fix: only fetch VRAM info for visible models in chat selector
* Fix cache key isolation and VRAM badge stability for PR #4696
- Cache key now includes a token fingerprint (last 8 chars) instead of a
boolean, so switching HF tokens gives separate cache entries instead of
serving stale data from the previous token.
- Extract token via credentials?.accessToken to match the @huggingface/hub
API surface.
- Extend CachedResult type with safetensors/tags fields so downstream
consumers no longer need unsafe `as` casts.
- Merge VRAM param map with previous state on scroll instead of replacing
it, preventing a brief flash of missing VRAM badges when new models
become visible.
* Fix VRAM badges missing for search-filtered recommended models
When a user types a search query, filteredRecommendedIds can include
models beyond the currently visible page. These models had no VRAM data
because useRecommendedModelVram only received visibleRecommendedIds.
Now we pass the union of visibleRecommendedIds and filteredRecommendedIds
to the VRAM hook, so recommended models surfaced by search also show
their VRAM badges. The hf-cache layer ensures no duplicate network calls.
* Apply biome formatting to hf-cache.ts and use-recommended-model-vram.ts
Auto-formatted with biome check --write to match project lint rules:
- Block statements for single-line if/for bodies
- Import sorting (type imports first)
- Consistent line wrapping
* Fix extractToken to handle both current and deprecated HF auth forms
The @huggingface/hub CredentialsParams type is a union:
- { accessToken: "hf_..." } (current preferred form)
- { credentials: { accessToken: "..." } } (deprecated form)
Previously only checked params.credentials?.accessToken (deprecated path).
Now checks both forms so the cache key is correct regardless of which
calling convention is used.
* Simplify extractToken, map merge, and set construction
- extractToken: remove type assertions, use direct property access with
truthiness checks for cleaner union type handling
- VRAM map merge: use Map spread constructor instead of manual for loop
- idsForVram: use Set spread construction for more concise dedup
* Add rationale comment for MAX_CONCURRENT=3 in hf-cache.ts
* Skip GGUF repos in VRAM fetch and pre-populate cache from listModels
Two changes to reduce redundant HF API calls:
1. Filter GGUF repos from idsForVram before passing to useRecommendedModelVram.
GGUF repos have no safetensors metadata and the render layer already shows
a static "GGUF" badge -- fetching modelInfo for them is a no-op that wastes
a semaphore slot and a network round-trip.
2. Add primeCacheFromListing() to hf-cache.ts and call it from listModels
yield sites in mergedModelIterator and priorityThenListingIterator.
listModels returns the same type (ModelEntry & Pick<ApiModelInfo, T>) as
modelInfo with the same additionalFields, so the data is interchangeable.
Priming only writes if the key is not already fresh, so it never overwrites
a recent modelInfo response.
This means models discovered via listModels are already in cache when
useRecommendedModelVram later calls cachedModelInfo for them, eliminating
duplicate network requests.
* Fix cache key mismatch: prime both token and anonymous slots
The VRAM hook calls cachedModelInfo without credentials (anonymous key),
but listModels results were primed only under the authenticated key.
For authenticated users the priming was a no-op -- cache miss every time.
Fix: prime both the token-specific slot and the anonymous slot when an
access token is present. Public model metadata (safetensors, tags) is
identical regardless of auth so this is safe.
Also add a defensive guard in primeCacheFromListing for empty name.
* Auto-prime anonymous cache slot from authenticated modelInfo fetches
When cachedModelInfo is called with a token, the result was only stored
under the token-specific key (e.g. model::abc12345). The VRAM hook
calls cachedModelInfo without credentials and reads the anonymous slot
(model::anon), causing a cache miss and duplicate fetch for every
priority model.
Now cachedModelInfo also writes to the anonymous slot on success when
a token is present. Public model metadata (safetensors, tags) is
identical regardless of auth, so this is safe and eliminates ~10
duplicate API calls on first page load.
* Guard anonymous cache priming against gated/private models
Only prime the anonymous cache slot for non-gated, non-private models.
Previously, authenticated modelInfo responses and listing results were
unconditionally copied into the anonymous slot, which could briefly
expose gated/private model metadata after clearing the HF token.
Now checks result.gated and result.private before writing the anon slot.
Public unsloth/ models (the common case) still benefit from the
optimization; gated models like meta-llama/* require a fresh fetch
per auth context.
* Extract primeFromListing helper to deduplicate cache priming logic
The cache priming pattern (prime token slot + conditionally prime anon
slot for non-gated models) was duplicated in three places. Extracted
into a single primeFromListing() function for maintainability.
* Export CachedResult type, add isStale helper, simplify primeFromListing
- Export CachedResult so consumers can use it directly instead of
the indirect Parameters<typeof ...> pattern.
- Extract isStale(key) helper to deduplicate the cache freshness
check that was repeated in primeCacheFromListing, cachedModelInfo,
and the anonymous-slot priming logic.
- Simplify primeFromListing to use CachedResult directly for both
the data parameter and the gated/private guard, eliminating the
double cast.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Revert to balanced for inference
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unused for_inference parameter from get_device_map
Since inference and training both use "balanced" now, the for_inference
flag is dead code. Remove it from the function signature, the call site
in inference.py, and simplify the tests accordingly.
* Remove redundant TestDeviceMapForInference test class
TestGpuAutoSelection already covers the same multi-gpu and single-gpu
device_map assertions. The TestDeviceMapForInference class was left
over from when for_inference had distinct behavior.
* Remove redundant test_get_device_map_multi_gpu_uses_balanced
Its assertions ([0,1] -> balanced, [0] -> sequential) are already
covered by test_get_device_map_uses_explicit_gpu_selection.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): open tour ReadMore links in new tab
The quick tour "Read more" links navigate away from Studio instead of
opening in a separate tab. Add target="_blank" and rel="noopener
noreferrer" to the ReadMore component so external doc links open in a
new browser tab.
* fix(studio): only open external ReadMore links in new tab
Apply target="_blank" conditionally based on whether the href starts
with "http", so internal links still navigate in the same tab.
* Tighten external-link detection in ReadMore component
Use regex /^https?:\/\// instead of startsWith("http") so the check
requires the full protocol prefix and does not match non-URL strings
that happen to begin with "http".
* Hoist regex to module scope for ReadMore
Move EXTERNAL_URL_RE to top-level constant to satisfy the biome
useTopLevelRegex lint rule and avoid re-creating the RegExp on
every render.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* studio: gate multimodal incompatibility warning on settled model capabilities
* Also disable Start button during isCheckingVision fallback
When getModelConfig fails and the fallback checkVisionModel is still
in-flight, isLoadingModelDefaults clears before isCheckingVision does.
Without also gating on isCheckingVision the Start button briefly
re-enables with stale capability flags.
Add isCheckingVision to the disabled condition and show "Loading
model..." text while either flag is active.
* Show correct error message for audio dataset incompatibility
The incompatibility warning always said "switch to a vision model"
even when the actual issue was an audio dataset on a non-audio model.
Now shows an audio-specific message when the mismatch is audio.
* Extract isLoadingModel constant for clarity
Pull the combined model-loading condition into a single constant
reused by the settled check, the disabled prop, and the button label.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The 180s wall-clock timeout would kill model loads on slow connections
even when the download was actively progressing. Now the worker sends
heartbeat status messages every 30s during loading, and the orchestrator
resets its 300s deadline on each one — so it only times out when the
subprocess goes truly silent.
* fix: skip download progress polling for exported GGUF models
* fix: revert isLocalGgufDir change — exported GGUFs are file paths, not dirs
* fix: set isDownloaded true for all adapters in LoraModelPicker
* fix(studio): replace unicode emoji in print() to avoid cp1252 crash on Windows
On Windows the default console encoding is cp1252 which cannot encode
unicode emoji like U+2705 or U+26A0. bare print() calls with these
characters cause a UnicodeEncodeError at runtime.
- run.py: replace emoji with ASCII status prefixes [OK] and [WARNING]
- format_conversion.py: remove duplicate print() that mirrors the
logger.info() call on the next line, and drop the emoji from the
log message since loggers handle encoding separately
* fix(studio): apply same emoji/print cleanup to parallel VLM conversion path
The parallel URL-based conversion logic has the same duplicate print()
with emoji that was fixed in the sequential path. Remove the bare
print() and drop the emoji from the logger.info() call.
* Treat install_python_stack.py failure as fatal in setup.ps1
On Linux/Mac, setup.sh runs under set -euo pipefail so a non-zero
exit from install_python_stack.py aborts the installer. On Windows,
setup.ps1 had no exit code check -- if the Python script crashed
(eg from the cp1252 UnicodeEncodeError), the installer silently
continued past the dependency loop and reported success. Studio
would then fail at launch with ModuleNotFoundError for structlog,
fastapi, and other deps that were never installed.
Capture $LASTEXITCODE and exit 1 if the dependency installer fails,
matching the error handling pattern already used for PyTorch install.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* [WIP] balanced device map for studio
* gpus as a request parameter
* API for multi GPU stuff
* return multi gpu util in new API
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use balanced_low0 instead of balanced
* Use balanced_low0 instead of balanced
* Fix device_map typo, UUID parsing crash, set() filter bug, and broken tests
- balanced_low0 -> balanced_low_0 (transformers/accelerate rejects the old string)
- get_parent_visible_gpu_ids() now handles UUID/MIG CUDA_VISIBLE_DEVICES
gracefully instead of crashing on int() parse
- _get_backend_visible_gpu_info() set() or None bug: empty set is falsy so
CUDA_VISIBLE_DEVICES=-1 would disable filtering and report all GPUs
- test_gpu_selection.py: add missing get_visible_gpu_utilization import and
add required job_id arg to start_training() calls
* Smart GPU determinism using estimates
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* disallow gpu selection for gguf for now
* cleanup
* Slightly larger baseline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Treat empty list as auto
* Verbose logging/debug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Cleanup and revert unnecessary deletions
* Cleanup excessive logs and guard against disk/cpu offload
* auth for visibility API. cleanup redundant imports. Adjust QLoRA estimate
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* support for non cuda gpus
* Fix multi-GPU auto-selection memory accounting
The multi_gpu_factor was applied uniformly to all GPUs including the
first one, which unfairly penalizes single-GPU capacity when
transitioning to multi-GPU. This created a discontinuity where a model
that barely fits 1 GPU would suddenly require 2 GPUs because the first
GPU's free memory was discounted by 20%.
Now the first GPU keeps its full free memory, and only additional GPUs
have an overhead factor (0.85) applied to account for inter-GPU
communication and sharding overhead. This gives more accurate
auto-selection and avoids unnecessary multi-GPU for models that
comfortably fit on one device.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox tests for multi-GPU selection logic
24 tests covering model size estimation, memory requirements, automatic
GPU selection, device map generation, GPU ID validation, and multi-GPU
overhead accounting. All tests use mocks so they run without GPUs on
Linux, macOS, and Windows.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reviewer findings: 4bit inference estimate, fallback, GGUF gpu_ids, retry
1. 4-bit inference now uses reduced memory estimate (model_size/3 + buffer)
instead of the FP16 1.3x multiplier. This prevents over-sharding
quantized models across unnecessary GPUs.
2. When model size estimation fails, auto_select_gpu_ids now falls back to
all visible GPUs instead of returning None (which could default to
single-GPU loading for an unknown-size model).
3. GGUF inference route now treats gpu_ids=[] as auto-selection (same as
None) instead of rejecting it as an unsupported explicit request.
4. Training retry path for "could not get source code" now preserves the
gpu_ids parameter so the retry lands on the same GPUs.
5. Updated sandbox tests to cover the new 4-bit inference estimate branch.
* Remove accidentally added unsloth-zoo submodule
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix UUID/MIG visibility and update test expectations
1. nvidia.py: When CUDA_VISIBLE_DEVICES uses UUID/MIG tokens, the
visibility APIs now return "unresolved" with empty device lists instead
of exposing all physical GPUs. This prevents the UI from showing GPUs
that the backend process cannot actually use.
2. test_gpu_selection.py: Updated test expectations to match the new
multi-GPU overhead accounting (first GPU at full capacity, 0.85x for
additional GPUs) and 4-bit inference memory estimation formula.
All 60 tests now pass.
* Add CPU/disk offload guard to audio inference path
The audio model loading branch returned before the common
get_offloaded_device_map_entries() check, so audio models loaded with a
multi-GPU device_map that spilled layers to CPU/disk would be accepted
instead of rejected. Now audio loads also verify no modules are offloaded.
* Improve VRAM requirement estimates
* Replace balanced_low_0 with balanced
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refine calculations for slightly easier nums
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* adjust estimates
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use nums instead of obj to avoid seralisation error
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden nvidia-smi parsing and fix fallback GPU list
1. nvidia.py: Wrap int() casts for GPU index and memory in try/except
so MIG slices, N/A values, or unexpected nvidia-smi output skip the
unparseable row instead of aborting the entire GPU list.
2. nvidia.py: Handle GPU names containing commas by using the last
field as memory instead of a fixed positional index.
3. hardware.py: fallback_all now uses gpu_candidates (GPUs with verified
VRAM data) instead of raw devices list, which could include GPUs
with null VRAM that were excluded from the ranking.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* consolidate raise_if_offload
* Improve MoE support. Guard against nvidia-smi failures
* Improve MoE support. Guard against nvidia-smi failures
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix shared-expert LoRA undercount, torch VRAM fallback, and apply_gpu_ids edge case
1. vram_estimation.py: compute_lora_params now includes shared experts
(n_shared_experts) alongside routed experts when computing MoE LoRA
adapter parameters. Previously only n_experts were counted, causing
the estimator to undercount adapter, optimizer, and gradient memory
for DeepSeek/GLM-style models with shared experts.
2. hardware.py: _torch_get_per_device_info now uses mem_get_info (which
reports system-wide VRAM usage) instead of memory_allocated (which
only reports this process's PyTorch allocations). This prevents
auto-selection from treating a GPU as mostly free when another
process is consuming VRAM. Falls back to memory_allocated when
mem_get_info is unavailable.
3. hardware.py: apply_gpu_ids([]) now returns early instead of setting
CUDA_VISIBLE_DEVICES="" which would disable CUDA entirely. Empty
list inherits the parent visibility, same as None.
4. hardware.py: Upgraded fallback_all GPU selection log from debug to
warning so operators are notified when the model likely will not fit
in available VRAM.
* Guard nvidia-smi subprocess calls against OSError and TimeoutExpired
get_visible_gpu_utilization and get_backend_visible_gpu_info now catch
OSError (nvidia-smi not found) and TimeoutExpired internally instead
of relying on callers to wrap every invocation. Returns the standard
available=False sentinel on failure so the torch-based fallback in
hardware.py can take over.
* Guard get_primary_gpu_utilization and reset GPU caches between tests
1. nvidia.py: get_primary_gpu_utilization now catches OSError and
TimeoutExpired internally, matching the pattern already used in
get_visible_gpu_utilization and get_backend_visible_gpu_info. All
three nvidia-smi callers are now self-contained.
2. test_gpu_selection.py: Added _GpuCacheResetMixin that resets the
module-level _physical_gpu_count and _visible_gpu_count caches in
tearDown. Applied to all test classes that exercise GPU selection,
device map, or visibility functions. This prevents stale cache
values from leaking between tests and causing flaky results on
machines with real GPUs.
* Fix nvidia-smi fallback regression and physical GPU count validation
1. hardware.py: get_gpu_utilization, get_visible_gpu_utilization, and
get_backend_visible_gpu_info now check result.get("available") before
returning the nvidia-smi result. When nvidia-smi is unavailable or
returns no data (e.g., containers without nvidia-smi, UUID/MIG masks),
the functions fall through to the torch-based fallback instead of
returning an empty result. This fixes a regression where the internal
exception handling in nvidia.py prevented the caller's except block
from triggering the fallback.
2. hardware.py: resolve_requested_gpu_ids now separates negative-ID
validation from physical upper-bound validation. The physical count
check is only enforced when it is plausibly a true physical count
(i.e., higher than the largest parent-visible ID), since
torch.cuda.device_count() under CUDA_VISIBLE_DEVICES returns the
visible count, not the physical total. The parent-visible-set check
remains authoritative in all cases. This prevents valid physical IDs
like [2, 3] from being rejected as "out of range" when nvidia-smi is
unavailable and CUDA_VISIBLE_DEVICES="2,3" makes torch report only
2 devices.
* Fix UUID/MIG torch fallback to enumerate devices by ordinal
When CUDA_VISIBLE_DEVICES uses UUID or MIG identifiers,
get_parent_visible_gpu_ids() returns [] because the tokens are
non-numeric. The torch fallback in get_visible_gpu_utilization() and
get_backend_visible_gpu_info() previously passed that empty list to
_torch_get_per_device_info(), getting nothing back.
Now both functions detect the empty-list case and fall back to
enumerating torch-visible ordinals (0..device_count-1) with
index_kind="relative". This means the UI and auto-selection still
see real device data in Kubernetes, MIG, and Slurm-style UUID
environments where nvidia-smi output cannot be mapped to physical
indices.
Updated test_uuid_parent_visibility to verify the new torch fallback
path returns available=True with relative ordinals.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add type hint for gpu_ids parameter in InferenceOrchestrator.load_model
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fixes#4670
Separates the GGUF context slider ceiling from the currently active context length so lowering context via Chat Settings no longer locks the slider max to the reduced value.
- Backend: adds `max_context_length` to GGUF load/status responses, computed from the largest VRAM/KV-fit cap across all usable GPU subsets
- Frontend: stores `ggufMaxContextLength` and uses it for Context Length slider/input bounds; hydrates from both `/api/inference/load` and `/api/inference/status`
- Defaults UI ceiling to native context for CPU-only and fallback paths
- Seeds `effective_ctx` and `max_available_ctx` before GPU probing to prevent `UnboundLocalError` on probe failure
- Property fallback uses native `_context_length`, not effective `context_length`
* refactor(studio): unify setup terminal output style and add verbose setup mode
* studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose)
* studio(setup): revert nvcc path reordering to match main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio(setup): restore fail-fast llama.cpp setup flow
* studio(banner): use IPv6 loopback URL when binding :: or ::1
* Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp
- Bracket IPv6 display_host in external_url to produce clickable URLs
- Redirect try_quiet failure log to stderr instead of stdout
- Clamp _step label to column width to prevent negative padding
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox integration tests for PR #4494 UX fixes
Simulation harness (tests/simulate_pr4494.py) creates an isolated uv
venv, copies the real source files into it, and runs subprocess tests
for all three fixes with visual before/after demos and edge cases.
Standalone bash test (tests/test_try_quiet.sh) validates try_quiet
stderr redirect across 8 scenarios including broken-version contrast.
39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all
existing 75 unit tests still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Truncate step() labels in setup.sh to match PS1 and Python
The %-15s printf format pads short labels but does not truncate long
ones. Change to %-15.15s so labels wider than 15 chars are clipped,
matching the PowerShell .Substring(0,15) and Python label[:15] logic.
* Remove sandbox integration tests from PR
These test files are not part of the styling fix and should not
ship with this PR.
* Show error output on failure instead of suppressing it
- install_python_stack.py: restore _red for patch_package_file
warnings (was downgraded to _dim)
- setup.ps1: capture winget output and show on failure for CUDA,
Node, Python, and OpenSSL installs (was piped to Out-Null)
- setup.ps1: always show git pull failure warning, not just in
verbose mode
* Show winget error output for Git and CMake installs on failure
Same capture-and-print-on-failure pattern already used for
Node, Python, CUDA, and OpenSSL winget installs.
* fix: preserve stderr for _run_quiet error messages in setup.sh
The step() helper writes to stdout, but _run_quiet's error header
was originally sent to stderr (>&2). Without the redirect, callers
that separate stdout/stderr would miss the failure headline while
still seeing the log body on stderr. Add >&2 to both step calls
inside _run_quiet to match main's behavior.
* feat: add --verbose flag to setup and update commands
Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that
'unsloth studio update --verbose' (and the deprecated 'setup')
passes the flag to setup.sh / setup.ps1 / install_python_stack.py.
* fix(studio): honor verbose logging and keep llama.cpp failures non-blocking
* fix(studio): switch installer to 'studio update' and normalize Windows setup logs
* chore(studio): refine localhost tip and remove skip-base setup nois
* fix(studio): align Windows setup logs with Linux style and improve startup tips
* fix(studio): align Windows setup logs with Linux style
* refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output
* refactor(windows): align installer/setup output with Linux style and reduce default verbosity
* refactor(windows): match install.ps1 output style/colors to setup and quiet default logs
* fix(studio-banner): update personal-computer localhost tip
* fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode
* fix(install.sh): align installer logging with setup style and restore POSIX-safe color output
* fix(install.sh): preserve installer reliability and launch visibility
Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output.
* fix(windows installer): keep exit semantics and degrade status accurate
Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable.
* fix(setup.sh): improve log clarity and enforce GGUF degraded signaling
Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable.
* fix(installer): harden bash preflight and PowerShell GPU checks
Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling.
* fix(windows): keep verbose output visible while preserving exit codes
Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode.
* fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity
* Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose
- setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer
already reports the limitation, and install.sh runs under set -e so a
non-zero exit aborts the entire install including PATH/shortcuts/launch.
- setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1
sets $? = $false when native commands write to stderr even with exit 0.
Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE.
- startup_banner.py: Show the actual bound address when Studio is bound to
a non-loopback interface instead of always showing 127.0.0.1/localhost.
- setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for
npm install steps so --verbose correctly surfaces npm output.
* Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose
- install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge
stderr into stdout with 2>&1 and drop the $? check that misclassifies
successful native commands on PS 5.1.
- install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env
when --verbose is passed so child processes like install_python_stack.py
also run in verbose mode.
- setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose
correctly surfaces clone diagnostics during source-build fallback.
* Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner
- setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so
users can see download/validation progress while still capturing the log
for structured error reporting on failure.
- setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode.
- setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers.
- startup_banner.py: Avoid printing the same URL twice when Studio is
bound to a specific non-loopback address that matches the display host.
* Fix run_install_cmd exit code after failed if-statement
The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured
$? = 0 because $? reflects the if-statement result, not the command's exit
code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual
command exit code on failure. Applies to both verbose and quiet branches.
* Fix _run_quiet exit code, double uv install, missing --local flag
- setup.sh: Fix _run_quiet verbose path that always captured exit code 0
due to $? resetting after if-then-fi with no else. Switch to the same
'"$@" && return 0; exit_code=$?' pattern used in install.sh.
- setup.sh: Consolidate the two uv install branches (verbose + quiet)
into a single attempt with conditional output. Previously, when verbose
mode was on and the install failed, a second silent attempt was made.
- install.ps1: Pass --local flag to 'unsloth studio update' when
$StudioLocalInstall is true. Without this, studio.py's update() command
overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if
setup.ps1 or install_python_stack.py later checks that variable.
* Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners
- Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already
installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling
setup.sh, so letting install_python_stack.py redo it was redundant and
slowed down --no-torch installs for no benefit.
- Restore the "Unsloth Studio installed!" success banner and "starting
Unsloth Studio..." launch message so users get clear install completion
feedback before the server starts.
* Make llama.cpp build failure a hard error with proper cleanup
- setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF
inference requires a working llama.cpp build, so this should be a
hard failure, not a silent degradation.
- install.sh: Catch setup.sh's non-zero exit with '|| _SETUP_EXIT=$?'
instead of letting set -e abort immediately. This ensures PATH setup,
symlinks, and shortcuts still get created so the user can fix the
build deps and retry with 'unsloth studio update'. After post-install
steps, propagate the failure with a clear error message.
* Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE
'studio update' pops SKIP_STUDIO_BASE from the environment, which
defeats the fast-path version check added in PR #4667. When called
from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1
must survive into setup.ps1 so it skips the redundant PyPI check and
package reinstallation. 'studio setup' does not modify env vars.
* Remove deprecation message from 'studio setup' command
install.ps1 uses 'studio setup' (not 'studio update') to preserve
SKIP_STUDIO_BASE. The deprecation message was confusing during first
install since the user never typed the command.
* Fix stale env vars, scope degraded exit, generic error message for PR #4651
- install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO
when not using --local, to prevent stale values from a previous --local
run in the same PowerShell session. Fix log messages to say 'setup' not
'update' since we call 'studio setup'.
- setup.sh: Only exit non-zero for degraded llama.cpp when called from the
installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps
degraded installs successful since Studio is still usable for non-GGUF
workflows and the footer already reports the limitation.
- install.sh: Make the setup failure error message generic instead of
GGUF-specific, so unrelated failures (npm, Python deps) do not show
misleading cmake/git recovery advice.
* Show captured output on failure in quiet mode for PR #4651
Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand
(setup.ps1) now capture command output in quiet mode and display it
in red when the command fails. This matches the behavior of
run_install_cmd in install.sh where failure output is surfaced even
in quiet mode, making cross-platform error debugging consistent.
* Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651
- setup.ps1: Exit non-zero for degraded llama.cpp when called from
install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct
'unsloth studio update' keeps degraded installs successful.
- install.sh: Show 'unsloth studio update --local' in the recovery
message when the install was run with --local, so users retry with
the correct flag instead of losing local checkout context.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: add PyPI version check to setup.ps1 for fast update path
Port the update-flow logic from setup.sh to setup.ps1 so that
`unsloth studio update` on Windows skips Python dependency reinstall
when the installed version already matches PyPI latest.
* fix: clear SKIP_STUDIO_BASE in update command
install.ps1 sets SKIP_STUDIO_BASE=1 which persists in the PowerShell
session. If the user runs `unsloth studio update` in the same terminal,
the env var causes the version check to be skipped. Clear it explicitly
in the update command.
* fix: harden version check and clear stale env vars in update flow
- Normalize $InstalledVer with Out-String + Trim() to avoid array/whitespace
comparison issues in PowerShell 5.1 (python output can be captured as
string[] instead of scalar string)
- Move Fast-Install --upgrade pip inside if (-not $SkipPythonDeps) so the
fast path avoids unnecessary network round-trips
- Clear STUDIO_LOCAL_REPO when --local is not passed to prevent a previous
--local session from leaking into a plain update
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix blank page on Windows due to broken .js MIME type in registry
* Update studio/backend/main.py
adding defensive suggestion by gemini where we make the mimetypes specific to windows platforms
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* feat(studio): add HF/local model selection UI for GGUF export
* fix(studio):fix selector ring clipping
* fix(studio): export page trust_remote_code control and label styling
* fix(studio): accept hf_token in load_checkpoint orchestrator method
The route was passing hf_token to load_checkpoint() but the method
didn't accept it, causing a TypeError on every /api/export/load-checkpoint
request.
* fix(studio): clear HF model selection when input is edited
Previously selectedSourceModel was only cleared when the input became
empty, so editing to a different repo ID after selecting a model would
silently keep the old selection.
---------
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
normalize_path() unconditionally converted Windows paths like
C:\Users\... to WSL format /mnt/c/Users/..., which breaks path
resolution on native Windows. This caused LM Studio GGUF models
to fail detection (detect_gguf_model returned None for the invalid
path), falling through to the Unsloth import path which requires
a GPU.
Now only performs the /mnt/ mapping when actually running under WSL.
On native Windows, drive letters are preserved and backslashes are
normalized to forward slashes.
* fix: default HF cache to standard platform path instead of legacy Unsloth cache
* feat: show LM Studio and local models in chat Fine-tuned tab
* feat: show LM Studio models in Hub models tab
* fix: fetch local models after auth refresh completes
* Revert "fix: fetch local models after auth refresh completes"
This reverts commit cfd61f0ac7.
* fix: increase llama-server health check timeout to 600s for large models
* feat: expandable GGUF variant picker for LM Studio local models
* fix: show GGUF variant label for locally loaded LM Studio models
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: show publisher name in LM Studio model labels
* fix: set model_id for loose GGUF files in LM Studio publisher dirs
* fix: show publisher prefix in Fine-tuned tab LM Studio models
* fix: only use model_id for lmstudio source models
* fix: only show LM Studio models in Hub tab on Mac/chat-only mode
* fix: respect XDG_CACHE_HOME, handle Windows paths in isLocalPath, refresh LM Studio on remount
- _setup_cache_env now reads XDG_CACHE_HOME (falls back to ~/.cache)
instead of hard-coding ~/.cache/huggingface. This follows the standard
HF cache resolution chain and respects distro/container overrides.
- isLocalPath in GgufVariantExpander uses a regex that covers Windows
drive letters (C:\, D:/), UNC paths (\\server\share), relative paths
(./, ../), and tilde (~/) -- not just startsWith("/").
- HubModelPicker.useEffect now calls listLocalModels() before the
alreadyCached early-return gate so LM Studio models are always
refreshed on remount. Also seeds useState from _lmStudioCache for
instant display on re-open.
* fix: add comment explaining isLocalPath regex for Windows/cross-platform paths
* fix: prioritize unsloth publisher in LM Studio model list
* fix: scope unsloth-first sort to LM Studio models on all platforms
* fix: add missing _lmStudioCache module-level declaration
* fix: prioritize unsloth publisher before timestamp sort in LM Studio group
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Some models like unsloth/Qwen3-0.6B have no safetensors metadata
on Hugging Face, so the training model selector showed no parameter
size badge. The chat model picker already had extractParamLabel()
as a fallback that parses sizes like "0.6B" from the model name.
Add the same fallback to the training model selector and the
onboarding model selection step.
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* Detect always-on reasoning models and show Think button as locked-on
Models with hardcoded <think>/<think> tags or reasoning_content in
their chat template (e.g. distilled reasoning models) always produce
thinking output regardless of any toggle. Previously these models
were not detected as reasoning-capable at all, so the Think button
was grayed out even though the model was actively reasoning.
Backend:
- Detect <think>/<think> and reasoning_content in GGUF chat templates
as a fallback when enable_thinking is not present
- Add reasoning_always_on flag to LoadResponse and InferenceStatusResponse
- Pass the flag through all GGUF load and status response paths
Frontend:
- Add reasoningAlwaysOn to the chat runtime store and API types
- When reasoning_always_on is true, show the Think button as lit
(active) but not clickable, with a tooltip explaining the model
always uses thinking
- Force reasoningEnabled=true when the model always reasons
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use pointer-events-none instead of disabled for always-on Think button
The HTML disabled attribute was not fully blocking clicks on the Think
button for always-on reasoning models. Switch to pointer-events-none
CSS class which prevents all mouse interaction at the CSS level.
* Use a static span instead of disabled button for always-on Think
Replace the button element with a plain span when reasoning is
always on. This makes it physically impossible to toggle since
there is no clickable element at all, avoiding any CSS or
disabled-attribute edge cases.
* Simplify always-on Think button to stay lit and remain toggleable
Keep the Think button as a normal toggleable button but ensure it
shows as lit when reasoning_always_on is true. The model always
reasons regardless of the toggle state so there is no need to
block interaction.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Use --no-deps for ALL packages (unsloth, unsloth-zoo, and runtime deps)
since the current PyPI metadata for unsloth still declares torch as a
hard dependency. Runtime deps (typer, pydantic, safetensors,
transformers, etc.) are installed from no-torch-runtime.txt with
--no-deps to prevent transitive torch resolution from accelerate, peft,
trl, and sentence-transformers.
no-torch-runtime.txt now includes unsloth's own direct deps (typer,
pydantic, pyyaml, nest-asyncio) since --no-deps skips those too.
install.sh installs no-torch-runtime.txt directly (via helper function
_find_no_torch_runtime). install.ps1 does the same via
Find-NoTorchRuntimeFile. SKIP_STUDIO_BASE stays at 1 to avoid setup.sh
fast-path issues.
install_python_stack.py NO_TORCH branch does the same for unsloth
studio update, using package_name instead of hardcoded "unsloth".
* Fix inference failing for transformers 5.x models (trust_remote_code)
The training worker in core/training/worker.py auto-enables
trust_remote_code for unsloth/* models that need transformers 5.x
(e.g. NVIDIA-Nemotron-3-Nano-4B). The inference worker did not have
the same logic, so loading these models for chat would fail with
"No config file found" while training worked fine.
Add the same auto-detection to the inference worker so
trust_remote_code is set automatically when needed.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Studio shutdown button
* fix: add auth to shutdown endpoint and improve UX
- Add JWT auth (Depends(get_current_subject)) to POST /api/shutdown
- Use authFetch instead of bare fetch in shutdown dialog
- Only show beforeunload prompt when training is running
- Remove Ctrl+W/Cmd+W interception (browsers don't allow it)
- Store shutdown task on app.state to prevent GC
---------
Co-authored-by: Datta Nimmaturi <venkatadattasainimmaturi@gmail.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: only kill studio-managed llama-server processes, not user's own servers
_kill_orphaned_servers() checked for "unsloth" anywhere in the process
cmdline, which matched the user's own llama-server when serving models
from unsloth/ HF repos (the model path in -m contains "unsloth"). This
caused the user's server to get SIGKILLed on Studio startup, destroying
their prompt cache and forcing full model re-loads.
Narrow the check to only match processes whose binary path lives under
~/.unsloth/llama.cpp/ (the Studio install directory).
* Address review: cover env var paths, move Path.home() inside try block
- Also check LLAMA_SERVER_PATH and UNSLOTH_LLAMA_CPP_PATH so orphans
from custom install locations are still cleaned up.
- Move studio_dirs construction inside the try/except so a Path.home()
failure (containers without HOME) does not crash the constructor.
* Address reviewer feedback: proper path ancestry, /proc/pid/exe, legacy paths
Changes based on 10-reviewer consensus:
- Use Path.is_relative_to() instead of substring matching to prevent
false positives on sibling paths like ~/.unsloth/llama.cpp-backup/.
- Use /proc/<pid>/exe (symlink to real binary) instead of parsing the
first cmdline token, which breaks on paths with spaces. Falls back
to cmdline parsing on non-Linux or when /proc is unavailable.
- Add legacy in-tree install paths (project_root/llama.cpp/ and
project_root/bin/) so orphans from older setup.sh are still cleaned.
- Treat LLAMA_SERVER_PATH as an exact binary match rather than widening
it to its parent directory, which could match unrelated servers in
shared locations like /usr/local/bin/.
- Keep everything inside the try/except so Path.home() failures in
containers do not crash the constructor.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review: add Linux platform guard and log cleanup errors
- Guard pgrep fallback with sys.platform check so it does not crash
on Windows/macOS when psutil is unavailable.
- Replace silent except-pass with logger.warning for observability.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
The [huggingfacenotorch] extras only exist in pyproject.toml but are
NOT published on PyPI, so uv pip install "unsloth[huggingfacenotorch]"
fails on fresh installs from the registry.
Fix: add studio/backend/requirements/no-torch-runtime.txt with the
runtime deps (safetensors, transformers, datasets, accelerate, etc.)
that mirror [huggingfacenotorch] from pyproject.toml. In no-torch mode:
1. install.sh/ps1 install unsloth + unsloth-zoo with --no-deps
2. SKIP_STUDIO_BASE=0 so install_python_stack.py's NO_TORCH branch runs
3. install_python_stack.py installs no-torch-runtime.txt
* Guard against late tool_calls after visible content, filter incomplete fragments
1. If visible content was already emitted (_last_emitted is non-empty)
when delta.tool_calls arrives, ignore the tool_calls instead of
reclassifying the turn as a tool call. llama-server never
interleaves content and tool_calls (they are mutually exclusive),
but this guard is defensive for other OpenAI-compatible backends.
2. Filter out incomplete structured tool_calls fragments before
execution. Entries with empty function.name (from truncation by
max_tokens, disconnect, or interruption) are skipped instead of
being passed to execute_tool().
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: account for KV cache in GGUF GPU fit check and auto-cap context length
The GPU fit check only compared GGUF file size against free VRAM,
ignoring KV cache memory. Models with large native context lengths
(e.g. Qwen3.5-9B at 262k) would pass the fit check since the GGUF
is only 5.6 GB, but the KV cache at 262k context needs ~40 GB at
f16. This caused llama-server to silently fall back to CPU inference.
Changes:
- Parse block_count, head_count_kv, head_count, and embedding_length
from GGUF metadata alongside context_length
- Add KV cache VRAM estimation based on architecture params and the
selected cache quantization type (f16, q8_0, q4_0, etc.)
- Auto-reduce context length to the maximum that fits in available
GPU VRAM when the native context would exceed it
- Include estimated KV cache size in the _select_gpus total so the
fit decision reflects actual runtime memory, not just file size
For the reported scenario (Qwen3.5-9B on RTX 3090 with 22415 MiB
free), context is auto-reduced from 262144 to ~63k with f16 KV cache,
keeping the model fully on GPU. With q4_0 KV cache quantization the
context can reach ~226k.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: resolve 6 bugs in KV cache VRAM estimation and add test harness
- Fix q8_0 BPE constant: 1.125 -> 34/32 (1.0625) to match llama.cpp block size
- Fix _fit_context_to_vram returning min_ctx when weights exceed budget
(should return requested_ctx unchanged, let --fit handle it)
- Fix binary search inflating below-2048 requests (lo=min_ctx=2048 > hi)
- Fix n_ctx=0 regressing to 4096 when metadata unavailable (preserve sentinel)
- Fix multi-GPU auto-cap using single-GPU budget instead of aggregate
- Fix _context_length being overwritten with capped effective value
Add tests/test_gguf_kv_vram.py: 43 cross-platform pytest tests covering
pure logic, integration (monkeypatched load_model), and real GGUF parsing.
Runs in an isolated uv venv with only pytest -- no GPU/torch/structlog needed.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: complete _effective_context_length lifecycle
- Initialize _effective_context_length in __init__ (prevents AttributeError)
- Reset _effective_context_length in unload_model (prevents stale values)
- Update context_length property to return effective (capped) value for
the UI/API, falling back to native _context_length if not set
* fix: multi-GPU selection tries smallest subset first
The previous approach summed all GPUs' memory to cap context, then
selected GPUs afterward. This was overly optimistic for heterogeneous
setups (e.g., 48 GiB + 4 GiB): the context was inflated by the tiny
GPU's contribution, then both GPUs were dragged in.
Now we try GPU subsets from smallest (1 GPU) to largest, capping
context for each. We pick the smallest subset where the model+KV
fits. This prefers single-GPU when possible (simpler, no tensor
split overhead) and avoids pulling in GPUs that barely help.
Add tests: test_multi_gpu_prefers_fewer_gpus,
test_multi_gpu_heterogeneous.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: prefer fewer GPUs over higher context in GPU selection
Multi-GPU inference is slower due to tensor-split overhead, so we
should prefer fewer GPUs with reduced context over more GPUs with
full context. Now the loop stops at the first GPU subset where the
model fits, rather than continuing to find subsets that allow higher
context. Only if the model can't fit on N GPUs do we try N+1.
This preserves the original behavior: use multi-GPU only when the
model doesn't fit on a single GPU.
* fix: make _kill_orphaned_servers cross-platform via psutil
Replace pgrep + os.kill(SIGKILL) with psutil.process_iter() and
proc.kill(), which work on Linux, macOS, and Windows. Build an
allowlist of install roots matching _find_llama_server_binary so
only studio-managed servers are killed.
* fix: skip KV estimation loop when effective context is unknown
When n_ctx=0 and GGUF metadata lacks context_length, effective_ctx
stays 0. _estimate_kv_cache_bytes(0) returns 0, so a GPU could be
selected with no KV headroom. Guard the loop with effective_ctx > 0
to fall back to file-size-only GPU selection in this case.
* chore: temporarily remove test harness (will add back separately)
* refactor: deduplicate UINT32/UINT64 handling in GGUF parser
Replace duplicated if/elif chains for vtype 4 and 10 with a single
block using setattr. No behavioral change.
* fix: honor explicit n_ctx by using multi-GPU before capping
When the user explicitly sets n_ctx, try to fit the full requested
context using _select_gpus (which adds GPUs as needed). Only cap
context if it doesn't fit on any GPU combination.
When n_ctx=0 (auto/native context), keep the existing behavior:
prefer fewer GPUs with reduced context, since multi-GPU is slower
and the user didn't ask for a specific context length.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: context_length property returns native value for frontend slider
The frontend uses context_length as the slider max. Returning the
capped effective value prevented users from requesting higher context
on reload (e.g., after switching to q4_0 KV cache). Revert to
returning the native GGUF metadata value -- the backend auto-caps
at load time regardless.
* revert: context_length returns effective (capped) value
The UI slider should show what the server is actually running at,
not the theoretical maximum. Revert to returning the effective
context length.
* fix: raise minimum context floor from 2048 to 4096
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix ~1.2s TTFT penalty when tools are enabled in Studio
When users enable web search, Python execution, or terminal tools,
every message gets a ~1.2s delay before any text appears -- even when
the model does not call any tool. This happens because
generate_chat_completion_with_tools() does a non-streaming detection
pass (stream: False) first, waits for the complete response, then
checks for tool calls. For the ~90% of messages that don't trigger a
tool call, this blocking wait is entirely wasted.
Root cause: the detection pass payload uses stream: False, forcing
llama-server to generate the entire response before returning any
tokens.
Fix: replace the non-streaming detection pass with a streaming pass
(stream: True) and a speculative buffer state machine that detects
tool signals in the first 1-2 SSE chunks:
- BUFFERING: accumulate content tokens, check first chars for tool
signal prefixes (<tool_call>, <function=)
- STREAMING: no tool detected, yield tokens to caller immediately
- DRAINING: tool signal found, silently accumulate rest of stream
Three detection paths:
1. Structured delta.tool_calls -- detected instantly, transition to
DRAINING, accumulate fragments, assemble at stream end.
2. XML tool markup in content -- buffer holds up to 32 chars checking
for <tool_call> or <function= prefix, then transitions to DRAINING.
3. No tool signal -- first non-whitespace, non-XML char triggers
immediate transition to STREAMING (fast path, ~90% of requests).
Safety net: after any stream ends in STREAMING state, check accumulated
content for XML tool signals. Handles rare "content before tool call"
edge case.
Additional supporting changes:
- Add headers parameter to _stream_with_retry for auth forwarding
- Share _strip_tool_markup and regex patterns between the detection
pass and the final streaming pass (removes duplication)
- Remove the iteration==0 non-streaming content shortcut (no longer
needed since all iterations stream directly)
- Keep the final streaming pass as fallback for max_tool_iterations
exhaustion
Benchmarked on Qwen3.5-4B Q4_K_XL:
- No tools: TTFT ~112ms (unchanged)
- Tools enabled, no call: TTFT ~112ms (was ~1207ms)
- Decode TPS: 226 (unchanged in all cases)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add unit tests for streaming tool detection state machine
16 tests covering every tool call parsing path:
- Plain text (no tool call) streaming
- Structured delta.tool_calls detection and fragment assembly
- XML <tool_call>JSON</tool_call> detection via buffer
- XML <function=name> tag detection via buffer
- Whitespace before tool XML
- Safety net (content then tool XML)
- Parallel multi-tool calls
- Reasoning token bypass (thinking models)
- Reasoning then tool call
- Empty response handling
- Buffer prefix timeout (HTML not mistaken for tool)
- Non-XML first char instant streaming
- False positive rejection (<tool_tip> vs <tool_call>)
- Arguments split across multiple chunks
- auto_heal_tool_calls=False respects the flag
- Metrics accumulation across tool iterations
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reasoning-only BUFFERING, pre-tool content emission, and code duplication
Addresses review feedback on the streaming tool detection:
1. Reasoning tokens are no longer yielded during BUFFERING/DRAINING
states. The consumer in routes/inference.py tracks prev_text across
tool iterations without resetting it, so yielding reasoning during
a detection pass that resolves to a tool call would corrupt the
delta computation for subsequent iterations. Reasoning is now
silently accumulated during detection (matching the old non-streaming
behavior) and flushed together with content when the buffer resolves
to STREAMING.
2. Handle reasoning-only responses in the BUFFERING resolver. When a
thinking model emits only reasoning_content with no content tokens,
the stream ends while still in BUFFERING state. The resolver now
detects this case and yields reasoning as plain text (without
<think> wrapper), matching the final streaming pass behavior for
models like Qwen3 in always-think mode.
3. Replace duplicated re.sub calls for stripping tool markup with
the existing _strip_tool_markup(content_text, final=True) helper,
removing ~40 lines of redundant regex code.
4. Update tests: adjust reasoning test expectations to match the new
behavior (reasoning batched with content, not streamed individually
during BUFFERING). Add test_reasoning_only_no_content for the
reasoning-only edge case. 17/17 tests pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address remaining reviewer findings: late tool_call IDs and XML speculation
1. Late-arriving tool_calls.id: when a provider sends the real ID on a
later delta chunk (after the initial one with index and function
name), the accumulator now updates the ID instead of keeping the
synthetic "call_{idx}" placeholder. (P2, 2/10 reviewers)
2. XML speculation respects auto_heal_tool_calls: when auto_heal is
explicitly disabled, _TOOL_XML_SIGNALS is empty so the BUFFERING
state never speculatively holds content for XML prefix detection.
Content starting with literal "<tool_call>" or "<function=" text
flows straight through without delay. (P2, 1/10 reviewers)
Skipped: finish_reason="tool_calls" without delta.tool_calls fallback
(P1, 1/10 reviewers). llama-server always sends delta.tool_calls
fragments in streaming mode. A non-streaming fallback for this edge
case would add complexity for a scenario that does not occur in
practice with the supported backend.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Check request.is_disconnected() every 20 tokens instead of every token
The disconnect check is an async round-trip that adds overhead on every
loop iteration. Since the cancel watcher in llama_cpp.py already
handles connection teardown (closes the streaming response on cancel),
this route-layer check is a secondary safety net that does not need to
run on every single token.
Check every 20 tokens across all 4 streaming paths:
- gguf_tool_stream (tool-enabled GGUF)
- gguf_stream_chunks (standard GGUF)
- audio_input_generate (audio/whisper input)
- generic backend stream (non-GGUF fallback)
* Fix safety net, DRAINING metadata, and test import path
1. Safety net no longer retroactively executes tools after visible
content was already emitted to the user. Once _last_emitted is
non-empty, the stream is committed to normal content mode.
Retroactive tool execution after visible output would violate the
streaming contract and corrupt the route-layer cumulative delta
tracker (prev_text). The tool XML is still stripped by
_strip_tool_markup so the user sees clean content.
2. DRAINING false-positive path now merges accumulated metrics from
prior tool iterations instead of dropping them. Uses the same
merge formula as the STREAMING path.
3. Test import path fixed to use repo root instead of hardcoded
sibling directory. Works in clean checkouts and CI.
4. Renamed test_content_then_tool_xml_safety_net to
test_content_then_tool_xml_no_retroactive_execution to reflect
the corrected behavior.
17/17 tests pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Redact --api-key value from llama-server startup log
When UNSLOTH_DIRECT_STREAM=1, the generated bearer token was logged
verbatim in the startup command. Replace the secret with <redacted>
before logging.
* Remove test file temporarily
* Revert disconnect throttle, reset prev_text on tool_start, restore XML safety net
Addresses all P1 findings from reviewer round 3 (10 reviewers):
1. Revert disconnect check to every iteration (was every 20th).
All 10 reviewers flagged this as a correctness regression for
short streams and sparse tool event loops. The cancel watcher in
llama_cpp.py is the primary mechanism but the route-layer check
must remain per-iteration for completeness. [10/10]
2. Reset prev_text on tool_start in gguf_tool_stream. When a tool
cycle begins after visible content was already streamed, the
route-layer cumulative delta tracker (prev_text) must be reset
so the post-tool synthesis response is not truncated or dropped.
[9/10]
3. Remove the _last_emitted gate from the XML safety net. The gate
was added to prevent retroactive tool execution after visible
content, but with prev_text now reset on tool_start (#2), the
root cause is fixed and the safety net can correctly handle
content-then-tool-XML responses (matching pre-PR behavior).
[8/10]
* Use None instead of {} for empty auth headers in TTS methods
* Include accumulated metrics in STREAMING metadata check
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* refactor(studio): unify setup terminal output style and add verbose setup mode
* studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose)
* studio(setup): revert nvcc path reordering to match main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio(setup): restore fail-fast llama.cpp setup flow
* studio(banner): use IPv6 loopback URL when binding :: or ::1
* Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp
- Bracket IPv6 display_host in external_url to produce clickable URLs
- Redirect try_quiet failure log to stderr instead of stdout
- Clamp _step label to column width to prevent negative padding
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox integration tests for PR #4494 UX fixes
Simulation harness (tests/simulate_pr4494.py) creates an isolated uv
venv, copies the real source files into it, and runs subprocess tests
for all three fixes with visual before/after demos and edge cases.
Standalone bash test (tests/test_try_quiet.sh) validates try_quiet
stderr redirect across 8 scenarios including broken-version contrast.
39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all
existing 75 unit tests still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Truncate step() labels in setup.sh to match PS1 and Python
The %-15s printf format pads short labels but does not truncate long
ones. Change to %-15.15s so labels wider than 15 chars are clipped,
matching the PowerShell .Substring(0,15) and Python label[:15] logic.
* Remove sandbox integration tests from PR
These test files are not part of the styling fix and should not
ship with this PR.
* Show error output on failure instead of suppressing it
- install_python_stack.py: restore _red for patch_package_file
warnings (was downgraded to _dim)
- setup.ps1: capture winget output and show on failure for CUDA,
Node, Python, and OpenSSL installs (was piped to Out-Null)
- setup.ps1: always show git pull failure warning, not just in
verbose mode
* Show winget error output for Git and CMake installs on failure
Same capture-and-print-on-failure pattern already used for
Node, Python, CUDA, and OpenSSL winget installs.
* fix: preserve stderr for _run_quiet error messages in setup.sh
The step() helper writes to stdout, but _run_quiet's error header
was originally sent to stderr (>&2). Without the redirect, callers
that separate stdout/stderr would miss the failure headline while
still seeing the log body on stderr. Add >&2 to both step calls
inside _run_quiet to match main's behavior.
* feat: add --verbose flag to setup and update commands
Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that
'unsloth studio update --verbose' (and the deprecated 'setup')
passes the flag to setup.sh / setup.ps1 / install_python_stack.py.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
The previous --no-deps approach skipped ALL dependencies, not just
torch. This left safetensors, transformers, datasets, accelerate, etc.
missing, causing PackageNotFoundError at runtime.
Fix: in no-torch mode, install unsloth[huggingfacenotorch] (which pulls
all runtime deps except torch), then install unsloth-zoo with --no-deps
(since zoo's published metadata still declares torch as a hard dep).
This gives a working no-torch environment with all non-torch packages.
Applied to all three installer files: install.sh, install.ps1, and
studio/install_python_stack.py.
* fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621)
On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the
installer crashes. Even when torch is absent, Studio crashes on startup
because two files have bare top-level torch imports.
Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and
HF-inference already isolate torch to subprocesses. Only 2 files in the
server startup chain had top-level torch imports preventing startup.
Changes:
- install.sh: detect architecture, default to Python 3.12 on Intel Mac,
skip torch install, add Python 3.13.8 guard for arm64, pass
UNSLOTH_NO_TORCH env var to setup.sh
- data_collators.py: remove unused `import torch` (no torch.* refs)
- chat_templates.py: lazy-import IterableDataset into function bodies
- install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip
torch-dependent packages, skip overrides.txt, skip triton on macOS
No existing working flow changes. Linux/WSL and macOS arm64 behavior is
identical.
* tests: add test suite for Mac Intel compat + no-torch mode
Shell tests (test_mac_intel_compat.sh):
- version_ge edge cases (9 tests)
- Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64
- get_torch_index_url returns cpu on simulated Darwin
- UNSLOTH_NO_TORCH propagation to both setup.sh branches
Python unit tests (test_no_torch_filtering.py):
- _filter_requirements with NO_TORCH_SKIP_PACKAGES
- NO_TORCH env var parsing (true/1/TRUE/false/0/unset)
- IS_MACOS constant check
- Overrides skip and triton macOS skip guards
Python import tests (test_studio_import_no_torch.py):
- data_collators.py loads in isolated no-torch venv
- chat_templates.py has no top-level torch imports
- Negative control confirms import torch fails without torch
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tests: add E2E sandbox tests for Mac Intel no-torch mode
Replace static/synthetic test stubs with real sandbox tests:
- Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify
torch install is skipped when MAC_INTEL=true, dynamic env propagation
test for UNSLOTH_NO_TORCH in both local and non-local install paths
- Python filtering: test real extras.txt and extras-no-deps.txt with
NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for
5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux,
Windows-only, macOS-only), VCS URL and env marker edge cases
- Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass
instantiation for all 3 collator classes, chat_templates.py exec with
stubs, negative controls proving import torch and torchao install fail
in no-torch venvs
91 total tests, all passing.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: address reviewer findings for Intel Mac no-torch mode
P1 fixes:
- Auto-infer NO_TORCH in install_python_stack.py via platform.machine()
so `unsloth studio update` preserves GGUF-only mode without needing
the UNSLOTH_NO_TORCH env var (6/10 reviewers)
- Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES
since both have unconditional torch dependencies (4/10 reviewers)
- Skip unsloth-zoo on Intel Mac --local installs (depends on torch)
in both migrated and fresh install paths (1/10)
- Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10)
- Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64
and warn user to use native arm64 terminal (1/10)
P2 fixes:
- Wire new test files into tests/run_all.sh (4/10 reviewers)
- Add update-path tests (skip_base=False) for Intel Mac
- Add _infer_no_torch tests for platform auto-detection
P3 fixes:
- Fix macOS progress bar total (triton step skipped but was counted)
- Fix temp file leak when Windows + NO_TORCH filters stack
All tests pass: 30 shell, 66 Python (96 total).
* feat: add --python override flag to install.sh
Lets users force a specific Python version, e.g. ./install.sh --python 3.12.
Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch.
When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade
are skipped so the user's choice is respected.
* tests: add comprehensive E2E sandbox tests for no-torch mode
Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total)
covering the full no-torch import chain, edge cases, and install logic:
- Group 1: BEFORE vs AFTER import chain comparison (proves the bug
existed and the fix works by synthetically prepending top-level torch
imports)
- Group 2: Dataclass instantiation without torch
- Group 3: Edge cases with broken/fake torch modules on sys.path
- Group 4: Hardware detection fallback to CPU without torch
- Group 5: install.sh flag parsing, version resolution, arch detection
- Group 6: install_python_stack.py NO_TORCH filtering
- Group 7: Live server startup without torch (marked @server, skipped
when studio venv is unavailable)
All 43 tests pass on both Python 3.12 and 3.13 isolated venvs.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting
- Fix chat_templates.py: narrow torch IterableDataset import into inner
try/except ImportError so dataset.map() works without torch installed
- Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca
and convert_alpaca_to_chatml
- Add --no-torch flag to install.sh with unified SKIP_TORCH variable
(driven by --no-torch flag OR MAC_INTEL auto-detection)
- Add --no-torch flag to install.ps1 with $SkipTorch variable
- Print CPU hint when no GPU detected and --no-torch not set
- Replace MAC_INTEL guards with SKIP_TORCH in torch install sections
- Update shell tests (40 pass) and Python tests (90 pass)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: address reviewer findings for --no-torch installer paths
- Fix migrated-env branch in install.sh and install.ps1: check
SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously
SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which
depends on torch), defeating --no-torch mode.
- Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true"
or "false" instead of only setting on the true branch. Prevents stale
no-torch state from leaking across runs in the same PS session.
- Fix install_python_stack.py update path: add NO_TORCH guard around
base.txt install so unsloth studio update does not reinstall
unsloth-zoo (which depends on torch) in no-torch mode.
* fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode
Instead of skipping unsloth-zoo entirely (which breaks unsloth's
dependency on it), install both packages with --no-deps so they are
present but torch is not pulled in transitively. Applied consistently
across all no-torch paths: migrated-env, fresh-local, fresh-non-local
in install.sh, install.ps1, and install_python_stack.py.
* chore: temporarily remove test files (will be added in a follow-up)
* refactor: deduplicate SKIP_TORCH conditional branches in installers
Collapse if/else blocks that differ only by --no-deps into a single
branch with a conditional flag variable. Applied to migrated-env and
fresh-local paths in install.sh, install.ps1, and install_python_stack.py.
* fix: apply --no-deps to fresh non-local --no-torch install path
The non-local else branch was missing $_no_deps_arg/$noDepsArg, so
uv pip install unsloth would resolve torch from PyPI metadata (the
published unsloth package still declares torch as a hard dep). Now
--no-deps is applied consistently to all SKIP_TORCH code paths.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Inline querier identity changed every render, forcing useLiveQuery to
resubscribe continuously causing CPU spikes. Store querier in a ref and
only re-subscribe when explicit deps change.
The ChatCompletionRequest Pydantic model defaulted repetition_penalty
to 1.1 when clients omitted the field. This silently forced
llama-server to perform per-token repetition scanning, dropping
streaming throughput from ~225 TPS to ~172 TPS (a 24% penalty).
The Studio frontend always sends repetition_penalty=1.0 explicitly,
so UI users were unaffected. But any API client hitting
/v1/chat/completions without setting the field (curl, third-party
integrations, Open WebUI, etc.) would get the slow path.
Benchmarked on Qwen3.5-4B Q4_K_XL, GPU 0:
- repeat_penalty=1.0: 225.2 TPS
- repeat_penalty=1.1: 172.7 TPS (24% slower)
- LM Studio (which applies rp internally): 170.8 TPS
This aligns the Pydantic default with the frontend default (1.0),
generate_chat_completion's function signature default (1.0), and
llama-server's own default (1.0).
* Allow install_python_stack to run on Colab
The _COLAB_NO_VENV flag was setting _SKIP_PYTHON_DEPS=true, which
skipped both the PyPI version check (needs $VENV_DIR/bin/python) and
install_python_stack (uses sys.executable, works without a venv).
Introduce a separate _SKIP_VERSION_CHECK flag for the version check,
so install_python_stack still runs on Colab. The _SKIP_PYTHON_DEPS
flag remains available for the "versions match" fast path.
* Remove colab.py workarounds that broke transformers/hf-hub compatibility
PR #4601 added _pip_install_backend_deps(), _bootstrap_studio_venv(),
and _is_colab() to colab.py as workarounds for install_python_stack
being skipped on Colab. These workarounds:
- Stripped version constraints from studio.txt and installed into system Python
- Upgraded huggingface-hub to >=1.0, breaking Colab's pre-installed
transformers which requires huggingface-hub<1.0
With install_python_stack now running on Colab (previous commit), these
workarounds are unnecessary — all deps are properly installed by setup.sh.
Restore colab.py to its original PR #4237 structure: just get_colab_url(),
show_link(), and start().
* Remove --local flag from setup.sh in Colab notebook
The --local flag is not needed for the standard Colab flow since
install_python_stack now runs on Colab and installs deps from PyPI.
* studio: humanize ETA display for long training runs
When training takes hours or days, the ETA displayed raw minutes
(e.g. '560m 50s'). This changes the format to:
- Under 1 hour: Xm Ys (unchanged)
- 1-24 hours: Xh Ym Zs
- Over 24 hours: Xd Xh Xm
* Fix formatDuration edge cases and consolidate duplicate for PR #4608
- Guard NaN/Infinity inputs with Number.isFinite() (matches formatNumber in same file)
- Add sub-minute branch so 30s displays as "30s" instead of "0m 30s"
- Accept undefined in type signature to match formatNumber pattern
- Remove duplicate formatDuration from history-card-grid.tsx and import the shared one
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: avoid _yaml.pyd lock on Windows during dependency overrides
* fix: move pytorch_tokenizers and kernels to no-deps install to avoid Windows _yaml.pyd loc
* fix(studio): align config cards, dynamic height for expanders, LoRA collapsible
* Fix clipping regressions in training, dataset, and params section cards
- training-section: Add hasMessage conditional so the card expands
(min-h) when startError, vision/audio incompatibility, or config
validation messages are present instead of always using fixed height
- dataset-section: Expand card when a local dataset is selected via
upload (datasetSource === "upload" && selectedLocalDataset), not only
when the Advanced panel is open
- params-section: Guard loraOpen behind isLora so switching to full
fine-tune collapses the card instead of staying expanded from stale
React useState
* Fix dataset card clipping for direct file uploads
Use uploadedFile instead of selectedLocalDataset in the card height
condition. selectedLocalDataset is derived from localDatasets.find()
which only resolves for Data Recipe entries, not direct file uploads
(.jsonl, .csv, .parquet, .arrow). The card already renders the Eval
Dataset panel based on uploadedFile (line 750), so the height gate
should match.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Recommended models matching the query were filtered from HF results but the Recommended section was hidden during search, causing them to vanish entirely.
- Show filtered recommended models during search by introducing `filteredRecommendedIds`
- Switch `recommendedSet` to use filtered IDs when searching so dedup against HF results is correct
- Hide empty "Hugging Face" label when recommended matches cover the query
- Add `normalizeForSearch` helper to strip separators (spaces, hyphens, underscores, dots) so queries like "llama 3" match "Llama-3.2-1B" and "qwen 2.5" matches "Qwen2.5-7B" in both the recommended model filter and the LoRA adapter filter
* Fix Colab setup skipping llama.cpp installation
The early exit 0 in the Colab no-venv path prevented setup.sh from
ever reaching the llama.cpp install section. Remove the early exit
and instead guard only the venv-dependent Python deps section, so
execution continues through to the llama.cpp prebuilt/source install.
* Simplify _SKIP_PYTHON_DEPS initialization
* Add --local flag to setup.sh in Colab notebook
* Fix Colab huggingface-hub conflict, ensurepip fallback, bump to 2026.3.14
- colab.py / setup.sh: relax == pins to >= when installing studio.txt
on Colab so huggingface-hub does not clobber Colab's bundled version
(breaks transformers is_offline_mode import)
- install_python_stack.py: when uv is unavailable and pip is missing
(uv-created venvs), bootstrap via ensurepip before attempting upgrade
- Bump version to 2026.3.14
- Bump installer min version pins to 2026.3.14
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix Colab Studio launch and setup.ps1 box alignment
- colab.py: when the Studio venv is missing on Colab, pip-install
backend dependencies (structlog, fastapi, etc.) from studio.txt
into the current Python instead of failing with ModuleNotFoundError
- setup.sh: on Colab without a venv, install backend deps into system
Python and skip venv-dependent sections (Python stack update,
llama.cpp build) that would otherwise fail
- setup.ps1: use PadRight(47) for the done-line so "Setup Complete!"
and "Update Complete!" both align with the box border
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): editable context length with Apply/Reset for GGUF model settings
Previously the Context Length field was read-only and the backend
hardcoded `-c 0`, ignoring custom values entirely. KV Cache Dtype also
triggered an immediate model reload with no way to cancel.
Backend:
- llama_cpp.py: pass the actual n_ctx value to `-c` instead of always 0
- models/inference.py: relax max_seq_length to 0..1048576 (0 = model
default) so GGUF models with large context windows are supported
Frontend:
- chat-runtime-store: add customContextLength and loadedKvCacheDtype
state fields for dirty tracking
- chat-settings-sheet: make Context Length an editable number input,
stop KV Cache Dtype from auto-reloading, show Apply/Reset buttons
when either setting has been changed
- use-chat-model-runtime: send customContextLength as max_seq_length
in the load request, reset after successful load
* fix: preserve maxSeqLength for non-GGUF models in load request
customContextLength ?? 0 sent max_seq_length=0 for non-GGUF models,
breaking the finetuning/inference path that needs the slider value.
Now uses a three-way branch:
- customContextLength set: use it (user edited GGUF context)
- GGUF without custom: 0 (model's native context)
- Non-GGUF: maxSeqLength from the sampling slider
* fix: keep max_seq_length default at 4096 for non-GGUF callers
Only relax the bounds (ge=0 for GGUF's "model default" mode,
le=1048576 for large context windows). The default stays at 4096
so API callers that omit max_seq_length still get a sane value
for non-GGUF models.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): rename trust remote code toggle and hide when no model selected
- Rename "Trust remote code" to "Enable custom code"
- Shorten subtitle to "Only enable if sure"
- Hide the toggle when no model is loaded (already hidden for GGUFs)
* fix: restore ge=128 for max_seq_length validation
Keep the minimum at 128 so the API rejects nonsensical values.
GGUF path now sends the model's native context length (from
ggufContextLength) instead of 0 when the user has not customized it.
The upper bound stays at 1048576 for large-context GGUF models.
* feat(studio): replace Context Length input with slider
Use a ParamSlider (512 to model's native context, step 512) instead
of a small number input. Shows "Max" when at the model's native
context length. Consistent with the other slider controls in the
settings panel.
* feat(studio): add editable number input alongside Context Length slider
The slider and number input stay synced -- dragging the slider updates
the number, typing a number moves the slider. The input also accepts
values beyond the slider range for power users who need custom context
lengths larger than the model default.
* fix(studio): widen context length input and use 1024 step for slider
Make the number input wider (100px) so large values like 262144 are
fully visible. Change slider step from 512 to 1024 and min from 512
to 1024.
* fix(studio): context length number input increments by 1024
* fix(studio): cap context length input at model's native max
Adds max attribute and clamps typed/incremented values so the context
length cannot exceed the GGUF model's reported context window.
* fix(studio): point "What's new" link to changelog page
Changed from /blog to /docs/new/changelog.
* fix(studio): preserve custom context length after Apply, remove stale subtitle
- After a reload with a custom context length, keep the user's value
in the UI instead of snapping back to the model's native max.
ggufContextLength always reports the model's native metadata value
regardless of what -c was passed, so we need to preserve
customContextLength when it differs from native.
- Remove "Reload to apply." from KV Cache Dtype subtitle since the
Apply/Reset buttons now handle this.
* feat(studio): auto-enable Search and Code tools when model supports them
Previously toolsEnabled and codeToolsEnabled stayed false after loading
a model even if it reported supports_tools=true. Now both toggles are
automatically enabled when the loaded model supports tool calling,
matching the existing behavior for reasoning.
* fix(studio): auto-enable tools in autoLoadSmallestModel path
The suggestion cards trigger autoLoadSmallestModel which bypasses
selectModel entirely. It was hardcoding toolsEnabled: false and
codeToolsEnabled: false even when the model supports tool calling.
Now both are set from the load response, matching the selectModel
behavior. Also sets kvCacheDtype/loadedKvCacheDtype for dirty
tracking consistency.
* fix(studio): re-read tool flags after auto-loading model
The runtime state was captured once at the start of the chat adapter's
run(), before autoLoadSmallestModel() executes. After auto-load enables
tools in the store, the request was still built with the stale snapshot
that had toolsEnabled=false. Now re-reads the store after auto-load so
the first message includes tools.
* fix(studio): re-read entire runtime state after auto-load, not just tools
The runtime snapshot (including params.checkpoint, model id, and all
tool/reasoning flags) was captured once before auto-load. After
autoLoadSmallestModel sets the checkpoint and enables tools, the
request was still built with stale params (empty checkpoint, tools
disabled). Now re-reads the full store state after auto-load so the
first message has the correct model, tools, and reasoning flags.
* feat(studio): add Hugging Face token field in Preferences
Adds a password input under Configuration > Preferences for users to
enter their HF token. The token is persisted in localStorage and
passed to all model validate/load/download calls, replacing the
previously hardcoded null. This enables downloading gated and private
models.
* fix(studio): use model native context for GGUF auto-load, show friendly errors
The auto-load paths and selectModel for GGUF were sending
max_seq_length=4096 which now actually limits the context window
(since we fixed the backend to respect n_ctx). Changed to send 0
for GGUF, which means "use model's native context size".
Also replaced generic "An internal error occurred" messages with
user-friendly descriptions for known errors like context size
exceeded and lost connections.
LoadRequest validation changed to ge=0 to allow the GGUF "model
default" signal. The frontend slider still enforces min=128 for
non-GGUF models.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix(studio): filter out FP8 models from model search results
Hide models matching *-FP8-* or *FP8-Dynamic* from both the
recommended list and HF search results. These models are not
yet supported in the inference UI.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>