* fix(studio): prevent small models from stalling on tool-calling tasks
Small GGUF models (< 9B params) in "Think, Search, Code" mode would
often describe what they planned to do ("Let me create this dashboard")
and then stop generating without ever calling a tool.
Three changes:
1. Simplify web_tips for small models: remove the "fetch its full content
by calling web_search with the url parameter" guidance for models < 9B.
This multi-step instruction causes small models to plan elaborate
search-then-fetch-then-code sequences they cannot reliably execute.
2. Add "always call tools directly" imperative to the system prompt nudge
so models act immediately instead of narrating their intentions.
3. Add plan-without-action re-prompt in the agentic loop: when the model
emits planning text (matching patterns like "let me", "I'll", etc.)
without calling any tool, inject a nudge asking it to call the tool
and continue the loop. Capped at 2 re-prompts per request.
Benchmarked with Qwen3.5-4B-GGUF (N=5 trials per variant):
- Baseline: 40% of requests had any tool call
- Combined fix: 100% of requests had at least one tool call
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix shell injection in GGML conversion paths
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove test file from security fix PR
Move test_save_shell_injection.py to a separate PR to keep this PR focused on the security fix itself.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Distinguish between actual network downloads and GPU memory loading for cached LoRA adapters in Studio chat.
- Add isCachedLora detection for local LoRA adapter paths using comprehensive cross-platform regex (Unix, Windows, UNC, WSL, tilde)
- Thread isCachedLora through loadInfo to chat-page inline status for proper 3-way distinction (cached / local LoRA / downloading)
- Skip download progress polling for cached LoRA models (no useless /download-progress API calls)
- Fix initial toast state to use isCachedLoad consistently instead of only checking isDownloaded
- Fix cancelLoading toast to not mention background downloads for cached/local loads
- Keep download-specific text ("Downloading model..." / "Download complete") inside the download-only polling block
- Add min-w-0 guards to thread/message/markdown containers to prevent
content overflow past the composer width
- Unify chat typography from Hellix/Space Grotesk to the sans stack,
keeping monospace for code blocks and inline code
- Restructure desktop navbar right-side controls with shrink-0 wrappers
for consistent spacing across HoverCard roots
- Soften tool-call label styling (font-medium + text-foreground/85
instead of bold)
- Add responsive code block sizing via @container queries
- Add horizontal scrolling for wide code blocks within the thread column
- Scope list-item code block alignment CSS to .aui-thread-root
- Preserve useScrollLock in tool-fallback and tool-group collapsibles
- Fall back to bg-background on ViewportFooter when hideComposer is true
- Widen inline code monospace selector to cover th, blockquote, and
heading elements
- Remove unused @fontsource-variable/space-grotesk import
* Fix script unbound variable error
* remove stale test script, add llama.cpp metal source builds, update tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Metal precedence, test sync, and add behavioral tests
- Move macOS arm64 Metal check before CUDA/ROCm in GPU backend
decision chain so Metal is not bypassed when nvcc is in PATH
- Remove RPATH flags from CPU fallback CMAKE_ARGS (only needed
for Metal library linking)
- Update test_llama_pr_force_and_source.py to match _CLONE_ARGS
rename from _CLONE_BRANCH_ARGS in setup.sh
- Add confirm_install_tree guard test for
existing_install_matches_choice
- Add TestMacOSMetalBuildLogic bash subprocess tests verifying
Metal flag selection, nvcc precedence, and CPU fallback behavior
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Metal CPU fallback to also cover cmake build failures and update tests
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* 1. _GPU_BACKEND_FRAGMENT synced -- removed dead CPU_FALLBACK_CMAKE_ARGS= init (6/8)
2. RPATH assertion replaced -- new test_macos_arm64_cpu_fallback_args_exclude_rpath checks the actual runtime CPU_FALLBACK_CMAKE_ARGS output for @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (6/8)
3. _TRY_METAL_CPU_FALLBACK=false reset after both configure-failure and build-failure fallback branches in setup.sh (4/8)
4. macOS test now removes libmtmd.0.dylib instead of the platform-agnostic convert_hf_to_gguf.py (3/8)
5. Empty-string tag test added -- test_empty_tag_omits_branch_flag for resolved_tag= (2/8)
6. RPATH checks on cmake call logs -- both fallback tests now assert @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON are absent from CPU fallback cmake calls, plus baseline flag preservation (multiple)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* tests clean up
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix(studio): allow context length slider to reach model's native limit
The context length slider was hard-capped to the VRAM-estimated maximum,
preventing users from requesting higher context even though the backend
already handles it safely (multi-GPU selection, --fit fallback). Expose
the model's native context length from GGUF metadata as a separate API
field and use it as the slider ceiling instead. Add an amber warning
when the selected context exceeds the estimated VRAM capacity.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Raise VRAM budget to 90% and add native_context_length tests
Increase the GPU memory utilization threshold from 70% to 90% across
_select_gpus and _fit_context_to_vram, allowing longer context lengths
before VRAM capping kicks in.
Add 33 tests for the native_context_length feature covering the backend
property, context value separation invariants, Pydantic models, route
completeness, edge cases, and cross-platform binary I/O.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: add tokenizers to no-torch runtime deps and add TORCH_CONSTRAINT for arm64 macOS py313+
Two installer fixes:
1. Add `tokenizers` to `no-torch-runtime.txt` before `transformers`.
Without it, `from transformers import AutoConfig` crashes on startup
because `--no-deps` skips transitive dependencies.
2. Add `TORCH_CONSTRAINT` variable to `install.sh`. On arm64 macOS with
Python 3.13+, tighten the torch requirement to `>=2.6` since torch
<2.6 has no cp313 arm64 wheels. The variable replaces the previously
hard-coded constraint in the uv pip install line.
Includes 66 tests (42 pytest + 24 bash) covering:
- Structural checks on install.sh, install.ps1, no-torch-runtime.txt
- Shell snippet tests with mocked python for 13 platform/version combos
- Mock uv integration verifying correct constraint string
- E2E venv tests on Python 3.12 and 3.13 confirming AutoConfig works
- Negative control proving AutoConfig fails without tokenizers
- Full no-torch sandbox regression guards (safetensors, huggingface_hub)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix incomplete no-torch manifest and align E2E tests with real --no-deps path
- Add missing transitive deps to no-torch-runtime.txt that are required
under --no-deps: regex, typing_extensions, filelock, httpx, httpcore,
certifi, idna, anyio, sniffio, h11. Without these, `from transformers
import AutoConfig` still fails after install.sh --no-torch.
- Change all E2E tests to use --no-deps (matching what install.sh does)
instead of normal dep resolution. Previous tests passed even with an
incomplete manifest because uv backfilled transitive deps.
- Rewrite negative control to derive from the real no-torch-runtime.txt
with tokenizers stripped, proving the specific fix matters.
- Replace GNU-only sed -i with heredoc in shell test for macOS compat.
- Remove unused os/sys imports from Python test file.
- Quote SKIP_TORCH and mock uv paths in bash -c strings.
* Assert install succeeds before checking import results in E2E tests
Address review feedback: test_torch_not_importable and
test_tokenizers_directly_importable in Group 3 now assert that
uv pip install returns 0 before checking import behavior. This
prevents false positives when the install itself fails silently.
* Assert install succeeds in negative control and tighten error check
- Add missing install-success assertion in test_negative_control_no_tokenizers
to prevent false positives from network/install failures.
- Tighten error message check to look for "tokenizers" in stderr or
ModuleNotFoundError, rather than the generic "No module" substring
which could match unrelated import failures.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Fix SSL handshake failures (SSLV3_ALERT_HANDSHAKE_FAILURE, CERTIFICATE_VERIFY_FAILED) when fetching HTTPS pages by introducing _PinnedHTTPSConnection that separates TCP connect (to pinned IP) from TLS handshake (with real hostname for SNI/cert verification)
- Fix SSRF DNS-rebinding vulnerability: previous impl swapped conn.host before connect(), causing fresh DNS resolution; new subclass keeps TCP pinned to validated IP
- Fix SPA/JS-rendered doc sites returning empty content by rotating real browser User-Agents (Chrome/Firefox/Safari)
- Strip nav/footer from HTML-to-Markdown output so article content is not buried under navigation chrome
- Increase raw fetch cap from 64KB to 512KB so SSR article content is reached on GitBook/Docusaurus/Next.js pages
- Fix IPv6 address bracketing in URL netloc construction
- Hoist SSL context, handler classes, and stdlib imports to module level (created once, not per-call)
- Use consistent UA across redirect hops to avoid breaking session-aware bot detection
Split out from #4741 to keep the main PR focused on installer logic.
- New test_install_llama_prebuilt_logic.py: tests for resolve logic,
fallback behavior, env_int, busy/lock handling
- New test_validate_llama_prebuilt.py: validator tests for staged
release_tag/upstream_tag handling
- New test_llama_pr_force_and_source.py: tests for PR_FORCE and
LLAMA_SOURCE maintainer defaults
- Updated test_selection_logic.py: expanded selection/fallback coverage
- Updated test_pr4562_bugfixes.py: updated bugfix tests for new logic
- Updated smoke_test_llama_prebuilt.py: minor update
Replaces the fixed prebuilt llama.cpp tag with dynamic published-release
resolution, adds bounded fallback across older published releases, and
introduces maintainer-editable defaults for PR/source overrides.
Changes:
- Resolve latest from the latest usable published release in unslothai/llama.cpp
- Use the selected release upstream_tag as the authoritative llama.cpp version
- Prefer Unsloth-published platform assets when available
- Fall back to same-tag upstream ggml-org/llama.cpp assets where allowed
- Keep Linux CUDA anchored to Unsloth-published CUDA bundles only
- Add bounded fallback across older Unsloth published releases
- Add separate busy/in-use install handling (exit code 3)
- Skip reinstall when the installed bundle already matches the selected candidate
- Add maintainer-editable _DEFAULT_LLAMA_PR_FORCE and _DEFAULT_LLAMA_SOURCE
- Harden env parsing so malformed installer env vars do not crash import-time fallback logic
- Honor UNSLOTH_LLAMA_RELEASE_TAG in all resolve steps
- Always sync git remote URL in existing-checkout path
* Fix save_pretrained_merged for full-finetuned models
save_pretrained_merged and push_to_hub_merged silently do nothing when
the model is not a PeftModel (i.e. full finetuning without LoRA).
merge_and_overwrite_lora returns None immediately for non-PeftModel,
and unsloth_generic_save does not check the return value.
Add a non-PeftModel branch in unsloth_generic_save that falls back to
model.save_pretrained / model.push_to_hub. When save_method contains
"16bit", cast weights to bfloat16 (or float16) via a state_dict copy
to honor the user's intent without mutating the live model.
The existing PeftModel (LoRA) code path is unchanged.
* Forward create_pr and revision to tokenizer.push_to_hub
The tokenizer push_to_hub call was missing create_pr and revision,
which could cause the tokenizer to push to the wrong branch or
bypass PR creation when the model push uses them.
* Honor merged_16bit dtype contract for full-finetuned models
Cast state_dict to bfloat16/float16 when save_method contains "16bit"
to match the documented behavior of save_pretrained_merged. Also pass
state_dict and save kwargs consistently to both save_pretrained and
push_to_hub paths.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Address review feedback for PR #4755
- Simplify PeftModel isinstance check (PeftModelForCausalLM inherits
from PeftModel)
- Add is_main_process guard for distributed training
- Forward variant to save_pretrained
- Set tokenizer padding_side to "left" before saving (matches other
save paths)
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* feat(studio): architecture-aware KV cache VRAM estimation
Replace the single legacy formula (2 * n_kv_heads * head_dim * n_layers
* n_ctx * bpe) with 5-path estimation that reads 8 additional GGUF
metadata fields:
1. MLA (DeepSeek-V2/V3, GLM-4.7, GLM-5, Kimi-K2.5) -- K-only cache
using compressed KV latent + RoPE; no separate V allocation
2. Hybrid Mamba (Qwen3.5-27B, Qwen3.5-35B-A3B) -- only attention
layers (1 in N) carry KV; Mamba layers have none
3. Sliding Window (Gemma-3, gpt-oss) -- SWA layers cache
min(ctx, window) tokens instead of the full context
4. Standard GQA -- uses explicit key_length/value_length from GGUF
instead of embed // n_heads (which is wrong for many models)
5. Legacy fallback -- identical to old formula for old GGUFs
New GGUF fields parsed: attention.key_length, attention.value_length,
attention.sliding_window, full_attention_interval,
attention.kv_lora_rank, attention.key_length_mla, ssm.inner_size,
ssm.state_size.
Validated against 9 real GGUF files (72/72 field checks pass).
The legacy formula was off by +682% for Gemma-3 and -81% for
DeepSeek-V3.1.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix MLA fallback and SWA global/local ratio heuristic
Two fixes based on review findings:
1. MLA fallback now uses key_length_mla from GGUF metadata instead of
hardcoded rope_dim=64. Falls back to 64 only when key_length_mla is
absent. This ensures correct estimates for MLA variants that use
rope dimensions other than 64.
2. SWA global/local layer ratio changed from 50/50 to 1/4 (25% global,
75% SWA). Most sliding window architectures have predominantly local
layers (Gemma-3 uses ~17% global, gpt-oss uses ~50%). The 1/4
heuristic is closer to the common case and still a large improvement
over the legacy formula which ignores SWA entirely.
* Tighten _can_estimate_kv gate and treat sliding_window=0 as disabled
Two additional fixes from review round 1 (5/8 and 4/8 reviewer consensus):
1. _can_estimate_kv now requires BOTH key_length AND value_length for
the explicit-dims path. Previously key_length alone was enough,
which could cause silent fallthrough to the legacy formula with
fabricated defaults (n_kv=1, head_dim=128) when value_length was
absent from the GGUF.
2. SWA path now requires sliding_window > 0. Some GGUFs use 0 as a
disabled sentinel. Without this guard, min(ctx, 0) would zero out
all SWA layer contributions, severely underestimating KV cache.
* Fix MLA n_kv safety and use ceiling division for hybrid path
Addresses Gemini Code Assist review findings:
1. MLA path now uses n_kv_mla = n_kv_heads or 1 (not n_heads). This
prevents a 128x overestimate for DeepSeek-V3 if head_count_kv is
absent from the GGUF (n_heads=128 would have been used instead).
2. Hybrid path now uses ceiling division for attention layer count.
This prevents undercounting by 1 when n_layers is not perfectly
divisible by full_attention_interval.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Fix forward compatibility with transformers 5.x
Tested on transformers 4.57.6, 5.3.0, and 5.4.0. All changes are no-ops
on transformers 4.x.
1. Skip exec-based config patching for transformers >= 5.0
Config classes in v5 use @strict, @auto_docstring, and interval()
which break exec(inspect.getsource(...)). Those configs already use
rope_parameters (the v5 replacement for rope_scaling).
2. Slice position_ids to last token in fast_forward_inference
Transformers 5.x generate() accumulates position_ids as
[batch, full_seq_len] across decode steps instead of [batch, 1].
cos[position_ids] then produces the wrong shape for rotary
embeddings. Fixed in llama, qwen3, falcon_h1, gemma2, cohere,
granite. No-op on 4.x since position_ids is already [batch, 1].
3. Handle @strict config kwargs for sequence classification
num_labels, max_position_embeddings, id2label etc. are set on the
config object and passed via config= instead of as kwargs.
AutoModelForSequenceClassification routing added to FastModel loader.
4. Exclude modernbert from flex_attention
ModernBERT with flex_attention hits CUDA illegal memory access in
create_block_mask. Falls back to eager attention safely.
5. Propagate token_type_ids and mm_token_type_ids through GRPO VLM path
Gemma3 Vision requires token_type_ids during training. Qwen3VL
requires mm_token_type_ids for M-RoPE. Extract from inputs in
compute_loss, pass to grpo_accumulated_loss, and extend
mm_token_type_ids for completion tokens in
_generate_and_score_completions.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add try/except safety net around config exec for pre-release transformers versions
* Pop config-level kwargs in seqclass path and use except Exception
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
When searching for a specific publisher model (e.g. `openai/gpt-oss-20b`), the
unsloth search used the full `openai/gpt-oss-20b` string with `author=unsloth`,
which returned zero results because no unsloth model contains the publisher
prefix in its name. Users never discovered unsloth variants.
This PR strips the org prefix for publisher-qualified queries so unsloth variants
surface, then pins the original publisher model after a small batch of unsloth
results. Plain queries (no slash) and unsloth-prefixed queries are unchanged.
- Strict regex (`/^([^/\s]+)\/([^/\s]+)$/`) only triggers on valid `owner/repo`
identifiers; incomplete typeahead, multi-slash, and URL-like inputs are rejected
- Queries for `unsloth/...` models (case-insensitive) keep the full 20-result
prefetch and secondary sort
- Pinned model lookup fires in parallel with the unsloth prefetch
- Canonical-name dedup prevents duplicates when HF normalizes casing
- Publisher detection extracted into a single `useMemo` block
Replace strikethrough + opacity-50 OOM styling with gray text and red pill badge across all Studio model selectors (chat, training, onboarding).
- Use gray-500/gray-400 for OOM model names (better contrast than strikethrough)
- Red pill badge for OOM indicator with light/dark mode support
- Scope GGUF gray override to quant name only so downloaded/recommended labels keep colors
- Add !important on TIGHT/OOM badges to resist ComboboxItem hover overrides
* Fix Windows "Non-relative patterns are unsupported" when loading local GGUF models
When a user loads a GGUF model from a local Windows path (e.g.
C:\Users\danie\.lmstudio\models\unsloth\functiongemma-270m-it-GGUF),
the model identifier contains backslashes and a drive letter. Both
load_model_defaults() and _has_specific_yaml() constructed a YAML
filename from the full absolute path and passed it to Path.rglob(),
which rejects non-relative patterns on Windows.
Fixed by detecting Windows-style paths (drive letters, UNC paths,
backslashes) in addition to Unix-style paths, and using only the
directory basename for the YAML filename lookup when the identifier
is a local filesystem path.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Refactor: reuse is_local_path helper, fix case-sensitive suffix lookup
- Replace inline local-path detection in model_config.py and
inference_config.py with the existing is_local_path() from utils.paths,
which already handles Unix, Windows drive-letter, UNC, and backslash paths
- Fix case-sensitive suffix lookup in load_model_defaults(): the
_REVERSE_MODEL_MAPPING is lowercase-keyed, so suffix comparisons must use
.lower() to match paths like /path/to/Spark-TTS-0.5B/LLM
* Fix WSL path parsing and _has_specific_yaml suffix lookup
- Use normalize_path() before Path() operations so backslash Windows
paths (e.g. C:\Users\...\model) are correctly split on POSIX/WSL hosts
where pathlib treats backslashes as literal characters
- Add suffix-based (2-component and 1-component) lookup to
_has_specific_yaml() so it matches the same resolution rules as
load_model_defaults(), fixing wrong inference params for local
suffix-mapped models like Spark-TTS-0.5B/LLM
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: clear tool status badge immediately after tool execution
The tool status timer badge (Searching 1s, 2s...) persisted after
tool calls finished because the status clear event was only sent
at the start of the next generation iteration, not after tool
execution completed.
Backend: yield status clear after all tools finish in the agentic
loop iteration, before continue starts the next generation pass.
Frontend: debounce badge visibility by 300ms so sub-second tool
calls dont flash the badge.
* Fix debounce regression for consecutive tool calls
Only apply the 300ms show-delay when transitioning from idle to
tool-active. When switching between consecutive tools in the same
turn (e.g. web_search -> python), keep the badge visible immediately
so it does not flicker or disappear during multi-tool runs.
* Delay wasActiveRef reset to bridge inter-iteration tool gaps
The backend emits a status-clear event between tool iterations,
which was resetting wasActiveRef immediately and causing the next
tool to be re-debounced (300ms hidden gap between consecutive tools
in the same turn). Now the ref reset is delayed by 500ms so a
follow-up tool within the same agentic turn shows the badge
immediately, while a genuinely new turn still gets the debounce.
* Use thread lifecycle to track tool-run boundaries
Replace the 500ms wall-clock timeout with the actual thread.isRunning
state to determine when wasActiveRef should reset. This properly
handles all cases:
- Consecutive tools within the same run stay visible without flicker
- The badge hides only when the thread run actually ends
- New turns always get a fresh 300ms debounce on the first tool
- No heuristic timeout that can misfire on slow or fast inference
* Consolidate wasActiveRef reset into single effect
Removes the separate isThreadRunning effect to avoid a race where
the ref resets before the tool-status effect reads it (when
isThreadRunning flips to false before setToolStatus(null) from
the adapter's finally block). Now wasActiveRef resets only when
both toolStatus is null AND the thread run has ended, eliminating
any flicker on the last tool of a run.
* Simplify debounce: use visible state instead of ref tracking
Drop wasActiveRef entirely and use the visible state as the
debounce gate. When the badge is not yet on screen, debounce
for 300ms before showing. When already visible from a prior tool,
keep showing immediately. This correctly handles all cases:
- All fast tools (<300ms) are suppressed, not just the first
- Consecutive tools after the badge is shown stay visible
- Badge persists across inter-iteration clears while thread runs
- New turns get a fresh debounce after visible resets
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* refactor: move folder management from sidebar into model selector
* Fix folder management: restore LoRA picker sync, error handling, caching
- Restore onFoldersChange callback to keep LoRA adapter picker in sync
when scan folders are added/removed (fixes regression from sidebar move)
- Thread onFoldersChange through ModelSelector -> HubModelPicker prop chain
- Add module-level _scanFoldersCache to prevent folder list flash on re-open
- Surface error toast on folder removal failure instead of silently ignoring
- Guard handleAddFolder against concurrent double-submit via folderLoading
- Clear folderInput on Escape key dismiss to prevent stale input on re-open
- Add refreshLocalModelsList and refreshScanFolders to useEffect dep array
* Fix compare-mode folder sync, Escape key propagation, cancel toggle state
- Wire onFoldersChange through CompareContent/GeneralCompareContent so
compare-mode selectors also refresh local models after folder changes
- Add e.stopPropagation() on Escape key in folder input to prevent
Radix Popover from closing the entire model selector dropdown
- Add e.preventDefault() on Enter key to prevent form submission
- Clear folderInput and folderError when cancel toggle hides the input,
matching the Escape key behavior for consistency
* Fix folder mutation state ordering and touch accessibility
- Use optimistic updates for add/remove so the folder list reflects
changes immediately instead of waiting on a second listScanFolders
round-trip that could silently fail.
- Move refreshScanFolders out of the finally block in handleRemoveFolder
so it runs after the cache update, not after onFoldersChange.
- Make the remove button visible on touch/mobile devices and reachable
via keyboard focus (opacity-100 on small screens, focus-visible).
- Add aria-label to the remove button for screen readers.
* Deduplicate optimistic folder add to match backend behavior
The backend returns the existing ScanFolderInfo row when adding a
path that is already registered. The optimistic update was blindly
appending the returned row, producing duplicate entries and React
key warnings. Now checks by id before appending.
* Add aria-label to folder toggle button and strengthen dedup check
- Add aria-label to the +/cancel icon button for screen readers.
- Extend optimistic dedup check to also compare by path, not just id,
to handle edge cases where the cache is stale.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* style(windows): clean installer/setup log output and remove seeded credential banner
* Keep startup credential hint without exposing plaintext password
Print the username and .bootstrap_password file path on first-run
admin creation instead of the raw password. Headless / Docker / SSH
operators still get a startup-time hint for initial sign-in, and the
plaintext credential no longer appears in terminal output or logs.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* feat: add scan_folders table and CRUD functions to studio_db
* feat: add scan folders API endpoints and integrate into model scan
* feat: add scan folders API client and update source types
* feat: add custom source to model filters and selector
* feat: add Model Folders section to chat settings sidebar
* style: fix biome formatting in ModelFoldersSection
* fix: address review findings for custom scan folders
empty string bypass, concurrent delete crash guard,
Windows case normalization, response_model on endpoints,
logging, deduplicated filter/map, module level cache for
custom folder models, consistent source labels, handleRemove
error surfacing, per folder scan cap
* fix: show custom folders section regardless of chatOnly mode
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refactor: extract shared refreshLocalModelsList in pickers
* Harden custom scan folder validation and scanning
- Validate path exists, is a directory, and is readable before persisting
- Apply per-folder model cap during traversal instead of after (avoids
scanning millions of inodes in large directories)
- Wrap per-folder scan in try/except so one unreadable folder does not
break the entire /api/models/local endpoint for all callers
- Normalize case on Windows before storing so C:\Models and c:\models
dedup correctly
- Extend macOS denylist to cover /private/etc and /private/tmp (realpath
resolves /etc -> /private/etc, bypassing the original denylist)
- Add /boot and /run to Linux denylist
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Improve scan robustness and preserve Windows path casing
- Preserve original Windows path casing in DB instead of lowercasing
(normcase used only for dedup comparison, not storage)
- Catch PermissionError per child directory so one unreadable subdirectory
does not skip the entire custom folder scan
- Wrap list_scan_folders() DB call in try/except so a DB issue does not
break the entire /api/models/local endpoint
* fix: scan custom folders for both flat and HF cache layouts
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix Windows case-insensitive path dedup with COLLATE NOCASE
Use COLLATE NOCASE on the scan_folders.path column so that the UNIQUE
constraint correctly deduplicates C:\Models and c:\models on Windows
without lowercasing the stored path. Also use COLLATE NOCASE in the
pre-insert lookup query on Windows to catch existing rows with
different casing.
* Restore early-exit limit in _scan_models_dir for custom folders
Keep the limit parameter so _scan_models_dir stops iterating once
enough models are found, avoiding unbounded traversal of large
directories. The post-traversal slice is still applied after combining
with _scan_hf_cache results.
* feat: scan custom folders with LM Studio layout too
* Fix custom folder models being hidden by dedup
Custom folder entries were appended after HF cache and models_dir
entries. The dedup loop kept the first occurrence of each model id,
so custom models with the same id as an existing HF cache entry were
silently dropped -- they never appeared in the "Custom Folders" UI
section.
Use a separate dedup key for custom-source entries so they always
survive deduplication. This way a model can appear under both
"Downloaded" (from HF cache) and "Custom Folders" (from the
user-registered directory) at the same time.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden LM Studio scan and fix COLLATE NOCASE on Linux
- Add per-child and per-publisher OSError handling in _scan_lmstudio_dir
so one unreadable subdirectory does not discard the entire custom
folder's results
- Only apply COLLATE NOCASE on the scan_folders schema on Windows where
paths are case-insensitive; keep default BINARY collation on Linux
and macOS where /Models and /models are distinct directories
* Use COLLATE NOCASE in post-IntegrityError fallback SELECT on Windows
The fallback SELECT after an IntegrityError race now uses the same
case-insensitive collation as the pre-insert check, so a concurrent
writer that stored the path with different casing does not cause a
false "Folder was concurrently removed" error.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Simplify tool-call dedup: drop hashlib, inline helpers
The duplicate tool-call detector only compares calls within a single
request from the same JSON parser, so dict key order is guaranteed
identical for identical calls (Python 3.7+ insertion-ordered dicts).
- Replace hashlib.md5(json.dumps(...)) with name + str(args)
- Inline _tool_call_key, _is_duplicate_call, _record_tool_call
since each was a one-liner used once
- Remove unused hashlib import
* Remove tool_calling_benchmark_results.md from repo
* Replace html2text with builtin HTML-to-Markdown converter
Drop the external html2text (GPL-3.0) dependency and its regex
fallback. Add _html_to_md.py (~190 lines, stdlib only) using
html.parser.HTMLParser that handles headings, links, bold/italic,
lists, tables, blockquotes, code blocks, and entity decoding.
Strips script/style/head tags entirely.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use json.dumps(sort_keys=True) for tool-call dedup key
str(dict) is sensitive to insertion order, so semantically identical
calls with different key ordering would bypass duplicate detection.
Switch to json.dumps with sort_keys=True for a canonical representation.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Revert dedup key to str(arguments)
json.dumps(sort_keys=True) is unnecessary here -- the arguments dict
always comes from the same JSON parser within a single request, so
key insertion order is deterministic (Python 3.7+). str() is faster
and sufficient for consecutive-call dedup.
* Address review comments on _html_to_md.py
- Remove "hr" from _BLOCK_TAGS so the dedicated hr handler is reachable
- Prefix all newlines with ">" inside blockquotes (multi-line support)
- Emit full  for images instead of alt text only
- Replace newlines with spaces inside table cells
- Track header cells per-row (_row_has_th) instead of last-cell-only
- Strip trailing tabs in addition to spaces in cleanup regex
* Fix blockquote rendering, truncated-HTML buffer flush, and dedup key canonicalization
_html_to_md.py:
- Rewrite blockquote handling with stack-based buffer approach so nested
blockquotes, pre blocks inside blockquotes, and multi-paragraph quotes
all render correctly with proper "> " prefix on every line.
- Add flush_pending() to recover content from truncated HTML where closing
tags are missing (common when _fetch_page_text caps the download size).
Flushes open <a>, <td>, <pre>, and blockquote buffers.
- Skip <img> tags to match prior html2text ignore_images=True behavior
and avoid data-URI amplification consuming the output budget.
- Collapse all whitespace (including newlines) in non-pre content per
standard HTML whitespace rules: \s+ -> single space.
- Escape pipe characters in table cell content to prevent column breakage.
- Emit separator row after the first row for tables without <th> headers.
- Guard against IndexError on _ol_counter for orphan <li> elements.
- Normalize CRLF line endings before parsing.
llama_cpp.py:
- Restore canonical dedup key with json.dumps(sort_keys=True) so that
semantically identical tool calls with different JSON key order are
correctly detected as duplicates.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix table optional end tags, inline code whitespace, and link text normalization
_html_to_md.py:
- Extract _finish_cell() and _finish_row() helpers to handle HTML tables
that omit optional </td>, </th>, or </tr> end tags. This is valid HTML
and common on real web pages -- previously the parser would silently
drop earlier cells and entire rows.
- Call _finish_cell()/_finish_row() from handle_starttag for <tr>/<td>/<th>,
handle_endtag for </tr>/<td>/<th>/<table>, and flush_pending() so all
three paths (normal close, implicit close, truncated HTML) use the same
row-finalization logic including header separator emission.
- Add _in_inline_code flag so handle_data() preserves literal whitespace
inside <code> spans instead of collapsing it. Source like
<code>pip install unsloth</code> now correctly renders as
`pip install unsloth` rather than `pip install unsloth`.
- Extract _finish_link() helper that normalizes accumulated link text with
\s+ -> single space before building the Markdown link. Prevents block-
level content inside <a> tags (e.g. <a><div>one</div><div>two</div></a>)
from producing multiline [one\n\ntwo](href) link labels.
- Empty blockquotes now produce no output instead of a stray ">".
- Remove unused _bq_depth field (all routing uses _bq_stack).
- Flush open cells and rows in handle_endtag("table") for robustness.
* Support <ol start=N>, <dl>/<dt>/<dd>, and preserve code block whitespace
_html_to_md.py:
- Honor <ol start="N"> attribute so ordered lists preserve their original
numbering instead of always restarting from 1. Important for docs/tutorials
that continue numbering across sections.
- Add dl, dt, dd to _BLOCK_TAGS so definition lists (common on MDN, Python
docs, Django docs) produce separated text instead of concatenated blobs.
- Rewrite _cleanup() to be fence-aware: content inside fenced code blocks
is now preserved verbatim (intentional blank lines in <pre> content are
no longer collapsed). Outside code blocks, blank runs are limited to one
and trailing whitespace is stripped.
- Fix _prefix_blockquote() to strip trailing whitespace before collapsing
blank lines, preventing the "\n\n \n\n" pattern from sneaking through.
* Suppress whitespace-only text nodes between table structural elements
Indented HTML tables (nearly all real-world pages) produce whitespace
text nodes between <table>, <tr>, </tr> etc. that land in the output
as leading spaces before table rows, breaking Markdown table alignment.
Skip whitespace-only text nodes when inside a table but not inside a
cell, so indentation from source HTML does not leak into the output.
* Revert dedup key to str(arguments) with explanatory comment
json.dumps(sort_keys=True) is unnecessary overhead here: arguments
always comes from json.loads on model output within a single request,
so dict insertion order is deterministic in Python 3.7+. A repeated
call from the model produces the same JSON, which parses to the same
dict repr. str() avoids re-serialization on every tool call.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* studio: improve GGUF tool calling accuracy and reliability
- Add URL fetching to web_search tool so models can read full page
content instead of only getting search snippets. Uses html2text for
clean markdown conversion with regex fallback.
- Inject current date and behavioral guidance (URL fetch workflow,
no repeated queries, use code for data processing) into the
tool-use system prompt.
- Append error recovery nudge to tool results that indicate failure,
helping small models avoid looping on the same broken call.
- Strip leaked <tool_call> XML from assistant messages in conversation
history and from the outgoing SSE stream.
- Raise default max tool iterations from 10 to 25 across backend,
model schema, and frontend defaults.
- Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain
enough content for the model to extract useful information.
- Add "IMPORTANT: These are only short snippets" hint to search
results so models know to fetch full pages when needed.
Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after:
- XML leaks in responses: 10/10 -> 0/10
- URL fetch usage: 0 -> 4/10 runs
- Runs producing actual correct answers: 0/10 -> 2/10
- Average tool calls per query: 5.5 -> 3.8 (more efficient)
- Average response time: 12.3s -> 9.8s
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add tool calling benchmark results across model sizes and quants
Tested 16 configurations (4 models x 2 quants x 2 KV cache types)
with 10 runs each on NVIDIA B200.
Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4
correct songs, 0 XML leaks, 131s average response time.
* Add duplicate tool-call detection and final-answer synthesis
When the model repeats the exact same tool call (same name + arguments)
twice in a row, skip execution and return a redirect message telling it
to try a different approach. This prevents the 8x-repeated-query loops
observed on 27B and 35B models.
When the tool iteration cap (25) is reached, inject a "provide your
final answer now" message before the final streaming pass. This lets
the model synthesize a useful answer from everything it gathered
instead of being silently cut off.
Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs):
- Repeated query runs: 4/10 -> 2/10
- Cap hits: 1/10 -> 0/10
- All 4/4 accuracy: 5/10 -> 7/10
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix CodeQL alert: handle whitespace in script/style closing tags
The regex fallback for HTML stripping did not match closing tags
with whitespace before the angle bracket (e.g. </script >).
Use \s* before > in both script and style patterns.
* Address reviewer findings: SSRF, timeout crash, XML regex, dedup
- SSRF: resolve hostname via getaddrinfo and reject private, loopback,
link-local, multicast, and reserved addresses before fetching
- Timeout: handle timeout=None (unlimited mode) in URL fetch path
by defaulting to 60s instead of crashing on min(None, 60)
- Download cap: read at most max_chars*4+1 bytes instead of the
full response body before truncating
- XML regex: match both <tool_call> and <function=...> markup in
the history/stream cleanup (inference.py)
- CodeQL: use [^>]* in closing script/style tags to handle any
whitespace or attributes before >
- Dedup: track whether each tool call failed so retries after
transient errors are allowed; only block consecutive identical
calls that both succeeded
- Final-answer synthesis: guard on max_tool_iterations > 0 so
callers who disable tools do not get a false "used all calls" turn
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix redirect SSRF, SSE streaming regression, dedup off-by-one
- SSRF redirect bypass: disable auto-redirect in urllib, manually
follow up to 5 hops with host validation at each step. Prevents
public URLs from redirecting to loopback/private targets.
- SSE streaming: track prev_text on the raw cumulative and strip
XML from the delta only, so completed tool_call tags do not cause
the cumulative to shrink and drop trailing real text.
- Dedup off-by-one: check the immediately previous call (window=1)
instead of requiring 2 matching history entries, so the second
identical successful call is blocked rather than the third.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix redirect HTTPError handling and tighten error prefixes
- Redirect fix: urllib raises HTTPError (not a normal response) when
the redirect handler returns None. Catch HTTPError for 3xx codes
and extract the Location header from the exception object.
- Error prefixes: remove overly broad "No " prefix that matched
"No results found." (a valid empty-search outcome, not an error).
Replace with specific prefixes like "Blocked:", "No query provided",
"Failed to resolve". This ensures empty search results are correctly
classified as non-errors for duplicate-call tracking.
* Fix SSE cross-chunk XML leaks, cleanup review findings
- SSE streaming: sanitize the full cumulative text before diffing
against the previous sanitized snapshot, so XML tags that span
chunk boundaries are stripped correctly. The previous delta-based
approach leaked split tags.
- DRAINING fallback: use _strip_tool_markup() helper instead of a
manual regex that only handled <tool_call> but not <function=...>.
- Move hashlib import, _TOOL_XML_RE compile, and datetime import to
module level per style guide.
- Remove unused _hit_tool_cap variable.
* Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record
- DNS rebinding: resolve hostname once via getaddrinfo, pin the
returned IP, rewrite the URL to connect to the pinned IP with
a Host header. Each redirect hop re-resolves and re-validates.
Closes the TOCTOU window between validation and connection.
- Charset: use resp.headers.get_content_charset() instead of
hardcoding utf-8, so pages with other encodings decode correctly.
- HTTPError: return descriptive "HTTP {code} {reason}" instead of
re-raising into a generic "Search failed" message.
- Dedup: remove redundant _record_tool_call in the duplicate branch;
the single call at the end of the loop handles all cases.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1
The heartbeat thread now monitors the HF Hub cache directory for
file-size growth. If no bytes are written for 3 minutes, it sends a
"stall" message to the orchestrator, which kills the subprocess and
retries with HF_HUB_DISABLE_XET=1 (falling back from Xet to standard
HTTPS). If the retry also stalls, it errors out with a clear message.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: include transport type (xet/https) in heartbeat and stall log messages
Makes it clear in backend logs whether the download is using xet or
https transport, and which transport stalled — helpful for debugging.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* fix: monitor HF Hub .tmp dir to avoid false stall detections
huggingface_hub downloads into .tmp/ before atomically moving to
blobs/. Without monitoring .tmp, a large shard actively downloading
for several minutes would show zero blob growth and trigger a false
stall.
* fix: scope HF cache size check to specific model being loaded
Instead of scanning every models--*/blobs directory (O(N) with cached
models), only check the specific model's blobs dir plus the global
.tmp dir. Much faster on systems with many cached models.
* Fix false stall detection on cached/local models and cleanup issues
- Only fire stall if download activity was observed (cache size changed
at least once). Previously, any model load taking >180s would trigger
a false stall, even for already-cached or local models where no
download is happening.
- Return -1 from _get_hf_cache_size on exception to distinguish
"unable to measure" from "genuinely zero bytes". Skip stall logic
when measurement fails.
- Add _shutdown_subprocess before raising on terminal stall path to
prevent leaking a stuck subprocess.
- Detect pre-existing HF_HUB_DISABLE_XET=1 in the parent environment
to avoid a redundant retry cycle when Xet is already disabled.
- Remove global .tmp directory scanning (not used by modern
huggingface_hub; in-progress downloads use .incomplete files in
blobs/ which are already captured by iterdir).
- Add f.is_file() guard in cache size calculation.
- Replace em dashes with ASCII dashes for Windows terminal compat.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden stall detection edge cases
- Guard -1 to valid value transition: when initial _get_hf_cache_size
returns -1 (error) and later recovers to a real value, do not count
that as download activity. Only set saw_download_activity when the
previous measurement was also valid (>= 0).
- Move os import to top-level in orchestrator.py instead of inline
import os as _os.
- Fix misleading comment about post-download protection.
* Use .incomplete files to detect active downloads for stall detection
Replace the saw_download_activity heuristic with direct .incomplete file
detection. huggingface_hub creates *.incomplete files in blobs/ during
active downloads and removes them on completion. This gives a reliable
signal for whether a download is actually in progress.
Benefits:
- Cached models: no .incomplete files -> no stall fired even after 180s
- Post-download init (quantization, GPU loading): .incomplete files gone
so stall timer resets, long init phases are not killed
- Pre-download hangs (XET handshake stall): .incomplete files are
created at download start, so zero-byte stalls are now detected
- No more false positives from -1 to valid measurement transitions
The _get_hf_download_state function now returns (total_bytes,
has_incomplete) tuple or None on error, replacing _get_hf_cache_size.
* Add debug logging to download state exception handler
Log the exception at debug level when _get_hf_download_state fails,
instead of silently returning None. Helps with troubleshooting cache
measurement issues.
* Watch both adapter and base model repos for LoRA stall detection
When loading a LoRA adapter, the actual download bottleneck is often
the base model, not the adapter itself. Update the heartbeat to watch
both mc.identifier and mc.base_model cache directories so stall
detection works for LoRA loads where the base model stalls on Xet.
Also update _get_hf_download_state to accept multiple model names and
skip names without "/" (local paths) since those do not have HF cache
directories.
* Fix model name filtering for official HF models without org prefix
Models like gpt2 and bert-base-uncased do not contain a slash but are
still valid HF Hub models with cache directories. Replace the "/" check
with a proper local-path detection that checks for path separators and
path-like prefixes instead.
Also fix the base_model watch list to not require "/" in the base model
name, so official models used as LoRA bases are also monitored.
* Fix local path detection that broke all org/model names on Linux
The os.path.sep check matched "/" in HF model IDs like "org/model" on
Linux, causing the stall detector to skip ALL standard HF models.
Replace with a check that only skips names starting with "/" (absolute
paths), "." (relative paths), "~" (home-relative), or containing "\"
(Windows paths). HF model IDs like "org/model" or "gpt2" pass through
correctly on all platforms.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): change default weight_decay from 0.01 to 0.001
The default weight decay across Studio was 0.01 but should be 0.001.
Updated the default in all backend fallbacks, the Pydantic model, the
frontend config, and every YAML preset/model-default config.
* fix(studio): auto-set learning rate based on training method
Default LR should be 2e-4 for LoRA/QLoRA and 2e-5 for full fine-tuning.
Frontend: track whether the user has manually edited the LR field via a
_learningRateManuallySet flag (same pattern as trainOnCompletions).
When switching training method and the user has not touched the LR,
auto-set it to the appropriate default. Reset the flag on model load.
Backend: change trainer.py start_training default from 5e-5 to 2e-4,
update default.yaml fallback from 5e-5 to 2e-4, and fix
full_finetune.yaml from 0.0002 (2e-4) to 2e-5.
* refactor(studio): centralize weight_decay and learning rate defaults
Create studio/backend/core/training/constants.py as the single source of
truth for DEFAULT_WEIGHT_DECAY (0.001), DEFAULT_LEARNING_RATE (2e-4),
DEFAULT_LEARNING_RATE_FULL (2e-5), and DEFAULT_LEARNING_RATE_STR ("2e-4").
All backend modules (trainer.py, training.py, worker.py, models/training.py)
now import from constants.py instead of hardcoding values.
On the frontend, add LR_DEFAULT_LORA and LR_DEFAULT_FULL to
config/training.ts and use them in the store instead of magic numbers.
A comment cross-references the backend constants file.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix model-specific LR override, persist migration, and flag resets
- Preserve model-specific learning rates from YAML configs when the
async autoSelectTrainingMethod callback fires (fixes Qwen2.5-1.5B
getting 2e-4 instead of its configured 1e-5, etc.)
- Bump zustand persist version to 9 with migration so existing users
with weightDecay=0.01 get updated to 0.001
- Clear _learningRateManuallySet in reset() and applyConfigPatch()
for consistency with trainOnCompletions flag behavior
- Add DEFAULT_LEARNING_RATE_FULL_STR to constants.py
* Refine applyConfigPatch to only clear LR flag when patch includes LR
Only reset _learningRateManuallySet when the applied config patch
actually provides a learningRate value. This prevents unrelated config
patches from silently disarming the manual-edit guard, which would
cause a subsequent setTrainingMethod call to overwrite the user's
custom LR.
* Preserve model-specific LR when switching between qlora and lora
Only auto-switch the learning rate when the training category changes
(adapter <-> full fine-tuning). Switching between qlora and lora keeps
the current LR since both methods share the same learning rate range.
This preserves curated per-model defaults (e.g. 1e-5 for
Qwen2.5-1.5B-Instruct) when the user toggles between adapter methods.
* Remove constants.py, use YAML configs as the source of truth
The YAML config files (model-specific + default.yaml) are the intended
config layer for training defaults. The Python backend fallbacks now use
inline values that match the YAML configs, rather than importing from a
separate constants module. This keeps the config architecture simple:
YAML files are the single source of truth, and the inline Python
fallbacks are just safety nets that mirror them.
* fix(studio): preserve model-specific LR when switching training method
Stash YAML-provided learning rate and use it to restore the correct
value when switching between adapter and full fine-tune modes.
- qlora <-> lora no longer overwrites the model's LR
- full -> adapter restores the YAML LR instead of a hardcoded constant
- selecting a model while on full fine-tune uses LR_DEFAULT_FULL
instead of applying the YAML adapter LR
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
* fix: throttle and cache HuggingFace modelInfo API calls
The frontend was firing 40 to 60 parallel modelInfo requests on app
startup with zero caching or deduplication, causing HF rate limits.
Adds a caching layer (hf-cache.ts) with TTL cache, inflight request
dedup, and a concurrency limiter. Also debounces the HF token input
so typing a token no longer re-fires all model searches per keystroke.
* fix: only fetch VRAM info for visible models in chat selector
* Fix cache key isolation and VRAM badge stability for PR #4696
- Cache key now includes a token fingerprint (last 8 chars) instead of a
boolean, so switching HF tokens gives separate cache entries instead of
serving stale data from the previous token.
- Extract token via credentials?.accessToken to match the @huggingface/hub
API surface.
- Extend CachedResult type with safetensors/tags fields so downstream
consumers no longer need unsafe `as` casts.
- Merge VRAM param map with previous state on scroll instead of replacing
it, preventing a brief flash of missing VRAM badges when new models
become visible.
* Fix VRAM badges missing for search-filtered recommended models
When a user types a search query, filteredRecommendedIds can include
models beyond the currently visible page. These models had no VRAM data
because useRecommendedModelVram only received visibleRecommendedIds.
Now we pass the union of visibleRecommendedIds and filteredRecommendedIds
to the VRAM hook, so recommended models surfaced by search also show
their VRAM badges. The hf-cache layer ensures no duplicate network calls.
* Apply biome formatting to hf-cache.ts and use-recommended-model-vram.ts
Auto-formatted with biome check --write to match project lint rules:
- Block statements for single-line if/for bodies
- Import sorting (type imports first)
- Consistent line wrapping
* Fix extractToken to handle both current and deprecated HF auth forms
The @huggingface/hub CredentialsParams type is a union:
- { accessToken: "hf_..." } (current preferred form)
- { credentials: { accessToken: "..." } } (deprecated form)
Previously only checked params.credentials?.accessToken (deprecated path).
Now checks both forms so the cache key is correct regardless of which
calling convention is used.
* Simplify extractToken, map merge, and set construction
- extractToken: remove type assertions, use direct property access with
truthiness checks for cleaner union type handling
- VRAM map merge: use Map spread constructor instead of manual for loop
- idsForVram: use Set spread construction for more concise dedup
* Add rationale comment for MAX_CONCURRENT=3 in hf-cache.ts
* Skip GGUF repos in VRAM fetch and pre-populate cache from listModels
Two changes to reduce redundant HF API calls:
1. Filter GGUF repos from idsForVram before passing to useRecommendedModelVram.
GGUF repos have no safetensors metadata and the render layer already shows
a static "GGUF" badge -- fetching modelInfo for them is a no-op that wastes
a semaphore slot and a network round-trip.
2. Add primeCacheFromListing() to hf-cache.ts and call it from listModels
yield sites in mergedModelIterator and priorityThenListingIterator.
listModels returns the same type (ModelEntry & Pick<ApiModelInfo, T>) as
modelInfo with the same additionalFields, so the data is interchangeable.
Priming only writes if the key is not already fresh, so it never overwrites
a recent modelInfo response.
This means models discovered via listModels are already in cache when
useRecommendedModelVram later calls cachedModelInfo for them, eliminating
duplicate network requests.
* Fix cache key mismatch: prime both token and anonymous slots
The VRAM hook calls cachedModelInfo without credentials (anonymous key),
but listModels results were primed only under the authenticated key.
For authenticated users the priming was a no-op -- cache miss every time.
Fix: prime both the token-specific slot and the anonymous slot when an
access token is present. Public model metadata (safetensors, tags) is
identical regardless of auth so this is safe.
Also add a defensive guard in primeCacheFromListing for empty name.
* Auto-prime anonymous cache slot from authenticated modelInfo fetches
When cachedModelInfo is called with a token, the result was only stored
under the token-specific key (e.g. model::abc12345). The VRAM hook
calls cachedModelInfo without credentials and reads the anonymous slot
(model::anon), causing a cache miss and duplicate fetch for every
priority model.
Now cachedModelInfo also writes to the anonymous slot on success when
a token is present. Public model metadata (safetensors, tags) is
identical regardless of auth, so this is safe and eliminates ~10
duplicate API calls on first page load.
* Guard anonymous cache priming against gated/private models
Only prime the anonymous cache slot for non-gated, non-private models.
Previously, authenticated modelInfo responses and listing results were
unconditionally copied into the anonymous slot, which could briefly
expose gated/private model metadata after clearing the HF token.
Now checks result.gated and result.private before writing the anon slot.
Public unsloth/ models (the common case) still benefit from the
optimization; gated models like meta-llama/* require a fresh fetch
per auth context.
* Extract primeFromListing helper to deduplicate cache priming logic
The cache priming pattern (prime token slot + conditionally prime anon
slot for non-gated models) was duplicated in three places. Extracted
into a single primeFromListing() function for maintainability.
* Export CachedResult type, add isStale helper, simplify primeFromListing
- Export CachedResult so consumers can use it directly instead of
the indirect Parameters<typeof ...> pattern.
- Extract isStale(key) helper to deduplicate the cache freshness
check that was repeated in primeCacheFromListing, cachedModelInfo,
and the anonymous-slot priming logic.
- Simplify primeFromListing to use CachedResult directly for both
the data parameter and the gated/private guard, eliminating the
double cast.
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Revert to balanced for inference
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Remove unused for_inference parameter from get_device_map
Since inference and training both use "balanced" now, the for_inference
flag is dead code. Remove it from the function signature, the call site
in inference.py, and simplify the tests accordingly.
* Remove redundant TestDeviceMapForInference test class
TestGpuAutoSelection already covers the same multi-gpu and single-gpu
device_map assertions. The TestDeviceMapForInference class was left
over from when for_inference had distinct behavior.
* Remove redundant test_get_device_map_multi_gpu_uses_balanced
Its assertions ([0,1] -> balanced, [0] -> sequential) are already
covered by test_get_device_map_uses_explicit_gpu_selection.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix(studio): open tour ReadMore links in new tab
The quick tour "Read more" links navigate away from Studio instead of
opening in a separate tab. Add target="_blank" and rel="noopener
noreferrer" to the ReadMore component so external doc links open in a
new browser tab.
* fix(studio): only open external ReadMore links in new tab
Apply target="_blank" conditionally based on whether the href starts
with "http", so internal links still navigate in the same tab.
* Tighten external-link detection in ReadMore component
Use regex /^https?:\/\// instead of startsWith("http") so the check
requires the full protocol prefix and does not match non-URL strings
that happen to begin with "http".
* Hoist regex to module scope for ReadMore
Move EXTERNAL_URL_RE to top-level constant to satisfy the biome
useTopLevelRegex lint rule and avoid re-creating the RegExp on
every render.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
* studio: gate multimodal incompatibility warning on settled model capabilities
* Also disable Start button during isCheckingVision fallback
When getModelConfig fails and the fallback checkVisionModel is still
in-flight, isLoadingModelDefaults clears before isCheckingVision does.
Without also gating on isCheckingVision the Start button briefly
re-enables with stale capability flags.
Add isCheckingVision to the disabled condition and show "Loading
model..." text while either flag is active.
* Show correct error message for audio dataset incompatibility
The incompatibility warning always said "switch to a vision model"
even when the actual issue was an audio dataset on a non-audio model.
Now shows an audio-specific message when the mismatch is audio.
* Extract isLoadingModel constant for clarity
Pull the combined model-loading condition into a single constant
reused by the settled check, the disabled prop, and the button label.
---------
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
The 180s wall-clock timeout would kill model loads on slow connections
even when the download was actively progressing. Now the worker sends
heartbeat status messages every 30s during loading, and the orchestrator
resets its 300s deadline on each one — so it only times out when the
subprocess goes truly silent.
* fix: skip download progress polling for exported GGUF models
* fix: revert isLocalGgufDir change — exported GGUFs are file paths, not dirs
* fix: set isDownloaded true for all adapters in LoraModelPicker
* fix(studio): replace unicode emoji in print() to avoid cp1252 crash on Windows
On Windows the default console encoding is cp1252 which cannot encode
unicode emoji like U+2705 or U+26A0. bare print() calls with these
characters cause a UnicodeEncodeError at runtime.
- run.py: replace emoji with ASCII status prefixes [OK] and [WARNING]
- format_conversion.py: remove duplicate print() that mirrors the
logger.info() call on the next line, and drop the emoji from the
log message since loggers handle encoding separately
* fix(studio): apply same emoji/print cleanup to parallel VLM conversion path
The parallel URL-based conversion logic has the same duplicate print()
with emoji that was fixed in the sequential path. Remove the bare
print() and drop the emoji from the logger.info() call.
* Treat install_python_stack.py failure as fatal in setup.ps1
On Linux/Mac, setup.sh runs under set -euo pipefail so a non-zero
exit from install_python_stack.py aborts the installer. On Windows,
setup.ps1 had no exit code check -- if the Python script crashed
(eg from the cp1252 UnicodeEncodeError), the installer silently
continued past the dependency loop and reported success. Studio
would then fail at launch with ModuleNotFoundError for structlog,
fastapi, and other deps that were never installed.
Capture $LASTEXITCODE and exit 1 if the dependency installer fails,
matching the error handling pattern already used for PyTorch install.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix: scope packages.find to prevent node_modules namespace scanning
The packages.find section had no include filter, so setuptools'
find_namespace_packages discovered all directories as potential Python
packages -- including the 6,557 directories inside
studio/frontend/node_modules/ after the frontend build step.
This caused the editable install overlay step to run 20,000+ glob
operations across 6,619 "packages", which on fast NVMe takes ~5s but
on slower disks can take 7+ minutes.
Adding an explicit include filter scopes discovery to only the packages
we actually ship (unsloth, unsloth_cli, studio, studio.backend), dropping
from 6,619 to 58 discovered packages and the editable build time from
5.4s to 1.2s.
Also removes the broken kernels/moe exclude (used "/" instead of "."
notation so it never matched) and adds a node_modules exclude as a
safety net.
* fix: use precise node_modules exclude patterns
Use "*.node_modules" and "*.node_modules.*" instead of "*.node_modules*"
to avoid accidentally excluding valid packages that might contain
"node_modules" as a substring in their name.
* [WIP] balanced device map for studio
* gpus as a request parameter
* API for multi GPU stuff
* return multi gpu util in new API
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use balanced_low0 instead of balanced
* Use balanced_low0 instead of balanced
* Fix device_map typo, UUID parsing crash, set() filter bug, and broken tests
- balanced_low0 -> balanced_low_0 (transformers/accelerate rejects the old string)
- get_parent_visible_gpu_ids() now handles UUID/MIG CUDA_VISIBLE_DEVICES
gracefully instead of crashing on int() parse
- _get_backend_visible_gpu_info() set() or None bug: empty set is falsy so
CUDA_VISIBLE_DEVICES=-1 would disable filtering and report all GPUs
- test_gpu_selection.py: add missing get_visible_gpu_utilization import and
add required job_id arg to start_training() calls
* Smart GPU determinism using estimates
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* disallow gpu selection for gguf for now
* cleanup
* Slightly larger baseline
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Treat empty list as auto
* Verbose logging/debug
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Cleanup and revert unnecessary deletions
* Cleanup excessive logs and guard against disk/cpu offload
* auth for visibility API. cleanup redundant imports. Adjust QLoRA estimate
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* support for non cuda gpus
* Fix multi-GPU auto-selection memory accounting
The multi_gpu_factor was applied uniformly to all GPUs including the
first one, which unfairly penalizes single-GPU capacity when
transitioning to multi-GPU. This created a discontinuity where a model
that barely fits 1 GPU would suddenly require 2 GPUs because the first
GPU's free memory was discounted by 20%.
Now the first GPU keeps its full free memory, and only additional GPUs
have an overhead factor (0.85) applied to account for inter-GPU
communication and sharding overhead. This gives more accurate
auto-selection and avoids unnecessary multi-GPU for models that
comfortably fit on one device.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox tests for multi-GPU selection logic
24 tests covering model size estimation, memory requirements, automatic
GPU selection, device map generation, GPU ID validation, and multi-GPU
overhead accounting. All tests use mocks so they run without GPUs on
Linux, macOS, and Windows.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix reviewer findings: 4bit inference estimate, fallback, GGUF gpu_ids, retry
1. 4-bit inference now uses reduced memory estimate (model_size/3 + buffer)
instead of the FP16 1.3x multiplier. This prevents over-sharding
quantized models across unnecessary GPUs.
2. When model size estimation fails, auto_select_gpu_ids now falls back to
all visible GPUs instead of returning None (which could default to
single-GPU loading for an unknown-size model).
3. GGUF inference route now treats gpu_ids=[] as auto-selection (same as
None) instead of rejecting it as an unsupported explicit request.
4. Training retry path for "could not get source code" now preserves the
gpu_ids parameter so the retry lands on the same GPUs.
5. Updated sandbox tests to cover the new 4-bit inference estimate branch.
* Remove accidentally added unsloth-zoo submodule
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix UUID/MIG visibility and update test expectations
1. nvidia.py: When CUDA_VISIBLE_DEVICES uses UUID/MIG tokens, the
visibility APIs now return "unresolved" with empty device lists instead
of exposing all physical GPUs. This prevents the UI from showing GPUs
that the backend process cannot actually use.
2. test_gpu_selection.py: Updated test expectations to match the new
multi-GPU overhead accounting (first GPU at full capacity, 0.85x for
additional GPUs) and 4-bit inference memory estimation formula.
All 60 tests now pass.
* Add CPU/disk offload guard to audio inference path
The audio model loading branch returned before the common
get_offloaded_device_map_entries() check, so audio models loaded with a
multi-GPU device_map that spilled layers to CPU/disk would be accepted
instead of rejected. Now audio loads also verify no modules are offloaded.
* Improve VRAM requirement estimates
* Replace balanced_low_0 with balanced
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* refine calculations for slightly easier nums
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* adjust estimates
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Use nums instead of obj to avoid seralisation error
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Harden nvidia-smi parsing and fix fallback GPU list
1. nvidia.py: Wrap int() casts for GPU index and memory in try/except
so MIG slices, N/A values, or unexpected nvidia-smi output skip the
unparseable row instead of aborting the entire GPU list.
2. nvidia.py: Handle GPU names containing commas by using the last
field as memory instead of a fixed positional index.
3. hardware.py: fallback_all now uses gpu_candidates (GPUs with verified
VRAM data) instead of raw devices list, which could include GPUs
with null VRAM that were excluded from the ranking.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* cleanup
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* consolidate raise_if_offload
* Improve MoE support. Guard against nvidia-smi failures
* Improve MoE support. Guard against nvidia-smi failures
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Fix shared-expert LoRA undercount, torch VRAM fallback, and apply_gpu_ids edge case
1. vram_estimation.py: compute_lora_params now includes shared experts
(n_shared_experts) alongside routed experts when computing MoE LoRA
adapter parameters. Previously only n_experts were counted, causing
the estimator to undercount adapter, optimizer, and gradient memory
for DeepSeek/GLM-style models with shared experts.
2. hardware.py: _torch_get_per_device_info now uses mem_get_info (which
reports system-wide VRAM usage) instead of memory_allocated (which
only reports this process's PyTorch allocations). This prevents
auto-selection from treating a GPU as mostly free when another
process is consuming VRAM. Falls back to memory_allocated when
mem_get_info is unavailable.
3. hardware.py: apply_gpu_ids([]) now returns early instead of setting
CUDA_VISIBLE_DEVICES="" which would disable CUDA entirely. Empty
list inherits the parent visibility, same as None.
4. hardware.py: Upgraded fallback_all GPU selection log from debug to
warning so operators are notified when the model likely will not fit
in available VRAM.
* Guard nvidia-smi subprocess calls against OSError and TimeoutExpired
get_visible_gpu_utilization and get_backend_visible_gpu_info now catch
OSError (nvidia-smi not found) and TimeoutExpired internally instead
of relying on callers to wrap every invocation. Returns the standard
available=False sentinel on failure so the torch-based fallback in
hardware.py can take over.
* Guard get_primary_gpu_utilization and reset GPU caches between tests
1. nvidia.py: get_primary_gpu_utilization now catches OSError and
TimeoutExpired internally, matching the pattern already used in
get_visible_gpu_utilization and get_backend_visible_gpu_info. All
three nvidia-smi callers are now self-contained.
2. test_gpu_selection.py: Added _GpuCacheResetMixin that resets the
module-level _physical_gpu_count and _visible_gpu_count caches in
tearDown. Applied to all test classes that exercise GPU selection,
device map, or visibility functions. This prevents stale cache
values from leaking between tests and causing flaky results on
machines with real GPUs.
* Fix nvidia-smi fallback regression and physical GPU count validation
1. hardware.py: get_gpu_utilization, get_visible_gpu_utilization, and
get_backend_visible_gpu_info now check result.get("available") before
returning the nvidia-smi result. When nvidia-smi is unavailable or
returns no data (e.g., containers without nvidia-smi, UUID/MIG masks),
the functions fall through to the torch-based fallback instead of
returning an empty result. This fixes a regression where the internal
exception handling in nvidia.py prevented the caller's except block
from triggering the fallback.
2. hardware.py: resolve_requested_gpu_ids now separates negative-ID
validation from physical upper-bound validation. The physical count
check is only enforced when it is plausibly a true physical count
(i.e., higher than the largest parent-visible ID), since
torch.cuda.device_count() under CUDA_VISIBLE_DEVICES returns the
visible count, not the physical total. The parent-visible-set check
remains authoritative in all cases. This prevents valid physical IDs
like [2, 3] from being rejected as "out of range" when nvidia-smi is
unavailable and CUDA_VISIBLE_DEVICES="2,3" makes torch report only
2 devices.
* Fix UUID/MIG torch fallback to enumerate devices by ordinal
When CUDA_VISIBLE_DEVICES uses UUID or MIG identifiers,
get_parent_visible_gpu_ids() returns [] because the tokens are
non-numeric. The torch fallback in get_visible_gpu_utilization() and
get_backend_visible_gpu_info() previously passed that empty list to
_torch_get_per_device_info(), getting nothing back.
Now both functions detect the empty-list case and fall back to
enumerating torch-visible ordinals (0..device_count-1) with
index_kind="relative". This means the UI and auto-selection still
see real device data in Kubernetes, MIG, and Slurm-style UUID
environments where nvidia-smi output cannot be mapped to physical
indices.
Updated test_uuid_parent_visibility to verify the new torch fallback
path returns available=True with relative ordinals.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add type hint for gpu_ids parameter in InferenceOrchestrator.load_model
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Fixes#4670
Separates the GGUF context slider ceiling from the currently active context length so lowering context via Chat Settings no longer locks the slider max to the reduced value.
- Backend: adds `max_context_length` to GGUF load/status responses, computed from the largest VRAM/KV-fit cap across all usable GPU subsets
- Frontend: stores `ggufMaxContextLength` and uses it for Context Length slider/input bounds; hydrates from both `/api/inference/load` and `/api/inference/status`
- Defaults UI ceiling to native context for CPU-only and fallback paths
- Seeds `effective_ctx` and `max_available_ctx` before GPU probing to prevent `UnboundLocalError` on probe failure
- Property fallback uses native `_context_length`, not effective `context_length`
* refactor(studio): unify setup terminal output style and add verbose setup mode
* studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose)
* studio(setup): revert nvcc path reordering to match main
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* studio(setup): restore fail-fast llama.cpp setup flow
* studio(banner): use IPv6 loopback URL when binding :: or ::1
* Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp
- Bracket IPv6 display_host in external_url to produce clickable URLs
- Redirect try_quiet failure log to stderr instead of stdout
- Clamp _step label to column width to prevent negative padding
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Add sandbox integration tests for PR #4494 UX fixes
Simulation harness (tests/simulate_pr4494.py) creates an isolated uv
venv, copies the real source files into it, and runs subprocess tests
for all three fixes with visual before/after demos and edge cases.
Standalone bash test (tests/test_try_quiet.sh) validates try_quiet
stderr redirect across 8 scenarios including broken-version contrast.
39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all
existing 75 unit tests still pass.
* [pre-commit.ci] auto fixes from pre-commit.com hooks
for more information, see https://pre-commit.ci
* Truncate step() labels in setup.sh to match PS1 and Python
The %-15s printf format pads short labels but does not truncate long
ones. Change to %-15.15s so labels wider than 15 chars are clipped,
matching the PowerShell .Substring(0,15) and Python label[:15] logic.
* Remove sandbox integration tests from PR
These test files are not part of the styling fix and should not
ship with this PR.
* Show error output on failure instead of suppressing it
- install_python_stack.py: restore _red for patch_package_file
warnings (was downgraded to _dim)
- setup.ps1: capture winget output and show on failure for CUDA,
Node, Python, and OpenSSL installs (was piped to Out-Null)
- setup.ps1: always show git pull failure warning, not just in
verbose mode
* Show winget error output for Git and CMake installs on failure
Same capture-and-print-on-failure pattern already used for
Node, Python, CUDA, and OpenSSL winget installs.
* fix: preserve stderr for _run_quiet error messages in setup.sh
The step() helper writes to stdout, but _run_quiet's error header
was originally sent to stderr (>&2). Without the redirect, callers
that separate stdout/stderr would miss the failure headline while
still seeing the log body on stderr. Add >&2 to both step calls
inside _run_quiet to match main's behavior.
* feat: add --verbose flag to setup and update commands
Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that
'unsloth studio update --verbose' (and the deprecated 'setup')
passes the flag to setup.sh / setup.ps1 / install_python_stack.py.
* fix(studio): honor verbose logging and keep llama.cpp failures non-blocking
* fix(studio): switch installer to 'studio update' and normalize Windows setup logs
* chore(studio): refine localhost tip and remove skip-base setup nois
* fix(studio): align Windows setup logs with Linux style and improve startup tips
* fix(studio): align Windows setup logs with Linux style
* refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output
* refactor(windows): align installer/setup output with Linux style and reduce default verbosity
* refactor(windows): match install.ps1 output style/colors to setup and quiet default logs
* fix(studio-banner): update personal-computer localhost tip
* fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode
* fix(install.sh): align installer logging with setup style and restore POSIX-safe color output
* fix(install.sh): preserve installer reliability and launch visibility
Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output.
* fix(windows installer): keep exit semantics and degrade status accurate
Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable.
* fix(setup.sh): improve log clarity and enforce GGUF degraded signaling
Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable.
* fix(installer): harden bash preflight and PowerShell GPU checks
Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling.
* fix(windows): keep verbose output visible while preserving exit codes
Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode.
* fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity
* Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose
- setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer
already reports the limitation, and install.sh runs under set -e so a
non-zero exit aborts the entire install including PATH/shortcuts/launch.
- setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1
sets $? = $false when native commands write to stderr even with exit 0.
Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE.
- startup_banner.py: Show the actual bound address when Studio is bound to
a non-loopback interface instead of always showing 127.0.0.1/localhost.
- setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for
npm install steps so --verbose correctly surfaces npm output.
* Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose
- install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge
stderr into stdout with 2>&1 and drop the $? check that misclassifies
successful native commands on PS 5.1.
- install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env
when --verbose is passed so child processes like install_python_stack.py
also run in verbose mode.
- setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose
correctly surfaces clone diagnostics during source-build fallback.
* Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner
- setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so
users can see download/validation progress while still capturing the log
for structured error reporting on failure.
- setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode.
- setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers.
- startup_banner.py: Avoid printing the same URL twice when Studio is
bound to a specific non-loopback address that matches the display host.
* Fix run_install_cmd exit code after failed if-statement
The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured
$? = 0 because $? reflects the if-statement result, not the command's exit
code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual
command exit code on failure. Applies to both verbose and quiet branches.
* Fix _run_quiet exit code, double uv install, missing --local flag
- setup.sh: Fix _run_quiet verbose path that always captured exit code 0
due to $? resetting after if-then-fi with no else. Switch to the same
'"$@" && return 0; exit_code=$?' pattern used in install.sh.
- setup.sh: Consolidate the two uv install branches (verbose + quiet)
into a single attempt with conditional output. Previously, when verbose
mode was on and the install failed, a second silent attempt was made.
- install.ps1: Pass --local flag to 'unsloth studio update' when
$StudioLocalInstall is true. Without this, studio.py's update() command
overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if
setup.ps1 or install_python_stack.py later checks that variable.
* Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners
- Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already
installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling
setup.sh, so letting install_python_stack.py redo it was redundant and
slowed down --no-torch installs for no benefit.
- Restore the "Unsloth Studio installed!" success banner and "starting
Unsloth Studio..." launch message so users get clear install completion
feedback before the server starts.
* Make llama.cpp build failure a hard error with proper cleanup
- setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF
inference requires a working llama.cpp build, so this should be a
hard failure, not a silent degradation.
- install.sh: Catch setup.sh's non-zero exit with '|| _SETUP_EXIT=$?'
instead of letting set -e abort immediately. This ensures PATH setup,
symlinks, and shortcuts still get created so the user can fix the
build deps and retry with 'unsloth studio update'. After post-install
steps, propagate the failure with a clear error message.
* Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE
'studio update' pops SKIP_STUDIO_BASE from the environment, which
defeats the fast-path version check added in PR #4667. When called
from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1
must survive into setup.ps1 so it skips the redundant PyPI check and
package reinstallation. 'studio setup' does not modify env vars.
* Remove deprecation message from 'studio setup' command
install.ps1 uses 'studio setup' (not 'studio update') to preserve
SKIP_STUDIO_BASE. The deprecation message was confusing during first
install since the user never typed the command.
* Fix stale env vars, scope degraded exit, generic error message for PR #4651
- install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO
when not using --local, to prevent stale values from a previous --local
run in the same PowerShell session. Fix log messages to say 'setup' not
'update' since we call 'studio setup'.
- setup.sh: Only exit non-zero for degraded llama.cpp when called from the
installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps
degraded installs successful since Studio is still usable for non-GGUF
workflows and the footer already reports the limitation.
- install.sh: Make the setup failure error message generic instead of
GGUF-specific, so unrelated failures (npm, Python deps) do not show
misleading cmake/git recovery advice.
* Show captured output on failure in quiet mode for PR #4651
Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand
(setup.ps1) now capture command output in quiet mode and display it
in red when the command fails. This matches the behavior of
run_install_cmd in install.sh where failure output is surfaced even
in quiet mode, making cross-platform error debugging consistent.
* Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651
- setup.ps1: Exit non-zero for degraded llama.cpp when called from
install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct
'unsloth studio update' keeps degraded installs successful.
- install.sh: Show 'unsloth studio update --local' in the recovery
message when the install was run with --local, so users retry with
the correct flag instead of losing local checkout context.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* fix: add PyPI version check to setup.ps1 for fast update path
Port the update-flow logic from setup.sh to setup.ps1 so that
`unsloth studio update` on Windows skips Python dependency reinstall
when the installed version already matches PyPI latest.
* fix: clear SKIP_STUDIO_BASE in update command
install.ps1 sets SKIP_STUDIO_BASE=1 which persists in the PowerShell
session. If the user runs `unsloth studio update` in the same terminal,
the env var causes the version check to be skipped. Clear it explicitly
in the update command.
* fix: harden version check and clear stale env vars in update flow
- Normalize $InstalledVer with Out-String + Trim() to avoid array/whitespace
comparison issues in PowerShell 5.1 (python output can be captured as
string[] instead of scalar string)
- Move Fast-Install --upgrade pip inside if (-not $SkipPythonDeps) so the
fast path avoids unnecessary network round-trips
- Clear STUDIO_LOCAL_REPO when --local is not passed to prevent a previous
--local session from leaking into a plain update
---------
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
* Fix blank page on Windows due to broken .js MIME type in registry
* Update studio/backend/main.py
adding defensive suggestion by gemini where we make the mimetypes specific to windows platforms
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>