Commit graph

4939 commits

Author SHA1 Message Date
Michael Han
e2fd946fe1
Add files via upload 2026-04-02 03:00:10 -07:00
Michael Han
31d6aeb197
Unsloth new logo 2026-04-02 02:58:21 -07:00
Daniel Han
e4d1499230
fix(studio): prevent small models from stalling on tool-calling tasks (#4769)
* fix(studio): prevent small models from stalling on tool-calling tasks

Small GGUF models (< 9B params) in "Think, Search, Code" mode would
often describe what they planned to do ("Let me create this dashboard")
and then stop generating without ever calling a tool.

Three changes:

1. Simplify web_tips for small models: remove the "fetch its full content
   by calling web_search with the url parameter" guidance for models < 9B.
   This multi-step instruction causes small models to plan elaborate
   search-then-fetch-then-code sequences they cannot reliably execute.

2. Add "always call tools directly" imperative to the system prompt nudge
   so models act immediately instead of narrating their intentions.

3. Add plan-without-action re-prompt in the agentic loop: when the model
   emits planning text (matching patterns like "let me", "I'll", etc.)
   without calling any tool, inject a nudge asking it to call the tool
   and continue the loop. Capped at 2 re-prompts per request.

Benchmarked with Qwen3.5-4B-GGUF (N=5 trials per variant):
- Baseline: 40% of requests had any tool call
- Combined fix: 100% of requests had at least one tool call

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-02 02:11:07 -07:00
Daniel Han
dc0729aadf
Add regression test for shell injection fix in GGML conversion (#4773)
AST-based test ensures subprocess.Popen calls in GGML conversion functions
use argv lists instead of shell=True. Companion to PR #4768.
2026-04-02 00:10:47 -07:00
mateeaaaaaaa
752cef3299
fix(security): shell injection in GGML export conversion (#4768)
* Fix shell injection in GGML conversion paths

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove test file from security fix PR

Move test_save_shell_injection.py to a separate PR to keep this PR focused on the security fix itself.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-04-02 00:10:43 -07:00
AdamPlatin123
ba8081fc96
fix(chat): correct loading text for cached models during inference (#4764)
Distinguish between actual network downloads and GPU memory loading for cached LoRA adapters in Studio chat.

- Add isCachedLora detection for local LoRA adapter paths using comprehensive cross-platform regex (Unix, Windows, UNC, WSL, tilde)
- Thread isCachedLora through loadInfo to chat-page inline status for proper 3-way distinction (cached / local LoRA / downloading)
- Skip download progress polling for cached LoRA models (no useless /download-progress API calls)
- Fix initial toast state to use isCachedLoad consistently instead of only checking isDownloaded
- Fix cancelLoading toast to not mention background downloads for cached/local loads
- Keep download-specific text ("Downloading model..." / "Download complete") inside the download-only polling block
2026-04-01 20:24:48 -07:00
Lee Jackson
ca4ea8b9fb
studio: align composer/code, unify fonts, and remove tool collapse jitter (#4763)
- Add min-w-0 guards to thread/message/markdown containers to prevent
  content overflow past the composer width
- Unify chat typography from Hellix/Space Grotesk to the sans stack,
  keeping monospace for code blocks and inline code
- Restructure desktop navbar right-side controls with shrink-0 wrappers
  for consistent spacing across HoverCard roots
- Soften tool-call label styling (font-medium + text-foreground/85
  instead of bold)
- Add responsive code block sizing via @container queries
- Add horizontal scrolling for wide code blocks within the thread column
- Scope list-item code block alignment CSS to .aui-thread-root
- Preserve useScrollLock in tool-fallback and tool-group collapsibles
- Fall back to bg-background on ViewportFooter when hideComposer is true
- Widen inline code monospace selector to cover th, blockquote, and
  heading elements
- Remove unused @fontsource-variable/space-grotesk import
2026-04-01 19:57:10 -07:00
DoubleMathew
71b934ef9d
Fix custom llama.cpp source builds and macos metal source builds (#4762)
* Fix script unbound variable error

* remove stale test script, add llama.cpp metal source builds, update tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Metal precedence, test sync, and add behavioral tests

- Move macOS arm64 Metal check before CUDA/ROCm in GPU backend
  decision chain so Metal is not bypassed when nvcc is in PATH
- Remove RPATH flags from CPU fallback CMAKE_ARGS (only needed
  for Metal library linking)
- Update test_llama_pr_force_and_source.py to match _CLONE_ARGS
  rename from _CLONE_BRANCH_ARGS in setup.sh
- Add confirm_install_tree guard test for
  existing_install_matches_choice
- Add TestMacOSMetalBuildLogic bash subprocess tests verifying
  Metal flag selection, nvcc precedence, and CPU fallback behavior

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Metal CPU fallback to also cover cmake build failures and update tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* 1. _GPU_BACKEND_FRAGMENT synced -- removed dead CPU_FALLBACK_CMAKE_ARGS= init (6/8)
2. RPATH assertion replaced -- new test_macos_arm64_cpu_fallback_args_exclude_rpath checks the actual runtime CPU_FALLBACK_CMAKE_ARGS output for @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (6/8)
3. _TRY_METAL_CPU_FALLBACK=false reset after both configure-failure and build-failure fallback branches in setup.sh (4/8)
4. macOS test now removes libmtmd.0.dylib instead of the platform-agnostic convert_hf_to_gguf.py (3/8)
5. Empty-string tag test added -- test_empty_tag_omits_branch_flag for resolved_tag= (2/8)
6. RPATH checks on cmake call logs -- both fallback tests now assert @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON are absent from CPU fallback cmake calls, plus baseline flag preservation (multiple)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* tests clean up

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 14:06:39 -05:00
Daniel Han
39fe23ded8
Tests for architecture-aware KV cache estimation (#4760)
* test: add 66 tests for architecture-aware KV cache estimation

Covers all 5 estimation paths (MLA, Hybrid Mamba, Sliding Window,
Standard GQA, Legacy), GGUF parser for 8 new metadata fields,
_can_estimate_kv gate conditions, quantization scaling, edge cases,
path priority ordering, and lifecycle (init/unload/reparse).

Zero external dependencies beyond pytest. No GPU or network required.
Cross-platform (Linux, macOS, Windows, WSL).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 06:13:37 -07:00
Daniel Han
653eb3819a
fix(studio): allow context length slider to reach model's native limit (#4746)
* fix(studio): allow context length slider to reach model's native limit

The context length slider was hard-capped to the VRAM-estimated maximum,
preventing users from requesting higher context even though the backend
already handles it safely (multi-GPU selection, --fit fallback). Expose
the model's native context length from GGUF metadata as a separate API
field and use it as the slider ceiling instead. Add an amber warning
when the selected context exceeds the estimated VRAM capacity.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Raise VRAM budget to 90% and add native_context_length tests

Increase the GPU memory utilization threshold from 70% to 90% across
_select_gpus and _fit_context_to_vram, allowing longer context lengths
before VRAM capping kicks in.

Add 33 tests for the native_context_length feature covering the backend
property, context value separation invariants, Pydantic models, route
completeness, edge cases, and cross-platform binary I/O.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 06:12:52 -07:00
Daniel Han
d22b2a18f9
fix: add tokenizers to no-torch deps and TORCH_CONSTRAINT for arm64 macOS py313+ (#4748)
* fix: add tokenizers to no-torch runtime deps and add TORCH_CONSTRAINT for arm64 macOS py313+

Two installer fixes:

1. Add `tokenizers` to `no-torch-runtime.txt` before `transformers`.
   Without it, `from transformers import AutoConfig` crashes on startup
   because `--no-deps` skips transitive dependencies.

2. Add `TORCH_CONSTRAINT` variable to `install.sh`. On arm64 macOS with
   Python 3.13+, tighten the torch requirement to `>=2.6` since torch
   <2.6 has no cp313 arm64 wheels. The variable replaces the previously
   hard-coded constraint in the uv pip install line.

Includes 66 tests (42 pytest + 24 bash) covering:
- Structural checks on install.sh, install.ps1, no-torch-runtime.txt
- Shell snippet tests with mocked python for 13 platform/version combos
- Mock uv integration verifying correct constraint string
- E2E venv tests on Python 3.12 and 3.13 confirming AutoConfig works
- Negative control proving AutoConfig fails without tokenizers
- Full no-torch sandbox regression guards (safetensors, huggingface_hub)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix incomplete no-torch manifest and align E2E tests with real --no-deps path

- Add missing transitive deps to no-torch-runtime.txt that are required
  under --no-deps: regex, typing_extensions, filelock, httpx, httpcore,
  certifi, idna, anyio, sniffio, h11. Without these, `from transformers
  import AutoConfig` still fails after install.sh --no-torch.

- Change all E2E tests to use --no-deps (matching what install.sh does)
  instead of normal dep resolution. Previous tests passed even with an
  incomplete manifest because uv backfilled transitive deps.

- Rewrite negative control to derive from the real no-torch-runtime.txt
  with tokenizers stripped, proving the specific fix matters.

- Replace GNU-only sed -i with heredoc in shell test for macOS compat.

- Remove unused os/sys imports from Python test file.

- Quote SKIP_TORCH and mock uv paths in bash -c strings.

* Assert install succeeds before checking import results in E2E tests

Address review feedback: test_torch_not_importable and
test_tokenizers_directly_importable in Group 3 now assert that
uv pip install returns 0 before checking import behavior. This
prevents false positives when the install itself fails silently.

* Assert install succeeds in negative control and tighten error check

- Add missing install-success assertion in test_negative_control_no_tokenizers
  to prevent false positives from network/install failures.

- Tighten error message check to look for "tokenizers" in stderr or
  ModuleNotFoundError, rather than the generic "No module" substring
  which could match unrelated import failures.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 06:12:17 -07:00
Daniel Han
76cb48be0b
fix: studio web search SSL failures and empty page content (#4754)
- Fix SSL handshake failures (SSLV3_ALERT_HANDSHAKE_FAILURE, CERTIFICATE_VERIFY_FAILED) when fetching HTTPS pages by introducing _PinnedHTTPSConnection that separates TCP connect (to pinned IP) from TLS handshake (with real hostname for SNI/cert verification)
- Fix SSRF DNS-rebinding vulnerability: previous impl swapped conn.host before connect(), causing fresh DNS resolution; new subclass keeps TCP pinned to validated IP
- Fix SPA/JS-rendered doc sites returning empty content by rotating real browser User-Agents (Chrome/Firefox/Safari)
- Strip nav/footer from HTML-to-Markdown output so article content is not buried under navigation chrome
- Increase raw fetch cap from 64KB to 512KB so SSR article content is reached on GitBook/Docusaurus/Next.js pages
- Fix IPv6 address bracketing in URL netloc construction
- Hoist SSL context, handler classes, and stdlib imports to module level (created once, not per-call)
- Use consistent UA across redirect hops to avoid breaking session-aware bot detection
2026-04-01 06:12:02 -07:00
Daniel Han
f84c2d03d3
Add installer test coverage for prebuilt llama.cpp changes (#4756)
Split out from #4741 to keep the main PR focused on installer logic.

- New test_install_llama_prebuilt_logic.py: tests for resolve logic,
  fallback behavior, env_int, busy/lock handling
- New test_validate_llama_prebuilt.py: validator tests for staged
  release_tag/upstream_tag handling
- New test_llama_pr_force_and_source.py: tests for PR_FORCE and
  LLAMA_SOURCE maintainer defaults
- Updated test_selection_logic.py: expanded selection/fallback coverage
- Updated test_pr4562_bugfixes.py: updated bugfix tests for new logic
- Updated smoke_test_llama_prebuilt.py: minor update
2026-04-01 06:06:29 -07:00
DoubleMathew
428efc7d95
Resolve latest usable published llama.cpp release instead of fixed pinned tag (#4741)
Replaces the fixed prebuilt llama.cpp tag with dynamic published-release
resolution, adds bounded fallback across older published releases, and
introduces maintainer-editable defaults for PR/source overrides.

Changes:
- Resolve latest from the latest usable published release in unslothai/llama.cpp
- Use the selected release upstream_tag as the authoritative llama.cpp version
- Prefer Unsloth-published platform assets when available
- Fall back to same-tag upstream ggml-org/llama.cpp assets where allowed
- Keep Linux CUDA anchored to Unsloth-published CUDA bundles only
- Add bounded fallback across older Unsloth published releases
- Add separate busy/in-use install handling (exit code 3)
- Skip reinstall when the installed bundle already matches the selected candidate
- Add maintainer-editable _DEFAULT_LLAMA_PR_FORCE and _DEFAULT_LLAMA_SOURCE
- Harden env parsing so malformed installer env vars do not crash import-time fallback logic
- Honor UNSLOTH_LLAMA_RELEASE_TAG in all resolve steps
- Always sync git remote URL in existing-checkout path
2026-04-01 06:06:17 -07:00
Daniel Han
5d7d882ce6
Fix save_pretrained_merged for full-finetuned models (#4755)
* Fix save_pretrained_merged for full-finetuned models

save_pretrained_merged and push_to_hub_merged silently do nothing when
the model is not a PeftModel (i.e. full finetuning without LoRA).
merge_and_overwrite_lora returns None immediately for non-PeftModel,
and unsloth_generic_save does not check the return value.

Add a non-PeftModel branch in unsloth_generic_save that falls back to
model.save_pretrained / model.push_to_hub. When save_method contains
"16bit", cast weights to bfloat16 (or float16) via a state_dict copy
to honor the user's intent without mutating the live model.

The existing PeftModel (LoRA) code path is unchanged.

* Forward create_pr and revision to tokenizer.push_to_hub

The tokenizer push_to_hub call was missing create_pr and revision,
which could cause the tokenizer to push to the wrong branch or
bypass PR creation when the model push uses them.

* Honor merged_16bit dtype contract for full-finetuned models

Cast state_dict to bfloat16/float16 when save_method contains "16bit"
to match the documented behavior of save_pretrained_merged. Also pass
state_dict and save kwargs consistently to both save_pretrained and
push_to_hub paths.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Address review feedback for PR #4755

- Simplify PeftModel isinstance check (PeftModelForCausalLM inherits
  from PeftModel)
- Add is_main_process guard for distributed training
- Forward variant to save_pretrained
- Set tokenizer padding_side to "left" before saving (matches other
  save paths)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 06:05:37 -07:00
Daniel Han
77e1a9edc9
feat(studio): architecture-aware KV cache VRAM estimation (#4757)
* feat(studio): architecture-aware KV cache VRAM estimation

Replace the single legacy formula (2 * n_kv_heads * head_dim * n_layers
* n_ctx * bpe) with 5-path estimation that reads 8 additional GGUF
metadata fields:

  1. MLA (DeepSeek-V2/V3, GLM-4.7, GLM-5, Kimi-K2.5) -- K-only cache
     using compressed KV latent + RoPE; no separate V allocation
  2. Hybrid Mamba (Qwen3.5-27B, Qwen3.5-35B-A3B) -- only attention
     layers (1 in N) carry KV; Mamba layers have none
  3. Sliding Window (Gemma-3, gpt-oss) -- SWA layers cache
     min(ctx, window) tokens instead of the full context
  4. Standard GQA -- uses explicit key_length/value_length from GGUF
     instead of embed // n_heads (which is wrong for many models)
  5. Legacy fallback -- identical to old formula for old GGUFs

New GGUF fields parsed: attention.key_length, attention.value_length,
attention.sliding_window, full_attention_interval,
attention.kv_lora_rank, attention.key_length_mla, ssm.inner_size,
ssm.state_size.

Validated against 9 real GGUF files (72/72 field checks pass).
The legacy formula was off by +682% for Gemma-3 and -81% for
DeepSeek-V3.1.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix MLA fallback and SWA global/local ratio heuristic

Two fixes based on review findings:

1. MLA fallback now uses key_length_mla from GGUF metadata instead of
   hardcoded rope_dim=64. Falls back to 64 only when key_length_mla is
   absent. This ensures correct estimates for MLA variants that use
   rope dimensions other than 64.

2. SWA global/local layer ratio changed from 50/50 to 1/4 (25% global,
   75% SWA). Most sliding window architectures have predominantly local
   layers (Gemma-3 uses ~17% global, gpt-oss uses ~50%). The 1/4
   heuristic is closer to the common case and still a large improvement
   over the legacy formula which ignores SWA entirely.

* Tighten _can_estimate_kv gate and treat sliding_window=0 as disabled

Two additional fixes from review round 1 (5/8 and 4/8 reviewer consensus):

1. _can_estimate_kv now requires BOTH key_length AND value_length for
   the explicit-dims path. Previously key_length alone was enough,
   which could cause silent fallthrough to the legacy formula with
   fabricated defaults (n_kv=1, head_dim=128) when value_length was
   absent from the GGUF.

2. SWA path now requires sliding_window > 0. Some GGUFs use 0 as a
   disabled sentinel. Without this guard, min(ctx, 0) would zero out
   all SWA layer contributions, severely underestimating KV cache.

* Fix MLA n_kv safety and use ceiling division for hybrid path

Addresses Gemini Code Assist review findings:

1. MLA path now uses n_kv_mla = n_kv_heads or 1 (not n_heads). This
   prevents a 128x overestimate for DeepSeek-V3 if head_count_kv is
   absent from the GGUF (n_heads=128 would have been used instead).

2. Hybrid path now uses ceiling division for attention layer count.
   This prevents undercounting by 1 when n_layers is not perfectly
   divisible by full_attention_interval.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 06:04:12 -07:00
Daniel Han
3f3757b143
Fix forward compatibility with transformers 5.x (#4752)
* Fix forward compatibility with transformers 5.x

Tested on transformers 4.57.6, 5.3.0, and 5.4.0. All changes are no-ops
on transformers 4.x.

1. Skip exec-based config patching for transformers >= 5.0

   Config classes in v5 use @strict, @auto_docstring, and interval()
   which break exec(inspect.getsource(...)). Those configs already use
   rope_parameters (the v5 replacement for rope_scaling).

2. Slice position_ids to last token in fast_forward_inference

   Transformers 5.x generate() accumulates position_ids as
   [batch, full_seq_len] across decode steps instead of [batch, 1].
   cos[position_ids] then produces the wrong shape for rotary
   embeddings. Fixed in llama, qwen3, falcon_h1, gemma2, cohere,
   granite. No-op on 4.x since position_ids is already [batch, 1].

3. Handle @strict config kwargs for sequence classification

   num_labels, max_position_embeddings, id2label etc. are set on the
   config object and passed via config= instead of as kwargs.
   AutoModelForSequenceClassification routing added to FastModel loader.

4. Exclude modernbert from flex_attention

   ModernBERT with flex_attention hits CUDA illegal memory access in
   create_block_mask. Falls back to eager attention safely.

5. Propagate token_type_ids and mm_token_type_ids through GRPO VLM path

   Gemma3 Vision requires token_type_ids during training. Qwen3VL
   requires mm_token_type_ids for M-RoPE. Extract from inputs in
   compute_loss, pass to grpo_accumulated_loss, and extend
   mm_token_type_ids for completion tokens in
   _generate_and_score_completions.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add try/except safety net around config exec for pre-release transformers versions

* Pop config-level kwargs in seqclass path and use except Exception

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 06:04:03 -07:00
Roland Tannous
41df4ec437
feat(studio): strip org prefix in model search to surface unsloth variants (#4749)
When searching for a specific publisher model (e.g. `openai/gpt-oss-20b`), the
unsloth search used the full `openai/gpt-oss-20b` string with `author=unsloth`,
which returned zero results because no unsloth model contains the publisher
prefix in its name. Users never discovered unsloth variants.

This PR strips the org prefix for publisher-qualified queries so unsloth variants
surface, then pins the original publisher model after a small batch of unsloth
results. Plain queries (no slash) and unsloth-prefixed queries are unchanged.

- Strict regex (`/^([^/\s]+)\/([^/\s]+)$/`) only triggers on valid `owner/repo`
  identifiers; incomplete typeahead, multi-slash, and URL-like inputs are rejected
- Queries for `unsloth/...` models (case-insensitive) keep the full 20-result
  prefetch and secondary sort
- Pinned model lookup fires in parallel with the unsloth prefetch
- Canonical-name dedup prevents duplicates when HF normalizes casing
- Publisher detection extracted into a single `useMemo` block
2026-04-01 04:37:28 -07:00
Leo Borcherding
63ad6dbd6d
Fix OOM model styling in Studio model selectors (#4738)
Replace strikethrough + opacity-50 OOM styling with gray text and red pill badge across all Studio model selectors (chat, training, onboarding).

- Use gray-500/gray-400 for OOM model names (better contrast than strikethrough)
- Red pill badge for OOM indicator with light/dark mode support
- Scope GGUF gray override to quant name only so downloaded/recommended labels keep colors
- Add !important on TIGHT/OOM badges to resist ComboboxItem hover overrides
2026-04-01 02:06:49 -07:00
Daniel Han
6c0826a9e4
Fix Windows local GGUF model loading crash (#4730)
* Fix Windows "Non-relative patterns are unsupported" when loading local GGUF models

When a user loads a GGUF model from a local Windows path (e.g.
C:\Users\danie\.lmstudio\models\unsloth\functiongemma-270m-it-GGUF),
the model identifier contains backslashes and a drive letter. Both
load_model_defaults() and _has_specific_yaml() constructed a YAML
filename from the full absolute path and passed it to Path.rglob(),
which rejects non-relative patterns on Windows.

Fixed by detecting Windows-style paths (drive letters, UNC paths,
backslashes) in addition to Unix-style paths, and using only the
directory basename for the YAML filename lookup when the identifier
is a local filesystem path.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Refactor: reuse is_local_path helper, fix case-sensitive suffix lookup

- Replace inline local-path detection in model_config.py and
  inference_config.py with the existing is_local_path() from utils.paths,
  which already handles Unix, Windows drive-letter, UNC, and backslash paths
- Fix case-sensitive suffix lookup in load_model_defaults(): the
  _REVERSE_MODEL_MAPPING is lowercase-keyed, so suffix comparisons must use
  .lower() to match paths like /path/to/Spark-TTS-0.5B/LLM

* Fix WSL path parsing and _has_specific_yaml suffix lookup

- Use normalize_path() before Path() operations so backslash Windows
  paths (e.g. C:\Users\...\model) are correctly split on POSIX/WSL hosts
  where pathlib treats backslashes as literal characters
- Add suffix-based (2-component and 1-component) lookup to
  _has_specific_yaml() so it matches the same resolution rules as
  load_model_defaults(), fixing wrong inference params for local
  suffix-mapped models like Spark-TTS-0.5B/LLM

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-01 01:38:09 -07:00
Datta Nimmaturi
256c6e4884
Refactor flex attn to prefer flash if possible (#4734)
Replaces prefer_flex_attn_if_supported (which only returned flex_attention or None) with determine_attention_implementation, a centralized hierarchy: FA2 > Flex > SDPA > Eager.

Changes:
- New determine_attention_implementation function in _utils.py with clear priority chain
- _set_attn_impl helper to stamp config consistently
- _FLEX_EXCLUDED_MODELS / _FLEX_EXCLUDED_PREFIXES for model-specific exclusions
- Gemma3N explicit eager override in vision.py (timm vision towers)
- Preserved sdpa fallback for unmapped/remote-code vision configs
- Config re-stamped to eager when supports_sdpa guard fires

Co-authored-by: Datta Nimmaturi <Datta0@users.noreply.github.com>
2026-04-01 00:30:21 -07:00
Wasim Yousef Said
d63cc57e1e
fix: clear tool status badge immediately after tool execution (#4733)
* fix: clear tool status badge immediately after tool execution

The tool status timer badge (Searching 1s, 2s...) persisted after
tool calls finished because the status clear event was only sent
at the start of the next generation iteration, not after tool
execution completed.

Backend: yield status clear after all tools finish in the agentic
loop iteration, before continue starts the next generation pass.

Frontend: debounce badge visibility by 300ms so sub-second tool
calls dont flash the badge.

* Fix debounce regression for consecutive tool calls

Only apply the 300ms show-delay when transitioning from idle to
tool-active. When switching between consecutive tools in the same
turn (e.g. web_search -> python), keep the badge visible immediately
so it does not flicker or disappear during multi-tool runs.

* Delay wasActiveRef reset to bridge inter-iteration tool gaps

The backend emits a status-clear event between tool iterations,
which was resetting wasActiveRef immediately and causing the next
tool to be re-debounced (300ms hidden gap between consecutive tools
in the same turn). Now the ref reset is delayed by 500ms so a
follow-up tool within the same agentic turn shows the badge
immediately, while a genuinely new turn still gets the debounce.

* Use thread lifecycle to track tool-run boundaries

Replace the 500ms wall-clock timeout with the actual thread.isRunning
state to determine when wasActiveRef should reset. This properly
handles all cases:
- Consecutive tools within the same run stay visible without flicker
- The badge hides only when the thread run actually ends
- New turns always get a fresh 300ms debounce on the first tool
- No heuristic timeout that can misfire on slow or fast inference

* Consolidate wasActiveRef reset into single effect

Removes the separate isThreadRunning effect to avoid a race where
the ref resets before the tool-status effect reads it (when
isThreadRunning flips to false before setToolStatus(null) from
the adapter's finally block). Now wasActiveRef resets only when
both toolStatus is null AND the thread run has ended, eliminating
any flicker on the last tool of a run.

* Simplify debounce: use visible state instead of ref tracking

Drop wasActiveRef entirely and use the visible state as the
debounce gate. When the badge is not yet on screen, debounce
for 300ms before showing. When already visible from a prior tool,
keep showing immediately. This correctly handles all cases:
- All fast tools (<300ms) are suppressed, not just the first
- Consecutive tools after the badge is shown stay visible
- Badge persists across inter-iteration clears while thread runs
- New turns get a fresh debounce after visible resets

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-04-01 00:28:38 -07:00
Wasim Yousef Said
4fb9778988
feat: move folder management into model selector dropdown (#4731)
* refactor: move folder management from sidebar into model selector

* Fix folder management: restore LoRA picker sync, error handling, caching

- Restore onFoldersChange callback to keep LoRA adapter picker in sync
  when scan folders are added/removed (fixes regression from sidebar move)
- Thread onFoldersChange through ModelSelector -> HubModelPicker prop chain
- Add module-level _scanFoldersCache to prevent folder list flash on re-open
- Surface error toast on folder removal failure instead of silently ignoring
- Guard handleAddFolder against concurrent double-submit via folderLoading
- Clear folderInput on Escape key dismiss to prevent stale input on re-open
- Add refreshLocalModelsList and refreshScanFolders to useEffect dep array

* Fix compare-mode folder sync, Escape key propagation, cancel toggle state

- Wire onFoldersChange through CompareContent/GeneralCompareContent so
  compare-mode selectors also refresh local models after folder changes
- Add e.stopPropagation() on Escape key in folder input to prevent
  Radix Popover from closing the entire model selector dropdown
- Add e.preventDefault() on Enter key to prevent form submission
- Clear folderInput and folderError when cancel toggle hides the input,
  matching the Escape key behavior for consistency

* Fix folder mutation state ordering and touch accessibility

- Use optimistic updates for add/remove so the folder list reflects
  changes immediately instead of waiting on a second listScanFolders
  round-trip that could silently fail.
- Move refreshScanFolders out of the finally block in handleRemoveFolder
  so it runs after the cache update, not after onFoldersChange.
- Make the remove button visible on touch/mobile devices and reachable
  via keyboard focus (opacity-100 on small screens, focus-visible).
- Add aria-label to the remove button for screen readers.

* Deduplicate optimistic folder add to match backend behavior

The backend returns the existing ScanFolderInfo row when adding a
path that is already registered. The optimistic update was blindly
appending the returned row, producing duplicate entries and React
key warnings. Now checks by id before appending.

* Add aria-label to folder toggle button and strengthen dedup check

- Add aria-label to the +/cancel icon button for screen readers.
- Extend optimistic dedup check to also compare by path, not just id,
  to handle edge cases where the cache is stale.

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-31 23:15:50 -07:00
Lee Jackson
2cac3e8e4d
studio: Polish Windows installer/setup logs (#4736)
* style(windows): clean installer/setup log output and remove seeded credential banner

* Keep startup credential hint without exposing plaintext password

Print the username and .bootstrap_password file path on first-run
admin creation instead of the raw password. Headless / Docker / SSH
operators still get a startup-time hint for initial sign-in, and the
plaintext credential no longer appears in terminal output or logs.

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-03-31 23:12:42 -07:00
Daniel Han
6984e118eb
Bump installer minimum version pin to 2026.3.18 (#4729)
Matches the latest PyPI release.
2026-03-31 07:00:51 -07:00
Daniel Han
cfeb8c3245 Versioning 2026-03-31 06:51:34 -07:00
Wasim Yousef Said
1e8875584d
feat: custom scan folders for GGUF model discovery (#4723)
* feat: add scan_folders table and CRUD functions to studio_db

* feat: add scan folders API endpoints and integrate into model scan

* feat: add scan folders API client and update source types

* feat: add custom source to model filters and selector

* feat: add Model Folders section to chat settings sidebar

* style: fix biome formatting in ModelFoldersSection

* fix: address review findings for custom scan folders

empty string bypass, concurrent delete crash guard,
Windows case normalization, response_model on endpoints,
logging, deduplicated filter/map, module level cache for
custom folder models, consistent source labels, handleRemove
error surfacing, per folder scan cap

* fix: show custom folders section regardless of chatOnly mode

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refactor: extract shared refreshLocalModelsList in pickers

* Harden custom scan folder validation and scanning

- Validate path exists, is a directory, and is readable before persisting
- Apply per-folder model cap during traversal instead of after (avoids
  scanning millions of inodes in large directories)
- Wrap per-folder scan in try/except so one unreadable folder does not
  break the entire /api/models/local endpoint for all callers
- Normalize case on Windows before storing so C:\Models and c:\models
  dedup correctly
- Extend macOS denylist to cover /private/etc and /private/tmp (realpath
  resolves /etc -> /private/etc, bypassing the original denylist)
- Add /boot and /run to Linux denylist

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Improve scan robustness and preserve Windows path casing

- Preserve original Windows path casing in DB instead of lowercasing
  (normcase used only for dedup comparison, not storage)
- Catch PermissionError per child directory so one unreadable subdirectory
  does not skip the entire custom folder scan
- Wrap list_scan_folders() DB call in try/except so a DB issue does not
  break the entire /api/models/local endpoint

* fix: scan custom folders for both flat and HF cache layouts

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix Windows case-insensitive path dedup with COLLATE NOCASE

Use COLLATE NOCASE on the scan_folders.path column so that the UNIQUE
constraint correctly deduplicates C:\Models and c:\models on Windows
without lowercasing the stored path. Also use COLLATE NOCASE in the
pre-insert lookup query on Windows to catch existing rows with
different casing.

* Restore early-exit limit in _scan_models_dir for custom folders

Keep the limit parameter so _scan_models_dir stops iterating once
enough models are found, avoiding unbounded traversal of large
directories. The post-traversal slice is still applied after combining
with _scan_hf_cache results.

* feat: scan custom folders with LM Studio layout too

* Fix custom folder models being hidden by dedup

Custom folder entries were appended after HF cache and models_dir
entries.  The dedup loop kept the first occurrence of each model id,
so custom models with the same id as an existing HF cache entry were
silently dropped -- they never appeared in the "Custom Folders" UI
section.

Use a separate dedup key for custom-source entries so they always
survive deduplication.  This way a model can appear under both
"Downloaded" (from HF cache) and "Custom Folders" (from the
user-registered directory) at the same time.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Harden LM Studio scan and fix COLLATE NOCASE on Linux

- Add per-child and per-publisher OSError handling in _scan_lmstudio_dir
  so one unreadable subdirectory does not discard the entire custom
  folder's results
- Only apply COLLATE NOCASE on the scan_folders schema on Windows where
  paths are case-insensitive; keep default BINARY collation on Linux
  and macOS where /Models and /models are distinct directories

* Use COLLATE NOCASE in post-IntegrityError fallback SELECT on Windows

The fallback SELECT after an IntegrityError race now uses the same
case-insensitive collation as the pre-insert check, so a concurrent
writer that stored the path with different casing does not cause a
false "Folder was concurrently removed" error.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-31 06:40:31 -07:00
Daniel Han
9a8b622306
Studio: simplify tool-call dedup and replace html2text with builtin converter (#4722)
* Simplify tool-call dedup: drop hashlib, inline helpers

The duplicate tool-call detector only compares calls within a single
request from the same JSON parser, so dict key order is guaranteed
identical for identical calls (Python 3.7+ insertion-ordered dicts).

- Replace hashlib.md5(json.dumps(...)) with name + str(args)
- Inline _tool_call_key, _is_duplicate_call, _record_tool_call
  since each was a one-liner used once
- Remove unused hashlib import

* Remove tool_calling_benchmark_results.md from repo

* Replace html2text with builtin HTML-to-Markdown converter

Drop the external html2text (GPL-3.0) dependency and its regex
fallback. Add _html_to_md.py (~190 lines, stdlib only) using
html.parser.HTMLParser that handles headings, links, bold/italic,
lists, tables, blockquotes, code blocks, and entity decoding.
Strips script/style/head tags entirely.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use json.dumps(sort_keys=True) for tool-call dedup key

str(dict) is sensitive to insertion order, so semantically identical
calls with different key ordering would bypass duplicate detection.
Switch to json.dumps with sort_keys=True for a canonical representation.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert dedup key to str(arguments)

json.dumps(sort_keys=True) is unnecessary here -- the arguments dict
always comes from the same JSON parser within a single request, so
key insertion order is deterministic (Python 3.7+).  str() is faster
and sufficient for consecutive-call dedup.

* Address review comments on _html_to_md.py

- Remove "hr" from _BLOCK_TAGS so the dedicated hr handler is reachable
- Prefix all newlines with ">" inside blockquotes (multi-line support)
- Emit full ![alt](url) for images instead of alt text only
- Replace newlines with spaces inside table cells
- Track header cells per-row (_row_has_th) instead of last-cell-only
- Strip trailing tabs in addition to spaces in cleanup regex

* Fix blockquote rendering, truncated-HTML buffer flush, and dedup key canonicalization

_html_to_md.py:
- Rewrite blockquote handling with stack-based buffer approach so nested
  blockquotes, pre blocks inside blockquotes, and multi-paragraph quotes
  all render correctly with proper "> " prefix on every line.
- Add flush_pending() to recover content from truncated HTML where closing
  tags are missing (common when _fetch_page_text caps the download size).
  Flushes open <a>, <td>, <pre>, and blockquote buffers.
- Skip <img> tags to match prior html2text ignore_images=True behavior
  and avoid data-URI amplification consuming the output budget.
- Collapse all whitespace (including newlines) in non-pre content per
  standard HTML whitespace rules: \s+ -> single space.
- Escape pipe characters in table cell content to prevent column breakage.
- Emit separator row after the first row for tables without <th> headers.
- Guard against IndexError on _ol_counter for orphan <li> elements.
- Normalize CRLF line endings before parsing.

llama_cpp.py:
- Restore canonical dedup key with json.dumps(sort_keys=True) so that
  semantically identical tool calls with different JSON key order are
  correctly detected as duplicates.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix table optional end tags, inline code whitespace, and link text normalization

_html_to_md.py:
- Extract _finish_cell() and _finish_row() helpers to handle HTML tables
  that omit optional </td>, </th>, or </tr> end tags. This is valid HTML
  and common on real web pages -- previously the parser would silently
  drop earlier cells and entire rows.
- Call _finish_cell()/_finish_row() from handle_starttag for <tr>/<td>/<th>,
  handle_endtag for </tr>/<td>/<th>/<table>, and flush_pending() so all
  three paths (normal close, implicit close, truncated HTML) use the same
  row-finalization logic including header separator emission.
- Add _in_inline_code flag so handle_data() preserves literal whitespace
  inside <code> spans instead of collapsing it. Source like
  <code>pip  install   unsloth</code> now correctly renders as
  `pip  install   unsloth` rather than `pip install unsloth`.
- Extract _finish_link() helper that normalizes accumulated link text with
  \s+ -> single space before building the Markdown link. Prevents block-
  level content inside <a> tags (e.g. <a><div>one</div><div>two</div></a>)
  from producing multiline [one\n\ntwo](href) link labels.
- Empty blockquotes now produce no output instead of a stray ">".
- Remove unused _bq_depth field (all routing uses _bq_stack).
- Flush open cells and rows in handle_endtag("table") for robustness.

* Support <ol start=N>, <dl>/<dt>/<dd>, and preserve code block whitespace

_html_to_md.py:
- Honor <ol start="N"> attribute so ordered lists preserve their original
  numbering instead of always restarting from 1. Important for docs/tutorials
  that continue numbering across sections.
- Add dl, dt, dd to _BLOCK_TAGS so definition lists (common on MDN, Python
  docs, Django docs) produce separated text instead of concatenated blobs.
- Rewrite _cleanup() to be fence-aware: content inside fenced code blocks
  is now preserved verbatim (intentional blank lines in <pre> content are
  no longer collapsed). Outside code blocks, blank runs are limited to one
  and trailing whitespace is stripped.
- Fix _prefix_blockquote() to strip trailing whitespace before collapsing
  blank lines, preventing the "\n\n \n\n" pattern from sneaking through.

* Suppress whitespace-only text nodes between table structural elements

Indented HTML tables (nearly all real-world pages) produce whitespace
text nodes between <table>, <tr>, </tr> etc. that land in the output
as leading spaces before table rows, breaking Markdown table alignment.

Skip whitespace-only text nodes when inside a table but not inside a
cell, so indentation from source HTML does not leak into the output.

* Revert dedup key to str(arguments) with explanatory comment

json.dumps(sort_keys=True) is unnecessary overhead here: arguments
always comes from json.loads on model output within a single request,
so dict insertion order is deterministic in Python 3.7+. A repeated
call from the model produces the same JSON, which parses to the same
dict repr. str() avoids re-serialization on every tool call.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-31 06:15:18 -07:00
Lee Jackson
9451bb1bac
fix(export): preserve selected/manual model on enter and blur (#4726) 2026-03-31 17:05:55 +04:00
Daniel Han
e159b93b97
studio: improve GGUF tool calling accuracy and reliability (#4700)
* studio: improve GGUF tool calling accuracy and reliability

- Add URL fetching to web_search tool so models can read full page
  content instead of only getting search snippets. Uses html2text for
  clean markdown conversion with regex fallback.
- Inject current date and behavioral guidance (URL fetch workflow,
  no repeated queries, use code for data processing) into the
  tool-use system prompt.
- Append error recovery nudge to tool results that indicate failure,
  helping small models avoid looping on the same broken call.
- Strip leaked <tool_call> XML from assistant messages in conversation
  history and from the outgoing SSE stream.
- Raise default max tool iterations from 10 to 25 across backend,
  model schema, and frontend defaults.
- Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain
  enough content for the model to extract useful information.
- Add "IMPORTANT: These are only short snippets" hint to search
  results so models know to fetch full pages when needed.

Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after:
- XML leaks in responses: 10/10 -> 0/10
- URL fetch usage: 0 -> 4/10 runs
- Runs producing actual correct answers: 0/10 -> 2/10
- Average tool calls per query: 5.5 -> 3.8 (more efficient)
- Average response time: 12.3s -> 9.8s

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add tool calling benchmark results across model sizes and quants

Tested 16 configurations (4 models x 2 quants x 2 KV cache types)
with 10 runs each on NVIDIA B200.

Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4
correct songs, 0 XML leaks, 131s average response time.

* Add duplicate tool-call detection and final-answer synthesis

When the model repeats the exact same tool call (same name + arguments)
twice in a row, skip execution and return a redirect message telling it
to try a different approach. This prevents the 8x-repeated-query loops
observed on 27B and 35B models.

When the tool iteration cap (25) is reached, inject a "provide your
final answer now" message before the final streaming pass. This lets
the model synthesize a useful answer from everything it gathered
instead of being silently cut off.

Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs):
- Repeated query runs: 4/10 -> 2/10
- Cap hits: 1/10 -> 0/10
- All 4/4 accuracy: 5/10 -> 7/10

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix CodeQL alert: handle whitespace in script/style closing tags

The regex fallback for HTML stripping did not match closing tags
with whitespace before the angle bracket (e.g. </script >).
Use \s* before > in both script and style patterns.

* Address reviewer findings: SSRF, timeout crash, XML regex, dedup

- SSRF: resolve hostname via getaddrinfo and reject private, loopback,
  link-local, multicast, and reserved addresses before fetching
- Timeout: handle timeout=None (unlimited mode) in URL fetch path
  by defaulting to 60s instead of crashing on min(None, 60)
- Download cap: read at most max_chars*4+1 bytes instead of the
  full response body before truncating
- XML regex: match both <tool_call> and <function=...> markup in
  the history/stream cleanup (inference.py)
- CodeQL: use [^>]* in closing script/style tags to handle any
  whitespace or attributes before >
- Dedup: track whether each tool call failed so retries after
  transient errors are allowed; only block consecutive identical
  calls that both succeeded
- Final-answer synthesis: guard on max_tool_iterations > 0 so
  callers who disable tools do not get a false "used all calls" turn

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix redirect SSRF, SSE streaming regression, dedup off-by-one

- SSRF redirect bypass: disable auto-redirect in urllib, manually
  follow up to 5 hops with host validation at each step. Prevents
  public URLs from redirecting to loopback/private targets.
- SSE streaming: track prev_text on the raw cumulative and strip
  XML from the delta only, so completed tool_call tags do not cause
  the cumulative to shrink and drop trailing real text.
- Dedup off-by-one: check the immediately previous call (window=1)
  instead of requiring 2 matching history entries, so the second
  identical successful call is blocked rather than the third.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix redirect HTTPError handling and tighten error prefixes

- Redirect fix: urllib raises HTTPError (not a normal response) when
  the redirect handler returns None. Catch HTTPError for 3xx codes
  and extract the Location header from the exception object.
- Error prefixes: remove overly broad "No " prefix that matched
  "No results found." (a valid empty-search outcome, not an error).
  Replace with specific prefixes like "Blocked:", "No query provided",
  "Failed to resolve". This ensures empty search results are correctly
  classified as non-errors for duplicate-call tracking.

* Fix SSE cross-chunk XML leaks, cleanup review findings

- SSE streaming: sanitize the full cumulative text before diffing
  against the previous sanitized snapshot, so XML tags that span
  chunk boundaries are stripped correctly. The previous delta-based
  approach leaked split tags.
- DRAINING fallback: use _strip_tool_markup() helper instead of a
  manual regex that only handled <tool_call> but not <function=...>.
- Move hashlib import, _TOOL_XML_RE compile, and datetime import to
  module level per style guide.
- Remove unused _hit_tool_cap variable.

* Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record

- DNS rebinding: resolve hostname once via getaddrinfo, pin the
  returned IP, rewrite the URL to connect to the pinned IP with
  a Host header. Each redirect hop re-resolves and re-validates.
  Closes the TOCTOU window between validation and connection.
- Charset: use resp.headers.get_content_charset() instead of
  hardcoding utf-8, so pages with other encodings decode correctly.
- HTTPError: return descriptive "HTTP {code} {reason}" instead of
  re-raising into a generic "Search failed" message.
- Dedup: remove redundant _record_tool_call in the duplicate branch;
  the single call at the end of the loop handles all cases.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-31 03:06:44 -07:00
Lee Jackson
815619d972
feat: add update instructions card with OS toggle and mobile expand flow (#4721)
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
2026-03-31 14:05:05 +04:00
Roland Tannous
cc5e4fbf17
fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1 (#4712)
* fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1

The heartbeat thread now monitors the HF Hub cache directory for
file-size growth. If no bytes are written for 3 minutes, it sends a
"stall" message to the orchestrator, which kills the subprocess and
retries with HF_HUB_DISABLE_XET=1 (falling back from Xet to standard
HTTPS). If the retry also stalls, it errors out with a clear message.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: include transport type (xet/https) in heartbeat and stall log messages

Makes it clear in backend logs whether the download is using xet or
https transport, and which transport stalled — helpful for debugging.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: monitor HF Hub .tmp dir to avoid false stall detections

huggingface_hub downloads into .tmp/ before atomically moving to
blobs/. Without monitoring .tmp, a large shard actively downloading
for several minutes would show zero blob growth and trigger a false
stall.

* fix: scope HF cache size check to specific model being loaded

Instead of scanning every models--*/blobs directory (O(N) with cached
models), only check the specific model's blobs dir plus the global
.tmp dir. Much faster on systems with many cached models.

* Fix false stall detection on cached/local models and cleanup issues

- Only fire stall if download activity was observed (cache size changed
  at least once). Previously, any model load taking >180s would trigger
  a false stall, even for already-cached or local models where no
  download is happening.
- Return -1 from _get_hf_cache_size on exception to distinguish
  "unable to measure" from "genuinely zero bytes". Skip stall logic
  when measurement fails.
- Add _shutdown_subprocess before raising on terminal stall path to
  prevent leaking a stuck subprocess.
- Detect pre-existing HF_HUB_DISABLE_XET=1 in the parent environment
  to avoid a redundant retry cycle when Xet is already disabled.
- Remove global .tmp directory scanning (not used by modern
  huggingface_hub; in-progress downloads use .incomplete files in
  blobs/ which are already captured by iterdir).
- Add f.is_file() guard in cache size calculation.
- Replace em dashes with ASCII dashes for Windows terminal compat.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Harden stall detection edge cases

- Guard -1 to valid value transition: when initial _get_hf_cache_size
  returns -1 (error) and later recovers to a real value, do not count
  that as download activity. Only set saw_download_activity when the
  previous measurement was also valid (>= 0).
- Move os import to top-level in orchestrator.py instead of inline
  import os as _os.
- Fix misleading comment about post-download protection.

* Use .incomplete files to detect active downloads for stall detection

Replace the saw_download_activity heuristic with direct .incomplete file
detection. huggingface_hub creates *.incomplete files in blobs/ during
active downloads and removes them on completion. This gives a reliable
signal for whether a download is actually in progress.

Benefits:
- Cached models: no .incomplete files -> no stall fired even after 180s
- Post-download init (quantization, GPU loading): .incomplete files gone
  so stall timer resets, long init phases are not killed
- Pre-download hangs (XET handshake stall): .incomplete files are
  created at download start, so zero-byte stalls are now detected
- No more false positives from -1 to valid measurement transitions

The _get_hf_download_state function now returns (total_bytes,
has_incomplete) tuple or None on error, replacing _get_hf_cache_size.

* Add debug logging to download state exception handler

Log the exception at debug level when _get_hf_download_state fails,
instead of silently returning None. Helps with troubleshooting cache
measurement issues.

* Watch both adapter and base model repos for LoRA stall detection

When loading a LoRA adapter, the actual download bottleneck is often
the base model, not the adapter itself. Update the heartbeat to watch
both mc.identifier and mc.base_model cache directories so stall
detection works for LoRA loads where the base model stalls on Xet.

Also update _get_hf_download_state to accept multiple model names and
skip names without "/" (local paths) since those do not have HF cache
directories.

* Fix model name filtering for official HF models without org prefix

Models like gpt2 and bert-base-uncased do not contain a slash but are
still valid HF Hub models with cache directories. Replace the "/" check
with a proper local-path detection that checks for path separators and
path-like prefixes instead.

Also fix the base_model watch list to not require "/" in the base model
name, so official models used as LoRA bases are also monitored.

* Fix local path detection that broke all org/model names on Linux

The os.path.sep check matched "/" in HF model IDs like "org/model" on
Linux, causing the stall detector to skip ALL standard HF models.

Replace with a check that only skips names starting with "/" (absolute
paths), "." (relative paths), "~" (home-relative), or containing "\"
(Windows paths). HF model IDs like "org/model" or "gpt2" pass through
correctly on all platforms.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-31 03:00:46 -07:00
Daniel Han
e164c930ff
fix(studio): correct default weight_decay and learning rate (#4695)
* fix(studio): change default weight_decay from 0.01 to 0.001

The default weight decay across Studio was 0.01 but should be 0.001.
Updated the default in all backend fallbacks, the Pydantic model, the
frontend config, and every YAML preset/model-default config.

* fix(studio): auto-set learning rate based on training method

Default LR should be 2e-4 for LoRA/QLoRA and 2e-5 for full fine-tuning.

Frontend: track whether the user has manually edited the LR field via a
_learningRateManuallySet flag (same pattern as trainOnCompletions).
When switching training method and the user has not touched the LR,
auto-set it to the appropriate default. Reset the flag on model load.

Backend: change trainer.py start_training default from 5e-5 to 2e-4,
update default.yaml fallback from 5e-5 to 2e-4, and fix
full_finetune.yaml from 0.0002 (2e-4) to 2e-5.

* refactor(studio): centralize weight_decay and learning rate defaults

Create studio/backend/core/training/constants.py as the single source of
truth for DEFAULT_WEIGHT_DECAY (0.001), DEFAULT_LEARNING_RATE (2e-4),
DEFAULT_LEARNING_RATE_FULL (2e-5), and DEFAULT_LEARNING_RATE_STR ("2e-4").

All backend modules (trainer.py, training.py, worker.py, models/training.py)
now import from constants.py instead of hardcoding values.

On the frontend, add LR_DEFAULT_LORA and LR_DEFAULT_FULL to
config/training.ts and use them in the store instead of magic numbers.
A comment cross-references the backend constants file.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix model-specific LR override, persist migration, and flag resets

- Preserve model-specific learning rates from YAML configs when the
  async autoSelectTrainingMethod callback fires (fixes Qwen2.5-1.5B
  getting 2e-4 instead of its configured 1e-5, etc.)
- Bump zustand persist version to 9 with migration so existing users
  with weightDecay=0.01 get updated to 0.001
- Clear _learningRateManuallySet in reset() and applyConfigPatch()
  for consistency with trainOnCompletions flag behavior
- Add DEFAULT_LEARNING_RATE_FULL_STR to constants.py

* Refine applyConfigPatch to only clear LR flag when patch includes LR

Only reset _learningRateManuallySet when the applied config patch
actually provides a learningRate value. This prevents unrelated config
patches from silently disarming the manual-edit guard, which would
cause a subsequent setTrainingMethod call to overwrite the user's
custom LR.

* Preserve model-specific LR when switching between qlora and lora

Only auto-switch the learning rate when the training category changes
(adapter <-> full fine-tuning). Switching between qlora and lora keeps
the current LR since both methods share the same learning rate range.
This preserves curated per-model defaults (e.g. 1e-5 for
Qwen2.5-1.5B-Instruct) when the user toggles between adapter methods.

* Remove constants.py, use YAML configs as the source of truth

The YAML config files (model-specific + default.yaml) are the intended
config layer for training defaults. The Python backend fallbacks now use
inline values that match the YAML configs, rather than importing from a
separate constants module. This keeps the config architecture simple:
YAML files are the single source of truth, and the inline Python
fallbacks are just safety nets that mirror them.

* fix(studio): preserve model-specific LR when switching training method

Stash YAML-provided learning rate and use it to restore the correct
value when switching between adapter and full fine-tune modes.

- qlora <-> lora no longer overwrites the model's LR
- full -> adapter restores the YAML LR instead of a hardcoded constant
- selecting a model while on full fine-tune uses LR_DEFAULT_FULL
  instead of applying the YAML adapter LR

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
2026-03-31 13:50:25 +04:00
Wasim Yousef Said
28aaf849bf
fix: throttle and cache HuggingFace modelInfo API calls (#4696)
* fix: throttle and cache HuggingFace modelInfo API calls

The frontend was firing 40 to 60 parallel modelInfo requests on app
startup with zero caching or deduplication, causing HF rate limits.

Adds a caching layer (hf-cache.ts) with TTL cache, inflight request
dedup, and a concurrency limiter. Also debounces the HF token input
so typing a token no longer re-fires all model searches per keystroke.

* fix: only fetch VRAM info for visible models in chat selector

* Fix cache key isolation and VRAM badge stability for PR #4696

- Cache key now includes a token fingerprint (last 8 chars) instead of a
  boolean, so switching HF tokens gives separate cache entries instead of
  serving stale data from the previous token.
- Extract token via credentials?.accessToken to match the @huggingface/hub
  API surface.
- Extend CachedResult type with safetensors/tags fields so downstream
  consumers no longer need unsafe `as` casts.
- Merge VRAM param map with previous state on scroll instead of replacing
  it, preventing a brief flash of missing VRAM badges when new models
  become visible.

* Fix VRAM badges missing for search-filtered recommended models

When a user types a search query, filteredRecommendedIds can include
models beyond the currently visible page. These models had no VRAM data
because useRecommendedModelVram only received visibleRecommendedIds.

Now we pass the union of visibleRecommendedIds and filteredRecommendedIds
to the VRAM hook, so recommended models surfaced by search also show
their VRAM badges. The hf-cache layer ensures no duplicate network calls.

* Apply biome formatting to hf-cache.ts and use-recommended-model-vram.ts

Auto-formatted with biome check --write to match project lint rules:
- Block statements for single-line if/for bodies
- Import sorting (type imports first)
- Consistent line wrapping

* Fix extractToken to handle both current and deprecated HF auth forms

The @huggingface/hub CredentialsParams type is a union:
  - { accessToken: "hf_..." }               (current preferred form)
  - { credentials: { accessToken: "..." } }  (deprecated form)

Previously only checked params.credentials?.accessToken (deprecated path).
Now checks both forms so the cache key is correct regardless of which
calling convention is used.

* Simplify extractToken, map merge, and set construction

- extractToken: remove type assertions, use direct property access with
  truthiness checks for cleaner union type handling
- VRAM map merge: use Map spread constructor instead of manual for loop
- idsForVram: use Set spread construction for more concise dedup

* Add rationale comment for MAX_CONCURRENT=3 in hf-cache.ts

* Skip GGUF repos in VRAM fetch and pre-populate cache from listModels

Two changes to reduce redundant HF API calls:

1. Filter GGUF repos from idsForVram before passing to useRecommendedModelVram.
   GGUF repos have no safetensors metadata and the render layer already shows
   a static "GGUF" badge -- fetching modelInfo for them is a no-op that wastes
   a semaphore slot and a network round-trip.

2. Add primeCacheFromListing() to hf-cache.ts and call it from listModels
   yield sites in mergedModelIterator and priorityThenListingIterator.
   listModels returns the same type (ModelEntry & Pick<ApiModelInfo, T>) as
   modelInfo with the same additionalFields, so the data is interchangeable.
   Priming only writes if the key is not already fresh, so it never overwrites
   a recent modelInfo response.

   This means models discovered via listModels are already in cache when
   useRecommendedModelVram later calls cachedModelInfo for them, eliminating
   duplicate network requests.

* Fix cache key mismatch: prime both token and anonymous slots

The VRAM hook calls cachedModelInfo without credentials (anonymous key),
but listModels results were primed only under the authenticated key.
For authenticated users the priming was a no-op -- cache miss every time.

Fix: prime both the token-specific slot and the anonymous slot when an
access token is present. Public model metadata (safetensors, tags) is
identical regardless of auth so this is safe.

Also add a defensive guard in primeCacheFromListing for empty name.

* Auto-prime anonymous cache slot from authenticated modelInfo fetches

When cachedModelInfo is called with a token, the result was only stored
under the token-specific key (e.g. model::abc12345). The VRAM hook
calls cachedModelInfo without credentials and reads the anonymous slot
(model::anon), causing a cache miss and duplicate fetch for every
priority model.

Now cachedModelInfo also writes to the anonymous slot on success when
a token is present. Public model metadata (safetensors, tags) is
identical regardless of auth, so this is safe and eliminates ~10
duplicate API calls on first page load.

* Guard anonymous cache priming against gated/private models

Only prime the anonymous cache slot for non-gated, non-private models.
Previously, authenticated modelInfo responses and listing results were
unconditionally copied into the anonymous slot, which could briefly
expose gated/private model metadata after clearing the HF token.

Now checks result.gated and result.private before writing the anon slot.
Public unsloth/ models (the common case) still benefit from the
optimization; gated models like meta-llama/* require a fresh fetch
per auth context.

* Extract primeFromListing helper to deduplicate cache priming logic

The cache priming pattern (prime token slot + conditionally prime anon
slot for non-gated models) was duplicated in three places. Extracted
into a single primeFromListing() function for maintainability.

* Export CachedResult type, add isStale helper, simplify primeFromListing

- Export CachedResult so consumers can use it directly instead of
  the indirect Parameters<typeof ...> pattern.
- Extract isStale(key) helper to deduplicate the cache freshness
  check that was repeated in primeCacheFromListing, cachedModelInfo,
  and the anonymous-slot priming logic.
- Simplify primeFromListing to use CachedResult directly for both
  the data parameter and the gated/private guard, eliminating the
  double cast.

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-31 02:21:17 -07:00
Datta Nimmaturi
3b5a49776b
[studio] multi gpu: revert to balanced for inference. (#4698)
* Revert to balanced for inference

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove unused for_inference parameter from get_device_map

Since inference and training both use "balanced" now, the for_inference
flag is dead code. Remove it from the function signature, the call site
in inference.py, and simplify the tests accordingly.

* Remove redundant TestDeviceMapForInference test class

TestGpuAutoSelection already covers the same multi-gpu and single-gpu
device_map assertions. The TestDeviceMapForInference class was left
over from when for_inference had distinct behavior.

* Remove redundant test_get_device_map_multi_gpu_uses_balanced

Its assertions ([0,1] -> balanced, [0] -> sequential) are already
covered by test_get_device_map_uses_explicit_gpu_selection.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-31 01:24:41 -07:00
Daniel Han
fe6609a624
fix(studio): open tour ReadMore links in new tab (#4694)
* fix(studio): open tour ReadMore links in new tab

The quick tour "Read more" links navigate away from Studio instead of
opening in a separate tab. Add target="_blank" and rel="noopener
noreferrer" to the ReadMore component so external doc links open in a
new browser tab.

* fix(studio): only open external ReadMore links in new tab

Apply target="_blank" conditionally based on whether the href starts
with "http", so internal links still navigate in the same tab.

* Tighten external-link detection in ReadMore component

Use regex /^https?:\/\// instead of startsWith("http") so the check
requires the full protocol prefix and does not match non-URL strings
that happen to begin with "http".

* Hoist regex to module scope for ReadMore

Move EXTERNAL_URL_RE to top-level constant to satisfy the biome
useTopLevelRegex lint rule and avoid re-creating the RegExp on
every render.

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-03-30 23:41:14 -07:00
Lee Jackson
308bb948d1
studio: prevent false multimodal warning during model loading (#4704)
* studio: gate multimodal incompatibility warning on settled model capabilities

* Also disable Start button during isCheckingVision fallback

When getModelConfig fails and the fallback checkVisionModel is still
in-flight, isLoadingModelDefaults clears before isCheckingVision does.
Without also gating on isCheckingVision the Start button briefly
re-enables with stale capability flags.

Add isCheckingVision to the disabled condition and show "Loading
model..." text while either flag is active.

* Show correct error message for audio dataset incompatibility

The incompatibility warning always said "switch to a vision model"
even when the actual issue was an audio dataset on a non-audio model.
Now shows an audio-specific message when the mismatch is audio.

* Extract isLoadingModel constant for clarity

Pull the combined model-loading condition into a single constant
reused by the settled check, the disabled prop, and the button label.

---------

Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-03-30 23:11:20 -07:00
pre-commit-ci[bot]
66f250a614
[pre-commit.ci] pre-commit autoupdate (#4705)
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.15.7 → v0.15.8](https://github.com/astral-sh/ruff-pre-commit/compare/v0.15.7...v0.15.8)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-30 21:58:16 -07:00
Roland Tannous
d6d3f59984
fix: replace hard timeout with inactivity timeout for model loading (#4707)
The 180s wall-clock timeout would kill model loads on slow connections
even when the download was actively progressing. Now the worker sends
heartbeat status messages every 30s during loading, and the orchestrator
resets its 300s deadline on each one — so it only times out when the
subprocess goes truly silent.
2026-03-31 07:35:04 +04:00
Roland Tannous
7f353acfd4
fix: skip download progress polling for exported GGUF models (#4709)
* fix: skip download progress polling for exported GGUF models

* fix: revert isLocalGgufDir change — exported GGUFs are file paths, not dirs

* fix: set isDownloaded true for all adapters in LoraModelPicker
2026-03-31 07:21:23 +04:00
Etherll
34272a796f
Fix/bun windows bin detection (#4703)
* fix(studio): detect bun .exe shims in Windows binary check

* Update setup.sh

* add .bunx checking
2026-03-30 21:58:33 +04:00
Daniel Han
6d83ad9a28
fix(studio): avoid UnicodeEncodeError on Windows cp1252 consoles (#4699)
* fix(studio): replace unicode emoji in print() to avoid cp1252 crash on Windows

On Windows the default console encoding is cp1252 which cannot encode
unicode emoji like U+2705 or U+26A0. bare print() calls with these
characters cause a UnicodeEncodeError at runtime.

- run.py: replace emoji with ASCII status prefixes [OK] and [WARNING]
- format_conversion.py: remove duplicate print() that mirrors the
  logger.info() call on the next line, and drop the emoji from the
  log message since loggers handle encoding separately

* fix(studio): apply same emoji/print cleanup to parallel VLM conversion path

The parallel URL-based conversion logic has the same duplicate print()
with emoji that was fixed in the sequential path. Remove the bare
print() and drop the emoji from the logger.info() call.

* Treat install_python_stack.py failure as fatal in setup.ps1

On Linux/Mac, setup.sh runs under set -euo pipefail so a non-zero
exit from install_python_stack.py aborts the installer. On Windows,
setup.ps1 had no exit code check -- if the Python script crashed
(eg from the cp1252 UnicodeEncodeError), the installer silently
continued past the dependency loop and reported success. Studio
would then fail at launch with ModuleNotFoundError for structlog,
fastapi, and other deps that were never installed.

Capture $LASTEXITCODE and exit 1 if the dependency installer fails,
matching the error handling pattern already used for PyTorch install.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-30 06:40:47 -07:00
Daniel Han
a0bca759f3
Fix editable install scanning 6,500+ node_modules dirs (#4697)
* fix: scope packages.find to prevent node_modules namespace scanning

The packages.find section had no include filter, so setuptools'
find_namespace_packages discovered all directories as potential Python
packages -- including the 6,557 directories inside
studio/frontend/node_modules/ after the frontend build step.

This caused the editable install overlay step to run 20,000+ glob
operations across 6,619 "packages", which on fast NVMe takes ~5s but
on slower disks can take 7+ minutes.

Adding an explicit include filter scopes discovery to only the packages
we actually ship (unsloth, unsloth_cli, studio, studio.backend), dropping
from 6,619 to 58 discovered packages and the editable build time from
5.4s to 1.2s.

Also removes the broken kernels/moe exclude (used "/" instead of "."
notation so it never matched) and adds a node_modules exclude as a
safety net.

* fix: use precise node_modules exclude patterns

Use "*.node_modules" and "*.node_modules.*" instead of "*.node_modules*"
to avoid accidentally excluding valid packages that might contain
"node_modules" as a substring in their name.
2026-03-30 02:40:29 -07:00
Datta Nimmaturi
9311df2b29
[Studio] multi gpu finetuning/inference via "balanced_low0/sequential" device_map (#4602)
* [WIP] balanced device map for studio

* gpus as a request parameter

* API for multi GPU stuff

* return multi gpu util in new API

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use balanced_low0 instead of balanced

* Use balanced_low0 instead of balanced

* Fix device_map typo, UUID parsing crash, set() filter bug, and broken tests

- balanced_low0 -> balanced_low_0 (transformers/accelerate rejects the old string)
- get_parent_visible_gpu_ids() now handles UUID/MIG CUDA_VISIBLE_DEVICES
  gracefully instead of crashing on int() parse
- _get_backend_visible_gpu_info() set() or None bug: empty set is falsy so
  CUDA_VISIBLE_DEVICES=-1 would disable filtering and report all GPUs
- test_gpu_selection.py: add missing get_visible_gpu_utilization import and
  add required job_id arg to start_training() calls

* Smart GPU determinism using estimates

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* disallow gpu selection for gguf for now

* cleanup

* Slightly larger baseline

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Treat empty list as auto

* Verbose logging/debug

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Cleanup and revert unnecessary deletions

* Cleanup excessive logs and guard against disk/cpu offload

* auth for visibility API. cleanup redundant imports. Adjust QLoRA estimate

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* support for non cuda gpus

* Fix multi-GPU auto-selection memory accounting

The multi_gpu_factor was applied uniformly to all GPUs including the
first one, which unfairly penalizes single-GPU capacity when
transitioning to multi-GPU. This created a discontinuity where a model
that barely fits 1 GPU would suddenly require 2 GPUs because the first
GPU's free memory was discounted by 20%.

Now the first GPU keeps its full free memory, and only additional GPUs
have an overhead factor (0.85) applied to account for inter-GPU
communication and sharding overhead. This gives more accurate
auto-selection and avoids unnecessary multi-GPU for models that
comfortably fit on one device.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add sandbox tests for multi-GPU selection logic

24 tests covering model size estimation, memory requirements, automatic
GPU selection, device map generation, GPU ID validation, and multi-GPU
overhead accounting. All tests use mocks so they run without GPUs on
Linux, macOS, and Windows.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix reviewer findings: 4bit inference estimate, fallback, GGUF gpu_ids, retry

1. 4-bit inference now uses reduced memory estimate (model_size/3 + buffer)
   instead of the FP16 1.3x multiplier. This prevents over-sharding
   quantized models across unnecessary GPUs.

2. When model size estimation fails, auto_select_gpu_ids now falls back to
   all visible GPUs instead of returning None (which could default to
   single-GPU loading for an unknown-size model).

3. GGUF inference route now treats gpu_ids=[] as auto-selection (same as
   None) instead of rejecting it as an unsupported explicit request.

4. Training retry path for "could not get source code" now preserves the
   gpu_ids parameter so the retry lands on the same GPUs.

5. Updated sandbox tests to cover the new 4-bit inference estimate branch.

* Remove accidentally added unsloth-zoo submodule

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix UUID/MIG visibility and update test expectations

1. nvidia.py: When CUDA_VISIBLE_DEVICES uses UUID/MIG tokens, the
   visibility APIs now return "unresolved" with empty device lists instead
   of exposing all physical GPUs. This prevents the UI from showing GPUs
   that the backend process cannot actually use.

2. test_gpu_selection.py: Updated test expectations to match the new
   multi-GPU overhead accounting (first GPU at full capacity, 0.85x for
   additional GPUs) and 4-bit inference memory estimation formula.
   All 60 tests now pass.

* Add CPU/disk offload guard to audio inference path

The audio model loading branch returned before the common
get_offloaded_device_map_entries() check, so audio models loaded with a
multi-GPU device_map that spilled layers to CPU/disk would be accepted
instead of rejected. Now audio loads also verify no modules are offloaded.

* Improve VRAM requirement estimates

* Replace balanced_low_0 with balanced

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* refine calculations for slightly easier nums

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* adjust estimates

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Use nums instead of obj to avoid seralisation error

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Harden nvidia-smi parsing and fix fallback GPU list

1. nvidia.py: Wrap int() casts for GPU index and memory in try/except
   so MIG slices, N/A values, or unexpected nvidia-smi output skip the
   unparseable row instead of aborting the entire GPU list.

2. nvidia.py: Handle GPU names containing commas by using the last
   field as memory instead of a fixed positional index.

3. hardware.py: fallback_all now uses gpu_candidates (GPUs with verified
   VRAM data) instead of raw devices list, which could include GPUs
   with null VRAM that were excluded from the ranking.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* cleanup

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* consolidate raise_if_offload

* Improve MoE support. Guard against nvidia-smi failures

* Improve MoE support. Guard against nvidia-smi failures

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix shared-expert LoRA undercount, torch VRAM fallback, and apply_gpu_ids edge case

1. vram_estimation.py: compute_lora_params now includes shared experts
   (n_shared_experts) alongside routed experts when computing MoE LoRA
   adapter parameters. Previously only n_experts were counted, causing
   the estimator to undercount adapter, optimizer, and gradient memory
   for DeepSeek/GLM-style models with shared experts.

2. hardware.py: _torch_get_per_device_info now uses mem_get_info (which
   reports system-wide VRAM usage) instead of memory_allocated (which
   only reports this process's PyTorch allocations). This prevents
   auto-selection from treating a GPU as mostly free when another
   process is consuming VRAM. Falls back to memory_allocated when
   mem_get_info is unavailable.

3. hardware.py: apply_gpu_ids([]) now returns early instead of setting
   CUDA_VISIBLE_DEVICES="" which would disable CUDA entirely. Empty
   list inherits the parent visibility, same as None.

4. hardware.py: Upgraded fallback_all GPU selection log from debug to
   warning so operators are notified when the model likely will not fit
   in available VRAM.

* Guard nvidia-smi subprocess calls against OSError and TimeoutExpired

get_visible_gpu_utilization and get_backend_visible_gpu_info now catch
OSError (nvidia-smi not found) and TimeoutExpired internally instead
of relying on callers to wrap every invocation. Returns the standard
available=False sentinel on failure so the torch-based fallback in
hardware.py can take over.

* Guard get_primary_gpu_utilization and reset GPU caches between tests

1. nvidia.py: get_primary_gpu_utilization now catches OSError and
   TimeoutExpired internally, matching the pattern already used in
   get_visible_gpu_utilization and get_backend_visible_gpu_info. All
   three nvidia-smi callers are now self-contained.

2. test_gpu_selection.py: Added _GpuCacheResetMixin that resets the
   module-level _physical_gpu_count and _visible_gpu_count caches in
   tearDown. Applied to all test classes that exercise GPU selection,
   device map, or visibility functions. This prevents stale cache
   values from leaking between tests and causing flaky results on
   machines with real GPUs.

* Fix nvidia-smi fallback regression and physical GPU count validation

1. hardware.py: get_gpu_utilization, get_visible_gpu_utilization, and
   get_backend_visible_gpu_info now check result.get("available") before
   returning the nvidia-smi result. When nvidia-smi is unavailable or
   returns no data (e.g., containers without nvidia-smi, UUID/MIG masks),
   the functions fall through to the torch-based fallback instead of
   returning an empty result. This fixes a regression where the internal
   exception handling in nvidia.py prevented the caller's except block
   from triggering the fallback.

2. hardware.py: resolve_requested_gpu_ids now separates negative-ID
   validation from physical upper-bound validation. The physical count
   check is only enforced when it is plausibly a true physical count
   (i.e., higher than the largest parent-visible ID), since
   torch.cuda.device_count() under CUDA_VISIBLE_DEVICES returns the
   visible count, not the physical total. The parent-visible-set check
   remains authoritative in all cases. This prevents valid physical IDs
   like [2, 3] from being rejected as "out of range" when nvidia-smi is
   unavailable and CUDA_VISIBLE_DEVICES="2,3" makes torch report only
   2 devices.

* Fix UUID/MIG torch fallback to enumerate devices by ordinal

When CUDA_VISIBLE_DEVICES uses UUID or MIG identifiers,
get_parent_visible_gpu_ids() returns [] because the tokens are
non-numeric. The torch fallback in get_visible_gpu_utilization() and
get_backend_visible_gpu_info() previously passed that empty list to
_torch_get_per_device_info(), getting nothing back.

Now both functions detect the empty-list case and fall back to
enumerating torch-visible ordinals (0..device_count-1) with
index_kind="relative". This means the UI and auto-selection still
see real device data in Kubernetes, MIG, and Slurm-style UUID
environments where nvidia-smi output cannot be mapped to physical
indices.

Updated test_uuid_parent_visibility to verify the new torch fallback
path returns available=True with relative ordinals.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add type hint for gpu_ids parameter in InferenceOrchestrator.load_model

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-30 02:33:15 -07:00
Michael Han
fbfcbc69f2
Update README.md 2026-03-30 01:34:36 -07:00
Michael Han
d2b8ed8def
Update install.md 2026-03-30 01:33:33 -07:00
Lee Jackson
2f0a5baa87
fix(studio): preserve GGUF context max after apply and refresh (#4691)
Fixes #4670

Separates the GGUF context slider ceiling from the currently active context length so lowering context via Chat Settings no longer locks the slider max to the reduced value.

- Backend: adds `max_context_length` to GGUF load/status responses, computed from the largest VRAM/KV-fit cap across all usable GPU subsets
- Frontend: stores `ggufMaxContextLength` and uses it for Context Length slider/input bounds; hydrates from both `/api/inference/load` and `/api/inference/status`
- Defaults UI ceiling to native context for CPU-only and fallback paths
- Seeds `effective_ctx` and `max_available_ctx` before GPU probing to prevent `UnboundLocalError` on probe failure
- Property fallback uses native `_context_length`, not effective `context_length`
2026-03-30 01:33:16 -07:00
Lee Jackson
5557e1fd27
studio: unify Windows installer/setup logging style, verbosity controls, and startup messaging (#4651)
* refactor(studio): unify setup terminal output style and add verbose setup mode

* studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose)

* studio(setup): revert nvcc path reordering to match main

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio(setup): restore fail-fast llama.cpp setup flow

* studio(banner): use IPv6 loopback URL when binding :: or ::1

* Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp

- Bracket IPv6 display_host in external_url to produce clickable URLs
- Redirect try_quiet failure log to stderr instead of stdout
- Clamp _step label to column width to prevent negative padding

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add sandbox integration tests for PR #4494 UX fixes

Simulation harness (tests/simulate_pr4494.py) creates an isolated uv
venv, copies the real source files into it, and runs subprocess tests
for all three fixes with visual before/after demos and edge cases.

Standalone bash test (tests/test_try_quiet.sh) validates try_quiet
stderr redirect across 8 scenarios including broken-version contrast.

39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all
existing 75 unit tests still pass.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Truncate step() labels in setup.sh to match PS1 and Python

The %-15s printf format pads short labels but does not truncate long
ones.  Change to %-15.15s so labels wider than 15 chars are clipped,
matching the PowerShell .Substring(0,15) and Python label[:15] logic.

* Remove sandbox integration tests from PR

These test files are not part of the styling fix and should not
ship with this PR.

* Show error output on failure instead of suppressing it

- install_python_stack.py: restore _red for patch_package_file
  warnings (was downgraded to _dim)
- setup.ps1: capture winget output and show on failure for CUDA,
  Node, Python, and OpenSSL installs (was piped to Out-Null)
- setup.ps1: always show git pull failure warning, not just in
  verbose mode

* Show winget error output for Git and CMake installs on failure

Same capture-and-print-on-failure pattern already used for
Node, Python, CUDA, and OpenSSL winget installs.

* fix: preserve stderr for _run_quiet error messages in setup.sh

The step() helper writes to stdout, but _run_quiet's error header
was originally sent to stderr (>&2). Without the redirect, callers
that separate stdout/stderr would miss the failure headline while
still seeing the log body on stderr. Add >&2 to both step calls
inside _run_quiet to match main's behavior.

* feat: add --verbose flag to setup and update commands

Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that
'unsloth studio update --verbose' (and the deprecated 'setup')
passes the flag to setup.sh / setup.ps1 / install_python_stack.py.

* fix(studio): honor verbose logging and keep llama.cpp failures non-blocking

* fix(studio): switch installer to 'studio update' and normalize Windows setup logs

* chore(studio): refine localhost tip and remove skip-base setup nois

* fix(studio): align Windows setup logs with Linux style and improve startup tips

* fix(studio): align Windows setup logs with Linux style

* refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output

* refactor(windows): align installer/setup output with Linux style and reduce default verbosity

* refactor(windows): match install.ps1 output style/colors to setup and quiet default logs

* fix(studio-banner): update personal-computer localhost tip

* fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode

* fix(install.sh): align installer logging with setup style and restore POSIX-safe color output

* fix(install.sh): preserve installer reliability and launch visibility

Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output.

* fix(windows installer): keep exit semantics and degrade status accurate

Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable.

* fix(setup.sh): improve log clarity and enforce GGUF degraded signaling

Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable.

* fix(installer): harden bash preflight and PowerShell GPU checks

Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling.

* fix(windows): keep verbose output visible while preserving exit codes

Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode.

* fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity

* Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose

- setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer
  already reports the limitation, and install.sh runs under set -e so a
  non-zero exit aborts the entire install including PATH/shortcuts/launch.
- setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1
  sets $? = $false when native commands write to stderr even with exit 0.
  Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE.
- startup_banner.py: Show the actual bound address when Studio is bound to
  a non-loopback interface instead of always showing 127.0.0.1/localhost.
- setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for
  npm install steps so --verbose correctly surfaces npm output.

* Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose

- install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge
  stderr into stdout with 2>&1 and drop the $? check that misclassifies
  successful native commands on PS 5.1.
- install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env
  when --verbose is passed so child processes like install_python_stack.py
  also run in verbose mode.
- setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose
  correctly surfaces clone diagnostics during source-build fallback.

* Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner

- setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so
  users can see download/validation progress while still capturing the log
  for structured error reporting on failure.
- setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode.
- setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers.
- startup_banner.py: Avoid printing the same URL twice when Studio is
  bound to a specific non-loopback address that matches the display host.

* Fix run_install_cmd exit code after failed if-statement

The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured
$? = 0 because $? reflects the if-statement result, not the command's exit
code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual
command exit code on failure. Applies to both verbose and quiet branches.

* Fix _run_quiet exit code, double uv install, missing --local flag

- setup.sh: Fix _run_quiet verbose path that always captured exit code 0
  due to $? resetting after if-then-fi with no else. Switch to the same
  '"$@" && return 0; exit_code=$?' pattern used in install.sh.
- setup.sh: Consolidate the two uv install branches (verbose + quiet)
  into a single attempt with conditional output. Previously, when verbose
  mode was on and the install failed, a second silent attempt was made.
- install.ps1: Pass --local flag to 'unsloth studio update' when
  $StudioLocalInstall is true. Without this, studio.py's update() command
  overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if
  setup.ps1 or install_python_stack.py later checks that variable.

* Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners

- Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already
  installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling
  setup.sh, so letting install_python_stack.py redo it was redundant and
  slowed down --no-torch installs for no benefit.
- Restore the "Unsloth Studio installed!" success banner and "starting
  Unsloth Studio..." launch message so users get clear install completion
  feedback before the server starts.

* Make llama.cpp build failure a hard error with proper cleanup

- setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF
  inference requires a working llama.cpp build, so this should be a
  hard failure, not a silent degradation.
- install.sh: Catch setup.sh's non-zero exit with '|| _SETUP_EXIT=$?'
  instead of letting set -e abort immediately. This ensures PATH setup,
  symlinks, and shortcuts still get created so the user can fix the
  build deps and retry with 'unsloth studio update'. After post-install
  steps, propagate the failure with a clear error message.

* Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE

'studio update' pops SKIP_STUDIO_BASE from the environment, which
defeats the fast-path version check added in PR #4667. When called
from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1
must survive into setup.ps1 so it skips the redundant PyPI check and
package reinstallation. 'studio setup' does not modify env vars.

* Remove deprecation message from 'studio setup' command

install.ps1 uses 'studio setup' (not 'studio update') to preserve
SKIP_STUDIO_BASE. The deprecation message was confusing during first
install since the user never typed the command.

* Fix stale env vars, scope degraded exit, generic error message for PR #4651

- install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO
  when not using --local, to prevent stale values from a previous --local
  run in the same PowerShell session. Fix log messages to say 'setup' not
  'update' since we call 'studio setup'.
- setup.sh: Only exit non-zero for degraded llama.cpp when called from the
  installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps
  degraded installs successful since Studio is still usable for non-GGUF
  workflows and the footer already reports the limitation.
- install.sh: Make the setup failure error message generic instead of
  GGUF-specific, so unrelated failures (npm, Python deps) do not show
  misleading cmake/git recovery advice.

* Show captured output on failure in quiet mode for PR #4651

Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand
(setup.ps1) now capture command output in quiet mode and display it
in red when the command fails. This matches the behavior of
run_install_cmd in install.sh where failure output is surfaced even
in quiet mode, making cross-platform error debugging consistent.

* Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651

- setup.ps1: Exit non-zero for degraded llama.cpp when called from
  install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct
  'unsloth studio update' keeps degraded installs successful.
- install.sh: Show 'unsloth studio update --local' in the recovery
  message when the install was run with --local, so users retry with
  the correct flag instead of losing local checkout context.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-30 00:53:23 -07:00
Roland Tannous
5bbfabb151
fix: [Studio] setup.ps1 update-flow for windows (#4667)
* fix: add PyPI version check to setup.ps1 for fast update path

Port the update-flow logic from setup.sh to setup.ps1 so that
`unsloth studio update` on Windows skips Python dependency reinstall
when the installed version already matches PyPI latest.

* fix: clear SKIP_STUDIO_BASE in update command

install.ps1 sets SKIP_STUDIO_BASE=1 which persists in the PowerShell
session. If the user runs `unsloth studio update` in the same terminal,
the env var causes the version check to be skipped. Clear it explicitly
in the update command.

* fix: harden version check and clear stale env vars in update flow

- Normalize $InstalledVer with Out-String + Trim() to avoid array/whitespace
  comparison issues in PowerShell 5.1 (python output can be captured as
  string[] instead of scalar string)
- Move Fast-Install --upgrade pip inside if (-not $SkipPythonDeps) so the
  fast path avoids unnecessary network round-trips
- Clear STUDIO_LOCAL_REPO when --local is not passed to prevent a previous
  --local session from leaking into a plain update

---------

Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-29 21:14:36 -07:00
Roland Tannous
a6c1f893fc
Fix blank page on Windows due to broken .js MIME type (#4674)
* Fix blank page on Windows due to broken .js MIME type in registry

* Update studio/backend/main.py

adding defensive suggestion by gemini where we make the mimetypes specific to windows platforms

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-28 22:26:49 +04:00