unsloth

mirror of https://github.com/unslothai/unsloth synced 2026-04-21 13:37:39 +00:00

Author	SHA1	Message	Date
Michael Han	e2fd946fe1	Add files via upload	2026-04-02 03:00:10 -07:00
Michael Han	31d6aeb197	Unsloth new logo	2026-04-02 02:58:21 -07:00
Daniel Han	e4d1499230	fix(studio): prevent small models from stalling on tool-calling tasks (#4769 ) * fix(studio): prevent small models from stalling on tool-calling tasks Small GGUF models (< 9B params) in "Think, Search, Code" mode would often describe what they planned to do ("Let me create this dashboard") and then stop generating without ever calling a tool. Three changes: 1. Simplify web_tips for small models: remove the "fetch its full content by calling web_search with the url parameter" guidance for models < 9B. This multi-step instruction causes small models to plan elaborate search-then-fetch-then-code sequences they cannot reliably execute. 2. Add "always call tools directly" imperative to the system prompt nudge so models act immediately instead of narrating their intentions. 3. Add plan-without-action re-prompt in the agentic loop: when the model emits planning text (matching patterns like "let me", "I'll", etc.) without calling any tool, inject a nudge asking it to call the tool and continue the loop. Capped at 2 re-prompts per request. Benchmarked with Qwen3.5-4B-GGUF (N=5 trials per variant): - Baseline: 40% of requests had any tool call - Combined fix: 100% of requests had at least one tool call * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-02 02:11:07 -07:00
Daniel Han	dc0729aadf	Add regression test for shell injection fix in GGML conversion (#4773 ) AST-based test ensures subprocess.Popen calls in GGML conversion functions use argv lists instead of shell=True. Companion to PR #4768.	2026-04-02 00:10:47 -07:00
mateeaaaaaaa	752cef3299	fix(security): shell injection in GGML export conversion (#4768 ) * Fix shell injection in GGML conversion paths * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove test file from security fix PR Move test_save_shell_injection.py to a separate PR to keep this PR focused on the security fix itself. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-02 00:10:43 -07:00
AdamPlatin123	ba8081fc96	fix(chat): correct loading text for cached models during inference (#4764 ) Distinguish between actual network downloads and GPU memory loading for cached LoRA adapters in Studio chat. - Add isCachedLora detection for local LoRA adapter paths using comprehensive cross-platform regex (Unix, Windows, UNC, WSL, tilde) - Thread isCachedLora through loadInfo to chat-page inline status for proper 3-way distinction (cached / local LoRA / downloading) - Skip download progress polling for cached LoRA models (no useless /download-progress API calls) - Fix initial toast state to use isCachedLoad consistently instead of only checking isDownloaded - Fix cancelLoading toast to not mention background downloads for cached/local loads - Keep download-specific text ("Downloading model..." / "Download complete") inside the download-only polling block	2026-04-01 20:24:48 -07:00
Lee Jackson	ca4ea8b9fb	studio: align composer/code, unify fonts, and remove tool collapse jitter (#4763 ) - Add min-w-0 guards to thread/message/markdown containers to prevent content overflow past the composer width - Unify chat typography from Hellix/Space Grotesk to the sans stack, keeping monospace for code blocks and inline code - Restructure desktop navbar right-side controls with shrink-0 wrappers for consistent spacing across HoverCard roots - Soften tool-call label styling (font-medium + text-foreground/85 instead of bold) - Add responsive code block sizing via @container queries - Add horizontal scrolling for wide code blocks within the thread column - Scope list-item code block alignment CSS to .aui-thread-root - Preserve useScrollLock in tool-fallback and tool-group collapsibles - Fall back to bg-background on ViewportFooter when hideComposer is true - Widen inline code monospace selector to cover th, blockquote, and heading elements - Remove unused @fontsource-variable/space-grotesk import	2026-04-01 19:57:10 -07:00
DoubleMathew	71b934ef9d	Fix custom llama.cpp source builds and macos metal source builds (#4762 ) * Fix script unbound variable error * remove stale test script, add llama.cpp metal source builds, update tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Metal precedence, test sync, and add behavioral tests - Move macOS arm64 Metal check before CUDA/ROCm in GPU backend decision chain so Metal is not bypassed when nvcc is in PATH - Remove RPATH flags from CPU fallback CMAKE_ARGS (only needed for Metal library linking) - Update test_llama_pr_force_and_source.py to match _CLONE_ARGS rename from _CLONE_BRANCH_ARGS in setup.sh - Add confirm_install_tree guard test for existing_install_matches_choice - Add TestMacOSMetalBuildLogic bash subprocess tests verifying Metal flag selection, nvcc precedence, and CPU fallback behavior * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Metal CPU fallback to also cover cmake build failures and update tests * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. _GPU_BACKEND_FRAGMENT synced -- removed dead CPU_FALLBACK_CMAKE_ARGS= init (6/8) 2. RPATH assertion replaced -- new test_macos_arm64_cpu_fallback_args_exclude_rpath checks the actual runtime CPU_FALLBACK_CMAKE_ARGS output for @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON (6/8) 3. _TRY_METAL_CPU_FALLBACK=false reset after both configure-failure and build-failure fallback branches in setup.sh (4/8) 4. macOS test now removes libmtmd.0.dylib instead of the platform-agnostic convert_hf_to_gguf.py (3/8) 5. Empty-string tag test added -- test_empty_tag_omits_branch_flag for resolved_tag= (2/8) 6. RPATH checks on cmake call logs -- both fallback tests now assert @loader_path and -DCMAKE_BUILD_WITH_INSTALL_RPATH=ON are absent from CPU fallback cmake calls, plus baseline flag preservation (multiple) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests clean up * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 14:06:39 -05:00
Daniel Han	39fe23ded8	Tests for architecture-aware KV cache estimation (#4760 ) * test: add 66 tests for architecture-aware KV cache estimation Covers all 5 estimation paths (MLA, Hybrid Mamba, Sliding Window, Standard GQA, Legacy), GGUF parser for 8 new metadata fields, _can_estimate_kv gate conditions, quantization scaling, edge cases, path priority ordering, and lifecycle (init/unload/reparse). Zero external dependencies beyond pytest. No GPU or network required. Cross-platform (Linux, macOS, Windows, WSL). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:13:37 -07:00
Daniel Han	653eb3819a	fix(studio): allow context length slider to reach model's native limit (#4746 ) * fix(studio): allow context length slider to reach model's native limit The context length slider was hard-capped to the VRAM-estimated maximum, preventing users from requesting higher context even though the backend already handles it safely (multi-GPU selection, --fit fallback). Expose the model's native context length from GGUF metadata as a separate API field and use it as the slider ceiling instead. Add an amber warning when the selected context exceeds the estimated VRAM capacity. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Raise VRAM budget to 90% and add native_context_length tests Increase the GPU memory utilization threshold from 70% to 90% across _select_gpus and _fit_context_to_vram, allowing longer context lengths before VRAM capping kicks in. Add 33 tests for the native_context_length feature covering the backend property, context value separation invariants, Pydantic models, route completeness, edge cases, and cross-platform binary I/O. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:12:52 -07:00
Daniel Han	d22b2a18f9	fix: add tokenizers to no-torch deps and TORCH_CONSTRAINT for arm64 macOS py313+ (#4748 ) * fix: add tokenizers to no-torch runtime deps and add TORCH_CONSTRAINT for arm64 macOS py313+ Two installer fixes: 1. Add `tokenizers` to `no-torch-runtime.txt` before `transformers`. Without it, `from transformers import AutoConfig` crashes on startup because `--no-deps` skips transitive dependencies. 2. Add `TORCH_CONSTRAINT` variable to `install.sh`. On arm64 macOS with Python 3.13+, tighten the torch requirement to `>=2.6` since torch <2.6 has no cp313 arm64 wheels. The variable replaces the previously hard-coded constraint in the uv pip install line. Includes 66 tests (42 pytest + 24 bash) covering: - Structural checks on install.sh, install.ps1, no-torch-runtime.txt - Shell snippet tests with mocked python for 13 platform/version combos - Mock uv integration verifying correct constraint string - E2E venv tests on Python 3.12 and 3.13 confirming AutoConfig works - Negative control proving AutoConfig fails without tokenizers - Full no-torch sandbox regression guards (safetensors, huggingface_hub) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix incomplete no-torch manifest and align E2E tests with real --no-deps path - Add missing transitive deps to no-torch-runtime.txt that are required under --no-deps: regex, typing_extensions, filelock, httpx, httpcore, certifi, idna, anyio, sniffio, h11. Without these, `from transformers import AutoConfig` still fails after install.sh --no-torch. - Change all E2E tests to use --no-deps (matching what install.sh does) instead of normal dep resolution. Previous tests passed even with an incomplete manifest because uv backfilled transitive deps. - Rewrite negative control to derive from the real no-torch-runtime.txt with tokenizers stripped, proving the specific fix matters. - Replace GNU-only sed -i with heredoc in shell test for macOS compat. - Remove unused os/sys imports from Python test file. - Quote SKIP_TORCH and mock uv paths in bash -c strings. * Assert install succeeds before checking import results in E2E tests Address review feedback: test_torch_not_importable and test_tokenizers_directly_importable in Group 3 now assert that uv pip install returns 0 before checking import behavior. This prevents false positives when the install itself fails silently. * Assert install succeeds in negative control and tighten error check - Add missing install-success assertion in test_negative_control_no_tokenizers to prevent false positives from network/install failures. - Tighten error message check to look for "tokenizers" in stderr or ModuleNotFoundError, rather than the generic "No module" substring which could match unrelated import failures. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:12:17 -07:00
Daniel Han	76cb48be0b	fix: studio web search SSL failures and empty page content (#4754 ) - Fix SSL handshake failures (SSLV3_ALERT_HANDSHAKE_FAILURE, CERTIFICATE_VERIFY_FAILED) when fetching HTTPS pages by introducing _PinnedHTTPSConnection that separates TCP connect (to pinned IP) from TLS handshake (with real hostname for SNI/cert verification) - Fix SSRF DNS-rebinding vulnerability: previous impl swapped conn.host before connect(), causing fresh DNS resolution; new subclass keeps TCP pinned to validated IP - Fix SPA/JS-rendered doc sites returning empty content by rotating real browser User-Agents (Chrome/Firefox/Safari) - Strip nav/footer from HTML-to-Markdown output so article content is not buried under navigation chrome - Increase raw fetch cap from 64KB to 512KB so SSR article content is reached on GitBook/Docusaurus/Next.js pages - Fix IPv6 address bracketing in URL netloc construction - Hoist SSL context, handler classes, and stdlib imports to module level (created once, not per-call) - Use consistent UA across redirect hops to avoid breaking session-aware bot detection	2026-04-01 06:12:02 -07:00
Daniel Han	f84c2d03d3	Add installer test coverage for prebuilt llama.cpp changes (#4756 ) Split out from #4741 to keep the main PR focused on installer logic. - New test_install_llama_prebuilt_logic.py: tests for resolve logic, fallback behavior, env_int, busy/lock handling - New test_validate_llama_prebuilt.py: validator tests for staged release_tag/upstream_tag handling - New test_llama_pr_force_and_source.py: tests for PR_FORCE and LLAMA_SOURCE maintainer defaults - Updated test_selection_logic.py: expanded selection/fallback coverage - Updated test_pr4562_bugfixes.py: updated bugfix tests for new logic - Updated smoke_test_llama_prebuilt.py: minor update	2026-04-01 06:06:29 -07:00
DoubleMathew	428efc7d95	Resolve latest usable published llama.cpp release instead of fixed pinned tag (#4741 ) Replaces the fixed prebuilt llama.cpp tag with dynamic published-release resolution, adds bounded fallback across older published releases, and introduces maintainer-editable defaults for PR/source overrides. Changes: - Resolve latest from the latest usable published release in unslothai/llama.cpp - Use the selected release upstream_tag as the authoritative llama.cpp version - Prefer Unsloth-published platform assets when available - Fall back to same-tag upstream ggml-org/llama.cpp assets where allowed - Keep Linux CUDA anchored to Unsloth-published CUDA bundles only - Add bounded fallback across older Unsloth published releases - Add separate busy/in-use install handling (exit code 3) - Skip reinstall when the installed bundle already matches the selected candidate - Add maintainer-editable _DEFAULT_LLAMA_PR_FORCE and _DEFAULT_LLAMA_SOURCE - Harden env parsing so malformed installer env vars do not crash import-time fallback logic - Honor UNSLOTH_LLAMA_RELEASE_TAG in all resolve steps - Always sync git remote URL in existing-checkout path	2026-04-01 06:06:17 -07:00
Daniel Han	5d7d882ce6	Fix save_pretrained_merged for full-finetuned models (#4755 ) * Fix save_pretrained_merged for full-finetuned models save_pretrained_merged and push_to_hub_merged silently do nothing when the model is not a PeftModel (i.e. full finetuning without LoRA). merge_and_overwrite_lora returns None immediately for non-PeftModel, and unsloth_generic_save does not check the return value. Add a non-PeftModel branch in unsloth_generic_save that falls back to model.save_pretrained / model.push_to_hub. When save_method contains "16bit", cast weights to bfloat16 (or float16) via a state_dict copy to honor the user's intent without mutating the live model. The existing PeftModel (LoRA) code path is unchanged. * Forward create_pr and revision to tokenizer.push_to_hub The tokenizer push_to_hub call was missing create_pr and revision, which could cause the tokenizer to push to the wrong branch or bypass PR creation when the model push uses them. * Honor merged_16bit dtype contract for full-finetuned models Cast state_dict to bfloat16/float16 when save_method contains "16bit" to match the documented behavior of save_pretrained_merged. Also pass state_dict and save kwargs consistently to both save_pretrained and push_to_hub paths. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Address review feedback for PR #4755 - Simplify PeftModel isinstance check (PeftModelForCausalLM inherits from PeftModel) - Add is_main_process guard for distributed training - Forward variant to save_pretrained - Set tokenizer padding_side to "left" before saving (matches other save paths) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:05:37 -07:00
Daniel Han	77e1a9edc9	feat(studio): architecture-aware KV cache VRAM estimation (#4757 ) * feat(studio): architecture-aware KV cache VRAM estimation Replace the single legacy formula (2 * n_kv_heads * head_dim * n_layers * n_ctx * bpe) with 5-path estimation that reads 8 additional GGUF metadata fields: 1. MLA (DeepSeek-V2/V3, GLM-4.7, GLM-5, Kimi-K2.5) -- K-only cache using compressed KV latent + RoPE; no separate V allocation 2. Hybrid Mamba (Qwen3.5-27B, Qwen3.5-35B-A3B) -- only attention layers (1 in N) carry KV; Mamba layers have none 3. Sliding Window (Gemma-3, gpt-oss) -- SWA layers cache min(ctx, window) tokens instead of the full context 4. Standard GQA -- uses explicit key_length/value_length from GGUF instead of embed // n_heads (which is wrong for many models) 5. Legacy fallback -- identical to old formula for old GGUFs New GGUF fields parsed: attention.key_length, attention.value_length, attention.sliding_window, full_attention_interval, attention.kv_lora_rank, attention.key_length_mla, ssm.inner_size, ssm.state_size. Validated against 9 real GGUF files (72/72 field checks pass). The legacy formula was off by +682% for Gemma-3 and -81% for DeepSeek-V3.1. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix MLA fallback and SWA global/local ratio heuristic Two fixes based on review findings: 1. MLA fallback now uses key_length_mla from GGUF metadata instead of hardcoded rope_dim=64. Falls back to 64 only when key_length_mla is absent. This ensures correct estimates for MLA variants that use rope dimensions other than 64. 2. SWA global/local layer ratio changed from 50/50 to 1/4 (25% global, 75% SWA). Most sliding window architectures have predominantly local layers (Gemma-3 uses ~17% global, gpt-oss uses ~50%). The 1/4 heuristic is closer to the common case and still a large improvement over the legacy formula which ignores SWA entirely. * Tighten _can_estimate_kv gate and treat sliding_window=0 as disabled Two additional fixes from review round 1 (5/8 and 4/8 reviewer consensus): 1. _can_estimate_kv now requires BOTH key_length AND value_length for the explicit-dims path. Previously key_length alone was enough, which could cause silent fallthrough to the legacy formula with fabricated defaults (n_kv=1, head_dim=128) when value_length was absent from the GGUF. 2. SWA path now requires sliding_window > 0. Some GGUFs use 0 as a disabled sentinel. Without this guard, min(ctx, 0) would zero out all SWA layer contributions, severely underestimating KV cache. * Fix MLA n_kv safety and use ceiling division for hybrid path Addresses Gemini Code Assist review findings: 1. MLA path now uses n_kv_mla = n_kv_heads or 1 (not n_heads). This prevents a 128x overestimate for DeepSeek-V3 if head_count_kv is absent from the GGUF (n_heads=128 would have been used instead). 2. Hybrid path now uses ceiling division for attention layer count. This prevents undercounting by 1 when n_layers is not perfectly divisible by full_attention_interval. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:04:12 -07:00
Daniel Han	3f3757b143	Fix forward compatibility with transformers 5.x (#4752 ) * Fix forward compatibility with transformers 5.x Tested on transformers 4.57.6, 5.3.0, and 5.4.0. All changes are no-ops on transformers 4.x. 1. Skip exec-based config patching for transformers >= 5.0 Config classes in v5 use @strict, @auto_docstring, and interval() which break exec(inspect.getsource(...)). Those configs already use rope_parameters (the v5 replacement for rope_scaling). 2. Slice position_ids to last token in fast_forward_inference Transformers 5.x generate() accumulates position_ids as [batch, full_seq_len] across decode steps instead of [batch, 1]. cos[position_ids] then produces the wrong shape for rotary embeddings. Fixed in llama, qwen3, falcon_h1, gemma2, cohere, granite. No-op on 4.x since position_ids is already [batch, 1]. 3. Handle @strict config kwargs for sequence classification num_labels, max_position_embeddings, id2label etc. are set on the config object and passed via config= instead of as kwargs. AutoModelForSequenceClassification routing added to FastModel loader. 4. Exclude modernbert from flex_attention ModernBERT with flex_attention hits CUDA illegal memory access in create_block_mask. Falls back to eager attention safely. 5. Propagate token_type_ids and mm_token_type_ids through GRPO VLM path Gemma3 Vision requires token_type_ids during training. Qwen3VL requires mm_token_type_ids for M-RoPE. Extract from inputs in compute_loss, pass to grpo_accumulated_loss, and extend mm_token_type_ids for completion tokens in _generate_and_score_completions. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add try/except safety net around config exec for pre-release transformers versions * Pop config-level kwargs in seqclass path and use except Exception --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 06:04:03 -07:00
Roland Tannous	41df4ec437	feat(studio): strip org prefix in model search to surface unsloth variants (#4749 ) When searching for a specific publisher model (e.g. `openai/gpt-oss-20b`), the unsloth search used the full `openai/gpt-oss-20b` string with `author=unsloth`, which returned zero results because no unsloth model contains the publisher prefix in its name. Users never discovered unsloth variants. This PR strips the org prefix for publisher-qualified queries so unsloth variants surface, then pins the original publisher model after a small batch of unsloth results. Plain queries (no slash) and unsloth-prefixed queries are unchanged. - Strict regex (`/^([^/\s]+)\/([^/\s]+)$/`) only triggers on valid `owner/repo` identifiers; incomplete typeahead, multi-slash, and URL-like inputs are rejected - Queries for `unsloth/...` models (case-insensitive) keep the full 20-result prefetch and secondary sort - Pinned model lookup fires in parallel with the unsloth prefetch - Canonical-name dedup prevents duplicates when HF normalizes casing - Publisher detection extracted into a single `useMemo` block	2026-04-01 04:37:28 -07:00
Leo Borcherding	63ad6dbd6d	Fix OOM model styling in Studio model selectors (#4738 ) Replace strikethrough + opacity-50 OOM styling with gray text and red pill badge across all Studio model selectors (chat, training, onboarding). - Use gray-500/gray-400 for OOM model names (better contrast than strikethrough) - Red pill badge for OOM indicator with light/dark mode support - Scope GGUF gray override to quant name only so downloaded/recommended labels keep colors - Add !important on TIGHT/OOM badges to resist ComboboxItem hover overrides	2026-04-01 02:06:49 -07:00
Daniel Han	6c0826a9e4	Fix Windows local GGUF model loading crash (#4730 ) * Fix Windows "Non-relative patterns are unsupported" when loading local GGUF models When a user loads a GGUF model from a local Windows path (e.g. C:\Users\danie\.lmstudio\models\unsloth\functiongemma-270m-it-GGUF), the model identifier contains backslashes and a drive letter. Both load_model_defaults() and _has_specific_yaml() constructed a YAML filename from the full absolute path and passed it to Path.rglob(), which rejects non-relative patterns on Windows. Fixed by detecting Windows-style paths (drive letters, UNC paths, backslashes) in addition to Unix-style paths, and using only the directory basename for the YAML filename lookup when the identifier is a local filesystem path. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor: reuse is_local_path helper, fix case-sensitive suffix lookup - Replace inline local-path detection in model_config.py and inference_config.py with the existing is_local_path() from utils.paths, which already handles Unix, Windows drive-letter, UNC, and backslash paths - Fix case-sensitive suffix lookup in load_model_defaults(): the _REVERSE_MODEL_MAPPING is lowercase-keyed, so suffix comparisons must use .lower() to match paths like /path/to/Spark-TTS-0.5B/LLM * Fix WSL path parsing and _has_specific_yaml suffix lookup - Use normalize_path() before Path() operations so backslash Windows paths (e.g. C:\Users\...\model) are correctly split on POSIX/WSL hosts where pathlib treats backslashes as literal characters - Add suffix-based (2-component and 1-component) lookup to _has_specific_yaml() so it matches the same resolution rules as load_model_defaults(), fixing wrong inference params for local suffix-mapped models like Spark-TTS-0.5B/LLM --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-04-01 01:38:09 -07:00
Datta Nimmaturi	256c6e4884	Refactor flex attn to prefer flash if possible (#4734 ) Replaces prefer_flex_attn_if_supported (which only returned flex_attention or None) with determine_attention_implementation, a centralized hierarchy: FA2 > Flex > SDPA > Eager. Changes: - New determine_attention_implementation function in _utils.py with clear priority chain - _set_attn_impl helper to stamp config consistently - _FLEX_EXCLUDED_MODELS / _FLEX_EXCLUDED_PREFIXES for model-specific exclusions - Gemma3N explicit eager override in vision.py (timm vision towers) - Preserved sdpa fallback for unmapped/remote-code vision configs - Config re-stamped to eager when supports_sdpa guard fires Co-authored-by: Datta Nimmaturi <Datta0@users.noreply.github.com>	2026-04-01 00:30:21 -07:00
Wasim Yousef Said	d63cc57e1e	fix: clear tool status badge immediately after tool execution (#4733 ) * fix: clear tool status badge immediately after tool execution The tool status timer badge (Searching 1s, 2s...) persisted after tool calls finished because the status clear event was only sent at the start of the next generation iteration, not after tool execution completed. Backend: yield status clear after all tools finish in the agentic loop iteration, before continue starts the next generation pass. Frontend: debounce badge visibility by 300ms so sub-second tool calls dont flash the badge. * Fix debounce regression for consecutive tool calls Only apply the 300ms show-delay when transitioning from idle to tool-active. When switching between consecutive tools in the same turn (e.g. web_search -> python), keep the badge visible immediately so it does not flicker or disappear during multi-tool runs. * Delay wasActiveRef reset to bridge inter-iteration tool gaps The backend emits a status-clear event between tool iterations, which was resetting wasActiveRef immediately and causing the next tool to be re-debounced (300ms hidden gap between consecutive tools in the same turn). Now the ref reset is delayed by 500ms so a follow-up tool within the same agentic turn shows the badge immediately, while a genuinely new turn still gets the debounce. * Use thread lifecycle to track tool-run boundaries Replace the 500ms wall-clock timeout with the actual thread.isRunning state to determine when wasActiveRef should reset. This properly handles all cases: - Consecutive tools within the same run stay visible without flicker - The badge hides only when the thread run actually ends - New turns always get a fresh 300ms debounce on the first tool - No heuristic timeout that can misfire on slow or fast inference * Consolidate wasActiveRef reset into single effect Removes the separate isThreadRunning effect to avoid a race where the ref resets before the tool-status effect reads it (when isThreadRunning flips to false before setToolStatus(null) from the adapter's finally block). Now wasActiveRef resets only when both toolStatus is null AND the thread run has ended, eliminating any flicker on the last tool of a run. * Simplify debounce: use visible state instead of ref tracking Drop wasActiveRef entirely and use the visible state as the debounce gate. When the badge is not yet on screen, debounce for 300ms before showing. When already visible from a prior tool, keep showing immediately. This correctly handles all cases: - All fast tools (<300ms) are suppressed, not just the first - Consecutive tools after the badge is shown stay visible - Badge persists across inter-iteration clears while thread runs - New turns get a fresh debounce after visible resets --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-04-01 00:28:38 -07:00
Wasim Yousef Said	4fb9778988	feat: move folder management into model selector dropdown (#4731 ) * refactor: move folder management from sidebar into model selector * Fix folder management: restore LoRA picker sync, error handling, caching - Restore onFoldersChange callback to keep LoRA adapter picker in sync when scan folders are added/removed (fixes regression from sidebar move) - Thread onFoldersChange through ModelSelector -> HubModelPicker prop chain - Add module-level _scanFoldersCache to prevent folder list flash on re-open - Surface error toast on folder removal failure instead of silently ignoring - Guard handleAddFolder against concurrent double-submit via folderLoading - Clear folderInput on Escape key dismiss to prevent stale input on re-open - Add refreshLocalModelsList and refreshScanFolders to useEffect dep array * Fix compare-mode folder sync, Escape key propagation, cancel toggle state - Wire onFoldersChange through CompareContent/GeneralCompareContent so compare-mode selectors also refresh local models after folder changes - Add e.stopPropagation() on Escape key in folder input to prevent Radix Popover from closing the entire model selector dropdown - Add e.preventDefault() on Enter key to prevent form submission - Clear folderInput and folderError when cancel toggle hides the input, matching the Escape key behavior for consistency * Fix folder mutation state ordering and touch accessibility - Use optimistic updates for add/remove so the folder list reflects changes immediately instead of waiting on a second listScanFolders round-trip that could silently fail. - Move refreshScanFolders out of the finally block in handleRemoveFolder so it runs after the cache update, not after onFoldersChange. - Make the remove button visible on touch/mobile devices and reachable via keyboard focus (opacity-100 on small screens, focus-visible). - Add aria-label to the remove button for screen readers. * Deduplicate optimistic folder add to match backend behavior The backend returns the existing ScanFolderInfo row when adding a path that is already registered. The optimistic update was blindly appending the returned row, producing duplicate entries and React key warnings. Now checks by id before appending. * Add aria-label to folder toggle button and strengthen dedup check - Add aria-label to the +/cancel icon button for screen readers. - Extend optimistic dedup check to also compare by path, not just id, to handle edge cases where the cache is stale. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 23:15:50 -07:00
Lee Jackson	2cac3e8e4d	studio: Polish Windows installer/setup logs (#4736 ) * style(windows): clean installer/setup log output and remove seeded credential banner * Keep startup credential hint without exposing plaintext password Print the username and .bootstrap_password file path on first-run admin creation instead of the raw password. Headless / Docker / SSH operators still get a startup-time hint for initial sign-in, and the plaintext credential no longer appears in terminal output or logs. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-03-31 23:12:42 -07:00
Daniel Han	6984e118eb	Bump installer minimum version pin to 2026.3.18 (#4729 ) Matches the latest PyPI release.	2026-03-31 07:00:51 -07:00
Daniel Han	cfeb8c3245	Versioning	2026-03-31 06:51:34 -07:00
Wasim Yousef Said	1e8875584d	feat: custom scan folders for GGUF model discovery (#4723 ) * feat: add scan_folders table and CRUD functions to studio_db * feat: add scan folders API endpoints and integrate into model scan * feat: add scan folders API client and update source types * feat: add custom source to model filters and selector * feat: add Model Folders section to chat settings sidebar * style: fix biome formatting in ModelFoldersSection * fix: address review findings for custom scan folders empty string bypass, concurrent delete crash guard, Windows case normalization, response_model on endpoints, logging, deduplicated filter/map, module level cache for custom folder models, consistent source labels, handleRemove error surfacing, per folder scan cap * fix: show custom folders section regardless of chatOnly mode * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor: extract shared refreshLocalModelsList in pickers * Harden custom scan folder validation and scanning - Validate path exists, is a directory, and is readable before persisting - Apply per-folder model cap during traversal instead of after (avoids scanning millions of inodes in large directories) - Wrap per-folder scan in try/except so one unreadable folder does not break the entire /api/models/local endpoint for all callers - Normalize case on Windows before storing so C:\Models and c:\models dedup correctly - Extend macOS denylist to cover /private/etc and /private/tmp (realpath resolves /etc -> /private/etc, bypassing the original denylist) - Add /boot and /run to Linux denylist * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Improve scan robustness and preserve Windows path casing - Preserve original Windows path casing in DB instead of lowercasing (normcase used only for dedup comparison, not storage) - Catch PermissionError per child directory so one unreadable subdirectory does not skip the entire custom folder scan - Wrap list_scan_folders() DB call in try/except so a DB issue does not break the entire /api/models/local endpoint * fix: scan custom folders for both flat and HF cache layouts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Windows case-insensitive path dedup with COLLATE NOCASE Use COLLATE NOCASE on the scan_folders.path column so that the UNIQUE constraint correctly deduplicates C:\Models and c:\models on Windows without lowercasing the stored path. Also use COLLATE NOCASE in the pre-insert lookup query on Windows to catch existing rows with different casing. * Restore early-exit limit in _scan_models_dir for custom folders Keep the limit parameter so _scan_models_dir stops iterating once enough models are found, avoiding unbounded traversal of large directories. The post-traversal slice is still applied after combining with _scan_hf_cache results. * feat: scan custom folders with LM Studio layout too * Fix custom folder models being hidden by dedup Custom folder entries were appended after HF cache and models_dir entries. The dedup loop kept the first occurrence of each model id, so custom models with the same id as an existing HF cache entry were silently dropped -- they never appeared in the "Custom Folders" UI section. Use a separate dedup key for custom-source entries so they always survive deduplication. This way a model can appear under both "Downloaded" (from HF cache) and "Custom Folders" (from the user-registered directory) at the same time. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden LM Studio scan and fix COLLATE NOCASE on Linux - Add per-child and per-publisher OSError handling in _scan_lmstudio_dir so one unreadable subdirectory does not discard the entire custom folder's results - Only apply COLLATE NOCASE on the scan_folders schema on Windows where paths are case-insensitive; keep default BINARY collation on Linux and macOS where /Models and /models are distinct directories * Use COLLATE NOCASE in post-IntegrityError fallback SELECT on Windows The fallback SELECT after an IntegrityError race now uses the same case-insensitive collation as the pre-insert check, so a concurrent writer that stored the path with different casing does not cause a false "Folder was concurrently removed" error. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 06:40:31 -07:00
Daniel Han	9a8b622306	Studio: simplify tool-call dedup and replace html2text with builtin converter (#4722 ) * Simplify tool-call dedup: drop hashlib, inline helpers The duplicate tool-call detector only compares calls within a single request from the same JSON parser, so dict key order is guaranteed identical for identical calls (Python 3.7+ insertion-ordered dicts). - Replace hashlib.md5(json.dumps(...)) with name + str(args) - Inline _tool_call_key, _is_duplicate_call, _record_tool_call since each was a one-liner used once - Remove unused hashlib import * Remove tool_calling_benchmark_results.md from repo * Replace html2text with builtin HTML-to-Markdown converter Drop the external html2text (GPL-3.0) dependency and its regex fallback. Add _html_to_md.py (~190 lines, stdlib only) using html.parser.HTMLParser that handles headings, links, bold/italic, lists, tables, blockquotes, code blocks, and entity decoding. Strips script/style/head tags entirely. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use json.dumps(sort_keys=True) for tool-call dedup key str(dict) is sensitive to insertion order, so semantically identical calls with different key ordering would bypass duplicate detection. Switch to json.dumps with sort_keys=True for a canonical representation. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert dedup key to str(arguments) json.dumps(sort_keys=True) is unnecessary here -- the arguments dict always comes from the same JSON parser within a single request, so key insertion order is deterministic (Python 3.7+). str() is faster and sufficient for consecutive-call dedup. * Address review comments on _html_to_md.py - Remove "hr" from _BLOCK_TAGS so the dedicated hr handler is reachable - Prefix all newlines with ">" inside blockquotes (multi-line support) - Emit full ![alt](url) for images instead of alt text only - Replace newlines with spaces inside table cells - Track header cells per-row (_row_has_th) instead of last-cell-only - Strip trailing tabs in addition to spaces in cleanup regex * Fix blockquote rendering, truncated-HTML buffer flush, and dedup key canonicalization _html_to_md.py: - Rewrite blockquote handling with stack-based buffer approach so nested blockquotes, pre blocks inside blockquotes, and multi-paragraph quotes all render correctly with proper "> " prefix on every line. - Add flush_pending() to recover content from truncated HTML where closing tags are missing (common when _fetch_page_text caps the download size). Flushes open <a>, <td>, <pre>, and blockquote buffers. - Skip <img> tags to match prior html2text ignore_images=True behavior and avoid data-URI amplification consuming the output budget. - Collapse all whitespace (including newlines) in non-pre content per standard HTML whitespace rules: \s+ -> single space. - Escape pipe characters in table cell content to prevent column breakage. - Emit separator row after the first row for tables without <th> headers. - Guard against IndexError on _ol_counter for orphan <li> elements. - Normalize CRLF line endings before parsing. llama_cpp.py: - Restore canonical dedup key with json.dumps(sort_keys=True) so that semantically identical tool calls with different JSON key order are correctly detected as duplicates. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix table optional end tags, inline code whitespace, and link text normalization _html_to_md.py: - Extract _finish_cell() and _finish_row() helpers to handle HTML tables that omit optional </td>, </th>, or </tr> end tags. This is valid HTML and common on real web pages -- previously the parser would silently drop earlier cells and entire rows. - Call _finish_cell()/_finish_row() from handle_starttag for <tr>/<td>/<th>, handle_endtag for </tr>/<td>/<th>/<table>, and flush_pending() so all three paths (normal close, implicit close, truncated HTML) use the same row-finalization logic including header separator emission. - Add _in_inline_code flag so handle_data() preserves literal whitespace inside <code> spans instead of collapsing it. Source like <code>pip install unsloth</code> now correctly renders as `pip install unsloth` rather than `pip install unsloth`. - Extract _finish_link() helper that normalizes accumulated link text with \s+ -> single space before building the Markdown link. Prevents block- level content inside <a> tags (e.g. <a><div>one</div><div>two</div></a>) from producing multiline [one\n\ntwo](href) link labels. - Empty blockquotes now produce no output instead of a stray ">". - Remove unused _bq_depth field (all routing uses _bq_stack). - Flush open cells and rows in handle_endtag("table") for robustness. * Support <ol start=N>, <dl>/<dt>/<dd>, and preserve code block whitespace _html_to_md.py: - Honor <ol start="N"> attribute so ordered lists preserve their original numbering instead of always restarting from 1. Important for docs/tutorials that continue numbering across sections. - Add dl, dt, dd to _BLOCK_TAGS so definition lists (common on MDN, Python docs, Django docs) produce separated text instead of concatenated blobs. - Rewrite _cleanup() to be fence-aware: content inside fenced code blocks is now preserved verbatim (intentional blank lines in <pre> content are no longer collapsed). Outside code blocks, blank runs are limited to one and trailing whitespace is stripped. - Fix _prefix_blockquote() to strip trailing whitespace before collapsing blank lines, preventing the "\n\n \n\n" pattern from sneaking through. * Suppress whitespace-only text nodes between table structural elements Indented HTML tables (nearly all real-world pages) produce whitespace text nodes between <table>, <tr>, </tr> etc. that land in the output as leading spaces before table rows, breaking Markdown table alignment. Skip whitespace-only text nodes when inside a table but not inside a cell, so indentation from source HTML does not leak into the output. * Revert dedup key to str(arguments) with explanatory comment json.dumps(sort_keys=True) is unnecessary overhead here: arguments always comes from json.loads on model output within a single request, so dict insertion order is deterministic in Python 3.7+. A repeated call from the model produces the same JSON, which parses to the same dict repr. str() avoids re-serialization on every tool call. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-31 06:15:18 -07:00
Lee Jackson	9451bb1bac	fix(export): preserve selected/manual model on enter and blur (#4726 )	2026-03-31 17:05:55 +04:00
Daniel Han	e159b93b97	studio: improve GGUF tool calling accuracy and reliability (#4700 ) * studio: improve GGUF tool calling accuracy and reliability - Add URL fetching to web_search tool so models can read full page content instead of only getting search snippets. Uses html2text for clean markdown conversion with regex fallback. - Inject current date and behavioral guidance (URL fetch workflow, no repeated queries, use code for data processing) into the tool-use system prompt. - Append error recovery nudge to tool results that indicate failure, helping small models avoid looping on the same broken call. - Strip leaked <tool_call> XML from assistant messages in conversation history and from the outgoing SSE stream. - Raise default max tool iterations from 10 to 25 across backend, model schema, and frontend defaults. - Increase _MAX_PAGE_CHARS from 4k to 16k so fetched pages contain enough content for the model to extract useful information. - Add "IMPORTANT: These are only short snippets" hint to search results so models know to fetch full pages when needed. Tested with Qwen3.5-4B-GGUF (UD-Q4_K_XL), 10 runs before/after: - XML leaks in responses: 10/10 -> 0/10 - URL fetch usage: 0 -> 4/10 runs - Runs producing actual correct answers: 0/10 -> 2/10 - Average tool calls per query: 5.5 -> 3.8 (more efficient) - Average response time: 12.3s -> 9.8s * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add tool calling benchmark results across model sizes and quants Tested 16 configurations (4 models x 2 quants x 2 KV cache types) with 10 runs each on NVIDIA B200. Best config: 27B UD-Q4_K_XL + bf16 KV -- 6/10 runs found all 4 correct songs, 0 XML leaks, 131s average response time. * Add duplicate tool-call detection and final-answer synthesis When the model repeats the exact same tool call (same name + arguments) twice in a row, skip execution and return a redirect message telling it to try a different approach. This prevents the 8x-repeated-query loops observed on 27B and 35B models. When the tool iteration cap (25) is reached, inject a "provide your final answer now" message before the final streaming pass. This lets the model synthesize a useful answer from everything it gathered instead of being silently cut off. Tested on Qwen3.5-27B UD-Q4_K_XL (10 runs): - Repeated query runs: 4/10 -> 2/10 - Cap hits: 1/10 -> 0/10 - All 4/4 accuracy: 5/10 -> 7/10 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix CodeQL alert: handle whitespace in script/style closing tags The regex fallback for HTML stripping did not match closing tags with whitespace before the angle bracket (e.g. </script >). Use \s* before > in both script and style patterns. * Address reviewer findings: SSRF, timeout crash, XML regex, dedup - SSRF: resolve hostname via getaddrinfo and reject private, loopback, link-local, multicast, and reserved addresses before fetching - Timeout: handle timeout=None (unlimited mode) in URL fetch path by defaulting to 60s instead of crashing on min(None, 60) - Download cap: read at most max_chars4+1 bytes instead of the full response body before truncating - XML regex: match both <tool_call> and <function=...> markup in the history/stream cleanup (inference.py) - CodeQL: use [^>] in closing script/style tags to handle any whitespace or attributes before > - Dedup: track whether each tool call failed so retries after transient errors are allowed; only block consecutive identical calls that both succeeded - Final-answer synthesis: guard on max_tool_iterations > 0 so callers who disable tools do not get a false "used all calls" turn * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix redirect SSRF, SSE streaming regression, dedup off-by-one - SSRF redirect bypass: disable auto-redirect in urllib, manually follow up to 5 hops with host validation at each step. Prevents public URLs from redirecting to loopback/private targets. - SSE streaming: track prev_text on the raw cumulative and strip XML from the delta only, so completed tool_call tags do not cause the cumulative to shrink and drop trailing real text. - Dedup off-by-one: check the immediately previous call (window=1) instead of requiring 2 matching history entries, so the second identical successful call is blocked rather than the third. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix redirect HTTPError handling and tighten error prefixes - Redirect fix: urllib raises HTTPError (not a normal response) when the redirect handler returns None. Catch HTTPError for 3xx codes and extract the Location header from the exception object. - Error prefixes: remove overly broad "No " prefix that matched "No results found." (a valid empty-search outcome, not an error). Replace with specific prefixes like "Blocked:", "No query provided", "Failed to resolve". This ensures empty search results are correctly classified as non-errors for duplicate-call tracking. * Fix SSE cross-chunk XML leaks, cleanup review findings - SSE streaming: sanitize the full cumulative text before diffing against the previous sanitized snapshot, so XML tags that span chunk boundaries are stripped correctly. The previous delta-based approach leaked split tags. - DRAINING fallback: use _strip_tool_markup() helper instead of a manual regex that only handled <tool_call> but not <function=...>. - Move hashlib import, _TOOL_XML_RE compile, and datetime import to module level per style guide. - Remove unused _hit_tool_cap variable. * Fix DNS rebinding, charset detection, HTTPError handling, dedup double-record - DNS rebinding: resolve hostname once via getaddrinfo, pin the returned IP, rewrite the URL to connect to the pinned IP with a Host header. Each redirect hop re-resolves and re-validates. Closes the TOCTOU window between validation and connection. - Charset: use resp.headers.get_content_charset() instead of hardcoding utf-8, so pages with other encodings decode correctly. - HTTPError: return descriptive "HTTP {code} {reason}" instead of re-raising into a generic "Search failed" message. - Dedup: remove redundant _record_tool_call in the duplicate branch; the single call at the end of the loop handles all cases. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-31 03:06:44 -07:00
Lee Jackson	815619d972	feat: add update instructions card with OS toggle and mobile expand flow (#4721 ) Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com>	2026-03-31 14:05:05 +04:00
Roland Tannous	cc5e4fbf17	fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1 (#4712 ) * fix: auto-retry stalled HF downloads with HF_HUB_DISABLE_XET=1 The heartbeat thread now monitors the HF Hub cache directory for file-size growth. If no bytes are written for 3 minutes, it sends a "stall" message to the orchestrator, which kills the subprocess and retries with HF_HUB_DISABLE_XET=1 (falling back from Xet to standard HTTPS). If the retry also stalls, it errors out with a clear message. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: include transport type (xet/https) in heartbeat and stall log messages Makes it clear in backend logs whether the download is using xet or https transport, and which transport stalled — helpful for debugging. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: monitor HF Hub .tmp dir to avoid false stall detections huggingface_hub downloads into .tmp/ before atomically moving to blobs/. Without monitoring .tmp, a large shard actively downloading for several minutes would show zero blob growth and trigger a false stall. * fix: scope HF cache size check to specific model being loaded Instead of scanning every models--/blobs directory (O(N) with cached models), only check the specific model's blobs dir plus the global .tmp dir. Much faster on systems with many cached models. Fix false stall detection on cached/local models and cleanup issues - Only fire stall if download activity was observed (cache size changed at least once). Previously, any model load taking >180s would trigger a false stall, even for already-cached or local models where no download is happening. - Return -1 from _get_hf_cache_size on exception to distinguish "unable to measure" from "genuinely zero bytes". Skip stall logic when measurement fails. - Add _shutdown_subprocess before raising on terminal stall path to prevent leaking a stuck subprocess. - Detect pre-existing HF_HUB_DISABLE_XET=1 in the parent environment to avoid a redundant retry cycle when Xet is already disabled. - Remove global .tmp directory scanning (not used by modern huggingface_hub; in-progress downloads use .incomplete files in blobs/ which are already captured by iterdir). - Add f.is_file() guard in cache size calculation. - Replace em dashes with ASCII dashes for Windows terminal compat. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden stall detection edge cases - Guard -1 to valid value transition: when initial _get_hf_cache_size returns -1 (error) and later recovers to a real value, do not count that as download activity. Only set saw_download_activity when the previous measurement was also valid (>= 0). - Move os import to top-level in orchestrator.py instead of inline import os as _os. - Fix misleading comment about post-download protection. * Use .incomplete files to detect active downloads for stall detection Replace the saw_download_activity heuristic with direct .incomplete file detection. huggingface_hub creates .incomplete files in blobs/ during active downloads and removes them on completion. This gives a reliable signal for whether a download is actually in progress. Benefits: - Cached models: no .incomplete files -> no stall fired even after 180s - Post-download init (quantization, GPU loading): .incomplete files gone so stall timer resets, long init phases are not killed - Pre-download hangs (XET handshake stall): .incomplete files are created at download start, so zero-byte stalls are now detected - No more false positives from -1 to valid measurement transitions The _get_hf_download_state function now returns (total_bytes, has_incomplete) tuple or None on error, replacing _get_hf_cache_size. Add debug logging to download state exception handler Log the exception at debug level when _get_hf_download_state fails, instead of silently returning None. Helps with troubleshooting cache measurement issues. * Watch both adapter and base model repos for LoRA stall detection When loading a LoRA adapter, the actual download bottleneck is often the base model, not the adapter itself. Update the heartbeat to watch both mc.identifier and mc.base_model cache directories so stall detection works for LoRA loads where the base model stalls on Xet. Also update _get_hf_download_state to accept multiple model names and skip names without "/" (local paths) since those do not have HF cache directories. * Fix model name filtering for official HF models without org prefix Models like gpt2 and bert-base-uncased do not contain a slash but are still valid HF Hub models with cache directories. Replace the "/" check with a proper local-path detection that checks for path separators and path-like prefixes instead. Also fix the base_model watch list to not require "/" in the base model name, so official models used as LoRA bases are also monitored. * Fix local path detection that broke all org/model names on Linux The os.path.sep check matched "/" in HF model IDs like "org/model" on Linux, causing the stall detector to skip ALL standard HF models. Replace with a check that only skips names starting with "/" (absolute paths), "." (relative paths), "~" (home-relative), or containing "\" (Windows paths). HF model IDs like "org/model" or "gpt2" pass through correctly on all platforms. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 03:00:46 -07:00
Daniel Han	e164c930ff	fix(studio): correct default weight_decay and learning rate (#4695 ) * fix(studio): change default weight_decay from 0.01 to 0.001 The default weight decay across Studio was 0.01 but should be 0.001. Updated the default in all backend fallbacks, the Pydantic model, the frontend config, and every YAML preset/model-default config. * fix(studio): auto-set learning rate based on training method Default LR should be 2e-4 for LoRA/QLoRA and 2e-5 for full fine-tuning. Frontend: track whether the user has manually edited the LR field via a _learningRateManuallySet flag (same pattern as trainOnCompletions). When switching training method and the user has not touched the LR, auto-set it to the appropriate default. Reset the flag on model load. Backend: change trainer.py start_training default from 5e-5 to 2e-4, update default.yaml fallback from 5e-5 to 2e-4, and fix full_finetune.yaml from 0.0002 (2e-4) to 2e-5. * refactor(studio): centralize weight_decay and learning rate defaults Create studio/backend/core/training/constants.py as the single source of truth for DEFAULT_WEIGHT_DECAY (0.001), DEFAULT_LEARNING_RATE (2e-4), DEFAULT_LEARNING_RATE_FULL (2e-5), and DEFAULT_LEARNING_RATE_STR ("2e-4"). All backend modules (trainer.py, training.py, worker.py, models/training.py) now import from constants.py instead of hardcoding values. On the frontend, add LR_DEFAULT_LORA and LR_DEFAULT_FULL to config/training.ts and use them in the store instead of magic numbers. A comment cross-references the backend constants file. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix model-specific LR override, persist migration, and flag resets - Preserve model-specific learning rates from YAML configs when the async autoSelectTrainingMethod callback fires (fixes Qwen2.5-1.5B getting 2e-4 instead of its configured 1e-5, etc.) - Bump zustand persist version to 9 with migration so existing users with weightDecay=0.01 get updated to 0.001 - Clear _learningRateManuallySet in reset() and applyConfigPatch() for consistency with trainOnCompletions flag behavior - Add DEFAULT_LEARNING_RATE_FULL_STR to constants.py * Refine applyConfigPatch to only clear LR flag when patch includes LR Only reset _learningRateManuallySet when the applied config patch actually provides a learningRate value. This prevents unrelated config patches from silently disarming the manual-edit guard, which would cause a subsequent setTrainingMethod call to overwrite the user's custom LR. * Preserve model-specific LR when switching between qlora and lora Only auto-switch the learning rate when the training category changes (adapter <-> full fine-tuning). Switching between qlora and lora keeps the current LR since both methods share the same learning rate range. This preserves curated per-model defaults (e.g. 1e-5 for Qwen2.5-1.5B-Instruct) when the user toggles between adapter methods. * Remove constants.py, use YAML configs as the source of truth The YAML config files (model-specific + default.yaml) are the intended config layer for training defaults. The Python backend fallbacks now use inline values that match the YAML configs, rather than importing from a separate constants module. This keeps the config architecture simple: YAML files are the single source of truth, and the inline Python fallbacks are just safety nets that mirror them. * fix(studio): preserve model-specific LR when switching training method Stash YAML-provided learning rate and use it to restore the correct value when switching between adapter and full fine-tune modes. - qlora <-> lora no longer overwrites the model's LR - full -> adapter restores the YAML LR instead of a hardcoded constant - selecting a model while on full fine-tune uses LR_DEFAULT_FULL instead of applying the YAML adapter LR --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Roland Tannous <115670425+rolandtannous@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>	2026-03-31 13:50:25 +04:00
Wasim Yousef Said	28aaf849bf	fix: throttle and cache HuggingFace modelInfo API calls (#4696 ) * fix: throttle and cache HuggingFace modelInfo API calls The frontend was firing 40 to 60 parallel modelInfo requests on app startup with zero caching or deduplication, causing HF rate limits. Adds a caching layer (hf-cache.ts) with TTL cache, inflight request dedup, and a concurrency limiter. Also debounces the HF token input so typing a token no longer re-fires all model searches per keystroke. * fix: only fetch VRAM info for visible models in chat selector * Fix cache key isolation and VRAM badge stability for PR #4696 - Cache key now includes a token fingerprint (last 8 chars) instead of a boolean, so switching HF tokens gives separate cache entries instead of serving stale data from the previous token. - Extract token via credentials?.accessToken to match the @huggingface/hub API surface. - Extend CachedResult type with safetensors/tags fields so downstream consumers no longer need unsafe `as` casts. - Merge VRAM param map with previous state on scroll instead of replacing it, preventing a brief flash of missing VRAM badges when new models become visible. * Fix VRAM badges missing for search-filtered recommended models When a user types a search query, filteredRecommendedIds can include models beyond the currently visible page. These models had no VRAM data because useRecommendedModelVram only received visibleRecommendedIds. Now we pass the union of visibleRecommendedIds and filteredRecommendedIds to the VRAM hook, so recommended models surfaced by search also show their VRAM badges. The hf-cache layer ensures no duplicate network calls. * Apply biome formatting to hf-cache.ts and use-recommended-model-vram.ts Auto-formatted with biome check --write to match project lint rules: - Block statements for single-line if/for bodies - Import sorting (type imports first) - Consistent line wrapping * Fix extractToken to handle both current and deprecated HF auth forms The @huggingface/hub CredentialsParams type is a union: - { accessToken: "hf_..." } (current preferred form) - { credentials: { accessToken: "..." } } (deprecated form) Previously only checked params.credentials?.accessToken (deprecated path). Now checks both forms so the cache key is correct regardless of which calling convention is used. * Simplify extractToken, map merge, and set construction - extractToken: remove type assertions, use direct property access with truthiness checks for cleaner union type handling - VRAM map merge: use Map spread constructor instead of manual for loop - idsForVram: use Set spread construction for more concise dedup * Add rationale comment for MAX_CONCURRENT=3 in hf-cache.ts * Skip GGUF repos in VRAM fetch and pre-populate cache from listModels Two changes to reduce redundant HF API calls: 1. Filter GGUF repos from idsForVram before passing to useRecommendedModelVram. GGUF repos have no safetensors metadata and the render layer already shows a static "GGUF" badge -- fetching modelInfo for them is a no-op that wastes a semaphore slot and a network round-trip. 2. Add primeCacheFromListing() to hf-cache.ts and call it from listModels yield sites in mergedModelIterator and priorityThenListingIterator. listModels returns the same type (ModelEntry & Pick<ApiModelInfo, T>) as modelInfo with the same additionalFields, so the data is interchangeable. Priming only writes if the key is not already fresh, so it never overwrites a recent modelInfo response. This means models discovered via listModels are already in cache when useRecommendedModelVram later calls cachedModelInfo for them, eliminating duplicate network requests. * Fix cache key mismatch: prime both token and anonymous slots The VRAM hook calls cachedModelInfo without credentials (anonymous key), but listModels results were primed only under the authenticated key. For authenticated users the priming was a no-op -- cache miss every time. Fix: prime both the token-specific slot and the anonymous slot when an access token is present. Public model metadata (safetensors, tags) is identical regardless of auth so this is safe. Also add a defensive guard in primeCacheFromListing for empty name. * Auto-prime anonymous cache slot from authenticated modelInfo fetches When cachedModelInfo is called with a token, the result was only stored under the token-specific key (e.g. model::abc12345). The VRAM hook calls cachedModelInfo without credentials and reads the anonymous slot (model::anon), causing a cache miss and duplicate fetch for every priority model. Now cachedModelInfo also writes to the anonymous slot on success when a token is present. Public model metadata (safetensors, tags) is identical regardless of auth, so this is safe and eliminates ~10 duplicate API calls on first page load. * Guard anonymous cache priming against gated/private models Only prime the anonymous cache slot for non-gated, non-private models. Previously, authenticated modelInfo responses and listing results were unconditionally copied into the anonymous slot, which could briefly expose gated/private model metadata after clearing the HF token. Now checks result.gated and result.private before writing the anon slot. Public unsloth/ models (the common case) still benefit from the optimization; gated models like meta-llama/* require a fresh fetch per auth context. * Extract primeFromListing helper to deduplicate cache priming logic The cache priming pattern (prime token slot + conditionally prime anon slot for non-gated models) was duplicated in three places. Extracted into a single primeFromListing() function for maintainability. * Export CachedResult type, add isStale helper, simplify primeFromListing - Export CachedResult so consumers can use it directly instead of the indirect Parameters<typeof ...> pattern. - Extract isStale(key) helper to deduplicate the cache freshness check that was repeated in primeCacheFromListing, cachedModelInfo, and the anonymous-slot priming logic. - Simplify primeFromListing to use CachedResult directly for both the data parameter and the gated/private guard, eliminating the double cast. --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 02:21:17 -07:00
Datta Nimmaturi	3b5a49776b	[studio] multi gpu: revert to balanced for inference. (#4698 ) * Revert to balanced for inference * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unused for_inference parameter from get_device_map Since inference and training both use "balanced" now, the for_inference flag is dead code. Remove it from the function signature, the call site in inference.py, and simplify the tests accordingly. * Remove redundant TestDeviceMapForInference test class TestGpuAutoSelection already covers the same multi-gpu and single-gpu device_map assertions. The TestDeviceMapForInference class was left over from when for_inference had distinct behavior. * Remove redundant test_get_device_map_multi_gpu_uses_balanced Its assertions ([0,1] -> balanced, [0] -> sequential) are already covered by test_get_device_map_uses_explicit_gpu_selection. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-31 01:24:41 -07:00
Daniel Han	fe6609a624	fix(studio): open tour ReadMore links in new tab (#4694 ) * fix(studio): open tour ReadMore links in new tab The quick tour "Read more" links navigate away from Studio instead of opening in a separate tab. Add target="_blank" and rel="noopener noreferrer" to the ReadMore component so external doc links open in a new browser tab. * fix(studio): only open external ReadMore links in new tab Apply target="_blank" conditionally based on whether the href starts with "http", so internal links still navigate in the same tab. * Tighten external-link detection in ReadMore component Use regex /^https?:\/\// instead of startsWith("http") so the check requires the full protocol prefix and does not match non-URL strings that happen to begin with "http". * Hoist regex to module scope for ReadMore Move EXTERNAL_URL_RE to top-level constant to satisfy the biome useTopLevelRegex lint rule and avoid re-creating the RegExp on every render. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-03-30 23:41:14 -07:00
Lee Jackson	308bb948d1	studio: prevent false multimodal warning during model loading (#4704 ) * studio: gate multimodal incompatibility warning on settled model capabilities * Also disable Start button during isCheckingVision fallback When getModelConfig fails and the fallback checkVisionModel is still in-flight, isLoadingModelDefaults clears before isCheckingVision does. Without also gating on isCheckingVision the Start button briefly re-enables with stale capability flags. Add isCheckingVision to the disabled condition and show "Loading model..." text while either flag is active. * Show correct error message for audio dataset incompatibility The incompatibility warning always said "switch to a vision model" even when the actual issue was an audio dataset on a non-audio model. Now shows an audio-specific message when the mismatch is audio. * Extract isLoadingModel constant for clarity Pull the combined model-loading condition into a single constant reused by the settled check, the disabled prop, and the button label. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>	2026-03-30 23:11:20 -07:00
pre-commit-ci[bot]	66f250a614	[pre-commit.ci] pre-commit autoupdate (#4705 ) updates: - [github.com/astral-sh/ruff-pre-commit: v0.15.7 → v0.15.8](https://github.com/astral-sh/ruff-pre-commit/compare/v0.15.7...v0.15.8) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>	2026-03-30 21:58:16 -07:00
Roland Tannous	d6d3f59984	fix: replace hard timeout with inactivity timeout for model loading (#4707 ) The 180s wall-clock timeout would kill model loads on slow connections even when the download was actively progressing. Now the worker sends heartbeat status messages every 30s during loading, and the orchestrator resets its 300s deadline on each one — so it only times out when the subprocess goes truly silent.	2026-03-31 07:35:04 +04:00
Roland Tannous	7f353acfd4	fix: skip download progress polling for exported GGUF models (#4709 ) * fix: skip download progress polling for exported GGUF models * fix: revert isLocalGgufDir change — exported GGUFs are file paths, not dirs * fix: set isDownloaded true for all adapters in LoraModelPicker	2026-03-31 07:21:23 +04:00
Etherll	34272a796f	Fix/bun windows bin detection (#4703 ) * fix(studio): detect bun .exe shims in Windows binary check * Update setup.sh * add .bunx checking	2026-03-30 21:58:33 +04:00
Daniel Han	6d83ad9a28	fix(studio): avoid UnicodeEncodeError on Windows cp1252 consoles (#4699 ) * fix(studio): replace unicode emoji in print() to avoid cp1252 crash on Windows On Windows the default console encoding is cp1252 which cannot encode unicode emoji like U+2705 or U+26A0. bare print() calls with these characters cause a UnicodeEncodeError at runtime. - run.py: replace emoji with ASCII status prefixes [OK] and [WARNING] - format_conversion.py: remove duplicate print() that mirrors the logger.info() call on the next line, and drop the emoji from the log message since loggers handle encoding separately * fix(studio): apply same emoji/print cleanup to parallel VLM conversion path The parallel URL-based conversion logic has the same duplicate print() with emoji that was fixed in the sequential path. Remove the bare print() and drop the emoji from the logger.info() call. * Treat install_python_stack.py failure as fatal in setup.ps1 On Linux/Mac, setup.sh runs under set -euo pipefail so a non-zero exit from install_python_stack.py aborts the installer. On Windows, setup.ps1 had no exit code check -- if the Python script crashed (eg from the cp1252 UnicodeEncodeError), the installer silently continued past the dependency loop and reported success. Studio would then fail at launch with ModuleNotFoundError for structlog, fastapi, and other deps that were never installed. Capture $LASTEXITCODE and exit 1 if the dependency installer fails, matching the error handling pattern already used for PyTorch install. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-30 06:40:47 -07:00
Daniel Han	a0bca759f3	Fix editable install scanning 6,500+ node_modules dirs (#4697 ) * fix: scope packages.find to prevent node_modules namespace scanning The packages.find section had no include filter, so setuptools' find_namespace_packages discovered all directories as potential Python packages -- including the 6,557 directories inside studio/frontend/node_modules/ after the frontend build step. This caused the editable install overlay step to run 20,000+ glob operations across 6,619 "packages", which on fast NVMe takes ~5s but on slower disks can take 7+ minutes. Adding an explicit include filter scopes discovery to only the packages we actually ship (unsloth, unsloth_cli, studio, studio.backend), dropping from 6,619 to 58 discovered packages and the editable build time from 5.4s to 1.2s. Also removes the broken kernels/moe exclude (used "/" instead of "." notation so it never matched) and adds a node_modules exclude as a safety net. * fix: use precise node_modules exclude patterns Use ".node_modules" and ".node_modules." instead of ".node_modules*" to avoid accidentally excluding valid packages that might contain "node_modules" as a substring in their name.	2026-03-30 02:40:29 -07:00
Datta Nimmaturi	9311df2b29	[Studio] multi gpu finetuning/inference via "balanced_low0/sequential" device_map (#4602 ) * [WIP] balanced device map for studio * gpus as a request parameter * API for multi GPU stuff * return multi gpu util in new API * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use balanced_low0 instead of balanced * Use balanced_low0 instead of balanced * Fix device_map typo, UUID parsing crash, set() filter bug, and broken tests - balanced_low0 -> balanced_low_0 (transformers/accelerate rejects the old string) - get_parent_visible_gpu_ids() now handles UUID/MIG CUDA_VISIBLE_DEVICES gracefully instead of crashing on int() parse - _get_backend_visible_gpu_info() set() or None bug: empty set is falsy so CUDA_VISIBLE_DEVICES=-1 would disable filtering and report all GPUs - test_gpu_selection.py: add missing get_visible_gpu_utilization import and add required job_id arg to start_training() calls * Smart GPU determinism using estimates * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disallow gpu selection for gguf for now * cleanup * Slightly larger baseline * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Treat empty list as auto * Verbose logging/debug * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleanup and revert unnecessary deletions * Cleanup excessive logs and guard against disk/cpu offload * auth for visibility API. cleanup redundant imports. Adjust QLoRA estimate * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * support for non cuda gpus * Fix multi-GPU auto-selection memory accounting The multi_gpu_factor was applied uniformly to all GPUs including the first one, which unfairly penalizes single-GPU capacity when transitioning to multi-GPU. This created a discontinuity where a model that barely fits 1 GPU would suddenly require 2 GPUs because the first GPU's free memory was discounted by 20%. Now the first GPU keeps its full free memory, and only additional GPUs have an overhead factor (0.85) applied to account for inter-GPU communication and sharding overhead. This gives more accurate auto-selection and avoids unnecessary multi-GPU for models that comfortably fit on one device. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox tests for multi-GPU selection logic 24 tests covering model size estimation, memory requirements, automatic GPU selection, device map generation, GPU ID validation, and multi-GPU overhead accounting. All tests use mocks so they run without GPUs on Linux, macOS, and Windows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer findings: 4bit inference estimate, fallback, GGUF gpu_ids, retry 1. 4-bit inference now uses reduced memory estimate (model_size/3 + buffer) instead of the FP16 1.3x multiplier. This prevents over-sharding quantized models across unnecessary GPUs. 2. When model size estimation fails, auto_select_gpu_ids now falls back to all visible GPUs instead of returning None (which could default to single-GPU loading for an unknown-size model). 3. GGUF inference route now treats gpu_ids=[] as auto-selection (same as None) instead of rejecting it as an unsupported explicit request. 4. Training retry path for "could not get source code" now preserves the gpu_ids parameter so the retry lands on the same GPUs. 5. Updated sandbox tests to cover the new 4-bit inference estimate branch. * Remove accidentally added unsloth-zoo submodule * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix UUID/MIG visibility and update test expectations 1. nvidia.py: When CUDA_VISIBLE_DEVICES uses UUID/MIG tokens, the visibility APIs now return "unresolved" with empty device lists instead of exposing all physical GPUs. This prevents the UI from showing GPUs that the backend process cannot actually use. 2. test_gpu_selection.py: Updated test expectations to match the new multi-GPU overhead accounting (first GPU at full capacity, 0.85x for additional GPUs) and 4-bit inference memory estimation formula. All 60 tests now pass. * Add CPU/disk offload guard to audio inference path The audio model loading branch returned before the common get_offloaded_device_map_entries() check, so audio models loaded with a multi-GPU device_map that spilled layers to CPU/disk would be accepted instead of rejected. Now audio loads also verify no modules are offloaded. * Improve VRAM requirement estimates * Replace balanced_low_0 with balanced * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refine calculations for slightly easier nums * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust estimates * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use nums instead of obj to avoid seralisation error * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden nvidia-smi parsing and fix fallback GPU list 1. nvidia.py: Wrap int() casts for GPU index and memory in try/except so MIG slices, N/A values, or unexpected nvidia-smi output skip the unparseable row instead of aborting the entire GPU list. 2. nvidia.py: Handle GPU names containing commas by using the last field as memory instead of a fixed positional index. 3. hardware.py: fallback_all now uses gpu_candidates (GPUs with verified VRAM data) instead of raw devices list, which could include GPUs with null VRAM that were excluded from the ranking. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * consolidate raise_if_offload * Improve MoE support. Guard against nvidia-smi failures * Improve MoE support. Guard against nvidia-smi failures * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix shared-expert LoRA undercount, torch VRAM fallback, and apply_gpu_ids edge case 1. vram_estimation.py: compute_lora_params now includes shared experts (n_shared_experts) alongside routed experts when computing MoE LoRA adapter parameters. Previously only n_experts were counted, causing the estimator to undercount adapter, optimizer, and gradient memory for DeepSeek/GLM-style models with shared experts. 2. hardware.py: _torch_get_per_device_info now uses mem_get_info (which reports system-wide VRAM usage) instead of memory_allocated (which only reports this process's PyTorch allocations). This prevents auto-selection from treating a GPU as mostly free when another process is consuming VRAM. Falls back to memory_allocated when mem_get_info is unavailable. 3. hardware.py: apply_gpu_ids([]) now returns early instead of setting CUDA_VISIBLE_DEVICES="" which would disable CUDA entirely. Empty list inherits the parent visibility, same as None. 4. hardware.py: Upgraded fallback_all GPU selection log from debug to warning so operators are notified when the model likely will not fit in available VRAM. * Guard nvidia-smi subprocess calls against OSError and TimeoutExpired get_visible_gpu_utilization and get_backend_visible_gpu_info now catch OSError (nvidia-smi not found) and TimeoutExpired internally instead of relying on callers to wrap every invocation. Returns the standard available=False sentinel on failure so the torch-based fallback in hardware.py can take over. * Guard get_primary_gpu_utilization and reset GPU caches between tests 1. nvidia.py: get_primary_gpu_utilization now catches OSError and TimeoutExpired internally, matching the pattern already used in get_visible_gpu_utilization and get_backend_visible_gpu_info. All three nvidia-smi callers are now self-contained. 2. test_gpu_selection.py: Added _GpuCacheResetMixin that resets the module-level _physical_gpu_count and _visible_gpu_count caches in tearDown. Applied to all test classes that exercise GPU selection, device map, or visibility functions. This prevents stale cache values from leaking between tests and causing flaky results on machines with real GPUs. * Fix nvidia-smi fallback regression and physical GPU count validation 1. hardware.py: get_gpu_utilization, get_visible_gpu_utilization, and get_backend_visible_gpu_info now check result.get("available") before returning the nvidia-smi result. When nvidia-smi is unavailable or returns no data (e.g., containers without nvidia-smi, UUID/MIG masks), the functions fall through to the torch-based fallback instead of returning an empty result. This fixes a regression where the internal exception handling in nvidia.py prevented the caller's except block from triggering the fallback. 2. hardware.py: resolve_requested_gpu_ids now separates negative-ID validation from physical upper-bound validation. The physical count check is only enforced when it is plausibly a true physical count (i.e., higher than the largest parent-visible ID), since torch.cuda.device_count() under CUDA_VISIBLE_DEVICES returns the visible count, not the physical total. The parent-visible-set check remains authoritative in all cases. This prevents valid physical IDs like [2, 3] from being rejected as "out of range" when nvidia-smi is unavailable and CUDA_VISIBLE_DEVICES="2,3" makes torch report only 2 devices. * Fix UUID/MIG torch fallback to enumerate devices by ordinal When CUDA_VISIBLE_DEVICES uses UUID or MIG identifiers, get_parent_visible_gpu_ids() returns [] because the tokens are non-numeric. The torch fallback in get_visible_gpu_utilization() and get_backend_visible_gpu_info() previously passed that empty list to _torch_get_per_device_info(), getting nothing back. Now both functions detect the empty-list case and fall back to enumerating torch-visible ordinals (0..device_count-1) with index_kind="relative". This means the UI and auto-selection still see real device data in Kubernetes, MIG, and Slurm-style UUID environments where nvidia-smi output cannot be mapped to physical indices. Updated test_uuid_parent_visibility to verify the new torch fallback path returns available=True with relative ordinals. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add type hint for gpu_ids parameter in InferenceOrchestrator.load_model --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-30 02:33:15 -07:00
Michael Han	fbfcbc69f2	Update README.md	2026-03-30 01:34:36 -07:00
Michael Han	d2b8ed8def	Update install.md	2026-03-30 01:33:33 -07:00
Lee Jackson	2f0a5baa87	fix(studio): preserve GGUF context max after apply and refresh (#4691 ) Fixes #4670 Separates the GGUF context slider ceiling from the currently active context length so lowering context via Chat Settings no longer locks the slider max to the reduced value. - Backend: adds `max_context_length` to GGUF load/status responses, computed from the largest VRAM/KV-fit cap across all usable GPU subsets - Frontend: stores `ggufMaxContextLength` and uses it for Context Length slider/input bounds; hydrates from both `/api/inference/load` and `/api/inference/status` - Defaults UI ceiling to native context for CPU-only and fallback paths - Seeds `effective_ctx` and `max_available_ctx` before GPU probing to prevent `UnboundLocalError` on probe failure - Property fallback uses native `_context_length`, not effective `context_length`	2026-03-30 01:33:16 -07:00
Lee Jackson	5557e1fd27	studio: unify Windows installer/setup logging style, verbosity controls, and startup messaging (#4651 ) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. * fix(studio): honor verbose logging and keep llama.cpp failures non-blocking * fix(studio): switch installer to 'studio update' and normalize Windows setup logs * chore(studio): refine localhost tip and remove skip-base setup nois * fix(studio): align Windows setup logs with Linux style and improve startup tips * fix(studio): align Windows setup logs with Linux style * refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output * refactor(windows): align installer/setup output with Linux style and reduce default verbosity * refactor(windows): match install.ps1 output style/colors to setup and quiet default logs * fix(studio-banner): update personal-computer localhost tip * fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode * fix(install.sh): align installer logging with setup style and restore POSIX-safe color output * fix(install.sh): preserve installer reliability and launch visibility Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output. * fix(windows installer): keep exit semantics and degrade status accurate Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable. * fix(setup.sh): improve log clarity and enforce GGUF degraded signaling Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable. * fix(installer): harden bash preflight and PowerShell GPU checks Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling. * fix(windows): keep verbose output visible while preserving exit codes Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode. * fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity * Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose - setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer already reports the limitation, and install.sh runs under set -e so a non-zero exit aborts the entire install including PATH/shortcuts/launch. - setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1 sets $? = $false when native commands write to stderr even with exit 0. Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE. - startup_banner.py: Show the actual bound address when Studio is bound to a non-loopback interface instead of always showing 127.0.0.1/localhost. - setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for npm install steps so --verbose correctly surfaces npm output. * Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose - install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge stderr into stdout with 2>&1 and drop the $? check that misclassifies successful native commands on PS 5.1. - install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env when --verbose is passed so child processes like install_python_stack.py also run in verbose mode. - setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose correctly surfaces clone diagnostics during source-build fallback. * Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner - setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so users can see download/validation progress while still capturing the log for structured error reporting on failure. - setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode. - setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers. - startup_banner.py: Avoid printing the same URL twice when Studio is bound to a specific non-loopback address that matches the display host. * Fix run_install_cmd exit code after failed if-statement The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured $? = 0 because $? reflects the if-statement result, not the command's exit code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual command exit code on failure. Applies to both verbose and quiet branches. * Fix _run_quiet exit code, double uv install, missing --local flag - setup.sh: Fix _run_quiet verbose path that always captured exit code 0 due to $? resetting after if-then-fi with no else. Switch to the same '"$@" && return 0; exit_code=$?' pattern used in install.sh. - setup.sh: Consolidate the two uv install branches (verbose + quiet) into a single attempt with conditional output. Previously, when verbose mode was on and the install failed, a second silent attempt was made. - install.ps1: Pass --local flag to 'unsloth studio update' when $StudioLocalInstall is true. Without this, studio.py's update() command overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if setup.ps1 or install_python_stack.py later checks that variable. * Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners - Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling setup.sh, so letting install_python_stack.py redo it was redundant and slowed down --no-torch installs for no benefit. - Restore the "Unsloth Studio installed!" success banner and "starting Unsloth Studio..." launch message so users get clear install completion feedback before the server starts. * Make llama.cpp build failure a hard error with proper cleanup - setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF inference requires a working llama.cpp build, so this should be a hard failure, not a silent degradation. - install.sh: Catch setup.sh's non-zero exit with '\|\| _SETUP_EXIT=$?' instead of letting set -e abort immediately. This ensures PATH setup, symlinks, and shortcuts still get created so the user can fix the build deps and retry with 'unsloth studio update'. After post-install steps, propagate the failure with a clear error message. * Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE 'studio update' pops SKIP_STUDIO_BASE from the environment, which defeats the fast-path version check added in PR #4667. When called from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1 must survive into setup.ps1 so it skips the redundant PyPI check and package reinstallation. 'studio setup' does not modify env vars. * Remove deprecation message from 'studio setup' command install.ps1 uses 'studio setup' (not 'studio update') to preserve SKIP_STUDIO_BASE. The deprecation message was confusing during first install since the user never typed the command. * Fix stale env vars, scope degraded exit, generic error message for PR #4651 - install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO when not using --local, to prevent stale values from a previous --local run in the same PowerShell session. Fix log messages to say 'setup' not 'update' since we call 'studio setup'. - setup.sh: Only exit non-zero for degraded llama.cpp when called from the installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps degraded installs successful since Studio is still usable for non-GGUF workflows and the footer already reports the limitation. - install.sh: Make the setup failure error message generic instead of GGUF-specific, so unrelated failures (npm, Python deps) do not show misleading cmake/git recovery advice. * Show captured output on failure in quiet mode for PR #4651 Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand (setup.ps1) now capture command output in quiet mode and display it in red when the command fails. This matches the behavior of run_install_cmd in install.sh where failure output is surfaced even in quiet mode, making cross-platform error debugging consistent. * Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651 - setup.ps1: Exit non-zero for degraded llama.cpp when called from install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct 'unsloth studio update' keeps degraded installs successful. - install.sh: Show 'unsloth studio update --local' in the recovery message when the install was run with --local, so users retry with the correct flag instead of losing local checkout context. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-30 00:53:23 -07:00
Roland Tannous	5bbfabb151	fix: [Studio] setup.ps1 update-flow for windows (#4667 ) * fix: add PyPI version check to setup.ps1 for fast update path Port the update-flow logic from setup.sh to setup.ps1 so that `unsloth studio update` on Windows skips Python dependency reinstall when the installed version already matches PyPI latest. * fix: clear SKIP_STUDIO_BASE in update command install.ps1 sets SKIP_STUDIO_BASE=1 which persists in the PowerShell session. If the user runs `unsloth studio update` in the same terminal, the env var causes the version check to be skipped. Clear it explicitly in the update command. * fix: harden version check and clear stale env vars in update flow - Normalize $InstalledVer with Out-String + Trim() to avoid array/whitespace comparison issues in PowerShell 5.1 (python output can be captured as string[] instead of scalar string) - Move Fast-Install --upgrade pip inside if (-not $SkipPythonDeps) so the fast path avoids unnecessary network round-trips - Clear STUDIO_LOCAL_REPO when --local is not passed to prevent a previous --local session from leaking into a plain update --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com>	2026-03-29 21:14:36 -07:00
Roland Tannous	a6c1f893fc	Fix blank page on Windows due to broken .js MIME type (#4674 ) * Fix blank page on Windows due to broken .js MIME type in registry * Update studio/backend/main.py adding defensive suggestion by gemini where we make the mimetypes specific to windows platforms Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2026-03-28 22:26:49 +04:00

1 2 3 4 5 ...

4939 commits