unsloth/studio/install_python_stack.py

1151 lines
39 KiB
Python
Raw Normal View History

#!/usr/bin/env python3
2026-03-12 18:28:04 +00:00
2026-03-12 17:23:10 +00:00
# SPDX-License-Identifier: AGPL-3.0-only
# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
"""Cross-platform Python dependency installer for Unsloth Studio.
Called by both setup.sh (Linux / WSL) and setup.ps1 (Windows) after the
virtual environment is already activated. Expects `pip` and `python` on
PATH to point at the venv.
"""
from __future__ import annotations
import os
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
import platform
import shutil
import subprocess
import sys
2026-02-27 20:31:57 +00:00
import tempfile
import urllib.request
from pathlib import Path
from backend.utils.wheel_utils import (
flash_attn_package_version,
flash_attn_wheel_url,
install_wheel,
probe_torch_wheel_env,
url_exists,
)
2026-02-27 20:31:57 +00:00
IS_WINDOWS = sys.platform == "win32"
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
IS_MACOS = sys.platform == "darwin"
IS_MAC_INTEL = IS_MACOS and platform.machine() == "x86_64"
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
# ── ROCm / AMD GPU support ─────────────────────────────────────────────────────
# Mapping from detected ROCm (major, minor) to the best PyTorch wheel tag on
# download.pytorch.org. Entries are checked newest-first (>=).
# ROCm 7.2 only has torch 2.11.0 on download.pytorch.org, which exceeds the
# current torch upper bound (<2.11.0). Fall back to rocm7.1 (torch 2.10.0).
# TODO: uncomment rocm7.2 when torch upper bound is bumped to >=2.11.0
_ROCM_TORCH_INDEX: dict[tuple[int, int], str] = {
# (7, 2): "rocm7.2", # torch 2.11.0 -- requires torch>=2.11
(7, 1): "rocm7.1",
(7, 0): "rocm7.0",
(6, 4): "rocm6.4",
(6, 3): "rocm6.3",
(6, 2): "rocm6.2",
(6, 1): "rocm6.1",
(6, 0): "rocm6.0",
}
_PYTORCH_WHL_BASE = (
os.environ.get("UNSLOTH_PYTORCH_MIRROR") or "https://download.pytorch.org/whl"
).rstrip("/")
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
Pin bitsandbytes to continuous-release_main on ROCm (4-bit decode fix) (#4954) * Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on every ROCm target: - CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a broken blocksize=32/64 warp64 GEMV kernel whose tests were explicitly skipped with ROCM_WARP_SIZE_64 guards because the code was known broken. - RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time BNB_WARP_SIZE macro in the host-side dispatch that resolves to 64 when the multi-arch wheel is compiled with CDNA as the primary target, so num_blocks is wrong on RDNA and half the GEMV output is never written. At decode shape (1, 1, hidden) both bugs produce NaN. Training is unaffected because training shapes are (batch, seq_len > 1, hidden) and never touch the GEMV path. The crash during autoregressive inference surfaces as _assert_async_cuda_kernel in torch.multinomial which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of a clean Python error. Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA", PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a runtime hipDeviceGetAttribute query and ships a working CDNA warp64 kernel. That commit has not shipped to PyPI yet, but continuous-release_main wheels are published on every push to bnb main via GitHub Releases. Point the ROCm install path at the continuous-release_main x86_64 and aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is unreachable (offline installs, firewalled hosts, or architectures not covered by the pre-release wheels). Drop the pin once bnb cuts a 0.50+ tag on PyPI. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1 (no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit sampling generation works end-to-end. NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is gated on the ROCm torch index and platform.machine() respectively. * Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode The 16-bit fallback in studio/backend/core/inference/inference.py was added as a workaround for a bug that this PR already fixes at the install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel on every ROCm target, which NaNs at decode shape (seq_len=1) and crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in 0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this PR) restores correct 4-bit decode on MI300X and verified working end-to-end with full Unsloth + for_inference + sampling. Revert the dual code path so ROCm and NVIDIA both go through the normal FastLanguageModel.from_pretrained + for_inference flow: - Remove the conditional `from unsloth import` that skipped the import on ROCm. The monkey-patches it was trying to avoid were never the cause of the crash; bnb 4-bit GEMV was. - Remove the `if _hw_module.IS_ROCM:` branch in load_model that loaded with plain transformers + PEFT + bfloat16, and the `_resolve_fp16_base` helper it relied on. - Remove the `get_chat_template is not None` fallback in _load_chat_template_info -- get_chat_template is now always imported. - Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM directly instead of the removed _IS_ROCM_ENV global. Audio and vision on ROCm still need separate validation (FastVisionModel and the CSM audio codecs were never tested on HIP) so the guard stays for now. Add _bnb_rocm_4bit_ok() as a runtime safety net for users who install from this PR before the install.sh bnb pin kicks in, or whose installer fell back to the PyPI pin because the continuous- release wheel was unreachable. When the installed bnb is < 0.50 on ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit / -bnb-4bit suffix from the model path so a pre-quantized repo resolves to its FP16 sibling instead of pulling bnb back in via the repo's quantization_config. LoRA adapters whose base is a pre-quantized repo on old bnb will still fail inside Unsloth's loader -- the only real fix there is `unsloth studio update`. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): - HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized repo): loads in 4-bit via the fixed GEMV, generation returns "Paris." for greedy and sampling. - SAFETY-NET path (simulated old bnb, suffix-stripped to the FP16 sibling, load_in_4bit=False): loads in bf16, generation returns "Paris." for greedy and sampling. Net diff is ~45 lines smaller than the pre-revert state because the entire plain-transformers 16-bit branch is gone. * Cache _bnb_rocm_4bit_ok() with functools.cache load_model() can be called many times in a single session but the bnb version and hardware state cannot change at runtime, so memoise the check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes` inside the try block), subsequent calls drop to sub-microsecond dict lookups. Zero behavioral change. * Shorten verbose bnb/ROCm comments Comment-only cleanup across install.sh, studio/install_python_stack.py, and studio/backend/core/inference/inference.py. No behavioral change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove _bnb_rocm_4bit_ok safety net from inference.py Studio's ROCm support is brand new (PR #4720, merged today) and every fresh install pulls the bnb continuous-release_main wheel via install.sh / install_python_stack.py in this same PR. There are no existing ROCm Studio installs carrying bnb < 0.50, so the defensive version-check fallback is guarding against a scenario that cannot actually occur. Delete the helper, the functools import, and the safety-net block -- inference.py now calls FastLanguageModel.from_pretrained directly with no ROCm branching. * Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix Vision inference was blocked by the same bnb 4-bit GEMV bug that affected text inference (vision models use bnb 4-bit for the LM backbone). With bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit loaded in 4-bit via FastVisionModel + for_inference returns a correct answer to a multimodal prompt. Audio (CSM) was never actually blocked by HIP — on this hardware CSM loads and runs its backbone forward pass fine with bnb 0.50, then fails during generate() with a transformers-level kwarg validation mismatch in generation_csm.py (`backbone_last_hidden_state` rejected). That's a pre-existing transformers/CSM integration bug that reproduces identically on NVIDIA, so the ROCm-gated guard was never actually protecting users from anything HIP-specific. Remove the combined audio/vision guard and the now-unused _hw_module import. Also restore the one-word "Can be" in an inline comment that drifted during the earlier comment-shortening pass, so the inference.py delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and nothing else. * Shorten max_seq_length=0 guard comment to one line --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-10 13:25:39 +00:00
# bitsandbytes continuous-release_main wheels with the ROCm 4-bit GEMV fix
# (bnb PR #1887, post-0.49.2). bnb <= 0.49.2 NaNs at decode shape on every
# AMD GPU. Drop the pin once bnb 0.50+ ships on PyPI.
_BNB_ROCM_PRERELEASE_URLS: dict[str, str] = {
"x86_64": (
"https://github.com/bitsandbytes-foundation/bitsandbytes/releases/"
"download/continuous-release_main/"
"bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl"
),
"aarch64": (
"https://github.com/bitsandbytes-foundation/bitsandbytes/releases/"
"download/continuous-release_main/"
"bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl"
),
}
_BNB_ROCM_PYPI_FALLBACK = "bitsandbytes>=0.49.1"
def _bnb_rocm_prerelease_url() -> str | None:
"""Return the continuous-release_main bnb wheel URL for the current
architecture, or None when no pre-release wheel is available.
"""
arch = platform.machine().lower()
arch = {"amd64": "x86_64", "arm64": "aarch64"}.get(arch, arch)
return _BNB_ROCM_PRERELEASE_URLS.get(arch)
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
def _detect_rocm_version() -> tuple[int, int] | None:
"""Return (major, minor) of the installed ROCm stack, or None."""
# Check /opt/rocm/.info/version or ROCM_PATH equivalent
rocm_root = os.environ.get("ROCM_PATH") or "/opt/rocm"
for path in (
os.path.join(rocm_root, ".info", "version"),
os.path.join(rocm_root, "lib", "rocm_version"),
):
try:
with open(path) as fh:
parts = fh.read().strip().split("-")[0].split(".")
# Explicit length guard avoids relying on the broad except
# below to swallow IndexError when the version file contains
# a single component (e.g. "6\n" on a partial install).
if len(parts) >= 2:
return int(parts[0]), int(parts[1])
except Exception:
pass
# Try amd-smi version (outputs "... | ROCm version: X.Y.Z")
amd_smi = shutil.which("amd-smi")
if amd_smi:
try:
result = subprocess.run(
[amd_smi, "version"],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 5,
)
if result.returncode == 0:
import re
m = re.search(r"ROCm version:\s*(\d+)\.(\d+)", result.stdout)
if m:
return int(m.group(1)), int(m.group(2))
except Exception:
pass
# Try hipconfig --version (outputs bare version like "6.3.21234.2")
hipconfig = shutil.which("hipconfig")
if hipconfig:
try:
result = subprocess.run(
[hipconfig, "--version"],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
timeout = 5,
)
if result.returncode == 0:
raw = result.stdout.decode().strip().split("\n")[0]
parts = raw.split(".")
if (
len(parts) >= 2
and parts[0].isdigit()
and parts[1].split("-")[0].isdigit()
):
return int(parts[0]), int(parts[1].split("-")[0])
except Exception:
pass
# Distro package-manager fallbacks. Package-managed ROCm installs can
# expose GPUs via rocminfo / amd-smi but still lack /opt/rocm/.info/version
# and hipconfig, so probe dpkg (Debian/Ubuntu) and rpm (RHEL/Fedora/SUSE)
# for the rocm-core package version. Matches the chain in
# install.sh::get_torch_index_url so `unsloth studio update` behaves
# the same as a fresh `curl | sh` install.
import re as _re_pkg
for cmd in (
["dpkg-query", "-W", "-f=${Version}\n", "rocm-core"],
["rpm", "-q", "--qf", "%{VERSION}\n", "rocm-core"],
):
exe = shutil.which(cmd[0])
if not exe:
continue
try:
result = subprocess.run(
[exe, *cmd[1:]],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 5,
)
except Exception:
continue
if result.returncode != 0 or not result.stdout.strip():
continue
raw = result.stdout.strip()
# dpkg can prepend an epoch ("1:6.3.0-1"); strip it before parsing.
raw = _re_pkg.sub(r"^\d+:", "", raw)
m = _re_pkg.match(r"(\d+)[.-](\d+)", raw)
if m:
return int(m.group(1)), int(m.group(2))
return None
def _has_rocm_gpu() -> bool:
"""Return True only if an actual AMD GPU is visible (not just ROCm tools installed)."""
import re
for cmd, check_fn in (
# rocminfo: look for a real gfx GPU id (3-4 chars, nonzero first digit).
# gfx000 is the CPU agent; ROCm 6.1+ also emits generic ISA lines like
# "gfx11-generic" or "gfx9-4-generic" which only have 1-2 digits before
# the dash and must not be treated as a real GPU.
(
["rocminfo"],
lambda out: bool(re.search(r"gfx[1-9][0-9a-z]{2,3}", out.lower())),
),
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
# amd-smi list: require "GPU: <number>" data rows, not just a header
(
["amd-smi", "list"],
lambda out: bool(re.search(r"(?im)^gpu\s*[:\[]\s*\d", out)),
),
):
exe = shutil.which(cmd[0])
if not exe:
continue
try:
result = subprocess.run(
[exe, *cmd[1:]],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 10,
)
except Exception:
continue
if result.returncode == 0 and result.stdout.strip():
if check_fn(result.stdout):
return True
return False
def _has_usable_nvidia_gpu() -> bool:
"""Return True only when nvidia-smi exists AND reports at least one GPU."""
exe = shutil.which("nvidia-smi")
if not exe:
return False
try:
result = subprocess.run(
[exe, "-L"],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 10,
)
except Exception:
return False
return result.returncode == 0 and "GPU " in result.stdout
def _ensure_rocm_torch() -> None:
"""Reinstall torch with ROCm wheels when the venv received CPU-only torch.
Runs only on Linux x86_64 hosts where an AMD GPU is present and the
ROCm runtime is detectable (rocminfo / amd-smi / hipconfig /
rocm-core package). No-op when torch already links against HIP
(ROCm), on Windows / macOS, on non-x86_64 Linux (PyTorch does not
publish ROCm wheels for aarch64 / arm64), or on mixed AMD+NVIDIA
hosts (NVIDIA takes precedence).
Uses pip_install() to respect uv, constraints, and --python targeting.
"""
# Explicit OS / architecture guards so the helper is safe to call
# from any context -- PyTorch only publishes ROCm wheels for
# linux_x86_64, so aarch64 / arm64 hosts must skip this repair path
# instead of failing the update with a missing-wheel error.
if IS_WINDOWS or IS_MACOS:
return
if platform.machine().lower() not in {"x86_64", "amd64"}:
return
# NVIDIA takes precedence on mixed hosts -- but only if an actual GPU is usable
if _has_usable_nvidia_gpu():
return
# Rely on _has_rocm_gpu() (rocminfo / amd-smi GPU data rows) as the
# authoritative "is this actually an AMD ROCm host?" signal. The old
# gate required /opt/rocm or hipcc to exist, which breaks on
# runtime-only ROCm installs (package-managed minimal installs,
# Radeon software) that ship amd-smi/rocminfo without /opt/rocm or
# hipcc, and leaves `unsloth studio update` unable to repair a
# CPU-only venv on those systems.
if not _has_rocm_gpu():
return # no AMD GPU visible
ver = _detect_rocm_version()
if ver is None:
print(" ROCm detected but version unreadable -- skipping torch reinstall")
return
# Probe whether torch already links against HIP (ROCm is already working).
# Do NOT skip for CUDA-only builds since they are unusable on AMD-only
# hosts (the NVIDIA check above already handled mixed AMD+NVIDIA setups).
try:
probe = subprocess.run(
[
sys.executable,
"-c",
"import torch; print(getattr(torch.version,'hip','') or '')",
],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
timeout = 30,
)
except (OSError, subprocess.TimeoutExpired):
probe = None
has_hip_torch = (
probe is not None
and probe.returncode == 0
and probe.stdout.decode().strip() != ""
)
rocm_torch_ready = has_hip_torch
if not has_hip_torch:
# Select best matching wheel tag (newest ROCm version <= installed)
tag = next(
(
t
for (maj, mn), t in sorted(_ROCM_TORCH_INDEX.items(), reverse = True)
if ver >= (maj, mn)
),
None,
)
if tag is None:
print(
f" No PyTorch wheel for ROCm {ver[0]}.{ver[1]} -- "
f"skipping torch reinstall"
)
else:
index_url = f"{_PYTORCH_WHL_BASE}/{tag}"
print(f" ROCm {ver[0]}.{ver[1]} -- installing torch from {index_url}")
pip_install(
f"ROCm torch ({tag})",
"--force-reinstall",
"--no-cache-dir",
"torch>=2.4,<2.11.0",
"torchvision<0.26.0",
"torchaudio<2.11.0",
"--index-url",
index_url,
constrain = False,
)
rocm_torch_ready = True
Pin bitsandbytes to continuous-release_main on ROCm (4-bit decode fix) (#4954) * Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on every ROCm target: - CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a broken blocksize=32/64 warp64 GEMV kernel whose tests were explicitly skipped with ROCM_WARP_SIZE_64 guards because the code was known broken. - RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time BNB_WARP_SIZE macro in the host-side dispatch that resolves to 64 when the multi-arch wheel is compiled with CDNA as the primary target, so num_blocks is wrong on RDNA and half the GEMV output is never written. At decode shape (1, 1, hidden) both bugs produce NaN. Training is unaffected because training shapes are (batch, seq_len > 1, hidden) and never touch the GEMV path. The crash during autoregressive inference surfaces as _assert_async_cuda_kernel in torch.multinomial which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of a clean Python error. Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA", PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a runtime hipDeviceGetAttribute query and ships a working CDNA warp64 kernel. That commit has not shipped to PyPI yet, but continuous-release_main wheels are published on every push to bnb main via GitHub Releases. Point the ROCm install path at the continuous-release_main x86_64 and aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is unreachable (offline installs, firewalled hosts, or architectures not covered by the pre-release wheels). Drop the pin once bnb cuts a 0.50+ tag on PyPI. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1 (no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit sampling generation works end-to-end. NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is gated on the ROCm torch index and platform.machine() respectively. * Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode The 16-bit fallback in studio/backend/core/inference/inference.py was added as a workaround for a bug that this PR already fixes at the install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel on every ROCm target, which NaNs at decode shape (seq_len=1) and crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in 0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this PR) restores correct 4-bit decode on MI300X and verified working end-to-end with full Unsloth + for_inference + sampling. Revert the dual code path so ROCm and NVIDIA both go through the normal FastLanguageModel.from_pretrained + for_inference flow: - Remove the conditional `from unsloth import` that skipped the import on ROCm. The monkey-patches it was trying to avoid were never the cause of the crash; bnb 4-bit GEMV was. - Remove the `if _hw_module.IS_ROCM:` branch in load_model that loaded with plain transformers + PEFT + bfloat16, and the `_resolve_fp16_base` helper it relied on. - Remove the `get_chat_template is not None` fallback in _load_chat_template_info -- get_chat_template is now always imported. - Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM directly instead of the removed _IS_ROCM_ENV global. Audio and vision on ROCm still need separate validation (FastVisionModel and the CSM audio codecs were never tested on HIP) so the guard stays for now. Add _bnb_rocm_4bit_ok() as a runtime safety net for users who install from this PR before the install.sh bnb pin kicks in, or whose installer fell back to the PyPI pin because the continuous- release wheel was unreachable. When the installed bnb is < 0.50 on ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit / -bnb-4bit suffix from the model path so a pre-quantized repo resolves to its FP16 sibling instead of pulling bnb back in via the repo's quantization_config. LoRA adapters whose base is a pre-quantized repo on old bnb will still fail inside Unsloth's loader -- the only real fix there is `unsloth studio update`. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): - HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized repo): loads in 4-bit via the fixed GEMV, generation returns "Paris." for greedy and sampling. - SAFETY-NET path (simulated old bnb, suffix-stripped to the FP16 sibling, load_in_4bit=False): loads in bf16, generation returns "Paris." for greedy and sampling. Net diff is ~45 lines smaller than the pre-revert state because the entire plain-transformers 16-bit branch is gone. * Cache _bnb_rocm_4bit_ok() with functools.cache load_model() can be called many times in a single session but the bnb version and hardware state cannot change at runtime, so memoise the check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes` inside the try block), subsequent calls drop to sub-microsecond dict lookups. Zero behavioral change. * Shorten verbose bnb/ROCm comments Comment-only cleanup across install.sh, studio/install_python_stack.py, and studio/backend/core/inference/inference.py. No behavioral change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove _bnb_rocm_4bit_ok safety net from inference.py Studio's ROCm support is brand new (PR #4720, merged today) and every fresh install pulls the bnb continuous-release_main wheel via install.sh / install_python_stack.py in this same PR. There are no existing ROCm Studio installs carrying bnb < 0.50, so the defensive version-check fallback is guarding against a scenario that cannot actually occur. Delete the helper, the functools import, and the safety-net block -- inference.py now calls FastLanguageModel.from_pretrained directly with no ROCm branching. * Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix Vision inference was blocked by the same bnb 4-bit GEMV bug that affected text inference (vision models use bnb 4-bit for the LM backbone). With bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit loaded in 4-bit via FastVisionModel + for_inference returns a correct answer to a multimodal prompt. Audio (CSM) was never actually blocked by HIP — on this hardware CSM loads and runs its backbone forward pass fine with bnb 0.50, then fails during generate() with a transformers-level kwarg validation mismatch in generation_csm.py (`backbone_last_hidden_state` rejected). That's a pre-existing transformers/CSM integration bug that reproduces identically on NVIDIA, so the ROCm-gated guard was never actually protecting users from anything HIP-specific. Remove the combined audio/vision guard and the now-unused _hw_module import. Also restore the one-word "Can be" in an inline comment that drifted during the earlier comment-shortening pass, so the inference.py delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and nothing else. * Shorten max_seq_length=0 guard comment to one line --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-10 13:25:39 +00:00
# Install bitsandbytes only when torch links against ROCm. Prefers the
# continuous-release_main wheel (bnb PR #1887 4-bit GEMV fix) and falls
# back to PyPI when the pre-release URL is unreachable.
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
if rocm_torch_ready:
Pin bitsandbytes to continuous-release_main on ROCm (4-bit decode fix) (#4954) * Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on every ROCm target: - CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a broken blocksize=32/64 warp64 GEMV kernel whose tests were explicitly skipped with ROCM_WARP_SIZE_64 guards because the code was known broken. - RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time BNB_WARP_SIZE macro in the host-side dispatch that resolves to 64 when the multi-arch wheel is compiled with CDNA as the primary target, so num_blocks is wrong on RDNA and half the GEMV output is never written. At decode shape (1, 1, hidden) both bugs produce NaN. Training is unaffected because training shapes are (batch, seq_len > 1, hidden) and never touch the GEMV path. The crash during autoregressive inference surfaces as _assert_async_cuda_kernel in torch.multinomial which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of a clean Python error. Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA", PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a runtime hipDeviceGetAttribute query and ships a working CDNA warp64 kernel. That commit has not shipped to PyPI yet, but continuous-release_main wheels are published on every push to bnb main via GitHub Releases. Point the ROCm install path at the continuous-release_main x86_64 and aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is unreachable (offline installs, firewalled hosts, or architectures not covered by the pre-release wheels). Drop the pin once bnb cuts a 0.50+ tag on PyPI. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1 (no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit sampling generation works end-to-end. NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is gated on the ROCm torch index and platform.machine() respectively. * Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode The 16-bit fallback in studio/backend/core/inference/inference.py was added as a workaround for a bug that this PR already fixes at the install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel on every ROCm target, which NaNs at decode shape (seq_len=1) and crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in 0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this PR) restores correct 4-bit decode on MI300X and verified working end-to-end with full Unsloth + for_inference + sampling. Revert the dual code path so ROCm and NVIDIA both go through the normal FastLanguageModel.from_pretrained + for_inference flow: - Remove the conditional `from unsloth import` that skipped the import on ROCm. The monkey-patches it was trying to avoid were never the cause of the crash; bnb 4-bit GEMV was. - Remove the `if _hw_module.IS_ROCM:` branch in load_model that loaded with plain transformers + PEFT + bfloat16, and the `_resolve_fp16_base` helper it relied on. - Remove the `get_chat_template is not None` fallback in _load_chat_template_info -- get_chat_template is now always imported. - Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM directly instead of the removed _IS_ROCM_ENV global. Audio and vision on ROCm still need separate validation (FastVisionModel and the CSM audio codecs were never tested on HIP) so the guard stays for now. Add _bnb_rocm_4bit_ok() as a runtime safety net for users who install from this PR before the install.sh bnb pin kicks in, or whose installer fell back to the PyPI pin because the continuous- release wheel was unreachable. When the installed bnb is < 0.50 on ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit / -bnb-4bit suffix from the model path so a pre-quantized repo resolves to its FP16 sibling instead of pulling bnb back in via the repo's quantization_config. LoRA adapters whose base is a pre-quantized repo on old bnb will still fail inside Unsloth's loader -- the only real fix there is `unsloth studio update`. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): - HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized repo): loads in 4-bit via the fixed GEMV, generation returns "Paris." for greedy and sampling. - SAFETY-NET path (simulated old bnb, suffix-stripped to the FP16 sibling, load_in_4bit=False): loads in bf16, generation returns "Paris." for greedy and sampling. Net diff is ~45 lines smaller than the pre-revert state because the entire plain-transformers 16-bit branch is gone. * Cache _bnb_rocm_4bit_ok() with functools.cache load_model() can be called many times in a single session but the bnb version and hardware state cannot change at runtime, so memoise the check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes` inside the try block), subsequent calls drop to sub-microsecond dict lookups. Zero behavioral change. * Shorten verbose bnb/ROCm comments Comment-only cleanup across install.sh, studio/install_python_stack.py, and studio/backend/core/inference/inference.py. No behavioral change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove _bnb_rocm_4bit_ok safety net from inference.py Studio's ROCm support is brand new (PR #4720, merged today) and every fresh install pulls the bnb continuous-release_main wheel via install.sh / install_python_stack.py in this same PR. There are no existing ROCm Studio installs carrying bnb < 0.50, so the defensive version-check fallback is guarding against a scenario that cannot actually occur. Delete the helper, the functools import, and the safety-net block -- inference.py now calls FastLanguageModel.from_pretrained directly with no ROCm branching. * Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix Vision inference was blocked by the same bnb 4-bit GEMV bug that affected text inference (vision models use bnb 4-bit for the LM backbone). With bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit loaded in 4-bit via FastVisionModel + for_inference returns a correct answer to a multimodal prompt. Audio (CSM) was never actually blocked by HIP — on this hardware CSM loads and runs its backbone forward pass fine with bnb 0.50, then fails during generate() with a transformers-level kwarg validation mismatch in generation_csm.py (`backbone_last_hidden_state` rejected). That's a pre-existing transformers/CSM integration bug that reproduces identically on NVIDIA, so the ROCm-gated guard was never actually protecting users from anything HIP-specific. Remove the combined audio/vision guard and the now-unused _hw_module import. Also restore the one-word "Can be" in an inline comment that drifted during the earlier comment-shortening pass, so the inference.py delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and nothing else. * Shorten max_seq_length=0 guard comment to one line --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-10 13:25:39 +00:00
_bnb_url = _bnb_rocm_prerelease_url()
_bnb_installed = False
if _bnb_url is not None:
_bnb_installed = pip_install_try(
"bitsandbytes (AMD, pre-release main)",
"--force-reinstall",
"--no-cache-dir",
"--no-deps",
_bnb_url,
constrain = False,
)
if not _bnb_installed:
print(
_red(
" bnb pre-release unreachable; falling back to PyPI "
"(4-bit decode will be broken on ROCm)"
)
)
if not _bnb_installed:
pip_install(
"bitsandbytes (AMD)",
"--force-reinstall",
"--no-cache-dir",
"--no-deps",
_BNB_ROCM_PYPI_FALLBACK,
constrain = False,
)
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
def _infer_no_torch() -> bool:
"""Determine whether to run in no-torch (GGUF-only) mode.
Checks UNSLOTH_NO_TORCH env var first. When unset, falls back to
platform detection so that Intel Macs automatically use GGUF-only
mode even when invoked from ``unsloth studio update`` (which does
not inject the env var).
"""
env = os.environ.get("UNSLOTH_NO_TORCH")
if env is not None:
return env.strip().lower() in ("1", "true")
return IS_MAC_INTEL
NO_TORCH = _infer_no_torch()
2026-02-27 20:31:57 +00:00
# -- Verbosity control ----------------------------------------------------------
# By default the installer shows a minimal progress bar (one line, in-place).
# Set UNSLOTH_VERBOSE=1 in the environment to restore full per-step output:
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
# CLI: unsloth studio setup --verbose
# Linux/Mac: UNSLOTH_VERBOSE=1 ./studio/setup.sh
# Windows: $env:UNSLOTH_VERBOSE="1" ; .\studio\setup.ps1
VERBOSE: bool = os.environ.get("UNSLOTH_VERBOSE", "0") == "1"
# Progress bar state -- updated by _progress() as each install step runs.
# Update _TOTAL here if you add or remove install steps in install_python_stack().
_STEP: int = 0
_TOTAL: int = 0 # set at runtime in install_python_stack() based on platform
# -- Paths --------------------------------------------------------------
SCRIPT_DIR = Path(__file__).resolve().parent
2026-03-09 23:47:26 +00:00
REQ_ROOT = SCRIPT_DIR / "backend" / "requirements"
SINGLE_ENV = REQ_ROOT / "single-env"
CONSTRAINTS = SINGLE_ENV / "constraints.txt"
LOCAL_DD_UNSTRUCTURED_PLUGIN = (
2026-03-09 23:47:26 +00:00
SCRIPT_DIR / "backend" / "plugins" / "data-designer-unstructured-seed"
)
# -- Unicode-safe printing ---------------------------------------------
# On Windows the default console encoding can be a legacy code page
# (e.g. CP1252) that cannot represent Unicode glyphs such as ✅ or ❌.
# _safe_print() gracefully degrades to ASCII equivalents so the
# installer never crashes just because of a status glyph.
_UNICODE_TO_ASCII: dict[str, str] = {
"\u2705": "[OK]", # ✅
"\u274c": "[FAIL]", # ❌
"\u26a0\ufe0f": "[!]", # ⚠️ (warning + variation selector)
"\u26a0": "[!]", # ⚠ (warning without variation selector)
}
def _safe_print(*args: object, **kwargs: object) -> None:
"""Drop-in print() replacement that survives non-UTF-8 consoles."""
try:
print(*args, **kwargs)
except UnicodeEncodeError:
# Stringify, then swap emoji for ASCII equivalents
text = " ".join(str(a) for a in args)
for uni, ascii_alt in _UNICODE_TO_ASCII.items():
text = text.replace(uni, ascii_alt)
# Final fallback: replace any remaining unencodable chars
print(
text.encode(sys.stdout.encoding or "ascii", errors = "replace").decode(
sys.stdout.encoding or "ascii", errors = "replace"
),
**kwargs,
)
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
# ── Color support ──────────────────────────────────────────────────────
# Same logic as startup_banner: NO_COLOR disables, FORCE_COLOR or TTY enables.
2026-02-27 20:31:57 +00:00
2026-03-12 18:28:04 +00:00
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
def _stdout_supports_color() -> bool:
"""True if we should emit ANSI colors (matches startup_banner)."""
if os.environ.get("NO_COLOR", "").strip():
2026-02-27 20:31:57 +00:00
return False
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
if os.environ.get("FORCE_COLOR", "").strip():
return True
2026-02-27 20:31:57 +00:00
try:
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
if not sys.stdout.isatty():
2026-02-27 20:31:57 +00:00
return False
studio: unify Windows installer/setup logging style, verbosity controls, and startup messaging (#4651) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. * fix(studio): honor verbose logging and keep llama.cpp failures non-blocking * fix(studio): switch installer to 'studio update' and normalize Windows setup logs * chore(studio): refine localhost tip and remove skip-base setup nois * fix(studio): align Windows setup logs with Linux style and improve startup tips * fix(studio): align Windows setup logs with Linux style * refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output * refactor(windows): align installer/setup output with Linux style and reduce default verbosity * refactor(windows): match install.ps1 output style/colors to setup and quiet default logs * fix(studio-banner): update personal-computer localhost tip * fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode * fix(install.sh): align installer logging with setup style and restore POSIX-safe color output * fix(install.sh): preserve installer reliability and launch visibility Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output. * fix(windows installer): keep exit semantics and degrade status accurate Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable. * fix(setup.sh): improve log clarity and enforce GGUF degraded signaling Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable. * fix(installer): harden bash preflight and PowerShell GPU checks Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling. * fix(windows): keep verbose output visible while preserving exit codes Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode. * fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity * Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose - setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer already reports the limitation, and install.sh runs under set -e so a non-zero exit aborts the entire install including PATH/shortcuts/launch. - setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1 sets $? = $false when native commands write to stderr even with exit 0. Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE. - startup_banner.py: Show the actual bound address when Studio is bound to a non-loopback interface instead of always showing 127.0.0.1/localhost. - setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for npm install steps so --verbose correctly surfaces npm output. * Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose - install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge stderr into stdout with 2>&1 and drop the $? check that misclassifies successful native commands on PS 5.1. - install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env when --verbose is passed so child processes like install_python_stack.py also run in verbose mode. - setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose correctly surfaces clone diagnostics during source-build fallback. * Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner - setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so users can see download/validation progress while still capturing the log for structured error reporting on failure. - setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode. - setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers. - startup_banner.py: Avoid printing the same URL twice when Studio is bound to a specific non-loopback address that matches the display host. * Fix run_install_cmd exit code after failed if-statement The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured $? = 0 because $? reflects the if-statement result, not the command's exit code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual command exit code on failure. Applies to both verbose and quiet branches. * Fix _run_quiet exit code, double uv install, missing --local flag - setup.sh: Fix _run_quiet verbose path that always captured exit code 0 due to $? resetting after if-then-fi with no else. Switch to the same '"$@" && return 0; exit_code=$?' pattern used in install.sh. - setup.sh: Consolidate the two uv install branches (verbose + quiet) into a single attempt with conditional output. Previously, when verbose mode was on and the install failed, a second silent attempt was made. - install.ps1: Pass --local flag to 'unsloth studio update' when $StudioLocalInstall is true. Without this, studio.py's update() command overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if setup.ps1 or install_python_stack.py later checks that variable. * Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners - Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling setup.sh, so letting install_python_stack.py redo it was redundant and slowed down --no-torch installs for no benefit. - Restore the "Unsloth Studio installed!" success banner and "starting Unsloth Studio..." launch message so users get clear install completion feedback before the server starts. * Make llama.cpp build failure a hard error with proper cleanup - setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF inference requires a working llama.cpp build, so this should be a hard failure, not a silent degradation. - install.sh: Catch setup.sh's non-zero exit with '|| _SETUP_EXIT=$?' instead of letting set -e abort immediately. This ensures PATH setup, symlinks, and shortcuts still get created so the user can fix the build deps and retry with 'unsloth studio update'. After post-install steps, propagate the failure with a clear error message. * Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE 'studio update' pops SKIP_STUDIO_BASE from the environment, which defeats the fast-path version check added in PR #4667. When called from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1 must survive into setup.ps1 so it skips the redundant PyPI check and package reinstallation. 'studio setup' does not modify env vars. * Remove deprecation message from 'studio setup' command install.ps1 uses 'studio setup' (not 'studio update') to preserve SKIP_STUDIO_BASE. The deprecation message was confusing during first install since the user never typed the command. * Fix stale env vars, scope degraded exit, generic error message for PR #4651 - install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO when not using --local, to prevent stale values from a previous --local run in the same PowerShell session. Fix log messages to say 'setup' not 'update' since we call 'studio setup'. - setup.sh: Only exit non-zero for degraded llama.cpp when called from the installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps degraded installs successful since Studio is still usable for non-GGUF workflows and the footer already reports the limitation. - install.sh: Make the setup failure error message generic instead of GGUF-specific, so unrelated failures (npm, Python deps) do not show misleading cmake/git recovery advice. * Show captured output on failure in quiet mode for PR #4651 Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand (setup.ps1) now capture command output in quiet mode and display it in red when the command fails. This matches the behavior of run_install_cmd in install.sh where failure output is surfaced even in quiet mode, making cross-platform error debugging consistent. * Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651 - setup.ps1: Exit non-zero for degraded llama.cpp when called from install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct 'unsloth studio update' keeps degraded installs successful. - install.sh: Show 'unsloth studio update --local' in the recovery message when the install was run with --local, so users retry with the correct flag instead of losing local checkout context. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-30 07:53:23 +00:00
except (AttributeError, OSError, ValueError):
2026-02-27 20:31:57 +00:00
return False
if IS_WINDOWS:
try:
import ctypes
2026-03-12 18:28:04 +00:00
2026-02-27 20:31:57 +00:00
kernel32 = ctypes.windll.kernel32
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
handle = kernel32.GetStdHandle(-11)
2026-02-27 20:31:57 +00:00
mode = ctypes.c_ulong()
kernel32.GetConsoleMode(handle, ctypes.byref(mode))
kernel32.SetConsoleMode(handle, mode.value | 0x0004)
studio: unify Windows installer/setup logging style, verbosity controls, and startup messaging (#4651) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. * fix(studio): honor verbose logging and keep llama.cpp failures non-blocking * fix(studio): switch installer to 'studio update' and normalize Windows setup logs * chore(studio): refine localhost tip and remove skip-base setup nois * fix(studio): align Windows setup logs with Linux style and improve startup tips * fix(studio): align Windows setup logs with Linux style * refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output * refactor(windows): align installer/setup output with Linux style and reduce default verbosity * refactor(windows): match install.ps1 output style/colors to setup and quiet default logs * fix(studio-banner): update personal-computer localhost tip * fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode * fix(install.sh): align installer logging with setup style and restore POSIX-safe color output * fix(install.sh): preserve installer reliability and launch visibility Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output. * fix(windows installer): keep exit semantics and degrade status accurate Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable. * fix(setup.sh): improve log clarity and enforce GGUF degraded signaling Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable. * fix(installer): harden bash preflight and PowerShell GPU checks Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling. * fix(windows): keep verbose output visible while preserving exit codes Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode. * fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity * Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose - setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer already reports the limitation, and install.sh runs under set -e so a non-zero exit aborts the entire install including PATH/shortcuts/launch. - setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1 sets $? = $false when native commands write to stderr even with exit 0. Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE. - startup_banner.py: Show the actual bound address when Studio is bound to a non-loopback interface instead of always showing 127.0.0.1/localhost. - setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for npm install steps so --verbose correctly surfaces npm output. * Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose - install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge stderr into stdout with 2>&1 and drop the $? check that misclassifies successful native commands on PS 5.1. - install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env when --verbose is passed so child processes like install_python_stack.py also run in verbose mode. - setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose correctly surfaces clone diagnostics during source-build fallback. * Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner - setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so users can see download/validation progress while still capturing the log for structured error reporting on failure. - setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode. - setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers. - startup_banner.py: Avoid printing the same URL twice when Studio is bound to a specific non-loopback address that matches the display host. * Fix run_install_cmd exit code after failed if-statement The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured $? = 0 because $? reflects the if-statement result, not the command's exit code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual command exit code on failure. Applies to both verbose and quiet branches. * Fix _run_quiet exit code, double uv install, missing --local flag - setup.sh: Fix _run_quiet verbose path that always captured exit code 0 due to $? resetting after if-then-fi with no else. Switch to the same '"$@" && return 0; exit_code=$?' pattern used in install.sh. - setup.sh: Consolidate the two uv install branches (verbose + quiet) into a single attempt with conditional output. Previously, when verbose mode was on and the install failed, a second silent attempt was made. - install.ps1: Pass --local flag to 'unsloth studio update' when $StudioLocalInstall is true. Without this, studio.py's update() command overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if setup.ps1 or install_python_stack.py later checks that variable. * Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners - Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling setup.sh, so letting install_python_stack.py redo it was redundant and slowed down --no-torch installs for no benefit. - Restore the "Unsloth Studio installed!" success banner and "starting Unsloth Studio..." launch message so users get clear install completion feedback before the server starts. * Make llama.cpp build failure a hard error with proper cleanup - setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF inference requires a working llama.cpp build, so this should be a hard failure, not a silent degradation. - install.sh: Catch setup.sh's non-zero exit with '|| _SETUP_EXIT=$?' instead of letting set -e abort immediately. This ensures PATH setup, symlinks, and shortcuts still get created so the user can fix the build deps and retry with 'unsloth studio update'. After post-install steps, propagate the failure with a clear error message. * Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE 'studio update' pops SKIP_STUDIO_BASE from the environment, which defeats the fast-path version check added in PR #4667. When called from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1 must survive into setup.ps1 so it skips the redundant PyPI check and package reinstallation. 'studio setup' does not modify env vars. * Remove deprecation message from 'studio setup' command install.ps1 uses 'studio setup' (not 'studio update') to preserve SKIP_STUDIO_BASE. The deprecation message was confusing during first install since the user never typed the command. * Fix stale env vars, scope degraded exit, generic error message for PR #4651 - install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO when not using --local, to prevent stale values from a previous --local run in the same PowerShell session. Fix log messages to say 'setup' not 'update' since we call 'studio setup'. - setup.sh: Only exit non-zero for degraded llama.cpp when called from the installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps degraded installs successful since Studio is still usable for non-GGUF workflows and the footer already reports the limitation. - install.sh: Make the setup failure error message generic instead of GGUF-specific, so unrelated failures (npm, Python deps) do not show misleading cmake/git recovery advice. * Show captured output on failure in quiet mode for PR #4651 Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand (setup.ps1) now capture command output in quiet mode and display it in red when the command fails. This matches the behavior of run_install_cmd in install.sh where failure output is surfaced even in quiet mode, making cross-platform error debugging consistent. * Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651 - setup.ps1: Exit non-zero for degraded llama.cpp when called from install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct 'unsloth studio update' keeps degraded installs successful. - install.sh: Show 'unsloth studio update --local' in the recovery message when the install was run with --local, so users retry with the correct flag instead of losing local checkout context. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-30 07:53:23 +00:00
except (ImportError, AttributeError, OSError):
2026-02-27 20:31:57 +00:00
return False
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
return True
_HAS_COLOR = _stdout_supports_color()
2026-02-27 20:31:57 +00:00
2026-03-12 18:28:04 +00:00
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
# Column layout — matches setup.sh step() helper:
# 2-space indent, 15-char label (dim), then value.
_LABEL = "deps"
_COL = 15
2026-03-12 18:28:04 +00:00
def _green(msg: str) -> str:
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
return f"\033[38;5;108m{msg}\033[0m" if _HAS_COLOR else msg
2026-03-12 18:28:04 +00:00
def _cyan(msg: str) -> str:
2026-02-27 20:31:57 +00:00
return f"\033[96m{msg}\033[0m" if _HAS_COLOR else msg
2026-03-12 18:28:04 +00:00
def _red(msg: str) -> str:
2026-02-27 20:31:57 +00:00
return f"\033[91m{msg}\033[0m" if _HAS_COLOR else msg
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
def _dim(msg: str) -> str:
return f"\033[38;5;245m{msg}\033[0m" if _HAS_COLOR else msg
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
def _title(msg: str) -> str:
return f"\033[38;5;150m{msg}\033[0m" if _HAS_COLOR else msg
_RULE = "\u2500" * 52
def _step(label: str, value: str, color_fn = None) -> None:
"""Print a single step line in the column format."""
if color_fn is None:
color_fn = _green
padded = label[:_COL]
print(f" {_dim(padded)}{' ' * (_COL - len(padded))}{color_fn(value)}")
def _progress(label: str) -> None:
"""Print an in-place progress bar aligned to the step column layout."""
global _STEP
_STEP += 1
if VERBOSE:
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
return
width = 20
filled = int(width * _STEP / _TOTAL)
bar = "=" * filled + "-" * (width - filled)
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
pad = " " * (_COL - len(_LABEL))
end = "\n" if _STEP >= _TOTAL else ""
sys.stdout.write(
f"\r {_dim(_LABEL)}{pad}[{bar}] {_STEP:2}/{_TOTAL} {label:<20}{end}"
)
sys.stdout.flush()
def run(
label: str, cmd: list[str], *, quiet: bool = True
) -> subprocess.CompletedProcess[bytes]:
"""Run a command; on failure print output and exit."""
if VERBOSE:
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
_step(_LABEL, f"{label}...", _dim)
result = subprocess.run(
cmd,
2026-03-12 18:28:04 +00:00
stdout = subprocess.PIPE if quiet else None,
stderr = subprocess.STDOUT if quiet else None,
)
if result.returncode != 0:
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
_step("error", f"{label} failed (exit code {result.returncode})", _red)
if result.stdout:
2026-03-12 18:28:04 +00:00
print(result.stdout.decode(errors = "replace"))
sys.exit(result.returncode)
return result
2026-02-27 20:31:57 +00:00
# Packages to skip on Windows (require special build steps)
2026-03-12 18:28:04 +00:00
WINDOWS_SKIP_PACKAGES = {"open_spiel", "triton_kernels"}
2026-02-27 20:31:57 +00:00
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
# Packages to skip when torch is unavailable (Intel Mac GGUF-only mode).
# These packages either *are* torch extensions or have unconditional
# ``Requires-Dist: torch`` in their published metadata, so installing
# them would pull torch back into the environment.
NO_TORCH_SKIP_PACKAGES = {
"torch-stoi",
"timm",
"torchcodec",
"torch-c-dlpack-ext",
"openai-whisper",
"transformers-cfg",
}
def _select_flash_attn_version(torch_mm: str) -> str | None:
return flash_attn_package_version(torch_mm)
def _build_flash_attn_wheel_url(env: dict[str, str]) -> str | None:
return flash_attn_wheel_url(env)
def _print_optional_install_failure(
label: str, result: subprocess.CompletedProcess[str]
) -> None:
_step("warning", f"{label} failed (exit code {result.returncode})", _cyan)
if result.stdout:
print(result.stdout.strip())
def _flash_attn_install_disabled() -> bool:
return os.getenv("UNSLOTH_STUDIO_SKIP_FLASHATTN_INSTALL") == "1"
def _ensure_flash_attn() -> None:
if NO_TORCH or IS_WINDOWS or IS_MACOS:
return
if _flash_attn_install_disabled():
return
if (
subprocess.run(
[sys.executable, "-c", "import flash_attn"],
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL,
).returncode
== 0
):
return
env = probe_torch_wheel_env()
wheel_url = _build_flash_attn_wheel_url(env) if env else None
if wheel_url and url_exists(wheel_url):
for installer, wheel_result in install_wheel(
wheel_url,
python_executable = sys.executable,
use_uv = USE_UV,
uv_needs_system = UV_NEEDS_SYSTEM,
):
if wheel_result.returncode == 0:
return
_print_optional_install_failure(
f"Installing flash-attn prebuilt wheel with {installer}",
wheel_result,
)
_step("warning", "Continuing without flash-attn", _cyan)
return
if wheel_url is None:
_step("warning", "No compatible flash-attn prebuilt wheel found", _cyan)
else:
_step("warning", "No published flash-attn prebuilt wheel found", _cyan)
# -- uv bootstrap ------------------------------------------------------
USE_UV = False # Set by _bootstrap_uv() at the start of install_python_stack()
UV_NEEDS_SYSTEM = False # Set by _bootstrap_uv() via probe
def _bootstrap_uv() -> bool:
"""Check if uv is available and probe whether --system is needed."""
global UV_NEEDS_SYSTEM
if not shutil.which("uv"):
return False
Combine studio setup fixes: frontend caching, venv isolation, Windows CPU support (#4413) * Allow Windows setup to complete without NVIDIA GPU setup.ps1 previously hard-exited if nvidia-smi was not found, blocking setup entirely on CPU-only or non-NVIDIA machines. The backend already supports CPU and MLX (Apple Silicon) in chat-only GGUF mode, and the Linux/Mac setup.sh handles missing GPUs gracefully. Changes: - Convert the GPU check from a hard exit to a warning - Guard CUDA toolkit installation behind $HasNvidiaSmi - Install CPU-only PyTorch when no GPU is detected - Build llama.cpp without CUDA flags when no GPU is present - Update doc comment to reflect CPU support * Cache frontend build across setup runs Skip the frontend npm install + build if frontend/dist already exists. Previously setup.ps1 nuked node_modules and package-lock.json on every run, and both scripts always rebuilt even when dist/ was already present. On a git clone editable install, the first setup run still builds the frontend as before. Subsequent runs skip it, saving several minutes. To force a rebuild, delete frontend/dist and re-run setup. * Show pip progress for PyTorch download on Windows The torch CUDA wheel is ~2.8 GB and the CPU wheel is ~300 MB. With | Out-Null suppressing all output, the install appeared completely frozen with no feedback. Remove | Out-Null for the torch install lines so pip's download progress bar is visible. Add a size hint so users know the download is expected to take a while. Also moves the Triton success message inside the GPU branch so it only prints when Triton was actually installed. * Guard CUDA env re-sanitization behind GPU check in llama.cpp build The CUDA_PATH re-sanitization block (lines 1020-1033) references $CudaToolkitRoot which is only set when $HasNvidiaSmi is true and the CUDA Toolkit section runs. On CPU-only machines, $CudaToolkitRoot is null, causing Split-Path to throw: Split-Path : Cannot bind argument to parameter 'Path' because it is null. Wrap the entire block in `if ($HasNvidiaSmi -and $CudaToolkitRoot)`. * Rebuild frontend when source files are newer than dist/ Instead of only checking if dist/ exists, compare source file timestamps against the dist/ directory. If any file in frontend/src/ is newer than dist/, trigger a rebuild. This handles the case where a developer pulls new frontend changes and re-runs setup -- stale assets get rebuilt automatically. * Fix cmake not found on Windows after winget install Two issues fixed: 1. After winget installs cmake, Refresh-Environment may not pick up the new PATH entry (MSI PATH changes sometimes need a new shell). Added a fallback that probes cmake's default install locations (Program Files, LocalAppData) and adds the directory to PATH explicitly if found. 2. If cmake is still unavailable when the llama.cpp build starts (e.g. winget failed silently or PATH was not updated), the build now skips gracefully with a [SKIP] warning instead of crashing with "cmake : The term 'cmake' is not recognized". * Fix frontend rebuild detection and decouple oxc-validator install Address review feedback: - Check entire frontend/ directory for changes, not just src/. The build also depends on package.json, vite.config.ts, tailwind.config.ts, public/, and other config files. A change to any of these now triggers a rebuild. - Move oxc-validator npm install outside the frontend build gate in setup.sh so it always runs on setup, matching setup.ps1 which already had it outside the gate. * Show cmake errors on failure and retry CUDA VS integration with elevation Two fixes for issue #4405 (Windows setup fails at cmake configure): 1. cmake configure: capture output and display it on failure instead of piping to Out-Null. When the error mentions "No CUDA toolset found", print a hint about the CUDA VS integration files. 2. CUDA VS integration copy: when the direct Copy-Item fails (needs admin access to write to Program Files), retry with Start-Process -Verb RunAs to prompt for elevation. This is the root cause of the "No CUDA toolset found" cmake failure -- the .targets files that let MSBuild compile .cu files are missing from the VS BuildCustomizations directory. * Address reviewer feedback: cmake PATH persistence, stale cache, torch error check 1. Persist cmake PATH to user registry so Refresh-Environment cannot drop it later in the same setup run. Previously the process-only PATH addition at phase 1 could vanish when Refresh-Environment rebuilt PATH from registry during phase 2/3 installs. 2. Clean stale CMake cache before configure. If a previous run built with CUDA and the user reruns without a GPU (or vice versa), the cached GGML_CUDA value would persist. Now the build dir is removed before configure. 3. Explicitly set -DGGML_CUDA=OFF for CPU-only builds instead of just omitting CUDA flags. This prevents cmake from auto-detecting a partial CUDA installation. 4. Fix CUDA cmake flag indentation -- was misaligned from the original PR, now consistently indented inside the if/else block. 5. Fail hard if pip install torch returns a non-zero exit code instead of silently continuing with a broken environment. * Remove extra CUDA cmake flags to align Windows with Linux build Drop GGML_CUDA_FA_ALL_QUANTS, GGML_CUDA_F16, GGML_CUDA_GRAPHS, GGML_CUDA_FORCE_CUBLAS, and GGML_CUDA_PEER_MAX_BATCH_SIZE flags. The Linux build in setup.sh only sets GGML_CUDA=ON and lets llama.cpp use its defaults for everything else. Keep Windows consistent. * Address reviewer round 2: GPU probe fallback, Triton check, stale binary rebuild 1. GPU detection: fallback to default nvidia-smi install locations (Program Files\NVIDIA Corporation\NVSMI, System32) when nvidia-smi is not on PATH. Prevents silent CPU-only provisioning on machines that have a GPU but a broken PATH. 2. Triton: check $LASTEXITCODE after pip install and print [WARN] on failure instead of unconditional [OK]. 3. Stale llama-server: check CMakeCache.txt for GGML_CUDA setting and rebuild if the existing binary does not match the current GPU mode (e.g. CUDA binary on a now-CPU-only rerun, or vice versa). * Fix frontend rebuild detection and npm dependency issues Addresses reviewer feedback on the frontend caching logic: 1. setup.sh: Fix broken find command that caused exit under pipefail. The piped `find | xargs find -newer` had paths after the expression which GNU find rejects. Replaced with a simpler `find -maxdepth 1 -type f -newer dist/` that checks ALL top-level files (catches index.html, bun.lock, etc. that the extension allowlist missed). 2. setup.sh: Guard oxc-validator npm install behind `command -v npm` check. When the frontend build is skipped (dist/ is cached), Node bootstrap is also skipped, so npm may not be available. 3. setup.ps1: Replace Get-ChildItem -Include with explicit path probing for src/ and public/. PowerShell's -Include without a trailing wildcard silently returns nothing, so src/public changes were never detected. Also check ALL top-level files instead of just .json/.ts/.js/.mjs extensions. * Fix studio setup: venv isolation, centralized .venv_t5, uv targeting - All platforms (including Colab) now create ~/.unsloth/studio/.venv with --without-pip fallback for broken ensurepip environments - Add --python sys.executable to uv pip install in install_python_stack.py so uv targets the correct venv instead of system Python - Centralize .venv_t5 bootstrap in transformers_version.py with proper validation (checks required packages exist, not just non-empty dir) - Replace ~150 lines of duplicated install code across 3 worker files with calls to the shared _ensure_venv_t5_exists() helper - Use uv-if-present with pip fallback; do not install uv at runtime - Add site.addsitedir() shim in colab.py so notebook cells can import studio packages from the venv without system-Python double-install - Update .venv_t5 packages: huggingface_hub 1.3.0->1.7.1, add hf_xet - Bump transformers pin 4.57.1->4.57.6 in requirements + constraints - Add Fast-Install helper to setup.ps1 with uv+pip fallback - Keep Colab-specific completion banner in setup.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix nvidia-smi PATH persistence and cmake requirement for CPU-only 1. Store nvidia-smi as an absolute path ($NvidiaSmiExe) on first detection. All later calls (Get-CudaComputeCapability, Get-PytorchCudaTag, CUDA toolkit detection) use this absolute path instead of relying on PATH. This survives Refresh-Environment which rebuilds PATH from the registry and drops process-only additions. 2. Make cmake fatal for CPU-only installs. CPU-only machines depend entirely on llama-server for GGUF chat mode, so reporting "Setup Complete!" without it is misleading. GPU machines can still skip the llama-server build since they have other inference paths. * Fix broken frontend freshness detection in setup scripts - setup.sh: Replace broken `find | xargs find -newer` pipeline with single `find ... -newer` call. The old pipeline produced "paths must precede expression" errors (silently suppressed by 2>/dev/null), causing top-level config changes to never trigger a rebuild. - setup.sh: Add `command -v npm` guard to oxc-validator block so it does not fail when Node was not installed (build-skip path). - setup.ps1: Replace `Get-ChildItem -Include` (unreliable without -Recurse on PS 5.1) with explicit directory paths for src/ and public/ scanning. - Both: Add *.html to tracked file patterns so index.html (Vite entry point) changes trigger a rebuild. - Both: Use -print -quit instead of piping to head -1 for efficiency. * Fix bugs found during review of PRs #4404, #4400, #4399 - setup.sh: Add || true guard to find command that checks frontend/src and frontend/public dirs, preventing script abort under set -euo pipefail when either directory is missing - colab.py: Use sys.path.insert(0, ...) instead of site.addsitedir() so Studio venv packages take priority over system copies. Add warning when venv is missing instead of silently failing. - transformers_version.py: _venv_t5_is_valid() now checks installed package versions via .dist-info metadata, not just directory presence. Prevents false positives from stale or wrong-version packages. - transformers_version.py: _install_to_venv_t5() now passes --upgrade so pip replaces existing stale packages in the target directory. - setup.ps1: CPU-only PyTorch install uses --index-url for cpu wheel and all install commands use Fast-Install (uv with pip fallback). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix _venv_t5_is_valid dist-info loop exiting after first directory Remove premature break that caused the loop over .dist-info directories to exit after the first match even if it had no METADATA file. Now continues iterating until a valid METADATA is found or all dirs are exhausted. * Capture error output on failure instead of discarding with Out-Null setup.ps1: 6 locations changed from `| Out-Null` to `| Out-String` with output shown on failure -- PyTorch GPU/CPU install, Triton install, venv_t5 package loop, cmake llama-server and llama-quantize builds. transformers_version.py: clean stale .venv_t5 directory before reinstall when validation detects missing or version-mismatched packages. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix ModuleNotFoundError when CLI imports studio.backend.core The backend uses bare "from utils.*" imports everywhere, relying on backend/ being on sys.path. Workers and routes add it at startup, but the CLI imports studio.backend.core as a package -- backend/ was never added. Add sys.path setup at the top of core/__init__.py so lazy imports resolve correctly regardless of entry point. Fixes: unsloth inference unsloth/Qwen3-8B "who are you" crashing with "No module named 'utils'" * Fix frontend freshness check to detect all top-level file changes The extension allowlist (*.json, *.ts, *.js, *.mjs, *.html) missed files like bun.lock, so lockfile-only dependency changes could skip the frontend rebuild. Check all top-level files instead. * Add tiktoken to .venv_t5 for Qwen-family tokenizers Qwen models use tiktoken-based tokenizers which fail when routed through the transformers 5.x overlay without tiktoken installed. Add it to the setup scripts (with deps for Windows) and runtime fallback list. Integrates PR #4418. * Fix tiktoken crash in _venv_t5_is_valid and stray brace in setup.ps1 _venv_t5_is_valid() crashed with ValueError on unpinned packages like "tiktoken" (no ==version). Handle by splitting safely and skipping version check for unpinned packages (existence check only). Also remove stray closing brace in setup.ps1 tiktoken install block. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-18 10:52:25 +00:00
# Probe: try a dry-run install targeting the current Python explicitly.
# Without --python, uv can ignore the activated venv on some platforms.
probe = subprocess.run(
Combine studio setup fixes: frontend caching, venv isolation, Windows CPU support (#4413) * Allow Windows setup to complete without NVIDIA GPU setup.ps1 previously hard-exited if nvidia-smi was not found, blocking setup entirely on CPU-only or non-NVIDIA machines. The backend already supports CPU and MLX (Apple Silicon) in chat-only GGUF mode, and the Linux/Mac setup.sh handles missing GPUs gracefully. Changes: - Convert the GPU check from a hard exit to a warning - Guard CUDA toolkit installation behind $HasNvidiaSmi - Install CPU-only PyTorch when no GPU is detected - Build llama.cpp without CUDA flags when no GPU is present - Update doc comment to reflect CPU support * Cache frontend build across setup runs Skip the frontend npm install + build if frontend/dist already exists. Previously setup.ps1 nuked node_modules and package-lock.json on every run, and both scripts always rebuilt even when dist/ was already present. On a git clone editable install, the first setup run still builds the frontend as before. Subsequent runs skip it, saving several minutes. To force a rebuild, delete frontend/dist and re-run setup. * Show pip progress for PyTorch download on Windows The torch CUDA wheel is ~2.8 GB and the CPU wheel is ~300 MB. With | Out-Null suppressing all output, the install appeared completely frozen with no feedback. Remove | Out-Null for the torch install lines so pip's download progress bar is visible. Add a size hint so users know the download is expected to take a while. Also moves the Triton success message inside the GPU branch so it only prints when Triton was actually installed. * Guard CUDA env re-sanitization behind GPU check in llama.cpp build The CUDA_PATH re-sanitization block (lines 1020-1033) references $CudaToolkitRoot which is only set when $HasNvidiaSmi is true and the CUDA Toolkit section runs. On CPU-only machines, $CudaToolkitRoot is null, causing Split-Path to throw: Split-Path : Cannot bind argument to parameter 'Path' because it is null. Wrap the entire block in `if ($HasNvidiaSmi -and $CudaToolkitRoot)`. * Rebuild frontend when source files are newer than dist/ Instead of only checking if dist/ exists, compare source file timestamps against the dist/ directory. If any file in frontend/src/ is newer than dist/, trigger a rebuild. This handles the case where a developer pulls new frontend changes and re-runs setup -- stale assets get rebuilt automatically. * Fix cmake not found on Windows after winget install Two issues fixed: 1. After winget installs cmake, Refresh-Environment may not pick up the new PATH entry (MSI PATH changes sometimes need a new shell). Added a fallback that probes cmake's default install locations (Program Files, LocalAppData) and adds the directory to PATH explicitly if found. 2. If cmake is still unavailable when the llama.cpp build starts (e.g. winget failed silently or PATH was not updated), the build now skips gracefully with a [SKIP] warning instead of crashing with "cmake : The term 'cmake' is not recognized". * Fix frontend rebuild detection and decouple oxc-validator install Address review feedback: - Check entire frontend/ directory for changes, not just src/. The build also depends on package.json, vite.config.ts, tailwind.config.ts, public/, and other config files. A change to any of these now triggers a rebuild. - Move oxc-validator npm install outside the frontend build gate in setup.sh so it always runs on setup, matching setup.ps1 which already had it outside the gate. * Show cmake errors on failure and retry CUDA VS integration with elevation Two fixes for issue #4405 (Windows setup fails at cmake configure): 1. cmake configure: capture output and display it on failure instead of piping to Out-Null. When the error mentions "No CUDA toolset found", print a hint about the CUDA VS integration files. 2. CUDA VS integration copy: when the direct Copy-Item fails (needs admin access to write to Program Files), retry with Start-Process -Verb RunAs to prompt for elevation. This is the root cause of the "No CUDA toolset found" cmake failure -- the .targets files that let MSBuild compile .cu files are missing from the VS BuildCustomizations directory. * Address reviewer feedback: cmake PATH persistence, stale cache, torch error check 1. Persist cmake PATH to user registry so Refresh-Environment cannot drop it later in the same setup run. Previously the process-only PATH addition at phase 1 could vanish when Refresh-Environment rebuilt PATH from registry during phase 2/3 installs. 2. Clean stale CMake cache before configure. If a previous run built with CUDA and the user reruns without a GPU (or vice versa), the cached GGML_CUDA value would persist. Now the build dir is removed before configure. 3. Explicitly set -DGGML_CUDA=OFF for CPU-only builds instead of just omitting CUDA flags. This prevents cmake from auto-detecting a partial CUDA installation. 4. Fix CUDA cmake flag indentation -- was misaligned from the original PR, now consistently indented inside the if/else block. 5. Fail hard if pip install torch returns a non-zero exit code instead of silently continuing with a broken environment. * Remove extra CUDA cmake flags to align Windows with Linux build Drop GGML_CUDA_FA_ALL_QUANTS, GGML_CUDA_F16, GGML_CUDA_GRAPHS, GGML_CUDA_FORCE_CUBLAS, and GGML_CUDA_PEER_MAX_BATCH_SIZE flags. The Linux build in setup.sh only sets GGML_CUDA=ON and lets llama.cpp use its defaults for everything else. Keep Windows consistent. * Address reviewer round 2: GPU probe fallback, Triton check, stale binary rebuild 1. GPU detection: fallback to default nvidia-smi install locations (Program Files\NVIDIA Corporation\NVSMI, System32) when nvidia-smi is not on PATH. Prevents silent CPU-only provisioning on machines that have a GPU but a broken PATH. 2. Triton: check $LASTEXITCODE after pip install and print [WARN] on failure instead of unconditional [OK]. 3. Stale llama-server: check CMakeCache.txt for GGML_CUDA setting and rebuild if the existing binary does not match the current GPU mode (e.g. CUDA binary on a now-CPU-only rerun, or vice versa). * Fix frontend rebuild detection and npm dependency issues Addresses reviewer feedback on the frontend caching logic: 1. setup.sh: Fix broken find command that caused exit under pipefail. The piped `find | xargs find -newer` had paths after the expression which GNU find rejects. Replaced with a simpler `find -maxdepth 1 -type f -newer dist/` that checks ALL top-level files (catches index.html, bun.lock, etc. that the extension allowlist missed). 2. setup.sh: Guard oxc-validator npm install behind `command -v npm` check. When the frontend build is skipped (dist/ is cached), Node bootstrap is also skipped, so npm may not be available. 3. setup.ps1: Replace Get-ChildItem -Include with explicit path probing for src/ and public/. PowerShell's -Include without a trailing wildcard silently returns nothing, so src/public changes were never detected. Also check ALL top-level files instead of just .json/.ts/.js/.mjs extensions. * Fix studio setup: venv isolation, centralized .venv_t5, uv targeting - All platforms (including Colab) now create ~/.unsloth/studio/.venv with --without-pip fallback for broken ensurepip environments - Add --python sys.executable to uv pip install in install_python_stack.py so uv targets the correct venv instead of system Python - Centralize .venv_t5 bootstrap in transformers_version.py with proper validation (checks required packages exist, not just non-empty dir) - Replace ~150 lines of duplicated install code across 3 worker files with calls to the shared _ensure_venv_t5_exists() helper - Use uv-if-present with pip fallback; do not install uv at runtime - Add site.addsitedir() shim in colab.py so notebook cells can import studio packages from the venv without system-Python double-install - Update .venv_t5 packages: huggingface_hub 1.3.0->1.7.1, add hf_xet - Bump transformers pin 4.57.1->4.57.6 in requirements + constraints - Add Fast-Install helper to setup.ps1 with uv+pip fallback - Keep Colab-specific completion banner in setup.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix nvidia-smi PATH persistence and cmake requirement for CPU-only 1. Store nvidia-smi as an absolute path ($NvidiaSmiExe) on first detection. All later calls (Get-CudaComputeCapability, Get-PytorchCudaTag, CUDA toolkit detection) use this absolute path instead of relying on PATH. This survives Refresh-Environment which rebuilds PATH from the registry and drops process-only additions. 2. Make cmake fatal for CPU-only installs. CPU-only machines depend entirely on llama-server for GGUF chat mode, so reporting "Setup Complete!" without it is misleading. GPU machines can still skip the llama-server build since they have other inference paths. * Fix broken frontend freshness detection in setup scripts - setup.sh: Replace broken `find | xargs find -newer` pipeline with single `find ... -newer` call. The old pipeline produced "paths must precede expression" errors (silently suppressed by 2>/dev/null), causing top-level config changes to never trigger a rebuild. - setup.sh: Add `command -v npm` guard to oxc-validator block so it does not fail when Node was not installed (build-skip path). - setup.ps1: Replace `Get-ChildItem -Include` (unreliable without -Recurse on PS 5.1) with explicit directory paths for src/ and public/ scanning. - Both: Add *.html to tracked file patterns so index.html (Vite entry point) changes trigger a rebuild. - Both: Use -print -quit instead of piping to head -1 for efficiency. * Fix bugs found during review of PRs #4404, #4400, #4399 - setup.sh: Add || true guard to find command that checks frontend/src and frontend/public dirs, preventing script abort under set -euo pipefail when either directory is missing - colab.py: Use sys.path.insert(0, ...) instead of site.addsitedir() so Studio venv packages take priority over system copies. Add warning when venv is missing instead of silently failing. - transformers_version.py: _venv_t5_is_valid() now checks installed package versions via .dist-info metadata, not just directory presence. Prevents false positives from stale or wrong-version packages. - transformers_version.py: _install_to_venv_t5() now passes --upgrade so pip replaces existing stale packages in the target directory. - setup.ps1: CPU-only PyTorch install uses --index-url for cpu wheel and all install commands use Fast-Install (uv with pip fallback). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix _venv_t5_is_valid dist-info loop exiting after first directory Remove premature break that caused the loop over .dist-info directories to exit after the first match even if it had no METADATA file. Now continues iterating until a valid METADATA is found or all dirs are exhausted. * Capture error output on failure instead of discarding with Out-Null setup.ps1: 6 locations changed from `| Out-Null` to `| Out-String` with output shown on failure -- PyTorch GPU/CPU install, Triton install, venv_t5 package loop, cmake llama-server and llama-quantize builds. transformers_version.py: clean stale .venv_t5 directory before reinstall when validation detects missing or version-mismatched packages. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix ModuleNotFoundError when CLI imports studio.backend.core The backend uses bare "from utils.*" imports everywhere, relying on backend/ being on sys.path. Workers and routes add it at startup, but the CLI imports studio.backend.core as a package -- backend/ was never added. Add sys.path setup at the top of core/__init__.py so lazy imports resolve correctly regardless of entry point. Fixes: unsloth inference unsloth/Qwen3-8B "who are you" crashing with "No module named 'utils'" * Fix frontend freshness check to detect all top-level file changes The extension allowlist (*.json, *.ts, *.js, *.mjs, *.html) missed files like bun.lock, so lockfile-only dependency changes could skip the frontend rebuild. Check all top-level files instead. * Add tiktoken to .venv_t5 for Qwen-family tokenizers Qwen models use tiktoken-based tokenizers which fail when routed through the transformers 5.x overlay without tiktoken installed. Add it to the setup scripts (with deps for Windows) and runtime fallback list. Integrates PR #4418. * Fix tiktoken crash in _venv_t5_is_valid and stray brace in setup.ps1 _venv_t5_is_valid() crashed with ValueError on unpinned packages like "tiktoken" (no ==version). Handle by splitting safely and skipping version check for unpinned packages (existence check only). Also remove stray closing brace in setup.ps1 tiktoken install block. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-18 10:52:25 +00:00
["uv", "pip", "install", "--dry-run", "--python", sys.executable, "pip"],
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
if probe.returncode != 0:
Combine studio setup fixes: frontend caching, venv isolation, Windows CPU support (#4413) * Allow Windows setup to complete without NVIDIA GPU setup.ps1 previously hard-exited if nvidia-smi was not found, blocking setup entirely on CPU-only or non-NVIDIA machines. The backend already supports CPU and MLX (Apple Silicon) in chat-only GGUF mode, and the Linux/Mac setup.sh handles missing GPUs gracefully. Changes: - Convert the GPU check from a hard exit to a warning - Guard CUDA toolkit installation behind $HasNvidiaSmi - Install CPU-only PyTorch when no GPU is detected - Build llama.cpp without CUDA flags when no GPU is present - Update doc comment to reflect CPU support * Cache frontend build across setup runs Skip the frontend npm install + build if frontend/dist already exists. Previously setup.ps1 nuked node_modules and package-lock.json on every run, and both scripts always rebuilt even when dist/ was already present. On a git clone editable install, the first setup run still builds the frontend as before. Subsequent runs skip it, saving several minutes. To force a rebuild, delete frontend/dist and re-run setup. * Show pip progress for PyTorch download on Windows The torch CUDA wheel is ~2.8 GB and the CPU wheel is ~300 MB. With | Out-Null suppressing all output, the install appeared completely frozen with no feedback. Remove | Out-Null for the torch install lines so pip's download progress bar is visible. Add a size hint so users know the download is expected to take a while. Also moves the Triton success message inside the GPU branch so it only prints when Triton was actually installed. * Guard CUDA env re-sanitization behind GPU check in llama.cpp build The CUDA_PATH re-sanitization block (lines 1020-1033) references $CudaToolkitRoot which is only set when $HasNvidiaSmi is true and the CUDA Toolkit section runs. On CPU-only machines, $CudaToolkitRoot is null, causing Split-Path to throw: Split-Path : Cannot bind argument to parameter 'Path' because it is null. Wrap the entire block in `if ($HasNvidiaSmi -and $CudaToolkitRoot)`. * Rebuild frontend when source files are newer than dist/ Instead of only checking if dist/ exists, compare source file timestamps against the dist/ directory. If any file in frontend/src/ is newer than dist/, trigger a rebuild. This handles the case where a developer pulls new frontend changes and re-runs setup -- stale assets get rebuilt automatically. * Fix cmake not found on Windows after winget install Two issues fixed: 1. After winget installs cmake, Refresh-Environment may not pick up the new PATH entry (MSI PATH changes sometimes need a new shell). Added a fallback that probes cmake's default install locations (Program Files, LocalAppData) and adds the directory to PATH explicitly if found. 2. If cmake is still unavailable when the llama.cpp build starts (e.g. winget failed silently or PATH was not updated), the build now skips gracefully with a [SKIP] warning instead of crashing with "cmake : The term 'cmake' is not recognized". * Fix frontend rebuild detection and decouple oxc-validator install Address review feedback: - Check entire frontend/ directory for changes, not just src/. The build also depends on package.json, vite.config.ts, tailwind.config.ts, public/, and other config files. A change to any of these now triggers a rebuild. - Move oxc-validator npm install outside the frontend build gate in setup.sh so it always runs on setup, matching setup.ps1 which already had it outside the gate. * Show cmake errors on failure and retry CUDA VS integration with elevation Two fixes for issue #4405 (Windows setup fails at cmake configure): 1. cmake configure: capture output and display it on failure instead of piping to Out-Null. When the error mentions "No CUDA toolset found", print a hint about the CUDA VS integration files. 2. CUDA VS integration copy: when the direct Copy-Item fails (needs admin access to write to Program Files), retry with Start-Process -Verb RunAs to prompt for elevation. This is the root cause of the "No CUDA toolset found" cmake failure -- the .targets files that let MSBuild compile .cu files are missing from the VS BuildCustomizations directory. * Address reviewer feedback: cmake PATH persistence, stale cache, torch error check 1. Persist cmake PATH to user registry so Refresh-Environment cannot drop it later in the same setup run. Previously the process-only PATH addition at phase 1 could vanish when Refresh-Environment rebuilt PATH from registry during phase 2/3 installs. 2. Clean stale CMake cache before configure. If a previous run built with CUDA and the user reruns without a GPU (or vice versa), the cached GGML_CUDA value would persist. Now the build dir is removed before configure. 3. Explicitly set -DGGML_CUDA=OFF for CPU-only builds instead of just omitting CUDA flags. This prevents cmake from auto-detecting a partial CUDA installation. 4. Fix CUDA cmake flag indentation -- was misaligned from the original PR, now consistently indented inside the if/else block. 5. Fail hard if pip install torch returns a non-zero exit code instead of silently continuing with a broken environment. * Remove extra CUDA cmake flags to align Windows with Linux build Drop GGML_CUDA_FA_ALL_QUANTS, GGML_CUDA_F16, GGML_CUDA_GRAPHS, GGML_CUDA_FORCE_CUBLAS, and GGML_CUDA_PEER_MAX_BATCH_SIZE flags. The Linux build in setup.sh only sets GGML_CUDA=ON and lets llama.cpp use its defaults for everything else. Keep Windows consistent. * Address reviewer round 2: GPU probe fallback, Triton check, stale binary rebuild 1. GPU detection: fallback to default nvidia-smi install locations (Program Files\NVIDIA Corporation\NVSMI, System32) when nvidia-smi is not on PATH. Prevents silent CPU-only provisioning on machines that have a GPU but a broken PATH. 2. Triton: check $LASTEXITCODE after pip install and print [WARN] on failure instead of unconditional [OK]. 3. Stale llama-server: check CMakeCache.txt for GGML_CUDA setting and rebuild if the existing binary does not match the current GPU mode (e.g. CUDA binary on a now-CPU-only rerun, or vice versa). * Fix frontend rebuild detection and npm dependency issues Addresses reviewer feedback on the frontend caching logic: 1. setup.sh: Fix broken find command that caused exit under pipefail. The piped `find | xargs find -newer` had paths after the expression which GNU find rejects. Replaced with a simpler `find -maxdepth 1 -type f -newer dist/` that checks ALL top-level files (catches index.html, bun.lock, etc. that the extension allowlist missed). 2. setup.sh: Guard oxc-validator npm install behind `command -v npm` check. When the frontend build is skipped (dist/ is cached), Node bootstrap is also skipped, so npm may not be available. 3. setup.ps1: Replace Get-ChildItem -Include with explicit path probing for src/ and public/. PowerShell's -Include without a trailing wildcard silently returns nothing, so src/public changes were never detected. Also check ALL top-level files instead of just .json/.ts/.js/.mjs extensions. * Fix studio setup: venv isolation, centralized .venv_t5, uv targeting - All platforms (including Colab) now create ~/.unsloth/studio/.venv with --without-pip fallback for broken ensurepip environments - Add --python sys.executable to uv pip install in install_python_stack.py so uv targets the correct venv instead of system Python - Centralize .venv_t5 bootstrap in transformers_version.py with proper validation (checks required packages exist, not just non-empty dir) - Replace ~150 lines of duplicated install code across 3 worker files with calls to the shared _ensure_venv_t5_exists() helper - Use uv-if-present with pip fallback; do not install uv at runtime - Add site.addsitedir() shim in colab.py so notebook cells can import studio packages from the venv without system-Python double-install - Update .venv_t5 packages: huggingface_hub 1.3.0->1.7.1, add hf_xet - Bump transformers pin 4.57.1->4.57.6 in requirements + constraints - Add Fast-Install helper to setup.ps1 with uv+pip fallback - Keep Colab-specific completion banner in setup.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix nvidia-smi PATH persistence and cmake requirement for CPU-only 1. Store nvidia-smi as an absolute path ($NvidiaSmiExe) on first detection. All later calls (Get-CudaComputeCapability, Get-PytorchCudaTag, CUDA toolkit detection) use this absolute path instead of relying on PATH. This survives Refresh-Environment which rebuilds PATH from the registry and drops process-only additions. 2. Make cmake fatal for CPU-only installs. CPU-only machines depend entirely on llama-server for GGUF chat mode, so reporting "Setup Complete!" without it is misleading. GPU machines can still skip the llama-server build since they have other inference paths. * Fix broken frontend freshness detection in setup scripts - setup.sh: Replace broken `find | xargs find -newer` pipeline with single `find ... -newer` call. The old pipeline produced "paths must precede expression" errors (silently suppressed by 2>/dev/null), causing top-level config changes to never trigger a rebuild. - setup.sh: Add `command -v npm` guard to oxc-validator block so it does not fail when Node was not installed (build-skip path). - setup.ps1: Replace `Get-ChildItem -Include` (unreliable without -Recurse on PS 5.1) with explicit directory paths for src/ and public/ scanning. - Both: Add *.html to tracked file patterns so index.html (Vite entry point) changes trigger a rebuild. - Both: Use -print -quit instead of piping to head -1 for efficiency. * Fix bugs found during review of PRs #4404, #4400, #4399 - setup.sh: Add || true guard to find command that checks frontend/src and frontend/public dirs, preventing script abort under set -euo pipefail when either directory is missing - colab.py: Use sys.path.insert(0, ...) instead of site.addsitedir() so Studio venv packages take priority over system copies. Add warning when venv is missing instead of silently failing. - transformers_version.py: _venv_t5_is_valid() now checks installed package versions via .dist-info metadata, not just directory presence. Prevents false positives from stale or wrong-version packages. - transformers_version.py: _install_to_venv_t5() now passes --upgrade so pip replaces existing stale packages in the target directory. - setup.ps1: CPU-only PyTorch install uses --index-url for cpu wheel and all install commands use Fast-Install (uv with pip fallback). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix _venv_t5_is_valid dist-info loop exiting after first directory Remove premature break that caused the loop over .dist-info directories to exit after the first match even if it had no METADATA file. Now continues iterating until a valid METADATA is found or all dirs are exhausted. * Capture error output on failure instead of discarding with Out-Null setup.ps1: 6 locations changed from `| Out-Null` to `| Out-String` with output shown on failure -- PyTorch GPU/CPU install, Triton install, venv_t5 package loop, cmake llama-server and llama-quantize builds. transformers_version.py: clean stale .venv_t5 directory before reinstall when validation detects missing or version-mismatched packages. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix ModuleNotFoundError when CLI imports studio.backend.core The backend uses bare "from utils.*" imports everywhere, relying on backend/ being on sys.path. Workers and routes add it at startup, but the CLI imports studio.backend.core as a package -- backend/ was never added. Add sys.path setup at the top of core/__init__.py so lazy imports resolve correctly regardless of entry point. Fixes: unsloth inference unsloth/Qwen3-8B "who are you" crashing with "No module named 'utils'" * Fix frontend freshness check to detect all top-level file changes The extension allowlist (*.json, *.ts, *.js, *.mjs, *.html) missed files like bun.lock, so lockfile-only dependency changes could skip the frontend rebuild. Check all top-level files instead. * Add tiktoken to .venv_t5 for Qwen-family tokenizers Qwen models use tiktoken-based tokenizers which fail when routed through the transformers 5.x overlay without tiktoken installed. Add it to the setup scripts (with deps for Windows) and runtime fallback list. Integrates PR #4418. * Fix tiktoken crash in _venv_t5_is_valid and stray brace in setup.ps1 _venv_t5_is_valid() crashed with ValueError on unpinned packages like "tiktoken" (no ==version). Handle by splitting safely and skipping version check for unpinned packages (existence check only). Also remove stray closing brace in setup.ps1 tiktoken install block. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-18 10:52:25 +00:00
# Retry with --system (some envs need it when uv can't find a venv)
probe_sys = subprocess.run(
["uv", "pip", "install", "--dry-run", "--system", "pip"],
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
if probe_sys.returncode != 0:
return False # uv is broken, fall back to pip
UV_NEEDS_SYSTEM = True
return True
2026-02-27 20:31:57 +00:00
def _filter_requirements(req: Path, skip: set[str]) -> Path:
"""Return a temp copy of a requirements file with certain packages removed."""
2026-03-12 18:28:04 +00:00
lines = req.read_text(encoding = "utf-8").splitlines(keepends = True)
2026-02-27 20:31:57 +00:00
filtered = [
2026-03-12 18:28:04 +00:00
line
for line in lines
2026-02-27 20:31:57 +00:00
if not any(line.strip().lower().startswith(pkg) for pkg in skip)
]
tmp = tempfile.NamedTemporaryFile(
2026-03-12 18:28:04 +00:00
mode = "w",
suffix = ".txt",
delete = False,
encoding = "utf-8",
2026-02-27 20:31:57 +00:00
)
tmp.writelines(filtered)
tmp.close()
return Path(tmp.name)
def _translate_pip_args_for_uv(args: tuple[str, ...]) -> list[str]:
"""Translate pip flags to their uv equivalents."""
translated: list[str] = []
for arg in args:
if arg == "--no-cache-dir":
continue # uv cache is fast; drop this flag
elif arg == "--force-reinstall":
translated.append("--reinstall")
else:
translated.append(arg)
return translated
def _build_pip_cmd(args: tuple[str, ...]) -> list[str]:
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
"""Build a standard pip install command.
Strips uv-only flags like --upgrade-package that pip doesn't understand.
"""
cmd = [sys.executable, "-m", "pip", "install"]
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
skip_next = False
for arg in args:
if skip_next:
skip_next = False
continue
if arg == "--upgrade-package":
skip_next = True # skip the flag and its value
continue
cmd.append(arg)
return cmd
def _build_uv_cmd(args: tuple[str, ...]) -> list[str]:
"""Build a uv pip install command with translated flags."""
cmd = ["uv", "pip", "install"]
if UV_NEEDS_SYSTEM:
cmd.append("--system")
Combine studio setup fixes: frontend caching, venv isolation, Windows CPU support (#4413) * Allow Windows setup to complete without NVIDIA GPU setup.ps1 previously hard-exited if nvidia-smi was not found, blocking setup entirely on CPU-only or non-NVIDIA machines. The backend already supports CPU and MLX (Apple Silicon) in chat-only GGUF mode, and the Linux/Mac setup.sh handles missing GPUs gracefully. Changes: - Convert the GPU check from a hard exit to a warning - Guard CUDA toolkit installation behind $HasNvidiaSmi - Install CPU-only PyTorch when no GPU is detected - Build llama.cpp without CUDA flags when no GPU is present - Update doc comment to reflect CPU support * Cache frontend build across setup runs Skip the frontend npm install + build if frontend/dist already exists. Previously setup.ps1 nuked node_modules and package-lock.json on every run, and both scripts always rebuilt even when dist/ was already present. On a git clone editable install, the first setup run still builds the frontend as before. Subsequent runs skip it, saving several minutes. To force a rebuild, delete frontend/dist and re-run setup. * Show pip progress for PyTorch download on Windows The torch CUDA wheel is ~2.8 GB and the CPU wheel is ~300 MB. With | Out-Null suppressing all output, the install appeared completely frozen with no feedback. Remove | Out-Null for the torch install lines so pip's download progress bar is visible. Add a size hint so users know the download is expected to take a while. Also moves the Triton success message inside the GPU branch so it only prints when Triton was actually installed. * Guard CUDA env re-sanitization behind GPU check in llama.cpp build The CUDA_PATH re-sanitization block (lines 1020-1033) references $CudaToolkitRoot which is only set when $HasNvidiaSmi is true and the CUDA Toolkit section runs. On CPU-only machines, $CudaToolkitRoot is null, causing Split-Path to throw: Split-Path : Cannot bind argument to parameter 'Path' because it is null. Wrap the entire block in `if ($HasNvidiaSmi -and $CudaToolkitRoot)`. * Rebuild frontend when source files are newer than dist/ Instead of only checking if dist/ exists, compare source file timestamps against the dist/ directory. If any file in frontend/src/ is newer than dist/, trigger a rebuild. This handles the case where a developer pulls new frontend changes and re-runs setup -- stale assets get rebuilt automatically. * Fix cmake not found on Windows after winget install Two issues fixed: 1. After winget installs cmake, Refresh-Environment may not pick up the new PATH entry (MSI PATH changes sometimes need a new shell). Added a fallback that probes cmake's default install locations (Program Files, LocalAppData) and adds the directory to PATH explicitly if found. 2. If cmake is still unavailable when the llama.cpp build starts (e.g. winget failed silently or PATH was not updated), the build now skips gracefully with a [SKIP] warning instead of crashing with "cmake : The term 'cmake' is not recognized". * Fix frontend rebuild detection and decouple oxc-validator install Address review feedback: - Check entire frontend/ directory for changes, not just src/. The build also depends on package.json, vite.config.ts, tailwind.config.ts, public/, and other config files. A change to any of these now triggers a rebuild. - Move oxc-validator npm install outside the frontend build gate in setup.sh so it always runs on setup, matching setup.ps1 which already had it outside the gate. * Show cmake errors on failure and retry CUDA VS integration with elevation Two fixes for issue #4405 (Windows setup fails at cmake configure): 1. cmake configure: capture output and display it on failure instead of piping to Out-Null. When the error mentions "No CUDA toolset found", print a hint about the CUDA VS integration files. 2. CUDA VS integration copy: when the direct Copy-Item fails (needs admin access to write to Program Files), retry with Start-Process -Verb RunAs to prompt for elevation. This is the root cause of the "No CUDA toolset found" cmake failure -- the .targets files that let MSBuild compile .cu files are missing from the VS BuildCustomizations directory. * Address reviewer feedback: cmake PATH persistence, stale cache, torch error check 1. Persist cmake PATH to user registry so Refresh-Environment cannot drop it later in the same setup run. Previously the process-only PATH addition at phase 1 could vanish when Refresh-Environment rebuilt PATH from registry during phase 2/3 installs. 2. Clean stale CMake cache before configure. If a previous run built with CUDA and the user reruns without a GPU (or vice versa), the cached GGML_CUDA value would persist. Now the build dir is removed before configure. 3. Explicitly set -DGGML_CUDA=OFF for CPU-only builds instead of just omitting CUDA flags. This prevents cmake from auto-detecting a partial CUDA installation. 4. Fix CUDA cmake flag indentation -- was misaligned from the original PR, now consistently indented inside the if/else block. 5. Fail hard if pip install torch returns a non-zero exit code instead of silently continuing with a broken environment. * Remove extra CUDA cmake flags to align Windows with Linux build Drop GGML_CUDA_FA_ALL_QUANTS, GGML_CUDA_F16, GGML_CUDA_GRAPHS, GGML_CUDA_FORCE_CUBLAS, and GGML_CUDA_PEER_MAX_BATCH_SIZE flags. The Linux build in setup.sh only sets GGML_CUDA=ON and lets llama.cpp use its defaults for everything else. Keep Windows consistent. * Address reviewer round 2: GPU probe fallback, Triton check, stale binary rebuild 1. GPU detection: fallback to default nvidia-smi install locations (Program Files\NVIDIA Corporation\NVSMI, System32) when nvidia-smi is not on PATH. Prevents silent CPU-only provisioning on machines that have a GPU but a broken PATH. 2. Triton: check $LASTEXITCODE after pip install and print [WARN] on failure instead of unconditional [OK]. 3. Stale llama-server: check CMakeCache.txt for GGML_CUDA setting and rebuild if the existing binary does not match the current GPU mode (e.g. CUDA binary on a now-CPU-only rerun, or vice versa). * Fix frontend rebuild detection and npm dependency issues Addresses reviewer feedback on the frontend caching logic: 1. setup.sh: Fix broken find command that caused exit under pipefail. The piped `find | xargs find -newer` had paths after the expression which GNU find rejects. Replaced with a simpler `find -maxdepth 1 -type f -newer dist/` that checks ALL top-level files (catches index.html, bun.lock, etc. that the extension allowlist missed). 2. setup.sh: Guard oxc-validator npm install behind `command -v npm` check. When the frontend build is skipped (dist/ is cached), Node bootstrap is also skipped, so npm may not be available. 3. setup.ps1: Replace Get-ChildItem -Include with explicit path probing for src/ and public/. PowerShell's -Include without a trailing wildcard silently returns nothing, so src/public changes were never detected. Also check ALL top-level files instead of just .json/.ts/.js/.mjs extensions. * Fix studio setup: venv isolation, centralized .venv_t5, uv targeting - All platforms (including Colab) now create ~/.unsloth/studio/.venv with --without-pip fallback for broken ensurepip environments - Add --python sys.executable to uv pip install in install_python_stack.py so uv targets the correct venv instead of system Python - Centralize .venv_t5 bootstrap in transformers_version.py with proper validation (checks required packages exist, not just non-empty dir) - Replace ~150 lines of duplicated install code across 3 worker files with calls to the shared _ensure_venv_t5_exists() helper - Use uv-if-present with pip fallback; do not install uv at runtime - Add site.addsitedir() shim in colab.py so notebook cells can import studio packages from the venv without system-Python double-install - Update .venv_t5 packages: huggingface_hub 1.3.0->1.7.1, add hf_xet - Bump transformers pin 4.57.1->4.57.6 in requirements + constraints - Add Fast-Install helper to setup.ps1 with uv+pip fallback - Keep Colab-specific completion banner in setup.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix nvidia-smi PATH persistence and cmake requirement for CPU-only 1. Store nvidia-smi as an absolute path ($NvidiaSmiExe) on first detection. All later calls (Get-CudaComputeCapability, Get-PytorchCudaTag, CUDA toolkit detection) use this absolute path instead of relying on PATH. This survives Refresh-Environment which rebuilds PATH from the registry and drops process-only additions. 2. Make cmake fatal for CPU-only installs. CPU-only machines depend entirely on llama-server for GGUF chat mode, so reporting "Setup Complete!" without it is misleading. GPU machines can still skip the llama-server build since they have other inference paths. * Fix broken frontend freshness detection in setup scripts - setup.sh: Replace broken `find | xargs find -newer` pipeline with single `find ... -newer` call. The old pipeline produced "paths must precede expression" errors (silently suppressed by 2>/dev/null), causing top-level config changes to never trigger a rebuild. - setup.sh: Add `command -v npm` guard to oxc-validator block so it does not fail when Node was not installed (build-skip path). - setup.ps1: Replace `Get-ChildItem -Include` (unreliable without -Recurse on PS 5.1) with explicit directory paths for src/ and public/ scanning. - Both: Add *.html to tracked file patterns so index.html (Vite entry point) changes trigger a rebuild. - Both: Use -print -quit instead of piping to head -1 for efficiency. * Fix bugs found during review of PRs #4404, #4400, #4399 - setup.sh: Add || true guard to find command that checks frontend/src and frontend/public dirs, preventing script abort under set -euo pipefail when either directory is missing - colab.py: Use sys.path.insert(0, ...) instead of site.addsitedir() so Studio venv packages take priority over system copies. Add warning when venv is missing instead of silently failing. - transformers_version.py: _venv_t5_is_valid() now checks installed package versions via .dist-info metadata, not just directory presence. Prevents false positives from stale or wrong-version packages. - transformers_version.py: _install_to_venv_t5() now passes --upgrade so pip replaces existing stale packages in the target directory. - setup.ps1: CPU-only PyTorch install uses --index-url for cpu wheel and all install commands use Fast-Install (uv with pip fallback). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix _venv_t5_is_valid dist-info loop exiting after first directory Remove premature break that caused the loop over .dist-info directories to exit after the first match even if it had no METADATA file. Now continues iterating until a valid METADATA is found or all dirs are exhausted. * Capture error output on failure instead of discarding with Out-Null setup.ps1: 6 locations changed from `| Out-Null` to `| Out-String` with output shown on failure -- PyTorch GPU/CPU install, Triton install, venv_t5 package loop, cmake llama-server and llama-quantize builds. transformers_version.py: clean stale .venv_t5 directory before reinstall when validation detects missing or version-mismatched packages. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix ModuleNotFoundError when CLI imports studio.backend.core The backend uses bare "from utils.*" imports everywhere, relying on backend/ being on sys.path. Workers and routes add it at startup, but the CLI imports studio.backend.core as a package -- backend/ was never added. Add sys.path setup at the top of core/__init__.py so lazy imports resolve correctly regardless of entry point. Fixes: unsloth inference unsloth/Qwen3-8B "who are you" crashing with "No module named 'utils'" * Fix frontend freshness check to detect all top-level file changes The extension allowlist (*.json, *.ts, *.js, *.mjs, *.html) missed files like bun.lock, so lockfile-only dependency changes could skip the frontend rebuild. Check all top-level files instead. * Add tiktoken to .venv_t5 for Qwen-family tokenizers Qwen models use tiktoken-based tokenizers which fail when routed through the transformers 5.x overlay without tiktoken installed. Add it to the setup scripts (with deps for Windows) and runtime fallback list. Integrates PR #4418. * Fix tiktoken crash in _venv_t5_is_valid and stray brace in setup.ps1 _venv_t5_is_valid() crashed with ValueError on unpinned packages like "tiktoken" (no ==version). Handle by splitting safely and skipping version check for unpinned packages (existence check only). Also remove stray closing brace in setup.ps1 tiktoken install block. --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-18 10:52:25 +00:00
# Always pass --python so uv targets the correct environment.
# Without this, uv can ignore an activated venv and install into
# the system Python (observed on Colab and similar environments).
cmd.extend(["--python", sys.executable])
cmd.extend(_translate_pip_args_for_uv(args))
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
# Torch is pre-installed by install.sh/setup.ps1. Do not add
# --torch-backend by default -- it can cause solver dead-ends on
# CPU-only machines. Callers that need it can set UV_TORCH_BACKEND.
_tb = os.environ.get("UV_TORCH_BACKEND", "")
if _tb:
cmd.append(f"--torch-backend={_tb}")
return cmd
Pin bitsandbytes to continuous-release_main on ROCm (4-bit decode fix) (#4954) * Pin bitsandbytes to continuous-release_main on ROCm for 4-bit decode fix bitsandbytes 0.49.2 on PyPI ships with a broken 4-bit GEMV kernel on every ROCm target: - CDNA (gfx90a / gfx942 / gfx950 = MI210 / MI300X / MI350) via a broken blocksize=32/64 warp64 GEMV kernel whose tests were explicitly skipped with ROCM_WARP_SIZE_64 guards because the code was known broken. - RDNA3 / RDNA3.5 (gfx1100-1103 / gfx1150-1152) via a compile-time BNB_WARP_SIZE macro in the host-side dispatch that resolves to 64 when the multi-arch wheel is compiled with CDNA as the primary target, so num_blocks is wrong on RDNA and half the GEMV output is never written. At decode shape (1, 1, hidden) both bugs produce NaN. Training is unaffected because training shapes are (batch, seq_len > 1, hidden) and never touch the GEMV path. The crash during autoregressive inference surfaces as _assert_async_cuda_kernel in torch.multinomial which on HIP becomes a hard HSA_STATUS_ERROR_EXCEPTION instead of a clean Python error. Both bugs are fixed by bitsandbytes commit 713a3b8 ("[ROCm] Enable blocksize 32 4-bit quantization and GEMV kernels on AMD CDNA", PR #1887, merged 2026-03-09) which replaces BNB_WARP_SIZE with a runtime hipDeviceGetAttribute query and ships a working CDNA warp64 kernel. That commit has not shipped to PyPI yet, but continuous-release_main wheels are published on every push to bnb main via GitHub Releases. Point the ROCm install path at the continuous-release_main x86_64 and aarch64 wheels and fall back to PyPI >=0.49.1 when the pre-release is unreachable (offline installs, firewalled hosts, or architectures not covered by the pre-release wheels). Drop the pin once bnb cuts a 0.50+ tag on PyPI. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): direct bnb GEMV shape test now returns 0.0078 max abs error at seq_len=1 (no NaN) vs NaN on 0.49.2, and full Unsloth + for_inference + 4-bit sampling generation works end-to-end. NVIDIA / CPU / Mac / Windows paths are unaffected -- the helper is gated on the ROCm torch index and platform.machine() respectively. * Drop Studio ROCm 16-bit fallback now that bnb 0.50+ fixes 4-bit decode The 16-bit fallback in studio/backend/core/inference/inference.py was added as a workaround for a bug that this PR already fixes at the install layer: bitsandbytes <= 0.49.2 has a broken 4-bit GEMV kernel on every ROCm target, which NaNs at decode shape (seq_len=1) and crashes autoregressive inference. bnb PR #1887 (commit 713a3b8, in 0.50.0.dev0+, pinned by install.sh / install_python_stack.py in this PR) restores correct 4-bit decode on MI300X and verified working end-to-end with full Unsloth + for_inference + sampling. Revert the dual code path so ROCm and NVIDIA both go through the normal FastLanguageModel.from_pretrained + for_inference flow: - Remove the conditional `from unsloth import` that skipped the import on ROCm. The monkey-patches it was trying to avoid were never the cause of the crash; bnb 4-bit GEMV was. - Remove the `if _hw_module.IS_ROCM:` branch in load_model that loaded with plain transformers + PEFT + bfloat16, and the `_resolve_fp16_base` helper it relied on. - Remove the `get_chat_template is not None` fallback in _load_chat_template_info -- get_chat_template is now always imported. - Refactor the audio/vision ROCm guard to check _hw_module.IS_ROCM directly instead of the removed _IS_ROCM_ENV global. Audio and vision on ROCm still need separate validation (FastVisionModel and the CSM audio codecs were never tested on HIP) so the guard stays for now. Add _bnb_rocm_4bit_ok() as a runtime safety net for users who install from this PR before the install.sh bnb pin kicks in, or whose installer fell back to the PyPI pin because the continuous- release wheel was unreachable. When the installed bnb is < 0.50 on ROCm, force load_in_4bit=False and strip any -unsloth-bnb-4bit / -bnb-4bit suffix from the model path so a pre-quantized repo resolves to its FP16 sibling instead of pulling bnb back in via the repo's quantization_config. LoRA adapters whose base is a pre-quantized repo on old bnb will still fail inside Unsloth's loader -- the only real fix there is `unsloth studio update`. Verified on MI300X (gfx942, ROCm 7.2, torch 2.10.0+rocm7.1): - HAPPY path (bnb 0.50.0.dev0, load_in_4bit=True, pre-quantized repo): loads in 4-bit via the fixed GEMV, generation returns "Paris." for greedy and sampling. - SAFETY-NET path (simulated old bnb, suffix-stripped to the FP16 sibling, load_in_4bit=False): loads in bf16, generation returns "Paris." for greedy and sampling. Net diff is ~45 lines smaller than the pre-revert state because the entire plain-transformers 16-bit branch is gone. * Cache _bnb_rocm_4bit_ok() with functools.cache load_model() can be called many times in a single session but the bnb version and hardware state cannot change at runtime, so memoise the check. First call is ~1.9 ms (dominated by the lazy `import bitsandbytes` inside the try block), subsequent calls drop to sub-microsecond dict lookups. Zero behavioral change. * Shorten verbose bnb/ROCm comments Comment-only cleanup across install.sh, studio/install_python_stack.py, and studio/backend/core/inference/inference.py. No behavioral change. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove _bnb_rocm_4bit_ok safety net from inference.py Studio's ROCm support is brand new (PR #4720, merged today) and every fresh install pulls the bnb continuous-release_main wheel via install.sh / install_python_stack.py in this same PR. There are no existing ROCm Studio installs carrying bnb < 0.50, so the defensive version-check fallback is guarding against a scenario that cannot actually occur. Delete the helper, the functools import, and the safety-net block -- inference.py now calls FastLanguageModel.from_pretrained directly with no ROCm branching. * Drop audio/vision ROCm guard in inference.py — verified unblocked by bnb fix Vision inference was blocked by the same bnb 4-bit GEMV bug that affected text inference (vision models use bnb 4-bit for the LM backbone). With bnb 0.50+ pinned in install.sh / install_python_stack.py, vision works end-to-end on MI300X: Llama-3.2-11B-Vision-Instruct-unsloth-bnb-4bit loaded in 4-bit via FastVisionModel + for_inference returns a correct answer to a multimodal prompt. Audio (CSM) was never actually blocked by HIP — on this hardware CSM loads and runs its backbone forward pass fine with bnb 0.50, then fails during generate() with a transformers-level kwarg validation mismatch in generation_csm.py (`backbone_last_hidden_state` rejected). That's a pre-existing transformers/CSM integration bug that reproduces identically on NVIDIA, so the ROCm-gated guard was never actually protecting users from anything HIP-specific. Remove the combined audio/vision guard and the now-unused _hw_module import. Also restore the one-word "Can be" in an inline comment that drifted during the earlier comment-shortening pass, so the inference.py delta vs pre-#4720 is exactly the max_seq_length<=0 crash fix and nothing else. * Shorten max_seq_length=0 guard comment to one line --------- Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-04-10 13:25:39 +00:00
def pip_install_try(
label: str,
*args: str,
constrain: bool = True,
) -> bool:
"""Like pip_install but returns False on failure instead of exiting.
For optional installs with a follow-up fallback.
"""
constraint_args: list[str] = []
if constrain and CONSTRAINTS.is_file():
constraint_args = ["-c", str(CONSTRAINTS)]
if USE_UV:
cmd = _build_uv_cmd(args) + constraint_args
else:
cmd = _build_pip_cmd(args) + constraint_args
if VERBOSE:
_step(_LABEL, f"{label}...", _dim)
result = subprocess.run(
cmd,
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
if result.returncode == 0:
return True
if VERBOSE and result.stdout:
print(result.stdout.decode(errors = "replace"))
return False
def pip_install(
label: str,
*args: str,
req: Path | None = None,
constrain: bool = True,
) -> None:
"""Build and run a pip install command (uses uv when available, falls back to pip)."""
constraint_args: list[str] = []
if constrain and CONSTRAINTS.is_file():
constraint_args = ["-c", str(CONSTRAINTS)]
2026-02-27 20:31:57 +00:00
actual_req = req
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
temp_reqs: list[Path] = []
2026-02-27 20:31:57 +00:00
if req is not None and IS_WINDOWS and WINDOWS_SKIP_PACKAGES:
actual_req = _filter_requirements(req, WINDOWS_SKIP_PACKAGES)
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
temp_reqs.append(actual_req)
if actual_req is not None and NO_TORCH and NO_TORCH_SKIP_PACKAGES:
actual_req = _filter_requirements(actual_req, NO_TORCH_SKIP_PACKAGES)
temp_reqs.append(actual_req)
req_args: list[str] = []
2026-02-27 20:31:57 +00:00
if actual_req is not None:
req_args = ["-r", str(actual_req)]
2026-02-27 20:31:57 +00:00
try:
if USE_UV:
uv_cmd = _build_uv_cmd(args) + constraint_args + req_args
if VERBOSE:
print(f" {label}...")
result = subprocess.run(
uv_cmd,
stdout = subprocess.PIPE,
stderr = subprocess.STDOUT,
)
if result.returncode == 0:
return
print(_red(f" uv failed, falling back to pip..."))
if result.stdout:
print(result.stdout.decode(errors = "replace"))
pip_cmd = _build_pip_cmd(args) + constraint_args + req_args
run(f"{label} (pip)" if USE_UV else label, pip_cmd)
2026-02-27 20:31:57 +00:00
finally:
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
for temp_req in temp_reqs:
temp_req.unlink(missing_ok = True)
def download_file(url: str, dest: Path) -> None:
"""Download a file using urllib (no curl dependency)."""
urllib.request.urlretrieve(url, dest)
def patch_package_file(package_name: str, relative_path: str, url: str) -> None:
"""Download a file from url and overwrite a file inside an installed package."""
result = subprocess.run(
[sys.executable, "-m", "pip", "show", package_name],
2026-03-12 18:28:04 +00:00
capture_output = True,
text = True,
)
if result.returncode != 0:
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
_step(_LABEL, f"package {package_name} not found, skipping patch", _red)
return
location = None
for line in result.stdout.splitlines():
if line.lower().startswith("location:"):
location = line.split(":", 1)[1].strip()
break
if not location:
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
_step(_LABEL, f"could not locate {package_name}", _red)
return
dest = Path(location) / relative_path
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
_step(_LABEL, f"patching {dest.name} in {package_name}...", _dim)
download_file(url, dest)
# -- Main install sequence ---------------------------------------------
2026-03-12 18:28:04 +00:00
def install_python_stack() -> int:
global USE_UV, _STEP, _TOTAL
_STEP = 0
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
# When called from install.sh (which already installed unsloth into the venv),
# SKIP_STUDIO_BASE=1 is set to avoid redundant reinstallation of base packages.
# When called from "unsloth studio update", it is NOT set so base packages
# (unsloth + unsloth-zoo) are always reinstalled to pick up new versions.
skip_base = os.environ.get("SKIP_STUDIO_BASE", "0") == "1"
# When --package is used, install a different package name (e.g. roland-sloth for testing)
package_name = os.environ.get("STUDIO_PACKAGE_NAME", "unsloth")
# When --local is used, overlay a local repo checkout after updating deps
local_repo = os.environ.get("STUDIO_LOCAL_REPO", "")
base_total = 10 if IS_WINDOWS else 11
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
if IS_MACOS:
base_total -= 1 # triton step is skipped on macOS
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
if not IS_WINDOWS and not IS_MACOS and not NO_TORCH:
base_total += 3
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
_TOTAL = (base_total - 1) if skip_base else base_total
# 1. Try to use uv for faster installs (must happen before pip upgrade
# because uv venvs don't include pip by default)
USE_UV = _bootstrap_uv()
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
# 2. Ensure pip is available (uv venvs created by install.sh don't include pip)
_progress("pip bootstrap")
if USE_UV:
run(
"Bootstrapping pip via uv",
[
"uv",
"pip",
"install",
"--python",
sys.executable,
"pip",
],
)
else:
# pip may not exist yet (uv-created venvs omit it). Try ensurepip
# first, then upgrade. Only fall back to a direct upgrade when pip
# is already present.
_has_pip = (
subprocess.run(
[sys.executable, "-m", "pip", "--version"],
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL,
).returncode
== 0
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
)
if not _has_pip:
run(
"Bootstrapping pip via ensurepip",
[sys.executable, "-m", "ensurepip", "--upgrade"],
)
else:
run(
"Upgrading pip",
[sys.executable, "-m", "pip", "install", "--upgrade", "pip"],
)
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
# 3. Core packages: unsloth-zoo + unsloth (or custom package name)
if skip_base:
studio: unify Windows installer/setup logging style, verbosity controls, and startup messaging (#4651) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. * fix(studio): honor verbose logging and keep llama.cpp failures non-blocking * fix(studio): switch installer to 'studio update' and normalize Windows setup logs * chore(studio): refine localhost tip and remove skip-base setup nois * fix(studio): align Windows setup logs with Linux style and improve startup tips * fix(studio): align Windows setup logs with Linux style * refactor(windows-installer): align install/setup logs with Linux style and silence auto-launch output * refactor(windows): align installer/setup output with Linux style and reduce default verbosity * refactor(windows): match install.ps1 output style/colors to setup and quiet default logs * fix(studio-banner): update personal-computer localhost tip * fix(setup.sh): restore verbose llama.cpp build output while keeping default quiet mode * fix(install.sh): align installer logging with setup style and restore POSIX-safe color output * fix(install.sh): preserve installer reliability and launch visibility Export verbose mode for child setup processes, harden install command handling under set -e, and keep first-run studio launch non-silent so users can always see URL and port fallback output. * fix(windows installer): keep exit semantics and degrade status accurate Use quiet command redirection that preserves native exit codes, keep startup output visible on first launch, and report limited install status when llama.cpp is unavailable. * fix(setup.sh): improve log clarity and enforce GGUF degraded signaling Restore clean default setup output, add verbose-only diagnostics, fail fast on Colab dependency install errors, and return non-zero when GGUF prerequisites or llama.cpp artifacts are unavailable. * fix(installer): harden bash preflight and PowerShell GPU checks Fail fast when bash is unavailable before invoking setup.sh, and replace remaining nvidia-smi pipeline checks with stream redirection patterns that preserve reliable native exit-code handling. * fix(windows): keep verbose output visible while preserving exit codes Ensure PowerShell wrapper helpers in install/update stream native command output to host without returning it as function output, so npm logs no longer corrupt exit-code checks in verbose mode. * fix(windows): avoid sticky UNSLOTH_VERBOSE and gate studio update verbosity * Fix degraded llama.cpp exit code, PS verbose stderr, banner URLs, npm verbose - setup.sh: Do not exit non-zero when llama.cpp is unavailable; the footer already reports the limitation, and install.sh runs under set -e so a non-zero exit aborts the entire install including PATH/shortcuts/launch. - setup.ps1: Remove $? check in Invoke-SetupCommand verbose path; PS 5.1 sets $? = $false when native commands write to stderr even with exit 0. Merge stderr into stdout with 2>&1 and rely solely on $LASTEXITCODE. - startup_banner.py: Show the actual bound address when Studio is bound to a non-loopback interface instead of always showing 127.0.0.1/localhost. - setup.sh: Use run_quiet_no_exit instead of run_quiet_no_exit_always for npm install steps so --verbose correctly surfaces npm output. * Fix install.ps1 verbose stderr, propagate UNSLOTH_VERBOSE, fix git clone verbose - install.ps1: Apply same Invoke-InstallCommand fix as setup.ps1 -- merge stderr into stdout with 2>&1 and drop the $? check that misclassifies successful native commands on PS 5.1. - install.ps1 + setup.ps1: Export UNSLOTH_VERBOSE=1 to the process env when --verbose is passed so child processes like install_python_stack.py also run in verbose mode. - setup.sh: Use run_quiet_no_exit for git clone llama.cpp so --verbose correctly surfaces clone diagnostics during source-build fallback. * Surface prebuilt llama.cpp output in verbose mode, remove dead code, fix banner - setup.sh: Use tee in verbose mode for prebuilt llama.cpp installer so users can see download/validation progress while still capturing the log for structured error reporting on failure. - setup.ps1: Same fix for Windows -- use Tee-Object in verbose mode. - setup.sh: Remove run_quiet_no_exit_always() which has no remaining callers. - startup_banner.py: Avoid printing the same URL twice when Studio is bound to a specific non-loopback address that matches the display host. * Fix run_install_cmd exit code after failed if-statement The previous pattern 'if "$@"; then return 0; fi; _rc=$?' always captured $? = 0 because $? reflects the if-statement result, not the command's exit code. Switch to '"$@" && return 0; _rc=$?' which preserves the actual command exit code on failure. Applies to both verbose and quiet branches. * Fix _run_quiet exit code, double uv install, missing --local flag - setup.sh: Fix _run_quiet verbose path that always captured exit code 0 due to $? resetting after if-then-fi with no else. Switch to the same '"$@" && return 0; exit_code=$?' pattern used in install.sh. - setup.sh: Consolidate the two uv install branches (verbose + quiet) into a single attempt with conditional output. Previously, when verbose mode was on and the install failed, a second silent attempt was made. - install.ps1: Pass --local flag to 'unsloth studio update' when $StudioLocalInstall is true. Without this, studio.py's update() command overwrites STUDIO_LOCAL_INSTALL to "0", which could cause issues if setup.ps1 or install_python_stack.py later checks that variable. * Revert SKIP_STUDIO_BASE change for --no-torch, restore install banners - Revert SKIP_STUDIO_BASE from 0 to 1 for --no-torch. install.sh already installs unsloth+unsloth-zoo and no-torch-runtime.txt before calling setup.sh, so letting install_python_stack.py redo it was redundant and slowed down --no-torch installs for no benefit. - Restore the "Unsloth Studio installed!" success banner and "starting Unsloth Studio..." launch message so users get clear install completion feedback before the server starts. * Make llama.cpp build failure a hard error with proper cleanup - setup.sh: Restore exit 1 when _LLAMA_CPP_DEGRADED is true. GGUF inference requires a working llama.cpp build, so this should be a hard failure, not a silent degradation. - install.sh: Catch setup.sh's non-zero exit with '|| _SETUP_EXIT=$?' instead of letting set -e abort immediately. This ensures PATH setup, symlinks, and shortcuts still get created so the user can fix the build deps and retry with 'unsloth studio update'. After post-install steps, propagate the failure with a clear error message. * Revert install.ps1 to 'studio setup' to preserve SKIP_STUDIO_BASE 'studio update' pops SKIP_STUDIO_BASE from the environment, which defeats the fast-path version check added in PR #4667. When called from install.ps1 (which already installed packages), SKIP_STUDIO_BASE=1 must survive into setup.ps1 so it skips the redundant PyPI check and package reinstallation. 'studio setup' does not modify env vars. * Remove deprecation message from 'studio setup' command install.ps1 uses 'studio setup' (not 'studio update') to preserve SKIP_STUDIO_BASE. The deprecation message was confusing during first install since the user never typed the command. * Fix stale env vars, scope degraded exit, generic error message for PR #4651 - install.ps1: Always set STUDIO_LOCAL_INSTALL and clear STUDIO_LOCAL_REPO when not using --local, to prevent stale values from a previous --local run in the same PowerShell session. Fix log messages to say 'setup' not 'update' since we call 'studio setup'. - setup.sh: Only exit non-zero for degraded llama.cpp when called from the installer (SKIP_STUDIO_BASE=1). Direct 'unsloth studio update' keeps degraded installs successful since Studio is still usable for non-GGUF workflows and the footer already reports the limitation. - install.sh: Make the setup failure error message generic instead of GGUF-specific, so unrelated failures (npm, Python deps) do not show misleading cmake/git recovery advice. * Show captured output on failure in quiet mode for PR #4651 Both Invoke-InstallCommand (install.ps1) and Invoke-SetupCommand (setup.ps1) now capture command output in quiet mode and display it in red when the command fails. This matches the behavior of run_install_cmd in install.sh where failure output is surfaced even in quiet mode, making cross-platform error debugging consistent. * Match degraded llama.cpp exit on Windows, fix --local recovery hint for PR #4651 - setup.ps1: Exit non-zero for degraded llama.cpp when called from install.ps1 (SKIP_STUDIO_BASE=1), matching setup.sh behavior. Direct 'unsloth studio update' keeps degraded installs successful. - install.sh: Show 'unsloth studio update --local' in the recovery message when the install was run with --local, so users retry with the correct flag instead of losing local checkout context. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-30 07:53:23 +00:00
pass
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
elif NO_TORCH:
# No-torch update path: install unsloth + unsloth-zoo with --no-deps
# (current PyPI metadata still declares torch as a hard dep), then
# runtime deps with --no-deps (avoids transitive torch).
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
_progress("base packages (no torch)")
pip_install(
f"Updating {package_name} + unsloth-zoo (no-torch mode)",
"--no-cache-dir",
"--no-deps",
"--upgrade-package",
package_name,
"--upgrade-package",
"unsloth-zoo",
package_name,
"unsloth-zoo",
)
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
pip_install(
"Installing no-torch runtime deps",
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
"--no-cache-dir",
"--no-deps",
req = REQ_ROOT / "no-torch-runtime.txt",
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
)
if local_repo:
pip_install(
"Overlaying local repo (editable)",
"--no-cache-dir",
"--no-deps",
"-e",
local_repo,
constrain = False,
)
Consolidate dual venvs and separate install from update (#4530) * refactor: consolidate dual venvs into single ~/.unsloth/studio/unsloth_studio * refactor: separate install.sh (first-time) from setup.sh (smart update with PyPI version check) * fix: install.sh calls setup.sh directly, keep both setup and update CLI commands * fix: use importlib.resources.files() directly without _path attribute * fix: bootstrap uv before pip upgrade to handle uv venvs without pip * fix: frontend 404 when launched via CLI, add global symlink to ~/.local/bin * feat: add --local flag to install.sh and unsloth studio update for branch testing * fix: resolve repo root from script location for --local installs * feat: add --package flag to install.sh for testing with custom package names * feat: add --package flag to unsloth studio update * fix: always nuke venv in install.sh for clean installs * revert: remove Windows changes, will handle in separate PR * fix: error when --package is passed without an argument * revert: restore Windows scripts to current main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: always explicitly set STUDIO_LOCAL_INSTALL and STUDIO_PACKAGE_NAME env vars * fix: pass explicit STUDIO_LOCAL_REPO env var for --local installs * fix: align banner box for Setup vs Update labels * deprecate: hide 'unsloth studio setup' command, point users to update/install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: check stdout not stdin for auto-launch detection (curl pipe fix) * fix: update install URL to unsloth.ai/install.sh * fix: update install.sh usage comments to unsloth.ai/install.sh * fix: use --upgrade-package for base deps to preserve existing torch/CUDA installs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: --local install now also installs unsloth-zoo via base.txt before editable overlay * fix: don't skip base packages for --local installs (editable needs unsloth-zoo) * refactor: move --local full dep install to install.sh, keep SKIP_STUDIO_BASE for all paths * feat: add migration support for old .venv and CWD-based installs in setup.sh * Revert "feat: add migration support for old .venv and CWD-based installs in setup.sh" This reverts commit 301291d0028b61e15acc064829f48be50c764087. * feat: migrate old .venv layout in install.sh instead of always nuking * feat: validate old .venv with torch CUDA test before migration, recovery message on launch failure * fix: try CUDA then fall back to CPU for migration validation * fix: upgrade unsloth/unsloth-zoo with --reinstall-package on migration to preserve torch * remove: delete unused unsloth ui command (use unsloth studio instead) * Fix Windows venv path mismatch between install.ps1, setup.ps1, and studio.py install.ps1 was creating the venv CWD-relative ($VenvName = "unsloth_studio"), setup.ps1 was using an absolute path to ".unsloth\studio\.venv", and studio.py looks for ".unsloth\studio\unsloth_studio". All three paths were different, so the Windows installer would never produce a working Studio setup. install.ps1: - Use absolute $StudioHome + $VenvDir matching the Linux install.sh layout - Add 3-way migration: old .venv at STUDIO_HOME, CWD-relative ~/unsloth_studio from the previous install.ps1, or fresh creation with torch validation - For migrated envs, upgrade unsloth while preserving existing torch/CUDA wheels - Set SKIP_STUDIO_BASE=1 before calling setup.ps1 (matches install.sh behavior) - Fix launch instructions to use the absolute venv path setup.ps1: - Change $VenvDir from ".unsloth\studio\.venv" to ".unsloth\studio\unsloth_studio" - Add SKIP_STUDIO_BASE guard: error out if venv is missing when called from install.ps1 (which should have already created it) - Differentiate "Setup" vs "Update" in banners based on SKIP_STUDIO_BASE * setup.ps1: unconditionally error if venv missing, matching setup.sh setup.sh always errors out if the venv does not exist (line 224-228), telling the user to run install.sh first. setup.ps1 was conditionally creating a bare venv with python -m venv when SKIP_STUDIO_BASE was not set, which would produce an empty venv with no torch or unsloth. Now setup.ps1 matches setup.sh: always error, always point to install.ps1. * Fix --torch-backend=auto CPU solver dead-end on Linux, macOS, and Windows On CPU-only machines, `uv pip install unsloth --torch-backend=auto` falls back to unsloth==2024.8 because the CPU solver cannot satisfy newer unsloth's dependencies. install.ps1 already solved this with a two-step approach; this applies the same fix to install.sh and install_python_stack.py. install.sh: add get_torch_index_url() that detects GPU via nvidia-smi and maps CUDA versions to PyTorch index URLs (matching install.ps1's Get-TorchIndexUrl). Fresh installs now install torch first via explicit --index-url, then install unsloth with --upgrade-package to preserve the pre-installed torch. All 5 --torch-backend=auto removed from primary paths. install.ps1: add fallback else-branch when TorchIndexUrl is empty, using --torch-backend=auto as last resort (matching install.sh). install_python_stack.py: remove unconditional --torch-backend=auto from _build_uv_cmd. Torch is pre-installed by install.sh/setup.ps1 by the time this runs. Callers that need it can set UV_TORCH_BACKEND. Both install.sh and install.ps1 now share the same three-branch logic: migrated env (upgrade-package only), normal (torch-first + index-url), and fallback (--torch-backend=auto if URL detection fails). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use --reinstall-package for migrated envs on both Linux and Windows For migrated environments (moved from legacy venv location), --reinstall-package is better than --upgrade-package because it forces a clean reinstall even if the same version is already installed. This ensures proper .dist-info and .pyc state in the new venv location. --upgrade-package remains correct for the fresh install path where torch is already installed and we just want to add unsloth without re-resolving torch. * Address review findings: portability, parity, and stale comments - Replace grep -oP (GNU Perl regex) with POSIX sed in get_torch_index_url() so the script works on BSD grep (macOS is already guarded by the Darwin early-return, but Alpine/BusyBox would silently get the wrong CUDA tag) - Add LC_ALL=C before nvidia-smi invocation to prevent locale-dependent output parsing issues - Add warning on stderr when nvidia-smi output is unparseable, matching install.ps1's [WARN] message - Add explicit unsloth-zoo positional arg to install.ps1 migrated path, matching install.sh (--reinstall-package alone won't install it if it was never present in the migrated env) - Fix stale comment in install_python_stack.py line 392 that still claimed --torch-backend=auto is added by _build_uv_cmd - Add sed to test tools directory (function now uses sed instead of grep) * Add --index-url to migrated env path to prevent CPU torch resolution The migrated path runs uv pip install with --reinstall-package for unsloth/unsloth-zoo. While uv should keep existing torch as satisfied, the resolver could still re-resolve torch as a transitive dependency. Without --index-url pointing at the correct CUDA wheel index, the resolver would fall back to plain PyPI and potentially pull CPU-only torch. Adding --index-url $TORCH_INDEX_URL ensures CUDA wheels are available if the resolver needs them. Applied to both install.sh and install.ps1. * Revert --index-url on migrated env path The original install.ps1 on main already handles the migrated path without --index-url and it works correctly. --reinstall-package only forces reinstall of the named packages while uv keeps existing torch as satisfied. No need for the extra flag. * Fix unsloth studio update --local not installing local checkout studio.py sets STUDIO_LOCAL_REPO when --local is passed, but install_python_stack.py never read it. The update path always installed from PyPI regardless of the --local flag. Add a local_repo branch that first updates deps from base.txt (with --upgrade-package to preserve torch), then overlays the local checkout as an editable install with --no-deps. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-25 12:24:21 +00:00
elif local_repo:
# Local dev install: update deps from base.txt, then overlay the
# local checkout as an editable install (--no-deps so torch is
# never re-resolved).
_progress("base packages")
pip_install(
"Updating base packages",
"--no-cache-dir",
"--upgrade-package",
"unsloth",
"--upgrade-package",
"unsloth-zoo",
req = REQ_ROOT / "base.txt",
)
pip_install(
"Overlaying local repo (editable)",
"--no-cache-dir",
"--no-deps",
"-e",
local_repo,
constrain = False,
)
elif package_name != "unsloth":
# Custom package name (e.g. roland-sloth for testing) — install directly
_progress("base packages")
pip_install(
f"Installing {package_name}",
"--no-cache-dir",
package_name,
)
else:
# Update path: upgrade only unsloth + unsloth-zoo while preserving
# existing torch/CUDA installations. Torch is pre-installed by
# install.sh / setup.ps1; --upgrade-package targets only base pkgs.
_progress("base packages")
pip_install(
"Updating base packages",
"--no-cache-dir",
"--upgrade-package",
"unsloth",
"--upgrade-package",
"unsloth-zoo",
req = REQ_ROOT / "base.txt",
)
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
# 2b. AMD ROCm: reinstall torch with HIP wheels if the host has ROCm but the
# venv received CPU-only torch (common when pip resolves torch from PyPI).
# Must come immediately after base packages so torch is present for inspection.
if not IS_WINDOWS and not IS_MACOS and not NO_TORCH:
_progress("ROCm torch check")
_ensure_rocm_torch()
# Windows + AMD GPU: PyTorch does not publish ROCm wheels for Windows.
# Detect and warn so users know manual steps are needed for GPU training.
if IS_WINDOWS and not NO_TORCH and not _has_usable_nvidia_gpu():
# Validate actual AMD GPU presence (not just tool existence)
import re as _re_win
def _win_amd_smi_has_gpu(stdout: str) -> bool:
return bool(_re_win.search(r"(?im)^gpu\s*[:\[]\s*\d", stdout))
_win_amd_gpu = False
for _wcmd, _check_fn in (
(["hipinfo"], lambda out: "gcnarchname" in out.lower()),
(["amd-smi", "list"], _win_amd_smi_has_gpu),
):
_wexe = shutil.which(_wcmd[0])
if not _wexe:
continue
try:
_wr = subprocess.run(
[_wexe, *_wcmd[1:]],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 10,
)
except Exception:
continue
if _wr.returncode == 0 and _check_fn(_wr.stdout):
_win_amd_gpu = True
break
if _win_amd_gpu:
_safe_print(
_dim(" Note:"),
"AMD GPU detected on Windows. ROCm-enabled PyTorch must be",
)
_safe_print(
" " * 8,
"installed manually. See: https://docs.unsloth.ai/get-started/install-and-update/amd",
)
# 3. Extra dependencies
_progress("unsloth extras")
pip_install(
"Installing additional unsloth dependencies",
"--no-cache-dir",
2026-03-12 18:28:04 +00:00
req = REQ_ROOT / "extras.txt",
)
# 3b. Extra dependencies (no-deps) -- audio model support etc.
_progress("extra codecs")
pip_install(
"Installing extras (no-deps)",
2026-03-12 18:28:04 +00:00
"--no-deps",
"--no-cache-dir",
req = REQ_ROOT / "extras-no-deps.txt",
)
# 4. Overrides (torchao, transformers) -- force-reinstall
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
# Skip entirely when torch is unavailable (e.g. Intel Mac GGUF-only mode)
# because overrides.txt contains torchao which requires torch.
if NO_TORCH:
_progress("dependency overrides (skipped, no torch)")
else:
_progress("dependency overrides")
pip_install(
"Installing dependency overrides",
"--force-reinstall",
"--no-cache-dir",
req = REQ_ROOT / "overrides.txt",
)
# 5. Triton kernels (no-deps, from source)
fix: install.sh Mac Intel compatibility + Studio no-torch support (#4624) * fix: install.sh Mac Intel compatibility + Studio no-torch support (#4621) On Intel Macs (x86_64), PyTorch has no wheels for torch >= 2.3, so the installer crashes. Even when torch is absent, Studio crashes on startup because two files have bare top-level torch imports. Studio's GGUF inference (llama.cpp) does not need PyTorch. Training and HF-inference already isolate torch to subprocesses. Only 2 files in the server startup chain had top-level torch imports preventing startup. Changes: - install.sh: detect architecture, default to Python 3.12 on Intel Mac, skip torch install, add Python 3.13.8 guard for arm64, pass UNSLOTH_NO_TORCH env var to setup.sh - data_collators.py: remove unused `import torch` (no torch.* refs) - chat_templates.py: lazy-import IterableDataset into function bodies - install_python_stack.py: add IS_MACOS/NO_TORCH constants, skip torch-dependent packages, skip overrides.txt, skip triton on macOS No existing working flow changes. Linux/WSL and macOS arm64 behavior is identical. * tests: add test suite for Mac Intel compat + no-torch mode Shell tests (test_mac_intel_compat.sh): - version_ge edge cases (9 tests) - Architecture detection for Darwin x86_64/arm64, Linux x86_64/aarch64 - get_torch_index_url returns cpu on simulated Darwin - UNSLOTH_NO_TORCH propagation to both setup.sh branches Python unit tests (test_no_torch_filtering.py): - _filter_requirements with NO_TORCH_SKIP_PACKAGES - NO_TORCH env var parsing (true/1/TRUE/false/0/unset) - IS_MACOS constant check - Overrides skip and triton macOS skip guards Python import tests (test_studio_import_no_torch.py): - data_collators.py loads in isolated no-torch venv - chat_templates.py has no top-level torch imports - Negative control confirms import torch fails without torch * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * tests: add E2E sandbox tests for Mac Intel no-torch mode Replace static/synthetic test stubs with real sandbox tests: - Shell: E2E uv venv creation at Python 3.12, mock uv shim to verify torch install is skipped when MAC_INTEL=true, dynamic env propagation test for UNSLOTH_NO_TORCH in both local and non-local install paths - Python filtering: test real extras.txt and extras-no-deps.txt with NO_TORCH_SKIP_PACKAGES, subprocess mock of install_python_stack() for 5 platform configs (NO_TORCH+macOS, Windows+NO_TORCH, normal Linux, Windows-only, macOS-only), VCS URL and env marker edge cases - Python imports: parametrized Python 3.12+3.13 venv fixture, dataclass instantiation for all 3 collator classes, chat_templates.py exec with stubs, negative controls proving import torch and torchao install fail in no-torch venvs 91 total tests, all passing. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for Intel Mac no-torch mode P1 fixes: - Auto-infer NO_TORCH in install_python_stack.py via platform.machine() so `unsloth studio update` preserves GGUF-only mode without needing the UNSLOTH_NO_TORCH env var (6/10 reviewers) - Add openai-whisper and transformers-cfg to NO_TORCH_SKIP_PACKAGES since both have unconditional torch dependencies (4/10 reviewers) - Skip unsloth-zoo on Intel Mac --local installs (depends on torch) in both migrated and fresh install paths (1/10) - Recreate stale 3.13 venvs as 3.12 on Intel Mac re-runs (1/10) - Detect Apple Silicon under Rosetta via sysctl hw.optional.arm64 and warn user to use native arm64 terminal (1/10) P2 fixes: - Wire new test files into tests/run_all.sh (4/10 reviewers) - Add update-path tests (skip_base=False) for Intel Mac - Add _infer_no_torch tests for platform auto-detection P3 fixes: - Fix macOS progress bar total (triton step skipped but was counted) - Fix temp file leak when Windows + NO_TORCH filters stack All tests pass: 30 shell, 66 Python (96 total). * feat: add --python override flag to install.sh Lets users force a specific Python version, e.g. ./install.sh --python 3.12. Addresses M2 Mac users whose systems resolve to a problematic 3.13.x patch. When --python is set, the Intel Mac stale-venv guard and 3.13.8 auto-downgrade are skipped so the user's choice is respected. * tests: add comprehensive E2E sandbox tests for no-torch mode Add test_e2e_no_torch_sandbox.py with 7 test groups (43 tests total) covering the full no-torch import chain, edge cases, and install logic: - Group 1: BEFORE vs AFTER import chain comparison (proves the bug existed and the fix works by synthetically prepending top-level torch imports) - Group 2: Dataclass instantiation without torch - Group 3: Edge cases with broken/fake torch modules on sys.path - Group 4: Hardware detection fallback to CPU without torch - Group 5: install.sh flag parsing, version resolution, arch detection - Group 6: install_python_stack.py NO_TORCH filtering - Group 7: Live server startup without torch (marked @server, skipped when studio venv is unavailable) All 43 tests pass on both Python 3.12 and 3.13 isolated venvs. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feat: add --no-torch flag to install.sh/ps1, fix lazy import bug in dataset formatting - Fix chat_templates.py: narrow torch IterableDataset import into inner try/except ImportError so dataset.map() works without torch installed - Fix format_conversion.py: same lazy import fix for convert_chatml_to_alpaca and convert_alpaca_to_chatml - Add --no-torch flag to install.sh with unified SKIP_TORCH variable (driven by --no-torch flag OR MAC_INTEL auto-detection) - Add --no-torch flag to install.ps1 with $SkipTorch variable - Print CPU hint when no GPU detected and --no-torch not set - Replace MAC_INTEL guards with SKIP_TORCH in torch install sections - Update shell tests (40 pass) and Python tests (90 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: address reviewer findings for --no-torch installer paths - Fix migrated-env branch in install.sh and install.ps1: check SKIP_TORCH first, then branch on STUDIO_LOCAL_INSTALL. Previously SKIP_TORCH+non-local fell into else and installed unsloth-zoo (which depends on torch), defeating --no-torch mode. - Fix $env:UNSLOTH_NO_TORCH leak in install.ps1: always set to "true" or "false" instead of only setting on the true branch. Prevents stale no-torch state from leaking across runs in the same PS session. - Fix install_python_stack.py update path: add NO_TORCH guard around base.txt install so unsloth studio update does not reinstall unsloth-zoo (which depends on torch) in no-torch mode. * fix: install unsloth + unsloth-zoo with --no-deps in no-torch mode Instead of skipping unsloth-zoo entirely (which breaks unsloth's dependency on it), install both packages with --no-deps so they are present but torch is not pulled in transitively. Applied consistently across all no-torch paths: migrated-env, fresh-local, fresh-non-local in install.sh, install.ps1, and install_python_stack.py. * chore: temporarily remove test files (will be added in a follow-up) * refactor: deduplicate SKIP_TORCH conditional branches in installers Collapse if/else blocks that differ only by --no-deps into a single branch with a conditional flag variable. Applied to migrated-env and fresh-local paths in install.sh, install.ps1, and install_python_stack.py. * fix: apply --no-deps to fresh non-local --no-torch install path The non-local else branch was missing $_no_deps_arg/$noDepsArg, so uv pip install unsloth would resolve torch from PyPI metadata (the published unsloth package still declares torch as a hard dep). Now --no-deps is applied consistently to all SKIP_TORCH code paths. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
2026-03-27 09:09:21 +00:00
# Skip on Windows (no support) and macOS (no support).
if not IS_WINDOWS and not IS_MACOS:
_progress("triton kernels")
2026-03-12 18:28:04 +00:00
pip_install(
"Installing triton kernels",
"--no-deps",
"--no-cache-dir",
req = REQ_ROOT / "triton-kernels.txt",
constrain = False,
)
if not IS_WINDOWS and not IS_MACOS and not NO_TORCH:
_progress("flash-attn")
_ensure_flash_attn()
2026-03-12 18:28:04 +00:00
# # 6. Patch: override llama_cpp.py with fix from unsloth-zoo feature/llama-cpp-windows-support branch
# patch_package_file(
# "unsloth-zoo",
# os.path.join("unsloth_zoo", "llama_cpp.py"),
# "https://raw.githubusercontent.com/unslothai/unsloth-zoo/refs/heads/main/unsloth_zoo/llama_cpp.py",
# )
# # 7a. Patch: override vision.py with fix from unsloth PR #4091
# patch_package_file(
# "unsloth",
# os.path.join("unsloth", "models", "vision.py"),
# "https://raw.githubusercontent.com/unslothai/unsloth/80e0108a684c882965a02a8ed851e3473c1145ab/unsloth/models/vision.py",
# )
# # 7b. Patch : override save.py with fix from feature/llama-cpp-windows-support
# patch_package_file(
# "unsloth",
# os.path.join("unsloth", "save.py"),
# "https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/unsloth/save.py",
# )
# 8. Studio dependencies
_progress("studio deps")
pip_install(
"Installing studio dependencies",
"--no-cache-dir",
2026-03-12 18:28:04 +00:00
req = REQ_ROOT / "studio.txt",
)
# 9. Data-designer dependencies
_progress("data designer deps")
pip_install(
2026-03-12 18:28:04 +00:00
"Installing data-designer base dependencies",
"--no-cache-dir",
2026-03-12 18:28:04 +00:00
req = SINGLE_ENV / "data-designer-deps.txt",
)
# 10. Data-designer packages (no-deps to avoid conflicts)
_progress("data designer")
pip_install(
"Installing data-designer",
2026-03-12 18:28:04 +00:00
"--no-cache-dir",
"--no-deps",
req = SINGLE_ENV / "data-designer.txt",
)
# 11. Local Data Designer seed plugin
if not LOCAL_DD_UNSTRUCTURED_PLUGIN.is_dir():
_safe_print(
_red(
f"❌ Missing local plugin directory: {LOCAL_DD_UNSTRUCTURED_PLUGIN}",
),
)
return 1
_progress("local plugin")
pip_install(
"Installing local data-designer unstructured plugin",
2026-03-12 18:28:04 +00:00
"--no-cache-dir",
"--no-deps",
str(LOCAL_DD_UNSTRUCTURED_PLUGIN),
constrain = False,
)
# 12. Patch metadata for single-env compatibility
_progress("finalizing")
run(
"Patching single-env metadata",
[sys.executable, str(SINGLE_ENV / "patch_metadata.py")],
)
Add AMD ROCm/HIP support across installer and hardware detection (#4720) * Add ROCm detection to install.sh and expand shell tests Add AMD ROCm GPU detection to get_torch_index_url() in install.sh. When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm version file, hipconfig, dpkg-query, and rpm. Includes validation guard for malformed _rocm_tag, Debian epoch prefix stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install, and status messaging. Shell tests expanded to 23 cases. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm torch reinstall support to install_python_stack.py Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a Linux host has ROCm but the venv received CPU-only torch, and reinstall with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a 30-second timeout on the torch GPU probe subprocess. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add ROCm support to llama.cpp prebuilt installer Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP prebuilt with CPU fallback). Add linux-rocm and windows-hip install kinds to runtime_patterns_for_choice(). Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add IS_ROCM hardware flag and fix AMD error message Add IS_ROCM flag to hardware.py detect_hardware() (set when torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM from __init__.py. Add "rocm" key to get_package_versions(). Replace "We do not support AMD" error in tokenizer_utils.py with a helpful message pointing to ROCm installation docs. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Add comprehensive ROCm support test suite (68 tests) Add tests/studio/install/test_rocm_support.py covering all ROCm code paths across install_llama_prebuilt.py, install_python_stack.py, hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks and run without AMD hardware. Covers: asset selection (11), runtime patterns (5), HostInfo (4), ROCm version detection (9), torch reinstall (9), index mapping (8), hardware flag (8), tokenizer message (2), install.sh structure (10), and live regression (1). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm support: probe error handling, version cap, validation Address review findings from 8 independent reviewers: - Wrap _ensure_rocm_torch() torch probe in try/except for TimeoutExpired and OSError so a hung or broken torch import does not crash the installer (8/8 reviewers flagged this) - Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to prevent installing unsupported torch 2.11.0 from the rocm7.1 index - Use with-statement for file reads in _detect_rocm_version() to avoid resource leaks - Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of default parameter to avoid relative path resolution) - Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to reject rocm0.x tags that would produce nonexistent PyTorch index URLs - Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0* |rocm7.1* pass through, everything else caps to rocm7.1) so future ROCm 10+ does not fall through to a nonexistent index - Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering - Fix test_probe_timeout_handled: replace zero-assertion test with proper assertions verifying reinstall proceeds after timeout * Clean up rocm_paths list construction in detect_host() Filter None from the ROCM_PATH env var lookup at list construction time instead of relying on the inline `if p` guard in the any() call. * Require actual AMD GPU presence before selecting ROCm paths All 8 reviewers across 2 cycles independently flagged that ROCm detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core) as a proxy for GPU presence, which would misroute CPU-only or NVIDIA hosts that happen to have ROCm tools installed. Now all 3 detection points (install.sh, install_python_stack.py, install_llama_prebuilt.py) probe for an actual AMD GPU before entering the ROCm path: - install.sh: check rocminfo for gfx* GPU names, or amd-smi list for device rows, before version detection - install_python_stack.py: new _has_rocm_gpu() function probes rocminfo and amd-smi list before _ensure_rocm_torch() proceeds - install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi list instead of just checking tool existence or directory paths Also: - Shell test mock amd-smi now handles "list" subcommand - Python tests updated to mock _has_rocm_gpu where needed - Added test_no_gpu_with_rocm_tools_skips to verify the new guard - Test index lookups now use sorted() to match production code * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden hipconfig version parsing and torch probe compatibility - Add parts[1].isdigit() check in hipconfig version parsing to handle versions like "6.3-HIP" where the minor component has non-numeric suffix (strip "-" prefix before int() conversion) - Use getattr() in torch probe subprocess to safely handle old or custom torch builds that may lack torch.version.hip/cuda attributes * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Strengthen AMD GPU detection and add NVIDIA precedence guard - Change amd-smi list detection from any-non-empty-output to requiring "gpu" marker in output, matching the shell-side NR>1 check. Prevents false positives from header-only amd-smi list output. - Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and install_llama_prebuilt.py behavior). - Apply the same amd-smi marker fix to install_llama_prebuilt.py detect_host() for consistency. * Add Windows-specific ROCm/HIP detection in detect_host() The previous detect_host() ROCm check used rocminfo and amd-smi list which are Linux-only tools. On Windows, has_rocm would always be False, making the Windows HIP prebuilt path at line 1794 unreachable. Now detect_host() uses platform-specific detection: - Linux: rocminfo (check for gfx GPU names) or amd-smi list - Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH This allows Windows AMD users to get the HIP prebuilt binary instead of silently falling through to the CPU prebuilt. * Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion - worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check for hipcc before ROCm source builds, improve status messages and error reporting, add timeout and uv support for the source build fallback - amd.py: New AMD GPU monitoring module via amd-smi metric --json, mirroring nvidia.py structure (utilization, temperature, power, VRAM) - hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization, visible GPU queries, and physical GPU count - install_python_stack.py: Detect AMD GPUs on Windows and warn that ROCm-enabled PyTorch must be installed manually - kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032), RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries - tests: Add 32 new tests covering all changes (95/95 pass) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage - Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi output markers instead of just checking tool existence on PATH - _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools) - amd.py _parse_numeric: handle dict-shaped metric objects from newer amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units - amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs - amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly - install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels exist for older versions); fix rocm7.1* glob to not match rocm7.10+ - is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.) - worker.py: increase ROCm source build timeout from 600s to 1800s; fix success log message for ROCm source builds - Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation - hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with HIP-specific env vars report the correct visible device set - amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly; fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB - install_python_stack.py: Windows AMD warning now validates actual GPU presence via hipinfo/amd-smi output markers before printing - install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP detection after tool-based checks, so Windows HIP installs without CLI tools on PATH are still detected - hardware.py: fix IS_ROCM comment to accurately describe its role * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec Use explicit None checks instead of Python `or` operator when reading HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string ("") is correctly honored as "no visible GPUs" rather than silently falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems. * Fix IS_ROCM test assertion for multi-line formatting * Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count - Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in both install.sh and install_python_stack.py to prevent resolver from selecting incompatible companion packages from ROCm wheel index - Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence without hipinfo/amd-smi is not proof of GPU existence) - Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts * Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428 The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3 (gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201) architectures is based on the original work from PR #4428. Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: billishyahao <bill.he@amd.com> * Support AMD Radeon for studio (#4770) Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> * Remove ROCm test files from main PR Move test_rocm_support.py and shell test additions to a separate PR to keep the main ROCm support PR focused on implementation changes. * Fix installer and hardware detection issues for PR #4720 - Fix empty _tri_arg passed to uv pip install in Radeon path (causes "Empty field is not allowed for PEP508" error) - Fix Radeon fallback: use ROCm index instead of CPU-only when repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm) - Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings - Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64 wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag) - Fix IS_ROCM export: use __getattr__ so callers always see the live value after detect_hardware() runs - Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set - Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB) - Add amd-smi version as a fallback in _detect_rocm_version - Fix trailing whitespace and missing newline at EOF in install.sh * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix GPU detection false positives and add missing health groups - Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows from amd-smi list, not just header containing "gpu" - Apply same fix in detect_host() in install_llama_prebuilt.py - Add runtime_payload_health_groups for linux-rocm and windows-hip so partial/corrupt ROCm/HIP prebuilt installs are properly detected - Add bitsandbytes install to Radeon fallback paths (was only in the success path, skipped when repo.radeon.com was unreachable) - Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main) and only use __getattr__ for IS_ROCM * Fix _ensure_rocm_torch and Windows AMD warning false positives - _ensure_rocm_torch: only skip when HIP is already present, not for CUDA builds (which are unusable on AMD-only hosts). Fixes the case where a venv has a stale CUDA wheel and the repair step is skipped. - Windows AMD warning: use GPU data row check (same as Linux fix) to avoid false positives from amd-smi list header-only output. * Fix amd-smi GPU detection for GPU[N] output format Older amd-smi versions output "GPU[0] : Card series: ..." instead of "GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>" formats to detect actual GPU data rows. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Harden AMD GPU detection against false positives - install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/) - All files: reject rocminfo gfx000 (CPU HSA agent) by requiring gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe - Fixes false positives on hosts with ROCm tools but no AMD GPU * Remove duplicate comment from pre-commit merge * Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports - Extract _has_amd_rocm_gpu() shell function to avoid duplicating the rocminfo/amd-smi GPU detection logic in get_torch_index_url and the Radeon auto-detect block - Consolidate bitsandbytes install into a single case block after torch install (was duplicated 4 times across Radeon success/fallback paths) - Move math and re imports to top of amd.py (were inline in functions) - Add _smi_query() helper in hardware.py to centralize IS_ROCM backend selection for get_gpu_utilization and get_visible_gpu_utilization Addresses Gemini code review suggestions. * Fix VRAM parsing for string values and GB/GiB consistency - Extract unit from string-valued VRAM fields (e.g. "192 GiB") so _parse_memory_mb correctly applies the unit multiplier instead of treating the value as bare MB - Treat GB and GiB identically (both as binary x1024) since GPU tools including amd-smi use binary units even when labeling them "GB" - Fixes incorrect VRAM reporting on MI300-class cards (was showing ~0.19 GB instead of 192 GB for string-valued outputs) * Add --no-cache to uv for ROCm HIP source builds Avoid stale cache artifacts from partial HIP source builds when uv is used for causal-conv1d/mamba-ssm compilation on ROCm. The pip path already uses --no-cache-dir; this adds the uv equivalent (--no-cache) only when is_hip is True. * Fix critical: initialize _amd_gpu_radeon before case block _amd_gpu_radeon was only set inside the */rocm*) case arm, so on NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm", the variable was unbound. With set -u (nounset) enabled, this crashes the installer for every non-AMD user. Move initialization to before the case block so it is always defined. * Fix Windows AMD: route has_rocm hosts to HIP prebuilt path resolve_release_asset_choice was selecting windows-cpu for all Windows x86_64 hosts including those with has_rocm=True. Windows AMD users should fall through to resolve_upstream_asset_choice which tries the HIP prebuilt first. Add "not host.has_rocm" guard to the published windows-cpu selection. * Harden ROCm detection, Radeon wheel fallback, and HIP visibility Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs. * Fix round 2 regressions: ROCm validate_server and Windows HIP routing Follow-up to 810b833b addressing review findings on the first round of hardening commits: - install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the resolved install_kind instead of host.has_rocm. AMD Windows hosts without a HIP prebuilt fall back to windows-cpu and must not be validated with GPU layers; thread install_kind through from the caller. - install_llama_prebuilt.py resolve_release_asset_choice: reinstate the "not has_rocm" guard on the published windows-cpu bundle so AMD Windows hosts reach resolve_upstream_asset_choice() where the new HIP prebuilt path lives. Prefer a published windows-hip bundle first when one exists, fall through to upstream HIP + upstream CPU otherwise. - install_llama_prebuilt.py detect_host: also set has_physical_nvidia when the secondary --query-gpu block confirms a working NVIDIA GPU, so older nvidia-smi versions without -L support do not silently skip the Linux diagnostics that key off has_physical_nvidia. - install_llama_prebuilt.py: drop redundant "import re as _re" / "import re as _re_rocm" local aliases in favour of the existing top-level "import re". - install_python_stack.py _ensure_rocm_torch: run the AMD bitsandbytes install unconditionally after the HIP-torch probe so "unsloth studio update" on venvs that already have ROCm torch still gains the AMD bitsandbytes build. - install.sh: add a non-x86_64 early-exit to get_torch_index_url() so aarch64 / arm64 Linux hosts do not hit the ROCm wheel index (PyTorch only publishes ROCm wheels for linux_x86_64). - install.sh: add bitsandbytes install to the migrated-environment branch so upgrades pick it up for ROCm hosts instead of only the fresh-install path. - install.sh: in the Radeon wheel path, pass version constraints + --no-index --find-links to uv instead of explicit wheel URLs so a version-compatible torch / torchvision / torchaudio triple is resolved, rather than picking the highest-version wheel for each package independently. - studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall through to lower-priority visibility env vars when the first entry is malformed (leading comma, all-whitespace first token) instead of silently returning GPU 0. * Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps Address issues surfaced by the round 3 reviewers on top of 8636fa63: - install_python_stack.py _ensure_rocm_torch: add the same `x86_64` guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts must skip the repair path entirely; PyTorch only publishes ROCm wheels for linux_x86_64, and without this guard `unsloth studio update` aborts with a missing-wheel error on non x86_64 hosts. - install_llama_prebuilt.py resolve_upstream_asset_choice: add a best-effort _detect_host_rocm_version() helper (reading /opt/rocm/.info/version, amd-smi version, hipconfig --version) and filter rocm_candidates to entries whose major.minor is <= host version. Falls back to the newest candidate only when no compatible one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being handed the numerically newest rocm-7.2 bundle (which fails preflight and forces a source build). - install.sh: remove the round 2 --no-index switch from the Radeon wheel branch. --no-index forced uv to ignore PyPI entirely, which broke transitive dependency resolution (filelock, sympy, networkx, jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv. Restore the round 1 explicit wheel URL invocation but add a torch / torchvision / torchaudio version-pair sanity check so a mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio 2.9.0) falls back to the standard ROCm index instead of installing a broken combination. - install_python_stack.py _ensure_rocm_torch: restructure the "tag is None" path so it no longer short-circuits the bitsandbytes install. On a ROCm runtime older than anything in _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the AMD bitsandbytes install. - studio/backend/core/training/worker.py: restore the pre-PR "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos) after 5 minutes; omit timeout for the non-HIP branch so the cap only applies to ROCm source builds. * Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate Address remaining issues surfaced by the round 4 reviewers: - studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever the caller already had a ROCm visibility env var set, not only when IS_ROCM has already been set by detect_hardware(). Training and inference workers call apply_gpu_ids() before detect_hardware() runs, so the old guard would leave a forked ROCm worker with a stale HIP_VISIBLE_DEVICES mask that no longer matched the narrowed CUDA_VISIBLE_DEVICES selection. - install.sh get_radeon_wheel_url: accept X.Y ROCm versions in addition to X.Y.Z. The `/opt/rocm/.info/version` file and some hipconfig versions report only two components, and the Radeon repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/ directories, so treating X.Y as invalid caused Radeon hosts to fall back to the generic ROCm index even when a matching AMD wheel set existed. - install_python_stack.py _ensure_rocm_torch: only install the AMD bitsandbytes build when the venv actually has a ROCm-compatible torch (either already present or just installed by this function). Previously the bitsandbytes install ran unconditionally, which could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch on hosts where the ROCm runtime is older than any entry in _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id Two medium-severity defensive fixes from the gemini-code-assist review on the AMD monitoring backend: 1. _extract_gpu_metrics may return a dict where every value is None when amd-smi succeeds (zero exit) but the JSON envelope contains no usable fields (error response, unsupported card). The new _has_real_metrics helper lets get_primary_gpu_utilization surface available:False and lets get_visible_gpu_utilization skip ghost device rows so the UI does not render placeholder cards with empty numbers. 2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit": "none"}, including the per-GPU id. The previous int(raw_id) call silently fell back to the enumeration index in that case, losing the real GPU id. Routing raw_id through the existing _parse_numeric helper handles bare ints, floats, strings, and the dict shape uniformly, with a debug log on parse failure. * Fix gemini round 2 findings: explicit length guard on ROCm version file parser Both _detect_rocm_version (install_python_stack.py) and _detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed parts[1]. The surrounding broad `except Exception: pass` already swallowed the resulting IndexError, so a one-component file like "6\n" did fall through to the next detection source -- but the control flow relied on exception handling instead of an explicit check. Add `if len(parts) >= 2:` guards in both helpers so the loop falls through on its own without raising. Behaviour is unchanged for the common multi- component case; the previously-silent IndexError path becomes an explicit no-op. * Fix gemini round 3: include has_rocm in validate_server fallback path When validate_server is called without an explicit install_kind (older call sites that have not been updated), the fallback was only enabling --n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts fell through to the CPU validation path even though the prebuilt being exercised was a HIP binary. Add host.has_rocm to the fallback expression so the GPU offload flag is applied consistently with the install_kind=='linux-rocm' / 'windows-hip' branches above. * Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb The previous heuristic divided any bare number above 10_000_000 by 1024*1024 on the assumption that large unit-less values were bytes. This misclassified small VRAM allocations: 5 MB of used VRAM reported as 5_242_880 bytes without a unit would be taken at face value and render as 5_242_880 MB (~5 TB) in the monitoring UI. Modern amd-smi always provides explicit units (MiB/GiB dict form), and legacy amd-smi returns bare numbers in MB -- the heuristic never had a real workload to handle. Drop it and default to MB for bare numeric input, keeping the existing unit-aware branches for dict / string inputs unchanged. The unrelated gemini suggestion to "default minor to 0" in the amd-smi version awk parser was intentionally NOT applied: rocm7.0 and rocm7.1 ship different wheel sets, so silently substituting 0 for a missing minor could install the wrong wheels. The existing reject-and-fall-through behaviour is safer. * Fix gemini round 5: POSIX compliance and leading-comma visibility parsing Three medium findings from gemini-code-assist addressed in this commit: 1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions that are not in POSIX and break on BSD/BusyBox coreutils. install.sh has a #!/bin/sh shebang so the whole pipeline was rewritten as a single awk script that extracts all href="..." hits on each line, filters to wheels matching the package prefix and python tag, and picks the newest version via zero-padded lexical comparison. No external sort or grep is needed. 2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to the next env var", which is surprising given the clear intent to narrow to device 1. Filter empty tokens after the split and return the first real one. An all-commas value ("," / ",,,") still falls through because no real tokens exist; the empty-string and "-1" explicit-zero cases are unchanged. The unrelated amd-smi version awk parser suggestion was not applied (see round 4 commit message for rationale: defaulting a missing minor to 0 could silently install the wrong ROCm wheel set). * Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label Consolidated fix batch from a 20-parallel reviewer.py run on the current head. Each fix is drawn from a high-consensus finding and addresses a real bug or feature gap, not a stylistic preference. 1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five call sites so this branch no longer regresses main's version floor (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would silently downgrade the minimum version pin for fresh installs. 2. install.sh: URL-decode Radeon wheel names before extracting the torch / torchvision / torchaudio version strings. Real wheel URLs from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...") so the previous `[+-]` terminator in the sed regex never matched, `_torch_ver` stayed empty, `_radeon_versions_match` stayed false, and every Radeon consumer install silently fell back to the generic ROCm index. Now decode %2B -> + first, then extract, then validate. 3. install.sh: the two AMD bitsandbytes install lines were running `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`, so upgrades where the venv already has a CPU/CUDA bitsandbytes satisfying the constraint would keep the stale non-AMD wheel. Add `--force-reinstall --no-cache-dir` to both call sites, matching the pattern already used in install_python_stack.py::_ensure_rocm_torch. 4. install_python_stack.py and install_llama_prebuilt.py: add `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the Python-side ROCm version detectors so they match the chain in install.sh::get_torch_index_url. Package-managed ROCm installs (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig, or amd-smi `version` output -- without these fallbacks, `unsloth studio update` on such hosts returned None and skipped the ROCm torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before parsing so epoch-annotated packages parse correctly. 5. hardware.py: add a `_backend_label(device)` helper that returns "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and use it for every `"backend": ...` emission in JSON responses served to the Studio frontend. Internally we still represent ROCm hosts as DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API surface), but the user-facing API now correctly reports "rocm" on AMD boxes instead of labeling them as "cuda". All 250 simulation scenarios pass (was 233 before this batch: added 17 new regression tests covering the version pin, %2B decoding, bnb force-reinstall flags, dpkg/rpm fallback presence, and the _backend_label helper's four-way truth table). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4 Two rounds of fixes in one commit, plus a full URL audit of every PyPI / download.pytorch.org / repo.radeon.com reference the PR introduces. amd.py (4 medium gemini findings on commit b3627bc2): 1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util gate. The follow-up `vram_total_mb > 0` already handles the division guard, but the truthiness check was redundant and slightly surprising for a 0.0 valid value. Replace with explicit `is not None and > 0` for both vram_util and power_util. 2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding for non-dict envelopes. A scalar / string JSON response from amd-smi would raise AttributeError. Add an isinstance(data, dict) check and return None for unexpected shapes. 3. get_visible_gpu_utilization had the same .get() exposure on the outer envelope. Rewrite the gpu_list extraction as an explicit list/dict/else cascade so a malformed scalar envelope produces gpu_list=[data] and continues without raising. 4. The same function's per-entry loop also called gpu_data.get() on whatever was inside gpu_list. If a scalar ever leaks into the list (directly or via the previous fix's fallback), _extract_gpu_metrics would raise on the first .get() inside the helper. Skip non-dict entries in the loop before extracting metrics. install.sh (URL audit finding, previously flagged by 20-reviewer as #13): 5. get_torch_index_url used `rocm6.*` in the rocm tag case statement, which matched rocm6.5 and rocm6.6 and emitted download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the supported 6.x minors explicitly and add a rocm6.* fallback branch that clips to rocm6.4 (the last supported 6.x wheel set). URL audit results (all URLs PR 4720 references): - 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130, rocm6.0..6.4,rocm7.0..7.2} return HTTP 200. - 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3, 6.4,7.0,7.1,7.2}/ return HTTP 200. - X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for 6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z -> X.Y fallback sed in the Radeon wheel install block. - Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the llama.cpp GitHub releases API endpoint all return 200. Test suite: 255 -> 258. New regression coverage: - U17: get_physical_gpu_count tolerates scalar amd-smi envelope - U18: get_visible_gpu_utilization tolerates scalar envelope - U19a-c: vram_util / power_util return None on zero total, but vram_total_gb still echoes 0.0 (not None) - A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported 6.x minors to rocm6.4 instead of producing a 403 index URL * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label Three high-confidence findings from a second 20-parallel reviewer.py run on commit 7effb3ae. Triaged 15 total findings and applied the three that were confirmed as real bugs; the rest were either false positives (e.g. "migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream via setup.sh regardless), design decisions (e.g. visibility mask env vars not consulted in installer detection), or edge cases the existing fallback logic already handles. 1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then only raises if `torch.cuda.is_available()` is False. On ROCm torch, torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.* API), so the guard becomes dead code on AMD hosts and multi-GPU AMD setups slip through even though unsloth does not support them yet. Add a torch.cuda.device_count() > 1 fallback inside the except so AMD multi-visible-device setups are flagged consistently with the original CUDA memory check. 2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user running `install.sh --no-torch` on an AMD host would still pull in bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the case block in an outer `[ "$SKIP_TORCH" = false ]` guard. 3. studio/backend/main.py [3/20]: the /api/system endpoint returned `"device_backend": get_device().value`, which is "cuda" on ROCm hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints (hardware.py) already use the _backend_label helper which swaps "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same helper so the Studio UI reports the backend consistently across all endpoints. 4. studio/backend/tests/test_utils.py: update test_backend_matches_device to call _backend_label(get_device()) instead of raw get_device().value so the test matches the new contract and still passes on CUDA hosts. Tests: 258 -> 261. New regression coverage: - X08 main.py /api/system uses _backend_label - X09 tokenizer multi-GPU guard has device_count() fallback - X10 fresh-install bnb case block gated on SKIP_TORCH=false * fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels During install, bitsandbytes was installed without --no-deps, causing uv to resolve torch from PyPI (CUDA build) and silently overwrite the ROCm wheels that were just installed in the previous step. This happened in three places: - install.sh: bitsandbytes install in both migrated and fresh paths - install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch() Additionally, multiple install steps in install_python_stack.py (extras, overrides, studio deps) can pull in CUDA torch via transitive dependencies. A final _ensure_rocm_torch() call at the end of the install sequence ensures ROCm torch is always in place at runtime. All changes are gated behind ROCm-specific conditions and do not affect NVIDIA, CPU-only, macOS, or Windows install paths. Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install. * fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path: 1. Unsloth's global monkey-patching of transformers model classes (LlamaRotaryEmbedding, attention modules) triggers _assert_async_cuda_kernel crashes on HIP during generation. Training uses different code paths and works fine. 2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion failures on MI300X (CDNA3 / gfx942), even without Unsloth patching. This commit adds a ROCm-specific inference fallback that: - Skips importing Unsloth at module level (prevents global patching) - Loads models in 16-bit with plain transformers + PEFT instead - Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx") since pre-quantized HF repos still trigger bnb codepaths - Guards get_chat_template calls (unavailable without Unsloth import) - Fixes max_seq_length=0 being passed to from_pretrained (GGUF semantics don't apply to transformers path) The NVIDIA path is completely unchanged -- Unsloth import and for_inference() optimization remain active. GGUF inference (via llama-server/HIP) is unaffected since it never imports Python model classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X) so 16-bit loading is practical for inference. Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424): - Simple generation: PASS - Compare mode (base vs finetuned): PASS - GGUF inference + tool calling: PASS (unaffected by this change) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: guard audio/vision inference on ROCm, remove unused import - Add clear RuntimeError for audio/vision model inference on ROCm (these paths use Unsloth's FastModel/FastVisionModel which would crash on HIP; GGUF inference is the supported path on AMD) - Remove unused `import os as _os` from the ROCm changes * fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature) amd-smi on recent ROCm versions (7.x) wraps metric output in a {"gpu_data": [...]} envelope instead of returning a raw list. This caused get_primary_gpu_utilization() and get_visible_gpu_utilization() to fail silently (returning available=False) because the GPU data dict was never unwrapped. Additionally: - VRAM data moved from "vram" to "mem_usage" with "total_vram" / "used_vram" keys. Added fallback key lookup. - Temperature "edge" sensor returns "N/A" on MI300X VF; the previous dict.get() chain returned the "N/A" string instead of falling through to "hotspot". Changed to a loop that checks each key until a parseable value is found. Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x): - GPU utilization: 0% (idle), up to 100% during training - Temperature: 40-44C (from hotspot sensor) - VRAM: 0.28/191.69 GB (idle) - Power: 158-211W draw * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug fix detecting radeon (#4940) * Bug fix detecting radeon * Expanding GPU target for gfx1100* * Generalize gfx family-prefix filter to cover gfx10/gfx12 as well rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures the bare family prefix from the generic line, and passing that to -DGPU_TARGETS breaks the HIP build because clang only accepts specific gfxNNN ids. The previous filter only special-cased gfx11. Generalize it so any bare 2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a specific sibling target is present in the same list. No real AMD GPU has a 2-digit gfx id, so the filter can only ever drop family prefixes and never a real target. Covers the existing gfx11 cases unchanged, and extends the same fix to gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4), which would otherwise hit the same build failure on newer rocminfo. --------- Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com> --------- Co-authored-by: Eda Z <eda.zhou@amd.com> Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: billishyahao <bill.he@amd.com> Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com> Co-authored-by: Iswarya Alex <iswarya.alex@amd.com> Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
2026-04-10 08:56:12 +00:00
# 13. AMD ROCm: final torch repair. Multiple install steps above can
# pull in CUDA torch from PyPI (base packages, extras, overrides,
# studio deps, etc.). Running the repair as the very last step
# ensures ROCm torch is in place at runtime, regardless of which
# intermediate step clobbered it.
if not IS_WINDOWS and not IS_MACOS and not NO_TORCH:
_progress("ROCm torch (final)")
_ensure_rocm_torch()
# 14. Final check (silent; third-party conflicts are expected)
2026-03-08 19:22:31 +00:00
subprocess.run(
[sys.executable, "-m", "pip", "check"],
2026-03-12 18:28:04 +00:00
stdout = subprocess.DEVNULL,
stderr = subprocess.DEVNULL,
)
studio: setup log styling (#4494) * refactor(studio): unify setup terminal output style and add verbose setup mode * studio(windows): align setup.ps1 banner/steps with setup.sh (ANSI, verbose) * studio(setup): revert nvcc path reordering to match main * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio(setup): restore fail-fast llama.cpp setup flow * studio(banner): use IPv6 loopback URL when binding :: or ::1 * Fix IPv6 URL bracketing, try_quiet stderr, _step label clamp - Bracket IPv6 display_host in external_url to produce clickable URLs - Redirect try_quiet failure log to stderr instead of stdout - Clamp _step label to column width to prevent negative padding * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add sandbox integration tests for PR #4494 UX fixes Simulation harness (tests/simulate_pr4494.py) creates an isolated uv venv, copies the real source files into it, and runs subprocess tests for all three fixes with visual before/after demos and edge cases. Standalone bash test (tests/test_try_quiet.sh) validates try_quiet stderr redirect across 8 scenarios including broken-version contrast. 39 integration tests total (14 IPv6 + 15 try_quiet + 10 _step), all existing 75 unit tests still pass. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Truncate step() labels in setup.sh to match PS1 and Python The %-15s printf format pads short labels but does not truncate long ones. Change to %-15.15s so labels wider than 15 chars are clipped, matching the PowerShell .Substring(0,15) and Python label[:15] logic. * Remove sandbox integration tests from PR These test files are not part of the styling fix and should not ship with this PR. * Show error output on failure instead of suppressing it - install_python_stack.py: restore _red for patch_package_file warnings (was downgraded to _dim) - setup.ps1: capture winget output and show on failure for CUDA, Node, Python, and OpenSSL installs (was piped to Out-Null) - setup.ps1: always show git pull failure warning, not just in verbose mode * Show winget error output for Git and CMake installs on failure Same capture-and-print-on-failure pattern already used for Node, Python, CUDA, and OpenSSL winget installs. * fix: preserve stderr for _run_quiet error messages in setup.sh The step() helper writes to stdout, but _run_quiet's error header was originally sent to stderr (>&2). Without the redirect, callers that separate stdout/stderr would miss the failure headline while still seeing the log body on stderr. Add >&2 to both step calls inside _run_quiet to match main's behavior. * feat: add --verbose flag to setup and update commands Wire UNSLOTH_VERBOSE=1 through _run_setup_script() so that 'unsloth studio update --verbose' (and the deprecated 'setup') passes the flag to setup.sh / setup.ps1 / install_python_stack.py. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com>
2026-03-27 10:12:48 +00:00
_step(_LABEL, "installed")
return 0
if __name__ == "__main__":
sys.exit(install_python_stack())