Add AMD ROCm/HIP support across installer and hardware detection (#4720)

* Add ROCm detection to install.sh and expand shell tests

Add AMD ROCm GPU detection to get_torch_index_url() in install.sh.
When nvidia-smi is not found, probe for ROCm via amd-smi, /opt/rocm
version file, hipconfig, dpkg-query, and rpm.

Includes validation guard for malformed _rocm_tag, Debian epoch prefix
stripping, ROCm 7.2+ cap to rocm7.1 index, bitsandbytes AMD install,
and status messaging. Shell tests expanded to 23 cases.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

* Add ROCm torch reinstall support to install_python_stack.py

Add _detect_rocm_version() and _ensure_rocm_torch() to detect when a
Linux host has ROCm but the venv received CPU-only torch, and reinstall
with the correct ROCm wheels. Covers ROCm 6.0 through 7.1 with a
30-second timeout on the torch GPU probe subprocess.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

* Add ROCm support to llama.cpp prebuilt installer

Add has_rocm field to HostInfo, extend detect_host() to probe for ROCm
via hipcc/amd-smi/rocm-smi/ROCM_PATH, and route ROCm hosts to upstream
prebuilts (Linux ROCm 7.2 prebuilt with source fallback, Windows HIP
prebuilt with CPU fallback). Add linux-rocm and windows-hip install
kinds to runtime_patterns_for_choice().

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

* Add IS_ROCM hardware flag and fix AMD error message

Add IS_ROCM flag to hardware.py detect_hardware() (set when
torch.version.hip is present, DeviceType stays CUDA). Export IS_ROCM
from __init__.py. Add "rocm" key to get_package_versions().

Replace "We do not support AMD" error in tokenizer_utils.py with a
helpful message pointing to ROCm installation docs.

Co-authored-by: Daniel Han <danielhanchen@gmail.com>

* Add comprehensive ROCm support test suite (68 tests)

Add tests/studio/install/test_rocm_support.py covering all ROCm code
paths across install_llama_prebuilt.py, install_python_stack.py,
hardware.py, tokenizer_utils.py, and install.sh. All tests use mocks
and run without AMD hardware.

Covers: asset selection (11), runtime patterns (5), HostInfo (4),
ROCm version detection (9), torch reinstall (9), index mapping (8),
hardware flag (8), tokenizer message (2), install.sh structure (10),
and live regression (1).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Harden ROCm support: probe error handling, version cap, validation

Address review findings from 8 independent reviewers:

- Wrap _ensure_rocm_torch() torch probe in try/except for
  TimeoutExpired and OSError so a hung or broken torch import does not
  crash the installer (8/8 reviewers flagged this)
- Add torch>=2.4,<2.11.0 version cap to the ROCm reinstall path to
  prevent installing unsupported torch 2.11.0 from the rocm7.1 index
- Use with-statement for file reads in _detect_rocm_version() to avoid
  resource leaks
- Handle ROCM_PATH="" correctly (use `or "/opt/rocm"` instead of
  default parameter to avoid relative path resolution)
- Strengthen shell validation guard from rocm[0-9] to rocm[1-9] to
  reject rocm0.x tags that would produce nonexistent PyTorch index URLs
- Switch shell version cap from blocklist to allowlist (rocm6.*|rocm7.0*
  |rocm7.1* pass through, everything else caps to rocm7.1) so future
  ROCm 10+ does not fall through to a nonexistent index
- Add sorted() to _ROCM_TORCH_INDEX lookup for defensive ordering
- Fix test_probe_timeout_handled: replace zero-assertion test with
  proper assertions verifying reinstall proceeds after timeout

* Clean up rocm_paths list construction in detect_host()

Filter None from the ROCM_PATH env var lookup at list construction time
instead of relying on the inline `if p` guard in the any() call.

* Require actual AMD GPU presence before selecting ROCm paths

All 8 reviewers across 2 cycles independently flagged that ROCm
detection used toolkit/filesystem hints (hipcc, /opt/rocm, rocm-core)
as a proxy for GPU presence, which would misroute CPU-only or NVIDIA
hosts that happen to have ROCm tools installed.

Now all 3 detection points (install.sh, install_python_stack.py,
install_llama_prebuilt.py) probe for an actual AMD GPU before
entering the ROCm path:

- install.sh: check rocminfo for gfx* GPU names, or amd-smi list
  for device rows, before version detection
- install_python_stack.py: new _has_rocm_gpu() function probes
  rocminfo and amd-smi list before _ensure_rocm_torch() proceeds
- install_llama_prebuilt.py: detect_host() probes rocminfo/amd-smi
  list instead of just checking tool existence or directory paths

Also:
- Shell test mock amd-smi now handles "list" subcommand
- Python tests updated to mock _has_rocm_gpu where needed
- Added test_no_gpu_with_rocm_tools_skips to verify the new guard
- Test index lookups now use sorted() to match production code

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Harden hipconfig version parsing and torch probe compatibility

- Add parts[1].isdigit() check in hipconfig version parsing to handle
  versions like "6.3-HIP" where the minor component has non-numeric
  suffix (strip "-" prefix before int() conversion)
- Use getattr() in torch probe subprocess to safely handle old or
  custom torch builds that may lack torch.version.hip/cuda attributes

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Strengthen AMD GPU detection and add NVIDIA precedence guard

- Change amd-smi list detection from any-non-empty-output to requiring
  "gpu" marker in output, matching the shell-side NR>1 check. Prevents
  false positives from header-only amd-smi list output.
- Add nvidia-smi check at the top of _ensure_rocm_torch() so mixed
  AMD+NVIDIA hosts preserve NVIDIA precedence (matching install.sh and
  install_llama_prebuilt.py behavior).
- Apply the same amd-smi marker fix to install_llama_prebuilt.py
  detect_host() for consistency.

* Add Windows-specific ROCm/HIP detection in detect_host()

The previous detect_host() ROCm check used rocminfo and amd-smi list
which are Linux-only tools. On Windows, has_rocm would always be False,
making the Windows HIP prebuilt path at line 1794 unreachable.

Now detect_host() uses platform-specific detection:
- Linux: rocminfo (check for gfx GPU names) or amd-smi list
- Windows: hipinfo.exe, amd-smi, or amdhip64.dll on PATH

This allows Windows AMD users to get the HIP prebuilt binary instead
of silently falling through to the CPU prebuilt.

* Add AMD ROCm gaps: Mamba/SSM source builds, GPU monitoring, Windows messaging, RDNA expansion

- worker.py: Add HIP detection to causal-conv1d/mamba-ssm probe, check
  for hipcc before ROCm source builds, improve status messages and error
  reporting, add timeout and uv support for the source build fallback
- amd.py: New AMD GPU monitoring module via amd-smi metric --json,
  mirroring nvidia.py structure (utilization, temperature, power, VRAM)
- hardware.py: Branch to amd.py when IS_ROCM is True for GPU utilization,
  visible GPU queries, and physical GPU count
- install_python_stack.py: Detect AMD GPUs on Windows and warn that
  ROCm-enabled PyTorch must be installed manually
- kernels/utils.py: Expand is_rdna() to cover RDNA2 (gfx1030-1032),
  RDNA3 (gfx1102-1103), RDNA3.5 (gfx1150-1152) alongside existing entries
- tests: Add 32 new tests covering all changes (95/95 pass)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Harden ROCm detection, fix VRAM heuristic, and expand RDNA2 coverage

- Windows ROCm detection: validate actual GPU presence via hipinfo/amd-smi
  output markers instead of just checking tool existence on PATH
- _ensure_rocm_torch: validate nvidia-smi actually reports a GPU before
  giving NVIDIA precedence (fixes AMD-only hosts with stale NVIDIA tools)
- amd.py _parse_numeric: handle dict-shaped metric objects from newer
  amd-smi versions ({"value": 10, "unit": "W"}) and strip MiB/GiB units
- amd.py VRAM heuristic: raise threshold from 100k to 10M to correctly
  handle MI300X (192 GB = 196608 MB) and other high-VRAM GPUs
- amd.py visible GPU: use AMD-reported GPU IDs instead of enumerate index
  so non-dense sets like CUDA_VISIBLE_DEVICES=1,3 report correctly
- install.sh: add ROCm <6.0 minimum version guard (no PyTorch wheels
  exist for older versions); fix rocm7.1* glob to not match rocm7.10+
- is_rdna: add gfx1033-1036 for RDNA2 mobile GPUs (RX 6600M etc.)
- worker.py: increase ROCm source build timeout from 600s to 1800s;
  fix success log message for ROCm source builds
- Tests: update mocks for _has_usable_nvidia_gpu, add RDNA2 target asserts

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add HIP_VISIBLE_DEVICES support, unit-aware VRAM parsing, Windows GPU validation

- hardware.py: check HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES on ROCm
  before falling back to CUDA_VISIBLE_DEVICES, so multi-GPU AMD setups with
  HIP-specific env vars report the correct visible device set
- amd.py: add _parse_memory_mb() that reads "unit" from dict-shaped amd-smi
  JSON (e.g. {"value": 192, "unit": "GiB"}) and converts to MB correctly;
  fixes MI300X VRAM misreported as 0.19 GB instead of 192 GB
- install_python_stack.py: Windows AMD warning now validates actual GPU
  presence via hipinfo/amd-smi output markers before printing
- install_llama_prebuilt.py: restore amdhip64.dll fallback for Windows HIP
  detection after tool-based checks, so Windows HIP installs without CLI
  tools on PATH are still detected
- hardware.py: fix IS_ROCM comment to accurately describe its role

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix HIP_VISIBLE_DEVICES empty-string handling in GPU visibility spec

Use explicit None checks instead of Python `or` operator when reading
HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES, so that an empty string
("") is correctly honored as "no visible GPUs" rather than silently
falling through to CUDA_VISIBLE_DEVICES on mixed ROCm+CUDA systems.

* Fix IS_ROCM test assertion for multi-line formatting

* Cap torchvision/torchaudio versions, remove amdhip64.dll fallback, fix visible GPU count

- Cap torchvision<0.26.0 and torchaudio<2.11.0 alongside torch<2.11.0 in
  both install.sh and install_python_stack.py to prevent resolver from
  selecting incompatible companion packages from ROCm wheel index
- Remove amdhip64.dll fallback in Windows ROCm detection (DLL presence
  without hipinfo/amd-smi is not proof of GPU existence)
- Fix get_visible_gpu_count() to use _get_parent_visible_gpu_spec() which
  respects HIP_VISIBLE_DEVICES/ROCR_VISIBLE_DEVICES on ROCm hosts

* Attribute is_rdna() RDNA2/3/3.5/4 expansion to PR #4428

The is_rdna() expansion to cover RDNA2 (gfx1030-1036), RDNA3
(gfx1100-1103), RDNA3.5 (gfx1150-1152), and RDNA4 (gfx1200-1201)
architectures is based on the original work from PR #4428.

Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com>
Co-authored-by: billishyahao <bill.he@amd.com>

* Support AMD Radeon for studio (#4770)

Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>

* Remove ROCm test files from main PR

Move test_rocm_support.py and shell test additions to a separate PR
to keep the main ROCm support PR focused on implementation changes.

* Fix installer and hardware detection issues for PR #4720

- Fix empty _tri_arg passed to uv pip install in Radeon path (causes
  "Empty field is not allowed for PEP508" error)
- Fix Radeon fallback: use ROCm index instead of CPU-only when
  repo.radeon.com is unreachable (TORCH_INDEX_URL already has ROCm)
- Use $TORCH_CONSTRAINT in fallback paths instead of hardcoded strings
- Fix _pick_radeon_wheel: relax suffix to match manylinux_2_28_x86_64
  wheels (AMD Radeon repo does not use bare linux_x86_64 platform tag)
- Fix IS_ROCM export: use __getattr__ so callers always see the live
  value after detect_hardware() runs
- Fix apply_gpu_ids: set HIP_VISIBLE_DEVICES and ROCR_VISIBLE_DEVICES
  on ROCm so _get_parent_visible_gpu_spec picks up narrowed GPU set
- Fix _parse_memory_mb: distinguish GB (1000 MB) from GiB (1024 MiB)
- Add amd-smi version as a fallback in _detect_rocm_version
- Fix trailing whitespace and missing newline at EOF in install.sh

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix GPU detection false positives and add missing health groups

- Fix _has_rocm_gpu() false positive: require "GPU: <number>" data rows
  from amd-smi list, not just header containing "gpu"
- Apply same fix in detect_host() in install_llama_prebuilt.py
- Add runtime_payload_health_groups for linux-rocm and windows-hip so
  partial/corrupt ROCm/HIP prebuilt installs are properly detected
- Add bitsandbytes install to Radeon fallback paths (was only in the
  success path, skipped when repo.radeon.com was unreachable)
- Keep DEVICE/CHAT_ONLY as direct imports in __init__.py (matching main)
  and only use __getattr__ for IS_ROCM

* Fix _ensure_rocm_torch and Windows AMD warning false positives

- _ensure_rocm_torch: only skip when HIP is already present, not for
  CUDA builds (which are unusable on AMD-only hosts). Fixes the case
  where a venv has a stale CUDA wheel and the repair step is skipped.
- Windows AMD warning: use GPU data row check (same as Linux fix) to
  avoid false positives from amd-smi list header-only output.

* Fix amd-smi GPU detection for GPU[N] output format

Older amd-smi versions output "GPU[0] : Card series: ..." instead of
"GPU: 0". The regex now matches both "GPU: <digit>" and "GPU[<digit>"
formats to detect actual GPU data rows.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Harden AMD GPU detection against false positives

- install.sh: replace weak amd-smi list check (awk 'NR>1 && NF') with
  strict pattern matching GPU data rows (/^GPU[[:space:]]*[:\[]/)
- All files: reject rocminfo gfx000 (CPU HSA agent) by requiring
  gfx[1-9] instead of gfx[0-9] in the rocminfo GPU probe
- Fixes false positives on hosts with ROCm tools but no AMD GPU

* Remove duplicate comment from pre-commit merge

* Refactor: deduplicate AMD detection, consolidate bitsandbytes, clean up imports

- Extract _has_amd_rocm_gpu() shell function to avoid duplicating the
  rocminfo/amd-smi GPU detection logic in get_torch_index_url and
  the Radeon auto-detect block
- Consolidate bitsandbytes install into a single case block after torch
  install (was duplicated 4 times across Radeon success/fallback paths)
- Move math and re imports to top of amd.py (were inline in functions)
- Add _smi_query() helper in hardware.py to centralize IS_ROCM backend
  selection for get_gpu_utilization and get_visible_gpu_utilization

Addresses Gemini code review suggestions.

* Fix VRAM parsing for string values and GB/GiB consistency

- Extract unit from string-valued VRAM fields (e.g. "192 GiB") so
  _parse_memory_mb correctly applies the unit multiplier instead of
  treating the value as bare MB
- Treat GB and GiB identically (both as binary x1024) since GPU tools
  including amd-smi use binary units even when labeling them "GB"
- Fixes incorrect VRAM reporting on MI300-class cards (was showing
  ~0.19 GB instead of 192 GB for string-valued outputs)

* Add --no-cache to uv for ROCm HIP source builds

Avoid stale cache artifacts from partial HIP source builds when
uv is used for causal-conv1d/mamba-ssm compilation on ROCm.
The pip path already uses --no-cache-dir; this adds the uv equivalent
(--no-cache) only when is_hip is True.

* Fix critical: initialize _amd_gpu_radeon before case block

_amd_gpu_radeon was only set inside the */rocm*) case arm, so on
NVIDIA/CPU/macOS paths where TORCH_INDEX_URL does not contain "rocm",
the variable was unbound. With set -u (nounset) enabled, this crashes
the installer for every non-AMD user.

Move initialization to before the case block so it is always defined.

* Fix Windows AMD: route has_rocm hosts to HIP prebuilt path

resolve_release_asset_choice was selecting windows-cpu for all Windows
x86_64 hosts including those with has_rocm=True. Windows AMD users
should fall through to resolve_upstream_asset_choice which tries the
HIP prebuilt first. Add "not host.has_rocm" guard to the published
windows-cpu selection.

* Harden ROCm detection, Radeon wheel fallback, and HIP visibility

Addresses review findings from parallel reviewers on PR #4720:

- install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L
  to actually list a GPU before treating the host as NVIDIA. Fixes the
  stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the
  CUDA branch.
- install.sh: fix hipconfig awk blocks to propagate a non-zero exit code
  when the output is not a recognisable version string, so the ||-chain
  continues to dpkg-query / rpm instead of terminating early.
- install.sh: fail-closed on Radeon wheel fallback. When torch,
  torchvision or torchaudio is missing from the Radeon repo for the
  active Python tag, fall back to the standard ROCm index instead of
  silently mixing Radeon wheels with PyPI defaults. Quote all wheel
  arguments individually so wheel filenames cannot be word-split or
  glob-expanded.
- install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to
  list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts
  with a broken leftover nvidia-smi to the ROCm path instead of
  misclassifying them as NVIDIA.
- install_llama_prebuilt.py: scan upstream assets for any rocm-<version>
  prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+
  users pick up a matching upstream prebuilt when one exists.
- install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for
  linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted
  on the GPU path instead of passing validation on CPU only.
- install_llama_prebuilt.py: restore the published windows-cpu fallback
  for AMD Windows hosts without a HIP prebuilt so hash-approved bundles
  are still preferred over the raw upstream CPU asset.
- install_python_stack.py: drop the /opt/rocm / hipcc gate in
  _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm
  installs (package-managed minimal installs, Radeon software) that ship
  amd-smi / rocminfo without hipcc can now repair a CPU-only venv via
  "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard.
- studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES /
  ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in
  get_primary_gpu_utilization(). A process restricted to GPU 2 now
  reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain
  bytes unit detection to an explicit allowlist.
- studio/backend/utils/hardware/hardware.py: route
  get_backend_visible_gpu_info()'s backend_cuda_visible_devices field
  through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the
  unconditional "(rocm=False)" suffix in apply_gpu_ids() logs.

* Fix round 2 regressions: ROCm validate_server and Windows HIP routing

Follow-up to 810b833b addressing review findings on the first round of
hardening commits:

- install_llama_prebuilt.py validate_server: gate --n-gpu-layers on the
  resolved install_kind instead of host.has_rocm. AMD Windows hosts
  without a HIP prebuilt fall back to windows-cpu and must not be
  validated with GPU layers; thread install_kind through from the
  caller.
- install_llama_prebuilt.py resolve_release_asset_choice: reinstate the
  "not has_rocm" guard on the published windows-cpu bundle so AMD
  Windows hosts reach resolve_upstream_asset_choice() where the new
  HIP prebuilt path lives. Prefer a published windows-hip bundle first
  when one exists, fall through to upstream HIP + upstream CPU
  otherwise.
- install_llama_prebuilt.py detect_host: also set has_physical_nvidia
  when the secondary --query-gpu block confirms a working NVIDIA GPU,
  so older nvidia-smi versions without -L support do not silently skip
  the Linux diagnostics that key off has_physical_nvidia.
- install_llama_prebuilt.py: drop redundant "import re as _re" /
  "import re as _re_rocm" local aliases in favour of the existing
  top-level "import re".
- install_python_stack.py _ensure_rocm_torch: run the AMD
  bitsandbytes install unconditionally after the HIP-torch probe so
  "unsloth studio update" on venvs that already have ROCm torch still
  gains the AMD bitsandbytes build.
- install.sh: add a non-x86_64 early-exit to get_torch_index_url() so
  aarch64 / arm64 Linux hosts do not hit the ROCm wheel index
  (PyTorch only publishes ROCm wheels for linux_x86_64).
- install.sh: add bitsandbytes install to the migrated-environment
  branch so upgrades pick it up for ROCm hosts instead of only the
  fresh-install path.
- install.sh: in the Radeon wheel path, pass version constraints +
  --no-index --find-links to uv instead of explicit wheel URLs so a
  version-compatible torch / torchvision / torchaudio triple is
  resolved, rather than picking the highest-version wheel for each
  package independently.
- studio/backend/utils/hardware/amd.py _first_visible_amd_gpu_id: fall
  through to lower-priority visibility env vars when the first entry
  is malformed (leading comma, all-whitespace first token) instead of
  silently returning GPU 0.

* Fix round 3 findings: x86_64 guard, ROCm version clip, Radeon deps

Address issues surfaced by the round 3 reviewers on top of 8636fa63:

- install_python_stack.py _ensure_rocm_torch: add the same `x86_64`
  guard that install.sh already has. Linux aarch64 / arm64 ROCm hosts
  must skip the repair path entirely; PyTorch only publishes ROCm
  wheels for linux_x86_64, and without this guard
  `unsloth studio update` aborts with a missing-wheel error on non
  x86_64 hosts.
- install_llama_prebuilt.py resolve_upstream_asset_choice: add a
  best-effort _detect_host_rocm_version() helper (reading
  /opt/rocm/.info/version, amd-smi version, hipconfig --version) and
  filter rocm_candidates to entries whose major.minor is <= host
  version. Falls back to the newest candidate only when no compatible
  one exists, so a ROCm 6.4 host downloads rocm-6.4 instead of being
  handed the numerically newest rocm-7.2 bundle (which fails preflight
  and forces a source build).
- install.sh: remove the round 2 --no-index switch from the Radeon
  wheel branch. --no-index forced uv to ignore PyPI entirely, which
  broke transitive dependency resolution (filelock, sympy, networkx,
  jinja2, fsspec, setuptools, typing-extensions, ...) on a fresh venv.
  Restore the round 1 explicit wheel URL invocation but add a
  torch / torchvision / torchaudio version-pair sanity check so a
  mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 + torchaudio
  2.9.0) falls back to the standard ROCm index instead of installing a
  broken combination.
- install_python_stack.py _ensure_rocm_torch: restructure the
  "tag is None" path so it no longer short-circuits the bitsandbytes
  install. On a ROCm runtime older than anything in
  _ROCM_TORCH_INDEX, print the "no wheel" warning but still run the
  AMD bitsandbytes install.
- studio/backend/core/training/worker.py: restore the pre-PR
  "no timeout" behaviour for non-HIP causal-conv1d / mamba-ssm source
  builds. The round 2 "timeout = 1800 if is_hip else 300" cap aborts
  slow non-HIP builds (Linux aarch64, unsupported torch/CUDA combos)
  after 5 minutes; omit timeout for the non-HIP branch so the cap
  only applies to ROCm source builds.

* Fix round 4 findings: apply_gpu_ids env inheritance, Radeon X.Y, bitsandbytes gate

Address remaining issues surfaced by the round 4 reviewers:

- studio/backend/utils/hardware/hardware.py apply_gpu_ids: mirror the
  selection into HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES whenever
  the caller already had a ROCm visibility env var set, not only when
  IS_ROCM has already been set by detect_hardware(). Training and
  inference workers call apply_gpu_ids() before detect_hardware()
  runs, so the old guard would leave a forked ROCm worker with a
  stale HIP_VISIBLE_DEVICES mask that no longer matched the
  narrowed CUDA_VISIBLE_DEVICES selection.
- install.sh get_radeon_wheel_url: accept X.Y ROCm versions in
  addition to X.Y.Z. The `/opt/rocm/.info/version` file and some
  hipconfig versions report only two components, and the Radeon
  repository publishes both rocm-rel-X.Y.Z/ and rocm-rel-X.Y/
  directories, so treating X.Y as invalid caused Radeon hosts to fall
  back to the generic ROCm index even when a matching AMD wheel set
  existed.
- install_python_stack.py _ensure_rocm_torch: only install the AMD
  bitsandbytes build when the venv actually has a ROCm-compatible
  torch (either already present or just installed by this function).
  Previously the bitsandbytes install ran unconditionally, which
  could leave an AMD bitsandbytes layered on top of a CPU/CUDA torch
  on hosts where the ROCm runtime is older than any entry in
  _ROCM_TORCH_INDEX. Also add --force-reinstall so an existing
  CPU/CUDA bitsandbytes is replaced by the AMD build during upgrades.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix gemini findings: amd-smi metric envelope validation and dict-wrapped GPU id

Two medium-severity defensive fixes from the gemini-code-assist review on
the AMD monitoring backend:

1. _extract_gpu_metrics may return a dict where every value is None when
   amd-smi succeeds (zero exit) but the JSON envelope contains no usable
   fields (error response, unsupported card). The new _has_real_metrics
   helper lets get_primary_gpu_utilization surface available:False and
   lets get_visible_gpu_utilization skip ghost device rows so the UI
   does not render placeholder cards with empty numbers.

2. Newer amd-smi versions wrap scalar fields as {"value": 0, "unit":
   "none"}, including the per-GPU id. The previous int(raw_id) call
   silently fell back to the enumeration index in that case, losing the
   real GPU id. Routing raw_id through the existing _parse_numeric
   helper handles bare ints, floats, strings, and the dict shape
   uniformly, with a debug log on parse failure.

* Fix gemini round 2 findings: explicit length guard on ROCm version file parser

Both _detect_rocm_version (install_python_stack.py) and
_detect_host_rocm_version (install_llama_prebuilt.py) read /opt/rocm/.info/version
or $ROCM_PATH/lib/rocm_version, split on "." and unconditionally accessed
parts[1]. The surrounding broad `except Exception: pass` already swallowed
the resulting IndexError, so a one-component file like "6\n" did fall
through to the next detection source -- but the control flow relied on
exception handling instead of an explicit check.

Add `if len(parts) >= 2:` guards in both helpers so the loop falls through
on its own without raising. Behaviour is unchanged for the common multi-
component case; the previously-silent IndexError path becomes an explicit
no-op.

* Fix gemini round 3: include has_rocm in validate_server fallback path

When validate_server is called without an explicit install_kind (older
call sites that have not been updated), the fallback was only enabling
--n-gpu-layers for NVIDIA and macOS arm64 hosts. AMD ROCm Linux hosts
fell through to the CPU validation path even though the prebuilt being
exercised was a HIP binary.

Add host.has_rocm to the fallback expression so the GPU offload flag is
applied consistently with the install_kind=='linux-rocm' / 'windows-hip'
branches above.

* Fix gemini round 4: remove risky bytes-vs-MB heuristic in _parse_memory_mb

The previous heuristic divided any bare number above 10_000_000 by
1024*1024 on the assumption that large unit-less values were bytes.
This misclassified small VRAM allocations: 5 MB of used VRAM reported
as 5_242_880 bytes without a unit would be taken at face value and
render as 5_242_880 MB (~5 TB) in the monitoring UI.

Modern amd-smi always provides explicit units (MiB/GiB dict form),
and legacy amd-smi returns bare numbers in MB -- the heuristic never
had a real workload to handle. Drop it and default to MB for bare
numeric input, keeping the existing unit-aware branches for dict /
string inputs unchanged.

The unrelated gemini suggestion to "default minor to 0" in the
amd-smi version awk parser was intentionally NOT applied: rocm7.0
and rocm7.1 ship different wheel sets, so silently substituting 0
for a missing minor could install the wrong wheels. The existing
reject-and-fall-through behaviour is safer.

* Fix gemini round 5: POSIX compliance and leading-comma visibility parsing

Three medium findings from gemini-code-assist addressed in this commit:

1. _pick_radeon_wheel used grep -o and sort -V, both GNU extensions
   that are not in POSIX and break on BSD/BusyBox coreutils. install.sh
   has a #!/bin/sh shebang so the whole pipeline was rewritten as a
   single awk script that extracts all href="..." hits on each line,
   filters to wheels matching the package prefix and python tag, and
   picks the newest version via zero-padded lexical comparison. No
   external sort or grep is needed.

2. _first_visible_amd_gpu_id in the AMD monitoring backend treated a
   leading comma (e.g. HIP_VISIBLE_DEVICES=",1") as "fall through to
   the next env var", which is surprising given the clear intent to
   narrow to device 1. Filter empty tokens after the split and return
   the first real one. An all-commas value ("," / ",,,") still falls
   through because no real tokens exist; the empty-string and "-1"
   explicit-zero cases are unchanged.

The unrelated amd-smi version awk parser suggestion was not applied
(see round 4 commit message for rationale: defaulting a missing minor
to 0 could silently install the wrong ROCm wheel set).

* Fix 20-reviewer.py findings: base drift, Radeon %2B, dpkg/rpm fallback, bnb, backend label

Consolidated fix batch from a 20-parallel reviewer.py run on the current
head. Each fix is drawn from a high-consensus finding and addresses a
real bug or feature gap, not a stylistic preference.

1. install.sh: bump `unsloth>=2026.4.2` -> `unsloth>=2026.4.4` at five
   call sites so this branch no longer regresses main's version floor
   (main bumped to 2026.4.4 in #4876). Without this, merging 4720 would
   silently downgrade the minimum version pin for fresh installs.

2. install.sh: URL-decode Radeon wheel names before extracting the
   torch / torchvision / torchaudio version strings. Real wheel URLs
   from repo.radeon.com are percent-encoded ("torch-2.10.0%2Brocm7.2.0...")
   so the previous `[+-]` terminator in the sed regex never matched,
   `_torch_ver` stayed empty, `_radeon_versions_match` stayed false,
   and every Radeon consumer install silently fell back to the generic
   ROCm index. Now decode %2B -> + first, then extract, then validate.

3. install.sh: the two AMD bitsandbytes install lines were running
   `uv pip install "bitsandbytes>=0.49.1"` without `--force-reinstall`,
   so upgrades where the venv already has a CPU/CUDA bitsandbytes
   satisfying the constraint would keep the stale non-AMD wheel. Add
   `--force-reinstall --no-cache-dir` to both call sites, matching the
   pattern already used in install_python_stack.py::_ensure_rocm_torch.

4. install_python_stack.py and install_llama_prebuilt.py: add
   `dpkg-query -W rocm-core` and `rpm -q rocm-core` fallbacks to the
   Python-side ROCm version detectors so they match the chain in
   install.sh::get_torch_index_url. Package-managed ROCm installs
   (Debian/Ubuntu/RHEL/Fedora distro packages) can expose GPUs via
   rocminfo/amd-smi but still lack /opt/rocm/.info/version, hipconfig,
   or amd-smi `version` output -- without these fallbacks, `unsloth
   studio update` on such hosts returned None and skipped the ROCm
   torch repair. Also strip the dpkg epoch prefix ("1:6.3.0-1") before
   parsing so epoch-annotated packages parse correctly.

5. hardware.py: add a `_backend_label(device)` helper that returns
   "rocm" when IS_ROCM is set and the device is DeviceType.CUDA, and
   use it for every `"backend": ...` emission in JSON responses served
   to the Studio frontend. Internally we still represent ROCm hosts as
   DeviceType.CUDA (ROCm torch reuses the whole torch.cuda.* API
   surface), but the user-facing API now correctly reports "rocm" on
   AMD boxes instead of labeling them as "cuda".

All 250 simulation scenarios pass (was 233 before this batch: added 17
new regression tests covering the version pin, %2B decoding, bnb
force-reinstall flags, dpkg/rpm fallback presence, and the
_backend_label helper's four-way truth table).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix gemini round 6 + URL audit: amd.py defensive checks, rocm6.5+ clip to 6.4

Two rounds of fixes in one commit, plus a full URL audit of every PyPI /
download.pytorch.org / repo.radeon.com reference the PR introduces.

amd.py (4 medium gemini findings on commit b3627bc2):

1. _extract_gpu_metrics used `and vram_total_mb` as part of the vram_util
   gate. The follow-up `vram_total_mb > 0` already handles the division
   guard, but the truthiness check was redundant and slightly surprising
   for a 0.0 valid value. Replace with explicit `is not None and > 0`
   for both vram_util and power_util.

2. get_physical_gpu_count called `data.get("gpu", ...)` without guarding
   for non-dict envelopes. A scalar / string JSON response from amd-smi
   would raise AttributeError. Add an isinstance(data, dict) check and
   return None for unexpected shapes.

3. get_visible_gpu_utilization had the same .get() exposure on the outer
   envelope. Rewrite the gpu_list extraction as an explicit
   list/dict/else cascade so a malformed scalar envelope produces
   gpu_list=[data] and continues without raising.

4. The same function's per-entry loop also called gpu_data.get() on
   whatever was inside gpu_list. If a scalar ever leaks into the list
   (directly or via the previous fix's fallback), _extract_gpu_metrics
   would raise on the first .get() inside the helper. Skip non-dict
   entries in the loop before extracting metrics.

install.sh (URL audit finding, previously flagged by 20-reviewer as #13):

5. get_torch_index_url used `rocm6.*` in the rocm tag case statement,
   which matched rocm6.5 and rocm6.6 and emitted
   download.pytorch.org/whl/rocm6.5 -- which returns HTTP 403 because
   PyTorch only publishes rocm 5.7, 6.0-6.4, 7.0-7.2. Enumerate the
   supported 6.x minors explicitly and add a rocm6.* fallback branch
   that clips to rocm6.4 (the last supported 6.x wheel set).

URL audit results (all URLs PR 4720 references):
- 14/14 download.pytorch.org/whl/{cpu,cu118,cu124,cu126,cu128,cu130,
  rocm6.0..6.4,rocm7.0..7.2} return HTTP 200.
- 9/9 repo.radeon.com/rocm/manylinux/rocm-rel-{5.7,6.0,6.1,6.2,6.3,
  6.4,7.0,7.1,7.2}/ return HTTP 200.
- X.Y.Z patch directories exist for 7.0.2, 7.1.1, 7.2.1 but NOT for
  6.3.0, 6.4.0, 6.2.1 -- install.sh already handles this via the X.Y.Z
  -> X.Y fallback sed in the Radeon wheel install block.
- Docs links (rocm.docs.amd.com, docs.unsloth.ai AMD guide) and the
  llama.cpp GitHub releases API endpoint all return 200.

Test suite: 255 -> 258. New regression coverage:
- U17: get_physical_gpu_count tolerates scalar amd-smi envelope
- U18: get_visible_gpu_utilization tolerates scalar envelope
- U19a-c: vram_util / power_util return None on zero total, but
  vram_total_gb still echoes 0.0 (not None)
- A_rocm{6.5,6.6,6.9}_clips_to_rocm64: install.sh clips unsupported
  6.x minors to rocm6.4 instead of producing a 403 index URL

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix reviewer.py round 2: tokenizer AMD multi-GPU, --no-torch bnb, main.py backend label

Three high-confidence findings from a second 20-parallel reviewer.py run
on commit 7effb3ae. Triaged 15 total findings and applied the three that
were confirmed as real bugs; the rest were either false positives (e.g.
"migrated AMD venv not repaired" -- _ensure_rocm_torch runs downstream
via setup.sh regardless), design decisions (e.g. visibility mask env
vars not consulted in installer detection), or edge cases the existing
fallback logic already handles.

1. unsloth/tokenizer_utils.py [6/20]: the multi-GPU guard's shell probe
   runs `nvidia-smi --query-gpu=memory.used`, catches the failure, then
   only raises if `torch.cuda.is_available()` is False. On ROCm torch,
   torch.cuda.is_available() returns True (ROCm reuses the torch.cuda.*
   API), so the guard becomes dead code on AMD hosts and multi-GPU AMD
   setups slip through even though unsloth does not support them yet.
   Add a torch.cuda.device_count() > 1 fallback inside the except so
   AMD multi-visible-device setups are flagged consistently with the
   original CUDA memory check.

2. install.sh [1/20]: the fresh-install bitsandbytes block for AMD ROCm
   ran unconditionally when TORCH_INDEX_URL matched `*/rocm*`, even when
   SKIP_TORCH=true (from --no-torch or Intel Mac auto-detect). A user
   running `install.sh --no-torch` on an AMD host would still pull in
   bitsandbytes despite explicitly asking for GGUF-only mode. Wrap the
   case block in an outer `[ "$SKIP_TORCH" = false ]` guard.

3. studio/backend/main.py [3/20]: the /api/system endpoint returned
   `"device_backend": get_device().value`, which is "cuda" on ROCm
   hosts (because ROCm torch piggybacks on torch.cuda). Other endpoints
   (hardware.py) already use the _backend_label helper which swaps
   "cuda" -> "rocm" when IS_ROCM. Route /api/system through the same
   helper so the Studio UI reports the backend consistently across all
   endpoints.

4. studio/backend/tests/test_utils.py: update test_backend_matches_device
   to call _backend_label(get_device()) instead of raw get_device().value
   so the test matches the new contract and still passes on CUDA hosts.

Tests: 258 -> 261. New regression coverage:
- X08 main.py /api/system uses _backend_label
- X09 tokenizer multi-GPU guard has device_count() fallback
- X10 fresh-install bnb case block gated on SKIP_TORCH=false

* fix: prevent bitsandbytes from overwriting ROCm torch with CUDA wheels

During install, bitsandbytes was installed without --no-deps, causing
uv to resolve torch from PyPI (CUDA build) and silently overwrite the
ROCm wheels that were just installed in the previous step.

This happened in three places:
- install.sh: bitsandbytes install in both migrated and fresh paths
- install_python_stack.py: bitsandbytes install inside _ensure_rocm_torch()

Additionally, multiple install steps in install_python_stack.py (extras,
overrides, studio deps) can pull in CUDA torch via transitive
dependencies. A final _ensure_rocm_torch() call at the end of the
install sequence ensures ROCm torch is always in place at runtime.

All changes are gated behind ROCm-specific conditions and do not affect
NVIDIA, CPU-only, macOS, or Windows install paths.

Tested on AMD Instinct MI300X VF with ROCm 7.2.0 -- confirms
torch==2.10.0+rocm7.1 with HIP 7.1.25424 after install.

* fix: ROCm inference fallback -- skip Unsloth patching and bnb 4-bit on HIP

On AMD ROCm (HIP), two issues prevent the normal Unsloth inference path:

1. Unsloth's global monkey-patching of transformers model classes
   (LlamaRotaryEmbedding, attention modules) triggers
   _assert_async_cuda_kernel crashes on HIP during generation.
   Training uses different code paths and works fine.

2. bitsandbytes 4-bit matmul kernels also trigger HIP assertion
   failures on MI300X (CDNA3 / gfx942), even without Unsloth patching.

This commit adds a ROCm-specific inference fallback that:
- Skips importing Unsloth at module level (prevents global patching)
- Loads models in 16-bit with plain transformers + PEFT instead
- Resolves pre-quantized model names (e.g. "xxx-bnb-4bit" -> "xxx")
  since pre-quantized HF repos still trigger bnb codepaths
- Guards get_chat_template calls (unavailable without Unsloth import)
- Fixes max_seq_length=0 being passed to from_pretrained (GGUF
  semantics don't apply to transformers path)

The NVIDIA path is completely unchanged -- Unsloth import and
for_inference() optimization remain active. GGUF inference (via
llama-server/HIP) is unaffected since it never imports Python model
classes. AMD GPUs typically have large VRAM (e.g. 192GB on MI300X)
so 16-bit loading is practical for inference.

Tested on AMD Instinct MI300X VF (ROCm 7.2, HIP 7.1.25424):
- Simple generation: PASS
- Compare mode (base vs finetuned): PASS
- GGUF inference + tool calling: PASS (unaffected by this change)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: guard audio/vision inference on ROCm, remove unused import

- Add clear RuntimeError for audio/vision model inference on ROCm
  (these paths use Unsloth's FastModel/FastVisionModel which would
  crash on HIP; GGUF inference is the supported path on AMD)
- Remove unused `import os as _os` from the ROCm changes

* fix: amd-smi parsing for newer output format (gpu_data wrapper, mem_usage, temperature)

amd-smi on recent ROCm versions (7.x) wraps metric output in a
{"gpu_data": [...]} envelope instead of returning a raw list. This
caused get_primary_gpu_utilization() and get_visible_gpu_utilization()
to fail silently (returning available=False) because the GPU data
dict was never unwrapped.

Additionally:
- VRAM data moved from "vram" to "mem_usage" with "total_vram" /
  "used_vram" keys. Added fallback key lookup.
- Temperature "edge" sensor returns "N/A" on MI300X VF; the previous
  dict.get() chain returned the "N/A" string instead of falling
  through to "hotspot". Changed to a loop that checks each key until
  a parseable value is found.

Tested on AMD Instinct MI300X VF (ROCm 7.2, amd-smi 24.x):
- GPU utilization: 0% (idle), up to 100% during training
- Temperature: 40-44C (from hotspot sensor)
- VRAM: 0.28/191.69 GB (idle)
- Power: 158-211W draw

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Bug fix detecting radeon (#4940)

* Bug fix detecting radeon

* Expanding GPU target for gfx1100*

* Generalize gfx family-prefix filter to cover gfx10/gfx12 as well

rocminfo on ROCm 6.1+ emits LLVM generic-family ISA lines alongside the
specific GPU (e.g. gfx11-generic next to gfx1100). The outer grep captures
the bare family prefix from the generic line, and passing that to
-DGPU_TARGETS breaks the HIP build because clang only accepts specific
gfxNNN ids.

The previous filter only special-cased gfx11. Generalize it so any bare
2-digit family prefix (gfx10, gfx11, gfx12, ...) is dropped whenever a
specific sibling target is present in the same list. No real AMD GPU has
a 2-digit gfx id, so the filter can only ever drop family prefixes and
never a real target.

Covers the existing gfx11 cases unchanged, and extends the same fix to
gfx10-1-generic / gfx10-3-generic (RDNA1/2) and gfx12-generic (RDNA4),
which would otherwise hit the same build failure on newer rocminfo.

---------

Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>

---------

Co-authored-by: Eda Z <eda.zhou@amd.com>
Co-authored-by: GoldenGrapeGentleman <yueyuan@amd.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: billishyahao <bill.he@amd.com>
Co-authored-by: Iswarya Alex <47045679+iswaryaalex@users.noreply.github.com>
Co-authored-by: Iswarya Alex <iswarya.alex@amd.com>
Co-authored-by: Daniel Han <danielhanchen@users.noreply.github.com>
This commit is contained in:
Daniel Han 2026-04-10 01:56:12 -07:00 committed by GitHub
parent 33503ea248
commit cad8c6ad05
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 1876 additions and 117 deletions

View file

@ -978,6 +978,37 @@ _find_no_torch_runtime() {
fi
}
# ── AMD ROCm GPU detection helper ──
# Returns 0 (true) if an actual AMD GPU is present, 1 (false) otherwise.
# Checks rocminfo for gfx[1-9]* (excludes gfx000 CPU agent) and
# amd-smi list for GPU data rows (excludes header-only output).
_has_amd_rocm_gpu() {
if command -v rocminfo >/dev/null 2>&1 && \
rocminfo 2>/dev/null | awk '/Name:[[:space:]]*gfx[0-9]/ && !/Name:[[:space:]]*gfx000/{found=1} END{exit !found}'; then
return 0
elif command -v amd-smi >/dev/null 2>&1 && \
amd-smi list 2>/dev/null | awk '/^GPU[[:space:]]*[:\[][[:space:]]*[0-9]/{ found=1 } END{ exit !found }'; then
return 0
fi
return 1
}
# ── NVIDIA usable-GPU helper ──
# Returns 0 (true) only if nvidia-smi is present AND actually lists a GPU.
# Prevents AMD-only hosts with a stale nvidia-smi on PATH from being routed
# into the CUDA branch.
_has_usable_nvidia_gpu() {
_nvsmi=""
if command -v nvidia-smi >/dev/null 2>&1; then
_nvsmi="nvidia-smi"
elif [ -x "/usr/bin/nvidia-smi" ]; then
_nvsmi="/usr/bin/nvidia-smi"
else
return 1
fi
"$_nvsmi" -L 2>/dev/null | awk '/^GPU[[:space:]]+[0-9]+:/{found=1} END{exit !found}'
}
# ── Detect GPU and choose PyTorch index URL ──
# Mirrors Get-TorchIndexUrl in install.ps1.
# On CPU-only machines this returns the cpu index, avoiding the solver
@ -986,14 +1017,82 @@ get_torch_index_url() {
_base="https://download.pytorch.org/whl"
# macOS: always CPU (no CUDA support)
case "$(uname -s)" in Darwin) echo "$_base/cpu"; return ;; esac
# Try nvidia-smi
# Try nvidia-smi -- require the binary to actually list a usable GPU.
# Presence of the binary alone (container leftovers, stale driver
# packages) is not sufficient: otherwise an AMD-only host would
# silently install CUDA wheels.
_smi=""
if command -v nvidia-smi >/dev/null 2>&1; then
_smi="nvidia-smi"
elif [ -x "/usr/bin/nvidia-smi" ]; then
_smi="/usr/bin/nvidia-smi"
if _has_usable_nvidia_gpu; then
if command -v nvidia-smi >/dev/null 2>&1; then
_smi="nvidia-smi"
elif [ -x "/usr/bin/nvidia-smi" ]; then
_smi="/usr/bin/nvidia-smi"
fi
fi
if [ -z "$_smi" ]; then
# No NVIDIA GPU -- check for AMD ROCm GPU.
# PyTorch only publishes ROCm wheels for linux-x86_64; skip the
# ROCm branch entirely on aarch64 / arm64 / other architectures
# so non-x86_64 Linux hosts fall back cleanly to CPU wheels.
case "$(uname -m)" in
x86_64|amd64) : ;;
*) echo "$_base/cpu"; return ;;
esac
if ! _has_amd_rocm_gpu; then
echo "$_base/cpu"; return
fi
# AMD GPU confirmed -- detect ROCm version
_rocm_tag=""
_rocm_tag=$({ command -v amd-smi >/dev/null 2>&1 && \
amd-smi version 2>/dev/null | awk -F'ROCm version: ' \
'NF>1{gsub(/[^0-9.]/, "", $2); split($2,a,"."); print "rocm"a[1]"."a[2]; ok=1; exit} END{exit !ok}'; } || \
{ [ -r /opt/rocm/.info/version ] && \
awk -F. '{print "rocm"$1"."$2; exit}' /opt/rocm/.info/version; } || \
{ command -v hipconfig >/dev/null 2>&1 && \
hipconfig --version 2>/dev/null | awk 'NR==1 && /^[0-9]/{split($1,a,"."); if(a[1]+0>0){print "rocm"a[1]"."a[2]; found=1}} END{exit !found}'; } || \
{ command -v dpkg-query >/dev/null 2>&1 && \
ver="$(dpkg-query -W -f='${Version}\n' rocm-core 2>/dev/null)" && \
[ -n "$ver" ] && \
printf '%s\n' "$ver" | sed 's/^[0-9]*://' | awk -F'[.-]' '{print "rocm"$1"."$2; exit}'; } || \
{ command -v rpm >/dev/null 2>&1 && \
ver="$(rpm -q --qf '%{VERSION}\n' rocm-core 2>/dev/null)" && \
[ -n "$ver" ] && \
printf '%s\n' "$ver" | awk -F'[.-]' '{print "rocm"$1"."$2; exit}'; }) 2>/dev/null
# Validate _rocm_tag: must match "rocmX.Y" with major >= 1
case "$_rocm_tag" in
rocm[1-9]*.[0-9]*) : ;; # valid (major >= 1)
*) _rocm_tag="" ;; # reject malformed (empty, garbled, or major=0)
esac
if [ -n "$_rocm_tag" ]; then
# Minimum supported: ROCm 6.0 (no PyTorch wheels exist for older)
case "$_rocm_tag" in
rocm[1-5].*) echo "$_base/cpu"; return ;;
esac
# ROCm 7.2 only has torch 2.11.0 which exceeds current bounds
# (<2.11.0). Fall back to rocm7.1 index which has torch 2.10.0.
# Enumerate explicit versions rather than matching rocm6.* so
# a host on ROCm 6.5 or 6.6 (no PyTorch wheels published) is
# clipped down to the last supported 6.x (rocm6.4) instead of
# constructing https://download.pytorch.org/whl/rocm6.5 which
# returns HTTP 403. PyTorch only ships: rocm5.7, 6.0, 6.1, 6.2,
# 6.3, 6.4, 7.0, 7.1, 7.2 (and 5.7 is below our minimum).
# TODO: uncomment rocm7.2 when the torch upper bound is bumped
# to >=2.11.0.
case "$_rocm_tag" in
rocm6.0|rocm6.0.*|rocm6.1|rocm6.1.*|rocm6.2|rocm6.2.*|rocm6.3|rocm6.3.*|rocm6.4|rocm6.4.*|rocm7.0|rocm7.0.*|rocm7.1|rocm7.1.*)
echo "$_base/$_rocm_tag" ;;
rocm6.*)
# ROCm 6.5+ (no published PyTorch wheels): clip down
# to the last supported 6.x wheel set.
echo "$_base/rocm6.4" ;;
*)
# ROCm 7.2+ (including future 10.x+): cap to rocm7.1
echo "$_base/rocm7.1" ;;
esac
return
fi
echo "$_base/cpu"; return
fi
if [ -z "$_smi" ]; then echo "$_base/cpu"; return; fi
# Parse CUDA version from nvidia-smi output (POSIX-safe, no grep -P)
_cuda_ver=$(LC_ALL=C $_smi 2>/dev/null \
| sed -n 's/.*CUDA Version:[[:space:]]*\([0-9][0-9]*\.[0-9][0-9]*\).*/\1/p' \
@ -1011,20 +1110,157 @@ get_torch_index_url() {
elif [ "$_major" -ge 11 ]; then echo "$_base/cu118"
else echo "$_base/cpu"; fi
}
get_radeon_wheel_url() {
# Only meaningful on Linux. Picks a repo.radeon.com base URL whose listing
# contains torch wheels. Tries paths like rocm-rel-7.2.1/, rocm-rel-7.2/,
# rocm-rel-7.1.1/, rocm-rel-7.1/ (AMD publishes both M.m and M.m.p dirs).
# Accepts both X.Y and X.Y.Z host versions since /opt/rocm/.info/version
# and hipconfig --version can return either shape.
case "$(uname -s)" in Linux) ;; *) echo ""; return ;; esac
# Detect ROCm version (X.Y or X.Y.Z) -- try amd-smi, then
# /opt/rocm/.info/version, then hipconfig.
_full_ver=""
_full_ver=$({ command -v amd-smi >/dev/null 2>&1 && \
amd-smi version 2>/dev/null | awk -F'ROCm version: ' \
'NF>1{if(match($2,/[0-9]+\.[0-9]+(\.[0-9]+)?/)){print substr($2,RSTART,RLENGTH); ok=1; exit}} END{exit !ok}'; } || \
{ [ -r /opt/rocm/.info/version ] && \
awk 'match($0,/[0-9]+\.[0-9]+(\.[0-9]+)?/){print substr($0,RSTART,RLENGTH); found=1; exit} END{exit !found}' /opt/rocm/.info/version; } || \
{ command -v hipconfig >/dev/null 2>&1 && \
hipconfig --version 2>/dev/null | awk 'NR==1 && match($0,/[0-9]+\.[0-9]+(\.[0-9]+)?/){print substr($0,RSTART,RLENGTH); found=1} END{exit !found}'; }) 2>/dev/null
# Validate: must be X.Y or X.Y.Z with X >= 1
case "$_full_ver" in
[1-9]*.[0-9]*.[0-9]*) : ;; # X.Y.Z
[1-9]*.[0-9]*) : ;; # X.Y
*) echo ""; return ;;
esac
echo "https://repo.radeon.com/rocm/manylinux/rocm-rel-${_full_ver}/"
}
# ── Radeon repo wheel selection helpers ──────────────────────────────────────
# Fetches the Radeon repo directory listing once into _RADEON_LISTING (global).
# _RADEON_PYTAG holds the CPython tag for the running interpreter (e.g. cp312).
# _RADEON_BASE_URL holds the base URL for relative-href resolution.
_RADEON_LISTING=""
_RADEON_PYTAG=""
_RADEON_BASE_URL=""
_radeon_fetch_listing() {
# Usage: _radeon_fetch_listing BASE_URL
# Populates _RADEON_LISTING, _RADEON_PYTAG, _RADEON_BASE_URL.
_RADEON_BASE_URL="$1"
_RADEON_PYTAG=$("$_VENV_PY" -c "
import sys
print('cp{}{}'.format(sys.version_info.major, sys.version_info.minor))
" 2>/dev/null) || return 1
if command -v curl >/dev/null 2>&1; then
_RADEON_LISTING=$(curl -fsSL --max-time 20 "$_RADEON_BASE_URL" 2>/dev/null)
elif command -v wget >/dev/null 2>&1; then
_RADEON_LISTING=$(wget -qO- --timeout=20 "$_RADEON_BASE_URL" 2>/dev/null)
fi
[ -n "$_RADEON_LISTING" ] || return 1
}
_pick_radeon_wheel() {
# Usage: _pick_radeon_wheel PACKAGE_NAME
# Scans $_RADEON_LISTING for the newest wheel whose filename starts exactly
# with PACKAGE_NAME- and matches _RADEON_PYTAG + linux_x86_64.
# Prints the full URL (resolving relative hrefs against _RADEON_BASE_URL).
#
# POSIX-compliant pipeline: all href parsing, filtering, and version
# selection is done inside a single awk script rather than reaching
# for GNU extensions (grep -o, sort -V) that would break under BSD
# or BusyBox coreutils.
_pkg="$1"
[ -n "$_RADEON_LISTING" ] || return 1
[ -n "$_RADEON_PYTAG" ] || return 1
_tag="$_RADEON_PYTAG"
_href=$(printf '%s\n' "$_RADEON_LISTING" \
| awk -v pkg="$_pkg" -v tag="$_tag" '
BEGIN { max_pad = ""; max_url = "" }
{
line = $0
while (match(line, /href="[^"]*"/)) {
# Strip the leading href=" (6 chars) and trailing " (1 char)
url = substr(line, RSTART + 6, RLENGTH - 7)
line = substr(line, RSTART + RLENGTH)
# Extract basename, strip query / fragment
n = split(url, p, "/")
base = p[n]
sub(/[?#].*/, "", base)
prefix = pkg "-"
# Match cpXY-cpXY or cpXY-abi3 with any linux x86_64
# platform tag (linux_x86_64, manylinux_2_28_x86_64,
# manylinux2014_x86_64, etc.)
if (substr(base, 1, length(prefix)) == prefix &&
index(base, "-" tag "-") > 0 &&
match(base, /x86_64\.whl$/)) {
# Extract the version component (first
# dotted-number run) and pad each piece so a
# plain lexical comparison gives us the newest.
if (match(base, /[0-9]+\.[0-9]+(\.[0-9]+)?/)) {
ver = substr(base, RSTART, RLENGTH)
m = split(ver, v, ".")
pad = ""
for (i = 1; i <= m; i++)
pad = pad sprintf("%08d", v[i])
if (pad > max_pad) {
max_pad = pad
max_url = url
}
}
}
}
}
END { if (max_url != "") print max_url }')
[ -z "$_href" ] && return 1
case "$_href" in
http*) printf '%s\n' "$_href" ;;
*) printf '%s\n' "${_RADEON_BASE_URL%/}/${_href#/}" ;;
esac
}
TORCH_INDEX_URL=$(get_torch_index_url)
# Auto-detect GPU for AMD ROCm based
# get_torch_index_url must have chosen */rocm*
# (gfx in rocminfo or amd-smi list). Then require rocminfo "Marketing Name:.*Radeon".
_amd_gpu_radeon=false
case "$TORCH_INDEX_URL" in
*/rocm*)
if _has_amd_rocm_gpu && command -v rocminfo >/dev/null 2>&1 && \
rocminfo 2>/dev/null | grep -q 'Marketing Name:.*Radeon'; then
_amd_gpu_radeon=true
fi
;;
esac
# ── Print CPU-only hint when no GPU detected ──
case "$TORCH_INDEX_URL" in
*/cpu)
if [ "$SKIP_TORCH" = false ] && [ "$OS" != "macos" ]; then
echo ""
echo " NOTE: No NVIDIA GPU detected (nvidia-smi not found)."
echo " NOTE: No GPU detected (nvidia-smi and ROCm not found)."
echo " Installing CPU-only PyTorch. If you only need GGUF chat/inference,"
echo " re-run with --no-torch for a faster, lighter install:"
echo " curl -fsSL https://unsloth.ai/install.sh | sh -s -- --no-torch"
echo " AMD ROCm users: see https://docs.unsloth.ai/get-started/install-and-update/amd"
echo ""
fi
;;
*/rocm*)
echo ""
if [ "$_amd_gpu_radeon" = true ]; then
echo " AMD Radeon + ROCm detected -- installing PyTorch wheels from repo.radeon.com"
else
echo " AMD ROCm detected -- installing ROCm-enabled PyTorch ($TORCH_INDEX_URL)"
fi
echo ""
;;
esac
# ── Install unsloth directly into the venv (no activation needed) ──
@ -1054,15 +1290,158 @@ if [ "$_MIGRATED" = true ]; then
substep "overlaying local repo (editable)..."
run_install_cmd "overlay local repo" uv pip install --python "$_VENV_PY" -e "$_REPO_ROOT" --no-deps
fi
# AMD ROCm: install bitsandbytes even in migrated environments so
# existing ROCm installs gain the AMD bitsandbytes build without a
# fresh reinstall.
if [ "$SKIP_TORCH" = false ]; then
case "$TORCH_INDEX_URL" in
*/rocm*)
substep "installing bitsandbytes for AMD ROCm..."
run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" --force-reinstall --no-cache-dir --no-deps "bitsandbytes>=0.49.1"
# Repair ROCm torch if overwritten during migrated install
_has_hip=$("$_VENV_PY" -c "import torch; print(getattr(torch.version,'hip','') or '')" 2>/dev/null || true)
if [ -z "$_has_hip" ]; then
substep "repairing ROCm torch (overwritten by dependency resolution)..."
run_install_cmd "repair ROCm torch" uv pip install --python "$_VENV_PY" \
"$TORCH_CONSTRAINT" torchvision torchaudio \
--index-url "$TORCH_INDEX_URL" \
--force-reinstall
fi
;;
esac
fi
elif [ -n "$TORCH_INDEX_URL" ]; then
# Fresh: Step 1 - install torch from explicit index (skip when --no-torch or Intel Mac)
if [ "$SKIP_TORCH" = true ]; then
substep "skipping PyTorch (--no-torch or Intel Mac x86_64)." "$C_WARN"
elif [ "$_amd_gpu_radeon" = true ]; then
_radeon_url=$(get_radeon_wheel_url)
if [ -n "$_radeon_url" ]; then
_radeon_listing_ok=false
if _radeon_fetch_listing "$_radeon_url" 2>/dev/null; then
_radeon_listing_ok=true
else
# Try shorter X.Y path (AMD publishes both X.Y.Z and X.Y dirs)
_radeon_url_short=$(printf '%s\n' "$_radeon_url" \
| sed 's|rocm-rel-\([0-9]*\)\.\([0-9]*\)\.[0-9]*/|rocm-rel-\1.\2/|')
if [ "$_radeon_url_short" != "$_radeon_url" ] && \
_radeon_fetch_listing "$_radeon_url_short" 2>/dev/null; then
_radeon_listing_ok=true
fi
fi
if [ "$_radeon_listing_ok" = true ]; then
# Require torch, torchvision, torchaudio wheels to all resolve
# from the Radeon listing. If any is missing for this Python
# tag, fall through to the standard ROCm index instead of
# silently mixing Radeon wheels with PyPI defaults.
_torch_whl=$(_pick_radeon_wheel "torch" 2>/dev/null) || _torch_whl=""
_tv_whl=$(_pick_radeon_wheel "torchvision" 2>/dev/null) || _tv_whl=""
_ta_whl=$(_pick_radeon_wheel "torchaudio" 2>/dev/null) || _ta_whl=""
_tri_whl=$(_pick_radeon_wheel "triton" 2>/dev/null) || _tri_whl=""
# Sanity-check torch / torchvision / torchaudio are a
# matching release. The Radeon repo publishes multiple
# generations simultaneously, so picking the highest-version
# wheel for each package independently can assemble a
# mismatched trio (e.g. torch 2.9.1 + torchvision 0.23.0 +
# torchaudio 2.9.0 from the current rocm-rel-7.2.1 index).
# Check that torch and torchaudio share the same X.Y public
# version prefix, and that torchvision's minor correctly
# pairs with torch's minor (torchvision = torch.minor - 5
# since torch 2.4 -> torchvision 0.19 -> torch 2.9 ->
# torchvision 0.24).
# URL-decode each wheel name so %2B -> + before version
# extraction. Real Radeon wheel hrefs are percent-encoded
# (torch-2.10.0%2Brocm7.2.0...), so a plain [+-] terminator
# in the sed regex below would never match and
# _radeon_versions_match would stay false for every real
# listing, silently forcing a fallback to the generic
# ROCm index.
_torch_ver=""
_tv_ver=""
_ta_ver=""
if [ -n "$_torch_whl" ]; then
_torch_name=$(printf '%s' "${_torch_whl##*/}" | sed 's/%2[Bb]/+/g')
_torch_ver=$(printf '%s\n' "$_torch_name" | sed -n 's|^torch-\([0-9][0-9]*\.[0-9][0-9]*\)\(\.[0-9][0-9]*\)\{0,1\}[+-].*|\1|p')
fi
if [ -n "$_tv_whl" ]; then
_tv_name=$(printf '%s' "${_tv_whl##*/}" | sed 's/%2[Bb]/+/g')
_tv_ver=$(printf '%s\n' "$_tv_name" | sed -n 's|^torchvision-\([0-9][0-9]*\.[0-9][0-9]*\)\(\.[0-9][0-9]*\)\{0,1\}[+-].*|\1|p')
fi
if [ -n "$_ta_whl" ]; then
_ta_name=$(printf '%s' "${_ta_whl##*/}" | sed 's/%2[Bb]/+/g')
_ta_ver=$(printf '%s\n' "$_ta_name" | sed -n 's|^torchaudio-\([0-9][0-9]*\.[0-9][0-9]*\)\(\.[0-9][0-9]*\)\{0,1\}[+-].*|\1|p')
fi
_radeon_versions_match=false
if [ -n "$_torch_ver" ] && [ -n "$_tv_ver" ] && [ -n "$_ta_ver" ]; then
_torch_major=${_torch_ver%%.*}
_torch_minor=${_torch_ver#*.}
_ta_major=${_ta_ver%%.*}
_ta_minor=${_ta_ver#*.}
_tv_major=${_tv_ver%%.*}
_tv_minor=${_tv_ver#*.}
# torchvision expected minor (e.g. torch 2.9 -> 0.24)
_expected_tv_minor=$((_torch_minor + 15))
if [ "$_torch_major" = "$_ta_major" ] && \
[ "$_torch_minor" = "$_ta_minor" ] && \
[ "$_tv_major" = "0" ] && \
[ "$_tv_minor" = "$_expected_tv_minor" ]; then
_radeon_versions_match=true
fi
fi
if [ -z "$_torch_whl" ] || [ -z "$_tv_whl" ] || [ -z "$_ta_whl" ] || \
[ "$_radeon_versions_match" != true ]; then
substep "[WARN] Radeon repo lacks a compatible wheel set for this Python; falling back to ROCm index ($TORCH_INDEX_URL)" "$C_WARN"
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
"$TORCH_CONSTRAINT" torchvision torchaudio \
--index-url "$TORCH_INDEX_URL"
else
substep "installing PyTorch from Radeon repo (${_RADEON_BASE_URL})..."
# Pass explicit wheel URLs so the matched trio is
# installed together. --find-links lets uv discover
# the Radeon listing for any local lookup, and PyPI
# (not disabled) provides transitive deps like
# filelock / sympy / networkx which are not in the
# Radeon listing.
if [ -n "$_tri_whl" ]; then
run_install_cmd "install triton + PyTorch" uv pip install --python "$_VENV_PY" \
--find-links "$_RADEON_BASE_URL" \
"$_tri_whl" "$_torch_whl" "$_tv_whl" "$_ta_whl"
else
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
--find-links "$_RADEON_BASE_URL" \
"$_torch_whl" "$_tv_whl" "$_ta_whl"
fi
fi
else
substep "[WARN] Radeon repo unavailable; falling back to ROCm index ($TORCH_INDEX_URL)" "$C_WARN"
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
"$TORCH_CONSTRAINT" torchvision torchaudio \
--index-url "$TORCH_INDEX_URL"
fi
else
substep "[WARN] Radeon GPU detected but could not detect full ROCm version; falling back to ROCm index" "$C_WARN"
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
"$TORCH_CONSTRAINT" torchvision torchaudio \
--index-url "$TORCH_INDEX_URL"
fi
else
substep "installing PyTorch ($TORCH_INDEX_URL)..."
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" "$TORCH_CONSTRAINT" torchvision torchaudio \
--index-url "$TORCH_INDEX_URL"
fi
# AMD ROCm: install bitsandbytes (once, after torch, for all ROCm paths).
# Gate on SKIP_TORCH=false so a user running with --no-torch on a ROCm
# host stays in GGUF-only mode rather than pulling in bitsandbytes,
# which is only useful once torch is present for training.
if [ "$SKIP_TORCH" = false ]; then
case "$TORCH_INDEX_URL" in
*/rocm*)
substep "installing bitsandbytes for AMD ROCm..."
run_install_cmd "install bitsandbytes (AMD)" uv pip install --python "$_VENV_PY" --force-reinstall --no-cache-dir --no-deps "bitsandbytes>=0.49.1"
;;
esac
fi
# Fresh: Step 2 - install unsloth, preserving pre-installed torch
substep "installing unsloth (this may take a few minutes)..."
if [ "$SKIP_TORCH" = true ]; then
@ -1088,6 +1467,22 @@ elif [ -n "$TORCH_INDEX_URL" ]; then
run_install_cmd "install unsloth" uv pip install --python "$_VENV_PY" \
--upgrade-package unsloth "$PACKAGE_NAME"
fi
# AMD ROCm: repair torch if the unsloth/unsloth-zoo install pulled in
# CUDA torch from PyPI, overwriting the ROCm wheels installed in Step 1.
if [ "$SKIP_TORCH" = false ]; then
case "$TORCH_INDEX_URL" in
*/rocm*)
_has_hip=$("$_VENV_PY" -c "import torch; print(getattr(torch.version,'hip','') or '')" 2>/dev/null || true)
if [ -z "$_has_hip" ]; then
substep "repairing ROCm torch (overwritten by dependency resolution)..."
run_install_cmd "repair ROCm torch" uv pip install --python "$_VENV_PY" \
"$TORCH_CONSTRAINT" torchvision torchaudio \
--index-url "$TORCH_INDEX_URL" \
--force-reinstall
fi
;;
esac
fi
else
# Fallback: GPU detection failed to produce a URL -- let uv resolve torch
substep "installing unsloth (this may take a few minutes)..."

View file

@ -5,8 +5,23 @@
Core inference backend - streamlined
"""
from unsloth import FastLanguageModel, FastVisionModel
from unsloth.chat_templates import get_chat_template
# On AMD ROCm, Unsloth's global monkey-patching of transformers model classes
# (LlamaRotaryEmbedding, attention modules, etc.) causes HIP kernel crashes
# (_assert_async_cuda_kernel -> HSA_STATUS_ERROR_EXCEPTION) during inference.
# Training works because it uses different code paths, but generation triggers
# the incompatible patched kernels. Skip the Unsloth import entirely on ROCm
# so transformers classes stay unmodified; the GGUF inference path (llama-server)
# is unaffected since it never imports these Python model classes.
_IS_ROCM_ENV = getattr(__import__("torch").version, "hip", None) is not None
if _IS_ROCM_ENV:
FastLanguageModel = None # Loaded on-demand only on NVIDIA
FastVisionModel = None
get_chat_template = None
else:
from unsloth import FastLanguageModel, FastVisionModel
from unsloth.chat_templates import get_chat_template
from transformers import TextStreamer
from peft import PeftModel, PeftModelForCausalLM
@ -26,6 +41,7 @@ from utils.hardware import (
raise_if_offloaded,
get_visible_gpu_count,
)
from utils.hardware import hardware as _hw_module
from core.inference.audio_codecs import AudioCodecManager
from io import StringIO
import structlog
@ -253,6 +269,15 @@ class InferenceBackend:
"""
Load any model: base, LoRA adapter, text, or vision.
"""
# max_seq_length=0 means "model default" for the GGUF/llama.cpp path,
# but Unsloth's FastLanguageModel.from_pretrained treats 0 literally --
# setting the model's context to 0 tokens, which triggers an assertion
# crash during generation (especially on ROCm/HIP where the async
# assert kernel raises a hardware exception instead of a Python error).
# Fall back to 2048 for the Unsloth/transformers path.
if max_seq_length <= 0:
max_seq_length = 2048
try:
model_name = config.identifier
@ -286,6 +311,12 @@ class InferenceBackend:
}
# ── Audio model loading path ──────────────────────────
if (config.is_audio or config.is_vision) and _IS_ROCM_ENV:
raise RuntimeError(
f"Audio and vision model inference via Unsloth is not "
f"yet supported on AMD ROCm. Use GGUF inference instead."
)
if config.is_audio:
audio_type = config.audio_type
adapter_info = " (LoRA adapter)" if config.is_lora else ""
@ -516,18 +547,84 @@ class InferenceBackend:
else:
# Text model (or text LoRA adapter)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = config.path, # Can be base model OR LoRA adapter path
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
device_map = device_map,
token = hf_token if hf_token and hf_token.strip() else None,
trust_remote_code = trust_remote_code,
)
if _hw_module.IS_ROCM:
# On AMD ROCm two issues prevent the normal Unsloth path:
# 1. Unsloth's patched kernels (RoPE, attention) crash on
# HIP (_assert_async_cuda_kernel -> HSA_STATUS_ERROR).
# 2. bitsandbytes 4-bit matmul kernels trigger the same
# HIP assertion on MI300X (CDNA3 / gfx942).
# Fall back to plain transformers + PEFT in 16-bit, which
# works reliably. AMD GPUs typically have large VRAM so
# 16-bit is practical; GGUF inference remains the
# recommended path for memory-constrained setups.
logger.info(
"ROCm detected -- loading in 16-bit with plain "
"transformers (bitsandbytes 4-bit and Unsloth kernels "
"are not yet compatible with HIP)"
)
from transformers import AutoModelForCausalLM, AutoTokenizer
# Apply inference optimization
FastLanguageModel.for_inference(model)
_load_kwargs = dict(
dtype = dtype or torch.bfloat16,
device_map = device_map,
token = hf_token if hf_token and hf_token.strip() else None,
trust_remote_code = trust_remote_code,
)
# Skip 4-bit on ROCm: bnb matmul kernels crash on HIP.
# Also resolve pre-quantized Unsloth model names (e.g.
# "unsloth/xxx-bnb-4bit") to their FP16 originals since
# loading a pre-quantized repo still triggers bnb codepaths.
def _resolve_fp16_base(name: str) -> str:
if not name:
return name
# Strip Unsloth quantization suffixes to get the FP16 model:
# "unsloth/Foo-unsloth-bnb-4bit" -> "unsloth/Foo"
# "unsloth/Foo-bnb-4bit" -> "unsloth/Foo"
# Order matters: try longer suffix first.
for suffix in ("-unsloth-bnb-4bit", "-bnb-4bit"):
if name.lower().endswith(suffix):
resolved = name[: -len(suffix)]
logger.info(
"Resolved pre-quantized base '%s' -> '%s' for ROCm 16-bit inference",
name,
resolved,
)
return resolved
return name
if config.is_lora and config.base_model:
# Load base model then apply adapter
_base = _resolve_fp16_base(config.base_model)
model = AutoModelForCausalLM.from_pretrained(
_base,
**_load_kwargs,
)
from peft import PeftModel
model = PeftModel.from_pretrained(model, config.path)
tokenizer = AutoTokenizer.from_pretrained(config.path)
else:
_path = _resolve_fp16_base(config.path)
model = AutoModelForCausalLM.from_pretrained(
_path,
**_load_kwargs,
)
tokenizer = AutoTokenizer.from_pretrained(config.path)
model.eval()
else:
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = config.path, # Can be base model OR LoRA adapter path
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
device_map = device_map,
token = hf_token if hf_token and hf_token.strip() else None,
trust_remote_code = trust_remote_code,
)
# Apply inference optimization
FastLanguageModel.for_inference(model)
self.models[model_name]["model"] = model
self.models[model_name]["tokenizer"] = tokenizer
@ -950,10 +1047,13 @@ class InferenceBackend:
)
# This modifies the tokenizer with the correct template
tokenizer = get_chat_template(
tokenizer,
chat_template = template_name,
)
if get_chat_template is not None:
tokenizer = get_chat_template(
tokenizer,
chat_template = template_name,
)
else:
logger.info("Skipping Unsloth chat template (ROCm fallback)")
else:
logger.info(
f"No registered Unsloth template for {self.active_model_name}, using tokenizer default"

View file

@ -86,6 +86,7 @@ def _probe_causal_conv1d_env() -> dict[str, str] | None:
"'python_tag': f'cp{sys.version_info.major}{sys.version_info.minor}', "
"'torch_mm': torch_mm, "
"'cuda_major': str(int(str(torch.version.cuda).split('.', 1)[0])) if torch.version.cuda else '', "
"'hip_version': str(torch.version.hip) if getattr(torch.version, 'hip', None) else '', "
"'cxx11abi': str(torch._C._GLIBCXX_USE_CXX11_ABI).upper()"
"}))"
),
@ -237,28 +238,111 @@ def _install_package_wheel_first(
else:
logger.info("No published %s wheel found: %s", display_name, wheel_url)
_send_status(event_queue, f"Installing {display_name} from PyPI...")
pypi_cmd = [
sys.executable,
"-m",
"pip",
"install",
"--no-build-isolation",
"--no-deps",
"--no-cache-dir",
f"{pypi_name}=={pypi_version}",
]
result = _sp.run(
pypi_cmd,
stdout = _sp.PIPE,
stderr = _sp.STDOUT,
text = True,
)
if result.returncode != 0:
logger.error("Failed to install %s from PyPI:\n%s", display_name, result.stdout)
is_hip = env and env.get("hip_version")
if is_hip and not shutil.which("hipcc"):
logger.error(
"%s requires hipcc for source compilation on ROCm. "
"Install the ROCm HIP SDK: https://rocm.docs.amd.com",
display_name,
)
_send_status(
event_queue,
f"{display_name}: hipcc not found (ROCm HIP SDK required)",
)
return
logger.info("Installed %s from PyPI", display_name)
if is_hip:
_send_status(
event_queue,
f"Compiling {display_name} from source for ROCm "
"(this may take several minutes)...",
)
else:
_send_status(event_queue, f"Installing {display_name} from PyPI...")
# Prefer uv for faster dependency resolution when available
if shutil.which("uv"):
pypi_cmd = [
"uv",
"pip",
"install",
"--python",
sys.executable,
"--no-build-isolation",
"--no-deps",
]
# Avoid stale cache artifacts from partial HIP source builds
if is_hip:
pypi_cmd.append("--no-cache")
pypi_cmd.append(f"{pypi_name}=={pypi_version}")
else:
pypi_cmd = [
sys.executable,
"-m",
"pip",
"install",
"--no-build-isolation",
"--no-deps",
"--no-cache-dir",
f"{pypi_name}=={pypi_version}",
]
# Source compilation on ROCm can take 10-30 minutes; use a generous
# timeout. Non-HIP installs preserve the pre-existing "no timeout"
# behaviour so unrelated slow installs (e.g. causal-conv1d source
# build on Linux aarch64 or unsupported torch/CUDA combinations)
# are not aborted at 5 minutes by this PR.
_run_kwargs: dict[str, Any] = {
"stdout": _sp.PIPE,
"stderr": _sp.STDOUT,
"text": True,
}
if is_hip:
_run_kwargs["timeout"] = 1800
try:
result = _sp.run(pypi_cmd, **_run_kwargs)
except _sp.TimeoutExpired:
logger.error(
"%s installation timed out after %ds",
display_name,
_run_kwargs.get("timeout"),
)
_send_status(
event_queue,
f"{display_name} installation timed out after "
f"{_run_kwargs.get('timeout')}s",
)
return
if result.returncode != 0:
if is_hip:
# Surface a clear error for ROCm source build failures
error_lines = (result.stdout or "").strip().splitlines()
snippet = "\n".join(error_lines[-5:]) if error_lines else "(no output)"
logger.error(
"Failed to compile %s for ROCm:\n%s",
display_name,
result.stdout,
)
_send_status(
event_queue,
f"Failed to compile {display_name} for ROCm. "
"Check that hipcc and ROCm development headers are installed.\n"
f"{snippet}",
)
else:
logger.error(
"Failed to install %s from PyPI:\n%s",
display_name,
result.stdout,
)
return
if is_hip:
logger.info("Compiled and installed %s from source for ROCm", display_name)
else:
logger.info("Installed %s from PyPI", display_name)
def _ensure_causal_conv1d_fast_path(event_queue: Any, model_name: str) -> None:

View file

@ -237,6 +237,7 @@ async def get_system_info():
import platform
import psutil
from utils.hardware import get_device
from utils.hardware.hardware import _backend_label
visibility_info = get_backend_visible_gpu_info()
gpu_info = {
@ -250,7 +251,10 @@ async def get_system_info():
return {
"platform": platform.platform(),
"python_version": platform.python_version(),
"device_backend": get_device().value,
# Use the centralized _backend_label helper so the /api/system
# endpoint reports "rocm" on AMD hosts instead of "cuda", matching
# the /api/hardware and /api/gpu-visibility endpoints.
"device_backend": _backend_label(get_device()),
"cpu_count": psutil.cpu_count(),
"memory": {
"total_gb": round(memory.total / 1e9, 2),

View file

@ -191,8 +191,14 @@ class TestGetGpuMemoryInfo:
assert "backend" in get_gpu_memory_info()
def test_backend_matches_device(self):
# The backend field uses _backend_label, which swaps "cuda" for
# "rocm" when running on an AMD host (IS_ROCM=True) so the UI
# can render the correct label. On CUDA / XPU / MLX / CPU hosts
# it is equivalent to `get_device().value`.
from utils.hardware.hardware import _backend_label
result = get_gpu_memory_info()
assert result["backend"] == get_device().value
assert result["backend"] == _backend_label(get_device())
# --- When a GPU IS available ---

View file

@ -5,6 +5,7 @@
Hardware detection and GPU utilities
"""
from . import hardware as _hardware
from .hardware import (
DeviceType,
DEVICE,
@ -49,6 +50,7 @@ __all__ = [
"DeviceType",
"DEVICE",
"CHAT_ONLY",
"IS_ROCM",
"detect_hardware",
"get_device",
"is_apple_silicon",
@ -81,3 +83,11 @@ __all__ = [
"extract_arch_config",
"estimate_training_vram",
]
def __getattr__(name: str):
"""Resolve IS_ROCM at access time so callers always see the live value
after detect_hardware() runs (it flips the flag in hardware.py)."""
if name == "IS_ROCM":
return getattr(_hardware, "IS_ROCM")
raise AttributeError(name)

View file

@ -0,0 +1,382 @@
# SPDX-License-Identifier: AGPL-3.0-only
# Copyright 2026-present the Unsloth AI Inc. team. All rights reserved. See /studio/LICENSE.AGPL-3.0
"""AMD GPU monitoring via amd-smi.
Mirrors the nvidia.py module structure so hardware.py can swap backends
based on IS_ROCM. All functions return the same dict shapes as their
nvidia.py counterparts.
"""
import json
import math
import os
import re
import subprocess
from typing import Any, Optional
from loggers import get_logger
logger = get_logger(__name__)
def _run_amd_smi(*args: str, timeout: int = 5) -> Optional[Any]:
"""Run amd-smi with the given arguments and return parsed JSON, or None."""
try:
result = subprocess.run(
["amd-smi", *args, "--json"],
capture_output = True,
text = True,
timeout = timeout,
)
except (OSError, subprocess.TimeoutExpired) as e:
logger.warning("amd-smi query failed: %s", e)
return None
if result.returncode != 0 or not result.stdout.strip():
logger.warning("amd-smi returned code %d", result.returncode)
return None
try:
return json.loads(result.stdout)
except json.JSONDecodeError:
logger.warning("Failed to parse amd-smi JSON output")
return None
def _parse_numeric(value: Any) -> Optional[float]:
"""Extract a numeric value from amd-smi output (may be str, int, float, or dict)."""
if value is None:
return None
# Newer amd-smi versions emit {"value": 10, "unit": "W"}
if isinstance(value, dict):
return _parse_numeric(value.get("value"))
if isinstance(value, (int, float)):
f = float(value)
return f if math.isfinite(f) else None
if isinstance(value, str):
# Strip units like "W", "C", "%", "MB", "MiB", "GB", "GiB" etc.
cleaned = re.sub(r"\s*[A-Za-z/%]+$", "", value.strip())
if not cleaned or cleaned.lower() in ("n/a", "none", "unknown"):
return None
try:
return float(cleaned)
except (ValueError, TypeError):
return None
return None
def _parse_memory_mb(value: Any) -> Optional[float]:
"""Parse a memory value from amd-smi output and return MB.
Handles bare numbers (assumed MB -- the amd-smi convention on every
version we have seen), dict-shaped values with explicit units
(``{"value": 192, "unit": "GiB"}`` on newer releases), and plain
strings like ``"8192 MiB"``.
"""
unit = ""
raw_value = value
if isinstance(value, dict):
unit = str(value.get("unit", "")).strip().lower()
raw_value = value.get("value")
elif isinstance(value, str):
# Extract unit suffix from strings like "192 GiB" or "8192 MB"
m = re.match(r"^\s*([\d.]+)\s*([A-Za-z]+)\s*$", value.strip())
if m:
unit = m.group(2).lower()
num = _parse_numeric(raw_value if isinstance(value, dict) else value)
if num is None:
return None
# Unit conversion -- GPU tools (including amd-smi) use binary units even
# when labeling them "GB" or "MB", so treat GB/GiB and MB/MiB the same.
if "gib" in unit or "gb" in unit:
return num * 1024
if "mib" in unit or "mb" in unit:
return num
if "kib" in unit or "kb" in unit:
return num / 1024
if unit in ("b", "byte", "bytes"):
# Plain bytes
return num / (1024 * 1024)
# No explicit unit -- default to MB, which is the amd-smi convention
# for bare numeric values. A previous heuristic assumed values above
# ~10M were bytes, but that misclassifies small VRAM allocations
# (e.g. 5 MB = 5,242,880 reported without a unit) as ~5 TB. Modern
# amd-smi always ships explicit units, so the heuristic branch only
# fired for legacy output where MB was already the convention.
return num
def _extract_gpu_metrics(gpu_data: dict) -> dict[str, Any]:
"""Extract standardized metrics from a single GPU's amd-smi data."""
# amd-smi metric output structure varies by version; try common paths
usage = gpu_data.get("usage", gpu_data.get("gpu_activity", {}))
if isinstance(usage, dict):
gpu_util = _parse_numeric(
usage.get("gfx_activity", usage.get("gpu_use_percent"))
)
else:
gpu_util = _parse_numeric(usage)
# Temperature -- try multiple keys in priority order.
# dict.get() returns "N/A" strings rather than falling through,
# so we must try each key and check if it parses to a real number.
temp_data = gpu_data.get("temperature", {})
temp = None
if isinstance(temp_data, dict):
for temp_key in ("edge", "temperature_edge", "hotspot", "temperature_hotspot"):
temp = _parse_numeric(temp_data.get(temp_key))
if temp is not None:
break
else:
temp = _parse_numeric(temp_data)
# Power
power_data = gpu_data.get("power", {})
if isinstance(power_data, dict):
power_draw = _parse_numeric(
power_data.get(
"current_socket_power",
power_data.get("average_socket_power", power_data.get("socket_power")),
)
)
power_limit = _parse_numeric(
power_data.get("power_cap", power_data.get("max_power_limit"))
)
else:
power_draw = None
power_limit = None
# VRAM -- unit-aware parsing to handle varying amd-smi output formats.
# Newer amd-smi versions may return {"value": 192, "unit": "GiB"}.
# Newer amd-smi uses "mem_usage" with "total_vram" / "used_vram" keys;
# older versions use "vram" or "fb_memory_usage" with "used" / "total".
vram_data = gpu_data.get(
"mem_usage",
gpu_data.get("vram", gpu_data.get("fb_memory_usage", {})),
)
if isinstance(vram_data, dict):
vram_used_mb = _parse_memory_mb(
vram_data.get(
"used_vram", vram_data.get("vram_used", vram_data.get("used"))
)
)
vram_total_mb = _parse_memory_mb(
vram_data.get(
"total_vram", vram_data.get("vram_total", vram_data.get("total"))
)
)
else:
vram_used_mb = None
vram_total_mb = None
# Build the standardized dict (same shape as nvidia._build_gpu_metrics)
vram_used_gb = round(vram_used_mb / 1024, 2) if vram_used_mb is not None else None
vram_total_gb = (
round(vram_total_mb / 1024, 2) if vram_total_mb is not None else None
)
vram_util = (
round((vram_used_mb / vram_total_mb) * 100, 1)
if vram_used_mb is not None and vram_total_mb is not None and vram_total_mb > 0
else None
)
power_util = (
round((power_draw / power_limit) * 100, 1)
if power_draw is not None and power_limit is not None and power_limit > 0
else None
)
return {
"gpu_utilization_pct": gpu_util,
"temperature_c": temp,
"vram_used_gb": vram_used_gb,
"vram_total_gb": vram_total_gb,
"vram_utilization_pct": vram_util,
"power_draw_w": power_draw,
"power_limit_w": power_limit,
"power_utilization_pct": power_util,
}
def _has_real_metrics(metrics: dict[str, Any]) -> bool:
"""Return True when ``metrics`` contains at least one non-None value.
``amd-smi`` can return a zero-exit JSON envelope that is missing every
expected field (error response, unsupported card, hipless container).
In that case ``_extract_gpu_metrics`` produces a dict where every value
is ``None`` -- callers must surface this as ``available: False`` rather
than ``available: True`` with empty data.
"""
return any(value is not None for value in metrics.values())
def get_physical_gpu_count() -> Optional[int]:
"""Return physical AMD GPU count via amd-smi, or None on failure."""
data = _run_amd_smi("list")
if data is None:
return None
if isinstance(data, list):
return len(data)
# Some versions return a dict with a "gpu" / "gpus" key. Guard the
# .get() access with an isinstance check so a malformed scalar /
# string response from amd-smi cannot raise AttributeError.
if not isinstance(data, dict):
return None
gpus = data.get("gpu", data.get("gpus", []))
if isinstance(gpus, list):
return len(gpus)
return None
def _first_visible_amd_gpu_id() -> Optional[str]:
"""Return the physical AMD GPU id that should be treated as 'primary'.
Honours HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES
in that order (HIP respects all three). Returns ``"0"`` when none are
set, and ``None`` when the env var explicitly narrows to zero GPUs
("" or "-1"), so callers can short-circuit to "available: False".
"""
for env_name in (
"HIP_VISIBLE_DEVICES",
"ROCR_VISIBLE_DEVICES",
"CUDA_VISIBLE_DEVICES",
):
raw = os.environ.get(env_name)
if raw is None:
continue
raw = raw.strip()
if raw == "" or raw == "-1":
return None
# Filter out empty tokens after splitting. This tolerates minor
# typos like ``HIP_VISIBLE_DEVICES=",1"`` (leading comma, user
# clearly meant to narrow to device 1) while still falling
# through to the next env var when every token is empty
# (e.g. ``,,,``).
tokens = [t.strip() for t in raw.split(",") if t.strip()]
if tokens:
return tokens[0]
return "0"
def get_primary_gpu_utilization() -> dict[str, Any]:
"""Return utilization metrics for the primary visible AMD GPU."""
gpu_idx = _first_visible_amd_gpu_id()
if gpu_idx is None:
return {"available": False}
data = _run_amd_smi("metric", "-g", gpu_idx)
if data is None:
return {"available": False}
# amd-smi may return:
# - a list of GPU dicts (older versions)
# - a dict with a "gpu_data" key wrapping a list (newer versions)
# - a single GPU dict (rare)
if isinstance(data, dict) and "gpu_data" in data:
data = data["gpu_data"]
if isinstance(data, list):
if len(data) == 0:
return {"available": False}
gpu_data = data[0]
else:
gpu_data = data
metrics = _extract_gpu_metrics(gpu_data)
if not _has_real_metrics(metrics):
# amd-smi returned a JSON envelope with no usable fields (error
# response or unsupported card). Surface as unavailable rather
# than available-with-empty-data so the UI does not render a
# ghost device.
return {"available": False}
metrics["available"] = True
return metrics
def get_visible_gpu_utilization(
parent_visible_ids: Optional[list[int]],
parent_cuda_visible_devices: Optional[str] = None,
) -> dict[str, Any]:
"""Return utilization metrics for visible AMD GPUs."""
if parent_visible_ids is None:
return {
"available": False,
"backend_cuda_visible_devices": parent_cuda_visible_devices,
"parent_visible_gpu_ids": [],
"devices": [],
"index_kind": "unresolved",
}
data = _run_amd_smi("metric")
if data is None:
return {
"available": False,
"backend_cuda_visible_devices": parent_cuda_visible_devices,
"parent_visible_gpu_ids": parent_visible_ids or [],
"devices": [],
"index_kind": "physical",
}
# Extract a device list from amd-smi's envelope. Newer versions return
# a JSON array directly, older versions return a dict with a "gpus" /
# "gpu" key wrapping the list. Guard non-dict / non-list envelopes
# (scalar / string fallbacks from malformed output) so the .get()
# access cannot raise AttributeError on an unexpected shape.
if isinstance(data, list):
gpu_list = data
elif isinstance(data, dict):
# Newer amd-smi wraps output in {"gpu_data": [...]}
gpu_list = data.get("gpu_data", data.get("gpus", data.get("gpu", [data])))
else:
gpu_list = [data]
visible_set = set(parent_visible_ids)
ordinal_map = {gpu_id: ordinal for ordinal, gpu_id in enumerate(parent_visible_ids)}
devices = []
for fallback_idx, gpu_data in enumerate(gpu_list):
# Skip non-dict entries defensively: if amd-smi ever ships a
# scalar inside its "gpus" array (observed on some malformed
# output), _extract_gpu_metrics would raise AttributeError on
# the first .get() call.
if not isinstance(gpu_data, dict):
continue
# Use AMD-reported GPU ID when available, fall back to enumeration
# index. Newer amd-smi versions wrap scalars as ``{"value": 0,
# "unit": "none"}``, so route raw_id through ``_parse_numeric``
# which already handles bare ints, floats, strings, and that
# dict shape uniformly.
raw_id = gpu_data.get(
"gpu", gpu_data.get("gpu_id", gpu_data.get("id", fallback_idx))
)
parsed_id = _parse_numeric(raw_id)
if parsed_id is None:
logger.debug(
"amd-smi GPU id %r could not be parsed; falling back to "
"enumeration index %d",
raw_id,
fallback_idx,
)
idx = fallback_idx
else:
idx = int(parsed_id)
if idx not in visible_set:
continue
metrics = _extract_gpu_metrics(gpu_data)
if not _has_real_metrics(metrics):
# Skip ghost entries: an amd-smi response that decodes to a
# dict but contains no usable fields (error envelope, etc.)
# would otherwise show up as a device row with all-None
# numbers in the UI.
continue
metrics["index"] = idx
metrics["index_kind"] = "physical"
metrics["visible_ordinal"] = ordinal_map.get(idx, len(devices))
devices.append(metrics)
return {
"available": len(devices) > 0,
"backend_cuda_visible_devices": parent_cuda_visible_devices,
"parent_visible_gpu_ids": parent_visible_ids or [],
"devices": devices,
"index_kind": "physical",
}

View file

@ -43,6 +43,26 @@ class DeviceType(str, Enum):
DEVICE: Optional[DeviceType] = None
CHAT_ONLY: bool = True # No CUDA GPU -> GGUF chat only (Mac, CPU-only, etc.)
IS_ROCM: bool = (
False # True when running on AMD ROCm (HIP) -- routes GPU monitoring to amd.py
)
def _backend_label(device: DeviceType) -> str:
"""Return the user-facing backend name for API responses.
Internally we still represent ROCm hosts as ``DeviceType.CUDA`` because
ROCm torch sets ``torch.cuda.is_available() = True`` and reuses the whole
``torch.cuda.*`` API surface, so branching on ``DeviceType`` stays
consistent with the rest of the codebase. For the JSON responses served
to the Studio frontend and other clients, however, "cuda" is misleading
on an AMD machine. This helper swaps the label to ``"rocm"`` when the
module-level ``IS_ROCM`` flag is set so the UI can render the correct
backend name without every caller having to duplicate the check.
"""
if IS_ROCM and device == DeviceType.CUDA:
return "rocm"
return device.value
# ========== Detection ==========
@ -85,10 +105,11 @@ def detect_hardware() -> DeviceType:
2. MLX (Apple Silicon via MLX framework)
3. CPU (fallback)
"""
global DEVICE, CHAT_ONLY
CHAT_ONLY = True # reset -- only CUDA sets it to False
global DEVICE, CHAT_ONLY, IS_ROCM
CHAT_ONLY = True # reset -- only CUDA/ROCm sets it to False
IS_ROCM = False
# --- CUDA: try PyTorch ---
# --- CUDA / ROCm: try PyTorch ---
if _has_torch():
import torch
@ -96,7 +117,16 @@ def detect_hardware() -> DeviceType:
DEVICE = DeviceType.CUDA
CHAT_ONLY = False
device_name = torch.cuda.get_device_properties(0).name
print(f"Hardware detected: CUDA — {device_name}")
# Distinguish AMD ROCm (HIP) from NVIDIA CUDA for display purposes.
# DeviceType stays CUDA since torch.cuda.* works on ROCm via HIP.
if getattr(torch.version, "hip", None) is not None:
IS_ROCM = True
print(
f"Hardware detected: ROCm (HIP {torch.version.hip}) -- {device_name}"
)
else:
print(f"Hardware detected: CUDA -- {device_name}")
return DEVICE
# --- XPU: Intel GPU ---
@ -186,7 +216,7 @@ def get_gpu_memory_info() -> Dict[str, Any]:
return {
"available": True,
"backend": device.value,
"backend": _backend_label(device),
"device": idx,
"device_name": props.name,
"total_gb": total / (1024**3),
@ -197,7 +227,11 @@ def get_gpu_memory_info() -> Dict[str, Any]:
}
except Exception as e:
logger.error(f"Error getting CUDA GPU info: {e}")
return {"available": False, "backend": device.value, "error": str(e)}
return {
"available": False,
"backend": _backend_label(device),
"error": str(e),
}
# ---- XPU path (Intel GPU) ----
if device == DeviceType.XPU:
@ -213,7 +247,7 @@ def get_gpu_memory_info() -> Dict[str, Any]:
return {
"available": True,
"backend": device.value,
"backend": _backend_label(device),
"device": idx,
"device_name": props.name,
"total_gb": total / (1024**3),
@ -224,7 +258,11 @@ def get_gpu_memory_info() -> Dict[str, Any]:
}
except Exception as e:
logger.error("Error getting XPU GPU info: %s", e)
return {"available": False, "backend": device.value, "error": str(e)}
return {
"available": False,
"backend": _backend_label(device),
"error": str(e),
}
# ---- MLX path (Apple Silicon) ----
if device == DeviceType.MLX:
@ -239,7 +277,7 @@ def get_gpu_memory_info() -> Dict[str, Any]:
return {
"available": True,
"backend": device.value,
"backend": _backend_label(device),
"device": 0,
"device_name": f"Apple Silicon ({platform.processor() or platform.machine()})",
"total_gb": total / (1024**3),
@ -250,7 +288,11 @@ def get_gpu_memory_info() -> Dict[str, Any]:
}
except Exception as e:
logger.error(f"Error getting MLX GPU info: {e}")
return {"available": False, "backend": device.value, "error": str(e)}
return {
"available": False,
"backend": _backend_label(device),
"error": str(e),
}
# ---- CPU-only ----
return {"available": False, "backend": "cpu"}
@ -315,13 +357,15 @@ def get_package_versions() -> Dict[str, Optional[str]]:
except PackageNotFoundError:
versions[name] = None
# CUDA toolkit version bundled with torch
# GPU runtime version bundled with torch
try:
import torch
versions["cuda"] = getattr(torch.version, "cuda", None)
versions["rocm"] = getattr(torch.version, "hip", None)
except Exception:
versions["cuda"] = None
versions["rocm"] = None
return versions
@ -387,26 +431,50 @@ def _torch_get_per_device_info(device_indices: list[int]) -> list[Dict[str, Any]
# ========== Live GPU Utilization ==========
def _smi_query(func_name: str, *args, **kwargs) -> Optional[Dict[str, Any]]:
"""Run a query against the appropriate SMI backend (amd-smi or nvidia-smi).
Returns the result dict if available, or None on failure/unavailability.
"""
if IS_ROCM:
backend_name = "amd-smi"
try:
from . import amd as _backend
except Exception as e:
logger.warning("%s import failed: %s", backend_name, e)
return None
else:
backend_name = "nvidia-smi"
try:
from . import nvidia as _backend
except Exception as e:
logger.warning("%s import failed: %s", backend_name, e)
return None
try:
func = getattr(_backend, func_name)
result = func(*args, **kwargs)
if result.get("available"):
return result
except Exception as e:
logger.warning("%s %s query failed: %s", backend_name, func_name, e)
return None
def get_gpu_utilization() -> Dict[str, Any]:
"""Return a live snapshot of device utilization information."""
device = get_device()
if device == DeviceType.CUDA:
try:
from . import nvidia
result = nvidia.get_primary_gpu_utilization()
if result.get("available"):
result["backend"] = device.value
return result
except Exception as e:
logger.warning("nvidia-smi utilization query failed: %s", e)
result = _smi_query("get_primary_gpu_utilization")
if result is not None:
result["backend"] = _backend_label(device)
return result
mem = get_gpu_memory_info()
if device != DeviceType.CPU and mem.get("available"):
return {
"available": True,
"backend": device.value,
"backend": _backend_label(device),
"gpu_utilization_pct": None,
"temperature_c": None,
"vram_used_gb": round(mem.get("allocated_gb", 0), 2),
@ -417,7 +485,7 @@ def get_gpu_utilization() -> Dict[str, Any]:
"power_utilization_pct": None,
}
return {"available": False, "backend": device.value}
return {"available": False, "backend": _backend_label(device)}
def get_visible_gpu_utilization() -> Dict[str, Any]:
@ -425,18 +493,14 @@ def get_visible_gpu_utilization() -> Dict[str, Any]:
if device == DeviceType.CUDA:
parent_visible_spec = _get_parent_visible_gpu_spec()
try:
from . import nvidia
result = nvidia.get_visible_gpu_utilization(
parent_visible_spec["numeric_ids"],
parent_cuda_visible_devices = parent_visible_spec["raw"],
)
if result.get("available"):
result["backend"] = device.value
return result
except Exception as e:
logger.warning("nvidia-smi visible GPU utilization query failed: %s", e)
result = _smi_query(
"get_visible_gpu_utilization",
parent_visible_spec["numeric_ids"],
parent_cuda_visible_devices = parent_visible_spec["raw"],
)
if result is not None:
result["backend"] = _backend_label(device)
return result
# Torch-based fallback for CUDA (nvidia-smi unavailable, AMD ROCm) and XPU (Intel)
if device in (DeviceType.CUDA, DeviceType.XPU):
@ -475,7 +539,7 @@ def get_visible_gpu_utilization() -> Dict[str, Any]:
)
return {
"available": True,
"backend": device.value,
"backend": _backend_label(device),
"parent_visible_gpu_ids": parent_ids,
"devices": devices,
"index_kind": index_kind,
@ -486,14 +550,14 @@ def get_visible_gpu_utilization() -> Dict[str, Any]:
if not mem.get("available"):
return {
"available": False,
"backend": device.value,
"backend": _backend_label(device),
"parent_visible_gpu_ids": [],
"devices": [],
"index_kind": "relative",
}
return {
"available": True,
"backend": device.value,
"backend": _backend_label(device),
"parent_visible_gpu_ids": [0],
"devices": [
{
@ -515,7 +579,7 @@ def get_visible_gpu_utilization() -> Dict[str, Any]:
return {
"available": False,
"backend": device.value,
"backend": _backend_label(device),
"parent_visible_gpu_ids": [],
"devices": [],
"index_kind": "relative",
@ -529,7 +593,21 @@ _visible_gpu_count: Optional[int] = None
def _get_parent_visible_gpu_spec() -> Dict[str, Any]:
cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES")
# ROCm uses HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES in addition to
# CUDA_VISIBLE_DEVICES (which HIP also respects). Check ROCm-specific
# env vars first so multi-GPU AMD setups are handled correctly.
# Use explicit None checks (not `or`) so empty string "" is honoured
# as "no visible GPUs" rather than falling through to CUDA_VISIBLE_DEVICES.
cuda_visible = None
if IS_ROCM:
hip_vis = os.environ.get("HIP_VISIBLE_DEVICES")
rocr_vis = os.environ.get("ROCR_VISIBLE_DEVICES")
if hip_vis is not None:
cuda_visible = hip_vis
elif rocr_vis is not None:
cuda_visible = rocr_vis
if cuda_visible is None:
cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES")
if cuda_visible is None:
return {
@ -1109,15 +1187,17 @@ def get_physical_gpu_count() -> int:
if device == DeviceType.CUDA:
try:
from . import nvidia
count = nvidia.get_physical_gpu_count()
if IS_ROCM:
from . import amd as _smi_mod
else:
from . import nvidia as _smi_mod
count = _smi_mod.get_physical_gpu_count()
if count is not None:
_physical_gpu_count = count
return _physical_gpu_count
except Exception:
pass
# nvidia-smi unavailable or failed — fall back to torch
# SMI tool unavailable or failed -- fall back to torch
count = _torch_get_physical_gpu_count()
_physical_gpu_count = count if count is not None else 1
return _physical_gpu_count
@ -1136,12 +1216,25 @@ def get_physical_gpu_count() -> int:
return _physical_gpu_count
def _backend_visible_devices_env() -> Optional[str]:
"""Return the raw visibility env string that applies to this backend.
On ROCm, HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES take precedence
over CUDA_VISIBLE_DEVICES; the helper mirrors the resolution logic in
``_get_parent_visible_gpu_spec`` so ``backend_cuda_visible_devices``
reports the value that is actually narrowing the visible device set.
"""
if IS_ROCM:
return _get_parent_visible_gpu_spec().get("raw")
return os.environ.get("CUDA_VISIBLE_DEVICES")
def get_backend_visible_gpu_info() -> Dict[str, Any]:
device = get_device()
if device in (DeviceType.CUDA, DeviceType.XPU):
parent_visible_ids = get_parent_visible_gpu_ids()
# Try nvidia-smi first (NVIDIA only)
if device == DeviceType.CUDA:
# Try native SMI tool first (nvidia-smi for NVIDIA, skipped for ROCm)
if device == DeviceType.CUDA and not IS_ROCM:
try:
from . import nvidia
@ -1151,7 +1244,7 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
parent_visible_spec["raw"],
)
if result.get("available"):
result["backend"] = device.value
result["backend"] = _backend_label(device)
return result
except Exception as e:
logger.warning("Backend GPU visibility query failed: %s", e)
@ -1180,8 +1273,8 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
]
return {
"available": True,
"backend": device.value,
"backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
"backend": _backend_label(device),
"backend_cuda_visible_devices": _backend_visible_devices_env(),
"parent_visible_gpu_ids": parent_visible_ids,
"devices": devices,
"index_kind": index_kind,
@ -1189,8 +1282,8 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
return {
"available": False,
"backend": device.value,
"backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
"backend": _backend_label(device),
"backend_cuda_visible_devices": _backend_visible_devices_env(),
"parent_visible_gpu_ids": parent_visible_ids,
"devices": [],
"index_kind": "physical",
@ -1201,7 +1294,7 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
if not mem.get("available"):
return {
"available": False,
"backend": device.value,
"backend": _backend_label(device),
"backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
"parent_visible_gpu_ids": [],
"devices": [],
@ -1209,7 +1302,7 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
}
return {
"available": True,
"backend": device.value,
"backend": _backend_label(device),
"backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
"parent_visible_gpu_ids": [0],
"devices": [
@ -1226,7 +1319,7 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
return {
"available": False,
"backend": device.value,
"backend": _backend_label(device),
"backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
"parent_visible_gpu_ids": [],
"devices": [],
@ -1246,17 +1339,20 @@ def get_visible_gpu_count() -> int:
if _visible_gpu_count is not None:
return _visible_gpu_count
cuda_visible = os.environ.get("CUDA_VISIBLE_DEVICES")
if cuda_visible is not None:
# "" means zero GPUs, "0" means 1, "0,1,2" means 3
cuda_visible = cuda_visible.strip()
if cuda_visible == "" or cuda_visible == "-1":
# Use _get_parent_visible_gpu_spec() which already handles
# HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES on ROCm.
visible_spec = _get_parent_visible_gpu_spec()
if visible_spec["raw"] is not None:
raw = visible_spec["raw"].strip()
if raw == "" or raw == "-1":
_visible_gpu_count = 0
elif visible_spec["numeric_ids"] is not None:
_visible_gpu_count = len(visible_spec["numeric_ids"])
else:
_visible_gpu_count = len([x for x in cuda_visible.split(",") if x.strip()])
_visible_gpu_count = len([x for x in raw.split(",") if x.strip()])
return _visible_gpu_count
# CUDA_VISIBLE_DEVICES not set -- try torch, fall back to physical count
# No visibility env var set -- try torch, fall back to physical count
try:
import torch
@ -1288,8 +1384,24 @@ def apply_gpu_ids(gpu_ids) -> None:
value = str(gpu_ids)
os.environ["CUDA_VISIBLE_DEVICES"] = value
# Keep ROCm visibility env vars in sync so _get_parent_visible_gpu_spec()
# picks up the narrowed set on AMD systems. Workers can call
# apply_gpu_ids() before detect_hardware() runs (so IS_ROCM is still
# its default False), so also mirror the selection whenever the
# parent process already set a ROCm visibility variable -- that
# way a downstream ROCm process inherits the narrowed mask even
# before Studio's hardware detection has classified the host.
_inherits_rocm_visibility = (
"HIP_VISIBLE_DEVICES" in os.environ or "ROCR_VISIBLE_DEVICES" in os.environ
)
if IS_ROCM or _inherits_rocm_visibility:
os.environ["HIP_VISIBLE_DEVICES"] = value
os.environ["ROCR_VISIBLE_DEVICES"] = value
_visible_gpu_count = None
logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s'", value)
if IS_ROCM or _inherits_rocm_visibility:
logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s' (rocm)", value)
else:
logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s'", value)
def get_device_map(

View file

@ -173,6 +173,7 @@ class HostInfo:
visible_cuda_devices: str | None
has_physical_nvidia: bool
has_usable_nvidia: bool
has_rocm: bool = False
@dataclass
@ -2493,12 +2494,25 @@ def detect_host() -> HostInfo:
has_physical_nvidia = False
has_usable_nvidia = False
if nvidia_smi:
# Require `nvidia-smi -L` to actually list a GPU before treating the
# host as NVIDIA. The banner text "NVIDIA-SMI ..." is printed even
# when the command fails to communicate with the driver (e.g. stale
# container leftovers), which would otherwise misclassify an AMD
# ROCm host as NVIDIA and short-circuit the ROCm path.
try:
listing = run_capture([nvidia_smi, "-L"], timeout = 20)
gpu_lines = [
line for line in listing.stdout.splitlines() if line.startswith("GPU ")
]
if gpu_lines:
has_physical_nvidia = True
has_usable_nvidia = visible_device_tokens != []
except Exception:
pass
try:
result = run_capture([nvidia_smi], timeout = 20)
merged = "\n".join(part for part in (result.stdout, result.stderr) if part)
if "NVIDIA-SMI" in merged:
has_physical_nvidia = True
has_usable_nvidia = visible_device_tokens != []
for line in merged.splitlines():
if "CUDA Version:" in line:
raw = line.split("CUDA Version:", 1)[1].strip().split()[0]
@ -2538,6 +2552,12 @@ def detect_host() -> HostInfo:
if visible_gpu_rows:
has_usable_nvidia = True
# Older nvidia-smi versions (pre -L support) hit the
# except in the first try block but still succeed here,
# leaving has_physical_nvidia unset. Mirror the -L path
# so downstream diagnostics on line ~4390 still run.
if not has_physical_nvidia:
has_physical_nvidia = True
elif visible_device_tokens == []:
has_usable_nvidia = False
elif supports_explicit_visible_device_matching(visible_device_tokens):
@ -2547,6 +2567,50 @@ def detect_host() -> HostInfo:
except Exception:
pass
# Detect AMD ROCm (HIP) -- require actual GPU, not just tools installed
def _amd_smi_has_gpu(stdout: str) -> bool:
"""Check for 'GPU: <number>' data rows, not just a table header."""
return bool(re.search(r"(?im)^gpu\s*[:\[]\s*\d", stdout))
has_rocm = False
if is_linux:
for _cmd, _check in (
# rocminfo: look for "gfxNNNN" with nonzero first digit (gfx000 is CPU agent)
(["rocminfo"], lambda out: bool(re.search(r"gfx[1-9]", out.lower()))),
(["amd-smi", "list"], _amd_smi_has_gpu),
):
_exe = shutil.which(_cmd[0])
if not _exe:
continue
try:
_result = run_capture([_exe, *_cmd[1:]], timeout = 10)
except Exception:
continue
if _result.returncode == 0 and _result.stdout.strip():
if _check(_result.stdout):
has_rocm = True
break
elif is_windows:
# Windows: prefer active probes that validate GPU presence
for _cmd, _check in (
(["hipinfo"], lambda out: "gcnarchname" in out.lower()),
(["amd-smi", "list"], _amd_smi_has_gpu),
):
_exe = shutil.which(_cmd[0])
if not _exe:
continue
try:
_result = run_capture([_exe, *_cmd[1:]], timeout = 10)
except Exception:
continue
if _result.returncode == 0 and _result.stdout.strip():
if _check(_result.stdout):
has_rocm = True
break
# Note: amdhip64.dll presence alone is NOT treated as GPU evidence
# since the HIP SDK can be installed without an AMD GPU.
return HostInfo(
system = system,
machine = machine,
@ -2561,6 +2625,7 @@ def detect_host() -> HostInfo:
visible_cuda_devices = visible_cuda_devices,
has_physical_nvidia = has_physical_nvidia,
has_usable_nvidia = has_usable_nvidia,
has_rocm = has_rocm,
)
@ -2926,9 +2991,168 @@ def published_asset_choice_for_kind(
return None
def _detect_host_rocm_version() -> tuple[int, int] | None:
"""Return (major, minor) of the installed ROCm runtime, or None.
Best-effort read from /opt/rocm/.info/version, amd-smi version, and
hipconfig --version. Used to pick a compatible upstream llama.cpp
ROCm prebuilt rather than always taking the numerically newest one
(which can be newer than the host runtime).
"""
rocm_root = os.environ.get("ROCM_PATH") or "/opt/rocm"
for path in (
os.path.join(rocm_root, ".info", "version"),
os.path.join(rocm_root, "lib", "rocm_version"),
):
try:
with open(path) as fh:
parts = fh.read().strip().split("-")[0].split(".")
# Explicit length guard avoids relying on the broad except
# below to swallow IndexError when the version file contains
# a single component (e.g. "6\n" on a partial install).
if len(parts) >= 2:
return int(parts[0]), int(parts[1])
except Exception:
pass
amd_smi = shutil.which("amd-smi")
if amd_smi:
try:
result = subprocess.run(
[amd_smi, "version"],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 5,
)
if result.returncode == 0:
m = re.search(r"ROCm version:\s*(\d+)\.(\d+)", result.stdout)
if m:
return int(m.group(1)), int(m.group(2))
except Exception:
pass
hipconfig = shutil.which("hipconfig")
if hipconfig:
try:
result = subprocess.run(
[hipconfig, "--version"],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 5,
)
if result.returncode == 0:
raw = (result.stdout or "").strip().split("\n")[0]
parts = raw.split(".")
if (
len(parts) >= 2
and parts[0].isdigit()
and parts[1].split("-")[0].isdigit()
):
return int(parts[0]), int(parts[1].split("-")[0])
except Exception:
pass
# Distro package-manager fallbacks. Mirrors install.sh::get_torch_index_url
# and _detect_rocm_version() in install_python_stack.py so package-managed
# ROCm hosts without /opt/rocm/.info/version still report a usable version
# and the <= host version filter in resolve_upstream_asset_choice picks
# the correct upstream prebuilt instead of the newest-regardless fallback.
for _cmd in (
["dpkg-query", "-W", "-f=${Version}\n", "rocm-core"],
["rpm", "-q", "--qf", "%{VERSION}\n", "rocm-core"],
):
_exe = shutil.which(_cmd[0])
if not _exe:
continue
try:
_result = subprocess.run(
[_exe, *_cmd[1:]],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 5,
)
except Exception:
continue
if _result.returncode != 0 or not _result.stdout.strip():
continue
_raw = _result.stdout.strip()
# dpkg can prepend an epoch ("1:6.3.0-1"); strip it before parsing.
_raw = re.sub(r"^\d+:", "", _raw)
_m = re.match(r"(\d+)[.-](\d+)", _raw)
if _m:
return int(_m.group(1)), int(_m.group(2))
return None
def resolve_upstream_asset_choice(host: HostInfo, llama_tag: str) -> AssetChoice:
upstream_assets = github_release_assets(UPSTREAM_REPO, llama_tag)
if host.is_linux and host.is_x86_64:
# AMD ROCm: try upstream ROCm prebuilt first, then fall back to source build.
# Source build (via setup.sh) compiles with -DGGML_HIP=ON and auto-detects
# the exact GPU target via rocminfo, which is more reliable for consumer
# GPUs (e.g. gfx1151) that may not be in the prebuilt.
if host.has_rocm and not host.has_usable_nvidia:
# Scan upstream assets for any rocm-<version> prebuilt. When the
# host ROCm runtime version is known, pick the newest candidate
# whose major.minor is <= host version -- otherwise a ROCm 6.4
# host would download the rocm-7.2 tarball, fail preflight, and
# fall back to a source build even though a compatible 6.4
# prebuilt exists. If no compatible candidate matches (e.g. host
# runtime is older than every published prebuilt), fall back to
# the numerically newest so we at least try something.
_rocm_pattern = re.compile(
rf"llama-{re.escape(llama_tag)}-bin-ubuntu-rocm-([0-9]+(?:\.[0-9]+)*)-x64\.tar\.gz"
)
rocm_candidates: list[tuple[tuple[int, ...], str]] = []
for _name in upstream_assets:
_m = _rocm_pattern.match(_name)
if _m is None:
continue
_parts = tuple(int(p) for p in _m.group(1).split("."))
rocm_candidates.append((_parts, _name))
rocm_candidates.sort(reverse = True)
_host_rocm_version = _detect_host_rocm_version()
_compatible: list[tuple[tuple[int, ...], str]] = rocm_candidates
if _host_rocm_version is not None:
_compatible = [
item
for item in rocm_candidates
if item[0][:2] <= _host_rocm_version
]
if rocm_candidates and not _compatible:
# Fall back to the newest candidate so a source build is
# not forced when the host runtime is older than every
# published prebuilt: preflight will still catch a true
# incompatibility and trigger a fallback.
_compatible = rocm_candidates[:1]
if _compatible:
rocm_name = _compatible[0][1]
if _host_rocm_version is not None:
log(
f"AMD ROCm {_host_rocm_version[0]}.{_host_rocm_version[1]} "
f"detected -- trying upstream prebuilt {rocm_name}"
)
else:
log(f"AMD ROCm detected -- trying upstream prebuilt {rocm_name}")
log(
"Note: if your ROCm runtime version differs significantly, "
"this may fail preflight and fall back to a source build (safe)"
)
return AssetChoice(
repo = UPSTREAM_REPO,
tag = llama_tag,
name = rocm_name,
url = upstream_assets[rocm_name],
source_label = "upstream",
install_kind = "linux-rocm",
)
# No ROCm prebuilt available -- fall back to source build
raise PrebuiltFallback(
"AMD ROCm detected but no upstream ROCm prebuilt found; "
"falling back to source build with HIP support"
)
upstream_name = f"llama-{llama_tag}-bin-ubuntu-x64.tar.gz"
if upstream_name not in upstream_assets:
raise PrebuiltFallback("upstream Linux CPU asset was not found")
@ -2948,6 +3172,25 @@ def resolve_upstream_asset_choice(host: HostInfo, llama_tag: str) -> AssetChoice
return attempts[0]
raise PrebuiltFallback("no compatible Windows CUDA asset was found")
# AMD ROCm on Windows: try HIP prebuilt
if host.has_rocm:
hip_name = f"llama-{llama_tag}-bin-win-hip-radeon-x64.zip"
if hip_name in upstream_assets:
log(
f"AMD ROCm detected on Windows -- trying upstream HIP prebuilt {hip_name}"
)
return AssetChoice(
repo = UPSTREAM_REPO,
tag = llama_tag,
name = hip_name,
url = upstream_assets[hip_name],
source_label = "upstream",
install_kind = "windows-hip",
)
log(
"AMD ROCm detected on Windows but no HIP prebuilt found -- falling back to CPU"
)
upstream_name = f"llama-{llama_tag}-bin-win-cpu-x64.zip"
if upstream_name not in upstream_assets:
raise PrebuiltFallback("upstream Windows CPU asset was not found")
@ -3029,7 +3272,16 @@ def resolve_release_asset_choice(
published_choice: AssetChoice | None = None
if host.is_windows and host.is_x86_64:
published_choice = published_asset_choice_for_kind(release, "windows-cpu")
# AMD Windows hosts should prefer a hash-approved published
# Windows HIP bundle when one exists, but otherwise fall through
# to resolve_asset_choice() so the upstream HIP prebuilt is
# tried before the CPU fallback. Hard-pinning the published
# windows-cpu bundle here would make the new HIP path
# unreachable.
if host.has_rocm:
published_choice = published_asset_choice_for_kind(release, "windows-hip")
else:
published_choice = published_asset_choice_for_kind(release, "windows-cpu")
elif host.is_macos and host.is_arm64:
published_choice = published_asset_choice_for_kind(release, "macos-arm64")
elif host.is_macos and host.is_x86_64:
@ -3378,7 +3630,7 @@ def overlay_directory_for_choice(
def runtime_patterns_for_choice(choice: AssetChoice) -> list[str]:
if choice.install_kind in {"linux-cpu", "linux-cuda"}:
if choice.install_kind in {"linux-cpu", "linux-cuda", "linux-rocm"}:
return [
"llama-server",
"llama-quantize",
@ -3388,11 +3640,12 @@ def runtime_patterns_for_choice(choice: AssetChoice) -> list[str]:
"libmtmd.so*",
"libggml-cpu-*.so*",
"libggml-cuda.so*",
"libggml-hip.so*",
"libggml-rpc.so*",
]
if choice.install_kind in {"macos-arm64", "macos-x64"}:
return ["llama-server", "llama-quantize", "lib*.dylib"]
if choice.install_kind in {"windows-cpu", "windows-cuda"}:
if choice.install_kind in {"windows-cpu", "windows-cuda", "windows-hip"}:
return ["*.exe", "*.dll"]
raise PrebuiltFallback(
f"unsupported install kind for runtime overlay: {choice.install_kind}"
@ -4117,6 +4370,7 @@ def validate_server(
install_dir: Path,
*,
runtime_line: str | None = None,
install_kind: str | None = None,
) -> None:
last_failure: PrebuiltFallback | None = None
for port_attempt in range(1, SERVER_PORT_BIND_ATTEMPTS + 1):
@ -4140,7 +4394,33 @@ def validate_server(
"--batch-size",
"32",
]
if host.has_usable_nvidia or (host.is_macos and host.is_arm64):
# Only enable GPU offload for assets that actually ship GPU code.
# Gating on `host.has_rocm` alone breaks the intentional CPU
# fallback on AMD Windows hosts without a HIP prebuilt: the CPU
# binary would be launched with `--n-gpu-layers 1` and fail
# validation. Use the resolved install_kind as the source of
# truth and fall back to host detection when the caller did not
# pass one (keeps backwards compatibility with older call sites).
_gpu_kinds = {
"linux-cuda",
"linux-rocm",
"windows-cuda",
"windows-hip",
"macos-arm64",
}
if install_kind is not None:
_enable_gpu_layers = install_kind in _gpu_kinds
else:
# Older call sites that don't pass install_kind: keep ROCm
# hosts in the GPU-validation path so an AMD-only Linux host
# is exercised against the actual hardware rather than the
# CPU fallback. NVIDIA and macOS-arm64 are already covered.
_enable_gpu_layers = (
host.has_usable_nvidia
or host.has_rocm
or (host.is_macos and host.is_arm64)
)
if _enable_gpu_layers:
command.extend(["--n-gpu-layers", "1"])
log_fd, log_name = tempfile.mkstemp(prefix = "llama-server-", suffix = ".log")
@ -4664,10 +4944,21 @@ def runtime_payload_health_groups(choice: AssetChoice) -> list[list[str]]:
["libggml*.dylib"],
["libmtmd*.dylib"],
]
if choice.install_kind == "linux-rocm":
return [
["libllama.so*"],
["libggml.so*"],
["libggml-base.so*"],
["libggml-cpu-*.so*"],
["libmtmd.so*"],
["libggml-hip.so*"],
]
if choice.install_kind == "windows-cpu":
return [["llama.dll"]]
if choice.install_kind == "windows-cuda":
return [["llama.dll"], ["ggml-cuda.dll"]]
if choice.install_kind == "windows-hip":
return [["llama.dll"], ["*hip*.dll"]]
return []
@ -4839,6 +5130,7 @@ def validate_prebuilt_choice(
host,
install_dir,
runtime_line = choice.runtime_line,
install_kind = choice.install_kind,
)
log(f"staged prebuilt validation succeeded for {choice.name}")
return server_path, quantize_path

View file

@ -25,6 +25,281 @@ IS_WINDOWS = sys.platform == "win32"
IS_MACOS = sys.platform == "darwin"
IS_MAC_INTEL = IS_MACOS and platform.machine() == "x86_64"
# ── ROCm / AMD GPU support ─────────────────────────────────────────────────────
# Mapping from detected ROCm (major, minor) to the best PyTorch wheel tag on
# download.pytorch.org. Entries are checked newest-first (>=).
# ROCm 7.2 only has torch 2.11.0 on download.pytorch.org, which exceeds the
# current torch upper bound (<2.11.0). Fall back to rocm7.1 (torch 2.10.0).
# TODO: uncomment rocm7.2 when torch upper bound is bumped to >=2.11.0
_ROCM_TORCH_INDEX: dict[tuple[int, int], str] = {
# (7, 2): "rocm7.2", # torch 2.11.0 -- requires torch>=2.11
(7, 1): "rocm7.1",
(7, 0): "rocm7.0",
(6, 4): "rocm6.4",
(6, 3): "rocm6.3",
(6, 2): "rocm6.2",
(6, 1): "rocm6.1",
(6, 0): "rocm6.0",
}
_PYTORCH_WHL_BASE = "https://download.pytorch.org/whl"
def _detect_rocm_version() -> tuple[int, int] | None:
"""Return (major, minor) of the installed ROCm stack, or None."""
# Check /opt/rocm/.info/version or ROCM_PATH equivalent
rocm_root = os.environ.get("ROCM_PATH") or "/opt/rocm"
for path in (
os.path.join(rocm_root, ".info", "version"),
os.path.join(rocm_root, "lib", "rocm_version"),
):
try:
with open(path) as fh:
parts = fh.read().strip().split("-")[0].split(".")
# Explicit length guard avoids relying on the broad except
# below to swallow IndexError when the version file contains
# a single component (e.g. "6\n" on a partial install).
if len(parts) >= 2:
return int(parts[0]), int(parts[1])
except Exception:
pass
# Try amd-smi version (outputs "... | ROCm version: X.Y.Z")
amd_smi = shutil.which("amd-smi")
if amd_smi:
try:
result = subprocess.run(
[amd_smi, "version"],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 5,
)
if result.returncode == 0:
import re
m = re.search(r"ROCm version:\s*(\d+)\.(\d+)", result.stdout)
if m:
return int(m.group(1)), int(m.group(2))
except Exception:
pass
# Try hipconfig --version (outputs bare version like "6.3.21234.2")
hipconfig = shutil.which("hipconfig")
if hipconfig:
try:
result = subprocess.run(
[hipconfig, "--version"],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
timeout = 5,
)
if result.returncode == 0:
raw = result.stdout.decode().strip().split("\n")[0]
parts = raw.split(".")
if (
len(parts) >= 2
and parts[0].isdigit()
and parts[1].split("-")[0].isdigit()
):
return int(parts[0]), int(parts[1].split("-")[0])
except Exception:
pass
# Distro package-manager fallbacks. Package-managed ROCm installs can
# expose GPUs via rocminfo / amd-smi but still lack /opt/rocm/.info/version
# and hipconfig, so probe dpkg (Debian/Ubuntu) and rpm (RHEL/Fedora/SUSE)
# for the rocm-core package version. Matches the chain in
# install.sh::get_torch_index_url so `unsloth studio update` behaves
# the same as a fresh `curl | sh` install.
import re as _re_pkg
for cmd in (
["dpkg-query", "-W", "-f=${Version}\n", "rocm-core"],
["rpm", "-q", "--qf", "%{VERSION}\n", "rocm-core"],
):
exe = shutil.which(cmd[0])
if not exe:
continue
try:
result = subprocess.run(
[exe, *cmd[1:]],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 5,
)
except Exception:
continue
if result.returncode != 0 or not result.stdout.strip():
continue
raw = result.stdout.strip()
# dpkg can prepend an epoch ("1:6.3.0-1"); strip it before parsing.
raw = _re_pkg.sub(r"^\d+:", "", raw)
m = _re_pkg.match(r"(\d+)[.-](\d+)", raw)
if m:
return int(m.group(1)), int(m.group(2))
return None
def _has_rocm_gpu() -> bool:
"""Return True only if an actual AMD GPU is visible (not just ROCm tools installed)."""
import re
for cmd, check_fn in (
# rocminfo: look for "Name: gfxNNNN" with nonzero first digit (gfx000 is the CPU agent)
(["rocminfo"], lambda out: bool(re.search(r"gfx[1-9]", out.lower()))),
# amd-smi list: require "GPU: <number>" data rows, not just a header
(
["amd-smi", "list"],
lambda out: bool(re.search(r"(?im)^gpu\s*[:\[]\s*\d", out)),
),
):
exe = shutil.which(cmd[0])
if not exe:
continue
try:
result = subprocess.run(
[exe, *cmd[1:]],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 10,
)
except Exception:
continue
if result.returncode == 0 and result.stdout.strip():
if check_fn(result.stdout):
return True
return False
def _has_usable_nvidia_gpu() -> bool:
"""Return True only when nvidia-smi exists AND reports at least one GPU."""
exe = shutil.which("nvidia-smi")
if not exe:
return False
try:
result = subprocess.run(
[exe, "-L"],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 10,
)
except Exception:
return False
return result.returncode == 0 and "GPU " in result.stdout
def _ensure_rocm_torch() -> None:
"""Reinstall torch with ROCm wheels when the venv received CPU-only torch.
Runs only on Linux x86_64 hosts where an AMD GPU is present and the
ROCm runtime is detectable (rocminfo / amd-smi / hipconfig /
rocm-core package). No-op when torch already links against HIP
(ROCm), on Windows / macOS, on non-x86_64 Linux (PyTorch does not
publish ROCm wheels for aarch64 / arm64), or on mixed AMD+NVIDIA
hosts (NVIDIA takes precedence).
Uses pip_install() to respect uv, constraints, and --python targeting.
"""
# Explicit OS / architecture guards so the helper is safe to call
# from any context -- PyTorch only publishes ROCm wheels for
# linux_x86_64, so aarch64 / arm64 hosts must skip this repair path
# instead of failing the update with a missing-wheel error.
if IS_WINDOWS or IS_MACOS:
return
if platform.machine().lower() not in {"x86_64", "amd64"}:
return
# NVIDIA takes precedence on mixed hosts -- but only if an actual GPU is usable
if _has_usable_nvidia_gpu():
return
# Rely on _has_rocm_gpu() (rocminfo / amd-smi GPU data rows) as the
# authoritative "is this actually an AMD ROCm host?" signal. The old
# gate required /opt/rocm or hipcc to exist, which breaks on
# runtime-only ROCm installs (package-managed minimal installs,
# Radeon software) that ship amd-smi/rocminfo without /opt/rocm or
# hipcc, and leaves `unsloth studio update` unable to repair a
# CPU-only venv on those systems.
if not _has_rocm_gpu():
return # no AMD GPU visible
ver = _detect_rocm_version()
if ver is None:
print(" ROCm detected but version unreadable -- skipping torch reinstall")
return
# Probe whether torch already links against HIP (ROCm is already working).
# Do NOT skip for CUDA-only builds since they are unusable on AMD-only
# hosts (the NVIDIA check above already handled mixed AMD+NVIDIA setups).
try:
probe = subprocess.run(
[
sys.executable,
"-c",
"import torch; print(getattr(torch.version,'hip','') or '')",
],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
timeout = 30,
)
except (OSError, subprocess.TimeoutExpired):
probe = None
has_hip_torch = (
probe is not None
and probe.returncode == 0
and probe.stdout.decode().strip() != ""
)
rocm_torch_ready = has_hip_torch
if not has_hip_torch:
# Select best matching wheel tag (newest ROCm version <= installed)
tag = next(
(
t
for (maj, mn), t in sorted(_ROCM_TORCH_INDEX.items(), reverse = True)
if ver >= (maj, mn)
),
None,
)
if tag is None:
print(
f" No PyTorch wheel for ROCm {ver[0]}.{ver[1]} -- "
f"skipping torch reinstall"
)
else:
index_url = f"{_PYTORCH_WHL_BASE}/{tag}"
print(f" ROCm {ver[0]}.{ver[1]} -- installing torch from {index_url}")
pip_install(
f"ROCm torch ({tag})",
"--force-reinstall",
"--no-cache-dir",
"torch>=2.4,<2.11.0",
"torchvision<0.26.0",
"torchaudio<2.11.0",
"--index-url",
index_url,
constrain = False,
)
rocm_torch_ready = True
# Install bitsandbytes only when the venv has a ROCm-compatible torch
# (either already present or just installed). Avoids leaving an AMD
# bitsandbytes on top of a CPU/CUDA torch on hosts where the ROCm
# runtime is older than any published torch wheel. Uses
# --force-reinstall so an existing CPU/CUDA bitsandbytes is replaced
# by the AMD build during upgrades.
if rocm_torch_ready:
pip_install(
"bitsandbytes (AMD)",
"--force-reinstall",
"--no-cache-dir",
"--no-deps",
"bitsandbytes>=0.49.1",
constrain = False,
)
def _infer_no_torch() -> bool:
"""Determine whether to run in no-torch (GGUF-only) mode.
@ -414,6 +689,10 @@ def install_python_stack() -> int:
base_total = 10 if IS_WINDOWS else 11
if IS_MACOS:
base_total -= 1 # triton step is skipped on macOS
# ROCm torch check steps (Linux only, non-macOS, non-no-torch):
# one early check (step 2b) and one final repair (step 13).
if not IS_WINDOWS and not IS_MACOS and not NO_TORCH:
base_total += 2
_TOTAL = (base_total - 1) if skip_base else base_total
# 1. Try to use uv for faster installs (must happen before pip upgrade
@ -537,6 +816,53 @@ def install_python_stack() -> int:
req = REQ_ROOT / "base.txt",
)
# 2b. AMD ROCm: reinstall torch with HIP wheels if the host has ROCm but the
# venv received CPU-only torch (common when pip resolves torch from PyPI).
# Must come immediately after base packages so torch is present for inspection.
if not IS_WINDOWS and not IS_MACOS and not NO_TORCH:
_progress("ROCm torch check")
_ensure_rocm_torch()
# Windows + AMD GPU: PyTorch does not publish ROCm wheels for Windows.
# Detect and warn so users know manual steps are needed for GPU training.
if IS_WINDOWS and not NO_TORCH and not _has_usable_nvidia_gpu():
# Validate actual AMD GPU presence (not just tool existence)
import re as _re_win
def _win_amd_smi_has_gpu(stdout: str) -> bool:
return bool(_re_win.search(r"(?im)^gpu\s*[:\[]\s*\d", stdout))
_win_amd_gpu = False
for _wcmd, _check_fn in (
(["hipinfo"], lambda out: "gcnarchname" in out.lower()),
(["amd-smi", "list"], _win_amd_smi_has_gpu),
):
_wexe = shutil.which(_wcmd[0])
if not _wexe:
continue
try:
_wr = subprocess.run(
[_wexe, *_wcmd[1:]],
stdout = subprocess.PIPE,
stderr = subprocess.DEVNULL,
text = True,
timeout = 10,
)
except Exception:
continue
if _wr.returncode == 0 and _check_fn(_wr.stdout):
_win_amd_gpu = True
break
if _win_amd_gpu:
_safe_print(
_dim(" Note:"),
"AMD GPU detected on Windows. ROCm-enabled PyTorch must be",
)
_safe_print(
" " * 8,
"installed manually. See: https://docs.unsloth.ai/get-started/install-and-update/amd",
)
# 3. Extra dependencies
_progress("unsloth extras")
pip_install(
@ -650,7 +976,16 @@ def install_python_stack() -> int:
[sys.executable, str(SINGLE_ENV / "patch_metadata.py")],
)
# 13. Final check (silent; third-party conflicts are expected)
# 13. AMD ROCm: final torch repair. Multiple install steps above can
# pull in CUDA torch from PyPI (base packages, extras, overrides,
# studio deps, etc.). Running the repair as the very last step
# ensures ROCm torch is in place at runtime, regardless of which
# intermediate step clobbered it.
if not IS_WINDOWS and not IS_MACOS and not NO_TORCH:
_progress("ROCm torch (final)")
_ensure_rocm_torch()
# 14. Final check (silent; third-party conflicts are expected)
subprocess.run(
[sys.executable, "-m", "pip", "check"],
stdout = subprocess.DEVNULL,

View file

@ -928,6 +928,20 @@ else
_valid_gfx=""
for _gfx in $_gfx_list; do
if [[ "$_gfx" =~ ^gfx[0-9]{2,4}[a-z]?$ ]]; then
# Drop bare family-level targets (gfx10, gfx11, gfx12, ...)
# when a specific sibling is present in the same list.
# rocminfo on ROCm 6.1+ emits both the specific GPU and
# the LLVM generic family line (e.g. gfx1100 alongside
# gfx11-generic), and the outer grep above captures the
# bare family prefix from the generic line. Passing that
# bare prefix to -DGPU_TARGETS breaks the HIP/llama.cpp
# build because clang only accepts specific gfxNNN ids.
# No real AMD GPU has a 2-digit gfx id, so this filter
# can only ever drop family prefixes, never real targets.
if [[ "$_gfx" =~ ^gfx[0-9]{2}$ ]] \
&& echo "$_gfx_list" | grep -qE "^${_gfx}[0-9][0-9a-z]?$"; then
continue
fi
_valid_gfx="${_valid_gfx}${_valid_gfx:+;}$_gfx"
fi
done

View file

@ -88,10 +88,26 @@ def is_cdna():
@functools.lru_cache(1)
def is_rdna():
"""Detect ROCm-supported RDNA consumer/workstation GPUs (RDNA3, RDNA4)."""
"""Detect ROCm-supported RDNA consumer/workstation GPUs (RDNA2, RDNA3, RDNA3.5, RDNA4)."""
return is_hip() and triton.runtime.driver.active.get_current_target().arch in (
# RDNA2 (Navi 21-24)
"gfx1030",
"gfx1031",
"gfx1032",
"gfx1033",
"gfx1034",
"gfx1035",
"gfx1036",
# RDNA3 (Navi 31-33)
"gfx1100",
"gfx1101",
"gfx1102",
"gfx1103",
# RDNA3.5 (Strix Point / Strix Halo)
"gfx1150",
"gfx1151",
"gfx1152",
# RDNA4 (Navi 48-44)
"gfx1200",
"gfx1201",
)

View file

@ -1108,7 +1108,16 @@ def patch_sft_trainer_tokenizer():
" a = np.array([int(x.decode('utf-8'))/1024 for x in a])\n"
"except:\n"
" if not torch.cuda.is_available():\n"
" raise RuntimeError('Unsloth: We do not support AMD / Intel machines yet - it is a work in progress!')\n"
" raise RuntimeError('Unsloth: No GPU detected. AMD ROCm users: install ROCm-enabled PyTorch -- see https://docs.unsloth.ai/get-started/install-and-update/amd')\n"
" # nvidia-smi unavailable but torch.cuda IS available -- we are on\n"
" # a ROCm host (ROCm reuses the torch.cuda.* API surface, so\n"
" # device_count() is authoritative) or on a CUDA host without\n"
" # the CLI installed. Use the device count directly as a\n"
" # conservative multi-GPU signal: any configuration with more\n"
" # than one visible device is flagged as unsupported, matching\n"
" # the spirit of the per-device memory check used on CUDA.\n"
" if torch.cuda.device_count() > 1:\n"
" raise RuntimeError('Unsloth currently does not support multi GPU setups - but we are working on it!')\n"
"if ((a - PRE_CHECK) >= 1).sum() > 1:\n"
" raise RuntimeError('Unsloth currently does not support multi GPU setups - but we are working on it!')\n"
"for _ in range(3):\n"