Harden ROCm detection, Radeon wheel fallback, and HIP visibility

Addresses review findings from parallel reviewers on PR #4720: - install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L to actually list a GPU before treating the host as NVIDIA. Fixes the stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the CUDA branch. - install.sh: fix hipconfig awk blocks to propagate a non-zero exit code when the output is not a recognisable version string, so the ||-chain continues to dpkg-query / rpm instead of terminating early. - install.sh: fail-closed on Radeon wheel fallback. When torch, torchvision or torchaudio is missing from the Radeon repo for the active Python tag, fall back to the standard ROCm index instead of silently mixing Radeon wheels with PyPI defaults. Quote all wheel arguments individually so wheel filenames cannot be word-split or glob-expanded. - install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts with a broken leftover nvidia-smi to the ROCm path instead of misclassifying them as NVIDIA. - install_llama_prebuilt.py: scan upstream assets for any rocm-<version> prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+ users pick up a matching upstream prebuilt when one exists. - install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted on the GPU path instead of passing validation on CPU only. - install_llama_prebuilt.py: restore the published windows-cpu fallback for AMD Windows hosts without a HIP prebuilt so hash-approved bundles are still preferred over the raw upstream CPU asset. - install_python_stack.py: drop the /opt/rocm / hipcc gate in _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm installs (package-managed minimal installs, Radeon software) that ship amd-smi / rocminfo without hipcc can now repair a CPU-only venv via "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard. - studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in get_primary_gpu_utilization(). A process restricted to GPU 2 now reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain bytes unit detection to an explicit allowlist. - studio/backend/utils/hardware/hardware.py: route get_backend_visible_gpu_info()'s backend_cuda_visible_devices field through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the unconditional "(rocm=False)" suffix in apply_gpu_ids() logs.
2026-04-21 13:37:39 +00:00 · 2026-04-08 09:20:14 +00:00 · 2026-04-08 09:20:14 +00:00 · 810b833b01
commit 810b833b01
parent c6f5b3af32
5 changed files with 168 additions and 44 deletions
--- a/install.sh
+++ b/install.sh
@ -987,12 +987,28 @@ _has_amd_rocm_gpu() {
       rocminfo 2>/dev/null | awk '/Name:[[:space:]]*gfx[1-9]/{found=1} END{exit !found}'; then
        return 0
    elif command -v amd-smi >/dev/null 2>&1 && \
-         amd-smi list 2>/dev/null | awk '/^GPU[[:space:]]*[:\[]/{ found=1 } END{ exit !found }'; then
+         amd-smi list 2>/dev/null | awk '/^GPU[[:space:]]*[:\[][[:space:]]*[0-9]/{ found=1 } END{ exit !found }'; then
        return 0
    fi
    return 1
 }

+# ── NVIDIA usable-GPU helper ──
+# Returns 0 (true) only if nvidia-smi is present AND actually lists a GPU.
+# Prevents AMD-only hosts with a stale nvidia-smi on PATH from being routed
+# into the CUDA branch.
+_has_usable_nvidia_gpu() {
+    _nvsmi=""
+    if command -v nvidia-smi >/dev/null 2>&1; then
+        _nvsmi="nvidia-smi"
+    elif [ -x "/usr/bin/nvidia-smi" ]; then
+        _nvsmi="/usr/bin/nvidia-smi"
+    else
+        return 1
+    fi
+    "$_nvsmi" -L 2>/dev/null | awk '/^GPU[[:space:]]+[0-9]+:/{found=1} END{exit !found}'
+}
+
 # ── Detect GPU and choose PyTorch index URL ──
 # Mirrors Get-TorchIndexUrl in install.ps1.
 # On CPU-only machines this returns the cpu index, avoiding the solver
@ -1001,12 +1017,17 @@ get_torch_index_url() {
    _base="https://download.pytorch.org/whl"
    # macOS: always CPU (no CUDA support)
    case "$(uname -s)" in Darwin) echo "$_base/cpu"; return ;; esac
-    # Try nvidia-smi
+    # Try nvidia-smi -- require the binary to actually list a usable GPU.
+    # Presence of the binary alone (container leftovers, stale driver
+    # packages) is not sufficient: otherwise an AMD-only host would
+    # silently install CUDA wheels.
    _smi=""
-    if command -v nvidia-smi >/dev/null 2>&1; then
-        _smi="nvidia-smi"
-    elif [ -x "/usr/bin/nvidia-smi" ]; then
-        _smi="/usr/bin/nvidia-smi"
+    if _has_usable_nvidia_gpu; then
+        if command -v nvidia-smi >/dev/null 2>&1; then
+            _smi="nvidia-smi"
+        elif [ -x "/usr/bin/nvidia-smi" ]; then
+            _smi="/usr/bin/nvidia-smi"
+        fi
    fi
    if [ -z "$_smi" ]; then
        # No NVIDIA GPU -- check for AMD ROCm GPU
@ -1021,7 +1042,7 @@ get_torch_index_url() {
            { [ -r /opt/rocm/.info/version ] && \
                awk -F. '{print "rocm"$1"."$2; exit}' /opt/rocm/.info/version; } || \
            { command -v hipconfig >/dev/null 2>&1 && \
-                hipconfig --version 2>/dev/null | awk 'NR==1{split($1,a,"."); if(a[1]+0>0) print "rocm"a[1]"."a[2]}'; } || \
+                hipconfig --version 2>/dev/null | awk 'NR==1 && /^[0-9]/{split($1,a,"."); if(a[1]+0>0){print "rocm"a[1]"."a[2]; found=1}} END{exit !found}'; } || \
            { command -v dpkg-query >/dev/null 2>&1 && \
                ver="$(dpkg-query -W -f='${Version}\n' rocm-core 2>/dev/null)" && \
                [ -n "$ver" ] && \
@ -1087,7 +1108,7 @@ get_radeon_wheel_url() {
        { [ -r /opt/rocm/.info/version ] && \
            awk -F'[.-]' 'NF>=3{print $1"."$2"."$3; exit}' /opt/rocm/.info/version; } || \
        { command -v hipconfig >/dev/null 2>&1 && \
-            hipconfig --version 2>/dev/null | awk 'NR==1 && /^[0-9]+\.[0-9]+\.[0-9]/{print $1}'; }) 2>/dev/null
+            hipconfig --version 2>/dev/null | awk 'NR==1 && /^[0-9]+\.[0-9]+\.[0-9]/{print $1; found=1} END{exit !found}'; }) 2>/dev/null

    # Validate: must be X.Y.Z with X >= 1
    case "$_full_ver" in
@ -1241,18 +1262,31 @@ elif [ -n "$TORCH_INDEX_URL" ]; then
            fi

            if [ "$_radeon_listing_ok" = true ]; then
-                substep "installing PyTorch from Radeon repo (${_RADEON_BASE_URL})..."
-                _torch_arg="torch"; _tv_arg="torchvision"; _ta_arg="torchaudio"; _tri_arg=""
-                _torch_whl=$(_pick_radeon_wheel "torch"       2>/dev/null) && _torch_arg="$_torch_whl"
-                _tv_whl=$(_pick_radeon_wheel    "torchvision" 2>/dev/null) && _tv_arg="$_tv_whl"
-                _ta_whl=$(_pick_radeon_wheel    "torchaudio"  2>/dev/null) && _ta_arg="$_ta_whl"
-                _tri_whl=$(_pick_radeon_wheel   "triton"      2>/dev/null) && _tri_arg="$_tri_whl"
-                # Build install args; skip empty _tri_arg to avoid passing "" to uv
-                _radeon_pkgs="$_torch_arg $_tv_arg $_ta_arg"
-                [ -n "$_tri_arg" ] && _radeon_pkgs="$_tri_arg $_radeon_pkgs"
-                run_install_cmd "install triton + PyTorch" uv pip install --python "$_VENV_PY" \
-                    --find-links "$_RADEON_BASE_URL" \
-                    $_radeon_pkgs
+                # Require torch, torchvision, torchaudio wheels to all resolve
+                # from the Radeon listing. If any is missing for this Python
+                # tag, fall through to the standard ROCm index instead of
+                # silently mixing Radeon wheels with PyPI defaults.
+                _torch_whl=$(_pick_radeon_wheel "torch"       2>/dev/null) || _torch_whl=""
+                _tv_whl=$(_pick_radeon_wheel    "torchvision" 2>/dev/null) || _tv_whl=""
+                _ta_whl=$(_pick_radeon_wheel    "torchaudio"  2>/dev/null) || _ta_whl=""
+                _tri_whl=$(_pick_radeon_wheel   "triton"      2>/dev/null) || _tri_whl=""
+                if [ -z "$_torch_whl" ] || [ -z "$_tv_whl" ] || [ -z "$_ta_whl" ]; then
+                    substep "[WARN] Radeon repo lacks a complete wheel set for this Python; falling back to ROCm index ($TORCH_INDEX_URL)" "$C_WARN"
+                    run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
+                        "$TORCH_CONSTRAINT" torchvision torchaudio \
+                        --index-url "$TORCH_INDEX_URL"
+                else
+                    substep "installing PyTorch from Radeon repo (${_RADEON_BASE_URL})..."
+                    if [ -n "$_tri_whl" ]; then
+                        run_install_cmd "install triton + PyTorch" uv pip install --python "$_VENV_PY" \
+                            --find-links "$_RADEON_BASE_URL" \
+                            "$_tri_whl" "$_torch_whl" "$_tv_whl" "$_ta_whl"
+                    else
+                        run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
+                            --find-links "$_RADEON_BASE_URL" \
+                            "$_torch_whl" "$_tv_whl" "$_ta_whl"
+                    fi
+                fi
            else
                substep "[WARN] Radeon repo unavailable; falling back to ROCm index ($TORCH_INDEX_URL)" "$C_WARN"
                run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
--- a/studio/backend/utils/hardware/amd.py
+++ b/studio/backend/utils/hardware/amd.py
@ -10,6 +10,7 @@ nvidia.py counterparts.

 import json
 import math
+import os
 import re
 import subprocess
 from typing import Any, Optional
@ -93,9 +94,7 @@ def _parse_memory_mb(value: Any) -> Optional[float]:
        return num
    if "kib" in unit or "kb" in unit:
        return num / 1024
-    if unit and (
-        "b" in unit and "g" not in unit and "m" not in unit and "k" not in unit
-    ):
+    if unit in ("b", "byte", "bytes"):
        # Plain bytes
        return num / (1024 * 1024)

@ -203,9 +202,34 @@ def get_physical_gpu_count() -> Optional[int]:
    return None


+def _first_visible_amd_gpu_id() -> Optional[str]:
+    """Return the physical AMD GPU id that should be treated as 'primary'.
+
+    Honours HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES
+    in that order (HIP respects all three). Returns ``"0"`` when none are
+    set, and ``None`` when the env var explicitly narrows to zero GPUs
+    ("" or "-1"), so callers can short-circuit to "available: False".
+    """
+    for env_name in ("HIP_VISIBLE_DEVICES", "ROCR_VISIBLE_DEVICES", "CUDA_VISIBLE_DEVICES"):
+        raw = os.environ.get(env_name)
+        if raw is None:
+            continue
+        raw = raw.strip()
+        if raw == "" or raw == "-1":
+            return None
+        first = raw.split(",", 1)[0].strip()
+        if first:
+            return first
+        break
+    return "0"
+
+
 def get_primary_gpu_utilization() -> dict[str, Any]:
-    """Return utilization metrics for the primary AMD GPU."""
-    data = _run_amd_smi("metric", "-g", "0")
+    """Return utilization metrics for the primary visible AMD GPU."""
+    gpu_idx = _first_visible_amd_gpu_id()
+    if gpu_idx is None:
+        return {"available": False}
+    data = _run_amd_smi("metric", "-g", gpu_idx)
    if data is None:
        return {"available": False}

--- a/studio/backend/utils/hardware/hardware.py
+++ b/studio/backend/utils/hardware/hardware.py
@ -1187,6 +1187,19 @@ def get_physical_gpu_count() -> int:
    return _physical_gpu_count


+def _backend_visible_devices_env() -> Optional[str]:
+    """Return the raw visibility env string that applies to this backend.
+
+    On ROCm, HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES take precedence
+    over CUDA_VISIBLE_DEVICES; the helper mirrors the resolution logic in
+    ``_get_parent_visible_gpu_spec`` so ``backend_cuda_visible_devices``
+    reports the value that is actually narrowing the visible device set.
+    """
+    if IS_ROCM:
+        return _get_parent_visible_gpu_spec().get("raw")
+    return os.environ.get("CUDA_VISIBLE_DEVICES")
+
+
 def get_backend_visible_gpu_info() -> Dict[str, Any]:
    device = get_device()
    if device in (DeviceType.CUDA, DeviceType.XPU):
@ -1232,7 +1245,7 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
            return {
                "available": True,
                "backend": device.value,
-                "backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
+                "backend_cuda_visible_devices": _backend_visible_devices_env(),
                "parent_visible_gpu_ids": parent_visible_ids,
                "devices": devices,
                "index_kind": index_kind,
@ -1241,7 +1254,7 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
        return {
            "available": False,
            "backend": device.value,
-            "backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
+            "backend_cuda_visible_devices": _backend_visible_devices_env(),
            "parent_visible_gpu_ids": parent_visible_ids,
            "devices": [],
            "index_kind": "physical",
@ -1348,7 +1361,10 @@ def apply_gpu_ids(gpu_ids) -> None:
        os.environ["HIP_VISIBLE_DEVICES"] = value
        os.environ["ROCR_VISIBLE_DEVICES"] = value
    _visible_gpu_count = None
-    logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s' (rocm=%s)", value, IS_ROCM)
+    if IS_ROCM:
+        logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s' (rocm)", value)
+    else:
+        logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s'", value)


 def get_device_map(
--- a/studio/install_llama_prebuilt.py
+++ b/studio/install_llama_prebuilt.py
@ -2494,12 +2494,27 @@ def detect_host() -> HostInfo:
    has_physical_nvidia = False
    has_usable_nvidia = False
    if nvidia_smi:
+        # Require `nvidia-smi -L` to actually list a GPU before treating the
+        # host as NVIDIA. The banner text "NVIDIA-SMI ..." is printed even
+        # when the command fails to communicate with the driver (e.g. stale
+        # container leftovers), which would otherwise misclassify an AMD
+        # ROCm host as NVIDIA and short-circuit the ROCm path.
+        try:
+            listing = run_capture([nvidia_smi, "-L"], timeout = 20)
+            gpu_lines = [
+                line
+                for line in listing.stdout.splitlines()
+                if line.startswith("GPU ")
+            ]
+            if gpu_lines:
+                has_physical_nvidia = True
+                has_usable_nvidia = visible_device_tokens != []
+        except Exception:
+            pass
+
        try:
            result = run_capture([nvidia_smi], timeout = 20)
            merged = "\n".join(part for part in (result.stdout, result.stderr) if part)
-            if "NVIDIA-SMI" in merged:
-                has_physical_nvidia = True
-                has_usable_nvidia = visible_device_tokens != []
            for line in merged.splitlines():
                if "CUDA Version:" in line:
                    raw = line.split("CUDA Version:", 1)[1].strip().split()[0]
@ -2981,11 +2996,28 @@ def resolve_upstream_asset_choice(host: HostInfo, llama_tag: str) -> AssetChoice
        # the exact GPU target via rocminfo, which is more reliable for consumer
        # GPUs (e.g. gfx1151) that may not be in the prebuilt.
        if host.has_rocm and not host.has_usable_nvidia:
-            rocm_name = f"llama-{llama_tag}-bin-ubuntu-rocm-7.2-x64.tar.gz"
-            if rocm_name in upstream_assets:
+            # Scan upstream assets for any rocm-<version> prebuilt and prefer
+            # the newest one. Hardcoding a single rocm-7.2 filename means
+            # ROCm 6.x / 7.0 / 7.1 / 7.3+ users always fall through to a
+            # source build even when a matching prebuilt exists upstream.
+            import re as _re_rocm
+
+            _rocm_pattern = _re_rocm.compile(
+                rf"llama-{_re_rocm.escape(llama_tag)}-bin-ubuntu-rocm-([0-9]+(?:\.[0-9]+)*)-x64\.tar\.gz"
+            )
+            rocm_candidates: list[tuple[tuple[int, ...], str]] = []
+            for _name in upstream_assets:
+                _m = _rocm_pattern.match(_name)
+                if _m is None:
+                    continue
+                _parts = tuple(int(p) for p in _m.group(1).split("."))
+                rocm_candidates.append((_parts, _name))
+            rocm_candidates.sort(reverse = True)
+            if rocm_candidates:
+                rocm_name = rocm_candidates[0][1]
                log(f"AMD ROCm detected -- trying upstream prebuilt {rocm_name}")
                log(
-                    "Note: prebuilt is compiled for ROCm 7.2; if your ROCm version differs, "
+                    "Note: if your ROCm runtime version differs significantly, "
                    "this may fail preflight and fall back to a source build (safe)"
                )
                return AssetChoice(
@ -3120,7 +3152,13 @@ def resolve_release_asset_choice(
        )

    published_choice: AssetChoice | None = None
-    if host.is_windows and host.is_x86_64 and not host.has_rocm:
+    if host.is_windows and host.is_x86_64:
+        # Always try the published Windows CPU bundle, even on AMD ROCm
+        # hosts. If a windows-hip bundle is added to published releases
+        # in the future, the upstream resolver below would pick it first
+        # via resolve_asset_choice; falling back to the hash-approved
+        # windows-cpu bundle is still better than the upstream CPU
+        # asset for AMD Windows hosts without a HIP prebuilt.
        published_choice = published_asset_choice_for_kind(release, "windows-cpu")
    elif host.is_macos and host.is_arm64:
        published_choice = published_asset_choice_for_kind(release, "macos-arm64")
@ -4233,7 +4271,11 @@ def validate_server(
            "--batch-size",
            "32",
        ]
-        if host.has_usable_nvidia or (host.is_macos and host.is_arm64):
+        if (
+            host.has_usable_nvidia
+            or host.has_rocm
+            or (host.is_macos and host.is_arm64)
+        ):
            command.extend(["--n-gpu-layers", "1"])

        log_fd, log_name = tempfile.mkstemp(prefix = "llama-server-", suffix = ".log")
--- a/studio/install_python_stack.py
+++ b/studio/install_python_stack.py
@ -157,20 +157,28 @@ def _has_usable_nvidia_gpu() -> bool:
 def _ensure_rocm_torch() -> None:
    """Reinstall torch with ROCm wheels when the venv received CPU-only torch.

-    Runs only on Linux hosts where ROCm is installed and an AMD GPU is
-    present.  No-op when torch already links against HIP (ROCm) or CUDA
-    (NVIDIA).  Skips on Windows/macOS and on mixed AMD+NVIDIA hosts
-    (NVIDIA takes precedence).
+    Runs only on Linux hosts where an AMD GPU is present and the ROCm
+    runtime is detectable (rocminfo / amd-smi / hipconfig / rocm-core
+    package).  No-op when torch already links against HIP (ROCm) or on
+    Windows/macOS or on mixed AMD+NVIDIA hosts (NVIDIA takes precedence).
    Uses pip_install() to respect uv, constraints, and --python targeting.
    """
+    # Explicit OS guard so the helper is safe to call from any context --
+    # ROCm wheels are only published for Linux x86_64.
+    if IS_WINDOWS or IS_MACOS:
+        return
    # NVIDIA takes precedence on mixed hosts -- but only if an actual GPU is usable
    if _has_usable_nvidia_gpu():
        return
-    rocm_root = os.environ.get("ROCM_PATH") or "/opt/rocm"
-    if not os.path.isdir(rocm_root) and not shutil.which("hipcc"):
-        return  # no ROCm toolchain
+    # Rely on _has_rocm_gpu() (rocminfo / amd-smi GPU data rows) as the
+    # authoritative "is this actually an AMD ROCm host?" signal. The old
+    # gate required /opt/rocm or hipcc to exist, which breaks on
+    # runtime-only ROCm installs (package-managed minimal installs,
+    # Radeon software) that ship amd-smi/rocminfo without /opt/rocm or
+    # hipcc, and leaves `unsloth studio update` unable to repair a
+    # CPU-only venv on those systems.
    if not _has_rocm_gpu():
-        return  # ROCm tools present but no AMD GPU
+        return  # no AMD GPU visible

    ver = _detect_rocm_version()
    if ver is None: