Harden ROCm detection, Radeon wheel fallback, and HIP visibility

Addresses review findings from parallel reviewers on PR #4720:

- install.sh: add _has_usable_nvidia_gpu() helper requiring nvidia-smi -L
  to actually list a GPU before treating the host as NVIDIA. Fixes the
  stale-nvidia-smi-on-PATH regression where AMD-only hosts fell into the
  CUDA branch.
- install.sh: fix hipconfig awk blocks to propagate a non-zero exit code
  when the output is not a recognisable version string, so the ||-chain
  continues to dpkg-query / rpm instead of terminating early.
- install.sh: fail-closed on Radeon wheel fallback. When torch,
  torchvision or torchaudio is missing from the Radeon repo for the
  active Python tag, fall back to the standard ROCm index instead of
  silently mixing Radeon wheels with PyPI defaults. Quote all wheel
  arguments individually so wheel filenames cannot be word-split or
  glob-expanded.
- install_llama_prebuilt.py: detect_host() now requires nvidia-smi -L to
  list a GPU before setting has_physical_nvidia. Routes AMD ROCm hosts
  with a broken leftover nvidia-smi to the ROCm path instead of
  misclassifying them as NVIDIA.
- install_llama_prebuilt.py: scan upstream assets for any rocm-<version>
  prebuilt instead of hard-coding rocm-7.2, so ROCm 6.x / 7.0 / 7.1 / 7.3+
  users pick up a matching upstream prebuilt when one exists.
- install_llama_prebuilt.py: validate_server() adds --n-gpu-layers 1 for
  linux-rocm and windows-hip hosts, so new HIP prebuilts are preflighted
  on the GPU path instead of passing validation on CPU only.
- install_llama_prebuilt.py: restore the published windows-cpu fallback
  for AMD Windows hosts without a HIP prebuilt so hash-approved bundles
  are still preferred over the raw upstream CPU asset.
- install_python_stack.py: drop the /opt/rocm / hipcc gate in
  _ensure_rocm_torch() and rely on _has_rocm_gpu(). Runtime-only ROCm
  installs (package-managed minimal installs, Radeon software) that ship
  amd-smi / rocminfo without hipcc can now repair a CPU-only venv via
  "unsloth studio update". Adds an explicit IS_WINDOWS / IS_MACOS guard.
- studio/backend/utils/hardware/amd.py: honour HIP_VISIBLE_DEVICES /
  ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES in
  get_primary_gpu_utilization(). A process restricted to GPU 2 now
  reports metrics for GPU 2 instead of physical GPU 0. Tighten the plain
  bytes unit detection to an explicit allowlist.
- studio/backend/utils/hardware/hardware.py: route
  get_backend_visible_gpu_info()'s backend_cuda_visible_devices field
  through a helper that reads HIP_VISIBLE_DEVICES on ROCm. Drop the
  unconditional "(rocm=False)" suffix in apply_gpu_ids() logs.
This commit is contained in:
Daniel Han 2026-04-08 09:20:14 +00:00
parent c6f5b3af32
commit 810b833b01
5 changed files with 168 additions and 44 deletions

View file

@ -987,12 +987,28 @@ _has_amd_rocm_gpu() {
rocminfo 2>/dev/null | awk '/Name:[[:space:]]*gfx[1-9]/{found=1} END{exit !found}'; then
return 0
elif command -v amd-smi >/dev/null 2>&1 && \
amd-smi list 2>/dev/null | awk '/^GPU[[:space:]]*[:\[]/{ found=1 } END{ exit !found }'; then
amd-smi list 2>/dev/null | awk '/^GPU[[:space:]]*[:\[][[:space:]]*[0-9]/{ found=1 } END{ exit !found }'; then
return 0
fi
return 1
}
# ── NVIDIA usable-GPU helper ──
# Returns 0 (true) only if nvidia-smi is present AND actually lists a GPU.
# Prevents AMD-only hosts with a stale nvidia-smi on PATH from being routed
# into the CUDA branch.
_has_usable_nvidia_gpu() {
_nvsmi=""
if command -v nvidia-smi >/dev/null 2>&1; then
_nvsmi="nvidia-smi"
elif [ -x "/usr/bin/nvidia-smi" ]; then
_nvsmi="/usr/bin/nvidia-smi"
else
return 1
fi
"$_nvsmi" -L 2>/dev/null | awk '/^GPU[[:space:]]+[0-9]+:/{found=1} END{exit !found}'
}
# ── Detect GPU and choose PyTorch index URL ──
# Mirrors Get-TorchIndexUrl in install.ps1.
# On CPU-only machines this returns the cpu index, avoiding the solver
@ -1001,12 +1017,17 @@ get_torch_index_url() {
_base="https://download.pytorch.org/whl"
# macOS: always CPU (no CUDA support)
case "$(uname -s)" in Darwin) echo "$_base/cpu"; return ;; esac
# Try nvidia-smi
# Try nvidia-smi -- require the binary to actually list a usable GPU.
# Presence of the binary alone (container leftovers, stale driver
# packages) is not sufficient: otherwise an AMD-only host would
# silently install CUDA wheels.
_smi=""
if command -v nvidia-smi >/dev/null 2>&1; then
_smi="nvidia-smi"
elif [ -x "/usr/bin/nvidia-smi" ]; then
_smi="/usr/bin/nvidia-smi"
if _has_usable_nvidia_gpu; then
if command -v nvidia-smi >/dev/null 2>&1; then
_smi="nvidia-smi"
elif [ -x "/usr/bin/nvidia-smi" ]; then
_smi="/usr/bin/nvidia-smi"
fi
fi
if [ -z "$_smi" ]; then
# No NVIDIA GPU -- check for AMD ROCm GPU
@ -1021,7 +1042,7 @@ get_torch_index_url() {
{ [ -r /opt/rocm/.info/version ] && \
awk -F. '{print "rocm"$1"."$2; exit}' /opt/rocm/.info/version; } || \
{ command -v hipconfig >/dev/null 2>&1 && \
hipconfig --version 2>/dev/null | awk 'NR==1{split($1,a,"."); if(a[1]+0>0) print "rocm"a[1]"."a[2]}'; } || \
hipconfig --version 2>/dev/null | awk 'NR==1 && /^[0-9]/{split($1,a,"."); if(a[1]+0>0){print "rocm"a[1]"."a[2]; found=1}} END{exit !found}'; } || \
{ command -v dpkg-query >/dev/null 2>&1 && \
ver="$(dpkg-query -W -f='${Version}\n' rocm-core 2>/dev/null)" && \
[ -n "$ver" ] && \
@ -1087,7 +1108,7 @@ get_radeon_wheel_url() {
{ [ -r /opt/rocm/.info/version ] && \
awk -F'[.-]' 'NF>=3{print $1"."$2"."$3; exit}' /opt/rocm/.info/version; } || \
{ command -v hipconfig >/dev/null 2>&1 && \
hipconfig --version 2>/dev/null | awk 'NR==1 && /^[0-9]+\.[0-9]+\.[0-9]/{print $1}'; }) 2>/dev/null
hipconfig --version 2>/dev/null | awk 'NR==1 && /^[0-9]+\.[0-9]+\.[0-9]/{print $1; found=1} END{exit !found}'; }) 2>/dev/null
# Validate: must be X.Y.Z with X >= 1
case "$_full_ver" in
@ -1241,18 +1262,31 @@ elif [ -n "$TORCH_INDEX_URL" ]; then
fi
if [ "$_radeon_listing_ok" = true ]; then
substep "installing PyTorch from Radeon repo (${_RADEON_BASE_URL})..."
_torch_arg="torch"; _tv_arg="torchvision"; _ta_arg="torchaudio"; _tri_arg=""
_torch_whl=$(_pick_radeon_wheel "torch" 2>/dev/null) && _torch_arg="$_torch_whl"
_tv_whl=$(_pick_radeon_wheel "torchvision" 2>/dev/null) && _tv_arg="$_tv_whl"
_ta_whl=$(_pick_radeon_wheel "torchaudio" 2>/dev/null) && _ta_arg="$_ta_whl"
_tri_whl=$(_pick_radeon_wheel "triton" 2>/dev/null) && _tri_arg="$_tri_whl"
# Build install args; skip empty _tri_arg to avoid passing "" to uv
_radeon_pkgs="$_torch_arg $_tv_arg $_ta_arg"
[ -n "$_tri_arg" ] && _radeon_pkgs="$_tri_arg $_radeon_pkgs"
run_install_cmd "install triton + PyTorch" uv pip install --python "$_VENV_PY" \
--find-links "$_RADEON_BASE_URL" \
$_radeon_pkgs
# Require torch, torchvision, torchaudio wheels to all resolve
# from the Radeon listing. If any is missing for this Python
# tag, fall through to the standard ROCm index instead of
# silently mixing Radeon wheels with PyPI defaults.
_torch_whl=$(_pick_radeon_wheel "torch" 2>/dev/null) || _torch_whl=""
_tv_whl=$(_pick_radeon_wheel "torchvision" 2>/dev/null) || _tv_whl=""
_ta_whl=$(_pick_radeon_wheel "torchaudio" 2>/dev/null) || _ta_whl=""
_tri_whl=$(_pick_radeon_wheel "triton" 2>/dev/null) || _tri_whl=""
if [ -z "$_torch_whl" ] || [ -z "$_tv_whl" ] || [ -z "$_ta_whl" ]; then
substep "[WARN] Radeon repo lacks a complete wheel set for this Python; falling back to ROCm index ($TORCH_INDEX_URL)" "$C_WARN"
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
"$TORCH_CONSTRAINT" torchvision torchaudio \
--index-url "$TORCH_INDEX_URL"
else
substep "installing PyTorch from Radeon repo (${_RADEON_BASE_URL})..."
if [ -n "$_tri_whl" ]; then
run_install_cmd "install triton + PyTorch" uv pip install --python "$_VENV_PY" \
--find-links "$_RADEON_BASE_URL" \
"$_tri_whl" "$_torch_whl" "$_tv_whl" "$_ta_whl"
else
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \
--find-links "$_RADEON_BASE_URL" \
"$_torch_whl" "$_tv_whl" "$_ta_whl"
fi
fi
else
substep "[WARN] Radeon repo unavailable; falling back to ROCm index ($TORCH_INDEX_URL)" "$C_WARN"
run_install_cmd "install PyTorch" uv pip install --python "$_VENV_PY" \

View file

@ -10,6 +10,7 @@ nvidia.py counterparts.
import json
import math
import os
import re
import subprocess
from typing import Any, Optional
@ -93,9 +94,7 @@ def _parse_memory_mb(value: Any) -> Optional[float]:
return num
if "kib" in unit or "kb" in unit:
return num / 1024
if unit and (
"b" in unit and "g" not in unit and "m" not in unit and "k" not in unit
):
if unit in ("b", "byte", "bytes"):
# Plain bytes
return num / (1024 * 1024)
@ -203,9 +202,34 @@ def get_physical_gpu_count() -> Optional[int]:
return None
def _first_visible_amd_gpu_id() -> Optional[str]:
"""Return the physical AMD GPU id that should be treated as 'primary'.
Honours HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES / CUDA_VISIBLE_DEVICES
in that order (HIP respects all three). Returns ``"0"`` when none are
set, and ``None`` when the env var explicitly narrows to zero GPUs
("" or "-1"), so callers can short-circuit to "available: False".
"""
for env_name in ("HIP_VISIBLE_DEVICES", "ROCR_VISIBLE_DEVICES", "CUDA_VISIBLE_DEVICES"):
raw = os.environ.get(env_name)
if raw is None:
continue
raw = raw.strip()
if raw == "" or raw == "-1":
return None
first = raw.split(",", 1)[0].strip()
if first:
return first
break
return "0"
def get_primary_gpu_utilization() -> dict[str, Any]:
"""Return utilization metrics for the primary AMD GPU."""
data = _run_amd_smi("metric", "-g", "0")
"""Return utilization metrics for the primary visible AMD GPU."""
gpu_idx = _first_visible_amd_gpu_id()
if gpu_idx is None:
return {"available": False}
data = _run_amd_smi("metric", "-g", gpu_idx)
if data is None:
return {"available": False}

View file

@ -1187,6 +1187,19 @@ def get_physical_gpu_count() -> int:
return _physical_gpu_count
def _backend_visible_devices_env() -> Optional[str]:
"""Return the raw visibility env string that applies to this backend.
On ROCm, HIP_VISIBLE_DEVICES / ROCR_VISIBLE_DEVICES take precedence
over CUDA_VISIBLE_DEVICES; the helper mirrors the resolution logic in
``_get_parent_visible_gpu_spec`` so ``backend_cuda_visible_devices``
reports the value that is actually narrowing the visible device set.
"""
if IS_ROCM:
return _get_parent_visible_gpu_spec().get("raw")
return os.environ.get("CUDA_VISIBLE_DEVICES")
def get_backend_visible_gpu_info() -> Dict[str, Any]:
device = get_device()
if device in (DeviceType.CUDA, DeviceType.XPU):
@ -1232,7 +1245,7 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
return {
"available": True,
"backend": device.value,
"backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
"backend_cuda_visible_devices": _backend_visible_devices_env(),
"parent_visible_gpu_ids": parent_visible_ids,
"devices": devices,
"index_kind": index_kind,
@ -1241,7 +1254,7 @@ def get_backend_visible_gpu_info() -> Dict[str, Any]:
return {
"available": False,
"backend": device.value,
"backend_cuda_visible_devices": os.environ.get("CUDA_VISIBLE_DEVICES"),
"backend_cuda_visible_devices": _backend_visible_devices_env(),
"parent_visible_gpu_ids": parent_visible_ids,
"devices": [],
"index_kind": "physical",
@ -1348,7 +1361,10 @@ def apply_gpu_ids(gpu_ids) -> None:
os.environ["HIP_VISIBLE_DEVICES"] = value
os.environ["ROCR_VISIBLE_DEVICES"] = value
_visible_gpu_count = None
logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s' (rocm=%s)", value, IS_ROCM)
if IS_ROCM:
logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s' (rocm)", value)
else:
logger.info("Applied gpu_ids: CUDA_VISIBLE_DEVICES='%s'", value)
def get_device_map(

View file

@ -2494,12 +2494,27 @@ def detect_host() -> HostInfo:
has_physical_nvidia = False
has_usable_nvidia = False
if nvidia_smi:
# Require `nvidia-smi -L` to actually list a GPU before treating the
# host as NVIDIA. The banner text "NVIDIA-SMI ..." is printed even
# when the command fails to communicate with the driver (e.g. stale
# container leftovers), which would otherwise misclassify an AMD
# ROCm host as NVIDIA and short-circuit the ROCm path.
try:
listing = run_capture([nvidia_smi, "-L"], timeout = 20)
gpu_lines = [
line
for line in listing.stdout.splitlines()
if line.startswith("GPU ")
]
if gpu_lines:
has_physical_nvidia = True
has_usable_nvidia = visible_device_tokens != []
except Exception:
pass
try:
result = run_capture([nvidia_smi], timeout = 20)
merged = "\n".join(part for part in (result.stdout, result.stderr) if part)
if "NVIDIA-SMI" in merged:
has_physical_nvidia = True
has_usable_nvidia = visible_device_tokens != []
for line in merged.splitlines():
if "CUDA Version:" in line:
raw = line.split("CUDA Version:", 1)[1].strip().split()[0]
@ -2981,11 +2996,28 @@ def resolve_upstream_asset_choice(host: HostInfo, llama_tag: str) -> AssetChoice
# the exact GPU target via rocminfo, which is more reliable for consumer
# GPUs (e.g. gfx1151) that may not be in the prebuilt.
if host.has_rocm and not host.has_usable_nvidia:
rocm_name = f"llama-{llama_tag}-bin-ubuntu-rocm-7.2-x64.tar.gz"
if rocm_name in upstream_assets:
# Scan upstream assets for any rocm-<version> prebuilt and prefer
# the newest one. Hardcoding a single rocm-7.2 filename means
# ROCm 6.x / 7.0 / 7.1 / 7.3+ users always fall through to a
# source build even when a matching prebuilt exists upstream.
import re as _re_rocm
_rocm_pattern = _re_rocm.compile(
rf"llama-{_re_rocm.escape(llama_tag)}-bin-ubuntu-rocm-([0-9]+(?:\.[0-9]+)*)-x64\.tar\.gz"
)
rocm_candidates: list[tuple[tuple[int, ...], str]] = []
for _name in upstream_assets:
_m = _rocm_pattern.match(_name)
if _m is None:
continue
_parts = tuple(int(p) for p in _m.group(1).split("."))
rocm_candidates.append((_parts, _name))
rocm_candidates.sort(reverse = True)
if rocm_candidates:
rocm_name = rocm_candidates[0][1]
log(f"AMD ROCm detected -- trying upstream prebuilt {rocm_name}")
log(
"Note: prebuilt is compiled for ROCm 7.2; if your ROCm version differs, "
"Note: if your ROCm runtime version differs significantly, "
"this may fail preflight and fall back to a source build (safe)"
)
return AssetChoice(
@ -3120,7 +3152,13 @@ def resolve_release_asset_choice(
)
published_choice: AssetChoice | None = None
if host.is_windows and host.is_x86_64 and not host.has_rocm:
if host.is_windows and host.is_x86_64:
# Always try the published Windows CPU bundle, even on AMD ROCm
# hosts. If a windows-hip bundle is added to published releases
# in the future, the upstream resolver below would pick it first
# via resolve_asset_choice; falling back to the hash-approved
# windows-cpu bundle is still better than the upstream CPU
# asset for AMD Windows hosts without a HIP prebuilt.
published_choice = published_asset_choice_for_kind(release, "windows-cpu")
elif host.is_macos and host.is_arm64:
published_choice = published_asset_choice_for_kind(release, "macos-arm64")
@ -4233,7 +4271,11 @@ def validate_server(
"--batch-size",
"32",
]
if host.has_usable_nvidia or (host.is_macos and host.is_arm64):
if (
host.has_usable_nvidia
or host.has_rocm
or (host.is_macos and host.is_arm64)
):
command.extend(["--n-gpu-layers", "1"])
log_fd, log_name = tempfile.mkstemp(prefix = "llama-server-", suffix = ".log")

View file

@ -157,20 +157,28 @@ def _has_usable_nvidia_gpu() -> bool:
def _ensure_rocm_torch() -> None:
"""Reinstall torch with ROCm wheels when the venv received CPU-only torch.
Runs only on Linux hosts where ROCm is installed and an AMD GPU is
present. No-op when torch already links against HIP (ROCm) or CUDA
(NVIDIA). Skips on Windows/macOS and on mixed AMD+NVIDIA hosts
(NVIDIA takes precedence).
Runs only on Linux hosts where an AMD GPU is present and the ROCm
runtime is detectable (rocminfo / amd-smi / hipconfig / rocm-core
package). No-op when torch already links against HIP (ROCm) or on
Windows/macOS or on mixed AMD+NVIDIA hosts (NVIDIA takes precedence).
Uses pip_install() to respect uv, constraints, and --python targeting.
"""
# Explicit OS guard so the helper is safe to call from any context --
# ROCm wheels are only published for Linux x86_64.
if IS_WINDOWS or IS_MACOS:
return
# NVIDIA takes precedence on mixed hosts -- but only if an actual GPU is usable
if _has_usable_nvidia_gpu():
return
rocm_root = os.environ.get("ROCM_PATH") or "/opt/rocm"
if not os.path.isdir(rocm_root) and not shutil.which("hipcc"):
return # no ROCm toolchain
# Rely on _has_rocm_gpu() (rocminfo / amd-smi GPU data rows) as the
# authoritative "is this actually an AMD ROCm host?" signal. The old
# gate required /opt/rocm or hipcc to exist, which breaks on
# runtime-only ROCm installs (package-managed minimal installs,
# Radeon software) that ship amd-smi/rocminfo without /opt/rocm or
# hipcc, and leaves `unsloth studio update` unable to repair a
# CPU-only venv on those systems.
if not _has_rocm_gpu():
return # ROCm tools present but no AMD GPU
return # no AMD GPU visible
ver = _detect_rocm_version()
if ver is None: