docs(gpu-windows): adding troubleshooting section (#8807)

* docs(gpu-windows): adding troubleshooting section

Signed-off-by: axel7083 <42176370+axel7083@users.noreply.github.com>

* Apply suggestions from @shipsing

Co-authored-by: Shipra Singh <94683525+shipsing@users.noreply.github.com>
Signed-off-by: axel7083 <42176370+axel7083@users.noreply.github.com>

* fix: prettier

Signed-off-by: axel7083 <42176370+axel7083@users.noreply.github.com>

---------

Signed-off-by: axel7083 <42176370+axel7083@users.noreply.github.com>
Co-authored-by: Shipra Singh <94683525+shipsing@users.noreply.github.com>
This commit is contained in:
axel7083 2024-09-10 10:27:01 +02:00 committed by GitHub
parent 4365ad99fe
commit 82697c701f
No known key found for this signature in database
GPG key ID: B5690EEEBB952194

View file

@ -43,12 +43,18 @@ Run the following commands **on the Podman Machine, not the host system**:
```sh
$ curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo && \
sudo yum install -y nvidia-container-toolkit && \
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml && \
tee /etc/yum.repos.d/nvidia-container-toolkit.repo && \
yum install -y nvidia-container-toolkit && \
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml && \
nvidia-ctk cdi list
```
:::info
A configuration change might occur when you create or remove Multi-Instance GPU (MIG) devices, or upgrade the Compute Unified Device Architecture (CUDA) driver. In such cases, you must generate a new Container Device Interface (CDI) specification.
:::
#### Verification
To verify that containers created can access the GPU, you can use `nvidia-smi` from within a container with NVIDIA drivers installed.
@ -85,6 +91,31 @@ Fri Aug 16 18:58:14 2024
+---------------------------------------------------------------------------------------+
```
#### Troubleshooting
#### Version mismatch
You might encounter the following error inside the containers:
```
# nvidia-smi
Failed to initialize NVML: N/A
```
This problem is related to a mismatch between the Container Device Interface (CDI) and the installed version.
To fix this problem, generate a new CDI specification by running the following inside the Podman machine:
```
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
```
:::info
You might need to restart your Podman machine.
:::
#### Additional resources
- [NVIDIA Container Toolkit Installation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-yum-or-dnf)