Self-managed GitLab runner always reporting that no CUDA is available. Sometimes even reporting that no Nvidia Driver and Nvidia Container Toolkit detected.
Steps to reproduce
Install and configure Podman
Install Nvidia Container Toolkit
If not automatically manually generate CDI specification and make sure that the correct driver is selected
Verify and validate that Podman has access to Nvidia GPUs and can execute CUDA code (e.g. the official PyTorch image or the base Nvidia CUDA image)
Install and configure GitLab runner with Docker as executor
Create a simple CI pipeline that runs the same code that was used to verify and validate the GPU access in Podman
Based on the image and CUDA version selected observe
(always) no CUDA detected even if Nvidia driver is
import torch`
print(torch.cuda.is_available())
(sometimes) no Nvidia driver detected
Using effective pull policy of [if-not-present] for container nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04
Using docker image sha256:92a047cf48371393d2d27c9a696f3afd7548b1b39e27d0696e2ec18c22e41ccc for nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04 with digest docker.io/nvidia/cuda@sha256:5a2d3b02eb7412847d051d0f2b0f0a5031057a0172d9ca78743cc41cfc5d037f …
==========
== CUDA ==
==========
CUDA Version 13.0.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see NVIDIA Cloud Native Technologies - NVIDIA Docs .
Configuration
Output from nvidia-ctk cdi list (upon request full CDI YAML can be provided)
INFO[0000] Found 5 CDI devices
nvidia.com/gpu=0
nvidia.com/gpu=1
nvidia.com/gpu=GPU-2aff26da-3664-9eeb-13ba-b78397cace6f
nvidia.com/gpu=GPU-66878602-8286-6421-1ec4-8d097b71be4e
nvidia.com/gpu=all
Here you can also see my futile attempts to adjust the environment variables of the executor (based on Docker’s documentation, which is references by the gpus option for the GitLab runner even though I am using Podman
Here I also tried adding the environment variables NVIDIA_VISIBLE_DEVICES: "nvidia.com/gpu=all" (also tried with just all) and NVIDIA_DRIVER_CAPABILITIES: "all".
Python script pytorch_check.py used for checking whether CUDA is detected or not
import torch
print(torch.cuda.is_available())
Versions
Please select whether options apply, and add the version information.
Self-managed
GitLab.com SaaS
Dedicated
Self-hosted Runners
Versions
Ubuntu Server 24.04 with kernel 6.8.0-1024-oracle (based on output from uname -a)
Nvidia driver 580.82.07 (based on output from nvidia-smi)
CUDA 13.0 (based on output from nvidia-smi)
Podman 4.9.3 (official Ubuntu package, unable to upgrade to 5.x)
Nvidia Container Toolkit
NVIDIA Container Toolkit CLI version 1.18.0
commit: f8daa5e26de9fd7eb79259040b6dd5a52060048c
podman run -it --device nvidia.com/gpu=0 --security-opt=label=disable docker.io/pytorch/pytorch:2.9.0-cuda12.8-cudnn9-runtime /bin/bash
to launch container with PyTorch image (note that currently there is no CUDA 13.0 image for PyTorch but based on how Nvidia drivers and CUDA runtime work, executing code for 12.8 works on 13.0 as well)
Obtain container ID
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
115c92bb4d3a docker.io/pytorch/pytorch:2.9.0-cuda12.8-cudnn9-runtime /bin/bash 12 hours ago Up 12 hours jolly_solomon
looks like gitlab runner isn’t passing gpu devices from podman. try setting runner to privileged mode and add nvidia env vars like nvidia_visible_devices=all. if it still fails, switch to docker socket instead of podman, that usually fixes cuda detection.
Thank you for the reply, @ase2356. I thought of that too but then I checked with any normal user (beside gitlab-runner) and I could use Podman with GPU passthrough with no issues. I did what you suggested though and changed priveleged to true but it made no difference. I tried also both versions of the env var NVIDIA_VISIBLE_DEVICES - all (compatible with Docker CLI) and nvidia.com/gpu=all (compatible with the actual Podman call I would do in the terminal when launching a container manually and not through the runner) before and after the change in the priveleged state. You can see my attempts as comments inside the TOML code snippet in my original question.
As for switching to a Docker socket I don’t know how to do that so if you can lend a hand, I would appreciate it.
I have also seen numerous times people using the Podman socket service in user mode, that is systemctl enable --user --now podman.socket, instead of system mode:
systemctl status podman.socket
● podman.socket - Podman API Socket
Loaded: loaded (/usr/lib/systemd/system/podman.socket; enabled; preset: enabled)
Active: active (listening) since Tue 2025-10-28 10:05:57 CET; 3 days ago
Triggers: ● podman.service
Docs: man:podman-system-service(1)
Listen: /run/podman/podman.sock (Stream)
CGroup: /system.slice/podman.socket
Oct 28 10:05:57 gin-vm-gpu systemd[1]: Listening on podman.socket - Podman API Socket.
which should make the socket available to all users. Accordingly people, who use the user mode for the socket put a different path for it, namely unix:///run/user/<gitlab-runner-uid>/podman.socket. The documentation on GitLab runners is rather confusing in this regard.
I have conducted further investigation. A colleague of mine gave an advice that I should put CUDA + cuDNN image as the base image for the runner. Since the GitLab runner GPU support article simply mentions to run nvidia-smi in the script block, it makes sense. This would mean that the CI job is using the underlying Nvidia container to run the command. At least in theory.
I used nvidia/cuda:13.0.1-cudnn-runtime-ubuntu24.04 and nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04 in two different ways:
Manually started local container with GPU passthrough using
with <NVIDIA-IMAGE> being nvidia/cuda:13.0.1-cudnn-runtime-ubuntu24.04 and then (just to try with a CUDA version lower than the one coming with the toolkit and driver I have on the system) nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04
Automatically started local container with (theoretically) GPU passthrough through the GitLab runner docker/podman executor
For both Nvidia images I received the following results when trying to execute nvidia-smi
manually started container - command worked as expected and provided the information it is supposed to provide
automatically started container (used podman exec -it <gitlab-runner-started-container> /bin/bash to log into it with a bash shell to run my test with) - I receive command not found error
It appears that even though the same image is used, the container environment is different. I used podman ps --no-trunc reveals the command that the container (GitLab runner one) is executing which is a simple bash session.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a6b3ae83249c23eff18af6855516eaa12163601df15390186c74bd47a3556074 nvcr.io/nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04 sh -c if [ -x /usr/local/bin/bash ]; then
exec /usr/local/bin/bash
elif [ -x /usr/bin/bash ]; then
exec /usr/bin/bash
elif [ -x /bin/bash ]; then
exec /bin/bash
elif [ -x /usr/local/bin/sh ]; then
exec /usr/local/bin/sh
elif [ -x /usr/bin/sh ]; then
exec /usr/bin/sh
elif [ -x /bin/sh ]; then
exec /bin/sh
elif [ -x /busybox/sh ]; then
exec /busybox/sh
else
echo shell not found
exit 1
fi
About a minute ago Up About a minute runner-5j-fmjokx-project-87457-concurrent-1-4e3dc2992f3a7b23-build
I am able to find the libraries inside the container that are related to CUDA and cuDNN but I cannot find nvidia-smi. I tried both recursively listing everything in / and greping as well as which:
root@runner-5j-fmjokx-project-87457-concurrent-1:/# which nvidia-smi
root@runner-5j-fmjokx-project-87457-concurrent-1:/# ls -R | grep nvidia-smi
For reference inside the manually started container I get
root@77237af1ac7f:/# which nvidia-smi
/usr/bin/nvidia-smi
The environment is indeed different. Inside /usr/lib/x86_64-linux-gnu/ I can see different libraries, with the ones on the automatically launched container being considerably less and, what’s a definite no go, the libnvidia and libcuda ones are definitely missing.
on the automatically (via GitLab runner) started container
Quick thought without validation - Check if the image has entrypoints configured that are run and install specific packages on container runtime. I suspect that the manual run does this, whereas the CI runner docker executor does not.
I will check that. I need to find the Dockerfile for the Nvidia CUDA cuDNN images first.
After all
GitLab runner CI job’s log clearly states that The NVIDIA Driver was not detected
after inspecting the libraries it is clear that the CUDA related ones are missing but not the cuDNN ones. While the cuDNN libraries do not come with an installer but are a simple download from Nvidia’s website and a copy-paste to the /usr/lib/..., the CUDA ones come from installing the toolkit. And if the installer fails (this is my experience setting up CUDA on a normal host) fails to detect an Nvidia GPU, the toolkit will not be installed, hence the missing libraries.
So we are back to square one - why is the runner not detecting the GPU and/or has some sort of a conflict with the Nvidia container toolkit, which - as seen previously - works perfectly fine with podman.
UPDATE: Hah, didn’t know Nvidia was using GitLab and not GitHub for hosting.