Unable to use Nvidia GPU with GitLab runner but works when running containers manually - same executor

Problem to solve

Self-managed GitLab runner always reporting that no CUDA is available. Sometimes even reporting that no Nvidia Driver and Nvidia Container Toolkit detected.

Steps to reproduce

  • Install and configure Podman
  • Install Nvidia Container Toolkit
  • If not automatically manually generate CDI specification and make sure that the correct driver is selected
  • Verify and validate that Podman has access to Nvidia GPUs and can execute CUDA code (e.g. the official PyTorch image or the base Nvidia CUDA image)
  • Install and configure GitLab runner with Docker as executor
  • Create a simple CI pipeline that runs the same code that was used to verify and validate the GPU access in Podman
  • Based on the image and CUDA version selected observe
    • (always) no CUDA detected even if Nvidia driver is
      import torch`
      print(torch.cuda.is_available())
      
    • (sometimes) no Nvidia driver detected

      Using effective pull policy of [if-not-present] for container nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04
      Using docker image sha256:92a047cf48371393d2d27c9a696f3afd7548b1b39e27d0696e2ec18c22e41ccc for nvidia/cuda:13.0.1-cudnn-devel-ubuntu24.04 with digest docker.io/nvidia/cuda@sha256:5a2d3b02eb7412847d051d0f2b0f0a5031057a0172d9ca78743cc41cfc5d037f
      ==========
      == CUDA ==
      ==========
      CUDA Version 13.0.1
      Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
      This container image and its contents are governed by the NVIDIA Deep Learning Container License.
      By pulling and using the container, you accept the terms and conditions of this license:
      https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
      A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
      WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
      Use the NVIDIA Container Toolkit to start this container with GPU support; see
      NVIDIA Cloud Native Technologies - NVIDIA Docs .

Configuration

  • Output from nvidia-ctk cdi list (upon request full CDI YAML can be provided)
    INFO[0000] Found 5 CDI devices                          
    nvidia.com/gpu=0
    nvidia.com/gpu=1
    nvidia.com/gpu=GPU-2aff26da-3664-9eeb-13ba-b78397cace6f
    nvidia.com/gpu=GPU-66878602-8286-6421-1ec4-8d097b71be4e
    nvidia.com/gpu=all
    
  • GitLab runner’s configuration TOML file
    concurrent = 8
    check_interval = 0
    connection_max_age = "15m0s"
    shutdown_timeout = 0
    
    [session_server]
      session_timeout = 1800
    
    [[runners]]
      name = "gpu-runner"
      url = "https://XXXXXXXXXXXx"
      id = 45874
      token = "YYYYYYYYYYYYYYYYYYYYYYYY"
      token_obtained_at = 2025-10-29T15:29:11Z
      token_expires_at = 0001-01-01T00:00:00Z
      executor = "docker"
      [runners.cache]
        MaxUploadedArchiveSize = 0
        [runners.cache.s3]
        [runners.cache.gcs]
        [runners.cache.azure]
      [runners.docker]
        host = "unix:///run/podman/podman.sock"
        tls_verify = false
        image = "ubuntu24.04"
        privileged = false
        disable_entrypoint_overwrite = false
        oom_kill_disable = false
        disable_cache = false
        volumes = ["/cache"]
        shm_size = 0
        network_mtu = 0
    #    devices = [
    #      "nvidia.com/gpu=0",
    #      "nvidia.com/gpu=1",
    #    ]
    #    environment = [
    # https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.10.0/user-guide.html#gpu- enumeration
    #      "NVIDIA_VISIBLE_DEVICES=all",
    #      "NVIDIA_VISIBLE_DEVICES=nvidia.com/gpu=all",
    # https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.10.0/user-guide.html#driver-capabilities
    #      "NVIDIA_DRIVER_CAPABILITIES=all"
    #    ]
        gpus = "all"
        service_gpus = "all"
        allowed_pull_policies = ["always", "if-not-present"]
    
    Here you can also see my futile attempts to adjust the environment variables of the executor (based on Docker’s documentation, which is references by the gpus option for the GitLab runner even though I am using Podman
  • .gitlab-ci.yml
    stages:
     - check
    check_cuda:
      stage: check
      tags:
        - gpu
        - ml
        - linux
      image: 
        name: docker.io/pytorch/pytorch:2.9.0-cuda12.8-cudnn9-runtime
        pull_policy: if-not-present
      script:
        - python pytorch_check.py
    
    Here I also tried adding the environment variables NVIDIA_VISIBLE_DEVICES: "nvidia.com/gpu=all" (also tried with just all) and NVIDIA_DRIVER_CAPABILITIES: "all".
  • Python script pytorch_check.py used for checking whether CUDA is detected or not
    import torch
    print(torch.cuda.is_available())
    

Versions

Please select whether options apply, and add the version information.

  • Self-managed
  • GitLab.com SaaS
  • Dedicated
  • Self-hosted Runners

Versions

  • Ubuntu Server 24.04 with kernel 6.8.0-1024-oracle (based on output from uname -a)
  • Nvidia driver 580.82.07 (based on output from nvidia-smi)
  • CUDA 13.0 (based on output from nvidia-smi)
  • Podman 4.9.3 (official Ubuntu package, unable to upgrade to 5.x)
  • Nvidia Container Toolkit
    NVIDIA Container Toolkit CLI version 1.18.0
    commit: f8daa5e26de9fd7eb79259040b6dd5a52060048c
    
  • GitLab 18.4
  • GitLab Runner, if self-hosted
    Version:      18.5.0
    Git revision: bda84871
    Git branch:   18-5-stable
    GO version:   go1.24.6 X:cacheprog
    Built:        2025-10-13T19:20:30Z
    OS/Arch:      linux/amd64
    
    

Local check (no runner)

  1. Use
    podman run -it --device nvidia.com/gpu=0 --security-opt=label=disable docker.io/pytorch/pytorch:2.9.0-cuda12.8-cudnn9-runtime /bin/bash
    
    to launch container with PyTorch image (note that currently there is no CUDA 13.0 image for PyTorch but based on how Nvidia drivers and CUDA runtime work, executing code for 12.8 works on 13.0 as well)
  2. Obtain container ID
    CONTAINER ID  IMAGE                                                    COMMAND     CREATED       STATUS       PORTS       NAMES
    115c92bb4d3a  docker.io/pytorch/pytorch:2.9.0-cuda12.8-cudnn9-runtime  /bin/bash   12 hours ago  Up 12 hours              jolly_solomon
    
    and copy PyTorch test script to container
    podman cp pytorch_cuda.py 115c92bb4d3a:/workspace/
    
  3. Execute PyTorch test script inside container and check output
    podman attach 115c92bb4d3a
    
    root@115c92bb4d3a:/workspace# ls
    pytorch_cuda.py
    root@115c92bb4d3a:/workspace# python pytorch_cuda.py 
    CUDA available: True
    

Documentation

  1. Nvidia Container Toolkit (Overview — NVIDIA Container Toolkit)
  2. Podman with GPU (GPU container access | Podman Desktop)
  3. Podman and Nvidia Container Toolkit (Installing Podman and the NVIDIA Container Toolkit — NVIDIA AI Enterprise: Red Hat Enterprise Linux With KVM Deployment Guide)
  4. GitLab runner with GPU (Using Graphical Processing Units (GPUs) | GitLab Docs)

looks like gitlab runner isn’t passing gpu devices from podman. try setting runner to privileged mode and add nvidia env vars like nvidia_visible_devices=all. if it still fails, switch to docker socket instead of podman, that usually fixes cuda detection.

Thank you for the reply, @ase2356. I thought of that too but then I checked with any normal user (beside gitlab-runner) and I could use Podman with GPU passthrough with no issues. I did what you suggested though and changed priveleged to true but it made no difference. I tried also both versions of the env var NVIDIA_VISIBLE_DEVICES - all (compatible with Docker CLI) and nvidia.com/gpu=all (compatible with the actual Podman call I would do in the terminal when launching a container manually and not through the runner) before and after the change in the priveleged state. You can see my attempts as comments inside the TOML code snippet in my original question.

As for switching to a Docker socket I don’t know how to do that so if you can lend a hand, I would appreciate it.

I have also seen numerous times people using the Podman socket service in user mode, that is systemctl enable --user --now podman.socket, instead of system mode:

systemctl status podman.socket

● podman.socket - Podman API Socket
     Loaded: loaded (/usr/lib/systemd/system/podman.socket; enabled; preset: enabled)
     Active: active (listening) since Tue 2025-10-28 10:05:57 CET; 3 days ago
   Triggers: ● podman.service
       Docs: man:podman-system-service(1)
     Listen: /run/podman/podman.sock (Stream)
     CGroup: /system.slice/podman.socket

Oct 28 10:05:57 gin-vm-gpu systemd[1]: Listening on podman.socket - Podman API Socket.

which should make the socket available to all users. Accordingly people, who use the user mode for the socket put a different path for it, namely unix:///run/user/<gitlab-runner-uid>/podman.socket. The documentation on GitLab runners is rather confusing in this regard.

I have conducted further investigation. A colleague of mine gave an advice that I should put CUDA + cuDNN image as the base image for the runner. Since the GitLab runner GPU support article simply mentions to run nvidia-smi in the script block, it makes sense. This would mean that the CI job is using the underlying Nvidia container to run the command. At least in theory.

I used nvidia/cuda:13.0.1-cudnn-runtime-ubuntu24.04 and nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04 in two different ways:

  • Manually started local container with GPU passthrough using
    podman run --rm --device nvidia.com/gpu=0 --device nvidia.com/gpu=1 --security-opt=label=disable -it <NVIDIA-IMAGE> /bin/bash
    
    with <NVIDIA-IMAGE> being nvidia/cuda:13.0.1-cudnn-runtime-ubuntu24.04 and then (just to try with a CUDA version lower than the one coming with the toolkit and driver I have on the system) nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04
  • Automatically started local container with (theoretically) GPU passthrough through the GitLab runner docker/podman executor

For both Nvidia images I received the following results when trying to execute nvidia-smi

  • manually started container - command worked as expected and provided the information it is supposed to provide
  • automatically started container (used podman exec -it <gitlab-runner-started-container> /bin/bash to log into it with a bash shell to run my test with) - I receive command not found error

It appears that even though the same image is used, the container environment is different. I used podman ps --no-trunc reveals the command that the container (GitLab runner one) is executing which is a simple bash session.

CONTAINER ID                                                      IMAGE                                                 COMMAND     CREATED     STATUS      PORTS       NAMES
a6b3ae83249c23eff18af6855516eaa12163601df15390186c74bd47a3556074  nvcr.io/nvidia/cuda:12.9.1-cudnn-runtime-ubuntu24.04  sh -c if [ -x /usr/local/bin/bash ]; then
                                                                  exec /usr/local/bin/bash 
elif [ -x /usr/bin/bash ]; then
            exec /usr/bin/bash 
elif [ -x /bin/bash ]; then
            exec /bin/bash 
elif [ -x /usr/local/bin/sh ]; then
            exec /usr/local/bin/sh 
elif [ -x /usr/bin/sh ]; then
            exec /usr/bin/sh 
elif [ -x /bin/sh ]; then
            exec /bin/sh 
elif [ -x /busybox/sh ]; then
            exec /busybox/sh 
else
            echo shell not found
            exit 1
fi

            About a minute ago  Up About a minute              runner-5j-fmjokx-project-87457-concurrent-1-4e3dc2992f3a7b23-build

I am able to find the libraries inside the container that are related to CUDA and cuDNN but I cannot find nvidia-smi. I tried both recursively listing everything in / and greping as well as which:

root@runner-5j-fmjokx-project-87457-concurrent-1:/# which nvidia-smi
root@runner-5j-fmjokx-project-87457-concurrent-1:/# ls -R | grep nvidia-smi

For reference inside the manually started container I get

root@77237af1ac7f:/# which nvidia-smi
/usr/bin/nvidia-smi

which is the same output that I get on the host.

The environment is indeed different. Inside /usr/lib/x86_64-linux-gnu/ I can see different libraries, with the ones on the automatically launched container being considerably less and, what’s a definite no go, the libnvidia and libcuda ones are definitely missing.

  • on the automatically (via GitLab runner) started container
    e2fsprogs                libcap-ng.so.0                               libdb-5.3.so               libhogweed.so.6     libncursesw.so.6       libproc2.so.0.0.2      libsystemd.so.0.38.0
    engines-3                libcap-ng.so.0.0.0                           libdebconfclient.so.0      libhogweed.so.6.8   libncursesw.so.6.4     libpsx.so.2            libtasn1.so.6
    gconv                    libcap.so.2                                  libdebconfclient.so.0.0.0  libidn2.so.0        libnettle.so.8         libpsx.so.2.66         libtasn1.so.6.6.3
    ld-linux-x86-64.so.2     libcap.so.2.66                               libdl.so.2                 libidn2.so.0.4.0    libnettle.so.8.8       libpthread.so.0        libthread_db.so.1
    libBrokenLocale.so.1     libcom_err.so.2                              libdrop_ambient.so.0       libksba.so.8        libnpth.so.0           libreadline.so.8       libtic.so.6
    libacl.so.1              libcom_err.so.2.1                            libdrop_ambient.so.0.0.0   libksba.so.8.14.6   libnpth.so.0.1.2       libreadline.so.8.2     libtic.so.6.4
    libacl.so.1.1.2302       libcrypt.so.1                                libe2p.so.2                liblber.so.2        libnsl.so.1            libresolv.so.2         libtinfo.so.6
    libanl.so.1              libcrypt.so.1.1.0                            libe2p.so.2.3              liblber.so.2.0.200  libnss_compat.so.2     librt.so.1             libtinfo.so.6.4
    libapt-pkg.so.6.0        libcrypto.so.3                               libext2fs.so.2             libldap.so.2        libnss_dns.so.2        libsasl2.so.2          libudev.so.1
    libapt-pkg.so.6.0.0      libcudnn.so.9                                libext2fs.so.2.4           libldap.so.2.0.200  libnss_files.so.2      libsasl2.so.2.0.25     libudev.so.1.7.8
    libapt-private.so.0.0    libcudnn.so.9.10.2                           libffi.so.8                liblz4.so.1         libnss_hesiod.so.2     libseccomp.so.2        libunistring.so.5
    libapt-private.so.0.0.0  libcudnn_adv.so.9                            libffi.so.8.1.4            liblz4.so.1.9.4     libp11-kit.so.0        libseccomp.so.2.5.5    libunistring.so.5.0.0
    libassuan.so.0           libcudnn_adv.so.9.10.2                       libformw.so.6              liblzma.so.5        libp11-kit.so.0.3.1    libselinux.so.1        libutil.so.1
    libassuan.so.0.8.6       libcudnn_cnn.so.9                            libformw.so.6.4            liblzma.so.5.4.5    libpam.so.0            libsemanage.so.2       libuuid.so.1
    libattr.so.1             libcudnn_cnn.so.9.10.2                       libgcc_s.so.1              libm.so.6           libpam.so.0.85.1       libsepol.so.2          libuuid.so.1.3.0
    libattr.so.1.1.2502      libcudnn_engines_precompiled.so.9            libgcrypt.so.20            libmd.so.0          libpam_misc.so.0       libsmartcols.so.1      libxxhash.so.0
    libaudit.so.1            libcudnn_engines_precompiled.so.9.10.2       libgcrypt.so.20.4.3        libmd.so.0.1.0      libpam_misc.so.0.82.1  libsmartcols.so.1.1.0  libxxhash.so.0.8.2
    libaudit.so.1.0.0        libcudnn_engines_runtime_compiled.so.9       libgmp.so.10               libmemusage.so      libpamc.so.0           libsqlite3.so.0        libz.so.1
    libblkid.so.1            libcudnn_engines_runtime_compiled.so.9.10.2  libgmp.so.10.5.0           libmenuw.so.6       libpamc.so.0.82.1      libsqlite3.so.0.8.6    libz.so.1.3
    libblkid.so.1.1.0        libcudnn_graph.so.9                          libgnutls.so.30            libmenuw.so.6.4     libpanelw.so.6         libss.so.2             libzstd.so.1
    libbz2.so.1              libcudnn_graph.so.9.10.2                     libgnutls.so.30.37.1       libmount.so.1       libpanelw.so.6.4       libss.so.2.0           libzstd.so.1.5.5
    libbz2.so.1.0            libcudnn_heuristic.so.9                      libgpg-error.so.0          libmount.so.1.1.0   libpcprofile.so        libssl.so.3            ossl-modules
    libbz2.so.1.0.4          libcudnn_heuristic.so.9.10.2                 libgpg-error.so.0.34.0     libmvec.so.1        libpcre2-8.so.0        libstdc++.so.6         perl-base
    libc.so.6                libcudnn_ops.so.9                            libhistory.so.8            libnccl.so.2        libpcre2-8.so.0.11.2   libstdc++.so.6.0.33    sasl2
    libc_malloc_debug.so.0   libcudnn_ops.so.9.10.2                       libhistory.so.8.2          libnccl.so.2.27.3   libproc2.so.0          libsystemd.so.0        security
    
    
  • on the manually started container (same image!)
      e2fsprogs                         libcudadebugger.so.580.82.07                 liblber.so.2                      libnvidia-glsi.so.580.82.07             libreadline.so.8.2
      engines-3                         libcudnn.so.9                                liblber.so.2.0.200                libnvidia-glvkspirv.so.580.82.07        libresolv.so.2
      gbm                               libcudnn.so.9.10.2                           libldap.so.2                      libnvidia-gpucomp.so.580.82.07          librt.so.1
      gconv                             libcudnn_adv.so.9                            libldap.so.2.0.200                libnvidia-gtk2.so.580.82.07             libsasl2.so.2
      ld-linux-x86-64.so.2              libcudnn_adv.so.9.10.2                       liblz4.so.1                       libnvidia-gtk3.so.580.82.07             libsasl2.so.2.0.25
      libBrokenLocale.so.1              libcudnn_cnn.so.9                            liblz4.so.1.9.4                   libnvidia-ml.so.1                       libseccomp.so.2
      libEGL_nvidia.so.0                libcudnn_cnn.so.9.10.2                       liblzma.so.5                      libnvidia-ml.so.580.82.07               libseccomp.so.2.5.5
      libEGL_nvidia.so.580.82.07        libcudnn_engines_precompiled.so.9            liblzma.so.5.4.5                  libnvidia-ngx.so.1                      libselinux.so.1
      libGLESv1_CM_nvidia.so.1          libcudnn_engines_precompiled.so.9.10.2       libm.so.6                         libnvidia-ngx.so.580.82.07              libsemanage.so.2
      libGLESv1_CM_nvidia.so.580.82.07  libcudnn_engines_runtime_compiled.so.9       libmd.so.0                        libnvidia-nvvm.so.4                     libsepol.so.2
      libGLESv2_nvidia.so.2             libcudnn_engines_runtime_compiled.so.9.10.2  libmd.so.0.1.0                    libnvidia-nvvm.so.580.82.07             libsmartcols.so.1
      libGLESv2_nvidia.so.580.82.07     libcudnn_graph.so.9                          libmemusage.so                    libnvidia-opencl.so.1                   libsmartcols.so.1.1.0
      libGLX_indirect.so.0              libcudnn_graph.so.9.10.2                     libmenuw.so.6                     libnvidia-opencl.so.580.82.07           libsqlite3.so.0
      libGLX_nvidia.so.0                libcudnn_heuristic.so.9                      libmenuw.so.6.4                   libnvidia-opticalflow.so                libsqlite3.so.0.8.6
      libGLX_nvidia.so.580.82.07        libcudnn_heuristic.so.9.10.2                 libmount.so.1                     libnvidia-opticalflow.so.1              libss.so.2
      libacl.so.1                       libcudnn_ops.so.9                            libmount.so.1.1.0                 libnvidia-opticalflow.so.580.82.07      libss.so.2.0
      libacl.so.1.1.2302                libcudnn_ops.so.9.10.2                       libmvec.so.1                      libnvidia-pkcs11-openssl3.so.580.82.07  libssl.so.3
      libanl.so.1                       libdb-5.3.so                                 libnccl.so.2                      libnvidia-present.so.580.82.07          libstdc++.so.6
      libapt-pkg.so.6.0                 libdebconfclient.so.0                        libnccl.so.2.27.3                 libnvidia-ptxjitcompiler.so.1           libstdc++.so.6.0.33
      libapt-pkg.so.6.0.0               libdebconfclient.so.0.0.0                    libncursesw.so.6                  libnvidia-ptxjitcompiler.so.580.82.07   libsystemd.so.0
      libapt-private.so.0.0             libdl.so.2                                   libncursesw.so.6.4                libnvidia-rtcore.so.580.82.07           libsystemd.so.0.38.0
      libapt-private.so.0.0.0           libdrop_ambient.so.0                         libnettle.so.8                    libnvidia-sandboxutils.so.1             libtasn1.so.6
      libassuan.so.0                    libdrop_ambient.so.0.0.0                     libnettle.so.8.8                  libnvidia-sandboxutils.so.580.82.07     libtasn1.so.6.6.3
      libassuan.so.0.8.6                libe2p.so.2                                  libnpth.so.0                      libnvidia-tls.so.580.82.07              libthread_db.so.1
      libattr.so.1                      libe2p.so.2.3                                libnpth.so.0.1.2                  libnvidia-vksc-core.so.1                libtic.so.6
      libattr.so.1.1.2502               libext2fs.so.2                               libnsl.so.1                       libnvidia-vksc-core.so.580.82.07        libtic.so.6.4
      libaudit.so.1                     libext2fs.so.2.4                             libnss_compat.so.2                libnvidia-wayland-client.so.580.82.07   libtinfo.so.6
      libaudit.so.1.0.0                 libffi.so.8                                  libnss_dns.so.2                   libnvoptix.so.1                         libtinfo.so.6.4
      libblkid.so.1                     libffi.so.8.1.4                              libnss_files.so.2                 libnvoptix.so.580.82.07                 libudev.so.1
      libblkid.so.1.1.0                 libformw.so.6                                libnss_hesiod.so.2                libp11-kit.so.0                         libudev.so.1.7.8
      libbz2.so.1                       libformw.so.6.4                              libnvcuvid.so                     libp11-kit.so.0.3.1                     libunistring.so.5
      libbz2.so.1.0                     libgcc_s.so.1                                libnvcuvid.so.1                   libpam.so.0                             libunistring.so.5.0.0
      libbz2.so.1.0.4                   libgcrypt.so.20                              libnvcuvid.so.580.82.07           libpam.so.0.85.1                        libutil.so.1
      libc.so.6                         libgcrypt.so.20.4.3                          libnvidia-allocator.so.1          libpam_misc.so.0                        libuuid.so.1
      libc_malloc_debug.so.0            libgmp.so.10                                 libnvidia-allocator.so.580.82.07  libpam_misc.so.0.82.1                   libuuid.so.1.3.0
      libcap-ng.so.0                    libgmp.so.10.5.0                             libnvidia-cfg.so.1                libpamc.so.0                            libxxhash.so.0
      libcap-ng.so.0.0.0                libgnutls.so.30                              libnvidia-cfg.so.580.82.07        libpamc.so.0.82.1                       libxxhash.so.0.8.2
      libcap.so.2                       libgnutls.so.30.37.1                         libnvidia-egl-gbm.so.1            libpanelw.so.6                          libz.so.1
      libcap.so.2.66                    libgpg-error.so.0                            libnvidia-egl-gbm.so.1.1.2        libpanelw.so.6.4                        libz.so.1.3
      libcom_err.so.2                   libgpg-error.so.0.34.0                       libnvidia-egl-wayland.so.1        libpcprofile.so                         libzstd.so.1
      libcom_err.so.2.1                 libhistory.so.8                              libnvidia-egl-wayland.so.1.1.19   libpcre2-8.so.0                         libzstd.so.1.5.5
      libcrypt.so.1                     libhistory.so.8.2                            libnvidia-eglcore.so.580.82.07    libpcre2-8.so.0.11.2                    nvidia
      libcrypt.so.1.1.0                 libhogweed.so.6                              libnvidia-encode.so               libproc2.so.0                           ossl-modules
      libcrypto.so.3                    libhogweed.so.6.8                            libnvidia-encode.so.1             libproc2.so.0.0.2                       perl-base
      libcuda.so                        libidn2.so.0                                 libnvidia-encode.so.580.82.07     libpsx.so.2                             sasl2
      libcuda.so.1                      libidn2.so.0.4.0                             libnvidia-fbc.so.1                libpsx.so.2.66                          security
      libcuda.so.580.82.07              libksba.so.8                                 libnvidia-fbc.so.580.82.07        libpthread.so.0                         vdpau
      libcudadebugger.so.1              libksba.so.8.14.6                            libnvidia-glcore.so.580.82.07     libreadline.so.8
    
    

Same applies to /usr/bin with the nvidia-smi clearly missing from the container started uisng the GitLab runner.

Quick thought without validation - Check if the image has entrypoints configured that are run and install specific packages on container runtime. I suspect that the manual run does this, whereas the CI runner docker executor does not.

I will check that. I need to find the Dockerfile for the Nvidia CUDA cuDNN images first. :smiley:

After all

  • GitLab runner CI job’s log clearly states that The NVIDIA Driver was not detected
  • after inspecting the libraries it is clear that the CUDA related ones are missing but not the cuDNN ones. While the cuDNN libraries do not come with an installer but are a simple download from Nvidia’s website and a copy-paste to the /usr/lib/..., the CUDA ones come from installing the toolkit. And if the installer fails (this is my experience setting up CUDA on a normal host) fails to detect an Nvidia GPU, the toolkit will not be installed, hence the missing libraries.

So we are back to square one - why is the runner not detecting the GPU and/or has some sort of a conflict with the Nvidia container toolkit, which - as seen previously - works perfectly fine with podman.

UPDATE: Hah, didn’t know Nvidia was using GitLab and not GitHub for hosting.