Run docker container with the `--gpu` option in jobs fails

Problem to solve

I need to be able to run docker container run --gpus all [...] in a job using the docker:latest image with docker:dind services, in the case of self-hosted containerized gitlab runners.

Steps to reproduce / Configuration

In my case, my gitlab-runners are custom containerized gitlab-runners with NVIDIA GPUs support.
My Dockerfile gpu-gitlab-runner.dockerfile is quite simple:

FROM nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04

# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

# Install dependencies
RUN apt update -y && \
    apt install -y --no-install-recommends \
    apt-transport-https \
    curl \
    ca-certificates \
    gnupg \
    lsb-release \
    git \
    git-lfs \
    wget \
    tzdata \
    openssh-client && \
    rm -rf /var/lib/apt/lists/*

# Install Docker CLI
RUN install -m 0755 -d /etc/apt/keyrings && \
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc && \
    chmod a+r /etc/apt/keyrings/docker.asc && \
    echo \
      "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
      $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
      tee /etc/apt/sources.list.d/docker.list > /dev/null && \
    apt update -y && \
    apt install -y docker-ce-cli && \
    rm -rf /var/lib/apt/lists/*

# Install and configure NVIDIA container toolkit
RUN curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
    && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
RUN apt update -y && apt install -y nvidia-container-toolkit && rm -rf /var/lib/apt/lists/*
RUN nvidia-ctk runtime configure --runtime=docker

# Install and configure Gitlab runner
RUN curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | bash
RUN apt update -y && apt install -y gitlab-runner && rm -rf /var/lib/apt/lists/*
RUN mkdir -p /home/gitlab-runner && chown gitlab-runner:gitlab-runner /home/gitlab-runner
COPY config.toml /etc/gitlab-runner/config.toml

CMD ["/usr/bin/gitlab-runner", "run", "--user=gitlab-runner", "--working-directory=/home/gitlab-runner"]

The config.toml is the following:

concurrent = 1
check_interval = 0
user = "gitlab-runner"
connection_max_age = "15m0s"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "linux-docker-executor-gpu"
  url = "XXXXX"
  id = XXX
  token = "XXXXX"
  token_obtained_at = 2024-12-06T00:00:00Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker"
  environment = ["DOCKER_TLS_CERTDIR=/certs"]
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.docker]
    tls_verify = false
    gpus = "all"
    image = "alpine:latest"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/certs/client", "/cache"]
    shm_size = 0

I launch my runners on physical machines equipped with NVIDIA GPUs, running Ubuntu 24.04 with NVIDIA drivers and nvidia-container-toolkit installed, with the command

docker container run --gpus all -d --name gitab-runner-test --restart always -v /var/run/docker.sock:/var/run/docker.sock gpu-gitlab-runner

From there, I have proper access to my GPUs when running a job. For example, the following job works perfectly fine on such a runner:

test-gpu-job:
  stage: test
  image: nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04
  script:
    - nvidia-smi

However, the following job fails

test-docker-job-on-gpu-runner:
  stage: test
  image: docker:latest
  services:
    - docker:dind
  script:
    - docker container run --gpus all --rm nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04 nvidia-smi

with the following error:

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Additional experimentation

Non-GPU docker job

The following job works fine on my runner:

test-docker-job-on-gpu-runner:
  stage: test
  image: docker:latest
  services:
    - docker:dind
  script:
    - docker container run --rm hello-world

Manual run

I managed to reproduce the behaviour I’m looking for manually on the same machine with the same docker image outside of gitlab-runner jobs:

> docker container run --gpus all -it --name gitab-runner-test --rm -v /var/run/docker.sock:/var/run/docker.sock gpu-runner:latest bash
root@XXXXX:/# docker container run --privileged --gpus all --rm -it docker sh
/ # docker container run --gpus all --rm nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.6.3
[...]

This worked perfectly fine, all the GPUs are visible, so it seems my gpu-gitlab-runner image is perfectly capable of doing what I want.

Image without Docker CLI

I also tried to remove the Docker CLI installation from my Dockerfile. The behaviour is exactly the same. However, I could not test manually like above in this cas as Docker CLI is not available anymore in my gpu-gitlab-runner.

My questions

  • How can I reproduce make the docker container run --gpus [...] work in a gitlab-runner job?
  • Is there something I need to change in my config.toml or .gitlab-ci.yml?

Versions

  • Self-managed Gitlab
  • Self-hosted Gitlab Runners

Versions

  • GitLab: v17.6.1-ee
  • GitLab Runner: 17.6.0

It seems I am not the only one with the issue: DIND with GPU: Error response from daemon: could not select device driver "nvidia" with capabilities: [[gpu]] (#36830) · Issues · GitLab.org / gitlab-runner · GitLab