Problem to solve
I need to be able to run docker container run --gpus all [...]
in a job using the docker:latest
image with docker:dind
services, in the case of self-hosted containerized gitlab runners.
Steps to reproduce / Configuration
In my case, my gitlab-runners are custom containerized gitlab-runners with NVIDIA GPUs support.
My Dockerfile gpu-gitlab-runner.dockerfile
is quite simple:
FROM nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04
# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
# Install dependencies
RUN apt update -y && \
apt install -y --no-install-recommends \
apt-transport-https \
curl \
ca-certificates \
gnupg \
lsb-release \
git \
git-lfs \
wget \
tzdata \
openssh-client && \
rm -rf /var/lib/apt/lists/*
# Install Docker CLI
RUN install -m 0755 -d /etc/apt/keyrings && \
curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc && \
chmod a+r /etc/apt/keyrings/docker.asc && \
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
tee /etc/apt/sources.list.d/docker.list > /dev/null && \
apt update -y && \
apt install -y docker-ce-cli && \
rm -rf /var/lib/apt/lists/*
# Install and configure NVIDIA container toolkit
RUN curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
RUN apt update -y && apt install -y nvidia-container-toolkit && rm -rf /var/lib/apt/lists/*
RUN nvidia-ctk runtime configure --runtime=docker
# Install and configure Gitlab runner
RUN curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | bash
RUN apt update -y && apt install -y gitlab-runner && rm -rf /var/lib/apt/lists/*
RUN mkdir -p /home/gitlab-runner && chown gitlab-runner:gitlab-runner /home/gitlab-runner
COPY config.toml /etc/gitlab-runner/config.toml
CMD ["/usr/bin/gitlab-runner", "run", "--user=gitlab-runner", "--working-directory=/home/gitlab-runner"]
The config.toml is the following:
concurrent = 1
check_interval = 0
user = "gitlab-runner"
connection_max_age = "15m0s"
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "linux-docker-executor-gpu"
url = "XXXXX"
id = XXX
token = "XXXXX"
token_obtained_at = 2024-12-06T00:00:00Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "docker"
environment = ["DOCKER_TLS_CERTDIR=/certs"]
[runners.cache]
MaxUploadedArchiveSize = 0
[runners.docker]
tls_verify = false
gpus = "all"
image = "alpine:latest"
privileged = true
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/certs/client", "/cache"]
shm_size = 0
I launch my runners on physical machines equipped with NVIDIA GPUs, running Ubuntu 24.04 with NVIDIA drivers and nvidia-container-toolkit installed, with the command
docker container run --gpus all -d --name gitab-runner-test --restart always -v /var/run/docker.sock:/var/run/docker.sock gpu-gitlab-runner
From there, I have proper access to my GPUs when running a job. For example, the following job works perfectly fine on such a runner:
test-gpu-job:
stage: test
image: nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04
script:
- nvidia-smi
However, the following job fails
test-docker-job-on-gpu-runner:
stage: test
image: docker:latest
services:
- docker:dind
script:
- docker container run --gpus all --rm nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04 nvidia-smi
with the following error:
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Additional experimentation
Non-GPU docker job
The following job works fine on my runner:
test-docker-job-on-gpu-runner:
stage: test
image: docker:latest
services:
- docker:dind
script:
- docker container run --rm hello-world
Manual run
I managed to reproduce the behaviour I’m looking for manually on the same machine with the same docker image outside of gitlab-runner jobs:
> docker container run --gpus all -it --name gitab-runner-test --rm -v /var/run/docker.sock:/var/run/docker.sock gpu-runner:latest bash
root@XXXXX:/# docker container run --privileged --gpus all --rm -it docker sh
/ # docker container run --gpus all --rm nvidia/cuda:12.6.3-cudnn-runtime-ubuntu24.04 nvidia-smi
==========
== CUDA ==
==========
CUDA Version 12.6.3
[...]
This worked perfectly fine, all the GPUs are visible, so it seems my gpu-gitlab-runner
image is perfectly capable of doing what I want.
Image without Docker CLI
I also tried to remove the Docker CLI installation from my Dockerfile. The behaviour is exactly the same. However, I could not test manually like above in this cas as Docker CLI is not available anymore in my gpu-gitlab-runner
.
My questions
- How can I reproduce make the
docker container run --gpus [...]
work in a gitlab-runner job? - Is there something I need to change in my config.toml or .gitlab-ci.yml?
Versions
- Self-managed Gitlab
- Self-hosted Gitlab Runners
Versions
- GitLab: v17.6.1-ee
- GitLab Runner: 17.6.0