Problem to solve:
I have a GitLab pipeline that is supposed to train an AI model. The pipeline is executed with a Docker executor. The training should run in the image gitlab.lrz.de:5005/messtechnik-labor/barcs/docker/mmdetection3d-training/tmp:0.5.1
. The problem is that during the training, I only have a shm_size
of 64 MB in this container, which is not sufficient. I have adjusted the shm_size
in both the docker-compose.yml
file and the GitLab Runner’s config.toml
file. When I run the docker inspect
command, I can confirm that the shm_size
has been successfully increased to 20 GB. However, when I start the pipeline, these 20 GB are not available in the training image. What else do I need to adjust or what can I do? I would really appreciate any help.
Steps to reproduce:
- I have modified the
shm_size
in thedocker-compose.yml
file (which starts the runner) and in the GitLab Runner’sconfig.toml
file. - I have verified the changes by running
docker inspect
and confirmed that theshm_size
is set to 20 GB. - I have restarted the GitLab Runner and the associated containers multiple times, but the issue persists.
- I have reviewed the GitLab Runner documentation regarding shared memory allocation and Docker configuration.
Configuration:
config.toml:
[[runners]]
name = "my-runner"
url = "https://gitlab.com/"
token = "YOUR_TOKEN"
executor = "docker"
[runners.docker]
tls_verify = false
image = "docker:latest"
privileged = true
gpus = "all" # Ensure only one occurrence
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache", "/data:/data"]
shm_size = 20g
docker-compose.yml:
volumes:
gitlab-runner-config:
external: true
services:
gitlab-runner:
container_name: gitlab-runner
restart: always
image: gitlab/gitlab-runner:latest
shm_size: 20g # Set shared memory to 20 GB
volumes:
- ${HOME}/data:/data
- /etc/mysql:/etc/mysql
- /var/run/docker.sock:/var/run/docker.sock
- gitlab-runner-config:/etc/gitlab-runner
hostname: "$(hostname)"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Pipeline:
stages:
- run_training
run_training:
image: gitlab.lrz.de:5005/messtechnik-labor/barcs/docker/mmdetection3d-training/tmp:0.5.1
tags:
- ai-worker-3
stage: run_training
script:
- echo "Running training job"
- echo "Check if GPU is available"
- nvidia-smi
- echo "Shared memory size:"
- df -h /dev/shm