Docker executor failing with socket binding

Hey Folks,

We have a bunch of gitlab-runners using the docker executor on machines we own, and gitlab is self managed. For legacy reasons, the gitlab runner service lives in a docker container, but the host OS owns the docker service, so we bind the docker sock into the container that runs the service with -v /var/run/docker.sock:/var/run/docker.sock --network=host

I will sporadically see fails ERROR: Preparation failed: adding cache volume: set volume permissions: create permission container for volume "runner-xqtsnmt4-project-21-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70": Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (linux_set.go:90:120s) Will be retried in 3s ...
This error manifests when a job is accepted.

I also see ERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:705:120s) occasionally after git checkouts.

Gitlab runner version is 13.1.0, docker version is 19.03.11

The runners themselves do not access the docker executor. For the config.toml

, this is what I have in the docker section

Has anyone seen this before? I see similar errors for folks using the dind service, but unsure if that generalizes to my use case

Hi,

can you specify the host OS a little more, including its details from /etc/os-release. Maybe SELinux prevents access here on RHEL/CentOS.

DinD is actually a good hint since this exactly what’s happening here, but with a different approach. I would follow these issues and their ideas on solving the problem. Here’s some from my Google search:



Cheers,
Michael

Hey Michael,

Thanks for your insight here. Our hosts are Centos7, with kernel version 3.10. The docker image in which our job runs is Ubuntu 18.04.

Thanks for the links here – the note [here](http://Errors connecting to Docker socket (#2408) · Issues · GitLab.org / gitlab-runner 4) seems particularly relevant to my use case.

The fix Utku suggests doesn’t seem viable for us though – removing all stopped containers, volumes, etc… after every job slows us down too much.

So from what folks have said previously, it looks like dockerd can become blocked by some massive I/O operation.

Assuming the size of build is causing this fail for us, perhaps its time to trim down our build footprint. To contextualize the size of everything: the base docker image we work off is about ~12GB, and we generate about ~8GB space more space between the project repo and built files. From what I can tell, each volume that is created by the gitlab-runner is about ~20GB in size (makes sense?). There are some efforts to shrink this (e.g. shallow clones, persistent bazel workspace across all gitlab runners), but devops is… hard.

I guess all I can ask is…

  • are there any low hanging fruit on the gitlab-runner side for shrinking the size of volumes generated by the gitlab-runner ?

  • Is 12 GB absurdly large for an image to be using for CI?

Cheers,
James

since updating gitlab-runner 13.9.0 from 13.7.0 and docker-ce 5:20.10.4 from 5:19.03.13 we encounter following docker.sock errors:

related to network

Preparing the "docker" executor
ERROR: Failed to remove network for build
ERROR: Preparation failed: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:974:120s)

related to volumes

ERROR: Preparation failed: adding cache volume: set volume permissions: create permission container for volume "runner-fxzrrdfo-project-7756-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (linux_set.go:95:120s)

related to cleanup

Cleaning up file based variables
ERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:757:120s)

the toml:

concurrent = 15
check_interval = 0
listen_address = ":9252"

[session_server]
  session_timeout = 1800

[[runners]]
  name = "Ubuntu-x86_64"
  url = "https://code.test.com"
  token = "foo"
  output_limit = 102400
  executor = "docker"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "python"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/var/run/docker.sock:/var/run/docker.sock", "/cache"]
    shm_size = 0

The machine 14 cores/28 threads (Intel® Xeon® Gold 6132 CPU @ 2.60GHz), specs 128 GB ram
It seems that these errors come more often when:

  • +500 caching volumes are on disk
  • load average is higher than 14

potential workaround:

docker container prune -f && docker volume prune -f && docker network prune -f
systemctl restart docker

I started to monitor load average and iostat to see if there is a real issue with high load of the system when this occurs.

any hint are welcome

another good reference and deep dive into these volume issue: Docker conflict, "already in use by container" (#4327) · Issues · GitLab.org / gitlab-runner · GitLab