Docker executor failing with socket binding

Hey Folks,

We have a bunch of gitlab-runners using the docker executor on machines we own, and gitlab is self managed. For legacy reasons, the gitlab runner service lives in a docker container, but the host OS owns the docker service, so we bind the docker sock into the container that runs the service with -v /var/run/docker.sock:/var/run/docker.sock --network=host

I will sporadically see fails ERROR: Preparation failed: adding cache volume: set volume permissions: create permission container for volume "runner-xqtsnmt4-project-21-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70": Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (linux_set.go:90:120s) Will be retried in 3s ...
This error manifests when a job is accepted.

I also see ERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:705:120s) occasionally after git checkouts.

Gitlab runner version is 13.1.0, docker version is 19.03.11

The runners themselves do not access the docker executor. For the config.toml

, this is what I have in the docker section

Has anyone seen this before? I see similar errors for folks using the dind service, but unsure if that generalizes to my use case

Hi,

can you specify the host OS a little more, including its details from /etc/os-release. Maybe SELinux prevents access here on RHEL/CentOS.

DinD is actually a good hint since this exactly what’s happening here, but with a different approach. I would follow these issues and their ideas on solving the problem. Here’s some from my Google search:



Cheers,
Michael

Hey Michael,

Thanks for your insight here. Our hosts are Centos7, with kernel version 3.10. The docker image in which our job runs is Ubuntu 18.04.

Thanks for the links here – the note [here](http://Errors connecting to Docker socket (#2408) · Issues · GitLab.org / gitlab-runner 4) seems particularly relevant to my use case.

The fix Utku suggests doesn’t seem viable for us though – removing all stopped containers, volumes, etc… after every job slows us down too much.

So from what folks have said previously, it looks like dockerd can become blocked by some massive I/O operation.

Assuming the size of build is causing this fail for us, perhaps its time to trim down our build footprint. To contextualize the size of everything: the base docker image we work off is about ~12GB, and we generate about ~8GB space more space between the project repo and built files. From what I can tell, each volume that is created by the gitlab-runner is about ~20GB in size (makes sense?). There are some efforts to shrink this (e.g. shallow clones, persistent bazel workspace across all gitlab runners), but devops is… hard.

I guess all I can ask is…

  • are there any low hanging fruit on the gitlab-runner side for shrinking the size of volumes generated by the gitlab-runner ?

  • Is 12 GB absurdly large for an image to be using for CI?

Cheers,
James