Gitlab Runner failing Cannot connect to the Docker daemon with custom image using gitlab registry

I have been running all my pipelines using Project specific Runners, docker executors, all working fine without any issues.

As the runners take some time to setup the environment and installing dependencies, I’ve created our custom image with that setup already done.

This is in a separate project branch now using registry.gitlab.com// image instead of php:8.1

The runner is able to connect, pull the image but randomly, this pipeline fails with the common

ERROR: Job failed (system failure): Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:659:120s)

Many people complain about this error, but all the errors are kind of related with people trying to use dind and that’s not my case.

Considerations

  • The runners never fails with this error when using php image
  • The custom image has 1.5GB, about 3x the size of base php image

Environment

  • Gitlab version: Gitlab.com
  • Gitlab-Runner version: 15.3.0 (project specific runner)

Troubleshooting

  • Knowing it was not a dind related problem, I’ve tried exposing the docker.sock as a volume or setting docker to connect via tcp port, without success.
  • Tried setting wait_for_services_timeout=120 without change on runner behaviour

On the machine running the gitlab-runner, I can see it creates the container and when the error occurrs it stay there created.

CONTAINER ID   IMAGE          COMMAND                  CREATED        STATUS     PORTS     NAMES
730cb7c95e9c   e92b9b9879e3   "docker-php-entrypoi…"   43 hours ago   Created             runner-removedid-project-removedid2-concurrent-0-3a0792ab23b639c3-build-2

The container is from our custom image (e92b9b9879e3).

Gitlab-Runner relevant configurations

[[runners]]
  name = "supressed"
  url = "https://gitlab.com/"
  token = "supressed-again"
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "docker:stable"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/certs/client", "/cache"]
    shm_size = 0
    wait_for_services_timeout=120

At this point, I don’t know what else to try.

Please feel free to ask for more information or run some testing.

Thank you,
Filipe

After digging into debug messages from gitlab-runner, found this

Sep  2 11:09:12 ci01 gitlab-runner[93928]: Executing "step_script" stage of the job script  job=jobid project=projid runner=runnerid
Sep  2 11:09:12 ci01 gitlab-runner[93928]: Looking for image registry.gitlab.com/groupid/project:latest ...  job=jobid project=projid runner=runnerid
Sep  2 11:09:12 ci01 gitlab-runner[93928]: Using docker image sha256:e92b9b9879e37a8d28b63eba4b78434eb8d28b962c021b5d06a63ef8b0b874dd for registry.gitlab.com/groupid/project:latest with digest registry.gitlab.com/groupid/projectsha256:7acea11efee148e7e20b037498a3e831e6cec6d17c22e052b9f3226abee8a635 ...  job=jobid project=projid runner=runnerid
Sep  2 11:09:12 ci01 gitlab-runner[93928]: Removing container runner-runnerid-project-projid-concurrent-0-36289087943770e4-build-2  job=jobid project=projid runner=runnerid
Sep  2 11:09:12 ci01 gitlab-runner[93928]: Disconnecting container runner-runnerid-project-projid-concurrent-0-36289087943770e4-build-2 from networks  job=jobid project=projid runner=runnerid
Sep  2 11:09:12 ci01 gitlab-runner[93928]: Removing container runner-runnerid-project-projid-concurrent-0-36289087943770e4-build-2 finished with error Error: No such container: runner-runnerid-project-projid-concurrent-0-36289087943770e4-build-2 (docker.go:770:0s)  job=jobid project=projid runner=runnerid
Sep  2 11:09:12 ci01 gitlab-runner[93928]: Creating container runner-runnerid-project-projid-concurrent-0-36289087943770e4-build-2 ...  job=jobid project=projid runner=runnerid
Sep  2 11:09:41 ci01 gitlab-runner[93928]: Appending trace to coordinator... ok                code=202 job=jobid job-log=0-76042 job-status=running runner=runnerid sent-log=275-76041 status=202 Accepted update-interval=1m0s

Seems it’s complaining of not finding an image but it created the image.
After the job fails I can see it using docker container ls -a

Could this be some timeout waiting for the container to be ready?