Hi, we’re currently hitting an issue where we sometimes (very rarely) hit this error message:
*** WARNING: Service runner-zw1jf9nkz-project-121-concurrent-0-648e8ddb06bef34a-docker-0 probably didn't start properly.
Health check error:
service "runner-zw1jf9nkz-project-121-concurrent-0-648e8ddb06bef34a-docker-0-wait-for-service" timeout
Health check container logs:
2025-10-29T17:19:38.818022544Z waiting for TCP connection to <IP> on [2375 2376]...
2025-10-29T17:19:38.818047364Z dialing <IP>:2376...
2025-10-29T17:19:38.818089465Z dialing <IP>:2375...
2025-10-29T17:19:39.818330130Z dialing <IP>:2376...
2025-10-29T17:19:39.818347301Z dialing <IP>:2375...
2025-10-29T17:19:40.818779520Z dialing <IP>:2375...
2025-10-29T17:19:40.818841341Z dialing <IP>:2376...
Service container logs:
2025-10-29T17:19:39.851017587Z Certificate request self-signature ok
2025-10-29T17:19:39.851040797Z subject=CN=docker:dind server
2025-10-29T17:19:39.864781346Z /certs/server/cert.pem: OK
2025-10-29T17:19:39.866551966Z chmod: /certs/client/key.pem: Operation not permitted
*********
This ultimately results in this error
ERROR: error during connect: Head "https://docker:2376/_ping": dial tcp: lookup docker on 10.69.224.2:53: no such host
Looking at the docs (Use Docker to build Docker images | GitLab Docs) I found this information:
Directories defined in volumes = [“/certs/client”, “/cache”] in the Docker-in-Docker with TLS enabled in the Docker executor approach are persistent between builds. If multiple CI/CD jobs using a Docker executor runner have Docker-in-Docker services enabled, then each job writes to the directory path. This approach might result in a conflict.
The proposed solution however involves modifying the pipelines, something I’m very reluctant to do since we are a large org and that kind of change is extremely disruptive.
I asked an LLM, and it seemed pretty adamant that we could instead simply remove “/certs/client” from the docker executor volumes in the runner config, and everything would work out, besides maybe a longer startup due to ephemeral certs creation for each job, and more importantly, no more conflicts. The problem is that I cant find that setup in the docs.
So, my question is two-fold:
- Do you think the error is caused by conflicts from this shared volume? Or something else entirely?
- Is the solution proposed by the LLM any good?
Additional info : we prune images and volume once a week because otherwise the disk fills up too much.