CI pipeline fails randomly when accessing GitLab APIs (Docker registry, Artifacts, ...)

Since some time the CI pipeline fails often with access issues to APIs provided by GitLab. Example of those issues:

$ docker login -u gitlab-ci-token -p $CI_JOB_TOKEN registry.gitlab.com
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry.gitlab.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
ERROR: Job failed: failed to pull image "registry.gitlab.com/3ker-grizzzel/candidate-api/app:de5061e6de3cf2506400024cfb5091dfbc017920" with specified policies [always]: Error response from daemon: Get "https://registry.gitlab.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) (manager.go:237:15s)
Downloading artifacts for test:data-migration (5167211972)...
ERROR: Downloading artifacts from coordinator... forbidden  id=5167211972 responseStatus=403 Forbidden status=403 Forbidden token=64_kzLnz
FATAL: permission denied    

Pipeline often fails if restarted immediately but works again if waiting a few hours.

Using custom CI runners setup with Docker+Machine on Hetzner. Has anyone seen this issue before?

Error in the request to Docker API seem to indicate a network connectivity issue. But it is strange that we see an authentication issue with the artifacts at the same time. Could it be that we run into some kind of rate limiting, which is handled differently between the services?

Would report as a bug. But I don’t have any idea how to reproduce it. It’s just happening from time to time.

Logged into the CI runner server, for which I have seen jobs to fail due to network connectivity to https://registry.gitlab.com/.

$ docker login -u gitlab-ci-token -p $CI_JOB_TOKEN registry.gitlab.com
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
Error response from daemon: Get "https://registry.gitlab.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Ping is working fine:

# ping registry.gitlab.com
PING registry.gitlab.com (35.227.35.254) 56(84) bytes of data.
64 bytes from 254.35.227.35.bc.googleusercontent.com (35.227.35.254): icmp_seq=1 ttl=102 time=112 ms
64 bytes from 254.35.227.35.bc.googleusercontent.com (35.227.35.254): icmp_seq=2 ttl=102 time=112 ms
64 bytes from 254.35.227.35.bc.googleusercontent.com (35.227.35.254): icmp_seq=3 ttl=102 time=112 ms
64 bytes from 254.35.227.35.bc.googleusercontent.com (35.227.35.254): icmp_seq=4 ttl=102 time=112 ms

HTTP request with cURL is failing with network timeout:

# curl --verbose https://registry.gitlab.com/v2/
*   Trying 35.227.35.254:443...
* connect to 35.227.35.254 port 443 failed: Connection timed out
* Failed to connect to registry.gitlab.com port 443 after 131025 ms: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to registry.gitlab.com port 443 after 131025 ms: Connection timed out

Another server on the same provider (Hetzner Cloud) is working fine at the same time. I have no idea.

I assume I’m running into the issue of Hetzner IP addresses being wrongly flagged by Google as Iran and therefore traffic is blocked by Cloudflare and others: intermittent registry.gitlab.com client timeouts from hetzner.de VPSes (#8121) · Issues · GitLab.com / GitLab Infrastructure Team / production · GitLab

Especially relevant are those updates from GitLab: