Intermittent 403 errors on registry.gitlab.com

We are currently using Gitlab SaaS and have been using our own runners for about 6 weeks now.
The first 4 weeks they worked perfectly fine, however since ~1,5 week we are now having frequent issues where our runners get a 403 when trying to pull the job container from the Gitlab registry.

We have opened a support ticket at April 25 but we haven’t gotten any response to that.

The error we receive:

WARNING: Failed to pull image with policy "always": error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: "<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>" (manager.go:203:2s)
ERROR: Job failed: failed to pull image "registry.gitlab.com/ORG/containers/commitlint:latest" with specified policies [always]: error parsing HTTP 403 response body: invalid character '<' looking for beginning of value: "<?xml version='1.0' encoding='UTF-8'?><Error><Code>AccessDenied</Code><Message>Access denied.</Message></Error>" (manager.go:203:2s)

Last week we also received a few of these:

Your client does not have permission to get URL /docker/registry/v2/blobs/sha256/04/xxxxxxxxxxxxxxxxxxxxx/data from this server.

The runners themselves are auto-scaled which gets spun up and destroyed from time to time.
There is no specific configuration set for accessing the registry, but I assume the “runner manager” (which is connected to Gitlab as a runner) is supposed to take care of that.

These issues have started about 1,5 weeks ago, which seem to coincide with the release of Gitlab 14.10. We’ve also upgraded our runner to 14.10.0 to no avail.

Any luck on this? We are seeing this issue happening on and on on random kubernetes nodes. We have already tried with different container runtimes but no luck. Credentials are ok and we can issue the same pull commands from other nodes in the cluster, also from our local machines.

We are using GitLab SaaS with its own container registries.

Some guidance would be helpful.

Thanks in advance

This particular problem we were having back then was being caused because a number of IP’s in the IP pool of our cloud provider were being blocked by Google. Since we used short-lived autoscaled runner instances this problem would intermittently occur and disappear again.

We were able to diagnose this by running a simple curl https://gcping.com/ on the runner nodes. If it would give back the AccessDenied error we would then kill that particular node and hope that the next one would get a new IP. Eventually we moved to a different region because we kept getting affected IP’s.

It could be that the external IP of your K8s node is also being blocked by Google.

1 Like