GitLab on Kubernetes - can't use custom internal runner image for CI jobs

We’ve recently moved from an Omnibus installation on Docker to the kubernetes helm chart, and so far everything is working great, save for one issue: when we run a CI job using a custom image from our own internal registry (also part of the helm chart), we get the following error:

ERROR: Job failed: image pull failed: rpc error: code = Unknown desc = Error response from daemon: error unmarshalling content: invalid character '<' looking for beginning of value

We use split-horizon DNS so that requests from within our cluster go straight to our load balancer IPs while external requests are routed through an IDP. Having fought through an issue earlier today with a similar error, I figured it was routing incorrectly and hitting our IDP and returning some HTML, hence the <. But I’m stumped after trying the following things:

  • Pulling the image as a step in the CI job without using a custom image (works)
  • Pulling the image in a before_script step without using a custom image (works)
  • Checking the output of nslookup of our registry url within the CI job (correct value)
  • Ensuring that $CI_REGISTRY points to the correct location (it does)

As far as I can tell, from every perspective the runner takes it’s able to properly resolve and in most cases correctly pull the image, but when specified as the image for a CI job task to use, we get the above error. If I use an explicit URL instead of $CI_REGISTRY I get a slightly different error:

ERROR: Job failed: image pull failed: Back-off pulling image ""

I’m using the simplest of CI jobs to test this:

image: docker:latest

  - docker:dind

  DOCKER_HOST: tcp://localhost:2375

  - test

  stage: test
    - env

We’re running GitLab 12.10.2-ee on Kubernetes 1.18.2 via Helm v3.

I’m not expecting a solution as much as any pointers on how to debug this. I haven’t been able to get very meaningful output from the docker client and gitlab logs haven’t shown me much else.

Any help is greatly appreciated!


I haven’t done this in such a scenario, but here’s a few things I would check:

  • Does the registry require a login?
  • Is it using TLS?
  • Are there more hints in the runner log (file/syslog)?
  • Does the debug mode provide more insights?

And maybe the error image pull failed: Back-off pulling image thrown into Google helps:

It seems that the TrafficPolicy is a good catch, maybe try that first?


Thanks for the suggestions dnsmichi. To answer some of your questions:

  • Does the registry require login? This is the registry that’s built into the Helm chart. If I remove the image: line from the build job and instead add a step that pulls the same image, it will succeed. I’ve tried adding an imagePullSecret to the default service account in the namespace I built gitlab in, but it didn’t seem to help. It does possibly seem permissions-related but I’m not sure where I’m missing them.
  • Is it using TLS? Yes, it’s hitting the registry at the address fronted by the nginx ingress controller, which is providing TLS. I know there are options to provide the registry its own TLS certs but I’m not sure if that’s required.
  • Are there more hints in the runner log? As this is the runner built into the Helm chart, I’m only aware of the runner pod logs, which don’t provide any more insight than what I’m getting out of the job logs themselves.
  • Does the debug mode provide more insights? I would love to run the runner in debug mode but I’m not entirely sure how using the Helm chart.

I’ve tried the suggestions in your links but generally they seem to apply to image pulls that happen within the job scripts, which work fine for me. I had tried the externalTrafficPolicy change prior to this, and for good measure tried it again, but it made no difference.

1 Like

Small addendum to the above, I did find where to enable debug logging for the runner. It didn’t reveal too much but I do see this at the beginning of the job:

Feeding runners to channel                          builds=0
Checking for jobs... nothing                        runner=96ZnVCnS
Feeding runners to channel                          builds=0
Checking for jobs... received                       job=28093 repo_url=https://gitlab.(domain).com/(me)/test-runner-bug.git runner=96ZnVCnS
Processing chain                                    chain-leaf=[0xc0002d6000 0xc0002d6580 0xc000737b80] context=certificate-chain-build
Certificate doesn't provide parent URL: exiting the loop  Issuer=USERTrust RSA Certification Authority IssuerCertURL=[] Serial=(serial) Subject=USERTrust RSA Certification Authority context=certificate-chain-build
Processing chain                                    chain-leaf=[0xc0002d6000 0xc0002d6580 0xc0002d6b00 0xc0002d9180] context=certificate-chain-build
Certificate doesn't provide parent URL: exiting the loop  Issuer=AddTrust External CA Root IssuerCertURL=[] Serial=1 Subject=AddTrust External CA Root context=certificate-chain-build
Requeued the runner                                 builds=1 runner=96ZnVCnS

Is this possibly relevant? It’s unclear whether these are warnings or if the job is failing because of this.

1 Like

I figured it out, it was nothing to do with gitlab itself in the end. We use a local MinIO running in the same cluster for object storage (for gitlab and other things), but we expose it on a public IP (unreachable from the cluster) as well as a local IP provisioned by MetalLB. We hadn’t set up our split-horizon DNS correctly and it was resolving MinIO to the unreachable public IP; a quick hosts entry on the k8s cluster nodes to test and everything is resolved.

I will say that I managed to get lucky and catch the output of kubectl describe pod for the job runner just in time before it exited which lead me to the solution. I will look into ways to persist or forward those logs somewhere for future troubleshooting.

Thanks for the help Michael!

1 Like