Task fails to start with "remote error: tls: unknown certificate authority"

Problem to solve

Today I started getting the following errors in the logs when running tasks on my local gitlab-runner instance connected to gitlab.com:

Jun 27 15:17:07 vesho docker[379590]: ERROR: Job failed (system failure):
   error during connect: Get "https://184.72.163.13:2376/v1.24/info":
   remote error: tls: unknown certificate authority (docker.go:958:1s)
   duration_s=109.616854734 job=7202449094 project=10176871 runner=KkF5hGxd

Then in the GitLab task log I have this:

Running with gitlab-runner 17.1.0 (fe451d5a)
  on vesho-autoscaler-public KkF5hGxd, system ID: r_Hn18yAGIixfO
Preparing the "docker+machine" executor
01:14
ERROR: Failed to remove network for build
ERROR: Preparation failed: error during connect: Get "https://184.72.163.13:2376/v1.24/info": remote error: tls: unknown certificate authority (docker.go:958:1s)
Will be retried in 3s ...
ERROR: Failed to remove network for build
ERROR: Preparation failed: error during connect: Get "https://54.224.151.177:2376/v1.24/info": remote error: tls: unknown certificate authority (docker.go:958:1s)
Will be retried in 3s ...
ERROR: Failed to remove network for build
ERROR: Preparation failed: error during connect: Get "https://54.224.151.177:2376/v1.24/info": remote error: tls: unknown certificate authority (docker.go:958:1s)
Will be retried in 3s ...
ERROR: Job failed (system failure): error during connect: Get "https://54.224.151.177:2376/v1.24/info": remote error: tls: unknown certificate authority (docker.go:958:1s)

The runner configuration is as follows:

[[runners]]
  name = "vesho-autoscaler-public"
  url = "https://gitlab.com/"
  id = XXXXXXX
  token = "XXXXXXX"
  token_obtained_at = 2022-11-10T15:40:28Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker+machine"
  limit = 3
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.docker]
    host = "tcp://docker:2375"
    tls_verify = false
    image = "docker:latest"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    cache_dir = "/cache"
    extra_hosts = ["docker:172.17.0.2"]
    shm_size = 0
    services_limit = 1
  [runners.machine]
    IdleCount = 0
    IdleScaleFactor = 0.0
    IdleCountMin = 0
    IdleTime = 1800
    MaxBuilds = 10
    MachineDriver = "amazonec2"
    MachineName = "cimachine-%s"
    MachineOptions = [
      "engine-opt=bip=172.17.0.1/24",
      "amazonec2-ami=ami-0b9a603c10937a61b",
      "amazonec2-access-key=XXXXXXX",
      "amazonec2-secret-key=XXXXXXX",
      "amazonec2-region=us-east-1",
      "amazonec2-vpc-id=vpc-XXXXXXX",
      "amazonec2-subnet-id=subnet-XXXXXXX",
      "amazonec2-zone=a",
      "amazonec2-use-private-address=false",
      "amazonec2-tags=runner-manager-name,gitlab-aws-autoscaler,gitlab,true,gitlab-runner-autoscale,true",
      "amazonec2-security-group=gitlab-ci-runner",
      "amazonec2-instance-type=c6a.2xlarge",
      "amazonec2-root-size=40",
      "amazonec2-request-spot-instance=false",
      "amazonec2-spot-price="
    ]

This worked well earlier this week (I’m not sure when was the previous build, but it was this week).

Versions

Please select whether options apply, and add the version information.

  • Self-managed
  • GitLab.com SaaS
  • Self-hosted Runners

Versions

  • GitLab: gitlab.com
  • GitLab Runner, if self-hosted: I’m running the official docker images, which I pulled today trying to work around the problem and it reports itself in the logs like so:
Runtime platform arch=amd64 os=linux pid=7 revision=fe451d5a version=17.1.0
1 Like

The addresses that are reported to have unknown certificate authority are the public IP addresses of the EC2 instances used by the runner’s docker-machine. 2376 is the docker port, IIRC, and the certificates set there are self-signed certificates. So this look like a docker-machine misconfiguration?

I have tls_verify set to false and also have the host set to non-TLS connection, so I don’t understand why I get this error.

There was an update to the docker-dind image a couple of days back (version 27.0.2), so I tried to revert by setting the runners.docker image to a previous version - Reverting to 26 (aka 26.1.4) did not fix this nor 27.0.1, but with docker:27.0.0-dnid I did get the old working behavior back.

Interestingly, while the configuration has host = "tcp://docker:2375", the log has this:

Jun 27 19:02:47 vesho docker[923766]: Using existing docker-machine created=2024-06-27 15:54:29.28651096 +0000 UTC m=+4.582039409 docker=tcp://3.92.55.129:2376 job=7205050194 name=runner-kkf5hgxd-gitlab-XXXXX-1719503669-4b1d2ca7 now=2024-06-27 16:02:47.944534527 +0000 UTC m=+503.240062947 project=10176871 runner=KkF5hGxd usedcount=6

So where does the docker=tcp://3.92.55.129:2376 part come from, and can I change it in the configuration?

I was apparently too hasty. Tasks are no failing even with the runners.docker image set to the older DIND docker image.

I have the same problem since yesterday. Setting is a gitlab-manager runner in AWS creating on demand EC2 instances for jobs via Gitlabs docker-machine fork.

@odeda : Did you find a solution?

I had TLS activated until now. Docker-machine installs Docker version 27.0.3 at the runner machines and according to cli/docs/deprecated.md at v27.0.2 · docker/cli · GitHub tlsverify=false is not working anymore.

It looks like DOCKER_TLS_CERTDIR is ignored. At least when I ssh into the runner machines docker is running and the only certificates I can find are located in /etc/docker.

Annoyingly - stopping and starting the gitlab-runner service sometimes works, as in it will run some jobs, then will lose the connection to gitlab-com (I’m running the service on my laptop, which sometimes sleeps and after waking the runner gets 502 errors from gitlab-com), then when I restart I will continue to get TLS verification errors. A couple of stop/starts later - it starts running things again.

I initially thought that I can select an image that will work, but the change of image doesn’t seem to have an affect on this otherwise completely arbitrary behavior.

The tls_verify parameter is still documented in the Gitlab Runner docker runner configuration even though it obviously doesn’t work, and neither is the host parameter that I have in my config (though that isn’t actually currently listed in the docs). :person_shrugging:

The runner log says things like:

Creating CA: /root/.docker/machine/certs/ca.pem
Creating client certificate: /root/.docker/machine/certs/cert.pem

So I added tls_cert_path = "/root/.docker/machine/certs" to the [runners.docker] section and right now jobs complete - but as I’ve noted above, this may not mean anything because Gitlab Runner is arbitrary and capricious. :face_exhaling:

1 Like

I got the same issue again today, no configuration change on my side.

I’ve noticed that I’m running (hard-coded) Gitlab Runner docker image for ubuntu-v17.0.0 and the current version (released last week) is 17.2.0. I updated my runner docker image, restarted, and now it works fine again.

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⠴⠒⠋⠉⠉⠙⠒⠦⣄⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡰⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⢳⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⢁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢧⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣿⡇⡕⢢⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡸⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⠃⡇⡇⢸⢦⠤⠤⠤⠄⠀⠀⠀⠀⢀⡇⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⢺⠘⠁⠇⢸⠀⡆⠀⠀⠀⠀⠀⠀⠀⡼⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡽⠀⠀⠀⠘⠀⡇⠀⠀⠀⠀⠀⠀⣼⡁⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⣠⠼⣇⠀⠀⠀⠀⠀⡏⠉⠑⠀⠀⣠⠞⡇⠙⢦⡀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣠⠊⠁⠀⠀⠙⢦⠀⠀⢀⡇⠀⣀⡠⠞⣁⠜⠁⠀⠀⠙⢦⠀⠀
⠀⠀⠀⠀⢠⠞⠁⠀⠀⠀⠀⠀⠈⡗⠚⠉⠉⠛⠓⠒⠉⠁⠀⠀⠀⠀⠀⠀⠳⡄
⠀⠀⢀⡴⠁⠀⠀⠀⠀⠀⠀⢀⡴⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸
⠀⢠⠎⠀⠀⠀⠀⠀⠀⢀⠔⠉⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸
⢠⠏⠀⠀⠀⠀⠀⢀⣴⠃⠀⠢⣄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢐⠀⠀⠀⠀⠀⠘
⡞⠀⠀⠀⠀⠀⠔⠁⣀⣠⠴⠒⠚⢳⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⡇⠀⠀⠀⠀⢀
⠹⢤⣀⣀⠤⠴⠒⠋⠁⠀⠀⠀⠀⢸⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⣇⠀⠀⠀⠀⢸

Having issues with this also, docker started to stop working with an infinite timeout. So the solution is to downgrade the gitlab runner to 17.0.0 or downgrade docker version to 26.x on the host machine? In my case I was using semantic-release-docker and it just hangs the pipeline forever, maybe because it is unable to connect to the docker tcp service.

We have started to get the same error today. We were on docker 26.1.3 and gitlab runner 17.0.0 and getting the error. Have updated to docker 27.3.1 and gitlab runner 17.5.3, but still same issue. Has anyone resolved this or raised with gitlab support?

I’m currently running gitlab runner 17.2.0 on Podman 4.9.3 with an EC2 auto-scale executor - and it has been working well for a while now. I’m loath to update the runner image as there’s always the fear that this will break everything. Though my impression is that this problem isn’t about specific versions that are problematic or incompatible with something - something that recurs from time to time and could be some kind of race condition or misalignment.

I have updated to runner 17.6.0, and using the EC2 autoscale executor (docker+machine; MachineDriver: amazonec2), with the image docker:27.0.3-dind it so far works well. This is my runner configuration:

[[runners]]
  name = "name-of-runner"
  url = "https://gitlab.com/"
  id = 0
  token = "MYTOKEN"
  token_obtained_at = 0001-01-01T00:00:00Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker+machine"
  [runners.cache]
    MaxUploadedArchiveSize = 0
  [runners.docker]
    host = "tcp://docker:2375"
    tls_verify = false
    tls_cert_path = "/root/.docker/machine/certs"
    image = "docker:27.0.3-dind"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    cache_dir = "/cache"
    extra_hosts = ["docker:172.17.0.2"]
    shm_size = 0
    services_limit = 1
  [runners.machine]
    IdleCount = 0
    IdleScaleFactor = 0.0
    IdleCountMin = 0
    IdleTime = 1800
    MaxBuilds = 10
    MachineDriver = "amazonec2"
    MachineName = "name-of-runner-%s"
    MachineOptions = [
      "engine-opt=bip=172.17.0.1/24",
      "amazonec2-ami=ami-04a81a99f5ec58529",
      "amazonec2-access-key=AKIABCDEFGHIJKLMNOP",
      "amazonec2-secret-key=deadbeefdeadbeefdeadbeef",
      "amazonec2-region=us-east-1",
      "amazonec2-vpc-id=vpc-abcd1234",
      "amazonec2-subnet-id=subnet-abcd1234",
      "amazonec2-zone=a",
      "amazonec2-use-private-address=false",
      "amazonec2-tags=tag-a,val-a,tag-b,val-b,tag-c,val-c",
      "amazonec2-security-group=gitlab-ci-runner",
      "amazonec2-instance-type=c6a.2xlarge",
      "amazonec2-root-size=40",
      "amazonec2-request-spot-instance=false",
      "amazonec2-spot-price="
    ]

We’re seeing this with two runner managers on self-hosted instances.

At least in our case, I’ve tracked the problem to a verification error with the client certificate created by docker-machine. We’re running GitLab’s gitlab/gitlab-runner:alpine3.18-vxxx images, with /root/.docker/machine mapped to /srv/gitlab-runner/machine on the runner-manager host.

On a ‘good’ machine, openssl verify works:

[root@gitlab-runner-good ~]# openssl verify -CAfile /srv/gitlab-runner/machine/certs/ca.pem /srv/gitlab-runner/machine/certs/cert.pem
/srv/gitlab-runner/machine/certs/cert.pem: OK

On a ‘bad’ machine, verify fails:

[root@gitlab-runner-bad ~]# openssl verify -CAfile /srv/gitlab-runner/machine/certs/ca.pem /srv/gitlab-runner/machine/certs/cert.pem
/srv/gitlab-runner/machine/certs/cert.pem: O = "unknown.<bootstrap>"
error 20 at 0 depth lookup:unable to get local issuer certificate

(Note that the certificate subject says unknown because docker-machine looks for the value of the USER environment variable when creating the CA and client certificates. In GitLab’s gitlab-runner image, this environment variable is not set, hence unknown. It’s possible to set that environment variable when creating the container, but we have not observed it making any difference.)

Changing the version of the container image (i.e. upgrading) doesn’t make a difference. But, sometimes recreating the certificates works.

I believe what’s happening here, at least for us, is a race condition. When autoscaling, GitLab fires off multiple independent docker-machine processes depending on the value of MaxGrowthRate. Each of these processes try to determine if CA and/or client certificates already exist, generating them if not. All it would take is a busy system where multiple processes think they need to generate new CA and client certificates, but one process manages to generate both before another can generate the CA certificate. If this happens, the CA and client certificates will be out of sync, leading to this failure behavior.

To ‘fix’ this, I changed the runners.limit and runners.machine.IdleCount to 1, waited for all the existing machines to be scaled down, deleted the generated certs, then deleted the final remaining managed machine. That machine was immediately recreated, and during that operation new CA and client certificates were created. Then I was able to reset runners.limit and runners.machine.IdleCount back to their original settings.

1 Like

This is an awesome investigation @wsaxon! Do you think that simply setting MaxGrowthRate to 1 could solve the problem? I don’t mind scaling slowly if its safer.

Do you think that simply setting MaxGrowthRate to 1 could solve the problem?

If I’m right about what’s going on, and you can tolerate it set to 1, then yeah I think so. Worst case is it doesn’t work.

All you actually need to ensure is that the CA and client certificates are in sync. I didn’t look too hard but another idea might be to pre-generate those. All the relevant paths can be set as environment variables to the docker-machine process, so on a busy system you might be able to sidestep this by controlling the certificates yourself.

1 Like