CICD fails to execute jobs on EC2 instances

Hi,

I’m running GitLab 14.10.0-ee with GitLab Runner 14.8.2 on AWS. The GitLab runner is configured to spawn a new AWS EC2 instance to run CICD jobs. This used to work until yesterday, but today, it stopped working. I haven’t made any configuration changes and did not run any updates, so I’m not sure what could be the cause.

In the GitLab web interface, the job status is shown as:

Running with gitlab-runner 14.8.2 (c6e7e194) on
Preparing the “docker+machine” executor 10:13
ERROR: Preparation failed: exit status 1
Will be retried in 3s …
ERROR: Preparation failed: exit status 1
Will be retried in 3s …
ERROR: Preparation failed: exit status 1
Will be retried in 3s …
ERROR: Job failed (system failure): exit status 1

In the AWS console, I can see that a new EC2 instance is spawned (as it should). I also ran netcat against the newly spawned EC2 machine against port 22 to make sure that SSH is available, which it is.

journalctl -f on the gitlab runner shows the following:

Apr 26 13:01:55 <hostname> gitlab-runner[609]: Running pre-create checks...                        driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx operation=create
Apr 26 13:01:55 <hostname> gitlab-runner[609]: Creating machine...                                 driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx operation=create
Apr 26 13:01:55 <hostname> gitlab-runner[609]: (runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx) Launching instance...  driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-912d6a0c operation=create
Apr 26 13:01:58 <hostname> gitlab-runner[609]: IdleCount is set to 0 so the machine will be created on demand in job context  creating=1 idle=0 idleCount=0 idleCountMin=0 idleScaleFactor=0 maxMachineCreate=0 maxMachines=1 removing=0 runner=yyyyyy total=1 used=0
Apr 26 13:02:00 <hostname> gitlab-runner[609]: Waiting for machine to be running, this may take a few minutes...  driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx operation=create
Apr 26 13:02:01 <hostname> gitlab-runner[609]: Detecting operating system of created instance...   driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx operation=create
Apr 26 13:02:01 <hostname> gitlab-runner[609]: Waiting for SSH to be available...                  driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx operation=create
Apr 26 13:02:20 <hostname> gitlab-runner[609]: Detecting the provisioner...                        driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx operation=create
Apr 26 13:02:21 <hostname> gitlab-runner[609]: Provisioning with ubuntu(systemd)...                driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx operation=create
Apr 26 13:02:35 <hostname> gitlab-runner[609]: Installing Docker...                                driver=amazonec2 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx operation=create
Apr 26 13:02:58 <hostname> gitlab-runner[609]: WARNING: Problem while reading command output       error=read |0: file already closed
Apr 26 13:02:58 <hostname> gitlab-runner[609]: WARNING: Problem while reading command output       error=read |0: file already closed
Apr 26 13:02:58 <hostname> gitlab-runner[609]: ERROR: Machine creation failed                      error=exit status 1 name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx time=1m3.021323248s
Apr 26 13:02:58 <hostname> gitlab-runner[609]: WARNING: Requesting machine removal                 lifetime=1m3.021582815s name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx now=2022-04-26 13:02:58.461046121 +0000 UTC m=+1441.281657463 reason=Failed to create used=1m3.02158386s usedCount=0
Apr 26 13:02:58 <hostname> gitlab-runner[609]: WARNING: Stopping machine                           lifetime=1m3.038607714s name=runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx reason=Failed to create used=16.829134ms usedCount=0
Apr 26 13:02:58 <hostname> gitlab-runner[609]: Stopping "runner-yyyyyy-gitlab-docker-machine-9999999-xxxxx"...  name=runner-yyyyyy-gitlab-docker-machine-9999999-912d6a0c operation=stop

To me, it looks like it is able to connect to the EC2 instance via SSH, but is then unable to continue for some reason. But I don’t know how to interpret this error message:

Problem while reading command output error=read |0: file already closed

I’d be glad to hear any advice on how to interpret this message and how to troubleshoot this problem any further.

I found a workaround for my problem here, so I will just copy-paste the answer for the sake of completeness:

There is an issue with the latest docker version, see here: docker/machine#4858 (comment)

You can resolve this bug, by specifying an older docker version in the config.toml [on your gitlab runner] :

MachineOptions = [ "engine-install-url=https://releases.rancher.com/install-docker/19.03.9.sh", ]

I can also confirm that the workaround suggested by @savp worked for me, in case you don’t want to rely on rancher.com, use:

MachineOptions = ["engine-install-url=https://get.docker.com|head -n-1|cat - <(echo -e \"VERSION=19.03.9\\nCHANNEL=stable\\ndo_install\")"],
3 Likes

Doesn’t seems like a solution to me. The question is why did it work yesterday and not today?
I have the same issue but I think the Ubuntu base image is changed (maybe 22.04) and now it stopped working.

I am also getting this issue. It randomly started happening 16 Hrs ago

I have the same issue as well. I can confirm that @lpyfm 's solution solved the problem temporarily, but I don’t think it’s a permanent solution to rely on a docker version that is almost 2years old.

Thanks!

I’ve also encountered the same issue on AWS. The worker instances stays stuck in ‘Initializing’ state and then self terminates with another instance subsequently starting up, and repeating.

My gitlab runner was running fine for a year plus. This also just started happening 2 days ago. Would be good to know what the cause is for affecting so many people at the same time suddenly.

Pinning the docker image version did resolve it for me too.

I have experienced the same issue since yesterday, it looks like the docker image fails to install due to missing plugin, error seen in log:

E: Unable to locate package docker-compose-plugin. This was stopping docker image from installing on newly spawned AWS EC2 instance.

Pinning the docker version resolved this issue as suggested by @lpyfm

Hi all, unfortunately I don’t have anything useful to add but wanted to add another “me too” to the pile. No changes on our end to our config, yet we’re seeing the same issues as described here.

This started happening today for us.

  • GitLab Runner: 14.10.0
  • Host: EC2 instance
  • OS: Amazon Linux 2
  • docker: 20.10.7, build f0df350
  • docker-machine: 0.16.2-gitlab.6, build b17173bc

Log messages:

gitlab-runner[3942]: {"driver":"amazonec2","level":"info","msg":"Installing Docker...","name":"runner-5-xxx-gitlab-runner-docker-machine-xxx-xxx","operation":"create","time":"2022-04-27"}
gitlab-runner[3942]: {"error":"read |0: file already closed","level":"warning","msg":"Problem while reading command output","time":"2022-04-27"}
gitlab-runner[3942]: {"error":"read |0: file already closed","level":"warning","msg":"Problem while reading command output","time":"2022-04-27"}
gitlab-runner[3942]: {"error":"exit status 1","fields.time":xxx,"level":"error","msg":"Machine creation failed","name":"runner-5-xxx-gitlab-runner-docker-machine-xxx-xxx","time":"2022-04-27"}
gitlab-runner[3942]: {"level":"warning","lifetime":xxx,"msg":"Requesting machine removal","name":"runner-5-xxx-gitlab-runner-docker-machine-xxx-xxx","now":"2022-04-27","reason":"Failed to create","time":"2022-04-27","used":xxx,"usedCount":0}
gitlab-runner[3942]: {"level":"warning","lifetime":xxx,"msg":"Stopping machine","name":"runner-5-xxx-gitlab-runner-docker-machine-xxx-xxx","reason":"Failed to create","time":"2022-04-27","used":xxx,"usedCount":0}

This is causing a new EC2 instance to be started and stopped every minute continuously and is preventing CI jobs from running, which is a big problem for us.

Please look into this. Thank you.

This Work Around Worked For Us And Saved Us From A Lot Of Headaches.

A issue is created here: Gitlab docker runner on AWS EC2 started failing abruptly (#29032) · Issues · GitLab.org / gitlab-runner · GitLab

1 Like

We’ve also ran into this issue, but instead I’ve pointed engine-install-url to the previous version of get.docker.com which is here: https://github.com/docker/docker-install/blob/e5f4d99c754ad5da3fc6e060f989bb508b26ebbd/install.sh (but using the raw in the MachineOptions). You can also see that they indeed added the installation of docker-compose-plugin package 3 days ago.

1 Like

Same issue here! The workaround did not work unfortunately.

Same issue here, it started like 2 days ago.

This worked for us without adding a dependency on rancher.com (which would have not held up to our security policies).

In the MachineOptions section of config.toml, we added this:

"engine-install-url=https://get.docker.com|head -n-1|cat - <(echo -e \"VERSION=19.03.9\\nCHANNEL=stable\\ndo_install\")",

I can’t take credit for it, as I got it from this source originally (see the very last post):

Hope that helps.

1 Like

This however seems to have solved our issues. Thanks :rocket:

Same thing happened here at work, around the same time! We applied what @savp suggested and it worked just fine! Waiting for some feedback of the Gitlab issue.

@JorRy, maybe using any version previous to 20.10 would solve the issue too? As @dobrud mentioned, might be a better solution than using an older Docker version.

As JoyRy mentioned, this is due to a change to the script that is pulled in by default from get.docker.com to install docker-compose-plugin. The issue is that that package doesn’t exist on the default AMI installed (Ubuntu 16.04).

I initially got it working by doing the same change (pulling in a version of the script from the raw github commit from the day prior to the change), but now have a better solution, which is to specify a much newer AMI in the MachineOptions instead. This then works with the default docker script (I never realised it was using such an old Ubuntu version by default anyway, so this is a good change to make regardless).

Can you advise of the exact change you made, which version etc you’re using?

Thank you!

Could you let us know which particular ami you’re using?

I’ve tried upgrading from Ubuntu 16.04 in the past, but had several issues with more recent versions of Ubuntu, so it would be interesting to hear which particular version works with GitLab.