Suddenly failing Docker+Machine autoscaling builds on GCP

Our docker+machine autoscaling builds on GCP started failing several days ago. No changes whatsoever on our side to anything related to CI or GCP infrastructure; just some small C++ code changes that triggered new builds.

The failure appears to be related to the runner service launching the remote docker on the spawned instance using the –storage-driver aufs option, which leads to missing driver errors. Specifically the dockerd launch command remotely executed on the spawned VM includes –storage-driver aufs, and the syslog on that machine states “level=error msg=“AUFS was not found in /proc/filesystems” storage-driver=aufs” before failing. But I could be wrong - am not experienced debugging CI setups.

Has anyone else seen this? I’d be grateful for any ideas.

This is the high level GitLab log for the pipeline:

Running with gitlab-runner 13.9.0 (2ebc4dc4) on gitlab-runner-gce-root-2 9db142ae
Preparing the "docker+machine" executor
05:48
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Preparation failed: exit status 1
Will be retried in 3s ...
ERROR: Job failed (system failure): exit status 1

The parent runner instance’s log:

Apr 29 01:17:42 gitlab-runner-root-2 gitlab-runner[4282]: Copying certs to the remote machine...              driver=google name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 operation=create
Apr 29 01:17:44 gitlab-runner-root-2 gitlab-runner[4282]: Setting Docker configuration on the remote daemon...  driver=google name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 operation=create
Apr 29 01:17:45 gitlab-runner-root-2 gitlab-runner[4282]: Feeding runners to channel                          builds=1
Apr 29 01:17:45 gitlab-runner-root-2 gitlab-runner[4282]: Docker Machine Details                              creating=1 idle=0 maxMachineCreate=0 maxMachines=0 minIdleCount=0 removing=0 runner=9db142ae time=2023-04-29 01:17:45.076152166 +0000 UTC m=+474.234273866 total=1 used=0
Apr 29 01:17:45 gitlab-runner-root-2 gitlab-runner[4282]: Failed to request job: runner requestConcurrency meet  builds=1 runner=9db142ae
Apr 29 01:17:46 gitlab-runner-root-2 gitlab-runner[4282]: Submitting job to coordinator... ok                 code=200 job=4200869459 job-status= runner=9db142ae update-interval=0s
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: ERROR: Error creating machine: Error running provisioning: ssh command error:  driver=google name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 operation=create
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: ERROR: command : sudo systemctl -f start docker     driver=google name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 operation=create
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: ERROR: err     : exit status 1                      driver=google name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 operation=create
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: ERROR: output  : Job for docker.service failed because the control process exited with error code.  driver=google name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 operation=create
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: ERROR: See "systemctl status docker.service" and "journalctl -xe" for details.  driver=google name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 operation=create
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: WARNING: Problem while reading command output       error=read |0: file already closed
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: WARNING: Problem while reading command output       error=read |0: file already closed
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: Skipping provision retry on failed machine          error=exit status 1 name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: ERROR: Machine creation failed                      error=exit status 1 name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 time=34.562191809s
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: WARNING: Requesting machine removal                 lifetime=34.562447574s name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 now=2023-04-29 01:17:47.184745963 +0000 UTC m=+476.342867672 reason=Failed to create used=34.562447918s usedCount=0
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: WARNING: Stopping machine                           lifetime=34.58787779s name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 reason=Failed to create used=24.841869ms usedCount=0
Apr 29 01:17:47 gitlab-runner-root-2 gitlab-runner[4282]: Stopping "runner-9db142ae-auto-scale-runner-1682731032-4ec325c9"...  name=runner-9db142ae-auto-scale-runner-1682731032-4ec325c9 operation=stop

Here is the output of “systemctl status docker.service” on a spawned machines that failed to execute the task:

● docker.service - Docker Application Container Engine
     Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/docker.service.d
             └─10-machine.conf
     Active: failed (Result: exit-code) since Sat 2023-04-29 01:20:32 UTC; 13s ago
TriggeredBy: ● docker.socket
       Docs: https://docs.docker.com
    Process: 2409 ExecStart=/usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --storage-driver aufs --tlsverify --tlscacert /etc/docker/ca.pem --tlscert /etc/docker/server.pem --tlskey /etc/docker/server-key.pem --label provider=google (code=exited, status=1/FAILURE)
   Main PID: 2409 (code=exited, status=1/FAILURE)
        CPU: 88ms

Apr 29 01:20:32 runner-9db142ae-auto-scale-runner-1682731191-4f4959bc systemd[1]: docker.service: Scheduled restart job, restart counter is at 3.
Apr 29 01:20:32 runner-9db142ae-auto-scale-runner-1682731191-4f4959bc systemd[1]: Stopped Docker Application Container Engine.
Apr 29 01:20:32 runner-9db142ae-auto-scale-runner-1682731191-4f4959bc systemd[1]: docker.service: Start request repeated too quickly.
Apr 29 01:20:32 runner-9db142ae-auto-scale-runner-1682731191-4f4959bc systemd[1]: docker.service: Failed with result 'exit-code'.
Apr 29 01:20:32 runner-9db142ae-auto-scale-runner-1682731191-4f4959bc systemd[1]: Failed to start Docker Application Container Engine.

And here’s “journalctl -xe” on the same machine:

Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: Starting Docker Application Container Engine...
Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 dockerd[2409]: time="2023-04-29T01:23:19.618022956Z" level=info msg="Starting up"
Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 dockerd[2409]: time="2023-04-29T01:23:19.622244404Z" level=info msg="[graphdriver] trying configured driver: aufs"
Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 dockerd[2409]: time="2023-04-29T01:23:19.622410834Z" level=warning msg="[graphdriver] WARNING: the aufs storage-driver is deprecated and will be removed in a future release; visit https://docs.docker.com/go/storage-driver/ for more information"
Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 dockerd[2409]: time="2023-04-29T01:23:19.624813559Z" level=error msg="AUFS was not found in /proc/filesystems" storage-driver=aufs
Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 dockerd[2409]: failed to start daemon: error initializing graphdriver: driver not supported
Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: docker.service: Failed with result 'exit-code'.
Apr 29 01:23:19 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: Failed to start Docker Application Container Engine.
Apr 29 01:23:21 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: docker.service: Scheduled restart job, restart counter is at 3.
Apr 29 01:23:21 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: Stopped Docker Application Container Engine.
Apr 29 01:23:21 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: docker.service: Start request repeated too quickly.
Apr 29 01:23:21 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: docker.service: Failed with result 'exit-code'.
Apr 29 01:23:21 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: Failed to start Docker Application Container Engine.
Apr 29 01:23:21 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: docker.socket: Failed with result 'service-start-limit-hit'.
Apr 29 01:23:25 runner-9db142ae-auto-scale-runner-1682731361-a5ffb6e4 systemd[1]: Stopping User Manager for UID 1004...

Our config.toml:

concurrent = 3
check_interval = 0
log_level = "debug"

[[runners]]
  name = "gitlab-runner-gce-root-2"
  url = "https://gitlab.com/"
  token = "[REDACTED]"
  executor = "docker+machine"
  environment = ["DOCKER_DRIVER=overlay2"]

  [runners.docker]
    tls_verify = false
    image = "[REDACTED]"
    privileged = false
    disable_cache = true
    shm_size = 0

  [runners.cache]
    Type = "gcs"
    Path = "cache"
    Shared = false
    [runners.cache.gcs]
      CredentialsFile = "[REDACTED]"
      BucketName = "[REDACTED]"

  [runners.machine]
    IdleCount = 0
    IdleTime = 60
    MachineDriver = "google"
    MachineName = "auto-scale-runner-%s"
    MachineOptions = [
      "google-project=[REDACTED]",
      "google-machine-type=e2-standard-32",
      "google-machine-image=[REDACTED]",
      "google-tags=gitlab-ci-slave",
      "google-preemptible=true",
      "google-zone=us-west2-c",
      "google-use-internal-ip=true",
      "google-disk-type=pd-ssd",
      "google-disk-size=50"
    ]

After searching through docker+machine code, this appears to be the fix. Not sure why it suddenly became necessary to specify the dockerd storage driver in the runner config.toml.

[runners.machine]
    MachineOptions = [
      "engine-storage-driver=overlay2"
    ]
1 Like