Docker+Machine Runner Spawns Another Machine Despite Idle Machines Being Available

Context:

In our CI/CD pipeline on GitLab.com, we’ve encountered an issue with GitLab Runner provisioning new Docker+Machine instances for each job, even when idle machines are available. This not only leads to unnecessary resource allocation but also increases our infrastructure costs.

Issue Details:

  • Observed Behavior: Despite idle machines being ready to take on jobs, GitLab Runner initiates the provisioning of new instances. While the jobs do eventually run on the idle machines, the creation of unnecessary instances is inefficient.

  • Expected Behavior: GitLab Runner should prioritize assigning jobs to available idle machines before provisioning new instances, to optimize resource utilization and control costs.

  • Impact: This issue leads to increased cloud costs and underutilization of provisioned resources, affecting our CI/CD pipeline’s efficiency.

  • Here is a snippet from the GitLab Runner logs showing the behavior:

Using existing docker-machine ... name=runner-my3irdvjn-autoscale-idle-runner-1708512742-ce11210e
Running pre-create checks... driver=google name=runner-my3irdvjn-autoscale-idle-runner-1708513616-76c02dbc operation=create
Creating machine... driver=google name=runner-my3irdvjn-autoscale-idle-runner-1708513616-76c02dbc operation=create
  • I am using Gitlab.com

    • GitLab version - GitLab Enterprise Edition 16.10.0-pre cdf098fb8df
    • GitLab Runner version - 16.9.0
  • here’s a snippet from the config.toml for the runner:

concurrent = 20
check_interval = 0
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "autoscale"
  url = "https://gitlab.com"
 ...
  executor = "docker+machine"
  [runners.cache]
    Type = "gcs"
    Shared = true
    MaxUploadedArchiveSize = 0
    [runners.cache.gcs]
      ...
  [runners.docker]
    tls_verify = true
    image = "docker:24.0.6"
    services_limit = -1
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/certs/client", "/cache"]
    pull_policy = ["if-not-present"]
    shm_size = 0
  [runners.machine]
    IdleCount = 0
    MachineDriver = "google"
    MachineName = "autoscale-idle-runner-%s"
    MachineOptions = ["google-project=...", "google-network=...", "google-subnetwork=...", "google-zone=...", "google-machine-type=...", "google-machine-image=projects/cos-cloud/global/images/cos-stable-105-17412-101-4", "google-disk-size=30", "google-preemptible=true", "google-use-internal-ip=true", "google-service-account=...", "google-scopes=...", "engine-install-url=https://releases.rancher.com/install-docker/24.0.6.sh"]
    [[runners.machine.autoscaling]]
      Periods = ["* * 9-18 * * mon-fri *"]
      IdleCount = 5
      IdleTime = 60
      Timezone = "Europe/London"
  • I’ve checked the runner’s configuration for any apparent misconfigurations, ensured that the runner version is up-to-date, and reviewed the GitLab Runner documentation regarding autoscaling and machine provisioning. I also increased the check_interval to see if it was a timing issue, but the problem persists.