Possible gitlab-runner regression between 16.5.0 and 16.6.x

Problem

I’d upgraded a local runner machine to v16.6.0 a few weeks ago and started seeing the following error in my CI jobs.

Running with gitlab-runner 16.6.0 (3046fee8)
  on <my-machine> Podman/Docker runner <...>, system ID: <...>
  feature flags: FF_NETWORK_PER_BUILD:true, FF_USE_FASTZIP:true, FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR:true
Resolving secrets 00:00
Preparing the "docker" executor 00:07
Using Docker executor with image <local gitlab registry>image:latest ...
ERROR: Job failed: adding cache volume: set volume permissions: running permission container "35e3042f0071e22a1b4d9b3ab73d541c9e0941b8dc60101b37d0dabd127ddb1e" for volume "runner-uyekh3z-project-225-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": waiting for permission container to finish: exit code 127

I also see the same error when upgrading to the v16.6.1. The latest gitlab-runner release at this time.
This machine has been setup to run Podman in rootless mode under the gitlab-runner user.

Two ways I can work around the issue:

Workaround Option 1

Downgrade from 16.6.x to v16.5.0. Running…

sudo dnf install gitlab-runner-16.5.0-1.x86_64

…then retrying the previously failed job will pass.

Workaround Option 2

Override the gitlab helper image to the SAME version it would have used. If I have gitlab-runner 16.6.1 installed, that’s f5da3c5a (see gitlab-runner repo tags).

/etc/gitlab-runner/config.toml

  [runners.docker]
    helper_image = "registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a"

Retrying a previously failed job looks like this. Check this out, the “default would be” and the overridden helper image tag are the same!

Preparing the "docker" executor 00:02
Using Docker executor with image  <local gitlab registry>image:latest ...
Using helper image:  registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a  (overridden, default would be  registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a )
Pulling docker image registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a ...
Using docker image sha256:561dca7a33f86bf3c2bf1112bbc1d3d12c6962e202e1e985185cc61177a4fdc1 for registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a with digest registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper@sha256:8c79152aed93973ee94ff532e32dab167ef5ce34ec0aef072f07097d587821a8 ...
Using helper image:  registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a  (overridden, default would be  registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a )
Using docker image sha256:561dca7a33f86bf3c2bf1112bbc1d3d12c6962e202e1e985185cc61177a4fdc1 for registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a with digest registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper@sha256:8c79152aed93973ee94ff532e32dab167ef5ce34ec0aef072f07097d587821a8 ...
Authenticating with credentials from job payload (GitLab Registry)
Pulling docker image  <local gitlab registry>image:latest ...
Using docker image sha256:<...> for  <local gitlab registry>image:latest with digest <...>
Not using umask - FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR is set!

[...]

Job succeeded

Question

Am I crazy? Anyone else seen this too? Thoughts/ideas?

Runner System Details

All system packages up-to-date as of 2023-11-27

[gitlab-runner@my-machine ~]$ podman info
host:
  arch: amd64
  buildahVersion: 1.31.3
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.8-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.8, commit: 879ca989e09d731947cd8d9cbb41038549bf669d'
  cpuUtilization:
    idlePercent: 98.74
    systemPercent: 0.24
    userPercent: 1.02
  cpus: 16
  databaseBackend: boltdb
  distribution:
    distribution: '"almalinux"'
    version: "9.3"
  eventLogger: file
  freeLocks: 2048
  hostname: epyc-rhino
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1001
      size: 1
    - container_id: 1
      host_id: 165536
      size: 65536
  kernel: 5.14.0-362.8.1.el9_3.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 30848843776
  memTotal: 33241456640
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.el9.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-1.el9.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /run/user/1001/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20230818.g0af928e-4.el9.x86_64
    version: |
      pasta 0^20230818.g0af928e-4.el9.x86_64
      Copyright Red Hat
      GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: true
    path: /run/user/1001/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.el9.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 16827543552
  swapTotal: 16827543552
  uptime: 1h 7m 26.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.access.redhat.com
  - registry.redhat.io
  - docker.io
store:
  configFile: /mnt/ci-data/gitlab-runner-home/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /mnt/ci-data/gitlab-runner-home/.local/share/containers/storage
  graphRootAllocated: 502921392128
  graphRootUsed: 35369885696
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/user/1001/containers
  transientStore: false
  volumePath: /mnt/ci-data/gitlab-runner-home/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1695842412
  BuiltTime: Wed Sep 27 12:20:12 2023
  GitCommit: ""
  GoVersion: go1.19.10
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

/etc/gitlab-runner/config.toml

concurrent = 1
check_interval = 0
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "<my-machine> Podman/Docker runner"
  url = "<...>"
  id = 292
  token = "<...>"
  token_obtained_at = <...>
  token_expires_at = <...>
  executor = "docker"
  environment = ["FF_NETWORK_PER_BUILD=1", "FF_USE_FASTZIP=1", "ARTIFACT_COMPRESSION_LEVEL=fast", "CACHE_COMPRESSION_LEVEL=fast", "FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR=1"]
  [runners.docker]
    # Explicit helper image to workaround "waiting for permission container to finish exit 127" error encountered 2023-11-20
    helper_image = "registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-f5da3c5a"
    host = "unix:///run/user/1001/podman/podman.sock"
    tls_verify = false
    image = "quay.io/podman/stable"
    privileged = true
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    # 1 GByte
    shm_size = 1073741824
    [runners.docker.tmpfs]
      "/mnt/ramdisk" = "rw,exec,size=2G"

When running v16.6.1, if I

  • Run podman system prune --all --volumes to clean up any old cruft
  • Retry a CI job

The CI job log shows the same error as before:

Running with gitlab-runner 16.6.1 (f5da3c5a)
  on <my-machine> Podman/Docker runner <...>, system ID: <...>
  feature flags: FF_NETWORK_PER_BUILD:true, FF_USE_FASTZIP:true, FF_DISABLE_UMASK_FOR_DOCKER_EXECUTOR:true
Resolving secrets 00:00
Preparing the "docker" executor 00:06
Using Docker executor with image <local gitlab registry>image:latest ...
ERROR: Job failed: adding cache volume: set volume permissions: running permission container "f404c36e60fca30ce81978bb9d92247812394ffafb3dadfaca52c1c537e03ada" for volume "runner-<...>-project-225-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70": waiting for permission container to finish: exit code 127

And the runner machine shows the helper image has been imported to Podman.

$ podman images
REPOSITORY                                                         TAG              IMAGE ID      CREATED         SIZE
registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper  x86_64-f5da3c5a  53b08cbd609b  16 seconds ago  69.1 MB

… so I don’t think it’s issue gitlab-org/gitlab-runner#29576

Notice, however,

  • run podman system prune --all --volume
  • set config.toml
    • helper_image = "registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-${CI_RUNNER_REVISION}"
  • sudo gitlab-runner restart
  • Retry a job

Then notice the helper image that gets pulled into podman:

$ podman images
REPOSITORY                                                         TAG              IMAGE ID      CREATED     SIZE
registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper  x86_64-f5da3c5a  561dca7a33f8  3 days ago  69.1 MB

Why are the image ID’s not the same? Isn’t “IMAGE ID” supposed to be the unique hash that we can verify with upstream registries? Why would gitlab runner default to an image with the same tag but a different hash?