Confusion around CI docker cache volumes, and sharing across jobs / concurrency

  • What are you seeing, and how does that differ from what you expect to see?

TLDR of my goal, have a single (or as few as possible) local caches for the configuration.

I have been on a small adventure figuring out how the docker runner caching works.
I am now in the situation where I have a single runner configured, with concurrency of 4 and a “global” pipeline cache configured, with a fixed key, but I’m still slightly confused around the volumes in use.

On the host for the runner when looking at the docker volumes after running the pipeline detailed below I see a whole bunch of volumes.

DRIVER    VOLUME NAME
local     runner-6kaxapsy-project-16-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-6kaxapsy-project-16-concurrent-0-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e
local     runner-6kaxapsy-project-16-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-6kaxapsy-project-16-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-6kaxapsy-project-16-concurrent-1-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e
local     runner-6kaxapsy-project-16-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-6kaxapsy-project-16-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-6kaxapsy-project-16-concurrent-2-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e
local     runner-6kaxapsy-project-16-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-6kaxapsy-project-16-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-6kaxapsy-project-16-concurrent-3-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e
local     runner-6kaxapsy-project-16-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8

I discovered in some docs that these volume names are in the following format:

runner-<short-token>-project-<id>-concurrent-<concurrency-id>-cache-<md5-of-path>

I was expecting to have difference caches per concurrent job as I have already read about that in other gitlab tickets.

However I do not understand why the md5-of-path is differing and I end up with 12 volumes, when the same path of mediawiki is defined, with the same cache key.

This is on GitLab Community Edition 13.12.9

This is a TLDR version of my config.
The full work in progress file can be found here

image: docker:19.03.12

cache:
  - key: mediawiki
    paths:
      - mediawiki

services:
  - name: docker:19.03.12-dind
    # Use a registry mirror to avoid hitting docker hub too much as there is a rate limit of 100 pulls per 6 hours
    command: ["--registry-mirror", "https://mirror.gcr.io"]

integration:
    parallel:
      matrix:
        - TEST: docker-mw-extra-commands.sh
        - TEST: docker-mw-install-all-the-dbs.sh
        - TEST: docker-mw-mysql-suspend-resume-destroy.sh
    before_script:
      - ./tests/cache-mediawiki.sh
      - ./tests/setup.sh
    script:
      - ./tests/$TEST

Some general questions I have some up with that despite much googling I think I havn’t found certain answers:

  • Do these 12 volumes actually mean that separate caches are used? (testing indicates maybe)?
  • Is the hash at the end of the volume name actually the path? (I only found this reference in 1 doc page so far)

Hi,

I don’t worry about those volumes, I run the script below using cronjob at 5am every day to clean up the rubbish gitlab gets up to. I also delete any of those concurrent containers if they are around, if not running.

#!/bin/bash

## SCRIPT: cleanupGitlabRunner.sh
##
## MODIFIED: 2021-07-12 13:30

EXIT_CODE=0

for DOCKERMACHINE in `docker ps -a  --format '{{.Names}} {{.Status}}'|grep -Po "runner.*concurrent
do
  echo "Found Gitlab Runner cache machine : $DOCKERMACHINE"
  echo "  .. deleting machine"
  docker rm $DOCKERMACHINE
done

echo ""
echo "Docker images prune"
docker image prune -a -f

if [ "$?" != "0" ]; then
   EXIT_CODE=1
fi

echo ""
echo "Docker volume prune"
docker volume prune -f

if [ "$?" != "0" ]; then
   EXIT_CODE=1
fi

echo "Done"

exit $EXIT_CODE

I have runners on two servers, the one server (1x runner, cache enabled) with the cleanup script has no runner cache volumes, the other server (2x runners) with no cleanup script and been up for 3 months has 21 cache volumes. On server 2 the one runner has cache disabled the other enabled - no particular reason.

Runners can be a nightmare, and if you try to follow gitlab developer advice on forums or defect tickets you can go in circles. The runner setup and operation is good when it works, when it stops then it can take 5 hours to figure it out. In paper looks so simple.

I rebuilt 2nd server a while back but did not put cleanup script on it, will do that now.

So unless you have a particular reason you need cache, it makes no difference to me, from what I can tell. Unless you have to optimise (reduce) runner execution time, no cache may avoid other issues where cached data results in unexpected behaviour. Theoretically.

Good luck.

I am late to the party, but you are mixing two cache configuration concept:

the one where you define a cache: with a key and path are specific to shared cache or local cache_dir (configured in the runner config.toml) this stores cache.zip on the next run it fetch if available and unzips.

the volumes are part of the docker executor configuration which create a new volume and mounts it to the /cache of your container while executing a new job.

There is very little documentation on what does change the last part of the naming convention of the gitlab-runner docker volume, e.g. the part.

from the code: executors/docker/internal/volumes/manager.go · main · GitLab.org / gitlab-runner · GitLab

it seems to be related to the hash of a volume destination:

func hashPath(path string) string {
  return fmt.Sprintf("%x", md5.Sum([]byte(path)))
}
volumeName := fmt.Sprintf("%s-cache-%s", name, hashPath(destination))

It seems the desitnation is the bind mount destination as defined in the tests

Yet it is still a wonder that this hash changes if the runner configuration is something like:

...
  [runners.docker]
...
    disable_cache = false
    volumes = ["/cache"]
...

the volume destination shall always be /cache and the md5 hash also be the same for every volume.

@ajwalker could you please help us out to understand this a bit more as you did the change