Confusion around CI docker cache volumes, and sharing across jobs / concurrency

  • What are you seeing, and how does that differ from what you expect to see?

TLDR of my goal, have a single (or as few as possible) local caches for the configuration.

I have been on a small adventure figuring out how the docker runner caching works.
I am now in the situation where I have a single runner configured, with concurrency of 4 and a “global” pipeline cache configured, with a fixed key, but I’m still slightly confused around the volumes in use.

On the host for the runner when looking at the docker volumes after running the pipeline detailed below I see a whole bunch of volumes.

DRIVER    VOLUME NAME
local     runner-6kaxapsy-project-16-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-6kaxapsy-project-16-concurrent-0-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e
local     runner-6kaxapsy-project-16-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-6kaxapsy-project-16-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-6kaxapsy-project-16-concurrent-1-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e
local     runner-6kaxapsy-project-16-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-6kaxapsy-project-16-concurrent-2-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-6kaxapsy-project-16-concurrent-2-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e
local     runner-6kaxapsy-project-16-concurrent-2-cache-c33bcaa1fd2c77edfc3893b41966cea8
local     runner-6kaxapsy-project-16-concurrent-3-cache-3c3f060a0374fc8bc39395164f415a70
local     runner-6kaxapsy-project-16-concurrent-3-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e
local     runner-6kaxapsy-project-16-concurrent-3-cache-c33bcaa1fd2c77edfc3893b41966cea8

I discovered in some docs that these volume names are in the following format:

runner-<short-token>-project-<id>-concurrent-<concurrency-id>-cache-<md5-of-path>

I was expecting to have difference caches per concurrent job as I have already read about that in other gitlab tickets.

However I do not understand why the md5-of-path is differing and I end up with 12 volumes, when the same path of mediawiki is defined, with the same cache key.

This is on GitLab Community Edition 13.12.9

This is a TLDR version of my config.
The full work in progress file can be found here

image: docker:19.03.12

cache:
  - key: mediawiki
    paths:
      - mediawiki

services:
  - name: docker:19.03.12-dind
    # Use a registry mirror to avoid hitting docker hub too much as there is a rate limit of 100 pulls per 6 hours
    command: ["--registry-mirror", "https://mirror.gcr.io"]

integration:
    parallel:
      matrix:
        - TEST: docker-mw-extra-commands.sh
        - TEST: docker-mw-install-all-the-dbs.sh
        - TEST: docker-mw-mysql-suspend-resume-destroy.sh
    before_script:
      - ./tests/cache-mediawiki.sh
      - ./tests/setup.sh
    script:
      - ./tests/$TEST

Some general questions I have some up with that despite much googling I think I havn’t found certain answers:

  • Do these 12 volumes actually mean that separate caches are used? (testing indicates maybe)?
  • Is the hash at the end of the volume name actually the path? (I only found this reference in 1 doc page so far)

Hi,

I don’t worry about those volumes, I run the script below using cronjob at 5am every day to clean up the rubbish gitlab gets up to. I also delete any of those concurrent containers if they are around, if not running.

#!/bin/bash

## SCRIPT: cleanupGitlabRunner.sh
##
## MODIFIED: 2021-07-12 13:30

EXIT_CODE=0

for DOCKERMACHINE in `docker ps -a  --format '{{.Names}} {{.Status}}'|grep -Po "runner.*concurrent
do
  echo "Found Gitlab Runner cache machine : $DOCKERMACHINE"
  echo "  .. deleting machine"
  docker rm $DOCKERMACHINE
done

echo ""
echo "Docker images prune"
docker image prune -a -f

if [ "$?" != "0" ]; then
   EXIT_CODE=1
fi

echo ""
echo "Docker volume prune"
docker volume prune -f

if [ "$?" != "0" ]; then
   EXIT_CODE=1
fi

echo "Done"

exit $EXIT_CODE

I have runners on two servers, the one server (1x runner, cache enabled) with the cleanup script has no runner cache volumes, the other server (2x runners) with no cleanup script and been up for 3 months has 21 cache volumes. On server 2 the one runner has cache disabled the other enabled - no particular reason.

Runners can be a nightmare, and if you try to follow gitlab developer advice on forums or defect tickets you can go in circles. The runner setup and operation is good when it works, when it stops then it can take 5 hours to figure it out. In paper looks so simple.

I rebuilt 2nd server a while back but did not put cleanup script on it, will do that now.

So unless you have a particular reason you need cache, it makes no difference to me, from what I can tell. Unless you have to optimise (reduce) runner execution time, no cache may avoid other issues where cached data results in unexpected behaviour. Theoretically.

Good luck.