Gitlab Runner fails to prepare container because of missing cache

I needed to do some housekeeping to recover some disk space and this involved running docker image prune and docker volume prune commands, neither of which should remove things that are in use.

Since doing that, I cannot run any jobs; they fail like this

Preparing the "docker" executor
Using Docker executor with image docker:20.10.8 ...
ERROR: Preparation failed: adding cache volume: set volume permissions: create permission container for volume "runner-dc7xussp-project-16-concurrent-0-cache-904f6ed42e0fa2b14c1d7a2ed6f1875e": Error response from daemon: exit status 2: "/usr/bin/zfs fs snapshot system/docker/418e78d27d51c2e2628534aaf9f84c5d76748d62e548a4de356328e0fb3a0c31@245476797" => cannot open 'system/docker/418e78d27d51c2e2628534aaf9f84c5d76748d62e548a4de356328e0fb3a0c31': dataset does not exist
usage:
	snapshot [-r] [-o property=value] ... <filesystem|volume>@<snap> ...
For the property list, run: zfs set|get
For the delegated permission list, run: zfs allow|unallow (linux_set.go:95:0s)
Will be retried in 3s ...

The dataset being referred to does not exist. I’ve seen another similar question on SE but no answer.

Prior to this, system was running flawlessly but now I cannot run any CI pipelines. What can I do to fix it?
Thanks!

OK, I have fixed this following a look through the runner’s source code. The problem was due to the runner’s failure to create a “permissions container” from an image called registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-8925d9a0 that had somehow lost its dataset:

"GraphDriver": {
     "Data": {
         "Dataset": "system/docker/418e78d27d51c2e2628534aaf9f84c5d76748d62e548a4de356328e0fb3a0c31",
         "Mountpoint": "/var/lib/docker/zfs/graph/418e78d27d51c2e2628534aaf9f84c5d76748d62e548a4de356328e0fb3a0c31"
     },
     "Name": "zfs"
 },

I suspect this got deleted by one of the docker prune commands previously given, although how something clearly “in use” gets deleted like that, I don’t know. One other possible explanation is the whole thing started due to disk space issues and it may be (a total guess here) that that image had only partially downloaded before.

Anyway, this is what I did:

$ docker image rm  registry.gitlab.com/gitlab-org/gitlab-runner/gitlab-runner-helper:x86_64-8925d9a0
Error response from daemon: exit status 1: "/usr/bin/zfs fs destroy -r system/docker/418e78d27d51c2e2628534aaf9f84c5d76748d62e548a4de356328e0fb3a0c31" => cannot open 'system/docker/418e78d27d51c2e2628534aaf9f84c5d76748d62e548a4de356328e0fb3a0c31': dataset does not exist

Despite the error message the image was deleted. When I then retried a CI job it was downloaded again and everything has worked fine since.

GitLab Runner (docker executor) creates new Pod for each Pipeline Job. If there are no Jobs in GitLab, then there are no Job containers in your Docker. If you do prune volumes and images get deleted, because currently no container is using it. Just for an explanation.

Anyway, prune wouldn’t affect GitLab Runner (done it thousands time), actually it is highly recommended to have a cronjob (or similar) to prune stuff on a machine with GitLab Runner docker executor. There are bunches of issues on this topic and there are also couple scripts on the web if you search with your favorite search engine.

My guess would be the image got corrupted, because of the disk space issues.