Problem to solve
I recently migrated our self-hosted runner executor from docker+machine
to docker autoscaler
, because the former will be EOL at the end of 2024.
Suddenly we were running into a lot of No space left on device
errors.
Upon closer inspection I found out that the disk on the single VMs, which is 100GB, is running full because there are a lot of docker volumes. This does not happen after a few weeks or months, but after a few hours. Increasing the disk space is not an option, because even 150GB is full very quickly and frankly it seems like a waste of money considering the fact, that the total amount of disk space a single job could ever need is 25GB (repository + various caches)
3.9G runner-<runner-id>-project-<project-id>-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
3.3G runner-<runner-id>-project-<project-id>-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
73M runner-<runner-id>-project-<project-id>-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
3.3G runner-<runner-id>-project-<project-id>-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
5.6G runner-<runner-id>-project-<project-id>-concurrent-10-cache-3c3f060a0374fc8bc39395164f415a70
4.2G runner-<runner-id>-project-<project-id>-concurrent-10-cache-c33bcaa1fd2c77edfc3893b41966cea8
7.2G runner-<runner-id>-project-<project-id>-concurrent-11-cache-3c3f060a0374fc8bc39395164f415a70
3.5G runner-<runner-id>-project-<project-id>-concurrent-11-cache-c33bcaa1fd2c77edfc3893b41966cea8
7.2G runner-<runner-id>-project-<project-id>-concurrent-12-cache-3c3f060a0374fc8bc39395164f415a70
5.4G runner-<runner-id>-project-<project-id>-concurrent-12-cache-c33bcaa1fd2c77edfc3893b41966cea8
5.0G runner-<runner-id>-project-<project-id>-concurrent-13-cache-3c3f060a0374fc8bc39395164f415a70
4.9G runner-<runner-id>-project-<project-id>-concurrent-13-cache-c33bcaa1fd2c77edfc3893b41966cea8
4.4G runner-<runner-id>-project-<project-id>-concurrent-7-cache-3c3f060a0374fc8bc39395164f415a70
3.4G runner-<runner-id>-project-<project-id>-concurrent-7-cache-c33bcaa1fd2c77edfc3893b41966cea8
1.4G runner-<runner-id>-project-<project-id>-concurrent-8-cache-3c3f060a0374fc8bc39395164f415a70
5.4G runner-<runner-id>-project-<project-id>-concurrent-8-cache-c33bcaa1fd2c77edfc3893b41966cea8
I know there already some issues talking about this:
- Docker gitlab-runner volumes taking so many storage
- Smart cache cleanup for Docker images & volumes (#27332) · Issues · GitLab.org / gitlab-runner · GitLab
Apparently, at the moment the GitLab Runner is not cleaning up those volumes by design, however, my question is rather why the cache volumes are actually recreated at all.
This is not happening with docker+machine
, but only with the docker autoscaler
. The config is more or less the same. Especially nothing concerning the cache was changed. For the docker+machine
the “concurrent-id” is always 0.
The volumes contain the same things, the repository and some caches from gradle
, pnpm
and the likes. Why is the cache suddenly not reused for subsequent jobs? Probably because the concurrent
id is incremented, but there is always only one job running on any given machine (capacity_per_instance
is 1).
The project-id is also always the same, it’s just the concurrent-id and the hash at the end which are changing.
Configuration
We have no additional configuration regarding the cache in any .gitlab-ci.yml
file
Important parts of the runner config
[[runners]]
name = "docker-autoscaler-1"
url = "https://gitlab.com/"
token = "xxxxxx"
executor = "docker-autoscaler"
limit = 240 # Job limit
output_limit = 30000 # Maximum log size
# Directories
cache_dir = "/cache"
builds_dir = "/builds"
[runners.docker]
image = "ubuntu:24.04"
pull_policy = ["always"]
tls_verify = false
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
shm_size = 2000000000 # 2GB
volumes = [
"/var/run/docker.sock:/var/run/docker.sock",
"/cache",
"/builds",
]
[runners.autoscaler]
# Manually installed in the Dockerfile
plugin = "fleeting-plugin-googlecloud"
max_instances = 240 # Maximum number of instances
capacity_per_instance = 1 # How many jobs in parallel on a single VM
delete_instances_on_shutdown = false
[runners.cache]
Type = "gcs"
Path = "runner-cache"
Shared = true # Share between runners
[runners.cache.gcs]
CredentialsFile = "..."
BucketName = "..."
Versions
Please select whether options apply, and add the version information.
- Self-managed
- GitLab.com SaaS
- Self-hosted Runners
Versions
- GitLab Runner: 17.2.0 (with
docker-autoscaler
executor) - fleeting-plugin-googlecloud: 1.0.0