GitLab Runner with docker-autoscaler not reusing available cache volumes

Problem to solve

I recently migrated our self-hosted runner executor from docker+machine to docker autoscaler, because the former will be EOL at the end of 2024.

Suddenly we were running into a lot of No space left on device errors.

Upon closer inspection I found out that the disk on the single VMs, which is 100GB, is running full because there are a lot of docker volumes. This does not happen after a few weeks or months, but after a few hours. Increasing the disk space is not an option, because even 150GB is full very quickly and frankly it seems like a waste of money considering the fact, that the total amount of disk space a single job could ever need is 25GB (repository + various caches)

3.9G	runner-<runner-id>-project-<project-id>-concurrent-0-cache-3c3f060a0374fc8bc39395164f415a70
3.3G	runner-<runner-id>-project-<project-id>-concurrent-0-cache-c33bcaa1fd2c77edfc3893b41966cea8
73M	    runner-<runner-id>-project-<project-id>-concurrent-1-cache-3c3f060a0374fc8bc39395164f415a70
3.3G	runner-<runner-id>-project-<project-id>-concurrent-1-cache-c33bcaa1fd2c77edfc3893b41966cea8
5.6G	runner-<runner-id>-project-<project-id>-concurrent-10-cache-3c3f060a0374fc8bc39395164f415a70
4.2G	runner-<runner-id>-project-<project-id>-concurrent-10-cache-c33bcaa1fd2c77edfc3893b41966cea8
7.2G	runner-<runner-id>-project-<project-id>-concurrent-11-cache-3c3f060a0374fc8bc39395164f415a70
3.5G	runner-<runner-id>-project-<project-id>-concurrent-11-cache-c33bcaa1fd2c77edfc3893b41966cea8
7.2G	runner-<runner-id>-project-<project-id>-concurrent-12-cache-3c3f060a0374fc8bc39395164f415a70
5.4G	runner-<runner-id>-project-<project-id>-concurrent-12-cache-c33bcaa1fd2c77edfc3893b41966cea8
5.0G	runner-<runner-id>-project-<project-id>-concurrent-13-cache-3c3f060a0374fc8bc39395164f415a70
4.9G	runner-<runner-id>-project-<project-id>-concurrent-13-cache-c33bcaa1fd2c77edfc3893b41966cea8
4.4G	runner-<runner-id>-project-<project-id>-concurrent-7-cache-3c3f060a0374fc8bc39395164f415a70
3.4G	runner-<runner-id>-project-<project-id>-concurrent-7-cache-c33bcaa1fd2c77edfc3893b41966cea8
1.4G	runner-<runner-id>-project-<project-id>-concurrent-8-cache-3c3f060a0374fc8bc39395164f415a70
5.4G	runner-<runner-id>-project-<project-id>-concurrent-8-cache-c33bcaa1fd2c77edfc3893b41966cea8

I know there already some issues talking about this:

Apparently, at the moment the GitLab Runner is not cleaning up those volumes by design, however, my question is rather why the cache volumes are actually recreated at all.

This is not happening with docker+machine, but only with the docker autoscaler. The config is more or less the same. Especially nothing concerning the cache was changed. For the docker+machine the “concurrent-id” is always 0.
The volumes contain the same things, the repository and some caches from gradle, pnpm and the likes. Why is the cache suddenly not reused for subsequent jobs? Probably because the concurrent id is incremented, but there is always only one job running on any given machine (capacity_per_instance is 1).
The project-id is also always the same, it’s just the concurrent-id and the hash at the end which are changing.

Configuration

We have no additional configuration regarding the cache in any .gitlab-ci.yml file

Important parts of the runner config

[[runners]]
    name = "docker-autoscaler-1"
    url  = "https://gitlab.com/"

    token    = "xxxxxx"
    executor = "docker-autoscaler"

    limit        = 240   # Job limit
    output_limit = 30000 # Maximum log size

    # Directories
    cache_dir  = "/cache"
    builds_dir = "/builds"

    [runners.docker]
        image        = "ubuntu:24.04"
        pull_policy = ["always"]

        tls_verify                   = false
        privileged                   = false
        disable_entrypoint_overwrite = false
        oom_kill_disable             = false
        disable_cache                = false
        shm_size                     = 2000000000 # 2GB

        volumes = [
            "/var/run/docker.sock:/var/run/docker.sock",
            "/cache",
            "/builds",
        ]

    [runners.autoscaler]
        # Manually installed in the Dockerfile
        plugin = "fleeting-plugin-googlecloud"

        max_instances                = 240   # Maximum number of instances
        capacity_per_instance        = 1     # How many jobs in parallel on a single VM
        delete_instances_on_shutdown = false

    [runners.cache]
        Type   = "gcs"
        Path   = "runner-cache"
        Shared = true           # Share between runners

        [runners.cache.gcs]
            CredentialsFile = "..."
            BucketName      = "..."

Versions

Please select whether options apply, and add the version information.

  • Self-managed
  • GitLab.com SaaS
  • Self-hosted Runners

Versions

  • GitLab Runner: 17.2.0 (with docker-autoscaler executor)
  • fleeting-plugin-googlecloud: 1.0.0

I cannot answer the questions about cache pruning in this topic, as they require engineering and architecture knowledge about the runner auto-scaling architecture. I’d suggest asking in one of the linked issues, for example Smart cache cleanup for Docker images & volumes (#27332) · Issues · GitLab.org / gitlab-runner · GitLab

Does the mentioned clear-docker-cache script help with pruning the images manually?

Thank you for your input!

I’m not sure my question is completely on topic with the linked ticket. I can try, but I might have to create a separate ticket for that.

Unfortunately I cannot use the clear-docker-cache script you mentioned as this is only available in the docker container where the gitlab runner is, not in the separate instances which are create by the runner.
That’s also why I am hesitant to add a comment to these tickets as they are mainly concerned with the docker executor which is only one machine and not the fleet with docker-autoscaler or docker+machine.

Now I either have to create a separate image for the single VMs, which contains a cron job cleaning the volumes after a while (which can still fail if a lot of very short jobs are run) or start every job with a pre_get_sources_script: where I’m calling docker volume prune -af. This also requires a modification to the helper image.

Besides, my main concern is why the docker volumes are actually not re-used, not how I can get rid of them.

Both are not really a good solution as they still leave some possibilities for failure. As it stands now I have to go back to docker+machine, which will no longer be supported at the end of the year.

Ah, thanks, it clicked :slight_smile:

I’d suggest to create a new issue then, and tag me there (same username). I can help loop in engineering and product team members but cannot help much with technical details.

1 Like

Thank you :slight_smile:
Done: GitLab Runner with docker-autoscaler not reusing available cache volumes (#37906) · Issues · GitLab.org / gitlab-runner · GitLab

1 Like

I dont frankly understand how the cache volumes are meant to work. If each job gets its own volume then nothing is cached anyway. Certainly keeping them around when no other pipeline can possibly use them is doubly confusing. Why does gitlab not delete them immediately?

Normally those volumes do get reused for subsequent jobs, but only if it is determined that it can be.
As for why they are not just deleted:

  • The volumes might be useful in later jobs, not just the next one
  • It’s not done automatically because it’s the domain of the 3rd party tool docker in this case :person_shrugging:, that’s why there’s an old, dusty feature ticket and not a bug ticket.
1 Like