Trouble with concurrent pipelines and caching

rwarner · May 31, 2024, 3:26pm

Currently having trouble attempting to resolve an odd caching issue after rebasing and pushing to a branch.

Currently I’m defining my cache as the following:

cache: &global_cache
    key: "cache-$CI_COMMIT_REF_SLUG-$CI_COMMIT_SHORT_SHA"
    paths:
        - ".gradle/caches"
    policy: pull-push

I have two stages:

stages:
    - Dependencies
    - Lint

gradle_dependencies:
    stage: Dependencies
    cache:
        # Inherit global cache settings
        <<: *global_cache
        # Override policy
        policy: push
    script:
        - gradle customTask()
    when: always

gradle_ktlintCheck:
    stage: Lint
    script:
        # Rely on cache from dependencies stage
        - gradle --offline ktlintCheck
    artifacts:
        paths:
            - app/build/reports/ktlint/*
    when: on_success
    needs:
        - job: gradle_dependencies

This pipeline has been successful for weeks when running independently on one branch. However, now that we’ve integrated it into our main branch we’re running into concurrency issue. For example, when rebasing a branch pushing those changes and pushing up to another branch at around the same time.

I see the dependencies stage make the correct cache file:

Creating cache cache-release-1-0-0-7205d38d-1-non_protected...
.gradle/caches: found 32673 matching artifact files and directories 
No URL provided, cache will not be uploaded to shared cache server. Cache will be stored only locally. 
Created cache

but the lint stage will just randomly not find the same exact cache:

Checking cache for cache-release-1-0-0-7205d38d-1-non_protected...
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted. 
WARNING: Cache file does not exist                 
Failed to extract cache

If I manually run the pipeline again when another pipeline is not running it works totally fine. I’m confused if the cache is getting deleted or something by another pipeline? It’s just strange because I would have thought cache-$CI_COMMIT_REF_SLUG-$CI_COMMIT_SHORT_SHA was going to be extremely unique and not have any conflicts with other pipelines and branches.

My only other thought was to integrate a Resource Group, to avoid concurrency and this issue but I don’t feel like this should be failing the way it’s integrated.

I see another post similiar to this: When two pipelines are executed simultaneously, do they access the cache concurrently? - #2 by dnsmichi

If jobs run concurrently, they might download the same cache, but later override the cache upload, depending which job finishes first. To prevent that, you can use resource_groups to lock the job, not being run in parallel. Resource group | GitLab

Alternatively, configure a cache key that is not global, but per branch. Caching in GitLab CI/CD | GitLab

But I feel like the cache key I have should resolve this problem?

Versions

Self Hosted Runner: 17.0.0 on Ubuntu 24.04

rwarner · June 3, 2024, 6:43pm

We were able to figure out the underlying problem. This is basically a problem of concurrent runners.

This project was associated with a “Shared Gitlab Runner” which was set to execute up to 6 concurrent runners at once time. This would essentially utilize a cache such as the following:

runner-_______-project-_____-concurrent-0
runner-_______-project-_____-concurrent-1
runner-_______-project-_____-concurrent-2
runner-_______-project-_____-concurrent-3
runner-_______-project-_____-concurrent-4
runner-_______-project-_____-concurrent-5

When kicking off Pipeline 1, it would create a cache in concurrent-0 directory. We would then simultaneously (or before the next stage) kick off Pipeline 2 which would then start on concurrent-1 and place the Pipeline 2 cache in another directory.

By the time Pipeline 1 got to it’s next stage it would go to a different runner (e.g. concurrent-1 or concurrent-2`) and try to access the cache and it would be no longer available because it would be under a completely different runner.

It seems like what I found as far as “shared cache” goes for concurrent runners was only available for Distributed Caching, or setting up some kind of shared directory for each runner to point at a single point of cache.

I’m not sure why we couldn’t have a single point of folder with settings Shared = true as an option to avoid this weird edgecase.

Our solution: Unfortunately, was to setup a specific “project-runner” for this project and designated the config.toml to set the project to limit = 1 as pointed out here. Since if we used the group runner and set it to limit = 1 and there was another project using concurrent-0 we could run into a similiar issue.

Verified with the following steps which failed in the past, since it is only running on one runner now:

I made a pipeline: #1 (Ran on: runner-concurrent-0 )
Waited for it to start building Stage: Dependencies
Made a new pipeline: #2 (Ran on: runner-concurrent-0 )
It was paused until the first pipeline finished (no concurrency )
Once pipeline 2 started I started a “re-run” of Lint Stage on pipeline 1 which fetches the cache
When 2 finished, the Lint stage started re-running and successfully fetched the appropriate cache from the original pipeline 1 (Ran On: runner-concurrent-0 )

Are we missing something here that should make concurrency more consistent among concurrent jobs? I’m basically aiming to have the same cache transferred among each pipeline stage, but different for each branch. The first stage’s cache get’s used down the pipeline for manual building and deployment but does not happen automatically.