Question of understanding caching and file exchange between stages

bytecounter · December 8, 2021, 5:54am

Hi @all,

I am new a gitlab and experimenting with gitlab ci. Currently I am confused, how the file exchange between stages works and I didn’t found any explanation about. As described in Best practices | GitLab each runner has its own working directory in the build directory. It seems, that all created files are moved to the build-directory and kept for the following stage(s).

For a better understanding, I’ve created the following gitlab-ci file. The first stage contains to jobs to prepare some files:
The first job

creates a file “localfile.txt” - this file must exists in the following stage
creates “job.txt” file with content “1”
The second job
creates a file “localfile2.txt” - this file must exists in the following stage
creates “job.txt” file with content “2” (which one wins?)
The second stage shows the results (content of the working directory and of the job.txt file).

To be sure, there is really no interaction between jobs in the same stage I use “>>” to append content to the job.txt file, if there is an interaction, the file will be not overwritten.

.gitlab-ci

stages:
    - build image
    - build
    - deploy

build:
    stage: build
    before_script:
        - sleep 10 # Simulating this job takes a longer time as the second job
    script:
        - date
        - "echo 1 > localfile.txt"
        - "echo 1 >> job.txt"

build2:
    stage: build
    script:
        - date
        - "echo 2 > localfile2.txt"
        - "echo 2 >> job.txt"

deploy:
    stage: deploy
    only:
        - build-deploy-example
    before_script:
        - ls -lhA
    script:
        - test -f job.txt && cat job.txt

After running the pipeline, the content of job.txt is “2”, although the first job in the build stage runs later:

Job "build"

$ date Wed Dec 8 04:51:59 UTC 2021
$ echo 1 > localfile.txt
$ echo 1 >> job.txt

Job "build2"
$ date Wed Dec 8 04:51:53 UTC 2021
$ echo 2 > localfile.txt
$ echo 2 >> job.txt

Job "deploy"
$ test -f localfile.txt && cat localfile.txt
2
$ test -f job.txt && cat job.txt
2

Since the jobs of a stage can runs parallel and normally no one knows which job is finished first, it should actually be random which number is in the job.txt file. But since this is not the case (every time I run the pipeline, the file contains “2”), I interpret that the order in the gitlab-ci file plays a role?

Although I think that this is not a sensible and reliable application, and it should not be implemented this way in practice. But for understanding, it is certainly not wrong to know how the result comes about.

Therefore, in my opinion, caching is interesting only if “git clean” is called in the before_script. Also “git clean” is important to get a clean working (build) directory?

Regards
Mathias

snim2 · December 8, 2021, 10:11am

Hi @bytecounter

In terms of your mental model of how pipelines work, I wouldn’t worry about things like “build directories”, which are only important if you want to do some complicated moving of files in a deployment (and usually not even then). I would think of a pipeline as automating the actions of a developer – the pipeline clones the repo, cds to the root directory of the repo, runs some commands, and exits. Any files that you need to worry about when you’re building your config will be relative to the repo root, and you can consider the rest of the filesystem to be an abstraction, that’s mostly irrelevant to you.

Your issue with passing files between stages is to do with the difference between caching and artifacts.

Caching should be used to speed up your pipelines by saving the results of operations that manage dependencies. So, you might want to cache a vendor directory, or similar. Caching happens automatically before each pipeline stage runs, so you can define what needs to be cached and assume the caching happens in the background.

Artifacts are used to pass files that are generated by the pipeline between pipeline stages, and the paths that define your artifacts are relative to the root of the repo.

For example:

stages:
    - stage1
    - stage2

job1:
    stage: stage1
    script:
        - "echo 2 >> job.txt"
    artifacts:
        paths:
            - job.txt

job2:
    stage: stage2
    script:
        - cat job.txt

bytecounter · December 9, 2021, 3:55am

Hi @snim2,

thank you for the detailed explanation! I noticed that my build contains some files from previous stages that I no longer need (e. g. PHPUnit Testcoverage files). I want to be able to download this via GitLab, but not have it in the production build. Also, my build job aborts when packing using tar. The error message is that a file was changed during reading.
That’s why I started to study the behaviour and how the questions come up.

Your example contains the artifact keyword, which keeps the job.txt explicitly. But, if you do not use, the “cat job.txt” in job2 will also work. “artifact” causes the file is available for a certain time (expire_on).

For me it is important to do things explicitly