Is it possible to have a kind of persistent Docker image for GitLab-CI?

Thanks to GitLab pages, I’m building a (French LaTeX) FAQ with Sphinx-doc on a gilab.com instance. Here is my repo and here is the relevant part of my .gitlab-ci.yml file:

pages:
  dependencies: []  # avoid accidentally caching artifacts
  cache:
    untracked: true
  stage: deploy
  script:
  - pip3 install --verbose --upgrade pip
  - pip3 install --verbose --force-reinstall "sphinx==6.2.1"
  - pip3 install --verbose sphinx_design "sphinx==6.2.1"
  - pip3 install --verbose --force-reinstall "numpy==1.19.4"
  - pip3 install --verbose sphinxext.opengraph "sphinx==6.2.1"
  - pip3 install --verbose sphinx_comments "sphinx==6.2.1"
  - pip3 install --verbose linkify-it-py "sphinx==6.2.1"
  - pip3 install --verbose furo "sphinx==6.2.1"
  - pip3 install --verbose myst-parser "sphinx==6.2.1"
  - sphinx-build --version
  - sphinx-build source build/html
  - cp -rf build/html public
  artifacts:
    paths:
    - build/
    - public
  only:
  - master

The problem is, even if only a single comma is changed in a single Sphinx (.md) source file (among the 1229 ones), sphinx-build source build/html (run by the Docker image) rebuilds all the (1229) HTML pages, and this takes too much time (more than 15 minutes). By contrast, on my local machine, if only a single source file is changed, sphinx-build source build/html (run in the terminal, not using Docker) rebuilds only the single corresponding HTML page and this takes only a few seconds.

As you can see above, I tried to make use of cache and artifacts (quite blindly: I don’t understand well all of these features) but with no success.

Is there a way, on GitLab, to deal with some “permanent” Docker image that would be aware that not all the HTML pages have to be rebuilt each time a commit happens?

You are going the right direction (almost), but it should work by simply adding Sphinx cache directory to it. This should work:

pages:
  cache:
    key: pages-sphinx # optional just in case you have different jobs in the pipeline
    paths:
    - .doctrees
  dependencies: []
  stage: deploy
  script:
  - pip3 install --verbose --upgrade pip
  - pip3 install --verbose --force-reinstall "sphinx==6.2.1"
  - pip3 install --verbose sphinx_design "sphinx==6.2.1"
  - pip3 install --verbose --force-reinstall "numpy==1.19.4"
  - pip3 install --verbose sphinxext.opengraph "sphinx==6.2.1"
  - pip3 install --verbose sphinx_comments "sphinx==6.2.1"
  - pip3 install --verbose linkify-it-py "sphinx==6.2.1"
  - pip3 install --verbose furo "sphinx==6.2.1"
  - pip3 install --verbose myst-parser "sphinx==6.2.1"
  - sphinx-build --version
  - sphinx-build source build/html
  - cp -rf build/html public
  only:
  - master
1 Like

Thanks for your answer. Nevertheless, with your setup:

  1. the pages job succeeded but the pages:deploy one failed (“missing pages artifacts”);
  2. still all the HTML pages were regenerated.

pages:deploy failed, because I have forgot to copy the

  artifacts:
    paths:
    - public

Just add it back.

Try to add the -d parameter to the sphinx-build like this sphinx-build -d .doctrees source build/html

@balonik But maybe the problem comes from a wrong location of the .doctrees folder. As you can see here, there isn’t such a folder at the top level of the project but only in the build subfolder. Does it mean I should replace:

paths:
    - .doctrees

by:

paths:
    - build/.doctrees

But, nevertheless, what is strange is that both build/.doctrees and build/doctrees have remained unchanged since months whereas several .md source and generated HTML files changed in the meantime.

So:

sphinx-build -d .doctrees source build/html

or:

sphinx-build -d build/.doctrees source build/html

?

Haven’t used Sphinx for a while, you might be correct. If you have it also on your local workstation just check where the .doctrees directory is and adjust the path in cache accordingly.

The .doctrees dir is where Sphinx stores it’s cache normally.

The .doctrees’s location, which can also be checked on the gitlab.com instance, is indeed in build/.doctrees directory. But, as said above and as you can see with the previous link:

Isn’t it strange build/.doctrees is several months behind build/html?

Good to know: thanks!

I don’t know if anyone is running Sphinx locally so the cache gets updated and then pushing to GIT repo. If you are running sphinx-build only in CI the cache couldn’t get updated really.

What is expected for sphinx-build is to be run only in CI. I guess the .doctrees was updated 7 months ago when files were pushed from my local repo.

Too bad! Do you know why? Isn’t a magic option to pass to sphinx-build in CI that would force the cache to be updated?

The cache will be updated in the jobs, but not pushed to the repository. I think it’s better to remove them from repository not to cause any conflicts.

Isn’t it possible to push this cache to the repository?

But, if they are removed, no hope to regenerate only the HTML pages of the changed .md source files, isn’t it?

You could push it to the repository, but I don’t see a reason for it. Unless someone needs to run it locally with cache.

You do not need to keep the cache in the repository, actually you shouldn’t. It will work.
First CI run will run without a cache, during this first run the cache will be created and stored internally within GitLab Runners. All subsequent jobs will fetch this stored cache and use it. And each run will update this cache. You won’t see it in the repository, but it’s there.

After some time the cache might expire and will be created again, just so you are not surprised if after some time (months) one job will run longer.

OK, so I guess I should add to my .gitignore file the following :

.doctrees

What about the doctrees folder (without any leading dot)? Should it ignored as well?

You can delete both from the repo and add to .gitignore

(Since I reached the minutes quota on gitlab.com, I had to use another instance of GitLab, here. As you can see, the repository no longer contains the build directory anymore, so no chance of being disturbed by the cache directories :slight_smile: )

Unfortunately, it doesn’t work either: the last modification, although tiny, triggers the regeneration of all the 1229 HTML pages.

If you are running self-hosted GitLab Runners, you need to ask your Gitlab admin team if the Runners have cache configured and how. I see from the job logs that you are most likely have local Runners on OKD (Kubernetes) and that cache is not configured. In that case the Runner needs to be configured for distributed cache.

This is pre-configured already on the GitLab SaaS Runners so it will work there.

Could you tell me where you see that?

For kubernetes executor that means inside the Pod which dies after the Job is completed. So there is no cache.

1 Like

(Unfortunately, I can’t investigate the GitLab instance any further because the minute quota has still been reached. In the meantime,) I compared the logs between two instances:

  1. the GitLab one,
  2. another one,

the latter also regenerating all HTML pages. I see, respectively:

  1. Initialized empty Git repository in /builds/dbitouze/test-faq-fr/.git/
  2. Reinitialized existing Git repository in /builds/dbitouze/faq-latex-sous-sphinx-doc/.git/” and, 2 lines below, “Removing build/”.

How do I interpret this “Initialized empty Git repository” vs “Reinitialized existing Git repository” and could it have anything to do with my problem?