Using cache for pip/npm dependencies in Gitlab CI

mlorant · May 18, 2020, 8:55am

Hi,

We’ve been using Gitlab.com (not self-managed) for the last few weeks. We want to use the shared runners to execute our CI, and I succeed to set up a config with our existing suite tests.

The main stage passes, however it takes about 22 minutes compared to 10-12 minutes on our legacy CI for one main reasons : Pypi and npm packages are downloaded and re-installed/compiled at each pipeline, which takes minutes (definitely most of the 10 extra minutes, maybe the whole).

Our .gitlab-ci.yml looks like this right now. Sorry for the long paste, but I prefer to give as much context as possible:

image: "python:3.7-alpine"

variables:
  [... some db/tokens variables...]
  # Set pip's cache inside the project directory since we can only cache local items
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"

stages:
  - test
  - coverage

cache:
  key: pip-and-npm-global-cache
  paths:
    - $CI_PROJECT_DIR/.cache/pip
    - $CI_PROJECT_DIR/.cache/npm

before_script:
  - mkdir -p $CI_PROJECT_DIR/.cache/pip $CI_PROJECT_DIR/.cache/npm

django_tests:
  stage: test

  services:
    - postgres:9.6-alpine
    - mongo:3.6-xenial

  cache:
    key: "coverage-$CI_COMMIT_REF_SLUG"
    paths:
      - .coverage

  script:
    # Various packages required to run dependencies below
    - apk add [...]
    - pip install pip --upgrade
    - pip install -r requirements.txt
    - coverage ...  # execute tests here

js_tests:
  stage: test
  image: "node:alpine"
  cache:
    key: "$CI_COMMIT_REF_SLUG"
    paths:
      - node_modules/

  script:
    - npm ci --cache $CI_PROJECT_DIR/.cache/npm --prefer-offline
    - npm install && npm run build
    - npm run test

coverage:
  stage: coverage
  script:
    - pip install coverage==4.5.3 django_coverage_plugin==1.6.0
    - coverage report -i -m [...]

First, the “test” stage always re-installs the packages, even between two builds on the same branch without any new commits. The stage passes though (as said before) but the coverage one doesn’t because some pip requirements installed before are not available anymore.
I have the same problem with a local runner on my machine and with the shared runners of Gitlab.com.

I tried to set some ls in the script and it seems $CI_PROJECT_DIR/.cache is always empty at the start of a job (django_tests and coverage). Did I miss something? Does any of my cache declaration overlaps another one?

mlorant · June 8, 2020, 1:15pm

I allow myself to resurrect my question. Does anyone have already encounter this problem? Does anyone has a config example working for pip?

luciojb · July 6, 2020, 7:08pm

Same thing here. I searched every corner of the internet for someone who had some working example or application. I have a successful case with maven, but for python it’s a different template and i have tried everything within my knowledge but without any success.

adietrich · July 23, 2020, 5:11pm

Hi @mlorant and @luciojb,

the Python example from the official documentation has worked pretty well for me in the past:
https://docs.gitlab.com/ee/ci/caching/#caching-python-dependencies

My understanding of the cache configuration is that job-level directives override global ones, so each of your jobs seems to be using a different cache, and only the last job is using the global one. If all jobs are supposed to use the same cache, try using only a global cache configuration and see if that helps.

One thing I would like to point out about the Python example above is that it caches the venv directory it installs packages to in addition to the Pip package cache. This should prevent jobs from re-installing the same packages every time.

Kind regards,
Alexander

fleXible · September 20, 2021, 5:40pm

Hi @mlorant,
I used to have the very same problem up until a couple of weeks ago, when I finally found a webpage setting it up in the right way.
First of all, you have to keep in mind, that when a job creates a cache, even with the broadest matching key, it will only stay local to the runner it was created on, with a bit of luck a future job might pull it. Watch out, default policy is pull-push, so every job can rewrite the cache and potentially wipe content you want to preserve. In the case of “pip-and-npm-global-cache”, the npm and python job are trashing each others content. Most important thing I did, was get a Minio container and provide a shared cache. Next I change policy to pull and had the first job take care of filling the pip cache and installing venv. The whole thing works best though, when venv is saved as artifact!

variables:
  PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
  XDG_CACHE_HOME: "$CI_PROJECT_DIR/.cache"

image: python:3.8-slim-buster

## These paths will be cached in between test runs. Saving the download times.
cache: &package_cache
  key:
    files:
      - poetry.lock
    prefix: poetry
  paths:
    - .cache
  policy: pull

stages:
  - Prepare
  - Static Analysis

before_script:
  - apt-get update &&
      apt-get install -qqy --no-install-recommends --no-install-suggests
        make
  - pip install poetry

dependencies:
  stage: Prepare
  cache:
    <<: *package_cache
    policy: pull-push
  script:
    - python -m venv --copies venv
    - source venv/bin/activate
    - python -m pip install --upgrade pip
    - poetry export --without-hashes -n |
        tee requirements.txt
    - poetry export --dev --without-hashes -n |
        tee requirements-dev.txt |
        pip install -r /dev/stdin
    # pip cache filled with all package downloads
    - poetry build
  artifacts:
    paths:
      - requirements.txt
      - requirements-dev.txt
      - dist/
      - .venv/
    exclude:
      - .venv/**/__pycache__/*
    when: on_success
    # venv gets extracted for each job and is immutable

# The uncompromising Python code formatter
black:
  stage: Static Analysis
  script:
    - make lint-test

# Simple and scalable tests for Python code
pytest:
  stage: Static Analysis
  script:
    - poetry install
    - make test
  artifacts:
    when: on_success
    reports:
      cobertura: coverage.xml
      junit: report.xml

Found the webpage again, that gave me the idea: GitLab CI: Cache and Artifacts explained by example

I hope that helps

vdespa · January 30, 2025, 8:35am

I came across this old post while researching a problem with a misconfigured Python/pip cache.

Based on the experiments I’ve run, I’ve noticed that using PIP_CACHE_DIR to set the cache dir in the project directory and caching “.cache/pip” does not actually bring any performance gains (which is counter-intuitive, I know!).

I’ve documented my findings in this longer blog post: How to Cache Python Dependencies in GitLab CI/CD

Topic		Replies	Views
Questions regarding GitLab Runner cache in YAML File GitLab CI/CD ci , python , pipelines	0	1089	February 7, 2021
How to use gitlab cache GitLab CI/CD ci , docker	13	58707	December 14, 2023
How to prevent GitLab CI from needlessly recreating cache? GitLab CI/CD	8	2241	September 9, 2024
How to use the "cache" in gitlab when trying to use files from two previously run jobs? GitLab CI/CD	5	9322	November 19, 2021
Cache does not work using local runner GitLab CI/CD	0	758	December 1, 2021

Using cache for pip/npm dependencies in Gitlab CI

Related topics