Gitlab Shared Runner get’s stuck in pytest job
we’re running our test and build pipelines on Gitlab.com using the shared runners.
We have an integration test job defined as follows.
What it does:
- Setup postgreSQL and mongoDB (with replica-set)
- Upgrade schema of SQL using alembic migrations
- Run the tests:
py.test --cov=src \ --cov-report xml:coverage.xml \ --cov-report html:coverage.html \ --cov-branch \ --verbose \ -m "not plot and not readonly" \ tests/server
server_test: image: $CI_REGISTRY_IMAGE:$TAG_NAME stage: test services: - postgres:latest - redis:latest - name: mongo:5.0.13 command: ["--bind_ip_all", "--replSet", "rs0"] variables: MONGO_HOST: mongo # Explicitly empty, because no auth is used in gitlab-ci environment MONGO_USER: "" MONGO_PASSWORD: "" MONGO_PROJECT_DB: "project_db_test" DB_HOST: postgres DB_USER: $POSTGRES_USER DB_PASSWORD: $POSTGRES_PASSWORD DB_DATABASE: $POSTGRES_DB before_script: - apt-get update # Hack to fix time-zone specification required: # https://grigorkh.medium.com/fix-tzdata-hangs-docker-image-build-cdb52cc3360d - export TZ=Europe/Berlin - ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone - apt-get update && apt-get install -y gnupg2 - wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add - - echo "deb http://repo.mongodb.org/apt/debian buster/mongodb-org/4.4 main" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list - apt-get update - apt-get update && apt-get install -y gnupg2 - apt-get install -y mongodb-org - mongo --host $MONGO_HOST --eval 'rs.initiate()' script: # Reset DB schema for assembly-db - cd src/md_core/db && alembic upgrade head && cd ../../.. - celery -A md_core.worker.celery_app worker --loglevel=INFO --detach - sh scripts/run_server_tests.sh
This works most of the time. However, every once in a while we add an integration test that leads the shared runner to freeze the job.
This means, it proceeds as normal until some point in time and then the logs get stuck and the runner continues there until the project timeout limit (2h) is reached. Normally, the job takes ~15 minutes.
Locally (both in a native environment and using the
gitlab-runner exec docker the job succeeds, so it is not a problem (e.g. infinite loop, …) in the test.
The only way to get the pipeline through is to not add the specific test. Then for a while we can continue adding different tests without a problem.
I have not been able to find a pattern in the kinds of tests that lead to the pipeline being stuck.
Our Dockerfile looks like this:
FROM condaforge/mambaforge:latest RUN mkdir /app # Install conda env COPY environment.yaml /app/environment.yaml RUN mamba env create -f /app/environment.yaml \ && mamba clean -ay RUN echo "source activate modugen-core-2" > ~/.bashrc ENV PATH /opt/conda/envs/modugen-core-2/bin:$PATH COPY . /app RUN pip install --no-build --no-deps -e /app/ WORKDIR /app RUN /bin/bash -c "source activate modugen-core-2" ENTRYPOINT 
I looked at things like:
- Shell runner freezes on long running job (#4285) · Issues · GitLab.org / gitlab-runner · GitLab
- Local GitLab runner freezes while Shared GitLab.com runner succeeds
- or most promising: Shell runner freezes on long running job (#4285) · Issues · GitLab.org / gitlab-runner · GitLab
But none seemed to contain reproducible behaviour of fixes that seemed applicable to our setup.
Since it is very hard / impossible to reproduce I thought it might be something regarding the resources on the runner. Running out of memory or something.