Gitlab Shared Runner get’s stuck in pytest job
Hey all,
we’re running our test and build pipelines on Gitlab.com using the shared runners.
We have an integration test job defined as follows.
What it does:
- Setup postgreSQL and mongoDB (with replica-set)
- Upgrade schema of SQL using alembic migrations
- Run the tests:
py.test --cov=src \ --cov-report xml:coverage.xml \ --cov-report html:coverage.html \ --cov-branch \ --verbose \ -m "not plot and not readonly" \ tests/server
server_test:
image: $CI_REGISTRY_IMAGE:$TAG_NAME
stage: test
services:
- postgres:latest
- redis:latest
- name: mongo:5.0.13
command: ["--bind_ip_all", "--replSet", "rs0"]
variables:
MONGO_HOST: mongo
# Explicitly empty, because no auth is used in gitlab-ci environment
MONGO_USER: ""
MONGO_PASSWORD: ""
MONGO_PROJECT_DB: "project_db_test"
DB_HOST: postgres
DB_USER: $POSTGRES_USER
DB_PASSWORD: $POSTGRES_PASSWORD
DB_DATABASE: $POSTGRES_DB
before_script:
- apt-get update
# Hack to fix time-zone specification required:
# https://grigorkh.medium.com/fix-tzdata-hangs-docker-image-build-cdb52cc3360d
- export TZ=Europe/Berlin
- ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
- apt-get update && apt-get install -y gnupg2
- wget -qO - https://www.mongodb.org/static/pgp/server-4.4.asc | apt-key add -
- echo "deb http://repo.mongodb.org/apt/debian buster/mongodb-org/4.4 main" | tee /etc/apt/sources.list.d/mongodb-org-4.4.list
- apt-get update
- apt-get update && apt-get install -y gnupg2
- apt-get install -y mongodb-org
- mongo --host $MONGO_HOST --eval 'rs.initiate()'
script:
# Reset DB schema for assembly-db
- cd src/md_core/db && alembic upgrade head && cd ../../..
- celery -A md_core.worker.celery_app worker --loglevel=INFO --detach
- sh scripts/run_server_tests.sh
This works most of the time. However, every once in a while we add an integration test that leads the shared runner to freeze the job.
This means, it proceeds as normal until some point in time and then the logs get stuck and the runner continues there until the project timeout limit (2h) is reached. Normally, the job takes ~15 minutes.
Locally (both in a native environment and using the gitlab-runner exec docker
the job succeeds, so it is not a problem (e.g. infinite loop, …) in the test.
The only way to get the pipeline through is to not add the specific test. Then for a while we can continue adding different tests without a problem.
I have not been able to find a pattern in the kinds of tests that lead to the pipeline being stuck.
Our Dockerfile looks like this:
FROM condaforge/mambaforge:latest
RUN mkdir /app
# Install conda env
COPY environment.yaml /app/environment.yaml
RUN mamba env create -f /app/environment.yaml \
&& mamba clean -ay
RUN echo "source activate modugen-core-2" > ~/.bashrc
ENV PATH /opt/conda/envs/modugen-core-2/bin:$PATH
COPY . /app
RUN pip install --no-build --no-deps -e /app/
WORKDIR /app
RUN /bin/bash -c "source activate modugen-core-2"
ENTRYPOINT []
My search
I looked at things like:
- Shell runner freezes on long running job (#4285) · Issues · GitLab.org / gitlab-runner · GitLab
- Local GitLab runner freezes while Shared GitLab.com runner succeeds
- or most promising: Shell runner freezes on long running job (#4285) · Issues · GitLab.org / gitlab-runner · GitLab
But none seemed to contain reproducible behaviour of fixes that seemed applicable to our setup.
My Hunch
Since it is very hard / impossible to reproduce I thought it might be something regarding the resources on the runner. Running out of memory or something.