Gitlab Shared Runner get's stuck in pytest job

Gitlab Shared Runner get’s stuck in pytest job

Hey all,

we’re running our test and build pipelines on using the shared runners.

We have an integration test job defined as follows.

What it does:

  • Setup postgreSQL and mongoDB (with replica-set)
  • Upgrade schema of SQL using alembic migrations
  • Run the tests: py.test --cov=src \ --cov-report xml:coverage.xml \ --cov-report html:coverage.html \ --cov-branch \ --verbose \ -m "not plot and not readonly" \ tests/server
  stage: test
    - postgres:latest
    - redis:latest
    - name: mongo:5.0.13
      command: ["--bind_ip_all", "--replSet", "rs0"]
    MONGO_HOST: mongo
    # Explicitly empty, because no auth is used in gitlab-ci environment
    MONGO_USER: ""
    MONGO_PROJECT_DB: "project_db_test"
    DB_HOST: postgres
    - apt-get update
      # Hack to fix time-zone specification required:
    - export TZ=Europe/Berlin
    - ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
    - apt-get update && apt-get install -y gnupg2
    - wget -qO - |  apt-key add -
    - echo "deb buster/mongodb-org/4.4 main" |  tee /etc/apt/sources.list.d/mongodb-org-4.4.list
    - apt-get update
    - apt-get update && apt-get install -y gnupg2
    - apt-get install -y mongodb-org
    - mongo --host $MONGO_HOST --eval 'rs.initiate()'
    # Reset DB schema for assembly-db
    - cd src/md_core/db && alembic upgrade head && cd ../../..
    - celery -A md_core.worker.celery_app worker --loglevel=INFO --detach
    - sh scripts/

This works most of the time. However, every once in a while we add an integration test that leads the shared runner to freeze the job.
This means, it proceeds as normal until some point in time and then the logs get stuck and the runner continues there until the project timeout limit (2h) is reached. Normally, the job takes ~15 minutes.

Locally (both in a native environment and using the gitlab-runner exec docker the job succeeds, so it is not a problem (e.g. infinite loop, …) in the test.

The only way to get the pipeline through is to not add the specific test. Then for a while we can continue adding different tests without a problem.

I have not been able to find a pattern in the kinds of tests that lead to the pipeline being stuck.

Our Dockerfile looks like this:

FROM condaforge/mambaforge:latest

RUN mkdir /app

# Install conda env
COPY environment.yaml /app/environment.yaml
RUN  mamba env create -f /app/environment.yaml \
 && mamba clean -ay
RUN echo "source activate modugen-core-2" > ~/.bashrc
ENV PATH /opt/conda/envs/modugen-core-2/bin:$PATH

COPY . /app
RUN pip install --no-build --no-deps -e /app/


RUN /bin/bash -c "source activate modugen-core-2"

My search

I looked at things like:

But none seemed to contain reproducible behaviour of fixes that seemed applicable to our setup.

My Hunch

Since it is very hard / impossible to reproduce I thought it might be something regarding the resources on the runner. Running out of memory or something.