Gitlab Runner can execute jobs with specified docker-compose files in series, but not concurrently (Solved, it totally can)

The question:

We’re trying to become more modern with our CI/CD, and as part of that, I’ve recently re-worked the testing for one of our applications to use docker compose; It’s a ruby app with a test suite that takes about 15 minutes to run, and sometimes if Selenium is having a sleepy day, a test fails transitively and the whole rspec needs to be re-run. To get around this, I wanted to run the tests concurrently in smaller batches - a job perfect for containerization. I wasn’t able to find much on the web about my problem, mostly because I’m not quite sure how to put it into words.

So far I have not optimised my workflow but essentially, I have an rspec test for every folder in ./spec in my application.

The plan is to have a .gitlab-ci.yml file execute a the “test” stage in the pipeline, many different test jobs that will allow me to only rerun the part that fails.

image

The way I have done this is to have block in my .gitlab-ci.yml file like this:

"Specs - Acceptance":
  stage: test
  script:
    - docker compose -f tests/compose-acceptance.yaml build
    - docker compose -f tests/compose-acceptance.yaml up -d
    - docker compose -f tests/compose-acceptance.yaml exec -it rails "/bin/bash" -c "RAILS_ENV=test bundle exec rake db:migrate; RAILS_ENV=test CI=true bundle exec rspec 'spec/acceptance' --format=documentation"
    - docker compose -f tests/compose-acceptance.yaml down --volumes
  except:
    - tags
    - all-tests   

The gitlab-ci.yml is so verbose so that I can debug this issue; I had it smaller and more petite before when I presumed this would work. The executor is “shell” because I don’t have this working reliably yet. If I can get it working, or if that’s whats required to fully commit, then I’ll give that a go.

There is an earlier build stage that optionally builds the local container images if any changes to the Dockerfile or Gemfile are detected.

Gitlab runner has no problem running through these tests in series, but as soon as they start to run concurrently, it gets very confused. I wish I could word it better than that, but I am at a loss as to why it is doing what it is.

The example I can most easily recreate is if I set the gitlab-runner concurrency to >1 (I would like to set it to 5) and two stages being processing at the same time, the stage for “Acceptance” runs at the same time as say, “Features”, and the output for features clearly shows that it’s trying to mess with the acceptance containers.

A relevant tail of the “features” job log:

 $ docker compose -f tests/compose-features.yaml up -d
 Network tests_features  Creating
 Network tests_features  Created
 Container db-test-acceptance  Recreate
Error response from daemon: removal of container efdaec5461259046a3c2c4ceeaa0eddc55dc37f145c80cba45d86e7d30b5d06c is already in progress

In the log above, it should absolutely not be recreating the db-test-acceptance container; it should be creating or recreating the db-test-features container! db-test-acceptance has already been created by gitlab-runner, hence why it’s already in progress…

When these jobs are run in series, this issue is not present.

I would expect that even when running in concurrency, these commands are completely separate, and these containers do not rely on one another (they are built from a common base image, but that’s it).

I suspect that the catalyst for failure is one of the other jobs finishing (not all tests run equally, some take 30 seconds) and during the compose down it causes some issues.

In the example above, the only job to execute before causing a cascading failure was the “Support” test suite; a tail from the job log:

 10 examples, 0 failures
$ docker compose -f tests/compose-support.yaml down --volumes
 Container rails-test-support  Stopping
 Container rails-test-support  Stopped
 Container rails-test-support  Removing
 Container rails-test-support  Removed
 Container db-test-support  Stopping
 Container db-test-support  Stopped
 Container db-test-support  Removing
 Container db-test-support  Removed
 Network tests_support  Removing
 Network tests_support  Removed
Cleaning up project directory and file based variables
00:00
Job succeeded

So it cleans up after itself and doesn’t remove anything that other containers rely on - and yet… something is not quite right.

I can re-run this pipeline with no issues simply by changing the gitlab-runner concurrency to “1” and restarting gitlab-runner.

I have tried:
→ splitting out the docker compose to use individual files, individual networks, no shared volumes

I haven’t tried because I dont think it’ll matter
→ putting the database on the same container

I am about to try ( because I just thought of it )
→ Logging into the gitlab-runner via shell with two sessions and trying to run the commands by hand to see how that plays out. I’ll report back here.

I think the crux of the issue is how gitlab-runner is making use of docker compose and the docker engine is getting handsy with containers that are meant to be outside of it’s purview.

Now I have to post some shameful stuff - namely the ancient version of my self-managed gitlab, I haven’t had the time to upgrade, please be gentle.

Versions of stuff

  • *GitLab Community Edition [12.7.5]
  • *gitlab-runner-ubuntu-2023 [16.5.0]

The gitlab-ci.yml and docker compose files are obscenely verbose at the moment because of troubleshooting, I will post some excerpts with context. Also please be kind; I’m transitioning from a very old school application and trying to modernise it without going full scorched earth, so there’s some … tech debt.

gitlab-ci

before_script:
after_script:
stages:
  - byebug_check
  - build
  - test
  - deploy_staging
  - smoke
  - package_release
  - make_badge
  - deploy_production

# Find any Instances of byebug, and fail if it finds it:
"Byebug Check":
  stage: byebug_check
  before_script: []
  script: bin/find-byebug.sh
  after_script: []
  except:
    - tags
    - master

"Build Docker test and db Images":
  stage: build
  before_script:
    - cp config/secrets.yml.example config/secrets.yml
    - cp config/database.yml.example config/database.yml
  script:
# This build uses the Dockerfile in root, should only be for Gem/RubyVersion changes
    - docker build --tag 'client-area-rails' .
  after_script: []
  rules:
    - changes:
        - Dockerfile

# Run this branch (all-tests) to see the full coverage value
"Specs - All":
  stage: test
  before_script: []
  script:
# This compose build should push the updated files using tests/Dockerfile
    - docker compose -f tests/compose-alltests.yaml build
    - docker compose -f tests/compose-alltests.yaml up -d
    - docker compose -f tests/compose-alltests.yaml exec -it rails "/bin/bash" -c "RAILS_ENV=test bundle exec rake db:migrate; RAILS_ENV=test CI=true bundle exec rspec --format=documentation"
    - docker compose -f tests/compose-alltests.yaml down --volumes

  after_script: []
# Run on only branch "all-tests"
  only:
    - all-tests   

# Broken down tests wont show the full coverage value but that's fine. This is more robust.
"Specs - Acceptance":
  stage: test
  script:
    - docker compose -f tests/compose-acceptance.yaml build
    - docker compose -f tests/compose-acceptance.yaml up -d
    - docker compose -f tests/compose-acceptance.yaml exec -it rails "/bin/bash" -c "RAILS_ENV=test bundle exec rake db:migrate; RAILS_ENV=test CI=true bundle exec rspec 'spec/acceptance' --format=documentation"
    - docker compose -f tests/compose-acceptance.yaml down --volumes
  except:
    - tags
    - all-tests  

...
#truncated all the 13 test jobs, it's all the same, replace "acceptance" with a different word, otherwise it's identical

# never gets this far
Deploy to staging:
  stage: deploy_staging
  #when: on_success
  when: manual

# it never gets past here because tests are failing

The Dockerfile used for the image build stage

Dockerfile

FROM ruby:slim AS certsbase

# Copy Certificates
WORKDIR /var/www/app/current
COPY m21.crt config/keys/m21.crt
COPY m21.key config/keys/m21.key
COPY m21.ca config/keys/m21.ca

FROM certsbase as chromebase

# Chrome dependency Instalation
RUN apt-get update && apt-get install -y \
#...shortened this for the forum post
    libvulkan1 \
    git \
    jq
# Instead of automating chrome, here's the manual work around.
RUN export CHROMEURL="https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/114.0.5735.90/linux64/chrome-linux64.zip"; \
    wget -P /chrome $CHROMEURL ;\
    unzip /chrome/chrome* -d /chrome ;\
    PATH="/chrome/chrome-linux64:$PATH"

RUN ln -s /chrome/chrome-linux64/chrome /usr/local/bin/google-chrome; \
    chmod +x /chrome/chrome-linux64/chrome; \
    chmod +x /usr/local/bin/google-chrome; \
    echo "Chrome: " && google-chrome --version;

FROM certsbase as gembase

# Nokogiri's and other gem's build dependencies
RUN apt-get update && apt-get install -y \
  build-essential \
  libxml2-dev     \
  libxslt-dev     \
  libpq-dev       \
  nodejs          \
  libsqlite3-dev  \
  tzdata          \
  git

# Nokogiri, yikes. Example of a rubygem source for build, in case i need it later
# RUN echo 'source "https://rubygems.org"; gem "nokogiri"' > Gemfile

COPY ./ /var/www/app/current
COPY Gemfile Gemfile.lock ./
RUN bundle update rails   --jobs=10 --retry=3
RUN bundle install  --jobs=10 --retry=3 --full-index

FROM gembase as precompile

RUN RAILS_ENV=test bundle exec rake assets:precompile

# The final image: we start clean
FROM chromebase as testbase

COPY --from=precompile /usr/local/bundle/ /usr/local/bundle/
COPY --from=precompile /var/www/app/current/public/assets/ /var/www/app/current/public/assets/
COPY Gemfile Gemfile.lock ./
COPY ./ /var/www/app/current
COPY --chmod=777 ./entrypoint.sh /start.sh

FROM testbase as testinstallbundler

RUN gem install bundler --no-document ;\
    bundle install --full-index

FROM testinstallbundler as testentrypoint

ENTRYPOINT ["/start.sh"]

# Start.sh runs the test suite as modofied by the dockerfile used at compose up; can get replaced by the other Dockerfile used by docker compose

The compose file used by the test stage - this example is acceptance

compose-acceptance.yaml

include:
  - compose-includes.yaml
services:
  database:
    container_name: db-test-acceptance
    networks:
      - acceptance
    environment:
      - POSTGRES_DB=client_area_acceptance
      - POSTGRES_USER=gitlabci
      - POSTGRES_PASSWORD=gitlabci
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U gitlabci -d client_area_acceptance"]
      interval: 10s
      timeout: 5s
      retries: 5
    image: postgres:9.6
  rails:
    container_name: rails-test-acceptance
    depends_on:
      database:
        condition: service_healthy
    build:
      context: ../
      dockerfile: ./tests/Dockerfile-acceptance
    ports:
      - "3000:3000"
    links:
      - database
    networks:
      - acceptance
    secrets:
      - m21.key
      - m21.crt
      - m21.ca
    volumes:
      - type: bind
        source: ${HOME}/capybara
        target: /var/www/app/current/tmp

and it’s compose-includes.yaml file:

networks:
  app:
  acceptance:
  alltests:
  commands:
  fabricators:
  features:
  helpers:
  jobs:
  lib:
  mailers:
  models:
  requests:
  services:
  support:
  controllers:

volumes:
  database:
  capybara:

secrets:
  m21.key:
    file: /etc/ssl/m21.key
  m21.crt:
    file: /etc/ssl/m21.crt
  m21.ca:
    file: /etc/ssl/m21.ca

and it’s Dockerfile-acceptance:

FROM client-area-rails

WORKDIR /var/www/app/current
COPY Gemfile Gemfile.lock ./
COPY ./ /var/www/app/current
COPY --chmod=777 ./entrypoint.sh /start.sh
RUN sed -i 's|#CHANGEDBNAME|sed -i '\''s/client_area_test/client_area_acceptance/g'\'' config/database.yml|' /start.sh
RUN echo "sleep 600" >> /start.sh
ENTRYPOINT ["/start.sh"]

# Start.sh runs the test suite at compose up

Coming back to this after the weekend, a fresh outlook, and a slight increase in willingness to live;

→ I tested running two of the tests “side by side” on commandline to see whether I could fabricate the crash outside of gitlab runner, and I did find a potential issue.

During my testing, I had added bound ports to the compose files (you can see in my examples above: ports 3000:3000)

The error log isn’t clear, but I suspect that Docker is trying to compose the database, it can’t start because port 3000 is already bound on the container host, the container never becomes healthy, and then it is removed; the failure that docker compose down fails because it’s already removing is a misnomer.

I’m sure if I trawled through my CI logs I might have seen an error like:

gitlab-runner@gitlab-runner:~/builds/6PicqP5W/0/micron21/client-area$ docker compose -f tests/compose-fabricators.yaml up -d
[+] Running 3/3
 ✔ Network test-fabricators_fabricators  Created                                                                                                                                                    0.1s
 ✔ Container db-test-fabricators         Healthy                                                                                                                                                    0.0s
 ✔ Container rails-test-fabricators      Created                                                                                                                                                    0.0s
Error response from daemon: driver failed programming external connectivity on endpoint rails-test-fabricators (226c38f1da4b7de96f33ff677ad1d496e623d4293a14dc2317f84d7b7370a066): Bind for 0.0.0.0:3000 failed: port is already allocated

but as it’s only the external bind, the service still seemed to create okay.

I’ve removed the port conflict and will reattempt concurrency.

Setting /etc/gitlab-runner/config.toml concurrency = 2 and two tests are running concurrently. I’m going to scale it up, but if I don’t post back here, the issues all along were just a string of conflicts in my docker compose files.

This is less of a gitlab thing and more of a docker thing, but just in case anyone comes here looking for a similar problem, my solution was:
→ Remove any includes in your compose files; docker compose putting the “down” on included files means compose will clobber stuff included in other compose tasks
→ Give each compose a “Project name” (since compose v2 can be done with the “name: foo” above your services block)
→ Give each compose unique network names. If you do this, service names do not need to be unique.
→ Ensure there are no host binds that can come into conflict, such as file-locks from bind mounts or port conflicts from port binds.

Following through on the above resolved this problem for me.

EDIT// hadn’t closed this tab, so editing this to confirm that concurrency 5 worked a dream and I’ve more than halved my CI/CD execution times.