The question:
We’re trying to become more modern with our CI/CD, and as part of that, I’ve recently re-worked the testing for one of our applications to use docker compose; It’s a ruby app with a test suite that takes about 15 minutes to run, and sometimes if Selenium is having a sleepy day, a test fails transitively and the whole rspec needs to be re-run. To get around this, I wanted to run the tests concurrently in smaller batches - a job perfect for containerization. I wasn’t able to find much on the web about my problem, mostly because I’m not quite sure how to put it into words.
So far I have not optimised my workflow but essentially, I have an rspec test for every folder in ./spec in my application.
The plan is to have a .gitlab-ci.yml file execute a the “test” stage in the pipeline, many different test jobs that will allow me to only rerun the part that fails.
The way I have done this is to have block in my .gitlab-ci.yml file like this:
"Specs - Acceptance":
stage: test
script:
- docker compose -f tests/compose-acceptance.yaml build
- docker compose -f tests/compose-acceptance.yaml up -d
- docker compose -f tests/compose-acceptance.yaml exec -it rails "/bin/bash" -c "RAILS_ENV=test bundle exec rake db:migrate; RAILS_ENV=test CI=true bundle exec rspec 'spec/acceptance' --format=documentation"
- docker compose -f tests/compose-acceptance.yaml down --volumes
except:
- tags
- all-tests
The gitlab-ci.yml is so verbose so that I can debug this issue; I had it smaller and more petite before when I presumed this would work. The executor is “shell” because I don’t have this working reliably yet. If I can get it working, or if that’s whats required to fully commit, then I’ll give that a go.
There is an earlier build stage that optionally builds the local container images if any changes to the Dockerfile or Gemfile are detected.
Gitlab runner has no problem running through these tests in series, but as soon as they start to run concurrently, it gets very confused. I wish I could word it better than that, but I am at a loss as to why it is doing what it is.
The example I can most easily recreate is if I set the gitlab-runner concurrency to >1 (I would like to set it to 5) and two stages being processing at the same time, the stage for “Acceptance” runs at the same time as say, “Features”, and the output for features clearly shows that it’s trying to mess with the acceptance containers.
A relevant tail of the “features” job log:
$ docker compose -f tests/compose-features.yaml up -d
Network tests_features Creating
Network tests_features Created
Container db-test-acceptance Recreate
Error response from daemon: removal of container efdaec5461259046a3c2c4ceeaa0eddc55dc37f145c80cba45d86e7d30b5d06c is already in progress
In the log above, it should absolutely not be recreating the db-test-acceptance container; it should be creating or recreating the db-test-features container! db-test-acceptance has already been created by gitlab-runner, hence why it’s already in progress…
When these jobs are run in series, this issue is not present.
I would expect that even when running in concurrency, these commands are completely separate, and these containers do not rely on one another (they are built from a common base image, but that’s it).
I suspect that the catalyst for failure is one of the other jobs finishing (not all tests run equally, some take 30 seconds) and during the compose down it causes some issues.
In the example above, the only job to execute before causing a cascading failure was the “Support” test suite; a tail from the job log:
10 examples, 0 failures
$ docker compose -f tests/compose-support.yaml down --volumes
Container rails-test-support Stopping
Container rails-test-support Stopped
Container rails-test-support Removing
Container rails-test-support Removed
Container db-test-support Stopping
Container db-test-support Stopped
Container db-test-support Removing
Container db-test-support Removed
Network tests_support Removing
Network tests_support Removed
Cleaning up project directory and file based variables
00:00
Job succeeded
So it cleans up after itself and doesn’t remove anything that other containers rely on - and yet… something is not quite right.
I can re-run this pipeline with no issues simply by changing the gitlab-runner concurrency to “1” and restarting gitlab-runner.
I have tried:
→ splitting out the docker compose
to use individual files, individual networks, no shared volumes
I haven’t tried because I dont think it’ll matter
→ putting the database on the same container
I am about to try ( because I just thought of it )
→ Logging into the gitlab-runner via shell with two sessions and trying to run the commands by hand to see how that plays out. I’ll report back here.
I think the crux of the issue is how gitlab-runner is making use of docker compose
and the docker engine is getting handsy with containers that are meant to be outside of it’s purview.
Now I have to post some shameful stuff - namely the ancient version of my self-managed gitlab, I haven’t had the time to upgrade, please be gentle.
Versions of stuff
- *GitLab Community Edition [12.7.5]
- *gitlab-runner-ubuntu-2023 [16.5.0]
The gitlab-ci.yml and docker compose files are obscenely verbose at the moment because of troubleshooting, I will post some excerpts with context. Also please be kind; I’m transitioning from a very old school application and trying to modernise it without going full scorched earth, so there’s some … tech debt.
gitlab-ci
before_script:
after_script:
stages:
- byebug_check
- build
- test
- deploy_staging
- smoke
- package_release
- make_badge
- deploy_production
# Find any Instances of byebug, and fail if it finds it:
"Byebug Check":
stage: byebug_check
before_script: []
script: bin/find-byebug.sh
after_script: []
except:
- tags
- master
"Build Docker test and db Images":
stage: build
before_script:
- cp config/secrets.yml.example config/secrets.yml
- cp config/database.yml.example config/database.yml
script:
# This build uses the Dockerfile in root, should only be for Gem/RubyVersion changes
- docker build --tag 'client-area-rails' .
after_script: []
rules:
- changes:
- Dockerfile
# Run this branch (all-tests) to see the full coverage value
"Specs - All":
stage: test
before_script: []
script:
# This compose build should push the updated files using tests/Dockerfile
- docker compose -f tests/compose-alltests.yaml build
- docker compose -f tests/compose-alltests.yaml up -d
- docker compose -f tests/compose-alltests.yaml exec -it rails "/bin/bash" -c "RAILS_ENV=test bundle exec rake db:migrate; RAILS_ENV=test CI=true bundle exec rspec --format=documentation"
- docker compose -f tests/compose-alltests.yaml down --volumes
after_script: []
# Run on only branch "all-tests"
only:
- all-tests
# Broken down tests wont show the full coverage value but that's fine. This is more robust.
"Specs - Acceptance":
stage: test
script:
- docker compose -f tests/compose-acceptance.yaml build
- docker compose -f tests/compose-acceptance.yaml up -d
- docker compose -f tests/compose-acceptance.yaml exec -it rails "/bin/bash" -c "RAILS_ENV=test bundle exec rake db:migrate; RAILS_ENV=test CI=true bundle exec rspec 'spec/acceptance' --format=documentation"
- docker compose -f tests/compose-acceptance.yaml down --volumes
except:
- tags
- all-tests
...
#truncated all the 13 test jobs, it's all the same, replace "acceptance" with a different word, otherwise it's identical
# never gets this far
Deploy to staging:
stage: deploy_staging
#when: on_success
when: manual
# it never gets past here because tests are failing
The Dockerfile used for the image build stage
Dockerfile
FROM ruby:slim AS certsbase
# Copy Certificates
WORKDIR /var/www/app/current
COPY m21.crt config/keys/m21.crt
COPY m21.key config/keys/m21.key
COPY m21.ca config/keys/m21.ca
FROM certsbase as chromebase
# Chrome dependency Instalation
RUN apt-get update && apt-get install -y \
#...shortened this for the forum post
libvulkan1 \
git \
jq
# Instead of automating chrome, here's the manual work around.
RUN export CHROMEURL="https://edgedl.me.gvt1.com/edgedl/chrome/chrome-for-testing/114.0.5735.90/linux64/chrome-linux64.zip"; \
wget -P /chrome $CHROMEURL ;\
unzip /chrome/chrome* -d /chrome ;\
PATH="/chrome/chrome-linux64:$PATH"
RUN ln -s /chrome/chrome-linux64/chrome /usr/local/bin/google-chrome; \
chmod +x /chrome/chrome-linux64/chrome; \
chmod +x /usr/local/bin/google-chrome; \
echo "Chrome: " && google-chrome --version;
FROM certsbase as gembase
# Nokogiri's and other gem's build dependencies
RUN apt-get update && apt-get install -y \
build-essential \
libxml2-dev \
libxslt-dev \
libpq-dev \
nodejs \
libsqlite3-dev \
tzdata \
git
# Nokogiri, yikes. Example of a rubygem source for build, in case i need it later
# RUN echo 'source "https://rubygems.org"; gem "nokogiri"' > Gemfile
COPY ./ /var/www/app/current
COPY Gemfile Gemfile.lock ./
RUN bundle update rails --jobs=10 --retry=3
RUN bundle install --jobs=10 --retry=3 --full-index
FROM gembase as precompile
RUN RAILS_ENV=test bundle exec rake assets:precompile
# The final image: we start clean
FROM chromebase as testbase
COPY --from=precompile /usr/local/bundle/ /usr/local/bundle/
COPY --from=precompile /var/www/app/current/public/assets/ /var/www/app/current/public/assets/
COPY Gemfile Gemfile.lock ./
COPY ./ /var/www/app/current
COPY --chmod=777 ./entrypoint.sh /start.sh
FROM testbase as testinstallbundler
RUN gem install bundler --no-document ;\
bundle install --full-index
FROM testinstallbundler as testentrypoint
ENTRYPOINT ["/start.sh"]
# Start.sh runs the test suite as modofied by the dockerfile used at compose up; can get replaced by the other Dockerfile used by docker compose
The compose file used by the test stage - this example is acceptance
compose-acceptance.yaml
include:
- compose-includes.yaml
services:
database:
container_name: db-test-acceptance
networks:
- acceptance
environment:
- POSTGRES_DB=client_area_acceptance
- POSTGRES_USER=gitlabci
- POSTGRES_PASSWORD=gitlabci
healthcheck:
test: ["CMD-SHELL", "pg_isready -U gitlabci -d client_area_acceptance"]
interval: 10s
timeout: 5s
retries: 5
image: postgres:9.6
rails:
container_name: rails-test-acceptance
depends_on:
database:
condition: service_healthy
build:
context: ../
dockerfile: ./tests/Dockerfile-acceptance
ports:
- "3000:3000"
links:
- database
networks:
- acceptance
secrets:
- m21.key
- m21.crt
- m21.ca
volumes:
- type: bind
source: ${HOME}/capybara
target: /var/www/app/current/tmp
and it’s compose-includes.yaml file:
networks:
app:
acceptance:
alltests:
commands:
fabricators:
features:
helpers:
jobs:
lib:
mailers:
models:
requests:
services:
support:
controllers:
volumes:
database:
capybara:
secrets:
m21.key:
file: /etc/ssl/m21.key
m21.crt:
file: /etc/ssl/m21.crt
m21.ca:
file: /etc/ssl/m21.ca
and it’s Dockerfile-acceptance:
FROM client-area-rails
WORKDIR /var/www/app/current
COPY Gemfile Gemfile.lock ./
COPY ./ /var/www/app/current
COPY --chmod=777 ./entrypoint.sh /start.sh
RUN sed -i 's|#CHANGEDBNAME|sed -i '\''s/client_area_test/client_area_acceptance/g'\'' config/database.yml|' /start.sh
RUN echo "sleep 600" >> /start.sh
ENTRYPOINT ["/start.sh"]
# Start.sh runs the test suite at compose up