GitLab Pipeline randomly failing using Docker Runners - suspect caching issue

ScottFred · August 26, 2021, 6:24pm

After migrating from GitLab Group specific GitLab Runners to installing and registering 2 Shared Runners, pipelines across all projects started failing (about 60% of the time) in random jobs within our pipeline. Retrying each failed job sometimes resolves the job (without any other changes), but may fail in the next job. What is the best way to resolve or troubleshoot possible GitLab Runner caching issues? Gitlab Docker Runner cache? Python cache within the build? etc.

What are you seeing, and how does that differ from what you expect to see?
GitLab Runners randomly fail, but sometimes pass if we “Retry” the job. I expect to not have to retry the job. If they don’t pass on retry, we can the pipeline to work if re-running the entire pipeline.
What version are you on? Are you using self-managed or GitLab.com?
We are on the self-hosted, community edition (CE) of GitLab, running 14.2.1
The GitLab Runners are also at version 14.2 running as Docker Containers
We are building Python and Java packages and bundling them into a Docker Container using Kaniko.
Add the CI configuration from .gitlab-ci.yml and other configuration if relevant (e.g. docker-compose.yml)

config.toml

concurrent = 1
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "shared-runner-2"
  url = "https://<external_url>"
  token = "<hidden_token>"
  executor = "docker"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "centos7.9"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0

.gitlab-ci.yml

# Disable the Gradle daemon for Continuous Integration servers as correctness
# is usually a priority over speed in CI environments. Using a fresh
# runtime for each build is more reliable since the runtime is completely
# isolated from any previous builds.
variables:
  GRADLE_OPTS: "-Dorg.gradle.daemon=false"
  SCRIPTS_REPO: ssh://git@$URL

stages:
  - .pre
  - build
  - publish-snapshot
  - dockerize
  - deploy-staging
  - version
  - publish-release
  - deploy-prod

before_script:
  - export GRADLE_USER_HOME=`pwd`/.gradle

cache:
  key: "$CI_COMMIT_REF_NAME" #optional: per branch caching, can be omitted
  paths:
    - .gradle/caches/
    - .gradle/wrapper/
    - .gradle/build-cache/

build:
  image: gradle:jdk11
  stage: build
  only:
    - branches
  script:
    - ./gradlew ${INIT_SCRIPT_ARGS} --build-cache assemble
    - ./gradlew ${INIT_SCRIPT_ARGS} bootJar
  artifacts:
    paths:
    - sgs-tman-simulator/build/libs/sgs-tman-simulator*.jar
    - mock-snapglass-cr-service/build/libs/mock-snapglass-cr-service*.jar
    - mock-ibs-service/build/libs/mock-ibs-service*.jar
    expire_in: 1 week

publish-snapshot:
  image: gradle:jdk11
  stage: publish-snapshot
  only:
    - branches
  except:
    refs:
      - master
  script:
    - ./gradlew ${INIT_SCRIPT_ARGS} publish

dockerize:
  stage: dockerize
  only:
    refs:
      - master
  image:
    name: gcr.io/kaniko-project/executor:debug
    entrypoint: [""]
  script:
    - mkdir -p /kaniko/.docker
    - echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
    - echo "Building sgs-tman-simulator image"
    - /kaniko/executor --context $CI_PROJECT_DIR/sgs-tman-simulator/ --dockerfile $CI_PROJECT_DIR/sgs-tman-simulator/Dockerfile --destination $CI_REGISTRY_IMAGE/sgs-tman-simulator:$CI_COMMIT_SHORT_SHA --cleanup
    - echo "Building mock-snapglass-cr-service image"
    - /kaniko/executor --context $CI_PROJECT_DIR/mock-snapglass-cr-service/ --dockerfile $CI_PROJECT_DIR/mock-snapglass-cr-service/Dockerfile --destination $CI_REGISTRY_IMAGE/mock-snapglass-cr-service:$CI_COMMIT_SHORT_SHA --cleanup
    - echo "Building activemq-high image"
    - /kaniko/executor --context $CI_PROJECT_DIR/activemq-high/ --dockerfile $CI_PROJECT_DIR/activemq-high/Dockerfile --destination $CI_REGISTRY_IMAGE/activemq-high:$CI_COMMIT_SHORT_SHA --cleanup
    - echo "Building mock-ibs-service image"
    - /kaniko/executor --context $CI_PROJECT_DIR/mock-ibs-service/ --dockerfile $CI_PROJECT_DIR/mock-ibs-service/Dockerfile --destination $CI_REGISTRY_IMAGE/mock-ibs-service:$CI_COMMIT_SHORT_SHA --cleanup
    - echo "Building xocomm-simulator image $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
    - export build_date=`date -u +"%Y-%m-%dT%H:%M:%SZ"`
    - /kaniko/executor --context $CI_PROJECT_DIR/xocomm-simulator/ --dockerfile $CI_PROJECT_DIR/xocomm-simulator/Dockerfile --destination $CI_REGISTRY_IMAGE/$XOCOMM_SIMULATOR_IMAGE_NAME --build-arg=VCS_REF=$CI_COMMIT_SHORT_SHA --build-arg=BUILD_DATE=$build_date --cleanup

deploy-staging:
  stage: deploy-staging
  image: kroniak/ssh-client
  only:
    refs:
      - master
  script:
    - mkdir -p ~/.ssh
    - echo "$SSH_KNOWN_HOSTS_ip_10_1_1_20" >> ~/.ssh/known_hosts
    - echo "$SSH_KNOWN_HOSTS_ip_10_1_1_22" >> ~/.ssh/known_hosts
    - chmod 644 ~/.ssh/known_hosts
    - eval "$(ssh-agent -s)"
    - ssh-add <(echo "$SSH_PRIVATE_KEY_ip_10_1_1_20")
    - ssh-add <(echo "$SSH_PRIVATE_KEY_ip_10_1_1_22")
    - export ACTIVE_PROFILE=int
    - printf "COMPOSE_HTTP_TIMEOUT=300\nMOCK_IBS_IMAGE_NAME=$MOCK_IBS_IMAGE_NAME\nACTIVEMQ_IMAGE_NAME=$ACTIVEMQ_IMAGE_NAME\nACTIVEMQ_HIGH_IMAGE_NAME=$ACTIVEMQ_HIGH_IMAGE_NAME\nTMAN_SIMULATOR_IMAGE_NAME=$TMAN_SIMULATOR_IMAGE_NAME\nMOCK_SG_CR_IMAGE_NAME=$MOCK_SG_CR_IMAGE_NAME\nSTORE_PASSWORD=$STORE_PASSWORD\nPROJECT_PORT=$PROJECT_PORT\nPROJECT_SERVER=$IP_ADDR_ip_10_1_1_20\nCI_COMMIT_SHORT_SHA=$CI_COMMIT_SHORT_SHA\n" > .env
    - printf "ACTIVE_PROFILE=$ACTIVE_PROFILE\nACTIVEMQ_BROKER=$ACTIVEMQ_BROKER\nACTIVEMQ_PASSWORD=$ACTIVEMQ_PASSWORD\nACTIVEMQ_USER=$ACTIVEMQ_USER\nINT_BROKER_URL=$INT_BROKER_URL\nINT_HIGH_BROKER_URL=$INT_HIGH_BROKER_URL\nPROD_BROKER_URL=$PROD_BROKER_URL\nPROD_HIGH_BROKER_URL=$PROD_HIGH_BROKER_URL\nDOCKER_NETWORK=$DOCKER_NETWORK\nDOCKER_REPO=$DOCKER_REPO\n" >> .env
    - printf "MOCK_IMAGE_SERVER=$MOCK_IMAGE_SERVER\nMOCK_IMAGE_PORT=$MOCK_IMAGE_PORT\n" >> .env
    - printf "CDS_LOW_HOST=$CDS_LOW_HOST\nCDS_HIGH_HOST=$CDS_HIGH_HOST\nCDS_DESTINATION_TASKING_PATH=$CDS_DESTINATION_TASKING_PATH\nCDS_DESTINATION_TASKING_STATUS_PATH=$CDS_DESTINATION_TASKING_STATUS_PATH\nCDS_DESTINATION_UPDATE_PATH=$CDS_DESTINATION_UPDATE_PATH\n" >> .env
    - printf "CDS_SOURCE_TASKING_PATH=$CDS_SOURCE_TASKING_PATH\nCDS_SOURCE_TASKING_STATUS_PATH=$CDS_SOURCE_TASKING_STATUS_PATH\nCDS_SOURCE_UPDATE_PATH=$CDS_SOURCE_UPDATE_PATH\nCDS_SOURCE_POSITION_UPDATE_PATH=$CDS_SOURCE_POSITION_UPDATE_PATH\nCDS_DESTINATION_POSITION_UPDATE_PATH=$CDS_DESTINATION_POSITION_UPDATE_PATH\nSIM_PROXY_PORT=$SIM_PROXY_PORT\n" >> .env
    - printf "XOCOMM_SIMULATOR_IMAGE_NAME=$XOCOMM_SIMULATOR_IMAGE_NAME\n" >> .env
    - echo "$(cat .env)"
##    - ssh -Tvv deployer@$IP_ADDR_ip_10_1_1_20
    - scp -r ./.env ./docker-compose-sgs-mocks.yml deployer@$IP_ADDR_ip_10_1_1_20:~/
    - scp -r ./.env ./docker-compose-sgs-mocks-high.yml deployer@$IP_ADDR_ip_10_1_1_22:~/
    - ssh deployer@$IP_ADDR_ip_10_1_1_20 ". .env; docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY; docker-compose -f docker-compose-sgs-mocks.yml stop; docker-compose -f docker-compose-sgs-mocks.yml rm --force; docker-compose -f docker-compose-sgs-mocks.yml pull; docker-compose -f docker-compose-sgs-mocks.yml up -d;"
    - ssh deployer@$IP_ADDR_ip_10_1_1_22 ". .env; docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY; docker-compose -f docker-compose-sgs-mocks-high.yml stop; docker-compose -f docker-compose-sgs-mocks-high.yml rm --force; docker-compose -f docker-compose-sgs-mocks-high.yml pull; docker-compose -f docker-compose-sgs-mocks-high.yml up -d;"

version:
  image: python:3.7-stretch
  stage: version
  only:
    refs:
      - master
  before_script:
    - export SCRIPTS_DIR=$(mktemp -d)
    - echo $SCRIPTS_DIR
  script:
    - mkdir -p ~/.ssh && chmod 700 ~/.ssh
    - ssh-keyscan -p 10023 $URL >> ~/.ssh/known_hosts && chmod 644 ~/.ssh/known_hosts
    - eval $(ssh-agent -s)
    - ssh-add <(echo "$SSH_PRIVATE_KEY_SEMVER")
    - pip install semver
    # Force an udpate to the Git Origin URL with the correct port
    - git remote remove origin
    - git remote add origin git@$URL
    - git clone -q --depth 1 ${SCRIPTS_REPO} ${SCRIPTS_DIR}
    - git remote -v
    - $SCRIPTS_DIR/ci-scripts/common/gen-semver.py | tee tag_version
  artifacts:
    paths:
      - tag_version

# Clean and rebuild jars with new release tag in name then publish
publish-release:
  image: gradle:jdk11
  stage: publish-release
  only:
    refs:
      - master
  script:
    - ./gradlew ${INIT_SCRIPT_ARGS} clean
    - ./gradlew ${INIT_SCRIPT_ARGS} --build-cache assemble
    - ./gradlew ${INIT_SCRIPT_ARGS} bootJar
    - ./gradlew ${INIT_SCRIPT_ARGS} publish
  artifacts:
    paths:
      - sgs-orchestrator/build/libs/sgs-orchestrator*.jar
      - uci-service/build/libs/uci-service*.jar
      - request-repo/build/libs/request-repo*.jar
      - sentient-request-status-normalizer/build/libs/sentient-request-status-normalizer*.jar
    expire_in: 1 week

deploy-prod:
  stage: deploy-prod
  environment: production
  image: kroniak/ssh-client
  only:
    refs:
      - master
  when: manual
  script:
    - mkdir -p ~/.ssh
    - echo "$SSH_KNOWN_HOSTS_ip_10_1_1_21" >> ~/.ssh/known_hosts
    - echo "$SSH_KNOWN_HOSTS_ip_10_1_1_23" >> ~/.ssh/known_hosts
    - chmod 644 ~/.ssh/known_hosts
    - eval "$(ssh-agent -s)"
    - ssh-add <(echo "$SSH_PRIVATE_KEY_ip_10_1_1_21")
    - ssh-add <(echo "$SSH_PRIVATE_KEY_ip_10_1_1_23")
    - printf "COMPOSE_HTTP_TIMEOUT=300\nMOCK_IBS_IMAGE_NAME=$MOCK_IBS_IMAGE_NAME\nACTIVEMQ_IMAGE_NAME=$ACTIVEMQ_IMAGE_NAME\nACTIVEMQ_HIGH_IMAGE_NAME=$ACTIVEMQ_HIGH_IMAGE_NAME\nDB_NAME=$DB_NAME\nDB_USERNAME=$DB_USERNAME\nDB_SERVER=$IP_ADDR_ip_10_1_1_21\nDB_PORT=$DB_PORT\nDB_PASSWORD=$DB_PASSWORD\nTMAN_SIMULATOR_IMAGE_NAME=$TMAN_SIMULATOR_IMAGE_NAME\nMOCK_SG_CR_IMAGE_NAME=$MOCK_SG_CR_IMAGE_NAME\nSTORE_PASSWORD=$STORE_PASSWORD\nPROJECT_PORT=$PROJECT_PORT\nPROJECT_SERVER=$IP_ADDR_ip_10_1_1_20\nCI_COMMIT_SHORT_SHA=$CI_COMMIT_SHORT_SHA\n" > .env
    - printf "ACTIVE_PROFILE=$ACTIVE_PROFILE\nACTIVEMQ_BROKER=$ACTIVEMQ_BROKER\nACTIVEMQ_PASSWORD=$ACTIVEMQ_PASSWORD\nACTIVEMQ_USER=$ACTIVEMQ_USER\nINT_BROKER_URL=$INT_BROKER_URL\nINT_HIGH_BROKER_URL=$INT_HIGH_BROKER_URL\nPROD_BROKER_URL=$PROD_BROKER_URL\nPROD_HIGH_BROKER_URL=$PROD_HIGH_BROKER_URL\nDOCKER_NETWORK=$DOCKER_NETWORK\nDOCKER_REPO=$DOCKER_REPO\n" >> .env
#    NOTE: if we stand a prod imagery server, will we need to account for that here
    - printf "MOCK_IMAGE_SERVER=$MOCK_IMAGE_SERVER\nMOCK_IMAGE_PORT=$MOCK_IMAGE_PORT\n" >> .env
    - printf "CDS_LOW_HOST=$CDS_LOW_HOST\nCDS_HIGH_HOST=$CDS_HIGH_HOST\nCDS_DESTINATION_TASKING_PATH=$CDS_DESTINATION_TASKING_PATH\nCDS_DESTINATION_TASKING_STATUS_PATH=$CDS_DESTINATION_TASKING_STATUS_PATH\nCDS_DESTINATION_UPDATE_PATH=$CDS_DESTINATION_UPDATE_PATH\n" >> .env
    - printf "CDS_SOURCE_TASKING_PATH=$CDS_SOURCE_TASKING_PATH\nCDS_SOURCE_TASKING_STATUS_PATH=$CDS_SOURCE_TASKING_STATUS_PATH\nCDS_SOURCE_UPDATE_PATH=$CDS_SOURCE_UPDATE_PATH\nCDS_SOURCE_POSITION_UPDATE_PATH=$CDS_SOURCE_POSITION_UPDATE_PATH\nCDS_DESTINATION_POSITION_UPDATE_PATH=$CDS_DESTINATION_POSITION_UPDATE_PATH\nSIM_PROXY_PORT=$SIM_PROXY_PORT\n" >> .env
    - printf "XOCOMM_SIMULATOR_IMAGE_NAME=$XOCOMM_SIMULATOR_IMAGE_NAME\n" >> .env
    - printf "SNAPGLASS_IP=$SNAPGLASS_IP\n" >> .env
    - echo "$(cat .env)"
    - scp -r ./.env ./docker-compose-sgs-mocks.yml deployer@$IP_ADDR_ip_10_1_1_21:~/
    - scp -r ./.env ./docker-compose-sgs-mocks-high.yml deployer@$IP_ADDR_ip_10_1_1_23:~/
    - ssh deployer@$IP_ADDR_ip_10_1_1_21 ". .env; docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY; docker-compose -f docker-compose-sgs-mocks.yml stop; docker-compose -f docker-compose-sgs-mocks.yml rm --force; docker-compose -f docker-compose-sgs-mocks.yml pull; docker-compose -f docker-compose-sgs-mocks.yml up -d"
    - ssh deployer@$IP_ADDR_ip_10_1_1_23 ". .env; docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY; docker-compose -f docker-compose-sgs-mocks-high.yml stop; docker-compose -f docker-compose-sgs-mocks-high.yml rm --force; docker-compose -f docker-compose-sgs-mocks-high.yml pull; docker-compose -f docker-compose-sgs-mocks-high.yml up -d;"

build-dal:
  stage: .pre
  image: python:3.7-stretch
  when: manual
  script:
    # build data ascension list
    - echo | find . -type d  -name '.git' -prune -o -type f  -name '.git*' -prune -o -type f -printf "%TY-%Tm-%Td %.8TT\t%s\t%p\n" | numfmt --field=3 --to=iec --padding=-4 > sgs_mock-services_dal.txt

  artifacts:
    paths:
      - sgs_mock-services_dal.txt
    expire_in: 1 week

What troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?

We’ve turned on debug to watch the details of the runner.
We’ve watched the build progress, added echo debug out put to the .gitlab-ci.yml file

ScottFred · August 26, 2021, 8:50pm

So, it seems as if turning off caching in each of the GitLab Runner config.toml files helped resolve the intermittent failures. But I’m not sure why. Can someone suggest why changing this to “true” seems to work? Is there something else that I could do that would be more appropriate?

[[runners]]
   [runners.docker]
...
    disable_cache = true

Thanks

ricardoamarilla · August 26, 2021, 8:58pm

Hi ScottFred,

I am not sure what is the exact the error you are experimenting in your failed job.

In my opinion, the best way to resolve or troubleshoot is follow the documentation, there is usually a Troubleshooting section, for the cache is Caching in GitLab CI/CD | GitLab

Based on the documentation, to Share caches between jobs in the same branch you need to use the key:

key: $CI_COMMIT_REF_SLUG

And you are using key: "$CI_COMMIT_REF_NAME" in your gitlab-ci.yml, not sure if this will help, I hope so.

ScottFred · August 26, 2021, 9:24pm

Thanks @ricardoamarilla … for the link to the Cache Troubleshooting. I saw it and read it, but it wasn’t clear what sections were applicable to my issue. Looking at it again.

Thanks for noticing that the key: we are using is different than the one that is recommended… $CI_COMMIT_REF_NAME.

Since, turning off docker caching, things appear to be working. But, we may be able to cut down on network bandwidth if I am able to take advantage of the Docker caching. So, I’ll turn it back on and try your suggestion to see if that works. Thanks again.

Scott

ScottFred · August 26, 2021, 9:34pm

BTW @ricardoamarilla, I discovered that $CI_COMMIT_REF_SLUG is the same as $CI_COMMIT_REF_NAME except: CI_COMMIT_REF_NAME in lowercase, shortened to 63 bytes, and with everything except 0-9 and a-z replaced with - . No leading / trailing - . Use in URLs, host names and domain names.

So, it doesn’t give me a lot of confidence that this will make a difference, but I’ll give it a try. Later tonight.

ricardoamarilla · August 26, 2021, 11:28pm

Yes it is, worth try it. There are some character restrictions in the cache:key

You can also set the CI_DEBUG_TRACE variable as true, re-run ta failing pipeline and you will get more details.

What errors do you have in the failing jobs?

ScottFred · August 27, 2021, 6:23pm

@ricardoamarilla Here’s an example of a “static analysis” job that fails. The previous build job ran successfully which did a pip install for this job to use. I’m sure there’s a subtlety that I’m missing, but the intent is for the static analysis job to use the results of the “build” which is for python packaging which includes a “pip install -r requirements.txt” that installs flake8. But, we see the pip list command in this job clearly shows that the python packages that we installed previously are not available in this “static analysis” job.

The output from the previous build job (same pipeline) clearly shows flake8 is installed and the Runner suggests that it is going to save the cache to “side-load-ossim-1” that is then going to be made available locally.

$ pip list
Package                  Version
------------------------ ----------
aiofiles                 0.7.0
...
flake8                   3.9.2
...
Saving cache for successful job
01:55
Creating cache side-load-ossim-1...
.cache/pip: found 1106 matching files and directories 
venv/: found 19434 matching files and directories  
No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
Created cache
Cleaning up file based variables
00:01
Job succeeded

However, the next job (static analysis) clearly shows that flake8 is not available…even though it shows that it knew to look for a cache called “side-load-ossim-1” and indicates that it was able to successfully extract that cache.

...
Restoring cache
00:01
Checking cache for side-load-ossim-1...
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted. 
Successfully extracted cache
Executing "step_script" stage of the job script
00:09
Using docker image sha256:1e76b28bfd4e5803f6f6176a35567bd6347e552d6221d5fc62133888c8caf496 for python:3.9 with digest python@sha256:2bd64896cf4ff75bf91a513358457ed09d890715d9aa6bb602323aedbee84d14 
...
$ pip --version
pip 21.2.3 from /builds/hawkeye/hawkeye/venv/lib/python3.9/site-packages/pip (python 3.9)
++ echo '$ pip list'
++ pip list
$ pip list
Package    Version
---------- -------
pip        21.2.3
setuptools 57.4.0
wheel      0.37.0
WARNING: You are using pip version 21.2.3; however, version 21.2.4 is available.
You should consider upgrading via the '/builds/hawkeye/hawkeye/venv/bin/python -m pip install --upgrade pip' command.
++ echo '$ uname -a'
++ uname -a
$ uname -a
Linux runner-fxvwypur-project-36-concurrent-0 3.10.0-1160.31.1.el7.x86_64 #1 SMP Wed May 26 20:18:08 UTC 2021 x86_64 GNU/Linux
++ echo '$ flake8 --version'
++ flake8 --version
$ flake8 --version
/bin/bash: line 150: flake8: command not found
Cleaning up file based variables
00:01
+ set -eo pipefail
+ set +o noclobber
+ eval '$'\''rm'\'' "-f" "/builds/hawkeye/hawkeye.tmp/CI_SERVER_TLS_CA_FILE"
'
+ :
++ rm -f /builds/hawkeye/hawkeye.tmp/CI_SERVER_TLS_CA_FILE
+ exit 0
ERROR: Job failed: exit code 1

ScottFred · August 29, 2021, 1:14pm

@ricardoamarilla After more searching, I think my issue is related to the issue described in this GitLab Forum post.

I tried the solution described there of adding cache parameters to my docker executor GitLab Runner config.toml files and I restarted the docker containers. (see config.toml below)

The pipeline now completes IF I have only ONE shared runner enabled. In other words, when I have two shared runners available (one running on one EC2 instance and another running on another EC2 instance), when the runner’s start executing in parallel (during the static-analysis and test jobs), one of the runners always fails. During 4 pipeline runs, shared-runner-1 always failed, regardless of which job it ran (static-analysis or test). I’m guessing that’s just a coincidence because running with shared-runner-1 by itself is able to successfully complete the pipeline.

— UPDATE —
After I paused and then restarted each of the GitLab Runners in turn (and ran my pipeline which completed successfully with only one shared-runner available), I then enabled both shared runners again and ran a couple more tests. AND THIS TIME THE Pipeline completed successfully with both shared-runer-1 and 2 participating in the pipeline! Why was I not able to use both gitlab runners in the pipeline until I paused and restarted the shared runners? Should I have cleared the “Runner Caches”?
— end UPDATE —

Is there a gitlab-runner command I can execute on the running gitlab-runner to see what configuration it thinks it has? I tried a couple variations of the following command, but failed to be able to display the gitlab-runner config:

$ docker exec -it gitlab-runner gitlab-runner verify
Runtime platform                                    arch=amd64 os=linux pid=61 revision=58ba2b95 version=14.2.0
Running in system-mode.

Verifying runner... is alive                        runner=EwF7xhTi

Here’s the config.toml on the failing shared runner - docker executor (the working shared runner config.toml is the same):

concurrent = 1
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "shared-runner-1"
  url = <hidden URL>
  token = "<hidden token>"
  executor = "docker"
  cache_dir = "/cache"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "centos:7.9"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    cache_dir = ""
    volumes = ["/cache"]
    shm_size = 0

ricardoamarilla · August 30, 2021, 12:21pm

I am glad to hear you have solved the issue.

There is a complete runner command list.

ScottFred · August 31, 2021, 3:03pm

Thanks for your assistance…

Topic		Replies	Views
Gitlab CI: failing pipeline after switching from shared to dedicated runner? GitLab CI/CD runner , docker	0	702	April 23, 2019
Problem downloading cache GitLab CI/CD	0	905	September 27, 2023
Errors: 403+500 Random Pipeline Fails with multiple runners GitLab CI/CD ci , runner , pipelines	6	1927	December 6, 2022
Cannot connect to the Docker daemon : intermittent error GitLab CI/CD ci , runner , docker	4	4361	June 7, 2019
Build job fails randomly with no changes to code GitLab CI/CD ci , runner , pipelines	0	525	December 6, 2023

GitLab Pipeline randomly failing using Docker Runners - suspect caching issue

Related topics