After migrating from GitLab Group specific GitLab Runners to installing and registering 2 Shared Runners, pipelines across all projects started failing (about 60% of the time) in random jobs within our pipeline. Retrying each failed job sometimes resolves the job (without any other changes), but may fail in the next job. What is the best way to resolve or troubleshoot possible GitLab Runner caching issues? Gitlab Docker Runner cache? Python cache within the build? etc.
-
What are you seeing, and how does that differ from what you expect to see?
GitLab Runners randomly fail, but sometimes pass if we “Retry” the job. I expect to not have to retry the job. If they don’t pass on retry, we can the pipeline to work if re-running the entire pipeline. -
What version are you on? Are you using self-managed or GitLab.com?
We are on the self-hosted, community edition (CE) of GitLab, running 14.2.1
The GitLab Runners are also at version 14.2 running as Docker Containers
We are building Python and Java packages and bundling them into a Docker Container using Kaniko. -
Add the CI configuration from
.gitlab-ci.yml
and other configuration if relevant (e.g. docker-compose.yml)
config.toml
concurrent = 1
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "shared-runner-2"
url = "https://<external_url>"
token = "<hidden_token>"
executor = "docker"
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[runners.docker]
tls_verify = false
image = "centos7.9"
privileged = false
disable_entrypoint_overwrite = false
oom_kill_disable = false
disable_cache = false
volumes = ["/cache"]
shm_size = 0
.gitlab-ci.yml
# Disable the Gradle daemon for Continuous Integration servers as correctness
# is usually a priority over speed in CI environments. Using a fresh
# runtime for each build is more reliable since the runtime is completely
# isolated from any previous builds.
variables:
GRADLE_OPTS: "-Dorg.gradle.daemon=false"
SCRIPTS_REPO: ssh://git@$URL
stages:
- .pre
- build
- publish-snapshot
- dockerize
- deploy-staging
- version
- publish-release
- deploy-prod
before_script:
- export GRADLE_USER_HOME=`pwd`/.gradle
cache:
key: "$CI_COMMIT_REF_NAME" #optional: per branch caching, can be omitted
paths:
- .gradle/caches/
- .gradle/wrapper/
- .gradle/build-cache/
build:
image: gradle:jdk11
stage: build
only:
- branches
script:
- ./gradlew ${INIT_SCRIPT_ARGS} --build-cache assemble
- ./gradlew ${INIT_SCRIPT_ARGS} bootJar
artifacts:
paths:
- sgs-tman-simulator/build/libs/sgs-tman-simulator*.jar
- mock-snapglass-cr-service/build/libs/mock-snapglass-cr-service*.jar
- mock-ibs-service/build/libs/mock-ibs-service*.jar
expire_in: 1 week
publish-snapshot:
image: gradle:jdk11
stage: publish-snapshot
only:
- branches
except:
refs:
- master
script:
- ./gradlew ${INIT_SCRIPT_ARGS} publish
dockerize:
stage: dockerize
only:
refs:
- master
image:
name: gcr.io/kaniko-project/executor:debug
entrypoint: [""]
script:
- mkdir -p /kaniko/.docker
- echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
- echo "Building sgs-tman-simulator image"
- /kaniko/executor --context $CI_PROJECT_DIR/sgs-tman-simulator/ --dockerfile $CI_PROJECT_DIR/sgs-tman-simulator/Dockerfile --destination $CI_REGISTRY_IMAGE/sgs-tman-simulator:$CI_COMMIT_SHORT_SHA --cleanup
- echo "Building mock-snapglass-cr-service image"
- /kaniko/executor --context $CI_PROJECT_DIR/mock-snapglass-cr-service/ --dockerfile $CI_PROJECT_DIR/mock-snapglass-cr-service/Dockerfile --destination $CI_REGISTRY_IMAGE/mock-snapglass-cr-service:$CI_COMMIT_SHORT_SHA --cleanup
- echo "Building activemq-high image"
- /kaniko/executor --context $CI_PROJECT_DIR/activemq-high/ --dockerfile $CI_PROJECT_DIR/activemq-high/Dockerfile --destination $CI_REGISTRY_IMAGE/activemq-high:$CI_COMMIT_SHORT_SHA --cleanup
- echo "Building mock-ibs-service image"
- /kaniko/executor --context $CI_PROJECT_DIR/mock-ibs-service/ --dockerfile $CI_PROJECT_DIR/mock-ibs-service/Dockerfile --destination $CI_REGISTRY_IMAGE/mock-ibs-service:$CI_COMMIT_SHORT_SHA --cleanup
- echo "Building xocomm-simulator image $CI_REGISTRY_IMAGE:$CI_COMMIT_SHORT_SHA"
- export build_date=`date -u +"%Y-%m-%dT%H:%M:%SZ"`
- /kaniko/executor --context $CI_PROJECT_DIR/xocomm-simulator/ --dockerfile $CI_PROJECT_DIR/xocomm-simulator/Dockerfile --destination $CI_REGISTRY_IMAGE/$XOCOMM_SIMULATOR_IMAGE_NAME --build-arg=VCS_REF=$CI_COMMIT_SHORT_SHA --build-arg=BUILD_DATE=$build_date --cleanup
deploy-staging:
stage: deploy-staging
image: kroniak/ssh-client
only:
refs:
- master
script:
- mkdir -p ~/.ssh
- echo "$SSH_KNOWN_HOSTS_ip_10_1_1_20" >> ~/.ssh/known_hosts
- echo "$SSH_KNOWN_HOSTS_ip_10_1_1_22" >> ~/.ssh/known_hosts
- chmod 644 ~/.ssh/known_hosts
- eval "$(ssh-agent -s)"
- ssh-add <(echo "$SSH_PRIVATE_KEY_ip_10_1_1_20")
- ssh-add <(echo "$SSH_PRIVATE_KEY_ip_10_1_1_22")
- export ACTIVE_PROFILE=int
- printf "COMPOSE_HTTP_TIMEOUT=300\nMOCK_IBS_IMAGE_NAME=$MOCK_IBS_IMAGE_NAME\nACTIVEMQ_IMAGE_NAME=$ACTIVEMQ_IMAGE_NAME\nACTIVEMQ_HIGH_IMAGE_NAME=$ACTIVEMQ_HIGH_IMAGE_NAME\nTMAN_SIMULATOR_IMAGE_NAME=$TMAN_SIMULATOR_IMAGE_NAME\nMOCK_SG_CR_IMAGE_NAME=$MOCK_SG_CR_IMAGE_NAME\nSTORE_PASSWORD=$STORE_PASSWORD\nPROJECT_PORT=$PROJECT_PORT\nPROJECT_SERVER=$IP_ADDR_ip_10_1_1_20\nCI_COMMIT_SHORT_SHA=$CI_COMMIT_SHORT_SHA\n" > .env
- printf "ACTIVE_PROFILE=$ACTIVE_PROFILE\nACTIVEMQ_BROKER=$ACTIVEMQ_BROKER\nACTIVEMQ_PASSWORD=$ACTIVEMQ_PASSWORD\nACTIVEMQ_USER=$ACTIVEMQ_USER\nINT_BROKER_URL=$INT_BROKER_URL\nINT_HIGH_BROKER_URL=$INT_HIGH_BROKER_URL\nPROD_BROKER_URL=$PROD_BROKER_URL\nPROD_HIGH_BROKER_URL=$PROD_HIGH_BROKER_URL\nDOCKER_NETWORK=$DOCKER_NETWORK\nDOCKER_REPO=$DOCKER_REPO\n" >> .env
- printf "MOCK_IMAGE_SERVER=$MOCK_IMAGE_SERVER\nMOCK_IMAGE_PORT=$MOCK_IMAGE_PORT\n" >> .env
- printf "CDS_LOW_HOST=$CDS_LOW_HOST\nCDS_HIGH_HOST=$CDS_HIGH_HOST\nCDS_DESTINATION_TASKING_PATH=$CDS_DESTINATION_TASKING_PATH\nCDS_DESTINATION_TASKING_STATUS_PATH=$CDS_DESTINATION_TASKING_STATUS_PATH\nCDS_DESTINATION_UPDATE_PATH=$CDS_DESTINATION_UPDATE_PATH\n" >> .env
- printf "CDS_SOURCE_TASKING_PATH=$CDS_SOURCE_TASKING_PATH\nCDS_SOURCE_TASKING_STATUS_PATH=$CDS_SOURCE_TASKING_STATUS_PATH\nCDS_SOURCE_UPDATE_PATH=$CDS_SOURCE_UPDATE_PATH\nCDS_SOURCE_POSITION_UPDATE_PATH=$CDS_SOURCE_POSITION_UPDATE_PATH\nCDS_DESTINATION_POSITION_UPDATE_PATH=$CDS_DESTINATION_POSITION_UPDATE_PATH\nSIM_PROXY_PORT=$SIM_PROXY_PORT\n" >> .env
- printf "XOCOMM_SIMULATOR_IMAGE_NAME=$XOCOMM_SIMULATOR_IMAGE_NAME\n" >> .env
- echo "$(cat .env)"
## - ssh -Tvv deployer@$IP_ADDR_ip_10_1_1_20
- scp -r ./.env ./docker-compose-sgs-mocks.yml deployer@$IP_ADDR_ip_10_1_1_20:~/
- scp -r ./.env ./docker-compose-sgs-mocks-high.yml deployer@$IP_ADDR_ip_10_1_1_22:~/
- ssh deployer@$IP_ADDR_ip_10_1_1_20 ". .env; docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY; docker-compose -f docker-compose-sgs-mocks.yml stop; docker-compose -f docker-compose-sgs-mocks.yml rm --force; docker-compose -f docker-compose-sgs-mocks.yml pull; docker-compose -f docker-compose-sgs-mocks.yml up -d;"
- ssh deployer@$IP_ADDR_ip_10_1_1_22 ". .env; docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY; docker-compose -f docker-compose-sgs-mocks-high.yml stop; docker-compose -f docker-compose-sgs-mocks-high.yml rm --force; docker-compose -f docker-compose-sgs-mocks-high.yml pull; docker-compose -f docker-compose-sgs-mocks-high.yml up -d;"
version:
image: python:3.7-stretch
stage: version
only:
refs:
- master
before_script:
- export SCRIPTS_DIR=$(mktemp -d)
- echo $SCRIPTS_DIR
script:
- mkdir -p ~/.ssh && chmod 700 ~/.ssh
- ssh-keyscan -p 10023 $URL >> ~/.ssh/known_hosts && chmod 644 ~/.ssh/known_hosts
- eval $(ssh-agent -s)
- ssh-add <(echo "$SSH_PRIVATE_KEY_SEMVER")
- pip install semver
# Force an udpate to the Git Origin URL with the correct port
- git remote remove origin
- git remote add origin git@$URL
- git clone -q --depth 1 ${SCRIPTS_REPO} ${SCRIPTS_DIR}
- git remote -v
- $SCRIPTS_DIR/ci-scripts/common/gen-semver.py | tee tag_version
artifacts:
paths:
- tag_version
# Clean and rebuild jars with new release tag in name then publish
publish-release:
image: gradle:jdk11
stage: publish-release
only:
refs:
- master
script:
- ./gradlew ${INIT_SCRIPT_ARGS} clean
- ./gradlew ${INIT_SCRIPT_ARGS} --build-cache assemble
- ./gradlew ${INIT_SCRIPT_ARGS} bootJar
- ./gradlew ${INIT_SCRIPT_ARGS} publish
artifacts:
paths:
- sgs-orchestrator/build/libs/sgs-orchestrator*.jar
- uci-service/build/libs/uci-service*.jar
- request-repo/build/libs/request-repo*.jar
- sentient-request-status-normalizer/build/libs/sentient-request-status-normalizer*.jar
expire_in: 1 week
deploy-prod:
stage: deploy-prod
environment: production
image: kroniak/ssh-client
only:
refs:
- master
when: manual
script:
- mkdir -p ~/.ssh
- echo "$SSH_KNOWN_HOSTS_ip_10_1_1_21" >> ~/.ssh/known_hosts
- echo "$SSH_KNOWN_HOSTS_ip_10_1_1_23" >> ~/.ssh/known_hosts
- chmod 644 ~/.ssh/known_hosts
- eval "$(ssh-agent -s)"
- ssh-add <(echo "$SSH_PRIVATE_KEY_ip_10_1_1_21")
- ssh-add <(echo "$SSH_PRIVATE_KEY_ip_10_1_1_23")
- printf "COMPOSE_HTTP_TIMEOUT=300\nMOCK_IBS_IMAGE_NAME=$MOCK_IBS_IMAGE_NAME\nACTIVEMQ_IMAGE_NAME=$ACTIVEMQ_IMAGE_NAME\nACTIVEMQ_HIGH_IMAGE_NAME=$ACTIVEMQ_HIGH_IMAGE_NAME\nDB_NAME=$DB_NAME\nDB_USERNAME=$DB_USERNAME\nDB_SERVER=$IP_ADDR_ip_10_1_1_21\nDB_PORT=$DB_PORT\nDB_PASSWORD=$DB_PASSWORD\nTMAN_SIMULATOR_IMAGE_NAME=$TMAN_SIMULATOR_IMAGE_NAME\nMOCK_SG_CR_IMAGE_NAME=$MOCK_SG_CR_IMAGE_NAME\nSTORE_PASSWORD=$STORE_PASSWORD\nPROJECT_PORT=$PROJECT_PORT\nPROJECT_SERVER=$IP_ADDR_ip_10_1_1_20\nCI_COMMIT_SHORT_SHA=$CI_COMMIT_SHORT_SHA\n" > .env
- printf "ACTIVE_PROFILE=$ACTIVE_PROFILE\nACTIVEMQ_BROKER=$ACTIVEMQ_BROKER\nACTIVEMQ_PASSWORD=$ACTIVEMQ_PASSWORD\nACTIVEMQ_USER=$ACTIVEMQ_USER\nINT_BROKER_URL=$INT_BROKER_URL\nINT_HIGH_BROKER_URL=$INT_HIGH_BROKER_URL\nPROD_BROKER_URL=$PROD_BROKER_URL\nPROD_HIGH_BROKER_URL=$PROD_HIGH_BROKER_URL\nDOCKER_NETWORK=$DOCKER_NETWORK\nDOCKER_REPO=$DOCKER_REPO\n" >> .env
# NOTE: if we stand a prod imagery server, will we need to account for that here
- printf "MOCK_IMAGE_SERVER=$MOCK_IMAGE_SERVER\nMOCK_IMAGE_PORT=$MOCK_IMAGE_PORT\n" >> .env
- printf "CDS_LOW_HOST=$CDS_LOW_HOST\nCDS_HIGH_HOST=$CDS_HIGH_HOST\nCDS_DESTINATION_TASKING_PATH=$CDS_DESTINATION_TASKING_PATH\nCDS_DESTINATION_TASKING_STATUS_PATH=$CDS_DESTINATION_TASKING_STATUS_PATH\nCDS_DESTINATION_UPDATE_PATH=$CDS_DESTINATION_UPDATE_PATH\n" >> .env
- printf "CDS_SOURCE_TASKING_PATH=$CDS_SOURCE_TASKING_PATH\nCDS_SOURCE_TASKING_STATUS_PATH=$CDS_SOURCE_TASKING_STATUS_PATH\nCDS_SOURCE_UPDATE_PATH=$CDS_SOURCE_UPDATE_PATH\nCDS_SOURCE_POSITION_UPDATE_PATH=$CDS_SOURCE_POSITION_UPDATE_PATH\nCDS_DESTINATION_POSITION_UPDATE_PATH=$CDS_DESTINATION_POSITION_UPDATE_PATH\nSIM_PROXY_PORT=$SIM_PROXY_PORT\n" >> .env
- printf "XOCOMM_SIMULATOR_IMAGE_NAME=$XOCOMM_SIMULATOR_IMAGE_NAME\n" >> .env
- printf "SNAPGLASS_IP=$SNAPGLASS_IP\n" >> .env
- echo "$(cat .env)"
- scp -r ./.env ./docker-compose-sgs-mocks.yml deployer@$IP_ADDR_ip_10_1_1_21:~/
- scp -r ./.env ./docker-compose-sgs-mocks-high.yml deployer@$IP_ADDR_ip_10_1_1_23:~/
- ssh deployer@$IP_ADDR_ip_10_1_1_21 ". .env; docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY; docker-compose -f docker-compose-sgs-mocks.yml stop; docker-compose -f docker-compose-sgs-mocks.yml rm --force; docker-compose -f docker-compose-sgs-mocks.yml pull; docker-compose -f docker-compose-sgs-mocks.yml up -d"
- ssh deployer@$IP_ADDR_ip_10_1_1_23 ". .env; docker login -u gitlab-ci-token -p $CI_BUILD_TOKEN $CI_REGISTRY; docker-compose -f docker-compose-sgs-mocks-high.yml stop; docker-compose -f docker-compose-sgs-mocks-high.yml rm --force; docker-compose -f docker-compose-sgs-mocks-high.yml pull; docker-compose -f docker-compose-sgs-mocks-high.yml up -d;"
build-dal:
stage: .pre
image: python:3.7-stretch
when: manual
script:
# build data ascension list
- echo | find . -type d -name '.git' -prune -o -type f -name '.git*' -prune -o -type f -printf "%TY-%Tm-%Td %.8TT\t%s\t%p\n" | numfmt --field=3 --to=iec --padding=-4 > sgs_mock-services_dal.txt
artifacts:
paths:
- sgs_mock-services_dal.txt
expire_in: 1 week
- What troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?
We’ve turned on debug to watch the details of the runner.
We’ve watched the build progress, added echo debug out put to the .gitlab-ci.yml file