Some jobs of my CI pipeline fail randomly without any useful log about what generates the failure. It seems that it occurs during the container setup for the step script, but I cannot find anything useful even turning trace mode on:
Executing "step_script" stage of the job script 00:04 Using docker image sha256:dfdcd5fd09c3db7ecf3277645dc9a569942a2deb26dff4f98597e35657eae388 for git.herd.cloud.infn.it:5050/herd/herd-docker:ubuntu20.04_master with digest git.herd.cloud.infn.it:5050/herd/herd-docker@sha256:68e859570e923fc2a95ee5d5d646f86f7a3c289a1fedd40cddc8e9410c717646 ... Cleaning up project directory and file based variables 00:04 + grep pipefail + set -o + set -o pipefail + set -o errexit + set +o noclobber + : + eval '$'\''rm'\'' -f /builds/herd/HerdSoftware.tmp/CI_SERVER_TLS_CA_FILE ' ++ rm -f /builds/herd/HerdSoftware.tmp/CI_SERVER_TLS_CA_FILE + exit 0 ERROR: Job failed: exit code 1
It seems to me that there is some problem in setting up the container; however, the very same job sometimes run correctly, and this problem randomly appears also for other jobs of the same pipeline. I have not been able to consistently reproduce the problem, which appears only during overnight scheduled pipelines; manually re-triggering the failed job consistently leads to a successful re-execution. I’d need some hints about how to troubleshoot this issue.
Running a self-hosted Gitlab 15.6.1 instance, with self-managed shared runner machine with gitlab-runner 15.6.1 configured as follows:
concurrent = 4 check_interval = 0 [session_server] session_timeout = 1800 [[runners]] name = "infn-cloud-runner-01" url = "https://git.herd.cloud.infn.it/" id = 0 token = "<value>" token_obtained_at = 0001-01-01T00:00:00Z token_expires_at = 0001-01-01T00:00:00Z executor = "docker" [runners.custom_build_dir] [runners.cache] [runners.cache.s3] [runners.cache.gcs] [runners.cache.azure] [runners.docker] tls_verify = false image = "ruby:2.7" privileged = false disable_entrypoint_overwrite = false oom_kill_disable = false disable_cache = false volumes = ["/cache", "/sys/fs/cgroup:/sys/fs/cgroup:ro"] shm_size = 0 devices = ["/dev/fuse"] cap_add = ["SYS_ADMIN"] security_opt = ["apparmor:unconfined"] [runners.docker.tmpfs] "/run" = "rw" "/tmp" = "rw"
The image used for executing the job runs systemd as ENTRYPOINT.