Gitlab CI failing randomly with no log at "Executing step_script"

nicolamori · December 3, 2022, 10:00am

Some jobs of my CI pipeline fail randomly without any useful log about what generates the failure. It seems that it occurs during the container setup for the step script, but I cannot find anything useful even turning trace mode on:

Executing "step_script" stage of the job script 00:04
Using docker image sha256:dfdcd5fd09c3db7ecf3277645dc9a569942a2deb26dff4f98597e35657eae388 for git.herd.cloud.infn.it:5050/herd/herd-docker:ubuntu20.04_master with digest git.herd.cloud.infn.it:5050/herd/herd-docker@sha256:68e859570e923fc2a95ee5d5d646f86f7a3c289a1fedd40cddc8e9410c717646 ...
Cleaning up project directory and file based variables 00:04
+ grep pipefail
+ set -o
+ set -o pipefail
+ set -o errexit
+ set +o noclobber
+ :
+ eval '$'\''rm'\'' -f /builds/herd/HerdSoftware.tmp/CI_SERVER_TLS_CA_FILE
'
++ rm -f /builds/herd/HerdSoftware.tmp/CI_SERVER_TLS_CA_FILE
+ exit 0
ERROR: Job failed: exit code 1

It seems to me that there is some problem in setting up the container; however, the very same job sometimes run correctly, and this problem randomly appears also for other jobs of the same pipeline. I have not been able to consistently reproduce the problem, which appears only during overnight scheduled pipelines; manually re-triggering the failed job consistently leads to a successful re-execution. I’d need some hints about how to troubleshoot this issue.

Running a self-hosted Gitlab 15.6.1 instance, with self-managed shared runner machine with gitlab-runner 15.6.1 configured as follows:

concurrent = 4
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "infn-cloud-runner-01"
  url = "https://git.herd.cloud.infn.it/"
  id = 0
  token = "<value>"
  token_obtained_at = 0001-01-01T00:00:00Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]
  [runners.docker]
    tls_verify = false
    image = "ruby:2.7"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache", "/sys/fs/cgroup:/sys/fs/cgroup:ro"]
    shm_size = 0
    devices = ["/dev/fuse"]
    cap_add = ["SYS_ADMIN"]
    security_opt = ["apparmor:unconfined"]
    [runners.docker.tmpfs]
      "/run" = "rw"
      "/tmp" = "rw"

The image used for executing the job runs systemd as ENTRYPOINT.

HasanZaki · November 26, 2023, 9:47am

did you find any solutiion? I encountered the same error

nicolamori · November 26, 2023, 5:07pm

Unfortunately not. I disabled nightlies since several months and honestly I Iost track of this problem.