We are seeing from time to time jobs which have completed successfully and uploaded their artefacts exiting with exit status 137 (SIGKILL). My suspicion is that there is some grace period (10s???) which may be exceeded and then the runner issues a SIGKILL to the job. Our logs can be quite considerable, containing docker logs for containers under test and test-results and test-coverage, so I can imagine that all of this file collection and uploading, along with the normal shutdown of the job could take some time.
The after_script has completed successfully with no errors. The make target we run in that is reporting exit status 0. The artifacts also seem to have been uploaded correctly.
The tail end of the job log is
logger nrg-vpp docker_down completed
24.37user 2.06system 1:14.65elapsed 35%CPU (0avgtext+0avgdata 57608maxresident)k
128inputs+1171664outputs (2major+300679minor)pagefaults 0swaps
section_end:1588759815:after_script
e[0Ksection_start:1588759815:upload_artifacts_on_failure
e[0Ke[0Ke[36;1mUploading artifacts for failed jobe[0;m
e[0;me[32;1mUploading artifacts...e[0;m
Runtime platform e[0;m arche[0;m=amd64 ose[0;m=linux pide[0;m=99182 revisione[0;m=ce065b93 versione[0;m=12.10.1
target/logs: found 15 matching files e[0;m
api/target/surefire-reports: found 70 matching filese[0;m
api/target/failsafe-reports: found 91 matching filese[0;m
tradepublisher/target/surefire-reports: found 13 matching filese[0;m
billing/target/surefire-reports: found 24 matching filese[0;m
handelsmengen/target/surefire-reports: found 16 matching filese[0;m
sftp/target/surefire-reports: found 4 matching filese[0;m
Uploading artifacts to coordinator... ok e[0;m ide[0;m=65186 responseStatuse[0;m=201 Created tokene[0;m=5-Nqinfh
section_end:1588759816:upload_artifacts_on_failure
e[0Ke[31;1mERROR: Job failed: exit status 137
e[0;m
In syslog I see
May 6 10:10:15 runner007 gitlab-runner: nrg-vpp docker_down completed
May 6 10:10:15 runner007 systemd[1]: Stopping User Manager for UID 998...
May 6 10:10:15 runner007 systemd[96745]: Stopped target Default.
May 6 10:10:15 runner007 systemd[96745]: Stopped target Basic System.
May 6 10:10:15 runner007 systemd[96745]: Stopped target Sockets.
May 6 10:10:15 runner007 systemd[96745]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
May 6 10:10:15 runner007 systemd[96745]: Closed GnuPG cryptographic agent (ssh-agent emulation).
May 6 10:10:15 runner007 systemd[96745]: Closed REST API socket for snapd user session agent.
May 6 10:10:15 runner007 systemd[96745]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
May 6 10:10:15 runner007 systemd[96745]: Closed GnuPG cryptographic agent and passphrase cache.
May 6 10:10:15 runner007 systemd[96745]: Stopped target Timers.
May 6 10:10:15 runner007 systemd[96745]: Closed GnuPG network certificate management daemon.
May 6 10:10:15 runner007 systemd[96745]: Stopped target Paths.
May 6 10:10:15 runner007 systemd[96745]: Reached target Shutdown.
May 6 10:10:15 runner007 systemd[96745]: Starting Exit the Session...
May 6 10:10:15 runner007 systemd[1]: Started Session c628 of user gitlab-runner.
May 6 10:10:15 runner007 systemd[96745]: Received SIGRTMIN+24 from PID 99154 (kill).
My questions are therefore
- Is there such a grace period?
- If it exists, is it configurable?
- If so where?
Any advice is welcome