Grace period for job completion is it configurable?

sandynoumena · May 11, 2020, 1:47pm

We are seeing from time to time jobs which have completed successfully and uploaded their artefacts exiting with exit status 137 (SIGKILL). My suspicion is that there is some grace period (10s???) which may be exceeded and then the runner issues a SIGKILL to the job. Our logs can be quite considerable, containing docker logs for containers under test and test-results and test-coverage, so I can imagine that all of this file collection and uploading, along with the normal shutdown of the job could take some time.

The after_script has completed successfully with no errors. The make target we run in that is reporting exit status 0. The artifacts also seem to have been uploaded correctly.

The tail end of the job log is

logger nrg-vpp docker_down completed
24.37user 2.06system 1:14.65elapsed 35%CPU (0avgtext+0avgdata 57608maxresident)k
128inputs+1171664outputs (2major+300679minor)pagefaults 0swaps
section_end:1588759815:after_script
e[0Ksection_start:1588759815:upload_artifacts_on_failure
e[0Ke[0Ke[36;1mUploading artifacts for failed jobe[0;m
e[0;me[32;1mUploading artifacts...e[0;m
Runtime platform                                  e[0;m  arche[0;m=amd64 ose[0;m=linux pide[0;m=99182 revisione[0;m=ce065b93 versione[0;m=12.10.1
target/logs: found 15 matching files              e[0;m 
api/target/surefire-reports: found 70 matching filese[0;m 
api/target/failsafe-reports: found 91 matching filese[0;m 
tradepublisher/target/surefire-reports: found 13 matching filese[0;m 
billing/target/surefire-reports: found 24 matching filese[0;m 
handelsmengen/target/surefire-reports: found 16 matching filese[0;m 
sftp/target/surefire-reports: found 4 matching filese[0;m 
Uploading artifacts to coordinator... ok          e[0;m  ide[0;m=65186 responseStatuse[0;m=201 Created tokene[0;m=5-Nqinfh
section_end:1588759816:upload_artifacts_on_failure
e[0Ke[31;1mERROR: Job failed: exit status 137
e[0;m

In syslog I see

May  6 10:10:15 runner007 gitlab-runner: nrg-vpp docker_down completed
May  6 10:10:15 runner007 systemd[1]: Stopping User Manager for UID 998...
May  6 10:10:15 runner007 systemd[96745]: Stopped target Default.
May  6 10:10:15 runner007 systemd[96745]: Stopped target Basic System.
May  6 10:10:15 runner007 systemd[96745]: Stopped target Sockets.
May  6 10:10:15 runner007 systemd[96745]: Closed GnuPG cryptographic agent and passphrase cache (access for web browsers).
May  6 10:10:15 runner007 systemd[96745]: Closed GnuPG cryptographic agent (ssh-agent emulation).
May  6 10:10:15 runner007 systemd[96745]: Closed REST API socket for snapd user session agent.
May  6 10:10:15 runner007 systemd[96745]: Closed GnuPG cryptographic agent and passphrase cache (restricted).
May  6 10:10:15 runner007 systemd[96745]: Closed GnuPG cryptographic agent and passphrase cache.
May  6 10:10:15 runner007 systemd[96745]: Stopped target Timers.
May  6 10:10:15 runner007 systemd[96745]: Closed GnuPG network certificate management daemon.
May  6 10:10:15 runner007 systemd[96745]: Stopped target Paths.
May  6 10:10:15 runner007 systemd[96745]: Reached target Shutdown.
May  6 10:10:15 runner007 systemd[96745]: Starting Exit the Session...
May  6 10:10:15 runner007 systemd[1]: Started Session c628 of user gitlab-runner.
May  6 10:10:15 runner007 systemd[96745]: Received SIGRTMIN+24 from PID 99154 (kill).

My questions are therefore

Is there such a grace period?
If it exists, is it configurable?
If so where?

Any advice is welcome

dnsmichi · May 12, 2020, 2:55pm

Hi,

it seems the SIGKILL originates from a job timeout. You can configure this with the timeout keyword for jobs. See the docs reference.

Although the error code 137 could be more than just a timeout. Can you share a little more about your setup and involved versions? Maybe one of the host systems, or containers cause trouble here, same goes for filesystem caches. Or the jobs go out of memory and containers are killed by the host system.

GitLab & runner versions
Host system (cat /etc/os-release)
Docker version
System specific, e.g. filesystem or SELinux

Cheers,
Michael

sandynoumena · May 15, 2020, 12:22pm

Thanks Michael.

The job timeout for the project is 1.5 hrs, the duration of the job was 49 minutes. The runner has the default timeout of 1hr (so I have to reduce the project timeout).

When a job times out that will AFAIK be reported in the log like

WARNING: Timed out waiting for the build to finish
ERROR: Job failed: execution took longer than 1h0m0s seconds

There’s no OOM killer activity in syslog. The only pertinent information in syslog was already shown. The thing is that we see this on many jobs and it looks like everything has succeeded and the SIGKILL comes in after the job and the after script have completed successfully.

You mention something about filesystem cache - this issue only appears to happen on jobs which have a large number of potentially large logs. Where does the file system cache fit into this equation?

Requested system details:

noumena@runner001:~$ gitlab-runner --version
Version:      12.10.1
Git revision: ce065b93
Git branch:   12-10-stable
GO version:   go1.13.8
Built:        2020-04-22T21:29:52+0000
OS/Arch:      linux/amd64

NAME="Ubuntu"
VERSION="18.04.4 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.4 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

Docker is not in use at this point in the build

Docker version 19.03.6, build 369ce74a3c

RAM 16GB, disk 30GB 4 vCPUs

dnsmichi · May 19, 2020, 6:16pm

Hi,

Docker uses specific filesystems for container management, overlay2 as an example. You can make mistakes and define the wrong there resulting in funny errors.

Since you are not using Docker here, this guess falls short.

Can you share the .gitlab-ci.yml to get an idea about the task being run for this job? Are there any network related tasks involved, such as download/upload any artifacts or packages? Did you add an “echo ‘done.’” statement as last array item into the script section to step into where it really fails?

Also, I’d recommend to go the “timing point” strategy with tracing the time spams. One insight may be the debug logging/tracing of the runner. See here how to enable this.

Cheers,
Michael

sandynoumena · May 19, 2020, 7:00pm

Thanks I’ll wire up the debug log and post the requested stuff tomorrow.

sandynoumena · June 4, 2020, 4:12pm

sandynoumena:

May  6 10:10:15 runner007 systemd[96745]: Starting Exit the Session...
May  6 10:10:15 runner007 systemd[1]: Started Session c628 of user gitlab-runner.
May  6 10:10:15 runner007 systemd[96745]: Received SIGRTMIN+24 from PID 99154 (kill).

M

Turns out it was an over zealous cron job The SIGTERM in the syslog was unrelated, just confusing…

Thanks for your help Michael.