Hello everyone!
Looking for help with a rather specific issue that is perhaps a borderline defect. It’s a long one, any help / thoughts would be greatly appreciated.
Setup
I have configured two very basic GitLab runners:
- a Linux runner with a shell executor (Bash);
- a Windows runner with a shell executor (also Bash, available via Mintty).
Here’s the Windows runner config. Not adding the one for Linux as it seems irrelevant (no problems on Linux).
concurrent = 1
check_interval = 0
shutdown_timeout = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "WindowsRunner"
url = "....REDACTED..."
id = 8
token = "...REDACTED..."
token_obtained_at = 2023-03-26T11:21:35Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "shell"
shell = "bash"
builds_dir = "/c/gitlab-runner/builds/"
cache_dir = "/c/gitlab-runner/cache/"
[runners.cache]
MaxUploadedArchiveSize = 0
I am not sure if it is a good practice to use Bash on a Windows runner, but it does work, I guess as long as the CI scripts are not too complex.
For testing purposes I have this simple script:
#!/bin/bash
pid=$$
echo "My PID is $pid"
# This line detects a SIGTERM sent to the process running this script,
# creates a file as a proof and terminates the process using SIGTERM
trap 'echo "Caught SIGTERM, terminating" > xxx.log; trap - TERM; kill $pid' TERM
# Just do this forever so that the process never exits on its own...
while [ true ]; do
echo ok
sleep 1
done
I am running this script as a single-stage test pipeline on both runners. Here’s my .gitlab-ci.yml
:
stages:
- Test
Job Linux:
stage: Test
tags:
- linux
script: scripts/test.sh
Job Windows:
stage: Test
tags:
- windows
script: scripts/test.sh
Problem
To reproduce the issue:
- run the pipeline; two jobs are created - one running on Windows and the other one on Linux; both jobs run the same script;
- cancel the pipeline (or the jobs one-by-one).
Result - the process is terminated on the Linux runner but not on the Windows runner.
Exactly the same happens when the jobs timeout.
Investigation
I have found a few seemingly unresolved problems that may or may not be related, such as this one: Processes on the Runner host are not terminated when a CI/CD job is cancelled in the UI (&10469) · Epics · GitLab.org · GitLab. However, most of these issue reports mention specifically Bash on Linux, which in my case works fine, so presumably they are not related.
Nevertheless, here is one MR, which is definitely related: Make commander start process group for each process (!1743) · Merge requests · GitLab.org / gitlab-runner · GitLab. Quote:
Update the unix killer to be aware of process groups and send the singal to each process instead of just the main process to prevent any orphan processes.
With this change, any processes launched by the runner for a given job will belong to the same group. In case a job cancellation, all processes belonging to this group should be terminated using SIGTERM. The code in the MR is pretty straightforward.
And it definitely works on Linux. I can tell by looking at the processes on the runner.
Linux
Before the job:
# ps -o pid,time,comm,pgid
PID TIME COMMAND PGID
1 0:00 dumb-init 1
7 1:05 gitlab-runner 7
30848 0:00 sh 30848
34328 0:00 ps 34328
During the job:
PID TIME COMMAND PGID
1 0:00 dumb-init 1
7 1:06 gitlab-runner 7
30848 0:00 sh 30848
34491 0:00 bash 34491
34495 0:00 bash 34491
34499 0:00 test.sh 34491
36332 0:00 sleep 34491
36333 0:00 ps 36333
Here you can clearly see 4 processes in the same group 34491. Something I don’t understand, given my script and CI config - why are there 4 processes for this single job? More specifically, why are there two instances of Bash instead of just one.
After the job is cancelled:
PID TIME COMMAND PGID
1 0:00 dumb-init 1
7 1:06 gitlab-runner 7
30848 0:00 sh 30848
36367 0:00 ps 36367
All 4 processes with PGID 34491 were terminated.
Windows
Sadly, the same thing does not seem to work on the Windows runner. And it’s pretty weird.
Also, ps
on Mintty Bash does not sort the results, sorry about that.
Before the job:
PID PPID PGID WINPID TTY UID STIME COMMAND
3218 3217 3218 8536 pty0 197612 18:35:52 /usr/bin/bash
9122 3218 9122 1276 pty0 197612 21:57:11 /usr/bin/ps
3217 1 3217 19492 ? 197612 18:35:52 /usr/bin/mintty
During the job:
PID PPID PGID WINPID TTY UID STIME COMMAND
3218 3217 3218 8536 pty0 197612 18:35:52 /usr/bin/bash
3217 1 3217 19492 ? 197612 18:35:52 /usr/bin/mintty
9220 9216 9195 12212 ? 197612 21:57:31 /usr/bin/bash
9195 1 9195 6992 ? 197612 21:57:30 /usr/bin/bash
9222 3218 9222 18064 pty0 197612 21:57:31 /usr/bin/ps
9216 9195 9195 11540 ? 197612 21:57:30 /usr/bin/bash
9221 9220 9195 8452 ? 197612 21:57:31 /usr/bin/sleep
As you can see, we now have four additional processes, all in the 9195 group. Now, I know that Windows does not have process groups, but Mintty seems to emulate everything.
Also, notice that the process tree is as follows:
1 > 9195 > 9216 > 9220 > 9221
After the job is cancelled:
PID PPID PGID WINPID TTY UID STIME COMMAND
3218 3217 3218 8536 pty0 197612 18:35:52 /usr/bin/bash
9349 9220 9195 7712 ? 197612 21:59:00 /usr/bin/sleep
3217 1 3217 19492 ? 197612 18:35:52 /usr/bin/mintty
9220 9216 9195 12212 ? 197612 21:57:31 /usr/bin/bash
9350 3218 9350 21196 pty0 197612 21:59:01 /usr/bin/ps
This is where things get interesting - the first two jobs in the group 9195 (PID 9195 and 9216) were terminated, but the last two (9220 and 9349 - different ID, but it’s just a new sleep process) remained.
So it’s like the solution implemented in the MR works only halfway. The behavior is consistent - always the first two processes are terminated.
One final note. The command kill -- -9195
should send a SIGTERM to all processes belonging to the 9195 group, and it does, and it works - it kills all four processes, as it should! So it seems like the Mintty Bash works well enough.
Question
If I had to ask a single question: why are some processes not terminated on my Windows runner?
Thanks for reading! Thoughts?