Windows runner with Mintty Bash: task cancellation problems

juona · April 25, 2024, 7:11pm

Hello everyone!

Looking for help with a rather specific issue that is perhaps a borderline defect. It’s a long one, any help / thoughts would be greatly appreciated.

Setup

I have configured two very basic GitLab runners:

a Linux runner with a shell executor (Bash);
a Windows runner with a shell executor (also Bash, available via Mintty).

Here’s the Windows runner config. Not adding the one for Linux as it seems irrelevant (no problems on Linux).

concurrent = 1
check_interval = 0
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "WindowsRunner"
  url = "....REDACTED..."
  id = 8
  token = "...REDACTED..."
  token_obtained_at = 2023-03-26T11:21:35Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "shell"
  shell = "bash"
  builds_dir = "/c/gitlab-runner/builds/"
  cache_dir = "/c/gitlab-runner/cache/"
  [runners.cache]
    MaxUploadedArchiveSize = 0

I am not sure if it is a good practice to use Bash on a Windows runner, but it does work, I guess as long as the CI scripts are not too complex.

For testing purposes I have this simple script:

#!/bin/bash

pid=$$

echo "My PID is $pid"

# This line detects a SIGTERM sent to the process running this script,
# creates a file as a proof and terminates the process using SIGTERM
trap 'echo "Caught SIGTERM, terminating" > xxx.log; trap - TERM; kill $pid' TERM

# Just do this forever so that the process never exits on its own...
while [ true ]; do
    echo ok
    sleep 1
done

I am running this script as a single-stage test pipeline on both runners. Here’s my .gitlab-ci.yml:

stages:
    - Test

Job Linux:
    stage: Test
    tags:
        - linux
    script: scripts/test.sh

Job Windows:
    stage: Test
    tags:
        - windows
    script: scripts/test.sh

Problem

To reproduce the issue:

run the pipeline; two jobs are created - one running on Windows and the other one on Linux; both jobs run the same script;
cancel the pipeline (or the jobs one-by-one).

Result - the process is terminated on the Linux runner but not on the Windows runner.

Exactly the same happens when the jobs timeout.

Investigation

I have found a few seemingly unresolved problems that may or may not be related, such as this one: Processes on the Runner host are not terminated when a CI/CD job is cancelled in the UI (&10469) · Epics · GitLab.org · GitLab. However, most of these issue reports mention specifically Bash on Linux, which in my case works fine, so presumably they are not related.

Nevertheless, here is one MR, which is definitely related: Make commander start process group for each process (!1743) · Merge requests · GitLab.org / gitlab-runner · GitLab. Quote:

Update the unix killer to be aware of process groups and send the singal to each process instead of just the main process to prevent any orphan processes.

With this change, any processes launched by the runner for a given job will belong to the same group. In case a job cancellation, all processes belonging to this group should be terminated using SIGTERM. The code in the MR is pretty straightforward.

And it definitely works on Linux. I can tell by looking at the processes on the runner.

Linux

Before the job:

# ps -o pid,time,comm,pgid
PID   TIME  COMMAND          PGID
    1  0:00 dumb-init            1
    7  1:05 gitlab-runner        7
30848  0:00 sh               30848
34328  0:00 ps               34328

During the job:

PID   TIME  COMMAND          PGID
    1  0:00 dumb-init            1
    7  1:06 gitlab-runner        7
30848  0:00 sh               30848
34491  0:00 bash             34491
34495  0:00 bash             34491
34499  0:00 test.sh          34491
36332  0:00 sleep            34491
36333  0:00 ps               36333

Here you can clearly see 4 processes in the same group 34491. Something I don’t understand, given my script and CI config - why are there 4 processes for this single job? More specifically, why are there two instances of Bash instead of just one.

After the job is cancelled:

PID   TIME  COMMAND          PGID
    1  0:00 dumb-init            1
    7  1:06 gitlab-runner        7
30848  0:00 sh               30848
36367  0:00 ps               36367

All 4 processes with PGID 34491 were terminated.

Windows

Sadly, the same thing does not seem to work on the Windows runner. And it’s pretty weird.

Also, ps on Mintty Bash does not sort the results, sorry about that.

Before the job:

      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
     3218    3217    3218       8536  pty0      197612 18:35:52 /usr/bin/bash
     9122    3218    9122       1276  pty0      197612 21:57:11 /usr/bin/ps
     3217       1    3217      19492  ?         197612 18:35:52 /usr/bin/mintty

During the job:

      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
     3218    3217    3218       8536  pty0      197612 18:35:52 /usr/bin/bash
     3217       1    3217      19492  ?         197612 18:35:52 /usr/bin/mintty
     9220    9216    9195      12212  ?         197612 21:57:31 /usr/bin/bash
     9195       1    9195       6992  ?         197612 21:57:30 /usr/bin/bash
     9222    3218    9222      18064  pty0      197612 21:57:31 /usr/bin/ps
     9216    9195    9195      11540  ?         197612 21:57:30 /usr/bin/bash
     9221    9220    9195       8452  ?         197612 21:57:31 /usr/bin/sleep

As you can see, we now have four additional processes, all in the 9195 group. Now, I know that Windows does not have process groups, but Mintty seems to emulate everything.

Also, notice that the process tree is as follows:

1 > 9195 > 9216 > 9220 > 9221

After the job is cancelled:

      PID    PPID    PGID     WINPID   TTY         UID    STIME COMMAND
     3218    3217    3218       8536  pty0      197612 18:35:52 /usr/bin/bash
     9349    9220    9195       7712  ?         197612 21:59:00 /usr/bin/sleep
     3217       1    3217      19492  ?         197612 18:35:52 /usr/bin/mintty
     9220    9216    9195      12212  ?         197612 21:57:31 /usr/bin/bash
     9350    3218    9350      21196  pty0      197612 21:59:01 /usr/bin/ps

This is where things get interesting - the first two jobs in the group 9195 (PID 9195 and 9216) were terminated, but the last two (9220 and 9349 - different ID, but it’s just a new sleep process) remained.

So it’s like the solution implemented in the MR works only halfway. The behavior is consistent - always the first two processes are terminated.

One final note. The command kill -- -9195 should send a SIGTERM to all processes belonging to the 9195 group, and it does, and it works - it kills all four processes, as it should! So it seems like the Mintty Bash works well enough.

Question

If I had to ask a single question: why are some processes not terminated on my Windows runner?

Thanks for reading! Thoughts?