Question about massive I/O performance issues in GitLab runners (via Docker)

I need your help to track down a very weird I/O-related performance issue in our CI. We are running

  • *GitLab (Hint: /help): 15.10.2-ee (self-managed)
  • Runner (Hint: /admin/runners): 15.10.1

Unfortunately I don’t have an MWE but let me describe the problem, I think that someone experienced with the architecture of the GitLab CI including the way GitLab Runners utilise Docker should probably be able to toss me in the right direction.

We started using the CI to do some heavy processing and generation of simulation data since we have set up some quite beefy runners at our institute (128 cores, 512GB RAM, fast SSD). We already discovered a weird bug related to the usage of tmpfs as a build folder in Docker-based GitLab runners, where we observe basically no speed-up whatsoever. This seems to be related to the fact that although a tmpfs RAM-disk is mounted properly, the build-directory of the runner is mounted as a regular disk mount on top of that (see GitLab runner - tmpfs/ramdisk extremely slow (no speed-up compared to HDD) (#29651) · Issues · GitLab.org / gitlab-runner · GitLab), but that’s another story.

Anyways, in this issue, the problem is the following: we need to launch many processes in parallel and generate simulation data but for some reason, all these processes are basically stuck and only consume 10-30% CPU and iotop shows about 3M/s READDISK, although these processes do not read from disk. They generate the data completely in memory and only after several minutes they open a file and dump a few hundred megabytes at max.

We are currently running the processes in parallel via xargs -a COMMANDS.txt -d '\n' -n 1 -P 128 bash -c where COMMANDS.txt is a long list of commands to be executed in parallel.

If I run this locally on the machine (with the natively installed software) or inside a Docker container (same image as used in the CI), I have full CPU utilisation, no disk read I/O at all, only occasionally a short write I/O when a process finishes and dumps the results into a file:

If I run the exact same command via a GitLab CI job, every process is simply hanging, using about 20% of the CPU and no files are produced (very likely a massive I/O bottleneck). I let it run for several days without any files being produced, although the whole processing should only take a few hours:

Here is the output of iotop when the GitLab runner is working, as you can see, a lot of DISK READ, which should be 0 B/s (those are, when running the processes locally and not in the runner):

Total DISK READ:       161.84 M/s | Total DISK WRITE:         8.22 K/s
Current DISK READ:     161.68 M/s | Current DISK WRITE:     673.89 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
   1521 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.04 % [jbd2/dm-0-8]
2857579 be/4 root        3.45 M/s    0.00 B/s 49.98 %  0.01 % JMakePDF -F4 -A 1.0 -S 1.0 -R 0.1 -o /Jpp/data/J4p-0.1.dat -d 2
2857581 be/4 root        3.55 M/s    0.00 B/s 53.35 %  0.01 % JMakePDF -F4 -A 1.0 -S 1.0 -R 0.3 -o /Jpp/data/J4p-0.3.dat -d 2
2857577 be/4 root        3.63 M/s    0.00 B/s 54.40 %  0.00 % JMakePDF -F2 -A 1.0 -S 1.0 -o /Jpp/data/J2p.dat -d 2
2857583 be/4 root        2.81 M/s    0.00 B/s 39.08 %  0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 0.5 -o /Jpp/data/J4p-0.5.dat -d 2
2857585 be/4 root        3.71 M/s    0.00 B/s 54.24 %  0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 1.0 -o /Jpp/data/J4p-1.0.dat -d 2
2857587 be/4 root        3.54 M/s    0.00 B/s 52.34 %  0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 2.0 -o /Jpp/data/J4p-2.0.dat -d 2
2857589 be/4 root        2.79 M/s    0.00 B/s 38.69 %  0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 3.0 -o /Jpp/data/J4p-3.0.dat -d 2
2857591 be/4 root        3.69 M/s    0.00 B/s 54.67 %  0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 4.0 -o /Jpp/data/J4p-4.0.dat -d 2
...
...
...

I also checked the process which files are open and the list is short, not revealing anything but a pipe which might be a hint that maybe something in Docker or GitLab runner is overloading it (maybe?). I believe the (deleted) and No such file or directory are displayed because lsof tries to access those files, they are however inside the Docker containers.

# lsof -p 3066092
COMMAND      PID USER   FD   TYPE DEVICE SIZE/OFF      NODE NAME
JMakePDF 3066092 root  cwd    DIR   0,98     4096  22413318 /Jpp
JMakePDF 3066092 root  rtd    DIR   0,98     4096  25048707 /
JMakePDF 3066092 root  txt    REG   0,98   774856  22417443 /Jpp/out/Linux/bin/JMakePDF
JMakePDF 3066092 root  mem    REG   0,98            2103520 /usr/lib64/libc-2.17.so (stat: No such file or directory)
JMakePDF 3066092 root  mem    REG   0,98            2103677 /usr/lib64/libpthread-2.17.so (stat: No such file or directory)
JMakePDF 3066092 root  mem    REG   0,98            2103576 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 (stat: No such file or directory)
JMakePDF 3066092 root  mem    REG   0,98            2103625 /usr/lib64/libm-2.17.so (stat: No such file or directory)
JMakePDF 3066092 root  mem    REG   0,98            2103696 /usr/lib64/libstdc++.so.6.0.19 (stat: No such file or directory)
JMakePDF 3066092 root  mem    REG   0,98            2100485 /usr/lib64/ld-2.17.so (stat: No such file or directory)
JMakePDF 3066092 root    0r  FIFO   0,13      0t0 237557915 pipe
JMakePDF 3066092 root    1u   REG   0,98        0  25048939 /tmp/parFtgVD.par (deleted)
JMakePDF 3066092 root    2u   REG   0,98       50  25048940 /tmp/parGT2Ki.par (deleted)

Does anyone have a hint why this happens? Why are these jobs stuck with such massive I/O reads although they don’t do anything like that when running locally?
There is also no STDOUT or logging produced by these processes btw.

Since the processing works absolutely fine when running locally or in a Docker container from the same image used in the CI, I am very confident that this is related to the GitLab Runner or the way GitLab communicates with the runner and the CI job.

I am totally lost…

Case solved: memory limit exceeded (it was a typo in the GitLab Runner config and was way too low).

1 Like

Glad you figured it out, @tamasgal ! :tada:

Thanks for sharing the solution here. :handshake:

1 Like