I need your help to track down a very weird I/O-related performance issue in our CI. We are running
- *GitLab (Hint:
/help
): 15.10.2-ee (self-managed) -
Runner (Hint:
/admin/runners
): 15.10.1
Unfortunately I don’t have an MWE but let me describe the problem, I think that someone experienced with the architecture of the GitLab CI including the way GitLab Runners utilise Docker should probably be able to toss me in the right direction.
We started using the CI to do some heavy processing and generation of simulation data since we have set up some quite beefy runners at our institute (128 cores, 512GB RAM, fast SSD). We already discovered a weird bug related to the usage of tmpfs
as a build folder in Docker-based GitLab runners, where we observe basically no speed-up whatsoever. This seems to be related to the fact that although a tmpfs
RAM-disk is mounted properly, the build-directory of the runner is mounted as a regular disk mount on top of that (see GitLab runner - tmpfs/ramdisk extremely slow (no speed-up compared to HDD) (#29651) · Issues · GitLab.org / gitlab-runner · GitLab), but that’s another story.
Anyways, in this issue, the problem is the following: we need to launch many processes in parallel and generate simulation data but for some reason, all these processes are basically stuck and only consume 10-30% CPU and iotop
shows about 3M/s READDISK
, although these processes do not read from disk. They generate the data completely in memory and only after several minutes they open a file and dump a few hundred megabytes at max.
We are currently running the processes in parallel via xargs -a COMMANDS.txt -d '\n' -n 1 -P 128 bash -c
where COMMANDS.txt
is a long list of commands to be executed in parallel.
If I run this locally on the machine (with the natively installed software) or inside a Docker container (same image as used in the CI), I have full CPU utilisation, no disk read I/O at all, only occasionally a short write I/O when a process finishes and dumps the results into a file:
If I run the exact same command via a GitLab CI job, every process is simply hanging, using about 20% of the CPU and no files are produced (very likely a massive I/O bottleneck). I let it run for several days without any files being produced, although the whole processing should only take a few hours:
Here is the output of iotop
when the GitLab runner is working, as you can see, a lot of DISK READ
, which should be 0 B/s
(those are, when running the processes locally and not in the runner):
Total DISK READ: 161.84 M/s | Total DISK WRITE: 8.22 K/s
Current DISK READ: 161.68 M/s | Current DISK WRITE: 673.89 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1521 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.04 % [jbd2/dm-0-8]
2857579 be/4 root 3.45 M/s 0.00 B/s 49.98 % 0.01 % JMakePDF -F4 -A 1.0 -S 1.0 -R 0.1 -o /Jpp/data/J4p-0.1.dat -d 2
2857581 be/4 root 3.55 M/s 0.00 B/s 53.35 % 0.01 % JMakePDF -F4 -A 1.0 -S 1.0 -R 0.3 -o /Jpp/data/J4p-0.3.dat -d 2
2857577 be/4 root 3.63 M/s 0.00 B/s 54.40 % 0.00 % JMakePDF -F2 -A 1.0 -S 1.0 -o /Jpp/data/J2p.dat -d 2
2857583 be/4 root 2.81 M/s 0.00 B/s 39.08 % 0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 0.5 -o /Jpp/data/J4p-0.5.dat -d 2
2857585 be/4 root 3.71 M/s 0.00 B/s 54.24 % 0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 1.0 -o /Jpp/data/J4p-1.0.dat -d 2
2857587 be/4 root 3.54 M/s 0.00 B/s 52.34 % 0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 2.0 -o /Jpp/data/J4p-2.0.dat -d 2
2857589 be/4 root 2.79 M/s 0.00 B/s 38.69 % 0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 3.0 -o /Jpp/data/J4p-3.0.dat -d 2
2857591 be/4 root 3.69 M/s 0.00 B/s 54.67 % 0.00 % JMakePDF -F4 -A 1.0 -S 1.0 -R 4.0 -o /Jpp/data/J4p-4.0.dat -d 2
...
...
...
I also checked the process which files are open and the list is short, not revealing anything but a pipe
which might be a hint that maybe something in Docker or GitLab runner is overloading it (maybe?). I believe the (deleted)
and No such file or directory
are displayed because lsof
tries to access those files, they are however inside the Docker containers.
# lsof -p 3066092
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
JMakePDF 3066092 root cwd DIR 0,98 4096 22413318 /Jpp
JMakePDF 3066092 root rtd DIR 0,98 4096 25048707 /
JMakePDF 3066092 root txt REG 0,98 774856 22417443 /Jpp/out/Linux/bin/JMakePDF
JMakePDF 3066092 root mem REG 0,98 2103520 /usr/lib64/libc-2.17.so (stat: No such file or directory)
JMakePDF 3066092 root mem REG 0,98 2103677 /usr/lib64/libpthread-2.17.so (stat: No such file or directory)
JMakePDF 3066092 root mem REG 0,98 2103576 /usr/lib64/libgcc_s-4.8.5-20150702.so.1 (stat: No such file or directory)
JMakePDF 3066092 root mem REG 0,98 2103625 /usr/lib64/libm-2.17.so (stat: No such file or directory)
JMakePDF 3066092 root mem REG 0,98 2103696 /usr/lib64/libstdc++.so.6.0.19 (stat: No such file or directory)
JMakePDF 3066092 root mem REG 0,98 2100485 /usr/lib64/ld-2.17.so (stat: No such file or directory)
JMakePDF 3066092 root 0r FIFO 0,13 0t0 237557915 pipe
JMakePDF 3066092 root 1u REG 0,98 0 25048939 /tmp/parFtgVD.par (deleted)
JMakePDF 3066092 root 2u REG 0,98 50 25048940 /tmp/parGT2Ki.par (deleted)
Does anyone have a hint why this happens? Why are these jobs stuck with such massive I/O reads although they don’t do anything like that when running locally?
There is also no STDOUT or logging produced by these processes btw.
Since the processing works absolutely fine when running locally or in a Docker container from the same image used in the CI, I am very confident that this is related to the GitLab Runner or the way GitLab communicates with the runner and the CI job.
I am totally lost…