Puma threads are periodically blocking GitLab instance

Hi, I’m experiencing serious slowdown issues of a GitLab instance I manage. I tried hard to find solutions, but still could not get around it.

I’m currently using v16.5.1-ee on Ubuntu 18. The server has 8 CPU and 30 GB RAM.
Projects are built and deployed mostly using runners in docker containers on the target machines.

What happens, since many months, is that at times, during each day, GitLab becomes unresponsive: the web GUI times out, and also git commands and pipelines fail. At the same time, if I check with the processes on the machine, I can see “puma: cluster worker” processes each one taking 100% of a CPU core.

I initially noticed the puma workers were being killed frequently, so I changed the max mem:

puma['per_worker_max_memory_mb'] = 1600

But the problem continued.
In the following months I played around with concurrency, reducing parallel workers:

puma['worker_processes'] = 2
sidekiq['concurrency'] = 2

But still no joy.
Today I also tried by disabling puma cluster mode (Document how and when to run Puma in Single mode, limitations (#300651) · Issues · GitLab.org / GitLab · GitLab) but the results seemed to be even worse.

As far as I can see it is not a matter of server resources: even when only 2 puma workers are taking just 2 vCPUs the whole GitLab instance is unusable, nevertheless I can get good response from a ssh shell on the same machine.
Each time this happens, if I wait 20-30 minutes the issue is solved automatically, i.e. the puma workers terminate and GitLab runs fast again.

Can anyone provide any pointers for finding the root cause and a solution, please? I’m really lost, I cannot find anything useful in the logs to my knowledge.
Thank you very much in advance.

Hi!
Did you find a solution for this, can you share it?

Hi Alexander,
yes I eventually found a solution, even though the problem was somewhat “external” to the GitLab instance, so the chances my solution fits you are rather low.
We have another server running Verdaccio, a private NPM registry. When monitoring the GitLab server during the slowdowns, I noticed (e.g. with netstat or ss) hundreds of connections from the Verdaccio server IP. So I investigated with the developers, and it turned out that every time a build was triggered on GitLab, manual or automatic, all node packages and dependencies - even public ones - were requested to our private registry, which in turn was getting back to GitLab to authenticate users against its internal database for EVERY package request, generating lots of simultaneous authentication requests which flooded GitLab and also failed because of this, triggering new retries, in an evil loop which took some time to complete.

To work around this issue, developers changed something - don’t ask me what - in their project files, to avoid getting all public node packages from Verdaccio, and we didn’t experience the issue anymore.

Hope this helps.
Regards