Puma threads are periodically blocking GitLab instance

Hi, I’m experiencing serious slowdown issues of a GitLab instance I manage. I tried hard to find solutions, but still could not get around it.

I’m currently using v16.5.1-ee on Ubuntu 18. The server has 8 CPU and 30 GB RAM.
Projects are built and deployed mostly using runners in docker containers on the target machines.

What happens, since many months, is that at times, during each day, GitLab becomes unresponsive: the web GUI times out, and also git commands and pipelines fail. At the same time, if I check with the processes on the machine, I can see “puma: cluster worker” processes each one taking 100% of a CPU core.

I initially noticed the puma workers were being killed frequently, so I changed the max mem:

puma['per_worker_max_memory_mb'] = 1600

But the problem continued.
In the following months I played around with concurrency, reducing parallel workers:

puma['worker_processes'] = 2
sidekiq['concurrency'] = 2

But still no joy.
Today I also tried by disabling puma cluster mode (Document how and when to run Puma in Single mode, limitations (#300651) · Issues · GitLab.org / GitLab · GitLab) but the results seemed to be even worse.

As far as I can see it is not a matter of server resources: even when only 2 puma workers are taking just 2 vCPUs the whole GitLab instance is unusable, nevertheless I can get good response from a ssh shell on the same machine.
Each time this happens, if I wait 20-30 minutes the issue is solved automatically, i.e. the puma workers terminate and GitLab runs fast again.

Can anyone provide any pointers for finding the root cause and a solution, please? I’m really lost, I cannot find anything useful in the logs to my knowledge.
Thank you very much in advance.