I’m running self-hosted Omnibus GitLab v16.2.4 on an Ubuntu 20.04 instance (16CPUs, 64GB of RAM), which far exceeds the 1K user reference architecture. I have roughly 250 users using the GitLab instance.
Users have been reporting slow user interface refresh/updates after pushing to a merge request and CI pipeline delays. That slowness is sporadic and predates the upgrade to v16.2.4 (it was happening in v15 as well).
These symptoms sound exactly like the symptoms of slow Sidekiq performance. So, I set up a cron job to capture Sidekiq worker stats. After capturing data for several hours, I did find times where there were 200+ jobs enqueued, but users did notice any slowness in the merge request interface or delayed CI jobs during that time:
---- Processes (1) ----
Threads: 20 (17 busy)
sidekiqcheck_20230822-13:41.out
Busy: 15
Enqueued: 208
---- Processes (1) ----
Threads: 20 (15 busy)
sidekiqcheck_20230822-13:42.out
Busy: 18
Enqueued: 218
---- Processes (1) ----
Threads: 20 (18 busy)
sidekiqcheck_20230822-13:43.out
Busy: 17
Enqueued: 196
---- Processes (1) ----
Threads: 20 (17 busy)
sidekiqcheck_20230822-13:44.out
Busy: 20
Enqueued: 181
The Sidekiq config in my instance is all default, so I’m only running one Sidekiq worker. Given that the reported symptoms match those described in the Sidekiq troubleshooting guide, I could increase the number of workers to try to fix this problem but I’d like to be able to reproduce it first.
Also, while troubleshooting, I took a look at the Sidekiq logs and noticed that almost all of the job errors are failing with PG::ConnectionBad: PQsocket() can't get socket descriptor
:
root@gitlab# cat /var/log/gitlab/sidekiq/current | jq '."exception.message" | select( . != null )' | sort | uniq -c
1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer {created_time:\"2023-08-22T12:44:36.843424225-07:00\", grpc_status:4, grpc_message:\"Deadline Exceeded\"}}"
1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer {created_time:\"2023-08-22T12:47:33.460745873-07:00\", grpc_status:4, grpc_message:\"Deadline Exceeded\"}}"
1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2023-08-22T12:50:38.106787423-07:00\"}}"
10 "Failed to obtain a lock"
405 "PG::ConnectionBad: PQsocket() can't get socket descriptor"
I tried increasing the number of max Postgres connections with postgresql['max_connections'] = 1000
and re-configuring GitLab but I still see the same number of Postgres socket errors.
I’m not sure if/how these Postgres socket errors could be related to Sidekiq slowness or the merge request/CI slowness that has been reported but it doesn’t seem good/normal so I thought I would mention it.
Any thoughts on the interface slowness/pipeline delays, Sidekiq slowness, and Postgres socket errors? Thanks!