Sidekiq Queue Backlog and PG::ConnectionBad: PQsocket() can't get socket descriptor Errors

I’m running self-hosted Omnibus GitLab v16.2.4 on an Ubuntu 20.04 instance (16CPUs, 64GB of RAM), which far exceeds the 1K user reference architecture. I have roughly 250 users using the GitLab instance.

Users have been reporting slow user interface refresh/updates after pushing to a merge request and CI pipeline delays. That slowness is sporadic and predates the upgrade to v16.2.4 (it was happening in v15 as well).

These symptoms sound exactly like the symptoms of slow Sidekiq performance. So, I set up a cron job to capture Sidekiq worker stats. After capturing data for several hours, I did find times where there were 200+ jobs enqueued, but users did notice any slowness in the merge request interface or delayed CI jobs during that time:

---- Processes (1) ----
  Threads: 20 (17 busy)
sidekiqcheck_20230822-13:41.out
       Busy: 15
   Enqueued: 208
---- Processes (1) ----
  Threads: 20 (15 busy)
sidekiqcheck_20230822-13:42.out
       Busy: 18
   Enqueued: 218
---- Processes (1) ----
  Threads: 20 (18 busy)
sidekiqcheck_20230822-13:43.out
       Busy: 17
   Enqueued: 196
---- Processes (1) ----
  Threads: 20 (17 busy)
sidekiqcheck_20230822-13:44.out
       Busy: 20
   Enqueued: 181

The Sidekiq config in my instance is all default, so I’m only running one Sidekiq worker. Given that the reported symptoms match those described in the Sidekiq troubleshooting guide, I could increase the number of workers to try to fix this problem but I’d like to be able to reproduce it first.

Also, while troubleshooting, I took a look at the Sidekiq logs and noticed that almost all of the job errors are failing with PG::ConnectionBad: PQsocket() can't get socket descriptor:

root@gitlab# cat /var/log/gitlab/sidekiq/current | jq '."exception.message" | select( . != null )' | sort | uniq -c
      1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer  {created_time:\"2023-08-22T12:44:36.843424225-07:00\", grpc_status:4, grpc_message:\"Deadline Exceeded\"}}"
      1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer  {created_time:\"2023-08-22T12:47:33.460745873-07:00\", grpc_status:4, grpc_message:\"Deadline Exceeded\"}}"
      1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer  {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2023-08-22T12:50:38.106787423-07:00\"}}"
     10 "Failed to obtain a lock"
    405 "PG::ConnectionBad: PQsocket() can't get socket descriptor"

I tried increasing the number of max Postgres connections with postgresql['max_connections'] = 1000 and re-configuring GitLab but I still see the same number of Postgres socket errors.

I’m not sure if/how these Postgres socket errors could be related to Sidekiq slowness or the merge request/CI slowness that has been reported but it doesn’t seem good/normal so I thought I would mention it.

Any thoughts on the interface slowness/pipeline delays, Sidekiq slowness, and Postgres socket errors? Thanks!

Users continued to report interface slowness and pipeline delays so I followed this documentation to increase the number of Sidekiq workers to 4.

Since increasing the number of Sidekiq workers, users have reported that the interface became much more responsive and that CI just began running immediately. There also have been no enqueued jobs since adding workers. There are still a few jobs dying with the same “PG::ConnectionBad: PQsocket() can’t get socket descriptor” error not near as many as before.

I’m still confused about the failing Sidekiq jobs/Postgres errors and why there are still a few errors even though the job queues have been empty or low and the workers are usually not busy.

I’m experimenting with an omnibus v16.3.0-ee instance, on AlmaLinux 8 provisioned with only 1 core (i.e. way below recommendation) and 4gb ram, but I see something similar except from the gitlab-exporter service. I have exactly 0 user load since I’m the only one experimenting though, and I would not expect to get socket errors to the database under these conditions.

Example log:

==> /var/log/gitlab/gitlab-exporter/current <==
2023-08-25_13:09:37.92166 ::1 - - [25/Aug/2023:15:09:37 CEST] “GET /ruby HTTP/1.1” 200 1074
2023-08-25_13:09:37.92168 - → /ruby
2023-08-25_13:09:46.82081 E, [2023-08-25T15:09:46.810133 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:09:46.82083 E, [2023-08-25T15:09:46.810355 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:09:46.82083 E, [2023-08-25T15:09:46.810450 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:09:46.82790 ::1 - - [25/Aug/2023:15:09:46 CEST] “GET /database HTTP/1.1” 200 0
2023-08-25_13:09:46.82792 - → /database
2023-08-25_13:09:50.23234 ::1 - - [25/Aug/2023:15:09:49 CEST] “GET /sidekiq HTTP/1.1” 200 140209
2023-08-25_13:09:50.23235 - → /sidekiq
2023-08-25_13:09:52.90619 ::1 - - [25/Aug/2023:15:09:52 CEST] “GET /ruby HTTP/1.1” 200 1074
2023-08-25_13:09:52.90621 - → /ruby
2023-08-25_13:10:01.83165 E, [2023-08-25T15:10:01.821261 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:10:01.83168 E, [2023-08-25T15:10:01.821472 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:10:01.83168 E, [2023-08-25T15:10:01.821611 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:10:01.84421 ::1 - - [25/Aug/2023:15:10:01 CEST] “GET /database HTTP/1.1” 200 0
2023-08-25_13:10:01.84424 - → /database
2023-08-25_13:10:05.45278 ::1 - - [25/Aug/2023:15:10:04 CEST] “GET /sidekiq HTTP/1.1” 200 140209
2023-08-25_13:10:05.45279 - → /sidekiq
2023-08-25_13:10:07.91292 ::1 - - [25/Aug/2023:15:10:07 CEST] “GET /ruby HTTP/1.1” 200 1074
2023-08-25_13:10:07.91293 - → /ruby

Still seeing PG::ConnectionBad: PQsocket() can’t get socket descriptor errors but haven’t had any enqueued jobs since adding more Sidekiq workers.