Sidekiq Queue Backlog and PG::ConnectionBad: PQsocket() can't get socket descriptor Errors

cspencer1 · August 22, 2023, 11:05pm

I’m running self-hosted Omnibus GitLab v16.2.4 on an Ubuntu 20.04 instance (16CPUs, 64GB of RAM), which far exceeds the 1K user reference architecture. I have roughly 250 users using the GitLab instance.

Users have been reporting slow user interface refresh/updates after pushing to a merge request and CI pipeline delays. That slowness is sporadic and predates the upgrade to v16.2.4 (it was happening in v15 as well).

These symptoms sound exactly like the symptoms of slow Sidekiq performance. So, I set up a cron job to capture Sidekiq worker stats. After capturing data for several hours, I did find times where there were 200+ jobs enqueued, but users did notice any slowness in the merge request interface or delayed CI jobs during that time:

---- Processes (1) ----
  Threads: 20 (17 busy)
sidekiqcheck_20230822-13:41.out
       Busy: 15
   Enqueued: 208
---- Processes (1) ----
  Threads: 20 (15 busy)
sidekiqcheck_20230822-13:42.out
       Busy: 18
   Enqueued: 218
---- Processes (1) ----
  Threads: 20 (18 busy)
sidekiqcheck_20230822-13:43.out
       Busy: 17
   Enqueued: 196
---- Processes (1) ----
  Threads: 20 (17 busy)
sidekiqcheck_20230822-13:44.out
       Busy: 20
   Enqueued: 181

The Sidekiq config in my instance is all default, so I’m only running one Sidekiq worker. Given that the reported symptoms match those described in the Sidekiq troubleshooting guide, I could increase the number of workers to try to fix this problem but I’d like to be able to reproduce it first.

Also, while troubleshooting, I took a look at the Sidekiq logs and noticed that almost all of the job errors are failing with PG::ConnectionBad: PQsocket() can't get socket descriptor:

root@gitlab# cat /var/log/gitlab/sidekiq/current | jq '."exception.message" | select( . != null )' | sort | uniq -c
      1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer  {created_time:\"2023-08-22T12:44:36.843424225-07:00\", grpc_status:4, grpc_message:\"Deadline Exceeded\"}}"
      1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer  {created_time:\"2023-08-22T12:47:33.460745873-07:00\", grpc_status:4, grpc_message:\"Deadline Exceeded\"}}"
      1 "4:Deadline Exceeded. debug_error_string:{UNKNOWN:Error received from peer  {grpc_message:\"Deadline Exceeded\", grpc_status:4, created_time:\"2023-08-22T12:50:38.106787423-07:00\"}}"
     10 "Failed to obtain a lock"
    405 "PG::ConnectionBad: PQsocket() can't get socket descriptor"

I tried increasing the number of max Postgres connections with postgresql['max_connections'] = 1000 and re-configuring GitLab but I still see the same number of Postgres socket errors.

I’m not sure if/how these Postgres socket errors could be related to Sidekiq slowness or the merge request/CI slowness that has been reported but it doesn’t seem good/normal so I thought I would mention it.

Any thoughts on the interface slowness/pipeline delays, Sidekiq slowness, and Postgres socket errors? Thanks!

cspencer1 · August 23, 2023, 10:32pm

Users continued to report interface slowness and pipeline delays so I followed this documentation to increase the number of Sidekiq workers to 4.

Since increasing the number of Sidekiq workers, users have reported that the interface became much more responsive and that CI just began running immediately. There also have been no enqueued jobs since adding workers. There are still a few jobs dying with the same “PG::ConnectionBad: PQsocket() can’t get socket descriptor” error not near as many as before.

I’m still confused about the failing Sidekiq jobs/Postgres errors and why there are still a few errors even though the job queues have been empty or low and the workers are usually not busy.

dsjoho · August 25, 2023, 12:42pm

I’m experimenting with an omnibus v16.3.0-ee instance, on AlmaLinux 8 provisioned with only 1 core (i.e. way below recommendation) and 4gb ram, but I see something similar except from the gitlab-exporter service. I have exactly 0 user load since I’m the only one experimenting though, and I would not expect to get socket errors to the database under these conditions.

Example log:

==> /var/log/gitlab/gitlab-exporter/current <==
2023-08-25_13:09:37.92166 ::1 - - [25/Aug/2023:15:09:37 CEST] “GET /ruby HTTP/1.1” 200 1074
2023-08-25_13:09:37.92168 - → /ruby
2023-08-25_13:09:46.82081 E, [2023-08-25T15:09:46.810133 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:09:46.82083 E, [2023-08-25T15:09:46.810355 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:09:46.82083 E, [2023-08-25T15:09:46.810450 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:09:46.82790 ::1 - - [25/Aug/2023:15:09:46 CEST] “GET /database HTTP/1.1” 200 0
2023-08-25_13:09:46.82792 - → /database
2023-08-25_13:09:50.23234 ::1 - - [25/Aug/2023:15:09:49 CEST] “GET /sidekiq HTTP/1.1” 200 140209
2023-08-25_13:09:50.23235 - → /sidekiq
2023-08-25_13:09:52.90619 ::1 - - [25/Aug/2023:15:09:52 CEST] “GET /ruby HTTP/1.1” 200 1074
2023-08-25_13:09:52.90621 - → /ruby
2023-08-25_13:10:01.83165 E, [2023-08-25T15:10:01.821261 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:10:01.83168 E, [2023-08-25T15:10:01.821472 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:10:01.83168 E, [2023-08-25T15:10:01.821611 #14320] ERROR – : Error connecting to the database: PQsocket() can’t get socket descriptor
2023-08-25_13:10:01.84421 ::1 - - [25/Aug/2023:15:10:01 CEST] “GET /database HTTP/1.1” 200 0
2023-08-25_13:10:01.84424 - → /database
2023-08-25_13:10:05.45278 ::1 - - [25/Aug/2023:15:10:04 CEST] “GET /sidekiq HTTP/1.1” 200 140209
2023-08-25_13:10:05.45279 - → /sidekiq
2023-08-25_13:10:07.91292 ::1 - - [25/Aug/2023:15:10:07 CEST] “GET /ruby HTTP/1.1” 200 1074
2023-08-25_13:10:07.91293 - → /ruby

cspencer1 · August 29, 2023, 5:46pm

Still seeing PG::ConnectionBad: PQsocket() can’t get socket descriptor errors but haven’t had any enqueued jobs since adding more Sidekiq workers.

Topic		Replies	Views
Triple question about sidekiq background jobs Self-managed enterprise	0	780	January 16, 2020
Sidekip problems : queue growing, merge requesting or forking too long How to Use GitLab	1	637	September 14, 2018
My Gitlab Version is 14.5.4 all of sudden my Sidekiq Enqueued queue growing and growing How to Use GitLab	9	504	April 22, 2023
Sidekiq Enqueued queue growing and growing How to Use GitLab	7	4460	December 20, 2016
Understanding background jobs in self-hosted Enterprise Edition of Gitlab Self-managed enterprise	6	6000	January 13, 2020

Sidekiq Queue Backlog and PG::ConnectionBad: PQsocket() can't get socket descriptor Errors

Related topics