Postgres DB high CPU utilization when a Events API is called

Hi GitLab community,

I upgraded recently my GitLab following suggested upgrade path to 14.9.5 (upgraded database as well to 13.7), since then Postgres database is reaching very high levels of CPU utilization 60%-70% like never before causing service interruptions, I traced utilization to a particular SQL query which is run whenever Events API endpoint is hit (“/api/v4/events?scope=all&per_page=1”). Service is fine as long as do not make these Events APi calls, but
I see the same requests are handled without any issue in earlier GitLab version 14.1.7 (database 12.8). Which new feature/fix might be causing this since 14.1.7? How to fix this so that Events API calls goes sommothly. Appreciate any help in this regard.

Find the details below:

Upgrade path: 14.1.7 => 14.3.6 => 14.9.5
GitLab Events API requests: “/api/v4/events?scope=all&per_page=1”
Database: AWS RDS Postgres (version 13.7)
SQL query and it’s execution plan (each query taking almost 20minutes):

2022-10-14 12:56:00 UTC:10.255.99.243(37722):xxxxxx@gitlabhq_production:[381]:LOG: duration: 1256875.584 ms plan:
Query Text: /application:web,correlation_id:01GFB7VB1RQZPYWNVCW0DT4FFJ,endpoint_id:HEAD /api/:version/events,db_config_name:main/ SELECT COUNT() FROM ((SELECT “events”. FROM (WITH RECURSIVE “array_cte” AS MATERIALIZED (SELECT “projects”.“id” FROM “projects” INNER JOIN “project_authorizations” ON “projects”.“id” = “project_authorizations”.“project_id” WHERE “project_authorizations”.“user_id” = 12), “recursive_keyset_cte” AS ((SELECT NULL::events AS records, array_cte_id_array, events_id_array, 0::bigint AS count FROM (SELECT ARRAY_AGG(“array_cte”.“id”) AS array_cte_id_array, ARRAY_AGG(“events”.“id”) AS events_id_array FROM (SELECT “array_cte”.“id” FROM array_cte) array_cte LEFT JOIN LATERAL (SELECT “events”.“id” AS id FROM “events” WHERE “events”.“project_id” = “array_cte”.“id” ORDER BY “events”.“id” DESC LIMIT 1) events ON TRUE WHERE “events”.“id” IS NOT NULL) array_scope_lateral_query LIMIT 1)
UNION ALL
(SELECT (SELECT events FROM “events” WHERE “events”.“id” = recursive_keyset_cte.events_id_array[position] LIMIT 1), array_cte_id_array, recursive_keyset_cte.events_id_array[:position_query.position-1]||next_cursor_values.id||recursive_keyset_cte.events_id_array[position_query.position+1:], recursive_keyset_cte.count + 1 FROM recursive_keyset_cte,
LATERAL (SELECT id, position FROM UNNEST(events_id_array) WITH ORDINALITY AS u(id, position) WHERE id IS NOT NULL ORDER BY 1 DESC LIMIT 1) AS position_query,
LATERAL (SELECT “record”.“id” FROM (VALUES (NULL)) AS nulls LEFT JOIN (SELECT “events”.“id” AS id FROM “events” WHERE “events”.“project_id” = recursive_keyset_cte.array_cte_id_array[position] AND (“events”.“id” < recursive_keyset_cte.events_id_array[position]) ORDER BY “events”.“id” DESC LIMIT 1) record ON TRUE LIMIT 1) AS next_cursor_values
)) SELECT (records).* FROM “recursive_keyset_cte” AS “events” WHERE (count <> 0)) events)) events INNER JOIN “projects” ON “projects”.“id” = “events”.“project_id” WHERE (EXISTS (SELECT 1 FROM “project_authorizations” WHERE “project_authorizations”.“user_id” = 12 AND (project_authorizations.project_id = projects.id)) OR projects.visibility_level IN (0,10,20))
Aggregate (cost=17976.73…17976.74 rows=1 width=8)
→ Nested Loop (cost=16710.61…17976.48 rows=100 width=0)
→ CTE Scan on recursive_keyset_cte events (cost=16710.32…16712.60 rows=100 width=110)
Filter: (count <> 0)
CTE array_cte
→ Hash Join (cost=382.68…4207.29 rows=2733 width=4)
Hash Cond: (project_authorizations_2.project_id = projects_1.id)
→ Bitmap Heap Scan on project_authorizations project_authorizations_2 (cost=65.61…3883.03 rows=2733 width=4)
Recheck Cond: (user_id = 12)
→ Bitmap Index Scan on project_authorizations_pkey (cost=0.00…64.92 rows=2733 width=0)
Index Cond: (user_id = 12)
→ Hash (cost=254.81…254.81 rows=4981 width=4)
→ Seq Scan on projects projects_1 (cost=0.00…254.81 rows=4981 width=4)
CTE recursive_keyset_cte
→ Recursive Union (cost=11217.16…12503.04 rows=101 width=104)
→ Limit (cost=11217.16…11217.18 rows=1 width=104)
→ Subquery Scan on array_scope_lateral_query (cost=11217.16…11217.18 rows=1 width=104)
→ Aggregate (cost=11217.16…11217.17 rows=1 width=64)
→ Nested Loop (cost=0.43…11203.49 rows=2733 width=12)
→ CTE Scan on array_cte (cost=0.00…54.66 rows=2733 width=4)
→ Subquery Scan on events_2 (cost=0.43…4.07 rows=1 width=8)
Filter: (events_2.id IS NOT NULL)
→ Limit (cost=0.43…4.06 rows=1 width=8)
→ Index Only Scan Backward using index_events_on_project_id_and_id on events events_3 (cost=0.43…3102.02 rows=854 width=8)
Index Cond: (project_id = array_cte.id)
→ Nested Loop (cost=0.58…128.38 rows=10 width=104)
→ Nested Loop (cost=0.15…1.95 rows=10 width=80)
→ WorkTable Scan on recursive_keyset_cte (cost=0.00…0.20 rows=10 width=72)
→ Limit (cost=0.15…0.16 rows=1 width=16)
→ Sort (cost=0.15…0.18 rows=10 width=16)
Sort Key: u.id DESC
→ Function Scan on unnest u (cost=0.00…0.10 rows=10 width=16)
Filter: (id IS NOT NULL)
→ Limit (cost=0.43…4.16 rows=1 width=8)
→ Nested Loop Left Join (cost=0.43…4.16 rows=1 width=8)
→ Result (cost=0.00…0.01 rows=1 width=0)
→ Limit (cost=0.43…4.13 rows=1 width=8)
→ Index Only Scan Backward using index_events_on_project_id_and_id on events events_4 (cost=0.43…1053.76 rows=285 width=8)
Index Cond: ((project_id = (recursive_keyset_cte.array_cte_id_array)[u.“position”]) AND (id < (recursive_keyset_cte.events_id_array)[u.“position”]))
SubPlan 4
→ Limit (cost=0.43…8.45 rows=1 width=112)
→ Index Scan using events_pkey on events events_1 (cost=0.43…8.45 rows=1 width=112)
Index Cond: (id = (recursive_keyset_cte.events_id_array)[u.“position”])
→ Index Scan using projects_pkey on projects (cost=0.28…12.63 rows=1 width=4)
Index Cond: (id = ((events.records).project_id))
Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (visibility_level = ANY (‘{0,10,20}’::integer)))
SubPlan 1
→ Index Only Scan using index_unique_project_authorizations_on_project_id_user_id on project_authorizations (cost=0.42…8.45 rows=1 width=0)
Index Cond: ((project_id = projects.id) AND (user_id = 12))
SubPlan 2
→ Bitmap Heap Scan on project_authorizations project_authorizations_1 (cost=65.61…3883.03 rows=2733 width=4)
Recheck Cond: (user_id = 12)
→ Bitmap Index Scan on project_authorizations_pkey (cost=0.00…64.92 rows=2733 width=0)
Index Cond: (user_id = 12)

Dear community - after this I have upgraded my deployment to 15.0.5 (via 14.3.6 => 14.9.5 => 14.10.5 => 15.0.5) which solved the problem, does anyone know how could I identify ahead and prevent such issues during the GitLab upgrades?