Pg_upgrade failure in CE 16.7.7 to 16.11.0

Problem to solve

After running an older version for a while and recently upgrading to 16.7.7 without apparent issues, today I tried to upgrade to the latest gitlab-ce 16.11.0 (omnibus package on Ubuntu) but the postgres upgrade step failed.

In /var/log/gitlab/postgresql/current it has the following errors:

2024-04-19_05:14:29.59665 LOG:  database system was not properly shut down; automatic recovery in progress
2024-04-19_05:14:29.60398 LOG:  redo starts at 253/8BFAFE88
2024-04-19_05:14:29.62591 LOG:  invalid record length at 253/8BFCDB68: wanted 24, got 0
2024-04-19_05:14:29.62592 LOG:  redo done at 253/8BFCDAF0
2024-04-19_05:14:29.69751 LOG:  database system is ready to accept connections
2024-04-19_05:19:51.53218 ERROR:  relation "namespace_descendants" does not exist at character 491
2024-04-19_05:19:51.53220 STATEMENT:  SELECT a.attname, format_type(a.atttypid, a.atttypmod),
2024-04-19_05:19:51.53220              pg_get_expr(d.adbin, d.adrelid), a.attnotnull, a.atttypid, a.atttypmod,
2024-04-19_05:19:51.53220              c.collname, col_description(a.attrelid, a.attnum) AS comment,
2024-04-19_05:19:51.53221              attgenerated as attgenerated
2024-04-19_05:19:51.53221         FROM pg_attribute a
2024-04-19_05:19:51.53221         LEFT JOIN pg_attrdef d ON a.attrelid = d.adrelid AND a.attnum = d.adnum
2024-04-19_05:19:51.53221         LEFT JOIN pg_type t ON a.atttypid = t.oid
2024-04-19_05:19:51.53222         LEFT JOIN pg_collation c ON a.attcollation = c.oid AND a.attcollation <> t.typcollation
2024-04-19_05:19:51.53222        WHERE a.attrelid = '"namespace_descendants"'::regclass
2024-04-19_05:19:51.53222          AND a.attnum > 0 AND NOT a.attisdropped
2024-04-19_05:19:51.53223        ORDER BY a.attnum
2024-04-19_05:19:51.53223

I’ve rolled back the server to the pre-upgrade snapshot, so I won’t be able to get any more logs from it. Is there something I can do to avoid this in a future attempt, or is this a bug?

Some earlier history:

2024-04-19_05:12:35.17262 received TERM from runit, sending INT instead to force quit connections
2024-04-19_05:12:35.20022 LOG:  received fast shutdown request
2024-04-19_05:12:35.20796 LOG:  aborting any active transactions
2024-04-19_05:12:35.21732 FATAL:  terminating connection due to administrator command
...
2024-04-19_05:12:35.22474 LOG:  background worker "logical replication launcher" (PID 1049356) exited with exit code 1
2024-04-19_05:12:35.22543 LOG:  shutting down
2024-04-19_05:12:35.26831 PANIC:  could not open file "/var/opt/gitlab/postgresql/data/global/pg_control": Operation not permitted
2024-04-19_05:12:38.30992 FATAL:  the database system is shutting down
...
2024-04-19_05:12:38.74882 LOG:  checkpointer process (PID 1049351) was terminated by signal 6: Aborted
2024-04-19_05:12:38.74885 LOG:  terminating any other active server processes
2024-04-19_05:12:38.79812 LOG:  abnormal database system shutdown
2024-04-19_05:12:38.85032 LOG:  database system is shut down
2024-04-19_05:13:16.98491 LOG:  starting PostgreSQL 13.14 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit
2024-04-19_05:13:16.99131 LOG:  listening on Unix socket "/var/opt/gitlab/postgresql/.s.PGSQL.5432"
2024-04-19_05:13:17.05162 LOG:  database system was interrupted; last known up at 2024-04-19 04:31:28 GMT
2024-04-19_05:13:17.80268 FATAL:  the database system is starting up
...

I suspect that the logs above are not the real problem, but are something that happened after the PG upgrade to 14 failed and rolled back. The error message at the time just said to “check logs” but didn’t say where, and the above is all I could find. (And I didn’t manage to capture the original error since I tried gitlab-ctl tail and it flooded scrollback.)

Ok, so after rolling back to 16.7.7 I tried running sudo gitlab-ctl pg-upgrade alone (without a package upgrade), and this was the result (which is basically the same as the original upgrade):

Checking for an omnibus managed postgresql: OK
Checking if postgresql['version'] is set: OK
Checking if we already upgraded: NOT OK
Checking for a newer version of PostgreSQL to install
Upgrading PostgreSQL to 14.10
Checking if disk for directory /var/opt/gitlab/postgresql/data has enough free space for PostgreSQL upgrade: OK
Checking if PostgreSQL bin files are symlinked to the expected location: OK
Waiting 30 seconds to ensure tasks complete before PostgreSQL upgrade.
See https://docs.gitlab.com/omnibus/settings/database.html#upgrade-packaged-postgresql-server for details
If you do not want to upgrade the PostgreSQL server at this time, enter Ctrl-C and see the documentation for details

Please hit Ctrl-C now if you want to cancel the operation.
Toggling deploy page:cp /opt/gitlab/embedded/service/gitlab-rails/public/deploy.html /opt/gitlab/embedded/service/gitlab-rails/public/index.html
Toggling deploy page: OK
Toggling services:ok: down: alertmanager: 1s, normally up
ok: down: gitaly: 1s, normally up
ok: down: gitlab-exporter: 1s, normally up
ok: down: gitlab-kas: 0s, normally up
ok: down: logrotate: 1s, normally up
ok: down: node-exporter: 0s, normally up
ok: down: postgres-exporter: 0s, normally up
ok: down: prometheus: 0s, normally up
ok: down: redis-exporter: 0s, normally up
ok: down: sidekiq: 0s, normally up
Toggling services: OK
Running stop on postgresql:timeout: run: postgresql: (pid 1049349) 608986s, want down
Running stop on postgresql: OK
Symlink correct version of binaries: OK
Creating temporary data directory: OK
Initializing the new database: OK
Upgrading the data:Error upgrading the data to version 14.10
STDOUT: Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok

The source cluster was not shut down cleanly.
Failure, exiting
STDERR:
Upgrading the data: NOT OK
== Fatal error ==
Error running pg_upgrade, please check logs
== Reverting ==
ok: down: postgresql: 23s, normally up
Symlink correct version of binaries: OK
ok: run: postgresql: (pid 2837483) 1s
== Reverted ==
== Reverted to 13.13. Please check output for what went wrong ==
Toggling deploy page:rm -f /opt/gitlab/embedded/service/gitlab-rails/public/index.html
Toggling deploy page: OK
Toggling services:ok: run: alertmanager: (pid 2837834) 0s
ok: run: gitaly: (pid 2837854) 1s
ok: run: gitlab-exporter: (pid 2837892) 0s
ok: run: gitlab-kas: (pid 2837907) 1s
ok: run: logrotate: (pid 2837936) 0s
ok: run: node-exporter: (pid 2837964) 1s
ok: run: postgres-exporter: (pid 2837971) 0s
ok: run: prometheus: (pid 2837982) 1s
ok: run: redis-exporter: (pid 2837995) 0s
ok: run: sidekiq: (pid 2838021) 1s
Toggling services: OK

The logs in postgresql/current were basically the same as above, but didn’t have the “relation does not exist” error, only the things prior to that.

Also, maybe related:

$ ls -l /var/opt/gitlab/postgresql/data/global/pg_control
-rw------- 1 gitlab-psql gitlab-psql 8192 Apr 19 18:29 /var/opt/gitlab/postgresql/data/global/pg_control

This failed into a nicer state (the instance was still operational afterwards, whereas in the original upgrade attempt it refused to launch postgres and stayed down), but I still rolled it back anyway.