GitLab CE Upgrade from 16.3.7 to 16.7.7 and then background migrations stuck

Problem to solve

I’d like to upgrade self-managed gitlab from 16.3.3 to 16.9.2
I follow the upgrading path, from 16.3.3 to 16.3.7 to 16.7.7, and then to 16.9.2

But then I found that the background migrations job is not finished in 16.7.7 so it may cause error in 16.9.2. I then wait maybe more then 12 hr to wait the job become finished. But it stuck -

Does anyone have this issue before?
Please help me. Thank you!!
The steps I did are listed below.

Steps to reproduce

Started from gitlab version 16.3.7. I used wget to download the 16.7.7 deb package (I cannot find the version via apt)

wget --content-disposition https://packages.gitlab.com/gitlab/gitlab-ce/packages/ubuntu/bionic/gitlab-ce_16.7.7-ce.0_amd64.deb/download.deb

then install it

dpkg -i gitlab-ce_16.7.7-ce.0_amd64.deb

After 15 mins the output said upgraded complete. Then I check the version is correct (to 16.7.7) by running gitlab-rake gitlab:env:info.

Then I go to the UI and see some queued jobs. The next day (after 12 hr) it stuck

So I run below to find unfinished jobs

sudo gitlab-psql

SELECT
  job_class_name,
  table_name,
  column_name,
  job_arguments
FROM batched_background_migrations
WHERE status <> 3;

I got sql output:

            job_class_name             | table_name | column_name |
                                                                                  job_arguments

---------------------------------------+------------+-------------+--------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------
 CopyColumnUsingBackgroundMigrationJob | ci_builds  | id          | [["auto_canceled_by_id", "commit_id", "erased_by_id", "project_id", "runner_id", "trigger_request_id", "upstream_pipeline
_id", "user_id"], ["auto_canceled_by_id_convert_to_bigint", "commit_id_convert_to_bigint", "erased_by_id_convert_to_bigint", "project_id_convert_to_bigint", "runner_id_convert_to_bigint", "
trigger_request_id_convert_to_bigint", "upstream_pipeline_id_convert_to_bigint", "user_id_convert_to_bigint"]]
(1 row)

And follow this page: https://docs.gitlab.com/ee/update/background_migrations.html#finish-a-failed-migration-manually

I run

sudo gitlab-rake gitlab:background_migrations:finalize[CopyColumnUsingBackgroundMigrationJob,ci_builds,id,'[["auto_canceled_by_id"\, "commit_id"\, "erased_by_id"\, "project_id"\, "runner_id"\, "trigger_request_id"\, "upstream_pipeline_id"\, "user_id"]\, ["auto_canceled_by_id_convert_to_bigint"\, "commit_id_convert_to_bigint"\, "erased_by_id_convert_to_bigint"\, "project_id_convert_to_bigint"\, "runner_id_convert_to_bigint"\, "trigger_request_id_convert_to_bigint"\, "upstream_pipeline_id_convert_to_bigint"\, "user_id_convert_to_bigint"]]']

After few minutes, I got this error:

rake aborted!
ActiveRecord::QueryCanceled: PG::QueryCanceled: ERROR:  canceling statement due to statement timeout
CONTEXT:  while locking tuple (363868,5) in relation "ci_builds"
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:127:in `public_send'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:127:in `block in write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:141:in `block in read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:228:in `retry_with_backoff'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:130:in `read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:126:in `write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:61:in `block (2 levels) in <class:ConnectionProxy>'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/copy_column_using_background_migration_job.rb:25:in `block in perform'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/batched_migration_job.rb:105:in `block (2 levels) in each_sub_batch'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batch_metrics.rb:22:in `instrument_operation'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/batched_migration_job.rb:104:in `block in each_sub_batch'
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:99:in `block (2 levels) in each_batch'
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:99:in `block in each_batch'
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:69:in `step'
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:69:in `each_batch'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/batched_migration_job.rb:103:in `each_sub_batch'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/copy_column_using_background_migration_job.rb:24:in `perform'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_wrapper.rb:81:in `execute_batched_migration_job'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_wrapper.rb:63:in `execute_job'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_wrapper.rb:50:in `execute_batch'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_wrapper.rb:25:in `perform'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_runner.rb:30:in `run_migration_job'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_runner.rb:140:in `run_migration_while'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_runner.rb:80:in `finalize'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_runner.rb:10:in `finalize'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/background_migrations.rake:72:in `finalize_migration'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/background_migrations.rake:18:in `block (3 levels) in <top (required)>'
/opt/gitlab/embedded/bin/bundle:25:in `load'
/opt/gitlab/embedded/bin/bundle:25:in `<main>'

Caused by:
PG::QueryCanceled: ERROR:  canceling statement due to statement timeout
CONTEXT:  while locking tuple (363868,5) in relation "ci_builds"
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:127:in `public_send'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:127:in `block in write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:141:in `block in read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:228:in `retry_with_backoff'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/load_balancer.rb:130:in `read_write'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:126:in `write_using_load_balancer'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/load_balancing/connection_proxy.rb:61:in `block (2 levels) in <class:ConnectionProxy>'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/copy_column_using_background_migration_job.rb:25:in `block in perform'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/batched_migration_job.rb:105:in `block (2 levels) in each_sub_batch'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batch_metrics.rb:22:in `instrument_operation'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/batched_migration_job.rb:104:in `block in each_sub_batch'
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:99:in `block (2 levels) in each_batch'
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:99:in `block in each_batch'
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:69:in `step'
/opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:69:in `each_batch'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/batched_migration_job.rb:103:in `each_sub_batch'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/background_migration/copy_column_using_background_migration_job.rb:24:in `perform'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_wrapper.rb:81:in `execute_batched_migration_job'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_wrapper.rb:63:in `execute_job'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_wrapper.rb:50:in `execute_batch'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_wrapper.rb:25:in `perform'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_runner.rb:30:in `run_migration_job'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_runner.rb:140:in `run_migration_while'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_runner.rb:80:in `finalize'
/opt/gitlab/embedded/service/gitlab-rails/lib/gitlab/database/background_migration/batched_migration_runner.rb:10:in `finalize'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/background_migrations.rake:72:in `finalize_migration'
/opt/gitlab/embedded/service/gitlab-rails/lib/tasks/gitlab/background_migrations.rake:18:in `block (3 levels) in <top (required)>'
/opt/gitlab/embedded/bin/bundle:25:in `load'
/opt/gitlab/embedded/bin/bundle:25:in `<main>'
Tasks: TOP => gitlab:background_migrations:finalize
(See full trace by running task with --trace)

And from UI, it become finalizing but still not finished

Configuration

Based on https://docs.gitlab.com/ee/update/versions/gitlab_16_changes.html#linux-package-installations-2

I’ve added

postgresql['version'] = 13

in /etc/gitlab/gitlab.rb
before I upgrade 16.3.7 to 16.7.7

Versions

Versions

  • from GitLab CE 16.3.7 to 16.7.7
root@git-testing:~# gitlab-rake gitlab:env:info

System information
System:		Ubuntu 18.04
Current User:	git
Using RVM:	no
Ruby Version:	3.1.4p223
Gem Version:	3.4.22
Bundler Version:2.4.22
Rake Version:	13.0.6
Redis Version:	7.0.15
Sidekiq Version:6.5.12
Go Version:	unknown

GitLab information
Version:	16.7.7
Revision:	5fb02de437c
Directory:	/opt/gitlab/embedded/service/gitlab-rails
DB Adapter:	PostgreSQL
DB Version:	13.13
URL:		<git url>
HTTP Clone URL:	http://<git url>/some-group/some-project.git
SSH Clone URL:	git@<git url>:some-group/some-project.git
Using LDAP:	yes
Using Omniauth:	yes
Omniauth Providers:

GitLab Shell
Version:	14.32.0
Repository storages:
- default: 	unix:/var/opt/gitlab/gitaly/gitaly.socket
GitLab Shell path:		/opt/gitlab/embedded/service/gitlab-shell

Gitaly
- default Address: 	unix:/var/opt/gitlab/gitaly/gitaly.socket
- default Version: 	16.7.7
- default Git Version: 	2.42.0

Other information

I’ve upgrade to 16.9.2 anywhere, not checking the background migration jobs in 16.7.7
And when it reconfigure it takes long time then timeout after 1 hr
It stuck at bash_hide_env[migrate gitlab-rails database] action run

I just guess that I need to wait the background migration jobs done in 16.7.7, so that I won’t have issue in 16.9.2. But I wonder is there any workaround that’s enable me to upgrade to 16.9.2 to pass this issue?

What is the output of gitlab-rake db:migrate:status ?

Could try forcing it again with gitlab-rake db:migrate

As for trying to force it, it looks like it can’t lock the ci_builds table to migrate it. Are you running a lot of pipelines at this time or?

And sometimes the migrations can take a few days. How big is your db?

Thank you for the reply!
I did all this in the test machine so no pipeline was running, no user used gitlab at that time.
I also realize that it can take few days to complete the migrations but I’m afraid of the performance will be low.

And indeed the db size may be the root cause make the migration stuck.

[Below steps are executed before I delete DB records]
I have run gitlab-rake db:migrate:status when the CopyColumnUsingBackgroundMigrationJob:ci_builds job stuck at 99.00%. (after 1 day I upgraded the gitlab to 16.7.7)
And the outcome shows every item are ‘up’.
But the job still not finished when checking by web UI.

I’ve run gitlab-rake db:migrate and got this

main: == [advisory_lock_connection] object_id: 114800, pg_backend_pid: 15361
main: == [advisory_lock_connection] object_id: 114800, pg_backend_pid: 15361
INFO:  analyzing "public.p_ci_runner_machine_builds" inheritance tree
INFO:  analyzing "gitlab_partitions_dynamic.ci_runner_machine_builds_100"
INFO:  "ci_runner_machine_builds_100": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  analyzing "gitlab_partitions_dynamic.ci_runner_machine_builds_101"
INFO:  "ci_runner_machine_builds_101": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  analyzing "public.p_ci_job_annotations" inheritance tree
INFO:  analyzing "gitlab_partitions_dynamic.ci_job_annotations_100"
INFO:  "ci_job_annotations_100": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  analyzing "gitlab_partitions_dynamic.ci_job_annotations_101"
INFO:  "ci_job_annotations_101": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  analyzing "public.p_ci_builds_metadata" inheritance tree
INFO:  "ci_builds_metadata": scanned 300000 of 546993 pages, containing 5050063 live rows and 84029 dead rows; 300000 rows in sample, 9207830 estimated total rows
INFO:  analyzing "public.ci_builds_metadata"
INFO:  "ci_builds_metadata": scanned 300000 of 546993 pages, containing 5030953 live rows and 84463 dead rows; 300000 rows in sample, 9172987 estimated total rows
INFO:  analyzing "gitlab_partitions_dynamic.ci_builds_metadata_101"
INFO:  "ci_builds_metadata_101": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  analyzing "public.p_ci_builds" inheritance tree
INFO:  "ci_builds": scanned 300000 of 964214 pages, containing 2857863 live rows and 260880 dead rows; 300000 rows in sample, 9185305 estimated total rows
INFO:  analyzing "public.ci_builds"
INFO:  "ci_builds": scanned 300000 of 964214 pages, containing 2856856 live rows and 260473 dead rows; 300000 rows in sample, 9182069 estimated total rows
INFO:  analyzing "gitlab_partitions_dynamic.ci_builds_101"
INFO:  "ci_builds_101": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows

[After I delete DB records]
It looks like the issue is related to table ci_builds. I just manually delete old data in the table ci_builds and it take less time to finish the background migrations when upgrading to 16.7.7.
BTW I deleted 8.6M records and 0.5M records remained.

I’m not very sure if this is the right way. And I’m still looking for suggestion or guidance about deleting old data or how to do when the DB size is huge.

I forget where it’s set but I wonder what the statement_timeout is for postgresql. You could connect to the DB with psql and do show statement_timeout; but I would think when it goes to migrate it should be very sufficiently high.

du -sh /var/opt/gitlab/postgresql should show your DB size for reference.

And from what I’ve seen, the DB could use a full vacuum sometimes to reduce all the dead tuples (note, this locks the tables exclusively while it works.) I don’t know that I’d be deleting full tables simply because they’re all interlinked and I would hazard that any upgrade after this one would cause issues because a foreign key is missing (ie. one that was present in ci_builds.)

Yes the tables are linked so it may not be a good solution to delete the records from the DB directly.
But this time I already deleted the records manually. I also deleted the records from ci_pipelines and run the VACUUM FULL to reduce the db size.
After all these actions done, it looks fine to upgrade to 16.7.7.

And I plan to follow the way provided here https://docs.gitlab.com/ee/user/storage_management_automation.html
to delete old records periodically.

Thanks again for the reply!