Upgrade 16.3.7 to 16.7.6 gives database error. Bug?

upgraded from 16.3.7 to 16.7.6 and 12 hours later it still on the background database tasks:


gitlabhq_production=# select job_class_name, table_name, column_name, job_arguments from batched_background_migrations where status <> 3;
                   job_class_name                   |         table_name          | column_name |                                job_arguments
----------------------------------------------------+-----------------------------+-------------+------------------------------------------------------------------------------
 CopyColumnUsingBackgroundMigrationJob              | ci_namespace_monthly_usages | id          | [["shared_runners_duration"], ["shared_runners_duration_convert_to_bigint"]]
 BackfillUsersWithDefaults                          | users                       | id          | []
 BackfillUserPreferencesWithDefaults                | user_preferences            | id          | []
 CreateComplianceStandardsAdherence                 | projects                    | id          | []
 BackfillProjectStatisticsStorageSizeWithRecentSize | project_statistics          | project_id  | []
 UpdateUsersSetExternalIfServiceAccount             | users                       | id          | []
 CopyColumnUsingBackgroundMigrationJob              | ci_project_monthly_usages   | id          | [["shared_runners_duration"], ["shared_runners_duration_convert_to_bigint"]]
(7 rows)

Have done 50+ upgrades over the years and usually the background scripts take 30 mins.

Is there a way to debug what’s causing this to hang?

Unable to run gitlab-ctl reconfigure until this has completed

I have seen migrations take a long time to complete, some up towards a day if not longer. This has been mentioned either elsewhere on the forum, or maybe even in the Gitlab documentation.

Check your CPU usage using top or htop and see if it’s being utilised, as that would normally show when migrations are taking place since there is high CPU usage then. Also, what are the specs of your server? How many CPU? How much RAM?

Hi,

8 cores, 16GB RAM.

Checked this morning at still stuck.

Some debugging notes below if it helps:

Run: gitlab-psql

SELECT id, job_class_name, table_name, column_name, job_arguments FROM batched_background_migrations WHERE status <> 3;

This gives all the stuck jobs and their migration ID. note the ID:

> gitlabhq_production=# SELECT id, job_class_name, table_name, column_name, job_arguments FROM batched_background_migrations WHERE status <> 3;
>  id  |                   job_class_name                   |         table_name          | column_name |                                job_arguments
> -----+----------------------------------------------------+-----------------------------+-------------+------------------------------------------------------------------------------
>  157 | CopyColumnUsingBackgroundMigrationJob              | ci_namespace_monthly_usages | id          | [["shared_runners_duration"], ["shared_runners_duration_convert_to_bigint"]]
>  158 | BackfillUsersWithDefaults                          | users                       | id          | []
>  159 | BackfillUserPreferencesWithDefaults                | user_preferences            | id          | []
>  160 | CreateComplianceStandardsAdherence                 | projects                    | id          | []
>  162 | BackfillProjectStatisticsStorageSizeWithRecentSize | project_statistics          | project_id  | []
>  163 | UpdateUsersSetExternalIfServiceAccount             | users                       | id          | []
>  156 | CopyColumnUsingBackgroundMigrationJob              | ci_project_monthly_usages   | id          | [["shared_runners_duration"], ["shared_runners_duration_convert_to_bigint"]]
> (7 rows)

Then run this SQL for each ID


SELECT started_at, finished_at, finished_at - started_at AS duration, min_value, max_value, batch_size, sub_batch_size FROM batched_background_migration_jobs WHERE batched_background_migration_id = PUT-ID-HERE ORDER BY id DESC limit 10;

In my case, ID’s 156, 157 & 158 they are all the same except the max_value:
started_at: blank
finished_at: blank
duration: blank
min_value 1
max_value: ID 156: 56; ID: 157: 59; ID 158: 61
batch_size 20000
system_note_metadata: ID 156: 250; ID: 157: 250; ID 158: 200

e.g.:

started_at | finished_at | duration | min_value | max_value | batch_size | system_note_metadata
------------+-------------+----------+-----------+-----------+------------+----------------------
            |             |          |         1 |        56 |       3000 |                  200

The other ID’s 159 160 162 163 have no results at all.

Also, on the server, opening the log file /var/log/gitlab/gitlab-rails/gitlab-rails-db-migrate-YYYY-MM-DD-HH-MM-SS.log

shows:

StandardError: An error has occurred, all later migrations canceled:

PG::CheckViolation: ERROR:  no partition of relation "batched_background_migration_job_transition_logs" found for row

Caused by:
ActiveRecord::StatementInvalid: PG::CheckViolation: ERROR:  no partition of relation "batched_background_migration_job_transition_logs" found for row

Check the status of migrations using this command:

gitlab-rake db:migrate:status

just in case they are all OK, and they are just dangling in the database.

big list returned with that command.

All DOWN from 20230923094438 onwards:

  down    20230923094438  Ensure backfill for shared runners duration is finished
  down    20230924095357  Swap columns for ci project monthly usages shared runners duration
  down    20230924134300  Finalize uuid backfilling
  down    20230924134453  Cleanup uuid type migration on vulnerability occurrences
  down    20230924154419  Drop temporary index on uuid for type migration
  down    20230925024201  Add foreign key for ci pipelines pipeline id bigint
  down    20230925062516  Add foreign key for ci stages pipeline id bigint
  down    20230925062800  Async validate foreign key for ci stages pipeline id bigint
  down    20230925095300  Remove deprecated delete container repository worker job instances
  down    20230925095357  Swap columns for ci namespace monthly usages shared runners duration
  down    20230925170448  Add index on okr reminder frequency
  down    20230926024201  Async validate foreign key for ci pipelines pipeline id bigint
  down    20230926040722  Add foreign key for ci sources pipelines pipeline id bigint
  down    20230926040755  Async validate foreign key for ci sources pipelines pipeline id bigint
  down    20230926092914  Add approval group rules
  down    20230926092944  Add approval group rules groups
  down    20230926093004  Add approval group rules users
  down    20230926093025  Add approval group rules protected branches
  down    20230926093101  Add fk to approval rule on approval group rules users
  down    20230926093144  Add fk to user on approval group rules users
  down    20230926093211  Add fk to approval rule on approval group rules groups
  down    20230926093251  Add fk to group on approval group rules groups
  down    20230926105440  Add fk to approval rule on approval group rules protected branches
  down    20230926105908  Add index to add on purchases on last assigned users refreshed at and add on
  down    20230926105931  Add fk to protected branch on approval group rules protected branches
  down    20230926113518  Remove application settings ai access token column
  down    20230926115744  Add vertex ai access token to application settings
  down    20230926133801  Create value stream analytics settings
  down    20230926201357  Drop index namespaces on type and visibility and parent
  down    20230927045103  Async idx vulnerability occurences on prim iden
  down    20230927124202  Add mastodon to user details
  down    20230927141237  Add index on pages deployments deleted at
  down    20230928024357  Drop index namespaces on runners token
  down    20230928073320  Add applicable post merge column to mr approval rules
  down    20230928104015  Sync foreign key for ci stages pipeline id bigint
  down    20230928145555  Add fk to security orchestration policy configuration on approval group rules
  down    20230928145637  Add fk to scan result policy on approval group rules
  down    20230929063124  Sync foreign key for ci sources pipelines pipeline id bigint
  down    20230929063406  Sync foreign key for ci sources pipelines source pipeline id bigint
  down    20230929095008  Drop application settings product analytics cluster settings
  down    20230929095728  Drop project settings product analytics cluster settings
  down    20230929151451  Add math rendering limits enabled
  down    20230929155123  Migrate disable merge trains value
  down    20230930094139  Add related link restrictions
  down    20231001105945  Requeue backfill finding id in vulnerabilities
  down    20231002023318  Prepare removal index deployments on project id and ref
  down    20231002162941  Add enable artifact external redirect warning page to application settings
  down    20231003003241  Drop index btree namespaces traversal ids
  down    20231003034711  Sync foreign key for ci pipelines auto canceled by id bigint
  down    20231003045342  Migrate sidekiq namespaced jobs
  down    20231003073437  Create abuse report user mentions
  down    20231003073505  Add abuse reports foreign key to abuse report user mentions
  down    20231003073526  Add notes foreign key to abuse report user mentions
  down    20231003083900  Swap columns for ci pipeline messages pipeline id bigint
  down    20231003142534  Add build timeout index
  down    20231003142706  Lower project build timeout to respect max validation
  down    20231003145757  Remove build timeout index
  down    20231004053341  Add index for group vulnerabilities aysnc
  down    20231004080224  Swap columns for ci stages pipeline id bigint
  down    20231004091113  Swap columns for ci sources pipelines pipeline id bigint
  down    20231004100000  Create container registry protection rules
  down    20231004120426  Change workspaces force include all resources default
  down    20231005131445  Add work items related link restrictions
  down    20231005145648  Add uuid and version to vs code setting
  down    20231005151816  Add created at to status check responses
  down    20231006154748  Replace value stream project ids filter constraint
  down    20231009104202  Add holder name hash index on credit card validations
  down    20231009104325  Add partial match index of hashes on credit card validations
.
.
.
.
  down    20231207221056  Finalize backfill uuid conversion column in vulnerability occurrences
  down    20231207221119  Finalize cleanup personal access tokens with nil expires at
  down    20231207221140  Finalize delete orphaned transferred project approval rules
  down    20231207221159  Finalize fix allow descendants override disabled shared runners
  down    20231207221219  Finalize mark duplicate npm packages for destruction
  down    20231207221241  Finalize populate vulnerability dismissal fields
  down    20231207221300  Finalize remove invalid deploy access level groups
  down    20231208103049  Drop index users on id and last activity
  down    20231211100717  Add source package name to sbom component versions
  down    20231212132322  Prepare ci pipeline variables primary key for partitioning
  down    20231213112726  Add trigram index to compliance management frameworks on name
  down    20231214064934  Add arkose labs data exchange key to application settings
  down    20231214164411  Add code added at to onboarding progresses
  down    20231218062442  Remove max workspaces from remote development agent configs
  down    20231218062505  Remove max workspaces per user from remote development agent configs
  down    20231219120134  Add token to chat names

These correspond to rb files in the folder: /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate

The first one in the list is file name: 20230923094438_ensure_backfill_for_shared_runners_duration_is_finished.rb
This file has this code:

# frozen_string_literal: true

class EnsureBackfillForSharedRunnersDurationIsFinished < Gitlab::Database::Migration[2.1]
  restrict_gitlab_migration gitlab_schema: :gitlab_ci
  disable_ddl_transaction!

  TABLE_NAMES = %i[ci_project_monthly_usages ci_namespace_monthly_usages]

  def up
    TABLE_NAMES.each do |table_name|
      ensure_batched_background_migration_is_finished(
        job_class_name: 'CopyColumnUsingBackgroundMigrationJob',
        table_name: table_name,
        column_name: 'id',
        job_arguments: [
          %w[shared_runners_duration],
          %w[shared_runners_duration_convert_to_bigint]
        ]
      )
    end
  end

  def down
    # no-op
  end
end

it’s definitely a problem with job ID 157 CopyColumnUsingBackgroundMigrationJob (same name as job ID 156)

I tried to manually finalise it:

gitlab-rake gitlab:background_migrations:finalize[CopyColumnUsingBackgroundMigrationJob,ci_namespace_monthly_usages,id,'[["shared_runners_duration"]\, ["shared_runners_duration_convert_to_bigint"]]']

and it gives the same error found in /var/log/gitlab/gitlab-rails/gitlab-rails-db-migrate-YYYY-MM-DD-HH-MM-SS.log:

no partition of relation “batched_background_migration_job_transition_logs” found for row

So it’s a problem with the data base and this mystery batched_background_migration_job_transition_logs

I found this post: Error migrating from 14.0.12 to 14.8.2 (#353927) · Issues · GitLab.org / GitLab · GitLab and it says to run this command to create the missing entry:
gitlab-rake db:migrate:up VERSION=20211123135255

However I don’t have this. Error is:
No migration with version number 20211123135255.
The oldest migrate script I have in /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate is from 2022-08

I went looking for the the corresponding rb file for 20211123135255. Found it here:

Put the file 20211123135255_create_batched_background_migration_jobs_status_changes.rb into the folder /opt/gitlab/embedded/service/gitlab-rails/db/migrate

Then re-ran the command: gitlab-rake db:migrate:up VERSION=20211123135255

This time it completed. Great!

So now lets re-run the stuck job: CopyColumnUsingBackgroundMigrationJob

gitlab-rake gitlab:background_migrations:finalize[CopyColumnUsingBackgroundMigrationJob,ci_namespace_monthly_usages,id,'[["shared_runners_duration"]\, ["shared_runners_duration_convert_to_bigint"]]']

aaaaaaand…

ERROR: no partition of relation “batched_background_migration_job_transition_logs” found for row

Oh well.