Fail omnibus upgrade from 15.4.6 to 15.11.13 some pages lead to error 500

VivienDelmon · September 13, 2023, 9:07am

Hi,

I am moving from 14.0.4 to 16.1.5.
14.0.4 => 14.0.5 => 15.0.5 => 15.1.6 => 15.4.6 went well.
But after moving from 15.4.6 to 15.11.13, I got some errors:

main page return an error 500
in one group, no more project is shown, its subgroups are marked as archived
when listing projects in the admin panel, whenever a failing project must be shown, the page show an error 500

I have the following error in gitlab-ctl tail :

==> /var/log/gitlab/gitlab-rails/production.log <==

Gitlab::Git::CommandError (13:get default branch: EOF.):

lib/gitlab/git/wraps_gitaly_errors.rb:24:in `rescue in wrapped_gitaly_errors'
lib/gitlab/git/wraps_gitaly_errors.rb:6:in `wrapped_gitaly_errors'
lib/gitlab/git/repository.rb:95:in `root_ref'
app/models/repository.rb:555:in `root_ref'
lib/gitlab/repository_cache_adapter.rb:95:in `block (2 levels) in cache_method_asymmetrically'
lib/gitlab/repository_cache.rb:44:in `fetch_without_caching_false'
lib/gitlab/repository_cache_adapter.rb:190:in `block (2 levels) in cache_method_output_asymmetrically'
lib/gitlab/safe_request_store.rb:12:in `fetch'
lib/gitlab/repository_cache.rb:25:in `fetch'
lib/gitlab/repository_cache_adapter.rb:189:in `block in cache_method_output_asymmetrically'
lib/gitlab/utils/strong_memoize.rb:34:in `strong_memoize'
lib/gitlab/repository_cache_adapter.rb:203:in `block in memoize_method_output'
lib/gitlab/repository_cache_adapter.rb:212:in `no_repository_fallback'
lib/gitlab/repository_cache_adapter.rb:202:in `memoize_method_output'
lib/gitlab/repository_cache_adapter.rb:188:in `cache_method_output_asymmetrically'
lib/gitlab/repository_cache_adapter.rb:94:in `block in cache_method_asymmetrically'
app/models/repository.rb:696:in `tree'
app/models/repository.rb:1084:in `file_on_head'
app/models/repository.rb:607:in `block in avatar'
lib/gitlab/gitaly_client.rb:336:in `allow_n_plus_1_calls'
app/models/repository.rb:606:in `avatar'
lib/gitlab/repository_cache_adapter.rb:21:in `block (2 levels) in cache_method'
lib/gitlab/repository_cache.rb:25:in `fetch'
lib/gitlab/repository_cache_adapter.rb:163:in `block in cache_method_output'
lib/gitlab/utils/strong_memoize.rb:34:in `strong_memoize'
lib/gitlab/repository_cache_adapter.rb:203:in `block in memoize_method_output'
lib/gitlab/repository_cache_adapter.rb:212:in `no_repository_fallback'
lib/gitlab/repository_cache_adapter.rb:202:in `memoize_method_output'
lib/gitlab/repository_cache_adapter.rb:162:in `cache_method_output'
lib/gitlab/repository_cache_adapter.rb:20:in `block in cache_method'
app/models/project.rb:1763:in `avatar_in_git'
app/models/project.rb:1767:in `avatar_url'
app/models/concerns/avatarable.rb:36:in `avatar_url'
app/serializers/base_serializer.rb:16:in `represent'
app/serializers/concerns/with_pagination.rb:19:in `represent'
app/serializers/group_child_serializer.rb:22:in `represent'
app/controllers/groups/children_controller.rb:38:in `block (2 levels) in index'
app/controllers/groups/children_controller.rb:32:in `index'
app/controllers/application_controller.rb:500:in `set_current_admin'
lib/gitlab/session.rb:11:in `with_session'
app/controllers/application_controller.rb:491:in `set_session_storage'
lib/gitlab/i18n.rb:107:in `with_locale'
lib/gitlab/i18n.rb:113:in `with_user_locale'
app/controllers/application_controller.rb:482:in `set_locale'
app/controllers/application_controller.rb:475:in `set_current_context'
lib/gitlab/middleware/memory_report.rb:13:in `call'
lib/gitlab/middleware/speedscope.rb:13:in `call'
lib/gitlab/database/load_balancing/rack_middleware.rb:23:in `call'
lib/gitlab/jira/middleware.rb:19:in `call'
lib/gitlab/middleware/go.rb:20:in `call'
lib/gitlab/etag_caching/middleware.rb:21:in `call'
lib/gitlab/middleware/query_analyzer.rb:11:in `block in call'
lib/gitlab/database/query_analyzer.rb:37:in `within'
lib/gitlab/middleware/query_analyzer.rb:11:in `call'
lib/gitlab/middleware/multipart.rb:173:in `call'
lib/gitlab/middleware/read_only/controller.rb:50:in `call'
lib/gitlab/middleware/read_only.rb:18:in `call'
lib/gitlab/middleware/same_site_cookies.rb:27:in `call'
lib/gitlab/middleware/basic_health_check.rb:25:in `call'
lib/gitlab/middleware/handle_malformed_strings.rb:21:in `call'
lib/gitlab/middleware/handle_ip_spoof_attack_error.rb:25:in `call'
lib/gitlab/middleware/request_context.rb:15:in `call'
lib/gitlab/middleware/webhook_recursion_detection.rb:15:in `call'
config/initializers/fix_local_cache_middleware.rb:11:in `call'
lib/gitlab/middleware/compressed_json.rb:44:in `call'
lib/gitlab/middleware/rack_multipart_tempfile_factory.rb:19:in `call'
lib/gitlab/middleware/sidekiq_web_static.rb:20:in `call'
lib/gitlab/metrics/requests_rack_middleware.rb:79:in `call'
lib/gitlab/middleware/release_env.rb:13:in `call'

==> /var/log/gitlab/gitlab-workhorse/current <==
{"content_type":"text/html; charset=utf-8","correlation_id":"01HA6WBV2P06GESZWBTQBYV5PE","duration_ms":122,"host":"10.2.200.42","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"https://10.2.200.42/<failing_group>","remote_addr":"127.0.0.1:0","remote_ip":
"127.0.0.1","route":"","status":500,"system":"http","time":"2023-09-13T11:01:13+02:00","ttfb_ms":122,"uri":"/groups/<failing_group>/-/children.json","user_agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0","written_bytes":3054}

==> /var/log/gitlab/nginx/gitlab_access.log <==
172.23.2.2 - - [13/Sep/2023:11:01:13 +0200] "GET /groups/<failing_group>/-/children.json HTTP/2.0" 500 3054 "https://10.2.200.42/<failing_group>" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0" -

Where can I start looking for the issue ?

VivienDelmon · September 13, 2023, 4:36pm

I had some projects with a kind of corrupted default branch that I found using gitlab-rails console with the following script:

Project.all.each do |p|
  begin
    p.default_branch
  rescue
    puts p.full_path
  end
end

Then I used a backup of the VM to apply the following script before upgrading to 15.11.13:

token = '####'
host = '###'

gl = gitlab.Gitlab('https://' + host, ssl_verify = False, private_token=token)
gl.auth()

plist = [
        # fill with projects from previous command
        ]

for pname in plist:
    project = gl.projects.get(pname)
    branches = project.branches.list()
    branchNames = []
    for i in branches:
        branchNames.append(i.name)
    if len(branches) == 1:
        print ("===========================  ",  pname)
    currentDefault = project.default_branch
    for b in branchNames:
        if b != currentDefault:
            print (b, " => ", currentDefault)
            subprocess.run(["curl", "-k", "--request", "PUT", "--header", "PRIVATE-TOKEN:" + token, "--url", "https://{}/api/v4/projects/{}".format(host, project.id), "--data", "default_branch={}".format(b)])
            break
    subprocess.run(["curl", "-k", "--request", "PUT", "--header", "PRIVATE-TOKEN:" + token, "--url", "https://{}/api/v4/projects/{}".format(host, project.id), "--data", "default_branch={}".format(currentDefault)])

Then, the upgrade worked !!

RaphaelCoelho · November 10, 2023, 1:59am

Hi @VivienDelmon , we’re having the same 500 error after migrating from 15.4.6 to 15.11.13 !

I’ll try your solution on our test environment to see how it goes. By the way, I’m curious on how did you come up with that solution? Was it going through GitLab enterprise support?

Regards,

VivienDelmon · November 10, 2023, 8:29am

Hi @RaphaelCoelho,
As far as I remember:

I saw that my gitlab instance was only failing when some project were involved.
I read the giltab log with sudo gitlab-ctl tail when I load this pages. I saw that gitaly was involved.
I wrote some ruby loops over projects and saw that some exception were caught.
I went to see the bare repositories and some git commands like status were failing and the HEAD file was containing a non existing ref.
Finally I ended up with the solution I give here

Recently, I did had a similar issue and fixed it by directly editing HEAD on the failing repositories so I did not need to get my backup up. I am not 100% sure that it is the same issue but I fixed it this way:
for all failing repo, go to the bare directory (something like /var/opt/gitlab/git-data/repositories/\@hashed/9f/1f/9f1f9dce319c4700ef28ec8c53bd3cc8e6abe64c68385479ab89215806a5bdd6.git) on your instance then :

cat HEAD
if you have my issue, it contains a hash instead of your default branch ref
check that the file refs/heads/<your_default_branch> exists
if it does not exists, create it with the right hash (find it in another clone for instance)

mkdir -p refs/heads
echo <the_hash_you_found> >> refs/heads/master
chown -R git:git refs

Then when gitaly search info about your repo it does not end up prematurely. This solution is a bit more manual but does not need to bring back a backup.

And no I don’t have GitLab enterprise support. I somehow like debugging

I don’t know what is leading to this issue. Seeing which repositories are impacted on our side I feel that it happens when some “advanced” git user are doing nasty things like force pushing/rewriting history.

RaphaelCoelho · November 10, 2023, 4:19pm

Awesome, @VivienDelmon. Thanks for the detailed reply =)

We’re getting error 500 already on the main page, I can’t navigate the panel at all. Seems like our issue here is a bit worse than the one you had.

Regarding the script you mentioned before, I saved it as a ruby file (.rb) and tried running it with a command like:
gitlab-rails runner /tmp/<my_script>.rb

Firing from my Unix prompt, the above command didn’t return anything and kept running without giving me back the prompt. Am I missing something here? I did change file ownership of the said ruby script to git:git before running it.

We have over 56k projects in our base, I wonder if it will take a while to finish running…

VivienDelmon · November 11, 2023, 12:23pm

I was also having the issue on the main page but I browsed other pages by directly using url from my browser history

I did run my script using gitlab-rails console interactively with copy/paste. You can do the same and add some puts to see if everything goes as expected and to monitor it.

Something that is not clear in my explanation is that the ruby part that find failing repositories has to be run on the upgraded instance where the fixing part (which is a python script by the way) must be run on a non upgraded instance to fix things before upgrading it.