Upgrade to GitLab 13.4.0 (b0481767fe4) killed all repositories

Hi all, I will look at all the information you all provided here and track the problem in https://gitlab.com/gitlab-org/gitlab/-/issues/259605. This is a high-priority issue to get fixed as it shouldn’t have caused problems in the first place. The way the migration was coded makes it very very hard to loose data, so for those of you who have it in an inconsistent state, it may be the case that the database has flagged the storage as migrated, but the repositories are still located on the legacy storage format, or the opposite, you have the repositories migrated but database for some reason failed. In any case, it’s possible to get it back to normal.

Please follow the issue to get notified of any solution (we will probably have a patch release for 14.4.x with a fix, but I will also provide instructions on how to manually fix, so you don’t have to wait)

The OpenSSL::Cipher::CipherError means some encrypted data in the database couldn’t be read with existing keys in /etc/gitlab/gitlab-secrets.json. The only reason I can think of this happening is when you have GitLab installed in either HA or sort-of HA where you are running on multiple nodes to spread the load. In that scenario, if you have sidekiq in a different node and you have forgotten to copy the secrets there, you may endup with this kind of issue.

@arhi Could you please provide additional insights here: https://gitlab.com/gitlab-org/gitlab/-/issues/259605#note_423592490 ?/

For those of you having issues related with OpenSSL::Cipher::CipherError please look at the documentation here: https://docs.gitlab.com/ee/administration/raketasks/doctor.html#verify-database-values-can-be-decrypted-using-the-current-secrets

Hi, that was not the case :frowning:

I had gitlab running on a single node, simple yum install gitlab-ci, was running for a while, then it started having some issues with upgrade (I think “registry” was added so the https keys were not loading properly) and I had to manually after every upgrade “fix” the config by pointing registry to proper https keys… then I decided to move the gitlab from host to vm for easier backup as the whole system started to be super important, so what we do is backup-etc + backup, installed it on vm, restored etc and backup (so rb files are copied to new system), restored backup and everything worked ok, did few upgrades, everything worked ok, and then this upgrade came trying to do a filesystem migration that crashed big time. I had ~25 repos there, 2 of them had wiki pages, 2-3 of them were already on the hash system and all others on the old system. After migration, all repos in the “old system” were empty (empty folders) but no repos except those that were on hashed before were visible. starting migration again was crashing with this crypto error, I then

UPDATE projects SET runners_token = null, runners_token_encrypted = null;
UPDATE namespaces SET runners_token = null, runners_token_encrypted = null;
UPDATE application_settings SET runners_registration_token_encrypted = null;
UPDATE ci_runners SET token = null, token_encrypted = null;

and started the migrate again
all repos became visible now, the 2 repos with wiki had empty wiki
I checked filesystem, those 2 repos that had wiki now had repo.wiki (empty) folder in the old filesystem repo format, I tried few things to restore wiki but could not do it, the repos were also in the old format and migrate were not moving them to hashed. then I deleted the .wiki folder and run migration again, the repos migrated successfully to hashed and wiki automagically got restored…

so I’m now fully functional but can’t say I’m very confident in gitlab after the whole ordeal :frowning: so I’m manually backing up the whole VM non stop and storing those backups… not very optimal but…

When you backup and restore, you also need to copy the secrets, as they are not part of the backup : https://docs.gitlab.com/ee/raketasks/backup_restore.html#storing-configuration-files

I think we should consider a solution to be able to store them inside the backup files as well. I will create an issue to follow up on that.

@brodock anything else I can help with lemme know.
not sure what to add to the issue there, I see that everything I could add there is already there

isn’t backup-etc doing just that ?!

if you copied what’s inside /etc/gitlab you are good, but the secrets are not part of the backup bundle, for security reasons

I do

gitlab-ctl backup-etc
gitlab-backup create

I understand first line backups the secrets and second one backups the repos, wikis, database…

You are right, with backup-etc this should be covered. We have two separate issues here, one is that the storage migration was left in inconsistent state when an exception occurred, which is not what we want to happen, so this should be fixed and I will investigate it on the original issue.

The other is how you got your system in a situation where OpenSSL::Cipher::CipherError triggered. I’ve created another issue to follow up on an idea on how we could prevent that: https://gitlab.com/gitlab-org/gitlab/-/issues/262040

@arhi did you see any error on the logs regarding those two? We’ve seen during the hashed storage attempt on gitlab.com a few cases where the repositories had wrong permissions and because of that the migration scripts couldn’t move them from one folder to another.

the migration (in my case) failed with cipher error, some others had it fail with some other error, the failed migration is a big fail that should never happen… to be honest, “automated migration without explicit question to migrate is a huge fail!!!”… just like gitlab-ci will not update if it cannot make a backup I’d prevent it from doing anything if gitlab:doctor:secrets fails too. Also, “auto migration” … hm … just like you do not continue upgrade when you don’t make backup you can prevent update if there is still old style storage and request manual upgrade to hashed … now doc. states hashed will be mandatory in 14.0 so no need to migrate in 13.x why do you force it anyhow?! … in any case migration of storage system is not something I expect in the upgrade procedure :frowning: … it’s something I’ll do manually, first I’ll backup the %$#^#^+ out of the system, test if the backup is good by restoring it to a stage system, only then I’ll try to migrate the storage … doing it like this … well not the first gray hair I got nor the last but I could really skipped this stress :smiley: :smiley: :smiley:

now how this cipher error happened - no idea, we copied etc and restored backup and everything worked ok for few updates… no way to turn back time to see what got messed up so dunno what info I can add about it

nope, no errors, the migrate script was finishing without reporting any errors anywhere but the repos were not being migrated (file permissions were ok I checked that before I decided to delete .wiki to see what will happen) … I’m not a ruby person so I don’t know how this all works, I tried to find out what migration actually does but could not find the script where it is done, but it looked to me that when the crypto thing was solved the migration script returned the repo files from the hashed to the old but did not move the .wiki files for some reason so in the @hashed structure the wiki data was left alone and in the old style repo was stored but there was a .wiki “excess” in the old style … this .wiki then prevented migration of the repo so when I deleted the .wiki the repo was migrated and the old wiki data that was there from the first migration try was now joined with the repo … should be possible to track this trough the script but … I don’t do ruby

I understand your frustration, and I’m sorry this caused issues for you. Let me explain the decisions we made and why:

In 13.0 release post we warned that legacy storage was discouraged from that point ahead, and that we would automatically migrate them two releases ahead: https://about.gitlab.com/releases/2020/05/22/gitlab-13-0-released/#planned-removal-of-legacy-storage-in-14.0 (we actually allowed 2 extra releases to be sure).

From 13.0 ahead, gitlab-rake gitlab:check would tell you you have an issue that needs to be fixed.

The rationale for the auto migration being triggered 4 releases after the first warning was that everyone should have migrated already or at least attempted and reported any exception we didn’t covered so we could fix them.

The automatic migration was intended to get everyone who either missed or that had any unmigrated repository left to hashed storage. So in a sense this was a mandatory one. (please understand that this whole migration involved a ton of work as we had to change and fix things that goes back to probably the very first versions). Having Hashed Storage allow us to fix other problems that may occur when running GitLab at scale.

for your curiosity this is what the migration does (after the multiple hops to schedule them at scale):

and related code in:

don’t get me wrong, I was soooooo pissed you cannot imagine, happened in the worse possible moment etc etc … but I do know I’m getting a mega turbo uber giga best system out there for free and I do appreciate that big time… and I do appreciate all the work you guy’s are putting in and working for a big open source project myself I had my share of angry customers and crazy decisions :smiley: … so don’t get me wrong I still think you guy’s are the best

1 Like

what I’d do (not important any more but just for the sake of argument)

  • change that it’s not deprecated in 13 and removed in 14 but removed in 13.4 so ppl know they must migrate
  • change the upgrade to 13.4 so that it checks if there is unmigrated data and fail to upgrade if there is requiring user migrate manually before upgrading to 13.4

I’m pretty sure that would prevent a whole bunch of issues and gray hairs :smiley:

gitlab-rake gitlab:check

is this something that’s run during the upgrade, I did not notice that. In any way IMHO this is something I should be seeing on /admin page right under the “update asap” warning :smiley:

thanks I’ll take a look, reading rb is not a problem when I know where to look, the structure is still bit strange to me :slight_smile:

keep up the good work :slight_smile: and thanks again

@bjelline can you provide any additional information, logs or any other error you can find and paste in: https://gitlab.com/gitlab-org/gitlab/-/issues/259605#note_423733035 ?

We still need to have you in hashed storage, so we need to figure out why you ended up in this half-migrated state.

Just checking in here. Has anyone found a workaround for the 2:NoMethodError: undefined method 'relative_path' variant of this error while we wait for an official fix?

Hi,
I tried the “sudo gitlab-rake gitlab:storage:rollback_to_legacy” and I do not get the repositories back from @hashed. I get the following message “Enqueuing rollback of 4 projects in batches of 200. Done!” . Yet I do not see any repositories in the legacy storage. I still see them in @hashed folder.

We still have the same case without a solution over here: Error on Gitlab CE version 13.6 No repository - #2 by vhristev

Nothing worked for me:

  • Clear cache
  • Migrate legacy storage to hashes
  • Clear registration tokens.

Still cannot see repos in UI but the data is on the server.

1 Like