Hi all, I will look at all the information you all provided here and track the problem in https://gitlab.com/gitlab-org/gitlab/-/issues/259605. This is a high-priority issue to get fixed as it shouldn’t have caused problems in the first place. The way the migration was coded makes it very very hard to loose data, so for those of you who have it in an inconsistent state, it may be the case that the database has flagged the storage as migrated, but the repositories are still located on the legacy storage format, or the opposite, you have the repositories migrated but database for some reason failed. In any case, it’s possible to get it back to normal.
Please follow the issue to get notified of any solution (we will probably have a patch release for 14.4.x with a fix, but I will also provide instructions on how to manually fix, so you don’t have to wait)
The OpenSSL::Cipher::CipherError means some encrypted data in the database couldn’t be read with existing keys in /etc/gitlab/gitlab-secrets.json. The only reason I can think of this happening is when you have GitLab installed in either HA or sort-of HA where you are running on multiple nodes to spread the load. In that scenario, if you have sidekiq in a different node and you have forgotten to copy the secrets there, you may endup with this kind of issue.
I had gitlab running on a single node, simple yum install gitlab-ci, was running for a while, then it started having some issues with upgrade (I think “registry” was added so the https keys were not loading properly) and I had to manually after every upgrade “fix” the config by pointing registry to proper https keys… then I decided to move the gitlab from host to vm for easier backup as the whole system started to be super important, so what we do is backup-etc + backup, installed it on vm, restored etc and backup (so rb files are copied to new system), restored backup and everything worked ok, did few upgrades, everything worked ok, and then this upgrade came trying to do a filesystem migration that crashed big time. I had ~25 repos there, 2 of them had wiki pages, 2-3 of them were already on the hash system and all others on the old system. After migration, all repos in the “old system” were empty (empty folders) but no repos except those that were on hashed before were visible. starting migration again was crashing with this crypto error, I then
UPDATE projects SET runners_token = null, runners_token_encrypted = null;
UPDATE namespaces SET runners_token = null, runners_token_encrypted = null;
UPDATE application_settings SET runners_registration_token_encrypted = null;
UPDATE ci_runners SET token = null, token_encrypted = null;
and started the migrate again
all repos became visible now, the 2 repos with wiki had empty wiki
I checked filesystem, those 2 repos that had wiki now had repo.wiki (empty) folder in the old filesystem repo format, I tried few things to restore wiki but could not do it, the repos were also in the old format and migrate were not moving them to hashed. then I deleted the .wiki folder and run migration again, the repos migrated successfully to hashed and wiki automagically got restored…
so I’m now fully functional but can’t say I’m very confident in gitlab after the whole ordeal so I’m manually backing up the whole VM non stop and storing those backups… not very optimal but…
You are right, with backup-etc this should be covered. We have two separate issues here, one is that the storage migration was left in inconsistent state when an exception occurred, which is not what we want to happen, so this should be fixed and I will investigate it on the original issue.
@arhi did you see any error on the logs regarding those two? We’ve seen during the hashed storage attempt on gitlab.com a few cases where the repositories had wrong permissions and because of that the migration scripts couldn’t move them from one folder to another.
the migration (in my case) failed with cipher error, some others had it fail with some other error, the failed migration is a big fail that should never happen… to be honest, “automated migration without explicit question to migrate is a huge fail!!!”… just like gitlab-ci will not update if it cannot make a backup I’d prevent it from doing anything if gitlab:doctor:secrets fails too. Also, “auto migration” … hm … just like you do not continue upgrade when you don’t make backup you can prevent update if there is still old style storage and request manual upgrade to hashed … now doc. states hashed will be mandatory in 14.0 so no need to migrate in 13.x why do you force it anyhow?! … in any case migration of storage system is not something I expect in the upgrade procedure … it’s something I’ll do manually, first I’ll backup the %$#^#^+ out of the system, test if the backup is good by restoring it to a stage system, only then I’ll try to migrate the storage … doing it like this … well not the first gray hair I got nor the last but I could really skipped this stress
now how this cipher error happened - no idea, we copied etc and restored backup and everything worked ok for few updates… no way to turn back time to see what got messed up so dunno what info I can add about it
nope, no errors, the migrate script was finishing without reporting any errors anywhere but the repos were not being migrated (file permissions were ok I checked that before I decided to delete .wiki to see what will happen) … I’m not a ruby person so I don’t know how this all works, I tried to find out what migration actually does but could not find the script where it is done, but it looked to me that when the crypto thing was solved the migration script returned the repo files from the hashed to the old but did not move the .wiki files for some reason so in the @hashed structure the wiki data was left alone and in the old style repo was stored but there was a .wiki “excess” in the old style … this .wiki then prevented migration of the repo so when I deleted the .wiki the repo was migrated and the old wiki data that was there from the first migration try was now joined with the repo … should be possible to track this trough the script but … I don’t do ruby
From 13.0 ahead, gitlab-rake gitlab:check would tell you you have an issue that needs to be fixed.
The rationale for the auto migration being triggered 4 releases after the first warning was that everyone should have migrated already or at least attempted and reported any exception we didn’t covered so we could fix them.
The automatic migration was intended to get everyone who either missed or that had any unmigrated repository left to hashed storage. So in a sense this was a mandatory one. (please understand that this whole migration involved a ton of work as we had to change and fix things that goes back to probably the very first versions). Having Hashed Storage allow us to fix other problems that may occur when running GitLab at scale.
don’t get me wrong, I was soooooo pissed you cannot imagine, happened in the worse possible moment etc etc … but I do know I’m getting a mega turbo uber giga best system out there for free and I do appreciate that big time… and I do appreciate all the work you guy’s are putting in and working for a big open source project myself I had my share of angry customers and crazy decisions … so don’t get me wrong I still think you guy’s are the best
I tried the “sudo gitlab-rake gitlab:storage:rollback_to_legacy” and I do not get the repositories back from @hashed. I get the following message “Enqueuing rollback of 4 projects in batches of 200. Done!” . Yet I do not see any repositories in the legacy storage. I still see them in @hashed folder.