16.8.0 Backup/Restore Issues - how to repair?

Summary: I can’t get a proper backup of my main gitlab server to restore properly on a warm standby. This is since upgrading to 16.8.0. I’m now looking at how to recover from this to make it work again.

Long story:
I’ve got a main Gitlab (CE) server running on a private network, which has been through multiple in-place upgrades, the latest from 16.5.0 via 16.7.3 to 16.8.0. All upgrades have been successful (or at least, have claimed to be).

I also have a ‘warm standby’ which takes the latest backup of the main server and restores it. Following some unrelated issues, I’ve completely rebuilt this server, with a direct install of 16.8.0. I’ve taken the latest backup from the upgraded main server and restored it - it says it’s successful, but it contains no actual repositories (they all say “The repository for this project does not exist.” - which is different from a ‘not found’, so it half knows about it!).

To investigate, I spun up a brand new gitlab server (directly to 16.8.0), created a repository in it and backed it up. I have been able to restore this to my warm standby successfully. I therefore conclude that it seems likely the actual backup of the main server is problematic and can’t be properly restored (although I can’t really prove that any further).

The question is then, how can I ‘refresh’ my main installation? My guess is some data isn’t properly migrated, or some files or symlinks aren’t in place. I obviously now only have one of these servers, so I can’t really just play about to see what works (unless there’s a way to do some sort of manual backup/restore?). Any ideas?

1 Like

What command are you using the create the backup? For example on mine I do:

gitlab-backup create

which creates a backup that includes the repositories. Also, just in case, some of the backups created under /var/opt/gitlab/backups during upgrade processes are not full backups and do not contain repository data. Just in case you are using one of these to restore from.

1 Like

Good point - I should probably try some variants. For the last 2-3 years, we’ve used this:

gitlab-backup create SKIP=registry,artifacts GITLAB_BACKUP_MAX_CONCURRENCY=4 BACKUP=dump

(I used the same when I tried the backup/restore of my ‘fresh’ server).

For restore, we do:

gitlab-backup restore force=yes (‘force’ because it’s usually run by a script, so it can’t go interactive)

I’m NOT using the upgrade backups - these are ‘full’ backups, which say they DO contain the repositories. Size-wise, they’re about the right GB to contain our repositories.

OK good to know, just wanted to point it out just in case. Seems strange though that it’s not restoring. I’m assuming that the backup is from the same version that you are installing on your new server to restore to? Eg: both are 16.8.0 and both are CE?

Usually when I do something like this I do the following (I’m using gitlab-ce so adapt where necessary).

  1. Install Gitlab-CE with same version number on my new server as the old one, eg: 16.8.0.
  2. Copy /etc/gitlab/gitlab.rb and /etc/gitlab/gitlab-secrets.json from the old server to the new one.
  3. Run gitlab-ctl reconfigure to get a basic empty installation on the new server that uses the config set from the recently copied gitlab.rb and gitlab-secrets.json. This ensures postgres as well as other services are running that are required for the restore to take place.
  4. Copy backup from old server to /var/opt/gitlab/backups and ensure correct permissions on the file.
  5. Run restore process.
  6. Finally another reconfigure and restart of Gitlab to make sure everything comes up OK.

I expect you are also probably doing that, but just wanted to make sure and clarify. If the backup contains the repositories it should restore them. Not something I’ve experienced personally, and I generally do test this process every six months to ensure I can recover in the event of total failure.

1 Like

Thanks for checking - yes, I do (mostly) do that. The only exception is that my standby server has a slightly different gitlab.rb - but even with the exact same one, the problem still exists.

Looking at the backup logs, it seems we’re doing things correctly (as far as I can tell). However, on restore of the repositories, I see this (grepped for one repository name):

{"command":"restore","gl_project_path":"devops/ansible","level":"info","msg":"started restore","pid":88307,"relative_path":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.git","storage_name":"default","time":"2024-01-24T10:15:46.129Z"}
{"command":"restore","gl_project_path":"devops/ansible.wiki","level":"info","msg":"started restore","pid":88307,"relative_path":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.wiki.git","storage_name":"default","time":"2024-01-24T10:15:46.134Z"}
{"command":"restore","gl_project_path":"devops/ansible","level":"warning","msg":"skipped restore","pid":88307,"relative_path":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.git","storage_name":"default","time":"2024-01-24T10:15:46.136Z"}
{"command":"restore","gl_project_path":"devops/ansible.design","level":"info","msg":"started restore","pid":88307,"relative_path":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.design.git","storage_name":"default","time":"2024-01-24T10:15:46.136Z"}
{"command":"restore","gl_project_path":"devops/ansible.wiki","level":"warning","msg":"skipped restore","pid":88307,"relative_path":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.wiki.git","storage_name":"default","time":"2024-01-24T10:15:46.140Z"}
{"command":"restore","gl_project_path":"devops/ansible.design","level":"warning","msg":"skipped restore","pid":88307,"relative_path":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.design.git","storage_name":"default","time":"2024-01-24T10:15:46.146Z"}

Ignoring the wiki and design, the repository itself (devops/ansible) says “started restore”, but then says “warning” and “skipped restore”.

I’m struggling to get more information about that particular restore though, so not sure how debug this further.

A little more information - I found the gitaly log has this about that repo (during a restore):

{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGCX8GNAF63MJ9A8HD5VP","diskcache":"3680e2d1-994d-4426-ac2a-746112a604a1","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/ansible","grpc.request.glRepository":"","grpc.request.repoPath":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.git","grpc.request.repoStorage":"default","grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.344","level":"info","msg":"diskcache state change","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.352Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGCX8GNAF63MJ9A8HD5VP","error":"repository does not exist","grpc.code":"NotFound","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/ansible","grpc.request.glRepository":"","grpc.request.payload_bytes":111,"grpc.request.repoPath":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.git","grpc.request.repoStorage":"default","grpc.response.payload_bytes":0,"grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.344","grpc.time_ms":8.357,"level":"info","msg":"finished unary call with code NotFound","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.353Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGCXHDCZF3MMYD6E3N95T","diskcache":"7c25e160-d3b4-4c89-8784-ced7529b4c44","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/ansible.wiki","grpc.request.glRepository":"","grpc.request.repoPath":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.wiki.git","grpc.request.repoStorage":"default","grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.353","level":"info","msg":"diskcache state change","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.358Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGCXHDCZF3MMYD6E3N95T","error":"repository does not exist","grpc.code":"NotFound","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/ansible.wiki","grpc.request.glRepository":"","grpc.request.payload_bytes":121,"grpc.request.repoPath":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.wiki.git","grpc.request.repoStorage":"default","grpc.response.payload_bytes":0,"grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.353","grpc.time_ms":5.977,"level":"info","msg":"finished unary call with code NotFound","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.359Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGCXJQAM9RRM7PK73YVAY","diskcache":"0d6e196e-f1b0-4760-90cd-da8a545408f0","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/ansible.design","grpc.request.glRepository":"","grpc.request.repoPath":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.design.git","grpc.request.repoStorage":"default","grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.354","level":"info","msg":"diskcache state change","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.363Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGCXJQAM9RRM7PK73YVAY","error":"repository does not exist","grpc.code":"NotFound","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/ansible.design","grpc.request.glRepository":"","grpc.request.payload_bytes":125,"grpc.request.repoPath":"@hashed/4b/22/4b227777d4dd1fc61c6f884f48641d02b4d121d3fd328cb08b5531fcacdabf8a.design.git","grpc.request.repoStorage":"default","grpc.response.payload_bytes":0,"grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.354","grpc.time_ms":9.481,"level":"info","msg":"finished unary call with code NotFound","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.364Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGD7FJXKKAX1R5RKHSWHJ","diskcache":"9b058293-bc51-4945-b78d-a6afef4e5145","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/field-ansible","grpc.request.glRepository":"","grpc.request.repoPath":"@hashed/c6/f3/c6f3ac57944a531490cd39902d0f777715fd005efac9a30622d5f5205e7f6894.git","grpc.request.repoStorage":"default","grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.671","level":"info","msg":"diskcache state change","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.680Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGD7FJXKKAX1R5RKHSWHJ","error":"repository does not exist","grpc.code":"NotFound","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/field-ansible","grpc.request.glRepository":"","grpc.request.payload_bytes":117,"grpc.request.repoPath":"@hashed/c6/f3/c6f3ac57944a531490cd39902d0f777715fd005efac9a30622d5f5205e7f6894.git","grpc.request.repoStorage":"default","grpc.response.payload_bytes":0,"grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.671","grpc.time_ms":9.989,"level":"info","msg":"finished unary call with code NotFound","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.681Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGD7M226QAKXN2C7FZF3Y","diskcache":"f0e5ade5-e677-4cc1-9da7-3b2ddfe7bfad","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/field-ansible.wiki","grpc.request.glRepository":"","grpc.request.repoPath":"@hashed/c6/f3/c6f3ac57944a531490cd39902d0f777715fd005efac9a30622d5f5205e7f6894.wiki.git","grpc.request.repoStorage":"default","grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.676","level":"info","msg":"diskcache state change","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.684Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGD7M226QAKXN2C7FZF3Y","error":"repository does not exist","grpc.code":"NotFound","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/field-ansible.wiki","grpc.request.glRepository":"","grpc.request.payload_bytes":127,"grpc.request.repoPath":"@hashed/c6/f3/c6f3ac57944a531490cd39902d0f777715fd005efac9a30622d5f5205e7f6894.wiki.git","grpc.request.repoStorage":"default","grpc.response.payload_bytes":0,"grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.676","grpc.time_ms":9.257,"level":"info","msg":"finished unary call with code NotFound","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.686Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGD7SYG0FPJ1W65ZFVKH8","diskcache":"1257564e-b7f2-4c4c-965b-17a530e9012e","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/field-ansible.design","grpc.request.glRepository":"","grpc.request.repoPath":"@hashed/c6/f3/c6f3ac57944a531490cd39902d0f777715fd005efac9a30622d5f5205e7f6894.design.git","grpc.request.repoStorage":"default","grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.681","level":"info","msg":"diskcache state change","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.689Z"}
{"component":"gitaly.UnaryServerInterceptor","correlation_id":"01HMXHGD7SYG0FPJ1W65ZFVKH8","error":"repository does not exist","grpc.code":"NotFound","grpc.meta.deadline_type":"none","grpc.meta.method_operation":"mutator","grpc.meta.method_scope":"repository","grpc.meta.method_type":"unary","grpc.method":"RemoveRepository","grpc.request.fullMethod":"/gitaly.RepositoryService/RemoveRepository","grpc.request.glProjectPath":"devops/field-ansible.design","grpc.request.glRepository":"","grpc.request.payload_bytes":132,"grpc.request.repoPath":"@hashed/c6/f3/c6f3ac57944a531490cd39902d0f777715fd005efac9a30622d5f5205e7f6894.design.git","grpc.request.repoStorage":"default","grpc.response.payload_bytes":0,"grpc.service":"gitaly.RepositoryService","grpc.start_time":"2024-01-24T10:50:58.681","grpc.time_ms":7.846,"level":"info","msg":"finished unary call with code NotFound","pid":62395,"span.kind":"server","system":"grpc","time":"2024-01-24T10:50:58.689Z"}

I also find nothing much in /var/opt/gitlab/git-data, where I do find my repositories there on the main server. It really has not restored anything on the backup server - so it’s definitely a git/gitaly issue, rather than a database migrations problem.

The backups DO contain the repositories, and I see during the restore process that these are extracted to disk (in their hashed directory names) - so I’m slightly happier than I am making useful backups, even if I can’t (yet) restore them correctly (my guess is I could probably do something with a restore if I absolutely had to).

Your permissions to git-data are OK?

drwx------  3 git               git        4.0K Sep 25  2017 git-data

the same for repositories which should exist inside git-data:

drwxrws--- 11 git git 4.0K Jan 22 03:18 repositories

further down I also have:

drwxr-sr-x  4 git root 4.0K Jan 24 12:12 +gitaly
drwxr-sr-x 32 git root 4.0K Nov 21 13:44 @hashed
drwxr-s--- 17 git root 4.0K Jan  2 14:43 @snippets

I’ve filtered out other directories, as I believe they are remnants on my system before Gitlab started to use Gitaly, since repos were previously outside of @hashed.

If we rule out a permissions issue, at least we know this isn’t blocking the files from being put on the disk during restore.

1 Like

Another good thought - yes, all of it seems to be owned by git:git - this is the total of my /var/opt/gitlab/git-data directory on the restore server:

/var/opt/gitlab/git-data/repositories:  drwxrws--- 3 git git 4096 Jan 24 10:50
/var/opt/gitlab/git-data/repositories/+gitaly:  drwxr-sr-x 3 git git 4096 Jan 24 11:57 .
/var/opt/gitlab/git-data/repositories/+gitaly/tmp:  drwxr-sr-x 2 git git 4096 Jan 24 10:50

So not much to see, but it all looks correct.

I just tried zipping up the /var/opt/gitlab/gitlab-data directory from my running server and putting onto my (half restored) warm standby. Right away, repositories look like they’re working (they’re certainly browsable, and clone-able).

I’ve tried the same, using the repositories file inside the backup tar file - this does NOT work though. The repositories are found, but they can’t be viewed properly (the UI says “An error occurred while fetching folder content.”). Looking at it, the files buried in the hash pathnames are bundles, so not immediately usable.

In the short term, I’ll work out some way to zip up the repositories separately - just so I have something I can actually rebuild a server from.

You can run some sanity checks from here: Maintenance Rake tasks | GitLab

And migration status you can check with:

gitlab-rake db:migrate:status

to see if something weird is going on with the old server.

If you wanted to get just the repository stuff, eg: files/commits, then you could use repository mirroring. I expect issues won’t get mirrored over though, or the wiki. Here: Repository mirroring | GitLab you can do push mirroring for free, so push from old server to new server. The push will happen on a commit.

There used to be a method to import repositories from data on disk, but I’m not sure if that has been removed in current versions. There are posts on here about importing data in that way, when people had issues. Cannot think of any other way you can recover this, unless the gitlab-rails logs aren’t showing anything or potentially other log files relating to repository data.

1 Like

Not sure if this will help, they gave a way of how they managed to recover: Migrate data from dead Gitlab-CE instance using only local files, backup corrupted! - #2 by iwalker

1 Like

Thanks for the ideas. The rake status says “ok” for everything - I’m inclined to agree, because when the repository files are put into place, it all seems to be working (so the indexes and repositories obviously match up).

As for longer term options, I guess I’m left waiting for Gitlab to find a solution. It seems like something new in a recent release, so hopefully it’ll get fixed. In the interim, I’ll look to see if the backup is kind enough to scoop up files in the repository directory - if it is, I can zip up the repositories before I run the Gitlab backup - it still won’t restore properly, but at least the backup file will have everything in it to recover the server if I ever need to. Failing that, I’ll do a two step backup (not at all ideal, and risks inconsistent repository backups, but it’s better than no repositories at all).

My warm standby is somewhat in jeopardy here though - if the repository backups thing works, I can probably automate putting them back into the partially recovered server. Not sure how good that idea is really - may have to think that one over a bit more.

Meanwhile, thanks again for the ideas - you gave me the nudges I needed to figure this out a bit further - even if I haven’t yet got to a solution.

If the OS on the new server is the same version as the old one, something else worth a try:

  1. Install Gitlab on new server.
  2. Copy across gitlab.rb, gitlab-secrets.json.
  3. Run gitlab-ctl reconfigure so that all the users required by Gitlab are created in /etc/passwd and /etc/group.
  4. Ensure Gitlab is stopped gitlab-ctl stop and systemctl stop gitlab-runsvdir.
  5. Move /var/opt/gitlab to /var/opt/gitlab.old or whatever else you want to rename it to.
  6. Use rsync to copy /var/opt/gitlab from old server to the new one and place in the same place on the new server.
  7. Run gitlab-ctl reconfigure and ensure this goes through OK.
  8. Run sanity checks from the link I provided earlier to make sure no issues.
  9. Ensure Gitlab is started, or restart it gitlab-ctl restart.
  10. Check all services came up OK gitlab-ctl status.

Then check your repositories out and see how it is. Rsync should preserve the permissions in relation to the user and group names irrespective of the uid/gid being different on the old and new servers. Theoretically you could even before point 4 change the uid and gid for the users and groups to match the old server but it shouldn’t be necessary.

Could be time consuming, but could be a more successful route. The only thing afterwards if it does work, would be to then try a backup from the new server and restore this on yet another server new server and see if the problem remains. If so, then what you can do at this point is look at a different backup strategy.

You may wish to take a look at a really cool backup tool called restic. (https://restic.net/) - I use this with something else entirely, but it even has the ability to take online backups without stopping the services. At least it does for the Zimbra Mail server that would usually block access to the LDAP (slapd) database (which as it happens is also a sparse file just to complicate things further). I’ve not tested it yet with Gitlab to see if I can make complete backups of Gitlab without taking it offline, but I will be doing that shortly just for my own personal curiosity. In this instance the only thing you need to backup with this would be /etc/gitlab and /var/opt/gitlab and potentially /var/log/gitlab just for logging archive info (although it won’t be necessary for recovering the server).

There are probably other backup tools that can be used, commercial or whatever. Restic backs up to disk, compresses and also uses de-duplication. Therefore subsequent backups made will only be the changed files and it will also complete far quicker than the initial sync. Obviously for performance reasons, you’ll be wanting to have at least minimum two backups at any one time if you are looking to save space. Removing all snapshots without leaving a single snapshot would mean we don’t take advantage of de-duplication and thus the backup would always take a lot longer than if it has a previous one to compare with.

Just some extra thoughts for potential ways forward.

2 Likes

So, I have managed to recover my system using restic. The backup/snapshot was taken whilst Gitlab was running. This does mean some caveats to getting it running again, but it is doable. Browsing the web interface, I do see my repositories, and I do see the files listed in them, and I can view them.

My process for the backup was:

restic backup /etc/gitlab /var/opt/gitlab

This is then shown in the output below:

root@gitlab:~# restic snapshots
repository e07706e0 opened (version 2, compression level auto)
ID        Time                 Host        Tags        Paths
----------------------------------------------------------------------
e074be8d  2024-01-24 18:57:19  gitlab                  /etc/gitlab
                                                       /var/opt/gitlab
----------------------------------------------------------------------
1 snapshots

I then rsynced this across to my new server. The snapshots I stored in /var/restic but you can choose the location yourself when you use the restic init command. Once on the new server:

root@debian:~# restic snapshots
repository e07706e0 opened (version 2, compression level auto)
ID        Time                 Host        Tags        Paths
----------------------------------------------------------------------
e074be8d  2024-01-24 18:57:19  gitlab                  /etc/gitlab
                                                       /var/opt/gitlab
----------------------------------------------------------------------
1 snapshots

My first step was to install gitlab-ce-16.8.0 which is what I was running on the old server. Once this was installed, I then ran the restic restore procedure:

root@debian:/var/restic# restic restore e074be8d --target /
repository e07706e0 opened (version 2, compression level auto)
[0:00] 100.00%  1 / 1 index files loaded
restoring <Snapshot e074be8d of [/etc/gitlab /var/opt/gitlab] at 2024-01-24 18:57:19.557248311 +0100 CET by root@gitlab> to /
Summary: Restored 10810 files/dirs (1.153 GiB) in 0:02

At this point, I was able to run gitlab-ctl reconfigure and get my Gitlab install sane enough in terms of configuration. Some pid files needed to be removed from the postgresql and gitaly directories because they existed in the backup because of it being online.

After this you will be debugging a few things in terms of permissions, for example, access to git-data, postgresql, redis. I did use the gitlab-rake gitlab:check command to also ensure everything had restored - this also hinted and fixing authorized_keys and one or two other things as well.

Depending on which service is failing to start, you can check the log file output and it will usually be throwing permission errors. I had that for postgres, gitaly, redis and some issues with nginx since I changed the hostname of the server, but not the names of the certificate files.

Also, depending on whether you can afford downtime overnight, you could script it to stop Gitlab before running the restic backup. That would at least help with the PID file issue. Then once the backup/snapshot has finished start Gitlab again. I decided on the harder route, just to prove it’s possible to be done as online backup - at least in my tests but your mileage may vary.

Version of restic I used was not the one that is installable from distro repositories, but went to their downloads and got 0.16.3 which is the latest as of this post.

1 Like

All good information - thanks. I will try backing up my hand-restored server and see if it’ll restore somewhere else. I don’t have any services failing to start - in fact, everything says “succeeded”, even though it’s skipped nearly all of the repository restore steps (that feels like a bug to me…).

As for files based backups (like Restic, bacula or countless others) - these look fine, but all fail if the repository is being used at the time the backup runs - you risk getting an inconsistent backup which likely won’t be restorable. You’ll be okay 99 times out of 100, but of course that 100th time is the one you’ll need. You’d need to more or less fully stop Gitlab before the backup and start it up again afterwards - thus preventing any concurrent repository access during the backup.

GItlab’s own backup solves this problem by making a Git bundle of all repos and then backing up the bundle. Bundles are created as a git transaction, so they’ll happen either before or after any other interactions, and so will be consistent. As such, I really do need Gitlab’s own solution to work.

I’m going to look into the bundles in the backup a little more closely - that may lead me to better information to talk to Gitlab with (maybe to open a bug or whatever)

Sadly, not fixed in 16.8.1 either :frowning:

Got things to a repeatable point, so logged these two bugs:

  1. No error even if we didn’t actually restore anything: gitlab-restore skips repositories but doesn't fail the restore (#439407) · Issues · GitLab.org / GitLab · GitLab

  2. Can’t restore a backup: Backup/Restore fails to restore repositories (versions 16.8.0-ce - 16.8.1-ce) (#439468) · Issues · GitLab.org / GitLab · GitLab (it turns out you can make a fresh system, back it up and then restore it to show the problem quite repeatably)

I find it strange, since I’ve just installed Gitlab 16.8.1 and restored a backup, and I don’t have the same problem as you.

You mention you use gitlab-backup restore and answer the questions, but that isn’t very clear. My restore looks like this:

root@gitlab-restore:/home/ian# ls /var/opt/gitlab/backups/
1706223688_2024_01_25_16.8.1_gitlab_backup.tar

root@gitlab-restore:/home/ian# gitlab-backup restore BACKUP=1706223688_2024_01_25_16.8.1

I’m assuming you also specify the file you are restoring from (and that you did chown git:git before starting restore). Sure, I then say yes to some questions later about overwriting tables in the database, etc. I also don’t use the BACKUP=dump option but then I don’t think that would be the cause of the problem. I don’t use any skip options, but then you are just skipping registry and artifacts, so any committed files in your repositories should still be there after restore.

Other than that, the only difference is I’m using Debian 12, and not Ubuntu 22.04. My gitlab.rb and gitlab-secrets.json were put in place after installing the gitlab-ce package, changing the external_url and running reconfigure. Then the restore was done as per the Gitlab docs. I ran a reconfigure after restore, and then restarted Gitlab.

I can only think something is weird with either Ubuntu 22.04 or the way your Gitlab server is configured or being backed up and restored. I haven’t been able to replicate your problem unfortunately.

My git-data directory:

root@gitlab-restore:/var/opt/gitlab/git-data# du -sh *
2.7G	repositories

and further:

root@gitlab-restore:/var/opt/gitlab/git-data/repositories# du -sh *
1.2M	+gitaly
2.7G	@hashed
964K	@snippets

Hmm… that is interesting. I tried the procedure twice, and got the same result both times. I’ll maybe try with Ubuntu 20 and maybe Debian too.

Given yours works with a backup you’ve taken previously, this maybe smells like my actual backup is bad (and so can’t be restored). I’d assumed the opposite because the files inside the backup I’ve got look okay to me - but that doesn’t exactly make it true.

My backup doesn’t specify a filename, so generates a dump_gitlab_backup.tar (which gitlab puts in /var/opt/gitlab/backups). My restore also doesn’t specify a filename, so uses the same one. You have to say “yes” to proceeding, and another “yes” to overwriting the authorized_keys file, but that’s all.

Hmm… still no joy - running through the procedure in the bug ticket, I have tried Ubuntu 22, 20 and Debian 12 - none of them produce a backup/restore that works.

Given the world isn’t screaming about this, and since @iwalker seems to have working system, I assume it’s not a problem for everyone (or people aren’t aware because the restore process fails silently). I’m left wondering what could the difference be that makes this work?

Meanwhile, I’ll try some previous versions - that might at least tell us when the problem started.