Migrating instance with large LFS store, tar: file changed as we read it

m.hamilton-jones · July 30, 2020, 1:19pm

I’m trying to migrate my org’s Gitlab instance to a new server, however, I cannot currently make a backup (our normal backup process takes a snapshot of the VM, rather than using gitlab-backup, but that isn’t applicable in this situation). The problem is occuring during processing of the LFS objects. These are large (~100GB) compared to the rest of the instance, and are stored on a CIFS mount. The backup is going into a different directory on the same mount (the server that the instance runs on doesn’t have disk space for the backup). I’m using the procedure outlined here. I have tried the STRATEGY=copy option, without any change. Any suggestions?

Edit
Can I safely exclude the LFS store and move it seperatly? That would I think solve the issue.

What version are you on (Hint: /help) ? and are you using self-managed or gitlab.com?
13.2.0-ee (core)

What troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?

Use STRATEGY=copy
Stop puma and sidekiq while the backup runs (I can’t really try this again, as it makes the instance unavailble overnight)
I can’t use the suggestion here because of disk space limitations.

coofercat · July 31, 2020, 9:11am

I’m not sure this is really a Gitlab problem per-se - it looks like general sysadmin issues to me…?

I have no direct experience of what you’re doing, but “tar: file changed as we read it” suggests that tar was reading a file and was backing it up, but it changed before tar got to the end of the file. This suggests that the file is being used by some other running process.

You say you can’t stop puma/sidekiq, but I’d imagine its these processes that are modifying the files you’re trying to back up. If you can’t stop the processes, you can’t use tar to do the backups - you’re then only really left with filesystem snapshots - but if a snapshot takes place during a file write, you still may not get a consistent file in the snapshot - so even though it looks good, you won’t be able to build a working system from it.

I’d say you pretty much have to take the downtime, stop all application processes and do your backup. Then you’ll be guaranteed a consistent backup that you can actually restore successfully.

m.hamilton-jones · July 31, 2020, 9:37am

Thanks for the reply. I should have phrased that better, I can stop puma and sidekiq, and I have tried that, with exactly the same result. I meant that I can’t keep trying it night after night in the hope that it works. If that solved the issue, we could take the downtime, but it doesn’t seem to.
The reason I think this is a gitlab issue is that the docs specifically state that the copy strategy was added to avoid exactly this error. I’m not sure if the reason it isn’t working is because of an error on my part, or a bug.

coofercat · July 31, 2020, 9:52am

Ahh okay - got it. In which case, I’d just stop all processes on the box (and any others that may be using the cifs share) and then do the backup.

If you’re not sure if you’ve got it all, try unmounting the cifs share - if it unmounts then nothing’s using it (and you could even remount it somewhere different so nothing can use somehow else).

m.hamilton-jones · July 31, 2020, 10:07am

hmm, the gitlab-backup command fails if the postgresql service is stopped, which I found since I did initially try just running gitlab-ctl stop before creating the backup,but trying to unmount the share and see what’s accessing it is a good shout. Thanks.

coofercat · August 4, 2020, 9:52am

That makes some sense - Gitlab needs to backup the Postgres data properly, so needs it running to do that (pg_backup needs postgres running, and I assume it’s using that or similar).

I suspect in your case the postgres data files and the other data files are on the same share though, so when Gitlab does a tar of the data files, it’s also getting the postgres data files (which are changing, because postgres is running). There’s no solution to this - other than to move Postgres to some other location.

m.hamilton-jones · September 28, 2020, 8:48am

Thanks for your input on this. We solved it in the end by rsyncing the LFS objects across to the new server and then doing a backup excluding LFS.