Backup script is creating multiple tar files because of time out

mblum · February 3, 2021, 11:45am

Dear community,

I run gitlab in a kubernetes cluster and every night the backup script creates archives of my data. I noticed a very weird behavior: in the final stage of the backup after the tarball is created there is a timeout while uploading the file to storage. This triggers a retry and eventually succeeds after 2-6 tries.

In the end all the uploads go through and I have duplicate files in object storage.

Anyone had this issue and was able to find a solution?

iwalker · February 3, 2021, 2:42pm

I’m assuming your backup script is something that doesn’t come with Gitlab by default. Please can you show the backup script commands that upload the file to your storage so we can see.

Usually, for example, if wget is being used, you need to use the -c parameter so that it continues to upload the file that it failed with in the beginning. Otherwise there will be filename.tar.1, filename.tar.2 and so on. The -c parameter means we continue with the file that partially uploaded. But without seeing the backup script, I cannot say if it is this situation or not.

mblum · February 3, 2021, 2:58pm

The backup script comes with gitlab and is installed automatically by helm. It simply calls the gitlab backup-utility like this:

/bin/bash -c cp /etc/gitlab/.s3cfg $HOME/.s3cfg && backup-utility

within a gitlab-task-runner-ce container.

The job is created by a kubernetes cronjob.

iwalker · February 3, 2021, 3:05pm

Yep, see it here:

it’s using curl. So you could edit this script:

    curl --retry 6 --progress-bar -o $output_path $1

and replace it with:

    curl --retry 6 --progress-bar -C - -o $output_path $1

The -C - allows you to continue a download/upload which has failed - at least according to the man page. It’s worked for me when testing a download via a HTTP/HTTPS url. Don’t forget the additional space and minus sign after -C as this is what tells it where to continue from.

All I can think of right now that might help.

mblum · February 3, 2021, 3:09pm

Thanks, iwalker.

I did an upgrade from version 4.5.3 (13.5.3) to 4.6.6 (13.6.6) and did two backup runs without having this problem.

I will definitely try the curl flag next if the problem comes back.

Thanks

iwalker · February 3, 2021, 3:17pm

Were those two backup runs done manually or also via cron? If manually, then perhaps something is happening during the night, with the kubernetes cluster or network that might be causing your retry issues. Worth checking out as seems strange otherwise.

mblum · February 3, 2021, 3:31pm

They were run manually. I will observe what happens under normal conditions and will let you know.

mblum · February 11, 2021, 11:32am

Well, the problem still remains. You are probably right, that there is something going on with my servers at night that causes the slow or dropped connections.