Reduce Binary File Storage Repo Size?

Hi, I have a GitLab Repo where I am storing some binary files. Over time, with multiple commits for each binary file, the size of the repo has grown substantially. In an attempt to reduce the size of the repo by getting rid of all the old versions of the files (I am only interested in the latest version of each) I followed this guide: https://docs.gitlab.com/ee/user/project/repository/reducing_the_repo_size_using_git.html.

While the repo size appeared to be smaller, the indicated storage used did not change. I also thought that performing the actions in the guide would keep the latest version of each binary file, however it removed them all; including the latest versions.

Am I doing something wrong ?

P.S. I know that it is bad practice to store binary files in the first place.

Hi @rph, welcome to the GitLab Community Forum!

The instructions you linked to may not display expected results instantly because of a delay between Repository cleanup and git gc. There’s a relevant issue here: https://gitlab.com/gitlab-org/gitlab/-/issues/220104

Can you please manually trigger housekeeping on the repository? You can find this option under your project’s Settings > General > Advanced. You’ll receive an email once Housekeeping has completed with the updated repository size - hopefully minus the files you removed.

Let us know how it goes either way!

Hi, @gitlab-greg I forgot to mention that I did run housekeeping afterwards. Made no difference.

Hey @rph,

Could you show a screenshot?
For example, I’m wondering if you mean the right number here didn’t decrease:

image
The left number is the git repo size (small), the right is the total storage (rather large).

Reducing repo size and gc wouldn’t affect the right number which also includes artifacts in the pipelines among other things.

If you do find this is your issue, please be aware you need to delete the jobs before the pipelines to ensure statistics don’t end up wrong:

@n-hebert Hi, the size of my “Files” and “Storage” is the same. Both nearly 3GB.

@rph Did you make any commits or updates to the project following the Repository Cleanup process?

Can you try exporting the project and verifying how large the export archive is? I suspect it repository cleanup worked, the export would be significantly smaller than the Files/Storage size you see in the UI.

@n-hebert thanks for helping out, welcome to the GitLab Community forum! In this specific case, the problem is related to repository file storage and not artifact storage. A good call to check this, as artifacts can also take up a lot of space. For problems or questions about deleting cleaning up artifact storage, keep an eye on the issue here: https://gitlab.com/gitlab-org/gitlab/-/issues/224151

1 Like

Thanks for the welcome, @gitlab-greg. Checking the export is a good idea that should be pursued.

@rph, an additional aside to that which came to mind for me, having done this before, is to confirm that you did delete all the old branches on the repo.
The old branches (& tags) all need to be replaced by new ones (smaller ones) or else the repository is still using all the old files. :slight_smile:

An easier path might be to push to an empty repo once you confirm the filter worked well locally to see it from afar without having to clobber any old work. If you like what you see you can proceed forwards towards the original project’s namespace in various fashions.

1 Like

@gitlab-greg Yes, I exported the project right after performing all the actions on the post and it’s filesize was very small (greatly reduced). I have since made new commits on the binary files and the indicated filesize on GitLab has continued to increase…

@n-hebert I’m not an expert when it comes to this, so I only followed everything that was in the guide. :laughing: If there is anything else that needs to be done, I am happy to do it if I can find some instructions…

So is it possible to perform the cleanup tasks without removing the latest version of each binary file ?

Various clean up tasks will likely work, but probably not ones you’re worried about.

If you’re looking to gain significant storage space back, you do need all references to any large objects you want deleted completely obliterated from your git branches current commits and full history, or they are simply part of your git repo’s basic size.

When you clone the repo (full repository including all depth), is it the same size on disk? There will be some deviation due to compression and other factors, but you should be in the ballpark of how big it will be on GitLab.

@n-hebert It took a while but the size is now showing correctly.

But I am still wondering if it possible to perform the cleanup tasks without removing the latest version of each binary file ?

Answering more simply than my last post which discussed this - yes, of course.

You can run clean up tasks at any time, it’s simply that it will do nothing if you don’t give it something to do.

As an example, for a fresh repo pushed with no dangling commits (commits not on branches), clean up does literally nothing.
My questions and theories above were trying to point out the fact that running clean up if there’s still any references to old versions of your binary objects will in fact do nothing, hence I was asking if they could be hanging around on old branches, tags, etc.

If the size only dropped after deleting the latest version of the file (and you mean to say that the binary objects are not in the repo anymore) then that means that git was doing a better job at keeping the size under control than expected (potentially as files may have been completely or partly identical or compressed very well). In other words, the old versions were not providing much of a storage bump in addition to the new versions.
That wouldn’t be very surprising – git is quite smart. :slight_smile:

@n-hebert Thanks, I think I see what you’re saying. Is there a way to tell the clean-up to leave the latest version of the file regardless ? Or is it automatically removing them because they form part of the history of the file and have to be removed since the previous commits are being removed…

That I don’t know.
There’s lots more info on how you can use it here: https://github.com/newren/git-filter-repo

Whatever you ask to be cleaned up will be cleaned up. Sub-specifying “latest version” like that is beyond my knowledge of filter-repo and GitLab’s clean up routine.