Filesystem performance issue with lots of sub-directories for attachments

somarandos · October 7, 2022, 1:25pm

Hi, all the attachments from issues are put into a @hashed directory. Each attachment creates its own hashed directory. Over time it can create a lot of sub-directories in a single directory. A project with 100K issues with an average 10 images each can lead to 1 million. This can be a performance bottleneck. How does Gitlab suggest self hosting users do for this. Thanks

iwalker · October 7, 2022, 2:55pm

Why would it be a performance bottleneck? Are you going to have hundreds of thousands of people attempting to open 100K issues all at the same time, and attempting to open all attachments at the same time? I don’t think so. It’s no different than any server with hundreds of thousands of files that are stored on the disk.

What you should be really looking at is what kind of server, what kind of storage. SATA vs SAS vs SSD. The biggest hit on performance will be due to incorrect server configuration. Sure, filesystem choice will also play a part, eg: EXT4 is better for smaller files than XFS is. But hardware choice will have far more impact. Do some research using Google, there’s tons of info out there for filesystem performance.

Also take a look at Gitlab docs for filesystem performance: File system performance benchmarking | GitLab you can also find a load more information in the Gitlab docs and is the first place to start.

somarandos · October 7, 2022, 3:10pm

Thanks for the reaply @iwalker . A million directories in a single directory can be slow right. The web (stackoverflow and reddit basically) suggests it be distributed evenly. like based on months or year. I am trying to read more on it. But thats basically I have learnt at the moment.

iwalker · October 7, 2022, 3:20pm

This is how the docs explain how it’s distributed: File Storage in GitLab | GitLab

However, if you are storing millions of attachments on a Gitlab server, I seriously think you need to decide if that is the correct thing to do or not and whether it was designed for such a thing. I know plenty of different solutions out there or business practices that state not to do such a thing, and that attachments or whatever should be stored in some other way.

Rather than discuss such limit, perhaps you should concentrate on what you are attempting to achieve, or what you are wanting to do with your server?

somarandos · October 7, 2022, 3:35pm

The gitlab core project has 125K issues on it. So the issues alone would create 500K directories, one for every attachment in a single directory if 5 images per issue is taken as assumption. I am just trying to know what happens when that level is reached. I would not worry if it was on the Saas. I am evaluating the hardware need and the project in general. Filesystem performance looks like a worry to me. Hence the question.

gitlab-greg · October 7, 2022, 4:20pm

Hi @somarandos

Usually bottlenecks are caused by excessive read/write operations where input/output is the limiting factor. Attachments usually take up minimal storage space, so it’s actually much easier and more common for someone to hit max I/O by downloading/pushing a lot of Git Repo, LFS, or (Package|Container) Registry data in parallel. Alternatively, if storage space is filled up (90%+ full), you might see some performance issues.

I’ve not heard of any situations where the existence of high number of image uploads or a large amount of sub-directories in @hashed directory has caused a performance bottleneck in itself. GitLab uses a database to keep track of where issue attachments are stored, so it’s not like it’s crawling or looking through every directory in @hashed every time an upload or image is requested.

I’ve worked with GitLab admins who have (tens of) thousands of users and I’ve never seen subdirectories for attachments result in a performance bottleneck. If you find it does cause a bottleneck, please create an issue.

iwalker · October 7, 2022, 4:29pm

What @gitlab-greg writes is excellent, and some of which goes back to what I mentioned about disks and storage type. Choosing SATA storage would be a big mistake, due to the disks being 7200rpm or slower. SAS would give you 10K or 15K, which means your read/write is going to be better. Enterprise storage is going to be better than using cheaper disks that were meant for desktop computers. SSD also, depending on what type you choose as well as it’s read/write capacity and speed.

Obviously you need to spec a server that is going to be capable of dealing with the situation that you have. That’s totally independent of Gitlab though, or in that fact any application that would run on said server. I guess you’ll need to google such scenarios to find out how other people have dealt with it. It’s impossible to say unless someone has that kind of experience of what would happen in such a situation. Anybody can theorise what will happen, but a lot of it will be based on how good your server is.

somarandos · October 7, 2022, 4:34pm

@gitlab-greg @iwalker . Thank you both for the help. This is great .