Gitlab HA is thoroughly miserable

Im not sure where to start with this, but i hope it starts a real dialogue. Gitlab has so many moving parts its practically a Goldberg machine of Machiavellian preconditions and byzantine server requirements. a few of the biggest concerns could be in the services themselves…

  • Gitaly: NFS backend was a strong idea until the time honored waltz of “NFS performance issues” cropped up. now we have multiple replica gitaly servers with an arcane sync method and numerous points of failure. objects could be the future of this perhaps, but moving from NFS was arguably a bad idea.

  • Postgres: bouncer is not HA, it is the bolt-on arithmetic required to handle hot standby servers; a caching layer at best. whoever thought this up was clearly a programmer, not a system administrator. Upselling bouncer in the primo tiers is bad comedy. You should be using PostgresXL.

but perhaps my biggest complaint is that Gitlab doesnt seem to understand the fundamentals of HA in the 21st century. HA isnt just a feature upsell, it is the product. HA is what GIT itself promises on every level, and competent HA should scale from zero users. What do i mean by this?

in 1966 Margaret Hamilton led a team of 350 people to design the software behind the Apollo 11 mission. Had she a Gitlab instance, not a single sysadmin would dare to stand before her and so much as suggest she did not deserve real HA for her project. power outages, OS updates, server outages, and network outages are all part of this users daily life.

Gitlab doesnt begin to offer REAL HA until the 3000 users reference architecture, and requires TWENTY FIVE SERVERS to do it. Why does Gitlab bother to break down reference architectures by number of users instead of implementation (like postfix, dovecot, or BIND)? Simple. because no sane person could justify filling up to an entire rack of servers for a single application in 2020.

the alternative strategy is to split your workload into different gitlab servers (prod, test, stage, dev, etc…) and spread the availability to reduce downtime for Margaret.

The reference architectures should start by answering the fundamental question: how does every admin regardless of head-count achieve HA. Can we consolidate some of these services onto the same deployment on each node? Can we have Gitlab HA at zero users on day one?

1 Like

Hi @nimbius and welcome to the Gitlab community.

I’m not sure how many users you have in total, but they do play a part in HA since you need to know how many servers to deal with it. For example load balancing, etc. Sending 100 users to one server for example, is going to be far quicker than sending 1000 users to one server. Anyway, that aside let me continue.

Let’s assume you just want HA to safeguard your data, that you don’t have a huge amount of users, and so sending them to a single server isn’t going to be a performance issue. What I write below is theory, but should generally work - it’s more or less a comparison to what I did utilising a LAMP setup. We would need 7 servers for this. Technically it would be possible to do it with 5, but providing that the 3 servers have a high enough spec to deal with the workload that we give it, then OK.

The first part, we would need 2 servers to act as load balancers. This will become more apparent as to why a little later on.

The second part would be to run gitlab on 2 servers but active/standby (in case of server failure). These 2 servers would mount /etc/gitlab, /opt/gitlab and /var/opt/gitlab via glusterfs-client to the other additional 3 servers. So 7 in total.

Alternatively, 2 servers for load balancing connections, and 3 servers running glusterfs and gitlab with locally mounted glusterfs for /etc/gitlab, /opt/gitlab, /var/opt/gitlab - but would require higher hardware requirements to deal with the IO of running gitlab AND replicating data between the 3 servers.

For GlusterFS you need 3 servers, for quorum, so it cannot be done with 2. Technically you can, but when one node fails, then you will have a problem and the other node is fine until you attempt to restart it. The GlusterFS services will not become active if there isn’t at least a second node available. Have tried it, it’s not happy to work like this so not worth even attempting.

But will assume the 7 server setup.

First Layer (LB) - 2 x LB, active/standby. VIP address will change between the two depending on what happens with the servers, eg: service failure, inaccessible for some reason, rebooted. Therefore connections will always get routed to the second layer.

Second Layer (GITLAB-LB) - 2 x servers with Gitlab installed, GlusterFS client mounting /etc/gitlab, /opt/gitlab and /var/opt/gitlab. Obviously with CPU/RAM requirements that meet your gitlab user count. These will also work in active/standby mode, therefirst the first layer (LB) will route to one of the active servers (GITLAB-LB). Here will also have a VIP IP like the First Layer.

Third Layer (GlusterFS) - 3 x servers replicating data, obviously 1Gbps network connection isn’t going to cut it, so 10Gbe will be needed here at least or perhaps fibre. You will have to replicate the gitlab repository data, as well as the PostgreSQL data. Because we are replicating whats under /var/opt/gitlab. /etc/gitlab and /opt/gitlab don’t see that many changes. Here also we have LB services with a VIP address as we need to use this to mount the GlusterFS partitions at the Second Layer where gitlab will be running. Whilst you could do this with fstab mounting to gluster01, gluster02, gluster03 (assuming that is their names), you would have a problem with mounting later if fstab is for gluster01 but it’s not available. You can edit it of course, but the solution with LB services simplifies, as the VIP IP address will float between the three machines. That way you can have, eg: gluster-vip in fstab, and this will mount irrespective of which server has the VIP IP address. Maybe there is a better way, but this is theory anyway. I have tested it this way with my LAMP setup.

In theory it’s possible, and providing the hardware is up to scratch, then it should be fine performance-wise. So communication wise:

Users --> First Layer (LB) --> Second Layer (GITLAB-LB) --> Third Layer (GlusterFS Data replication)

First Layer - HTTPS and SSH redirected to Second Layer
Second Layer - Gitlab services HTTPS/SSH just like a single server setup
Third Layer - Your data is replicated so that one of the two servers in the second layer can mount partitions and continue working in the event of service/hardware failure.

You can redirect for example HTTPS and SSH which would generally enough for browsing the Gitlab WEB UI, as well as making commits via HTTPS or SSH.

But again, all theory, depends on what hardware is available, and would seriously need testing to see what performance would be like. For PostgreSQL certain Gluster settings need to be taken into account to ensure we don’t get bottlenecks.

So if you are not worried about the amount of users and want the HA equivalent of utilising one Gitlab server to service your users, that could potentially do it.