HA Options for GitLab Runners

Using Gitlab (Ultimate licences) at work connected to on-premises runners. As part of risk and compliance requirements our runners have been identified as single points of failure.

We use Podman as the engine for executing our containers and in recent times had runners offline due to a bug which filled the tmpfs filesystem of the Podman user (non-root execution) - this has now been resolved by RedHat. However, due to this issue we are looking at ways to ensure scheduled tasks do not fail in the event of a single runner being offline.

If we were to implement HA, at a high level I believe we would:

  • Deploy an additional Linux VM
  • Ensure that Podman is installed
  • Register this new VM as a runner with GitLab

Note: we do not have any dedicated container platforms in place (i.e. Kubernetes or Docker Swarm) at this time, just traditional VM infrastructure.

Questions:

  1. Can GitLab runners operate in a HA model? At a high-level I believe they can through load balancing - though many posts suggest it doesn’t work that well. i.e. second runner doesn’t do anything until primary is 100% utilised. Though possibly this depends on the methods used - load balancing vs round robin?

  2. Are there any better ways? I don’t believe we have a huge number of pipelines or jobs so any sort of autoscaling doesn’t seem required. Simply want to add some redundancy to what we currently do.

Do you need HA on the fleet of runners, so GitLab can (almost) always execute jobs, or do you need HA on each individual runner. You seem to be describing the first option, that is also what I know something about.

If you just set up multiple runners, GitLab will spread the jobs among them (actually the runners control the distribution of jobs - for one runner to take all jobs you’ll need the jobs to be started at specific times and the runners to have an inappropriate configuration). If one runner goes down in a scenario like that, you’ll lose the jobs that were executing at that runner when it crashed, and you’ll probably have to restart those jobs manually - if you can’t live with that you’re probably in the second category mentioned above.

Without diving into technical speak and what the correct terms are:

  • Load balance pipelines / jobs between runners
  • Ensure pipelines / jobs can continue to run in the event of a runner failure

As you said, I am happy to lose a job that was running, I just want to ensure subsequent jobs don’t also fail