I want to autoscale my gitlab runners (kubernetes and helm)

EKS + Gitlab + Kubernetes + Autoscale

Here is my scenario. I have gitlab ce (13.12.12) running on an AWS EC2 node.
I have EKS (1.19) setup with a managed node group for the Gitlab runners. These hosts are labeled pool:runner.

I launch my gitlab runners using helm. (This creates runner managers which in turn spawn the actual runners that run the builds.) They are configured to run only on hosts with label “pool:runner” Initially it seems fine. I have two runners configured to run 3 concurrent jobs. So there is a max of 6 job altogether. I have to do this because some of the jobs are big and take a long time to run.

Here is the issue: my node group is configured to use spot instances and it also scales down at night. So when scale up in the morning, or recover from a lost instance, Both of my runners are now running on a single host. Developers push and jobs get spawned, all the job build runners are now running on one host, while the other sits there idle.

So here are the problems I need to solve.

  1. There should be one runner manager per host (maybe I can use anti-infinity rule?)
  2. When a scaling event happens a new runner should get spawned on that new host.
  3. I can then set my managed node group to autoscale based on CPU. So if a third host comes online I now have three runner managers. If a 4th host comes online I would have 4 runner managers, and so on.
  4. When it scales down, the number of runner-managers scales down as well.

Is this even possible? And please don’t just post a link to the gitlab or kubernetes documentation. I have read that until I am blue in the face and its no help.

I used to use the docker-machine solution. This scaled nicely, but it had its own quirks and did not fit in with our plan to move to kubernetes.

We are running into a similar issue. I used helm to install the runner. when a build job starts a new runner pod is created but it stays pending and doesn’t trigger scale-up. It keeps complaining of insufficient memory and stays at 3 nodes. The job times out eventually.
However, when additional applications are deployed on the cluster, a 4th node gets added so seems like the nodepool is autoscaling in a normal manner.