I’m having a problem with the GitLab Kubernetes runner autoscaling and resulting in abandoned pods in the cluster.
Under even moderate loads (a couple hundred CI jobs), the runner will autoscale and have two (or more) replicas, resulting in strange behavior where once the runner scales back down, a number of pods are left abandoned and never get cleaned up. It’s quite reliable such that every time the runner scales up, I know things have broken and jobs end up abandoned.
I used to give the runner 2CPUs & 2GB of RAM, now I’m up to 3CPUs and 3GB of RAM, but I’m having the same problem. This is running on a GCP GKE Autopilot platform.
I am using the latest standard GitLab runner Helm chart. I have tried this with both the hpa parameter left undefined or set with min/max replicas set to 1. Neither seems to make a difference.
Is there a way around this problem? I would be happy with one of two solutions, so long as I don’t end up with abandoned pods.
Pin the runner so that it will never autoscale
Permit autoscaling, but ensure that no runner scales down while it has active jobs, with a minimum runner pod size of 1.
Is there a way to get to either of these two solutions?
To achieve #1 - if you do not set hpa in your values there is no autoscalling. Set replicas: to your desired number of controller pods.
The #2 is a bit harder, because hpa is controlled by Kubernetes which does not know if the controller had any job pods still active. I have no idea if Gitlab Runner has something in place to mitigate this and you have better chance to get qualified response if you ask in GitLab Runner issue tracker Issues · GitLab.org / gitlab-runner · GitLab
#1 would have been my understanding as well. replicas is set to 1 for me. I have never set the hpa parameter, and yet I am seeing this autoscaling behavior at even a slight load. I don’t know why it is happening nor how to stop it.
When I look at the number of CPUs allocated to the gitlab-runner pod over time, I see that under a slight load, the pod’s CPUs doubles from the baseline that I requested/allocated (e.g, from 3 to 6). When it scales back down, worker pods (either previously allocated, or the ones allocated since the scale up), end up abandoned. I don’t know how to stop GKE Autopilot (or wherever this is coming from) from scaling up.
Horizontal Pod Autoscaling is about scaling out by increasing number of Pods. What you are talking about now is Vertical scaling which is increasing CPU/MEM.
I am lost now, what are we trying to solve, horizontal or vertical?
A second pod is scaled up, so what is happening is more in line with HPA (but I don’t know if that is what is actually happening, because I had believed I had disabled HPA). It’s definitely not vertical.
The way I know pods are being scaled is because the pod count increases by one, and the total workload CPU/RAM is dobuled.
Here is a plot of the total workload (sum of all resources for all of the gitlab runner deployment) in a dashboard log view of GKE Autopilot.
The base CPU/RAM allocation in this view is 3CPU/3GB. Notice that even under a very slight load, something (HPA maybe?) caused a second pod to be allocated. This happened twice between 1 and 2PM and again just before 4PM. Each time this happens, when one of the two pods (I’m not sure which one) is scaled back down, we end up with abandoned worker pods. I want to stop this behavior. There are plenty of resources still available on the allocated pod. I need it to not be scaling the pods (or somehow if it must scale them, ensure that no runner is scaled down that is still managing pods).
The easiest way how to check if there is any HPA is just to do kubectl get hpa in the GitLab Runner namespaces. And while in there I would also look into events kubectl get events or look into yours monitoring tool to understand what’s really happening.
As I suspected, there is no HPA defined. The events recently don’t show anything either, since this was some days back.
I think I have a hunch as to what is going on though. I think it’s a pod priority preemption issue. When many worker pods are queued, they are competing with the runner manager pod and sometimes preempting it out. I found some pod preemption logs at exactly the times the pods are being cycled in the graph above.
I checked the pod priority status of the runner manager pod, and it was set to 0.
I think this is the issue. I tried updating the deployment in the helm chart by setting priorityClassName in the values.yaml, but that didn’t seem to make a difference unfortunately. I set priorityClassName="system-node-critical" in my values.yaml in the line linked earlier, but it did not seem to make a difference. The runner manager’s pod priority is still 0.
I found out that there was some issue with the specific version of helm I was using (3.9.4) that has some incompatibility with that priorityClassName parameter. The rendered yaml template didn’t contain it at all. I have no idea why this is happening, the specific lines dealing with priorityClassName in the deployment template in the Gitlab runner helm chart look fine to me.
I patched it manually and was able to get a runner manager pod with the appropriate priority. Just for fun, I tried helm on a different system (GCP Cloud Shell) and it uses a slightly older version of helm (3.9.3) that doesn’t have this issue and renders the priorityClassName parameter just fine. Very peculiar.
I’ll update this thread with the details of whether this solves the scaling (but now I suspect to be, an eviction) issue in the coming days.
Just to close the loop on this thread, I have confirmed that the issue is indeed a pod eviction error. I tried all sorts of mechanisms to increase priority or prevent eviction, but nothing worked. I set the priority of the runner jobs to 0 and set preemptionPolicy=Never. I tried putting the runner manager into a different namespace. Nothing worked.
It’s quite unfortunate, as it’s fundamentally necessary to protect critical pods.
The one thing I did not try is fiddling with pod disruption budgets. It is possible that by setting something like maxUnavailable to 0 on the runner manager pod, that I could prevent the eviction. However, I had given up on toying with this and went with a different approach.
My solution instead was to set up a standalone VM in the GCP project outside of kubernetes, configure kubectl to access the kubernetes cluster through the API, and run the gitlab runner on that VM to spool systems up and down. It appears to be much more stable now. I may need to tweak some resource/limits settings on the VM to ensure I don’t hit any max file handles open issues or fill network buffers and the like.
Last follow-up here. The issue was that GKE Autopilot deploys more kubedns pods as the number of workload pods in the cluster grows. kubedns has higher priority than anything a user can define, and so the gitlab runner manager was consistently being evicted when the number of jobs reached the several hundreds.
The correct solution was to set a node taint/toleration for the runner manager defining workload separation, and isolating it from everything else, include kubedns. This solved the gitlab manager preemption problem.
This can be accomplished with the following settings in the values.yaml file (around line 650ish in the template):
nodeSelector:
group : runner-manager
tolerations:
- key: group
operator: Equal
value: runner-manager
effect: NoSchedule```