Configuring autoscaling for on demand runners

Problem to solve

What are best practices for configuring the ASG ‘scale in’ rules for on-demand runners? Our runners scale out fine when they hit the CPU threshold. The problem is that sometimes one of those instances will pick up a job near the end of its timeout threshold, then terminate the job. We’re using CPU as the scale in metric, and have it set to something ridiculously low like '2% cpu utilization and it still terminates the job. Is there a better way to do this so that we don’t get terminated jobs or do we need to go to static instances?

  • What are you seeing, and how does that differ from what you expect to see?
    Seeing jobs terminate that run on instances that have been created via scale out and have timed out. I’d expect the instance to persist until the job stopped running.

Which troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?

Other than extending the instance timeout and lowering the CPU scale in threshold, nothing much.

Add the infrastructure-as-code or cloud-native configuration relevant to the question.
Not really relevant as this is more about configuring the ASG. But here’s what we’re using for Scale in/Scale out parameters:

5ASGAutoScalingMetricTypeToMonitor CPU -
5ASGAutoScalingSetScaleInUtilizationThreshold 2 -
5ASGAutoScalingSetScaleInUtilizationThresholdSeconds 600 -
5ASGAutoScalingSetScaleOutUtilizationThreshold 50 -
5ASGAutoScalingSetScaleOutUtilizationThresholdSeconds 300 -

Versions

  • Self-managed
  • GitLab.com SaaS
  • Self-hosted Runners

Versions
gitlab-runner 17.0.0