Gitlab 13.12 self-hosted on a GCP k8s cluster, the main GitLab application and runners have their own namespace in a shared node pool, with other applications. The jobs are run in a separate pool with auto-scaling. The job uses our own custom image which is quite large with all the application need (747.98 MiB) this is hosted in the on-site GitLab registry.
The error happens randomly but is related to the number of pipelines running.
Looking for first steps to identify the bottle neck or pinch point, to start working on a resolution.
So far increasing the number of runner pods does not seem to make a difference, adding logging on some runners has not shown a single couse.
placing the helper image local in the GitLab repo does not seem to have had any effect.
Do we have any solution on the same?
Bump! Same issue here >:|
On investigation the issue seams to be K8s Node IO and issue in PLEG
Are system is GKE based with scaling node pools node pools.
In GKE Disk IO is a function of disk type and size GKE DISK IO
See this articular which is slimier to what we were experiencing
We monitored the PLEG and found it to reaching a limit which cause the node to go unhealthy.
Solution for us was
- update to lates GKE 1.21 version witch has fixed for pod creation in COS and there are other fixes for contierd
- change disk type to SSD
- change node disk size to best brake point for GKE IO
- change node limit to 15 this give use 10 node as 5 slotes are take by gke and Prometheus/ loki motioning
After these change we have not seen the issue with bust of unto 60 runner pods created running simulations