ContainersNotReady: "containers with unready status: [build helper]"

Techno-wizard · October 27, 2021, 8:24am

Gitlab 13.12 self-hosted on a GCP k8s cluster, the main GitLab application and runners have their own namespace in a shared node pool, with other applications. The jobs are run in a separate pool with auto-scaling. The job uses our own custom image which is quite large with all the application need (747.98 MiB) this is hosted in the on-site GitLab registry.

The error happens randomly but is related to the number of pipelines running.

Looking for first steps to identify the bottle neck or pinch point, to start working on a resolution.

Any thought.

So far increasing the number of runner pods does not seem to make a difference, adding logging on some runners has not shown a single couse.

placing the helper image local in the GitLab repo does not seem to have had any effect.

Manish · February 23, 2022, 2:46pm

Do we have any solution on the same?

Qark-dev · March 12, 2022, 3:44pm

Bump! Same issue here >:|

Techno-wizard · March 22, 2022, 1:42pm

On investigation the issue seams to be K8s Node IO and issue in PLEG

Are system is GKE based with scaling node pools node pools.
In GKE Disk IO is a function of disk type and size GKE DISK IO

See this articular which is slimier to what we were experiencing

We monitored the PLEG and found it to reaching a limit which cause the node to go unhealthy.

Solution for us was

update to lates GKE 1.21 version witch has fixed for pod creation in COS and there are other fixes for contierd
change disk type to SSD
change node disk size to best brake point for GKE IO
change node limit to 15 this give use 10 node as 5 slotes are take by gke and Prometheus/ loki motioning

After these change we have not seen the issue with bust of unto 60 runner pods created running simulations