Gitlab 13.12 self-hosted on a GCP k8s cluster, the main GitLab application and runners have their own namespace in a shared node pool, with other applications. The jobs are run in a separate pool with auto-scaling. The job uses our own custom image which is quite large with all the application need (747.98 MiB) this is hosted in the on-site GitLab registry.
The error happens randomly but is related to the number of pipelines running.
Looking for first steps to identify the bottle neck or pinch point, to start working on a resolution.
So far increasing the number of runner pods does not seem to make a difference, adding logging on some runners has not shown a single couse.
placing the helper image local in the GitLab repo does not seem to have had any effect.