Builds stopped working yesterday - gitlab runner instances stuck in Pending state

We’ve had a gitlab.com, gitlab-runner setup running on a specific runner deployed to our kubernetes cluster. It ran for ~6 months no problem and yesterday all of our pipelines fail with this same error on every stage. We had a devops guy set everything up and he left a couple of weeks ago, so I’m on the steep learning curve of figuring out what was done without much in the way of documentation.

I can see what is failing, but I don’t really understand where I start applying fixes to resolve this issue.

The job output from one of our builds:
Running with gitlab-runner 13.3.1 (738bbe5a) on runner-gitlab-runner-594484c775-v2zxb w7WiGHhk Preparing the "kubernetes" executor ** ** 00:00** ** Using Kubernetes namespace: gitlab-managed-apps Using Kubernetes executor with image gcr.io/kaniko-project/executor:debug ... Preparing environment ** ** 03:03** ** Waiting for pod gitlab-managed-apps/runner-w7wighhk-project-20213218-concurrent-0xsr7z to be running, status is Pending Waiting for pod gitlab-managed-apps/runner-w7wighhk-project-20213218-concurrent-0xsr7z to be running, status is Pending
<same message for another 3-4 minutes>
ERROR: Job failed (system failure): prepare environment: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information

Then the build reports as failed. The shell documentation does not provide and good info that I believe helps me resolve this. I’ve logged into the gitlab-runner instance and don’t see any of the referenced shell files. All the builds show a similar output. We can get our builds to run on the shared runners, but we have a lot and don’t think we’ll be able to use this long-term.

I can see the gitlab-runner deployment running:
$ kubectl get pods -n gitlab-managed-apps
NAME READY STATUS RESTARTS AGE
runner-gitlab-runner-594484c775-v2zxb 1/1 Running 0 161d

If I run a build, I can see where it spins up a pod to run the build:
$ kubectl get pods -n gitlab-managed-apps
NAME READY STATUS RESTARTS AGE
runner-gitlab-runner-594484c775-v2zxb 1/1 Running 0 161d
runner-w7wighhk-project-20213218-concurrent-024xhx 0/2 Pending 0 8s

The logs show a lot of the same errors from the build screen, but there’s some additional info that might point in a helpful direction, but googling for these errors and searching the forum didn’t turn anything useful up. Here’s some exceprts from the logs:

WARNING: Failed to process runner builds=1 error=prepare environment: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information executor=kubernetes runner=w7WiGHhk
ERROR: Job failed (system failure): prepare environment: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information duration=3m3.679822327s job=1051575746 project=20213218 runner=w7WiGHhk
WARNING: Appending trace to coordinator... aborted code=403 job=1052542445 job-log= job-status=canceled runner=w7WiGHhk sent-log=6733-6934 status=403 Forbidden update-interval=0s

This is about all I’ve figured out so far. Someone with a similar setup had reported that his logs bloated and stopped his container from responding, but best I can tell I don’t have space issues on my gitlab-runner instance.

Can anyone help point me in a helpful direction? I’m new to the gitlab-runner/kubernetes/docker world but not to servers and software, I’m just at a loss for where to start.

Thanks!

My deployment versions in case it helps:
$ kubectl exec --stdin --tty -n gitlab-managed-apps runner-gitlab-runner-594484c775-v2zxb -- gitlab-runner --version
Version: 13.3.1
Git revision: 738bbe5a
Git branch: 13-3-stable
GO version: go1.13.8
Built: 2020-08-25T12:29:06+0000
OS/Arch: linux/amd64

It turns out that we had gitlab runner installed through the integrations on a cluster that is managed outside of gitlab (AWS). The kubernetes screen is only set up through one of our repositories, and not the one I expected, even though we have 8-9 repos on the same project running through the runner. Now I’m trying to uninstall and reinstall because it looks like version mismatches could be a candidate for the issues we’re having? I tried the uninstall once and it timed out, so will have to see if I need to remove stuff from the command line instead. Anyone else ever have a similar issue?

$ kubectl get pods -n gitlab-managed-apps
NAME READY STATUS RESTARTS AGE
runner-gitlab-runner-594484c775-v2zxb 1/1 Running 0 162d
uninstall-runner 0/1 Pending 0 16m

:frowning: It seems the uninstaller is having the same issue as the build containers.

Well this was a fun introduction to kubernetes. :slight_smile:

Deep in one of the describe output of one of the pods that was stuck in a pending state was an error that stated I had reached the pod limit. One of my namespaces has several cronjobs that were misconfigured, and where I thought their restart behavior was “never”, but because the config file had some errors it was left with the default “always”. I additionally had errors preventing the jobs from fully starting, so I ended up with too many pods running and that prevented any new pods from fully running. I deleted the deployments for those and we’re back in business.

Hope this might help someone else down the line. Be sure to check the describe info, logs, events for any pod or job that is having trouble.

Very happy that you worked it out, @jlillest! Thank you so much for posting updates so others in a similar situation can benefit from it. :bowing_man: