"Kubernetes cluster unreachable" intermittent error when using Gitlab Agent for Kubernetes

stephen.gans · July 31, 2023, 4:40pm

Summary

When deploying to a Kubernetes Cluster within a CICD Pipeline using the Gitlab Agent for Kubernetes, we occasionally get this error:
Error: Kubernetes cluster unreachable: an error on the server ("unknown") has prevented the request from succeeding

Steps to reproduce

Intermittent problem

What is the current bug behavior?

Deployment job exits as failed because it cannot interact with the cluster.

What is the expected correct behavior?

Deployment succeeds by properly interacting with the Kubernetes API using the Gitlab Agent for Kubernetes as its medium.

Relevant logs and/or screenshots

From last lines of the job:

Error: Kubernetes cluster unreachable: an error on the server ("unknown") has prevented the request from succeeding
Cleaning up project directory and file based variables
ERROR: Job failed: command terminated with exit code 1

/label ~“type::bug”

utay · August 1, 2023, 12:22pm

+1 we have the exact same issue with Kubernetes 1.26 (EKS) and the Gitlab Agent 16.2.0

ps-haiilo · August 9, 2023, 3:54pm

+1 also happening in 15.9.0

doctor3182 · August 10, 2023, 11:14am

Is there more info / specifics on the setup? Where is it hosted? What resources are there? CPU / RAM / OS etc?

stephen.gans · August 10, 2023, 1:44pm

Hosted in Digital Ocean using DOKS.
Gitlab agent is configured to allow the projects in question using a configuration file in a separate repository where the Gitlab agent was deployed.
OS is linux
Anywhere from 4 to 6 nodes on the clusters with 8 CPUs and 16GB of ram.
Node autoscaling is enabled
Both the Agent and the Cluster are deployed and maintained using Terraform

doctor3182 · August 10, 2023, 1:54pm

Thank you, interesting, is there any kind of a pattern to this intermittency? Specific time window throughout the day? does it coincide with any releases going on internally etc? do you have monitoring / graphs (grafana / prometheus) where you can visualise what exactly is happening in your environment during this “unreachable” behaviour?
Intermittency can be hard to troubleshoot, best way is to use monitoring data to visualise what is happening in the whole environment overall, which component is doing what etc

stephen.gans · August 10, 2023, 3:01pm

We do have some monitoring in place, but the problem has been so intermittent and short lived that it hasn’t created a need to monitor it immediately. However, it is often enough that our app devs have had to ping us multiple times when they were unable to deploy to the cluster with the first attempt to run the pipeline. Our current solution has been to add a retry script to the pipeline that verifies the cluster is reachable before continuing and deploying.

stephen.gans · August 10, 2023, 3:03pm

I guess I’m wondering if there would be logs in the Gitlab Agent that could give insight to these problems if it is in fact a Gitlab Agent problem. Also I didn’t specify that we are not self-hosting our gitlab instance but rather, using the cloud version.