"Kubernetes cluster unreachable" intermittent error when using Gitlab Agent for Kubernetes

Summary

  • When deploying to a Kubernetes Cluster within a CICD Pipeline using the Gitlab Agent for Kubernetes, we occasionally get this error:
    Error: Kubernetes cluster unreachable: an error on the server ("unknown") has prevented the request from succeeding

Steps to reproduce

Intermittent problem

What is the current bug behavior?

Deployment job exits as failed because it cannot interact with the cluster.

What is the expected correct behavior?

Deployment succeeds by properly interacting with the Kubernetes API using the Gitlab Agent for Kubernetes as its medium.

Relevant logs and/or screenshots

From last lines of the job:

Error: Kubernetes cluster unreachable: an error on the server ("unknown") has prevented the request from succeeding
Cleaning up project directory and file based variables
ERROR: Job failed: command terminated with exit code 1

/label ~“type::bug”

5 Likes

+1 we have the exact same issue with Kubernetes 1.26 (EKS) and the Gitlab Agent 16.2.0

+1 also happening in 15.9.0

Is there more info / specifics on the setup? Where is it hosted? What resources are there? CPU / RAM / OS etc?

  • Hosted in Digital Ocean using DOKS.
  • Gitlab agent is configured to allow the projects in question using a configuration file in a separate repository where the Gitlab agent was deployed.
  • OS is linux
  • Anywhere from 4 to 6 nodes on the clusters with 8 CPUs and 16GB of ram.
  • Node autoscaling is enabled
  • Both the Agent and the Cluster are deployed and maintained using Terraform

Thank you, interesting, is there any kind of a pattern to this intermittency? Specific time window throughout the day? does it coincide with any releases going on internally etc? do you have monitoring / graphs (grafana / prometheus) where you can visualise what exactly is happening in your environment during this “unreachable” behaviour?
Intermittency can be hard to troubleshoot, best way is to use monitoring data to visualise what is happening in the whole environment overall, which component is doing what etc

We do have some monitoring in place, but the problem has been so intermittent and short lived that it hasn’t created a need to monitor it immediately. However, it is often enough that our app devs have had to ping us multiple times when they were unable to deploy to the cluster with the first attempt to run the pipeline. Our current solution has been to add a retry script to the pipeline that verifies the cluster is reachable before continuing and deploying.

I guess I’m wondering if there would be logs in the Gitlab Agent that could give insight to these problems if it is in fact a Gitlab Agent problem. Also I didn’t specify that we are not self-hosting our gitlab instance but rather, using the cloud version.