Hey there
I reworked our runner structure and need some metrics to optimize resource assignment. My goal, is to have a dashboard, where I can allocate/group different statistics and, for example, get the load of certain jobs.
We have several hosts with a k3s cluster. Our argoCD deploys a set of gitlab runners with different properties to each cluster, using the official helm chart. The runners basic difference of these runners is using either docker or kubernetes executor to run the actual jobs. Additionally, we have a prometheus in each cluster, also deployed via helm and argoCD.
The gitlab-runner service monitor is enabled. The runner config (template) contains
...
service:
enabled: true
{{- if eq .executor "kubernetes" }}
annotations:
external-dns.alpha.kubernetes.io/hostname: "{{ .cluster }}-{{ .executor }}.{{ .location }}.something.cloud"
clusterIP: None
{{- end }}
{{ .additionalValues -}}
runners:
...
My idea, was to attach CI job variables to containers and then make prometheus collect these labels as the metrics’ label. Then I can combine them to get metrics of a certain pipeline stage or similar.
This works just perfectly for docker executor jobs, using cadvisor (also deployed via the helm chart and argoCD) by having in the docker executer config.toml
...
[runners.docker.container_labels]
"com.gitlab.gitlab.runner.job.id" = "$CI_JOB_ID"
"com.gitlab.gitlab.runner.job.stage" = "$CI_JOB_STAGE"
"com.gitlab.gitlab.runner.job.name" = "$CI_JOB_NAME"
"com.gitlab.gitlab.runner.pipeline.url" = "$CI_PIPELINE_URL"
"com.gitlab.gitlab.runner.pipeline.name" = "$CI_PIPELINE_NAME"
"com.gitlab.gitlab.runner.project.path" = "$CI_PROJECT_PATH"
...
and the scrape config
prometheus:
server:
ingress:
enabled: true
hosts:
- prometheus.runnerhost.inhouse.something.cloud
# scrape the cadvisor for docker jobs
extraScrapeConfigs: |
- job_name: cadvisor
static_configs:
- targets:
- cadvisor.cadvisor.svc.cluster.local:8080
and an example promQL query
sum(
rate(
container_cpu_usage_seconds_total{
container_label_com_gitlab_gitlab_runner_job_id!="",
container_label_com_gitlab_gitlab_runner_job_name!=""
}
[$__rate_interval])
)
by (
container_label_com_gitlab_gitlab_runner_job_id,
)
* 100
I get all the metadata I need.
I hoped to achieve the same with the kubernetes executor and kubernetes-nodes-cadvisor, so I tried with pod_labels and pod_annotations
...
[runners.kubernetes.pod_annotations]
"job.runner.gitlab.com/stage" = "$CI_JOB_STAGE"
...
[runners.kubernetes.pod_labels]
"com.gitlab.gitlab.runner.job.stage" = "$CI_JOB_STAGE"
...
...
But metrics like the container_cpu_usage_seconds_total, coming from kubernetes-nodes-cadvisor job, do not contain labels, coming from annotations or labels attached to the executor pods. The actual pods however have the annotations/labels I defined.
I tried many different scrape configs. I tried some custom extraScrapeConfigs that should monitor pods, but I got 404s.
prometheus:
server:
...
serverFiles
prometheus.yml:
scrape_configs:
- job_name: 'kubernetes-nodes-cadvisor'
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_(.*)_label_(.+)
- action: labelmap
regex: __meta_kubernetes_(.*)_annotation_(.+)
I also got 404s when I added
[runners.kubernetes.pod_annotations]
"prometheus.io/scrape" = "true"
"prometheus.io/path" = "metrics"
"prometheus.io/port" = "9252"
Any ideas, suggestions? I sure can provide more config details, didn’t want to bloat the first post.