[SOLVED] kubernetes executor runner job metrics: scrape pod labels/annotations with prometheus

Hey there :slight_smile:

I reworked our runner structure and need some metrics to optimize resource assignment. My goal, is to have a dashboard, where I can allocate/group different statistics and, for example, get the load of certain jobs.

We have several hosts with a k3s cluster. Our argoCD deploys a set of gitlab runners with different properties to each cluster, using the official helm chart. The runners basic difference of these runners is using either docker or kubernetes executor to run the actual jobs. Additionally, we have a prometheus in each cluster, also deployed via helm and argoCD.

The gitlab-runner service monitor is enabled. The runner config (template) contains

...
service:
  enabled: true
  {{- if eq .executor "kubernetes" }}
  annotations:
    external-dns.alpha.kubernetes.io/hostname: "{{ .cluster }}-{{ .executor }}.{{ .location }}.something.cloud"
  clusterIP: None
  {{- end }}
{{ .additionalValues -}}
runners:
...

My idea, was to attach CI job variables to containers and then make prometheus collect these labels as the metrics’ label. Then I can combine them to get metrics of a certain pipeline stage or similar.

This works just perfectly for docker executor jobs, using cadvisor (also deployed via the helm chart and argoCD) by having in the docker executer config.toml

...
[runners.docker.container_labels]
  "com.gitlab.gitlab.runner.job.id" = "$CI_JOB_ID"
  "com.gitlab.gitlab.runner.job.stage" = "$CI_JOB_STAGE"
  "com.gitlab.gitlab.runner.job.name" = "$CI_JOB_NAME"
  "com.gitlab.gitlab.runner.pipeline.url" = "$CI_PIPELINE_URL"
  "com.gitlab.gitlab.runner.pipeline.name" = "$CI_PIPELINE_NAME"
  "com.gitlab.gitlab.runner.project.path" = "$CI_PROJECT_PATH"
...

and the scrape config

prometheus:
  server:
    ingress:
      enabled: true
      hosts:
        - prometheus.runnerhost.inhouse.something.cloud
  # scrape the cadvisor for docker jobs
  extraScrapeConfigs: |
    - job_name: cadvisor
      static_configs:
        - targets:
            - cadvisor.cadvisor.svc.cluster.local:8080

and an example promQL query

sum(
    rate(
        container_cpu_usage_seconds_total{
            container_label_com_gitlab_gitlab_runner_job_id!="",
            container_label_com_gitlab_gitlab_runner_job_name!=""
        }
        [$__rate_interval])
    )
by (
    container_label_com_gitlab_gitlab_runner_job_id,
    )
* 100

I get all the metadata I need.

I hoped to achieve the same with the kubernetes executor and kubernetes-nodes-cadvisor, so I tried with pod_labels and pod_annotations

...
[runners.kubernetes.pod_annotations]
  "job.runner.gitlab.com/stage" = "$CI_JOB_STAGE"
  ...
[runners.kubernetes.pod_labels]
  "com.gitlab.gitlab.runner.job.stage" = "$CI_JOB_STAGE" 
  ...
...

But metrics like the container_cpu_usage_seconds_total, coming from kubernetes-nodes-cadvisor job, do not contain labels, coming from annotations or labels attached to the executor pods. The actual pods however have the annotations/labels I defined.

I tried many different scrape configs. I tried some custom extraScrapeConfigs that should monitor pods, but I got 404s.

prometheus:
  server:
    ...
  serverFiles
    prometheus.yml:
        scrape_configs:
          - job_name: 'kubernetes-nodes-cadvisor'
            relabel_configs:
              - action: labelmap
                regex: __meta_kubernetes_(.*)_label_(.+)
              - action: labelmap
                regex: __meta_kubernetes_(.*)_annotation_(.+)

I also got 404s when I added

[runners.kubernetes.pod_annotations]
  "prometheus.io/scrape" = "true"
  "prometheus.io/path" = "metrics"
  "prometheus.io/port" = "9252"

Any ideas, suggestions? I sure can provide more config details, didn’t want to bloat the first post.

I managed to solve it with the help of a k8s magician colleague :o)
In order to maybe help others, I’ll share some details:

  1. There seems to be a bug, that may have prevented adding the pod labels to the metrics
  2. New ansatz was to add kube-state-metrics to prometheus and tell it to add pod labels/annotations
    a. prometheus values.yaml
prometheus:
  server:
    ingress:
      enabled: true
      hosts:
        - prometheus.runnerfarm-0.inhouse.platform.reservix.cloud
  kube-state-metrics:
    enabled: true
#    metricAllowlist:
#      - kube_pod_annotations
    metricAnnotationsAllowList:
     - pods=[*]
#     - namespaces=[gitlab]
  1. Use promQL to join labels of kubernetes-nodes-cadvisor (resource usage metrics) and kube-state-metrics (containing the desired labels). A simple query would be as
container_cpu_usage_seconds_total{pod=~"runner-.*", container!=""}
*
on(pod) 
group_left(annotation_job_runner_gitlab_com_id)
(kube_pod_annotations)

And for reference, this is how I calculate the CPU utilization (see related github discussion)

round(
  100 * 
  sum(
    rate(
        container_cpu_usage_seconds_total{pod=~"runner-.*", container!=""}[5m]
    )
  )
  by (
    pod,
    container,
    annotation_job_runner_gitlab_com_id,
    ...
  )
) 
/ 
sum by (
    pod,
    container,
    annotation_job_runner_gitlab_com_id,
    ...
    )
    (
        container_spec_cpu_quota{pod=~"runner-.*", container!=""}
        /
        container_spec_cpu_period{pod=~"runner-.*", container!=""}
    )
*
on(pod)
group_left(
    annotation_job_runner_gitlab_com_id,
    ...
)
(
    max by(
        pod,
        container,
        annotation_job_runner_gitlab_com_id,
        ...
    )
(kube_pod_annotations)
)

where I add the desired labels to the by() expressions so I can use them as graph labels in Grafana. The max by() was necessary to assure uniqueness