Gitlab-ci with GKE leaves workloads hanging out after failure

MikeCongdon1 · August 7, 2020, 12:38pm

Replace this template with your information

I have a gitlab managed cluster running in GKE.
I’m getting a lot of failures. (timeouts, or mysql command doesn’t work etc). I think some of these are due to lack of resources in the cluster (it has HPA and Vertical Autoscaling on).
** How do I make that better? **
** Why are there so many workloads hanging out on jobs that have failed? **
In some cases they stick around for days and have to be manually deleted.

Describe your question in as much detail as possible:

What are you seeing, and how does that differ from what you expect to see?
I expect the workloads to only exist for the life of the job they are running.
Consider including screenshots, error messages, and/or other helpful visuals
What version are you on? Are you using self-managed or GitLab.com?
Gitlab.com
- GitLab (Hint: /help): 13.3.0-pre 52083dab1f2
- Runner (Hint: /admin/runners): GKE gitlab-runner
Add the CI configuration from .gitlab-ci.yml and other configuration if relevant (e.g. docker-compose.yml)

I’ve manually altered the concurrent to 7 in the config.toml

Here’s my config.toml:

concurrent = 7 
cpu_limit = "3"
memory_limit = "4Gi"
service_cpu_limit = "2"
service_memory_limit = "2Gi"
helper_cpu_limit = "500m" 
helper_memory_limit = "500Mi" 
check_interval = 3 
log_level = "info" 
listen_address = ':9252'

gitlab-ci.yml:

image: gcr.io/$PROJECT/$CUSTOMUBUNTUIMAGE:$TAG

variables:
  GIT_DEPTH: 10
  GIT_STRATEGY: fetch
  REPO_NAME: $REPO
  MYSQL_DATABASE: reboot_tests
  MYSQL_ROOT_PASSWORD: docker
  REF_DESLASHED: ${CI_COMMIT_REF_NAME////_}
  PROJECT_NAME: $BUILD_IMAGE
  GCP_PROJECT_ID: $PROJECT 
  SHORT_SHA: $(echo $CI_COMMIT_SHA | cut -c1-7)
  KUBERNETES_CPU_REQUEST: "1"
  KUBERNETES_CPU_LIMIT: "1.5"
  KUBERNETES_MEMORY_REQUEST: "2Gi"
  KUBERNETES_MEMORY_LIMIT: "2Gi"

stages:
  - backend
  - frontend
  - build 

.job_template: &kube_backend  # Hidden key that defines an anchor named 'job_definition'
    timeout: "30 minute"
    retry: 2
    services:
        - mysql:5.7
    stage: backend
    only:
        - external_pull_requests
    tags:
        - kube
    variables:
       DBNAME: "reboot_tests"

    before_script:
        #- echo "y" | cpanm --notest --skip-satisfied Date::Business
        #- cpanm Alt::Crypt::RSA::BigInt App::cpanoutdated --quiet --notest --skip-satisfied
        #- cpanm --quiet --notest --skip-satisfied --installdeps .
        #- perl Makefile.PL
        #- cpanm SOAP::Lite --skip-satisfied
        - apt-get install mariadb-client -y --quiet
        - sleep 10
        - mysql -h mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SET GLOBAL sql_mode = '';"
        - export FRESH_DB=1

.job_template: &kube_frontend  # Hidden key that defines an anchor named 'job_definition'
    image: gcr.io/$PROJECT/$IMAGE:$TAG
    services:
        - mysql:5.7
    stage: frontend 
    timeout: "90 minute"
    retry: 1
    only:
        - external_pull_requests
    tags:
        - devel 
    variables:
       DBNAME: "reboot_tests"

    before_script:
        - sudo /usr/bin/mysqld_safe --basedir=/usr &
        - echo "y" | cpanm --notest --skip-satisfied Date::Business
        - cpanm Alt::Crypt::RSA::BigInt App::cpanoutdated --quiet --notest --skip-satisfied
        - cpanm --quiet --notest --skip-satisfied --installdeps .
        - perl Makefile.PL
        - cpanm SOAP::Lite --skip-satisfied

        - export PATH=/usr/local/lib/nodejs/node-v10.9.0-linux-x64/bin:$PATH
        - sudo npm ci
        - npm version
        - mysql -h mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SET GLOBAL sql_mode = '';"
        - export FRESH_DB=1
        - prove -vl t/t_admin/sitemap.t
        - script/cetec_reboot_web_server.pl -rp 3001 &
        - sleep 15


frontend_testday:
    <<: *kube_frontend
    script:
        - $(npm bin)/cypress run --spec cypress/integration/testday/* --config baseUrl="http://localhost:3001"
    
    artifacts:
        paths:
        - cypress/videos/testday/
        - cypress/screenshots/testday/
        when: on_failure
        expire_in: 2 days    

backend_part:
    <<: *kube_backend
    script:
        - prove -vl t/t_part/*

What troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?

I’ve edited concurrent jobs, added the vertical autoscaler, added the KUBERNETES_* resource variables in gitlab-ci.yml

Topic		Replies	Views
Gitlab + GKE + Gitlab CI unable to clone Repository GitLab CI/CD	0	799	April 19, 2018
CI/CD Dedicated Runners loosing connectivity Self-managed runner	0	335	December 19, 2022
GitLab Pipeline randomly failing using Docker Runners - suspect caching issue GitLab CI/CD ci , runner , docker	9	7247	August 31, 2021
Gitlab runner migration and resource limit expansion Infrastructure as Code & Cloud Native ci , runner , kubernetes , pipelines	0	486	July 20, 2021
GitLab CI/CD pipeline are stuck and are not in progress GitLab CI/CD runner	1	2302	January 6, 2022

Gitlab-ci with GKE leaves workloads hanging out after failure

Replace this template with your information

Here’s my config.toml:

gitlab-ci.yml:

Related topics