Gitlab-ci with GKE leaves workloads hanging out after failure

Replace this template with your information

I have a gitlab managed cluster running in GKE.
I’m getting a lot of failures. (timeouts, or mysql command doesn’t work etc). I think some of these are due to lack of resources in the cluster (it has HPA and Vertical Autoscaling on).
** How do I make that better? **
** Why are there so many workloads hanging out on jobs that have failed? **
In some cases they stick around for days and have to be manually deleted.
image
Describe your question in as much detail as possible:

  • What are you seeing, and how does that differ from what you expect to see?
    I expect the workloads to only exist for the life of the job they are running.

  • Consider including screenshots, error messages, and/or other helpful visuals

  • What version are you on? Are you using self-managed or GitLab.com?
    Gitlab.com

    • GitLab (Hint: /help): 13.3.0-pre 52083dab1f2
    • Runner (Hint: /admin/runners): GKE gitlab-runner
  • Add the CI configuration from .gitlab-ci.yml and other configuration if relevant (e.g. docker-compose.yml)

I’ve manually altered the concurrent to 7 in the config.toml

Here’s my config.toml:

concurrent = 7 
cpu_limit = "3"
memory_limit = "4Gi"
service_cpu_limit = "2"
service_memory_limit = "2Gi"
helper_cpu_limit = "500m" 
helper_memory_limit = "500Mi" 
check_interval = 3 
log_level = "info" 
listen_address = ':9252'

gitlab-ci.yml:

image: gcr.io/$PROJECT/$CUSTOMUBUNTUIMAGE:$TAG

variables:
  GIT_DEPTH: 10
  GIT_STRATEGY: fetch
  REPO_NAME: $REPO
  MYSQL_DATABASE: reboot_tests
  MYSQL_ROOT_PASSWORD: docker
  REF_DESLASHED: ${CI_COMMIT_REF_NAME////_}
  PROJECT_NAME: $BUILD_IMAGE
  GCP_PROJECT_ID: $PROJECT 
  SHORT_SHA: $(echo $CI_COMMIT_SHA | cut -c1-7)
  KUBERNETES_CPU_REQUEST: "1"
  KUBERNETES_CPU_LIMIT: "1.5"
  KUBERNETES_MEMORY_REQUEST: "2Gi"
  KUBERNETES_MEMORY_LIMIT: "2Gi"

stages:
  - backend
  - frontend
  - build 

.job_template: &kube_backend  # Hidden key that defines an anchor named 'job_definition'
    timeout: "30 minute"
    retry: 2
    services:
        - mysql:5.7
    stage: backend
    only:
        - external_pull_requests
    tags:
        - kube
    variables:
       DBNAME: "reboot_tests"

    before_script:
        #- echo "y" | cpanm --notest --skip-satisfied Date::Business
        #- cpanm Alt::Crypt::RSA::BigInt App::cpanoutdated --quiet --notest --skip-satisfied
        #- cpanm --quiet --notest --skip-satisfied --installdeps .
        #- perl Makefile.PL
        #- cpanm SOAP::Lite --skip-satisfied
        - apt-get install mariadb-client -y --quiet
        - sleep 10
        - mysql -h mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SET GLOBAL sql_mode = '';"
        - export FRESH_DB=1

.job_template: &kube_frontend  # Hidden key that defines an anchor named 'job_definition'
    image: gcr.io/$PROJECT/$IMAGE:$TAG
    services:
        - mysql:5.7
    stage: frontend 
    timeout: "90 minute"
    retry: 1
    only:
        - external_pull_requests
    tags:
        - devel 
    variables:
       DBNAME: "reboot_tests"

    before_script:
        - sudo /usr/bin/mysqld_safe --basedir=/usr &
        - echo "y" | cpanm --notest --skip-satisfied Date::Business
        - cpanm Alt::Crypt::RSA::BigInt App::cpanoutdated --quiet --notest --skip-satisfied
        - cpanm --quiet --notest --skip-satisfied --installdeps .
        - perl Makefile.PL
        - cpanm SOAP::Lite --skip-satisfied

        - export PATH=/usr/local/lib/nodejs/node-v10.9.0-linux-x64/bin:$PATH
        - sudo npm ci
        - npm version
        - mysql -h mysql -uroot -p"$MYSQL_ROOT_PASSWORD" -e "SET GLOBAL sql_mode = '';"
        - export FRESH_DB=1
        - prove -vl t/t_admin/sitemap.t
        - script/cetec_reboot_web_server.pl -rp 3001 &
        - sleep 15


frontend_testday:
    <<: *kube_frontend
    script:
        - $(npm bin)/cypress run --spec cypress/integration/testday/* --config baseUrl="http://localhost:3001"
    
    artifacts:
        paths:
        - cypress/videos/testday/
        - cypress/screenshots/testday/
        when: on_failure
        expire_in: 2 days    

backend_part:
    <<: *kube_backend
    script:
        - prove -vl t/t_part/*
  • What troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?

I’ve edited concurrent jobs, added the vertical autoscaler, added the KUBERNETES_* resource variables in gitlab-ci.yml