GitLab K8s Runner fails for unknown reasons

We’re using Gitlab Kubernetes Runner based on the official Helm chart within an Azure Kubernetes Service for our CI/CD pipelines. Up until a few days ago, everything was working just fine, all projects were built successfully. We now suddenly encounter an issue where for one project the build sometimes works (re-running the same pipeline works sometimes), for the other project it doesn’t work consistently.

Looking at the logs, the spawned runner pod just stops without showing any error message at all. Please see logs below.

In the actual GitLab Runner pod, there’s not much logs to be seen:

Checking for jobs... received                       job=610 repo_url=https://<project-url> runner=ztdc_eMB
WARNING: Error streaming logs gitlab-runner/runner-ztdcemb-project-14-concurrent-0nqlm6/helper:/logs-14-610/output.log: command terminated with exit code 137. Retrying...  job=610 project=14 runner=ztdc_eMB
WARNING: Error while executing file based variables removal script  error=pod "runner-ztdcemb-project-14-concurrent-0nqlm6" (on namespace "gitlab-runner") is not running and cannot execute commands; current phase is "Failed" job=610 project=14 runner=ztdc_eMB
WARNING: Job failed: pod "runner-ztdcemb-project-14-concurrent-0nqlm6" status is "Failed"  duration_s=241.875380283 job=610 project=14 runner=ztdc_eMB
WARNING: Failed to process runner                   builds=0 error=pod "runner-ztdcemb-project-14-concurrent-0nqlm6" status is "Failed" executor=kubernetes runner=ztdc_eMB

Once the job is retrieved, a new pod for concurrent execution is spawned, containing two containers build and helper. The builder logs confuse me the most, everything seems normal, until the container simply goes into the “failed” state without any kind of error indication:

$ echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.json
$ /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile.dev --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG
E0924 07:44:57.941086      11 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
INFO[0000] Resolved base name node:16 to build-deps     
INFO[0000] Retrieving image manifest node:16            
INFO[0000] Retrieving image node:16 from registry index.docker.io 
INFO[0001] Retrieving image manifest nginx:1.19-alpine  
INFO[0001] Retrieving image nginx:1.19-alpine from registry index.docker.io 
INFO[0002] Built cross stage deps: map[0:[/usr/src/app/build]] 
INFO[0002] Retrieving image manifest node:16            
INFO[0002] Returning cached image manifest              
INFO[0002] Executing 0 build triggers                   
INFO[0002] Unpacking rootfs as cmd COPY package.json yarn.lock ./ requires it. 
INFO[0021] WORKDIR /usr/src/app                         
INFO[0021] cmd: workdir                                 
INFO[0021] Changed working directory to /usr/src/app    
INFO[0021] Creating directory /usr/src/app              
INFO[0021] Taking snapshot of files...                  
INFO[0022] COPY package.json yarn.lock ./               
INFO[0022] Taking snapshot of files...                  
INFO[0022] RUN yarn                                     
INFO[0022] Taking snapshot of full filesystem...        
INFO[0031] cmd: /bin/sh                                 
INFO[0031] args: [-c yarn]                              
INFO[0031] Running: [/bin/sh -c yarn]                   
yarn install v1.22.5
[1/4] Resolving packages...
[2/4] Fetching packages...
[3/4] Linking dependencies...
warning " > @n8tb1t/use-scroll-position@2.0.3" has unmet peer dependency "@types/react@*".
warning "react-scripts > @typescript-eslint/eslint-plugin > tsutils@3.17.1" has unmet peer dependency "typescript@>=2.8.0 || >= 3.2.0-dev || >= 3.3.0-dev || >= 3.4.0-dev || >= 3.5.0-dev || >= 3.6.0-dev || >= 3.6.0-beta || >= 3.7.0-dev || >= 3.7.0-beta".
warning "react-wow > react-addons-css-transition-group@15.6.2" has incorrect peer dependency "react@^15.4.2".
warning " > styled-components@5.2.1" has unmet peer dependency "react-is@>= 16.8.0".
warning " > @testing-library/user-event@12.2.0" has unmet peer dependency "@testing-library/dom@>=7.21.4".
[4/4] Building fresh packages...
Done in 88.58s.
INFO[0120] Taking snapshot of full filesystem...        
INFO[0225] COPY . ./                                    
INFO[0226] Taking snapshot of files...                  
INFO[0226] RUN yarn build                               
INFO[0226] cmd: /bin/sh                                 
INFO[0226] args: [-c yarn build]                        
INFO[0226] Running: [/bin/sh -c yarn build]             
yarn run v1.22.5
$ react-scripts build
(node:124) [DEP0148] DeprecationWarning: Use of deprecated folder mapping "./" in the "exports" field module resolution of the package at /usr/src/app/node_modules/postcss-safe-parser/node_modules/postcss/package.json.
Update this package.json to use a subpath pattern like "./*".
(Use `node --trace-deprecation ...` to show where the warning was created)
Creating an optimized production build...
Browserslist: caniuse-lite is outdated. Please run:
npx browserslist@latest --update-db

Why you should do it regularly:
https://github.com/browserslist/browserslist#browsers-data-updating

A few seconds after the last message, the container simply crashed. No more logs, no more information, its just gone. The last message about the browser data is an informational / warning message only. Its also there when building it locally, and there’s a few seconds of time between this message being printed and the container crashing.

The helper logs for completeness, although they’re not containing anything useful as far as I can tell:

# kubectl -n gitlab-runner logs -f runner-ztdcemb-project-14-concurrent-0h95vm helper
Running on runner-ztdcemb-project-14-concurrent-0h95vm via gitlab-runner-gitlab-runner-6dcf9969b6-x9s9d...

{"command_exit_code": 0, "script": "/scripts-14-611/prepare_script"}
Fetching changes with git depth set to 50...
Initialized empty Git repository in /builds/ztdc_eMB/0/<project>/.git/
Created fresh repository.
Checking out 53493b66 as development...

Skipping Git submodules setup

{"command_exit_code": 0, "script": "/scripts-14-611/get_sources"}

Do you have any idea how I could to on and troubleshoot from there? The lack of any error indication drives me crazy a bit.

And just for completeness, the full logs from the web interface job view:

e[0KRunning with gitlab-runner 14.2.0 (58ba2b95)e[0;m
e[0K  on gitlab-runner-gitlab-runner-6dcf9969b6-x9s9d ztdc_eMBe[0;m
section_start:1632469492:prepare_executor
e[0Ke[0Ke[36;1mPreparing the "kubernetes" executore[0;me[0;m
e[0KUsing Kubernetes namespace: gitlab-runnere[0;m
e[0KUsing Kubernetes executor with image gcr.io/kaniko-project/executor:debug ...e[0;m
e[0KUsing attach strategy to execute scripts...e[0;m
section_end:1632469492:prepare_executor
e[0Ksection_start:1632469492:prepare_script
e[0Ke[0Ke[36;1mPreparing environmente[0;me[0;m
Waiting for pod gitlab-runner/runner-ztdcemb-project-14-concurrent-0h95vm to be running, status is Pending
Running on runner-ztdcemb-project-14-concurrent-0h95vm via gitlab-runner-gitlab-runner-6dcf9969b6-x9s9d...

section_end:1632469496:prepare_script
e[0Ksection_start:1632469496:get_sources
e[0Ke[0Ke[36;1mGetting source from Git repositorye[0;me[0;m
e[32;1mFetching changes with git depth set to 50...e[0;m
Initialized empty Git repository in /builds/ztdc_eMB/0/<project>/.git/
e[32;1mCreated fresh repository.e[0;m
e[32;1mChecking out 53493b66 as development...e[0;m

e[32;1mSkipping Git submodules setupe[0;m

section_end:1632469497:get_sources
e[0Ksection_start:1632469497:step_script
e[0Ke[0Ke[36;1mExecuting "step_script" stage of the job scripte[0;me[0;m
e[32;1m$ echo "{\"auths\":{\"$CI_REGISTRY\":{\"username\":\"$CI_REGISTRY_USER\",\"password\":\"$CI_REGISTRY_PASSWORD\"}}}" > /kaniko/.docker/config.jsone[0;m
e[32;1m$ /kaniko/executor --context $CI_PROJECT_DIR --dockerfile $CI_PROJECT_DIR/Dockerfile.dev --destination $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUGe[0;m
E0924 07:44:57.941086      11 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors
e[36mINFOe[0m[0000] Resolved base name node:16 to build-deps     
e[36mINFOe[0m[0000] Retrieving image manifest node:16            
e[36mINFOe[0m[0000] Retrieving image node:16 from registry index.docker.io 
e[36mINFOe[0m[0001] Retrieving image manifest nginx:1.19-alpine  
e[36mINFOe[0m[0001] Retrieving image nginx:1.19-alpine from registry index.docker.io 
e[36mINFOe[0m[0002] Built cross stage deps: map[0:[/usr/src/app/build]] 
e[36mINFOe[0m[0002] Retrieving image manifest node:16            
e[36mINFOe[0m[0002] Returning cached image manifest              
e[36mINFOe[0m[0002] Executing 0 build triggers                   
e[36mINFOe[0m[0002] Unpacking rootfs as cmd COPY package.json yarn.lock ./ requires it. 
e[36mINFOe[0m[0021] WORKDIR /usr/src/app                         
e[36mINFOe[0m[0021] cmd: workdir                                 
e[36mINFOe[0m[0021] Changed working directory to /usr/src/app    
e[36mINFOe[0m[0021] Creating directory /usr/src/app              
e[36mINFOe[0m[0021] Taking snapshot of files...                  
e[36mINFOe[0m[0022] COPY package.json yarn.lock ./               
e[36mINFOe[0m[0022] Taking snapshot of files...                  
e[36mINFOe[0m[0022] RUN yarn                                     
e[36mINFOe[0m[0022] Taking snapshot of full filesystem...        
e[36mINFOe[0m[0031] cmd: /bin/sh                                 
e[36mINFOe[0m[0031] args: [-c yarn]                              
e[36mINFOe[0m[0031] Running: [/bin/sh -c yarn]                   
yarn install v1.22.5
[1/4] Resolving packages...
[2/4] Fetching packages...
info fsevents@2.2.1: The platform "linux" is incompatible with this module.
info "fsevents@2.2.1" is an optional dependency and failed compatibility check. Excluding it from installation.
info fsevents@1.2.13: The platform "linux" is incompatible with this module.
info "fsevents@1.2.13" is an optional dependency and failed compatibility check. Excluding it from installation.
info fsevents@2.1.3: The platform "linux" is incompatible with this module.
info "fsevents@2.1.3" is an optional dependency and failed compatibility check. Excluding it from installation.
[3/4] Linking dependencies...
warning " > @n8tb1t/use-scroll-position@2.0.3" has unmet peer dependency "@types/react@*".
warning "react-scripts > @typescript-eslint/eslint-plugin > tsutils@3.17.1" has unmet peer dependency "typescript@>=2.8.0 || >= 3.2.0-dev || >= 3.3.0-dev || >= 3.4.0-dev || >= 3.5.0-dev || >= 3.6.0-dev || >= 3.6.0-beta || >= 3.7.0-dev || >= 3.7.0-beta".
warning "react-wow > react-addons-css-transition-group@15.6.2" has incorrect peer dependency "react@^15.4.2".
warning " > styled-components@5.2.1" has unmet peer dependency "react-is@>= 16.8.0".
warning " > @testing-library/user-event@12.2.0" has unmet peer dependency "@testing-library/dom@>=7.21.4".
[4/4] Building fresh packages...
Done in 88.58s.
e[36mINFOe[0m[0120] Taking snapshot of full filesystem...        
e[36mINFOe[0m[0225] COPY . ./                                    
e[36mINFOe[0m[0226] Taking snapshot of files...                  
e[36mINFOe[0m[0226] RUN yarn build                               
e[36mINFOe[0m[0226] cmd: /bin/sh                                 
e[36mINFOe[0m[0226] args: [-c yarn build]                        
e[36mINFOe[0m[0226] Running: [/bin/sh -c yarn build]             
yarn run v1.22.5
$ react-scripts build
(node:124) [DEP0148] DeprecationWarning: Use of deprecated folder mapping "./" in the "exports" field module resolution of the package at /usr/src/app/node_modules/postcss-safe-parser/node_modules/postcss/package.json.
Update this package.json to use a subpath pattern like "./*".
(Use `node --trace-deprecation ...` to show where the warning was created)
Creating an optimized production build...
Browserslist: caniuse-lite is outdated. Please run:
npx browserslist@latest --update-db

Why you should do it regularly:
https://github.com/browserslist/browserslist#browsers-data-updating
section_end:1632469737:step_script
e[0Ksection_start:1632469737:cleanup_file_variables
e[0Ke[0Ke[36;1mCleaning up file based variablese[0;me[0;m
section_end:1632469737:cleanup_file_variables
e[0Ke[31;1mERROR: Job failed: pod "runner-ztdcemb-project-14-concurrent-0h95vm" status is "Failed"e[0;m

Have you found a solution? I’m experiencing similar issue and without additional info in the logs, it’s hard to debug the issue. I’m inclined to try the runner on plain vanilla VM to see if the problem would go away.

Hi, no unfortunately no solution yet. I did try it with a vanilla VM and runner with docker executor there, which works fine. I also tried to redeploy the K8s runner using the helm chart, this didn’t make any difference, it’s still failing.

Did you or anyone else figure out the cause / fix this?

I wonder if it’s related to the Kubernetes version. I’m using Digitalocean K8s. Maybe there was an update around October that messed things up?

I run into this issue today and the reason turned out to be disk space. I checked RAM first as this seems to be the most commonly suggested cause, but for me it was fine. Then I run into a comment saying it could be disk space, so I started the build and run watch df -h on the Kubernetes node where it was running. The space was rather low from the beginning and shrinking gradually, and then the build got killed when there was about 1GB left (maybe less as watch does 2-second intervals). After increasing the disk size my builds run successfully now.

1 Like

the command bellow shows the events and the reason why gitlab runner pod was terminated
kubectl get events -n

1 Like