Gitlab-runner fails all jobs when running with systemd

Hi there,

I run an up-to-date Debian Buster on a VPS. I installed gitlab-ce and gitlab-runner from packages.gitlab.com, and therefore have the following versions:

# apt show gitlab-ce
Package: gitlab-ce
Version: 12.0.3-ce.0

# apt show gitlab-runner
Package: gitlab-runner
Version: 12.0.1

gitlab-runner is launched by systemd, and the associated file in /etc/systemd/system/gitlab-runner.service contains this:

[Unit]
Description=GitLab Runner
After=syslog.target network.target
ConditionFileIsExecutable=/usr/lib/gitlab-runner/gitlab-runner

[Service]
StartLimitInterval=5
StartLimitBurst=10
ExecStart=/usr/lib/gitlab-runner/gitlab-runner "--debug" "run" "--working-directory" "/home/gitlab-runner" "--config" "/etc/gitlab-runner/config.toml" "--service" "gitlab-runner" "--syslog" "--user" "gitlab-runner"
Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target

I successfully registered a shared runner, but all jobs fail. For example, one repository contains the following .gitlab-ci.yml:

---
stages:
    - precompile

precompile:
    stage: precompile
    tags:
        - base
    before_script:
        - docker --version
    script:
        - docker build . -t cv:${CI_PIPELINE_ID}

The job is received by gitlab-runner, but immediately fails. Here is the log from systemd:

gitlab-runner[30858]: Checking for jobs... received                       job=106 repo_url=https://gitlab.dunatotatos.com/dunatotatos/cv.git runner=bUxm5K1S
gitlab-runner[30858]: Checking for jobs... received                       job=106 repo_url=https://gitlab.dunatotatos.com/dunatotatos/cv.git runner=bUxm5K1S
gitlab-runner[30858]: Failed to requeue the runner:                       builds=1 runner=bUxm5K1S
gitlab-runner[30858]: Running with gitlab-runner 12.0.1 (0e5417a3)        job=106 project=26 runner=bUxm5K1S
gitlab-runner[30858]:   on Azazel bUxm5K1S                                job=106 project=26 runner=bUxm5K1S
gitlab-runner[30858]: Shell configuration: environment: []
gitlab-runner[30858]: dockercommand:
gitlab-runner[30858]: - sh
gitlab-runner[30858]: - -c
gitlab-runner[30858]: - "if [ -x /usr/local/bin/bash ]; then\n\texec /usr/local/bin/bash --login\nelif [
gitlab-runner[30858]:   -x /usr/bin/bash ]; then\n\texec /usr/bin/bash --login\nelif [ -x /bin/bash ]; then\n\texec
gitlab-runner[30858]:   /bin/bash --login\nelif [ -x /usr/local/bin/sh ]; then\n\texec /usr/local/bin/sh
gitlab-runner[30858]:   --login\nelif [ -x /usr/bin/sh ]; then\n\texec /usr/bin/sh --login\nelif [ -x /bin/sh
gitlab-runner[30858]:   ]; then\n\texec /bin/sh --login\nelif [ -x /busybox/sh ]; then\n\texec /busybox/sh
gitlab-runner[30858]:   --login\nelse\n\techo shell not found\n\texit 1\nfi\n\n"
gitlab-runner[30858]: command: su
gitlab-runner[30858]: arguments:
gitlab-runner[30858]: - -s
gitlab-runner[30858]: - /bin/bash
gitlab-runner[30858]: - gitlab-runner
gitlab-runner[30858]: - -c
gitlab-runner[30858]: - bash --login
gitlab-runner[30858]: passfile: false
gitlab-runner[30858]: extension: ""
gitlab-runner[30858]:   job=106 project=26 runner=bUxm5K1S
gitlab-runner[30858]: Using Shell executor...                             job=106 project=26 runner=bUxm5K1S
gitlab-runner[30858]: Waiting for signals...                              job=106 project=26 runner=bUxm5K1S
gitlab-runner[30858]: Executing build stage                               build_stage=prepare_script job=106 project=26 runner=bUxm5K1S
su[2599]: (to gitlab-runner) root on none
su[2599]: pam_unix(su:session): session opened for user gitlab-runner by (uid=0)
su[2599]: pam_unix(su:session): session closed for user gitlab-runner
gitlab-runner[30858]: Executing build stage                               build_stage=upload_artifacts_on_failure job=106 project=26 runner=bUxm5K1S
su[2619]: (to gitlab-runner) root on none
su[2619]: pam_unix(su:session): session opened for user gitlab-runner by (uid=0)
su[2619]: pam_unix(su:session): session closed for user gitlab-runner
gitlab-runner[30858]: WARNING: Job failed: exit status 1                  duration=146.489509ms job=106 project=26 runner=bUxm5K1S
gitlab-runner[30858]: WARNING: Job failed: exit status 1                  duration=146.489509ms job=106 project=26 runner=bUxm5K1S
gitlab-runner[30858]: Appending trace to coordinator... ok                code=202 job=106 job-log=0-20037 job-status=running runner=bUxm5K1S sent-log=0-20036 status=202 Accepted
gitlab-runner[30858]: Submitting job to coordinator... ok                 code=200 job=106 job-status= runner=bUxm5K1S
gitlab-runner[30858]: ERROR: Failed to process runner                     builds=0 error=exit status 1 executor=shell runner=bUxm5K1S
gitlab-runner[30858]: ERROR: Failed to process runner                     builds=0 error=exit status 1 executor=shell runner=bUxm5K1S

The output on gitlab itself is not of much help:

Running with gitlab-runner 12.0.1 (0e5417a3)
  on Azazel bUxm5K1S
Using Shell executor...
Running on azazel...
ERROR: Job failed: exit status 1

Running with CI_DEBUG_TRACE gives more info, but I’m a bit confused. (I would happily post the whole log, but formatting is a pain, and I cannot post attachments.)

Running with gitlab-runner 12.0.1 (0e5417a3)
  on Azazel bUxm5K1S
Using Shell executor...

+ set -eo pipefail
+ set +o noclobber
+ :
+ eval 'echo "Running on $(hostname)..."
'
+++ hostname
++ echo 'Running on azazel...'
Running on azazel...
+ exit 0
++ '[' 1 = 1 ']'
++ '[' -x /usr/bin/clear_console ']'
++ /usr/bin/clear_console -q

+ set -eo pipefail
+ set +o noclobber
+ :
+ eval 'export FF_CMD_DISABLE_DELAYED_ERROR_LEVEL_EXPANSION=$'\''false'\''
export FF_USE_LEGACY_BUILDS_DIR_FOR_DOCKER=$'\''false'\''
[...]
++ export CI_RUNNER_EXECUTABLE_ARCH=linux/amd64
++ CI_RUNNER_EXECUTABLE_ARCH=linux/amd64
++ cd /home/gitlab-runner/builds/bUxm5K1S/0/dunatotatos/cv
+ exit 0
++ '[' 1 = 1 ']'
++ '[' -x /usr/bin/clear_console ']'
++ /usr/bin/clear_console -q
ERROR: Job failed: exit status 1

Why would the job fail when executing clear_console, while it went perfectly fine during the first step? I guess the error is not there…

HOWEVER, everything works as expected when gitlab-runner is running from the command line through an SSH connection, with the following command:

/usr/lib/gitlab-runner/gitlab-runner "--debug" "run" "--working-directory" "/home/gitlab-runner" "--config" "/etc/gitlab-runner/config.toml" "--service" "gitlab-runner" "--syslog" "--user" "gitlab-runner"

This is the exact same command launched by systemd. I’ve tried to set a matching $PATH and $PWD, and the behavior is still the same. Jobs crash with systemd, and run successfully when launched from the command line. I’ve also tried to use a runner dedicated to the repository instead of shared, to observer still the same behavior. Last but not least, I’ve used strace on the process launched by systemd, but the output is a bit too large for a detailed analysis.

Is this a bug, or (most likely) a misconfiguration from my side?
Please let me know if anything is missing in my post.

Thanks in advance for your help!
Duna

I will try to dig deeper based on this post: https://unix.stackexchange.com/a/339645/147672

Bump? I need your help, I’m far from an expert…

EDIT: I just tried to run it with nohup, and the job fails as well. It really looks like a problem with the non-interactive shell.

EDIT: But nohup clear_console -q & does not crash, and the background process exits with 0.

EDIT: Removing the --user option fixes the problem. I think I’m on a good track.

EDIT: Multiple issues refer to bash --login executed from gitlab-runner. I seriously suspect this command to be the problem.

# nohup "su" "-s" "/bin/bash" "gitlab-runner" "-c" "bash --login"
nohup: ignoring input and appending output to 'nohup.out'
# echo $?
2

EDIT: I found by chance that clear_console -q is not executed by gitlab-runner, but is a default configuration in ~/.bash_logout when creating a new user under Debian. Commenting this command fixes the problem.
Also, found afterwards that the problem has already been reported, and a solution has been provided:
Fixed with : https://gitlab.com/gitlab-org/gitlab-runner/issues/3849#note_134536963

Marking as solved. Thanks all for your help!

4 Likes

I’m seeing what appears to be the same symptoms, though I’m starting the runner as a user via a cron job (no root privileges). But, I don’t have a .bash_logout, and there’s no /etc/bash_logout on this system. (also, it’s SLES-based, not Debian - there’s no clear_console that I can find).

I’m starting the runner by calling a script that sets up a working directory, and then starts gitlab-runner via nohup. It works fine when started from a terminal, so I don’t think the nohup is to blame. I’m not using the --user flag.

CI_DEBUG_TRACE doesn’t seem to show anything extra. I’m starting the runner with this line (in a script started from cron):

nohup $prefix/gitlab-runner --debug run --config $prefix/config.$(hostname).toml --working-directory $workdir > $workdir/runner.log-$now 2>&1 &

When I try to run a pipeline, the log shows:

No referees configured                              job=7559 project=353 runner=KoouGDRd
Executing build stage                               build_stage=prepare_script job=7559 project=353 runner=KoouGDRd
Preparing environment                   job=7559 project=353 runner=KoouGDRd
Feeding runners to channel                          builds=1
Appending trace to coordinator... ok                code=202 job=7559 job-log=0-346 job-status=running runner=KoouGDRd sent-log=0-345 status=202 Accepted update-interval=3s
Submitting job to coordinator... ok                 code=200 job=7559 job-status= runner=KoouGDRd
...
Skipping referees execution                         job=7559 project=353 runner=KoouGDRd
WARNING: Job failed: canceled                       duration=2m33.255789094s job=7559 project=353 runner=KoouGDRd
WARNING: Failed to parse "X-GitLab-Trace-Update-Interval" header  error=strconv.Atoi: parsing "": invalid syntax header-value= job=7559 runner=KoouGDRd
WARNING: Appending trace to coordinator... aborted  code=403 job=7559 job-log= job-status=canceled runner=KoouGDRd sent-log=346-595 status=403 Forbidden update-interval=0s
WARNING: Submitting job to coordinator... aborted   code=403 job=7559 job-status=canceled runner=KoouGDRd
WARNING: Failed to process runner                   builds=0 error=canceled executor=shell runner=KoouGDRd

My workaround for now is to start the runner from the command line in a detached screen session.

Any ideas about why clear_console causes the problem for Debian? Maybe my setup fails for the same underlying reason.

thanks!
Steve