Gitlab pipeline broken after upgrade to 16.4.1

I have recently updated my gitlab runner to 16.4.1 and suddenly my pipeline stopped working. I have no clue what causes this state. Before this upgrade, everythign worked fine.

Pipeline is really simple- it just connects to server via SSH and runs script as specific user. I have tried to run this script as that user on the machine and it worked fine. Now it seems that Gitlab completely ignores script part of the pipeline. All variables are properly set and accesible.

my-pipeline:
  stage: deploy
  rules:
    - if: '$PROJECT_IDENTIFIER != null && $CI_PIPELINE_SOURCE == "web"'
      when: always
    - when: never
  before_script:
    - apk add --update --no-cache openssh
    - install -m 600 -D /dev/null ~/.ssh/id_rsa
    - echo "$SSH_PRIVATE_KEY" | base64 -d > ~/.ssh/id_rsa
    - ssh-keyscan -H $SSH_HOST > ~/.ssh/known_hosts
  script:
    - echo "start"
    - ssh $SSH_USER@$SSH_HOST "bash /scripts/myScript.sh $PROJECT_IDENTIFIER && exit"
    - echo "stop"
  after_script:
    - rm -rf ~/.ssh

Output of pipeline is like:

Running with gitlab-runner 16.4.1 (XXX) on docker-runner XXX, system ID: XXX
Preparing the "docker" executor
Using Docker executor with image alpine:latest ...
Pulling docker image alpine:latest ...
Using docker image XXX for alpine:latest with digest XXXX ...
Preparing environment
Running on runner XXX
Getting source from Git repository
Fetching changes with git depth set to 20...
Reinitialized existing Git repository in XXX
Checking out 46e2c9a7 as detached HEAD (ref is main)...
Skipping Git submodules setup
Executing "step_script" stage of the job script
Using docker image XXX for alpine:latest with digest alpine@sha256:XXX ...
$ apk add --update --no-cache openssh
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.18/community/x86_64/APKINDEX.tar.gz
(1/10) Installing openssh-keygen (9.3_p2-r0)
(2/10) Installing ncurses-terminfo-base (6.4_p20230506-r0)
(3/10) Installing libncursesw (6.4_p20230506-r0)
(4/10) Installing libedit (20221030.3.1-r1)
(5/10) Installing openssh-client-common (9.3_p2-r0)
(6/10) Installing openssh-client-default (9.3_p2-r0)
(7/10) Installing openssh-sftp-server (9.3_p2-r0)
(8/10) Installing openssh-server-common (9.3_p2-r0)
(9/10) Installing openssh-server (9.3_p2-r0)
(10/10) Installing openssh (9.3_p2-r0)
Executing busybox-1.36.1-r2.trigger
OK: 14 MiB in 25 packages
$ install -m 600 -D /dev/null ~/.ssh/id_rsa
$ echo "$SSH_PRIVATE_KEY" | base64 -d > ~/.ssh/id_rsa
$ ssh-keyscan -H $SSH_HOST > ~/.ssh/known_hosts
Running after_script
Running after script...
$ rm -rf ~/.ssh
Cleaning up project directory and file based variables
ERROR: Job failed: exit code 1

I have even tried to remove the SSH command to see if it does not cause any issue preventing to echo “start” at least but with no success. It seems that Gitlab completely ignores script part of my pipeline.

Possibly ssh-keyscan -H $SSH_HOST > ~/.ssh/known_hosts fails now. Job is usually terminated after first command with exit code > 0.

1 Like

You are probably right. I have moved the whole before script in the script and added echoes after each step but echo after row you have mentioned is not present.

But I am unable to find why. I am running alpine docker image on my gitlab-runner and in the systelctl gitlab runner logs there is just an information I see in the pipeline (that process ended with code 1)

You can print the previous command exit status using echo $?. Suggest adding this line into the script section where you assume the errors.

ssh-keyscan also supports -v | verbose https://linux.die.net/man/1/ssh-keyscan - note that it would write into > ~/.ssh/known_hosts. Suggest changing the script steps into

  before_script:
    - apk add --update --no-cache openssh
    - install -m 600 -D /dev/null ~/.ssh/id_rsa
    - echo "$SSH_PRIVATE_KEY" | base64 -d > ~/.ssh/id_rsa
    - ssh-keyscan -v -H $SSH_HOST 
    - echo $?
    - ssh-keyscan -H $SSH_HOST > ~/.ssh/known_hosts 

and see if there are errors.

Hi, thanks for reaching me.

I´ve tried what you have suggested but with no success- output does no print anything

$ install -m 600 -D /dev/null ~/.ssh/id_rsa
$ echo "$SSH_PRIVATE_KEY" | base64 -d > ~/.ssh/id_rsa
$ ssh-keyscan -v -H $SSH_HOST
Running after_script
Running after script...
$ rm -rf ~/.ssh

In journal there is just


Oct 17 14:12:39 johnczekVPS gitlab-runner[2360945]: WARNING: Job failed: exit code 1
Oct 17 14:12:39 johnczekVPS gitlab-runner[2360945]:                    duration_s=10.885858581 job=5311384781 project=44995680 runner=nZt7k9z86

As usual.

If you have access to the server hosting GitLab Runner or can run Alpine container yourself I would try to run the ssh-keyscan manually and see what happens?

Alternative is to add to ~/.ssh/config and remove the ssh-keyscan

Host *
  StrictHosKeyChecking no
  UserKnownHostsFile=/dev/null

Running ssh-keyscan -v -H $SSH_HOST as a user that is used for gitlab pipelines with properly set variable passes without any error. There is no config for this user.

This pipeline worked on the same machine as charm with no changes, it stopped working after updating gitlab runner.

Have you tried to echo $SSH_HOST before the ssh-keyscan just to check it is not mangled somehow?

If you feel this is a bug, you can raise new issue in the GitLab Runner issue tracker. They might help you better to debug the issue.

Tried to reproduce the problem on GitLab.com SaaS, with a minimal example.

image: alpine:latest

variables:
  SSH_HOST: "dnsmichi.at" 
  SSH_USER: "doesnotexist"
  SSH_PRIVATE_KEY: "somethingxyzhash"

my-job:
  before_script:
    - apk add --update --no-cache openssh
    - install -m 600 -D /dev/null ~/.ssh/id_rsa
    - echo "$SSH_PRIVATE_KEY" | base64 -d > ~/.ssh/id_rsa
    - ssh-keyscan -H $SSH_HOST > ~/.ssh/known_hosts
  script:
    - echo "Doing some SSH to $SSH_HOST"
    - ssh $SSH_USER@$SSH_HOST "bash /scripts/myScript.sh $PROJECT_IDENTIFIER && exit"
  after_script:
    - rm -rf ~/.ssh 

I did fill all variables with values, leading to expected results. For your problem, I would suggest printing SSH_HOST etc. and verify that these environment variables are filled with values. SSH command sometimes behave weird when value strings are missing.

Could also be a bug, I cannot test with a local GitLab Runner in 16.4.1 at the moment.

Thanks for your time. I have tried to echo all variables (including echoing the private key in the .ssh destination and everything shiwed right data.

After few attempts I have gave up and rewritten the pipeline body like this

  before_script:
    - apk add --no-cache openssh-client bash
    - mkdir -p ~/.ssh
    - echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config
    - echo $SSH_PRIVATE_KEY | base64 -d > id_rsa
    - chmod 700 id_rsa
    - mv id_rsa ~/.ssh/id_rsa
  script:
    - ssh $SSH_USER@$SSH_HOST "bash /scripts/myScript.sh $PROJECT_IDENTIFIER"
  after_script:
    - rm -rf ~/.ssh

And that seems to work. I have no clue why the previous version did not work but I am honestly have no will to waste more time on that. So if anyone face similiar issue, I hope my edited pipeline will help.

1 Like