GitLab runner seems to stall when executing the 2nd, 3rd or n-th ssh-based command

Hi folks,

i hit a really mysterious problem.
I use GitLab’ runner with an ruby image in the yml-file.
The whole setup of the image, ssh and scripts runs wonderful.
Then a job “deploy_stage” runs three steps on the remote host.
The 1st one runns always wonderful too.
The 2nd is an easy rsync command - this one fails currently.

It seems that the job runs always the first ssh command and then stalls / times out on the 2nd or 3rd.

After the timeout there is no further error message, anywhere.
I tried even logging at the remote. There is nothing.

# define possible stages

stages:
  - deploy

variables:

  REMOTE_SCRIPT_PATH: "/files/scripts"
  REMOTE_BACKUP_SCRIPT: "backup_typo3.bsh"
  REMOTE_CLEANUP_SCRIPT: "cleanup.bsh"
  REMOTE_LOG_PATH: "/files/logs"

  REMOTE_DEPLOY_SCRIPT_PATH: "/files/scripts/remote-target"
  REMOTE_DEPLOY_SCRIPT: "deploy.bsh"
  #SSH_PRIVATE_KEY_GITLAB_CI is in Gitlab CI/CD variables

.deploy_stage_template:

  before_script:
    # configure SSH-keys
    - . ./_gitlab-ci/Scripts/prepare-ssh-key.bsh
    - prepareSshKey "$STAGE_SERVER" "$SSH_PRIVATE_KEY_GITLAB_CI"

    ##
    ## copy scripts for remote-target-server
    - ssh $STAGE_SERVER_USER@$STAGE_SERVER "if [ ! -d $REMOTE_DEPLOY_SCRIPT_PATH ]; then mkdir -p $REMOTE_DEPLOY_SCRIPT_PATH; echo Created directory $REMOTE_DEPLOY_SCRIPT_PATH; fi"
    - scp -r _gitlab-ci/remote-target/. $STAGE_SERVER_USER@$STAGE_SERVER:$REMOTE_DEPLOY_SCRIPT_PATH/
    - ssh $STAGE_SERVER_USER@$STAGE_SERVER "chmod ug+x $REMOTE_DEPLOY_SCRIPT_PATH/*"

    ## add loggin directory when needed
    - ssh $STAGE_SERVER_USER@$STAGE_SERVER "if [ ! -d $REMOTE_LOG_PATH ]; then mkdir -p $REMOTE_LOG_PATH; [ -d $REMOTE_LOG_PATH ] && echo Created directory $REMOTE_LOG_PATH; fi"

  after_script:
    # configure SSH-keys
    - . ./_gitlab-ci/Scripts/prepare-ssh-key.bsh
    - prepareSshKey "$STAGE_SERVER" "$SSH_PRIVATE_KEY_GITLAB_CI"

    ##
    ## cleanup remote-scripts from remote-target-server
    - ssh $STAGE_SERVER_USER@$STAGE_SERVER "if [ -d $REMOTE_DEPLOY_SCRIPT_PATH ] && [ $REMOTE_DEPLOY_SCRIPT_PATH != '/' ]; then rm -rf $REMOTE_DEPLOY_SCRIPT_PATH; echo Removed directory $REMOTE_DEPLOY_SCRIPT_PATH; fi;"



deploy_stage:
  # initiate before_script from deploy_stage_template
  extends: .deploy_stage_template
 ######################################################################
 #
 # deploy to stage STAGE_SERVER
 #
 ######################################################################

  stage: deploy

  image: ruby:2.5

  only:
    refs:
    # - master
    - /^release\/.*$/
    # -  feature\/ci-test
    # -  /^gitlab-ci-dev$/

  # define which job artifacts should be available in this job
  dependencies:
#   - building_vbfrontend_gulp-frontend

  environment:
    name: $STAGE_SERVER
    url: $STAGE_SERVER_URL

  when: on_success

  variables:

    STAGE_SERVER_USER: "08154711"
    STAGE_SERVER: "domain.tld"
    STAGE_SERVER_URL: "https://domain.tld"
    STAGE_SERVER_ROOT_PATH: "/html/my-cms"

  script:
    ## call script showing env vars first and enforcing return code true on grep
    - which bash
    - chmod ug+x ./_gitlab-ci/Scripts/*
    - env|grep -e '(REMOTE|SERVER)' || [ 1 -eq 1 ]
    - . ./_gitlab-ci/Scripts/deploy-stages.bsh

This is the shortened yml-file.
Also, there is the part of the script, which is executed on the gitLab runner and executes command on the remote server:

## takes all deploy steps into account
## runs on gitlab-runner
## o backup remote target
## o rsync repository to remote target
## o execute deployment on remote target
## o cleanup backup dir

## make backup of target system
ssh $STAGE_SERVER_USER@$STAGE_SERVER "cd $REMOTE_SCRIPT_PATH; ./$REMOTE_BACKUP_SCRIPT"

## copy repository to remote
echo "copy relevant repository files to remote target server ($STAGE_SERVER_USER@$STAGE_SERVER:$STAGE_SERVER_ROOT_PATH)"
ssh $STAGE_SERVER_USER@$STAGE_SERVER
rsync -avz --exclude '.*' --exclude '_*' ./* $STAGE_SERVER_USER@$STAGE_SERVER:$STAGE_SERVER_ROOT_PATH
ret=$?
echo "copy of relevant repository files to remote target server ($STAGE_SERVER_USER@$STAGE_SERVER:$STAGE_SERVER_ROOT_PATH) ended with $ret"


## execute remote target deploy script
echo "executing remote deploy script (${REMOTE_DEPLOY_SCRIPT_PATH}/${REMOTE_DEPLOY_SCRIPT})"
ssh $STAGE_SERVER_USER@$STAGE_SERVER
ssh $STAGE_SERVER_USER@$STAGE_SERVER "cd $STAGE_SERVER_ROOT_PATH; pwd; ${REMOTE_DEPLOY_SCRIPT_PATH}/${REMOTE_DEPLOY_SCRIPT}"
ret=${?}
echo "remote deploy script (${REMOTE_DEPLOY_SCRIPT_PATH}/${REMOTE_DEPLOY_SCRIPT}) return code $ret"

## cleanup backup-directory: 90 days older files deleted
echo "executing remote cleanup script (${REMOTE_DEPLOY_SCRIPT_PATH}/${REMOTE_CLEANUP_SCRIPT})"
ssh $STAGE_SERVER_USER@$STAGE_SERVER
ssh $STAGE_SERVER_USER@$STAGE_SERVER "${REMOTE_DEPLOY_SCRIPT_PATH}/${REMOTE_CLEANUP_SCRIPT} -d=/files/backups -a=90"
ret=${?}
echo "remote cleanup script (${REMOTE_DEPLOY_SCRIPT_PATH}/${REMOTE_CLEANUP_SCRIPT}) return code $ret"

I really would appreciate any hints, comments or solutions for that ‘hidden’ problem / behavoir.

Many thanks,

Thomas

Hello dear gitLab-experts,

i tstill persists. It seems to be a known problem on several sides:

Ich habe das Problem weiter eingrenzen können und schwanke zwischen zwei möglichen Fehlerursachen:

bash-script “hangs” forever: https://stackoverflow.com/questions/7114990/pseudo-terminal-will-not-be-allocated-because-stdin-is-not-a-terminal

  • actually, the first SSH command works every time. After the first command, the next one starts hanging. Seemingly already the connect to the remote has a problem, since the remote does not produce any output.

or: SSH-exchange-identification: https://unix.stackexchange.com/questions/151860/ssh-exchange-identification-read-connection-reset-by-peer

=> Has many causes, like blocked IP or wrong key (which i can proof is correct), …

Any hints super-appriciated.

Thx, Thomas

We have the same issue over here :-/
Do the SCP’ed file end up on the destination ? In our Case they can be found in one of the tmp folders (seems its not getting cleaned up)

Hi Basti,
i found out that it MAY have to do with both of these:

  • ssh does not neccessarily know IPv6. If both servers connected via IPv6, you may enforce IPv4 by using option “-4” on rsync, scp and ssh;

  • my target host has a restriction of max. 150 connections per 3 min.; the hosting support told me, that the CI-job is doing connections near to that. I could not commit to it, since i had just a few ssh-commands. But i switched to persistent connections with ssh. You may use a ssh-config-file for it, containing:

    ControlMaster auto
    ControlPath ~/.ssh/ssh-%r@%h:%p.sock
    ControlPersist 60s

Use “ssh -o stop …” to close the socket at the end.

Hope it helps, Th.

Hi Thomas,

the ssh connection is made to an v4 address, no hostnames involved that could interfere with v6 (even though the target server is dual stacked)

we (i’m from the hosting company) do not set up any restrictions like that on the servers. i’ll ask the customer to use multiplexed sessions and see if it gets better

cheers
basti