Gitlab Jobs stales while running ssh commands

Hi

after upgrading to Gitlab 14 the last week I’ve seen global job timeouts on some pipelines.

The common thing on those pipelines is that they use a docker executor to run Ansible. Those jobs startetd and after the global timeout of 1 hour gitlab marks them as failed. What happens is that the ssh session went stale and their return is not noticed by gitlab.

I started digging and stripped it down to this. Sleep is just an example long running command:

  • i run a ssh command via gitlab script:
ssh <..> random.host  'sleep 40'
  • the command is executed on the remote host. i see in the auth.log on the remote host that the ssh session is terminated after the given time. i see the executed sleep and the process list and it disappears after the given time. so everything works.
  • gitlab still thinks the jobs is running and will hit the 1h timeout limit at some point in time.

Every other long running ssh command i trigger manually works fine:

  • from the host where gitlab is running
  • from a container on that host
  • even if i exec into the running job container
/ $ ps axf
PID USER TIME COMMAND
1 root 0:00 /bin/sh
8 root 0:00 /bin/sh
13 root 0:00 ssh-agent -s
27 root 0:00 ssh -vvv -o ControlMaster=auto -o ControlPersist=yes -o StrictHostKeyChecking=no -l random_user controlplane01-kube-dev.random.hoster sleep 25
34 root 0:00 sh
53 root 0:00 ps axf
/ $ ssh -i ssh.priv -o ControlMaster=auto -o ControlPersist=yes -o StrictHostKeyChecking=no -l random_user controlplane01-kube-dev.random.hoster sleep 100
/ $ echo $?
0

Gitlab-CI Debug trace just shows this and the hits the global timeout 1 h later:

$ ssh -o StrictHostKeyChecking=no -l ${ssh_user} controlplane01-kube-dev.random.hoster 'sleep 100'
+ echo '$ ssh -o StrictHostKeyChecking=no -l ${ssh_user} controlplane01-kube-dev.random.hoster '"'"'sleep 100'"'"''
+ ssh -o 'StrictHostKeyChecking=no' -l ansible_devops controlplane01-kube-dev.random.hoster 'sleep 100'
Warning: Permanently added 'controlplane01-kube-dev.random.hoster,10.1.101.10' (ECDSA) to the list of known hosts.

Gitlab Runner Trace log just shows. I isolated the job on a dedicated runner:

Submitting job to coordinator... ok                 code=200 job=27033 job-status= runner=ujy-Ls7N update-interval=0s
Submitting job to coordinator... ok                 code=200 job=27033 job-status= runner=ujy-Ls7N update-interval=0s
Submitting job to coordinator... ok                 code=200 job=27033 job-status= runner=ujy-Ls7N update-interval=0s
Submitting job to coordinator... ok                 code=200 job=27033 job-status= runner=ujy-Ls7N update-interval=0s
Submitting job to coordinator... ok                 code=200 job=27033 job-status= runner=ujy-Ls7N update-interval=0s
Submitting job to coordinator... ok                 code=200 job=27033 job-status= runner=ujy-Ls7N update-interval=0s

I see the connection on the target host as long as it should be:

4157972 ?        Ss     0:00  \_ sshd: ansible_devops [priv]
4158002 ?        S      0:00      \_ sshd: ansible_devops@notty
4158003 ?        Ss     0:00          \_ sleep 45

Tried downgrading the gitlab runner image to 13.10 without any success.
I think Gitlab loses somehow the command and does not get the “command finished” return of the ssh command. But i can not get why.

I’m a little bit lost how to get this working again. A downgrade to Gitlab 13 is not an option. Neither is using an SSH executor as there should be more steps before and afterwards the ansible.

Any help appreciated :wink:

Alex

— EDIT
those are the app components we use. as host os we use ubuntu 20.04

GitLab14.1.3
GitLab Shell13.19.1
GitLab Workhorsev14.1.3
GitLab APIv4
Ruby2.7.2p137
Rails6.1.3.2
PostgreSQL12.6
Redis 6.0.14

After some days debugging i finally found the problem.

it’s neither gitlab nor Ubuntu so afterwards I’m going to close this issue. it’s a combination from open-nebula tooling that affects the conntrack of linux netfilter.

bye


In case somebody stumples upon this, i just note my findings:

investigating in the stale ssh sessions i’ve seen that conntrack entry’s gets removed after round about 30 seconds. Looks like this in the console:

    [NEW] ipv4     2 tcp      6 120 SYN_SENT src=GITLAB-IP dst=TARGET-IP sport=56882 dport=22 [UNREPLIED] src=TARGET-IP dst=10.0.89.38 sport=22 dport=56882
 [UPDATE] ipv4     2 tcp      6 60 SYN_RECV src=GITLAB-IP dst=TARGET-IP sport=56882 dport=22 src=TARGET-IP dst=10.0.89.38 sport=22 dport=56882
 [UPDATE] ipv4     2 tcp      6 432000 ESTABLISHED src=GITLAB-IP dst=TARGET-IP sport=56882 dport=22 src=TARGET-IP dst=10.0.89.38 sport=22 dport=56882 [ASSURED]
[DESTROY] ipv4     2 tcp      6 src=GITLAB-IP dst=TARGET-IP sport=56882 dport=22 src=TARGET-IP dst=10.0.89.38 sport=22 dport=56882 [ASSURED]
    [NEW] ipv4     2 tcp      6 300 ESTABLISHED src=TARGET-IP dst=10.0.89.38 sport=22 dport=56882 [UNREPLIED] src=10.0.89.38 dst=TARGET-IP sport=56882 dport=22
[DESTROY] ipv4     2 tcp      6 src=TARGET-IP dst=10.0.89.38 sport=22 dport=56882 [UNREPLIED] src=10.0.89.38 dst=TARGET-IP sport=56882 dport=22

Notice the first “DESTROY” entry that deletes the conntrack entry.

After digging around and searching for a process that deletes the conntrack i found the systemd service one-context which syncs our VMs to the underlying open-nebula instance. I noticed a temporal coincidence while looking at my syslog: every time the one-context reconfigures the vm my ssh sessions went stale.

looking into the one-context config i found the root-cause - this snipped is executed every 30 seconds and flushes all our interfaces.

deactivate_network()
{
    IFACES=`/sbin/ifquery --list -a`

    for i in $IFACES; do
        if [ $i != 'lo' ]; then
            /sbin/ifdown $i
            /sbin/ip addr flush dev $i
        fi
    done
}

ultimately this causes the conntrack entry’s to disappear and the session to went stale. the returning packages that are sent from the ssh-target does not reach the starting process so it stales forever.