When does a runner get displayed as offline?

Problem to solve

After setting up my first runner in a container, I set out to transform it into a managed service, which involved stopping the runner process. After it was stopped as a standalone process but before it was started as a service (i.e. it was not running), I checked the UI of Gitlab, which happily reported the runner as online, with last contact up to an hour ago. I would have expected the Gitlab instance to recognize the loss of the connection to the runner process and mark it as offline accordingly within a reasonable time frame (say about 5-10 minutes max).

Searching the web for this only returned inverse cases, where a runner was supposed to be online but was shown as offline, so Iā€™m asking here: why didnā€™t Gitlab mark the runner as offline? Was it due to the time elapsed being too low to be registered as a lost connection? Was it because of a mistake I made when shutting down the runner? What or where should I check in order to learn more about this topic?

Steps to reproduce

  1. Set up a runner
  2. Start the runner in a podman container and register it (podman run gitlab/gitlab-runner --url https://some.local.url --token glrt-secret-token, podman run gitlab/gitlab-runner)
  3. Verify that the running container results in Gitlab correctly displaying the runner as online
  4. Stop the container (podman stop gitlab-runner)
  5. Check the runner in the Gitlab interface

Expected result: Gitlab says the runner is offline after at most a few minutes.
Observed result: Gitlab says the runner is online, even after an hour.

Versions

  • Self-managed, GitLab Enterprise Edition v17.0.2-ee
  • GitLab.com SaaS
  • Dedicated
1 Like

Same issue here, on Self-managed GitLab CE v17.6.1

We noticed that one of our servers with gitlab-runner is gone for more than 1 hour now (physical linux-server that seems fully frozen, canā€™t even be pingā€™d anymore), but the runner still appears online in GitLab - and the job that was running on that host is also still in ā€˜runningā€™ state, seemingly still continuing to tail the logs of that job.

Iā€™m also very interested in knowing what knobs there are to control / tune this, to make the GitLab server recognize and handle such situations in a defined time-frame of some minutes - and especially, to then also treat ā€œhangingā€ jobs as failed accordingly.

Our runners are configured with a very high max. job-timeout (more than 1 day) due to long-running jobs. I suppose that passing this threshold would also make GitLab recogize failed runners eventually, but surely there should be some other mechanism to detect that a runner is gone outside of those boundsā€¦?