Runners look like they've lost contact with Gitlab CE 8.12.3, because Gitlab only records contacts once an hour now

We are seeing a periodic intermittent issue with Gitlab CE omnibus 8.12.3, which has the following symptoms:

  1. All five of our runners seem to “lose contact” with gitlab simultaneously. If you go to https://gitlab.yourcompany.biz/admin/runners you can see that the last contact is “28 minutes” ago and climbing.

  2. I have a gitlab runner running interactively so I can see its console (it’s on my own desktop PC so I can study this), and there’s NO output to indicate anything is wrong.

  3. I can see the /var/log/gitlab/gitlab-rails/production.log and it contains lines like this which seem to indicate that the register.json is being repeatedly hit by a number of correct looking IP addresses:

    Started POST “/ci/api/v1/builds/register.json” for 192.168.215.221 at 2016-10-13 09:40:05 -0500
    Started POST “/ci/api/v1/builds/register.json” for 192.168.215.35 at 2016-10-13 09:40:06 -0500
    … and more similar

  4. And yet, no jobs are being run, and the time value “since last communication” keeps going up.

  5. This condition PERSISTED even after I did “sudo gitlab-ctl stop” and then restarted “sudo gitlab-ctl start”. Restarting runners also has no effect.

  6. This condition only went away when I rebooted the Ubuntu 14.04 vm. It seems like Gitlab is up, you can push and pull git repos, and use the whole Gitlab web user interface, but CI alone is affected.

I’m thinking it may be a problem in the OS itself, and I plan to update this VM to Ubuntu 16.x LTS to see if things get more stable.

Hello,

there was a change that the status of runners is only saved once per hour in 8.12.3, so you only have to worry after an hour. We have a custom monitoring script which failed because of this :-).

Oh I see. I’m looking at noise, not at a signal. I expect this to be up to date as of the last time a job started running, because, why would it not be…

That explains a lot. What’s the issue number or MR for this change? This is a dev-ops regression IMHO.