Triggering specific job causes freeze of gitlab machine

We have a very confusing issue with our Gitlab server.
The server is:

  • running 14.9.3 and is using mostly default settings.
  • running ubuntu 22.04, kernel 5.13.0.
  • is virtualized on vmware.
  • is configured with 4 CPUs and 8GB ram.
  • is not heavily used, ~10 users and a handful of active repos.

Since upgrading from 13.12 we have started getting “freezes” in network connectivity.
In principle, all network traffic halts completely for around 60s before continuing as normal.
This includes traffic to and from the machine, but also between services on the machine (e.g. between workhorse and gitlab-rails).

This happens when running a few specific CI jobs.
The job that triggers the issue just runs gradle clean.
It uploads/downloads a cache, but that is as far as I know handle by the runner, not the server.
The git repo is only around 1MB in size.

We have checked for memory/disk/cpu issues but can’t find any obvious problems with resource constraints.

The only few clues we have is that gitlab-workhorse reports:
badgateway: failed to receive response: context canceled and
badgateway: failed to receive response: dial unix /var/opt/gitlab/gitlab-rails/sockets/gitlab.socket: connect: no such file or directory
But we are fairly sure this is just a symptom of the network stack being “frozen”.

We think this is not directly Gitlabs “fault”, but want to check here if anyone else have had similar issues?

Unrelated to gitlab but i have had problems with debian or ubuntu on vmware when using vmxnet3 as the network card in the vmware machine config. Intermittent or poor network or even causing the vm to restart.

I changed to e1000 in the vm machine config and after this was stable. Means deleting the existing network card from the vm and adding a new one and choosing e1000 and saving.

Thanks!

I was kind of going in the direction of something with vmware being iffy.
Replacing the NIC worked like a charm.

1 Like