Docker+machine autoscale jobs randomly get interrupted with "Cannot connect to the Docker daemon"

Our current setup uses docker+machine with on demand runners from AWS. This setup has been working like a charm for more than a year, but suddenly in July our pipelines started randomly failing during the build with the following error message:

WARNING: Failed to pull image with policy “always”: Cannot connect to the Docker daemon at tcp://172.31.38.228:2376. Is the docker daemon running? (manager.go:205:0s)

80% of the time the jobs work, but 20% of the time a job gets interrupted at some point with this error message. In this case I usually restart the job and it works.

I’ve tried to:

  • disable spot instances and use normal ones
  • downgrade docker to an older version
  • restart gitlab-runner and docker

but the issue still keeps on occurring. Any ideas what could cause this issue or how this could be debugged in more detail?