Get net/http TLS handshake timeout

Getting an intermittent error during my CI pipeline.
Setup:

  • 2+ runners with docker executor. Docker v18 and v19, gitlab-runner v12.0 and v12.5
  • 1 registry
  • I have scheduled pipelines to run a couple of times a day.

Error:

  • Every day, out of 5 pipeline executions, at least one fails. Usually the one in the morning.
    Error message is generic when doing a docker push my-registry.some.co

  • journaclctl -u docker shows:
    dockerd[29874]: time=“2020-01-13T07:22:08.328674059-05:00” level=error msg=“Handler for POST /v1.39/images/create returned error: Get https://registry.mine.mine/v2/: net/http: TLS handshake timeout”

dockerd[10544]: time=“2020-01-12T07:24:48.573875158-05:00” level=info msg=“Attempting next endpoint for push after error: Get https://registry.mine.mine/v2/: net/http: TLS handshake timeout”

I use a proxy, and have it configured. I believe correctly because my pipelines for other projects work fine and for this one works most of the time (about 80%).

Any suggestions?

Hi,

from a network perspective, where are the runners and the registry located? Since you’ve said “proxy”, which connections are proxied and which one + config are you using?

TLS handshake errors not only source from low latency network connections, but also with limited CPU resources on the end performing the handshake. If the registry host for example is overloaded with other tasks/connections, this may pile up into blocked cryptography calculations and thus, timing out the handshake request from the other end.

Which TLS versions and ciphers are offered/used by the registry host? You can check that e.g. with sslscan.

Cheers,
Michael

Hi Michael,

Thanks for the prompt reply!
The runners, gitlab, and the registry are all in the same vlan, so I have http_proxy and no_proxy configured.
The CPU should not be a problem because the times it fails, is after business hours, but I will keep an eye on that.
I’ll try sslscan, what should I expect to see?

Hi,

sslscan should return the used TLS versions, and also the cipher suites being used. Some of these are more CPU extensive than others, so it was a quick shot in the dark.

Typically, I’d rather monitor the CPU load on the system, and especially the docker registry daemon. Correlate these graphs with the TLS timeout, you can e.g. with an HTTP check against the registry which is run every time à la curl registryurl.com or docker login registryurl.com and collecting the http response time metrics and state.

Cheers,
Michael