I’ve got quite a scenario here and I’ve been fighting with it for several days full-time so this will probably be a lot to read. Very high level: gitlab latest version, upgraded during the course of this fight just to be sure. Two Kubernetes clusters, one bare-metal for dev purposes and one at Digital Ocean for prod. Just doing the integration and starting to work with autodeployment and builds of my test project fail on the dev cluster and succeed on the prod cluster. Doing the cluster setup in gitlab went completely painlessly, and installing the gitlab runners and all was easy. (Prometheus doesn’t like k8s 1.18 that I’m running in dev, but that’s a different topic.)
Dev cluster: ubuntu 18.04 running k8s 1.18, 5 nodes, docker 19.03.6
Prod cluster: Digital Ocean hosted, k8s 1.16.6, … whatever the heck the rest is DO hides from me.
Test build: I’m working from the instructions at Create a Continuous Integration Pipeline with GitLab and Kubernetes to get this all set up, and I’m using his test node.js Hello, World! app. I’ve tried upgrading it to node 12, that made no difference.
What I expect to happen: container builds no matter which cluster it picks to do the build
What is happening: container fails to build on dev cluster, succeeds on prod cluster.
The build fires up, grabs all the parts it needs, and then hangs for a while at the “npm install” step until it times out and fails. Making it an “npm install --verbose” showed me that it’s timing out getting to https://registry.npmjs.org/express. That works from my browser and from DO, I know it’s not the site. So I added a “sleep 3600” to the Dockerfile and logged in to the pod. From there I can wget https://registry.npmjs.org/express just fine. Discovered that the svc-0 container in the pod has a docker container within it, so I logged in to THAT. curl -v https://registry.npmjs.org/express hangs immediately after the TLS 1.3 Client Hello.
This is where it gets “fun”.
Dropped tcpdump on the node on the outside interface for port 443 and on the inside cali interface, on the gitlab-runner svc-0 container, and on the container it’s building all the way in, and ran the curl. I see the TCP connection get set up, the TLS negotiation start, and then it dies. I see the Server Hello coming back in at the host’s external interface, but it doesn’t get forwarded to the cali interface to head in further.
Did the same thing with https://www.google.com and it works fine.
Tested more stuff, found that registry.npmjs.org is returning the entire certificate chain and google is not, and it’s doing it as a single 3+KB packet that is getting fragmented somewhere along the way in to 3 pieces. Google is not sending the whole chain, just their cert, and that fits in one normal-sized packet. Tested the curl against another site I know is sending the whole chain… HANG. So for some reason it is not correctly handling fragmented packets, apparently within the Ubuntu network stack or possibly the cali drivers, but only when those packets are supposed to go from the node → docker container → docker container. If it’s just one docker layer in it works fine.
I’m sending this here first, as opposed to Ubuntu, Kubernetes, Docker, Alpine (the gitlab-runner), or Debian (the node.js image) because the problem is manifesting in the gitlab-runner build process. If we come up with any sort of credible reason to point at any particular of the products in the stack, I’ll happily go chase them down, too.
Any ideas? This one has me pretty well stumped.