Runner cannot push all layers to registry (504 gateway timeout) Self-hosted, Kubernetes, cloud native helm chart

Environment:

  • K8s v1.14.1 on bare metal, deployed via kubespray v2.10.0 on single Debian 9.9 node. All default inventory settings except: flannel instead of calico, kube_proxy_mode iptables, and helm enabled
  • Rook installed via latest helm chart (v1.0.1), helm install --namespace rook-ceph rook-release/rook-ceph --set agent.flexVolumeDirPath=/var/lib/kubelet/volume-plugins with cluster-test.yml and storageclass-test.yml. This storage class was made the default.
  • GitLab installed via helm (11.11)
  • nginx ingress service changed from LoadBalancer to externalIP

job log:

$ docker push "$BUILD_IMAGE_NAME"
The push refers to repository [registry.mydomain.com/web/auto-build-image/master]
0a667c142b26: Preparing
1e8ec32b2f91: Preparing
a21c0a6873db: Preparing
c895bf09456a: Preparing
968d46c1d20e: Preparing
b87598efb2f0: Preparing
f1b5933fe4b5: Preparing
b87598efb2f0: Waiting
f1b5933fe4b5: Waiting
1e8ec32b2f91: Layer already exists
968d46c1d20e: Layer already exists
b87598efb2f0: Layer already exists
a21c0a6873db: Layer already exists
f1b5933fe4b5: Layer already exists
0a667c142b26: Pushed
c895bf09456a: Retrying in 5 seconds
c895bf09456a: Retrying in 4 seconds
c895bf09456a: Retrying in 3 seconds
c895bf09456a: Retrying in 2 seconds
c895bf09456a: Retrying in 1 second
c895bf09456a: Retrying in 10 seconds
c895bf09456a: Retrying in 9 seconds
c895bf09456a: Retrying in 8 seconds
c895bf09456a: Retrying in 7 seconds
c895bf09456a: Retrying in 6 seconds
c895bf09456a: Retrying in 5 seconds
c895bf09456a: Retrying in 4 seconds
c895bf09456a: Retrying in 3 seconds
c895bf09456a: Retrying in 2 seconds
c895bf09456a: Retrying in 1 second
c895bf09456a: Retrying in 15 seconds
c895bf09456a: Retrying in 14 seconds
c895bf09456a: Retrying in 13 seconds
c895bf09456a: Retrying in 12 seconds
c895bf09456a: Retrying in 11 seconds
c895bf09456a: Retrying in 10 seconds
c895bf09456a: Retrying in 9 seconds
c895bf09456a: Retrying in 8 seconds
c895bf09456a: Retrying in 7 seconds
c895bf09456a: Retrying in 6 seconds
c895bf09456a: Retrying in 5 seconds
c895bf09456a: Retrying in 4 seconds
c895bf09456a: Retrying in 3 seconds
c895bf09456a: Retrying in 2 seconds
c895bf09456a: Retrying in 1 second
c895bf09456a: Retrying in 20 seconds
c895bf09456a: Retrying in 19 seconds
c895bf09456a: Retrying in 18 seconds
c895bf09456a: Retrying in 17 seconds
c895bf09456a: Retrying in 16 seconds
c895bf09456a: Retrying in 15 seconds
c895bf09456a: Retrying in 14 seconds
c895bf09456a: Retrying in 13 seconds
c895bf09456a: Retrying in 12 seconds
c895bf09456a: Retrying in 11 seconds
c895bf09456a: Retrying in 10 seconds
c895bf09456a: Retrying in 9 seconds
c895bf09456a: Retrying in 8 seconds
c895bf09456a: Retrying in 7 seconds
c895bf09456a: Retrying in 6 seconds
c895bf09456a: Retrying in 5 seconds
c895bf09456a: Retrying in 4 seconds
c895bf09456a: Retrying in 3 seconds
c895bf09456a: Retrying in 2 seconds
c895bf09456a: Retrying in 1 second
received unexpected HTTP status: 504 Gateway Time-out
ERROR: Job failed: command terminated with exit code 1

registry pod log:

time="2019-05-28T02:19:02.052602654Z" level=error msg="client disconnected during blob PATCH" auth.user.name=fury contentLength=-1 copied=24731056 error="http: unexpected EOF reading trailer" go.version=go1.11.2 http.request.host=registry.mydomain.com http.request.id=7ab88f1b-4e47-491f-aaeb-387ce10a70ce http.request.method=PATCH http.request.remoteaddr=10.233.64.1 http.request.uri="/v2/web/auto-build-image/master/blobs/uploads/705390eb-eb88-423e-9c72-d39433f35ac4?_state=BJG7VilD0PITqxhJw36_4PnZhiYu_56crVZFlxjFkUZ7Ik5hbWUiOiJ3ZWIvYXV0by1idWlsZC1pbWFnZS9tYXN0ZXIiLCJVVUlEIjoiNzA1MzkwZWItZWI4OC00MjNlLTljNzItZDM5NDMzZjM1YWM0IiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE5LTA1LTI4VDAyOjE2OjI5LjI3MjUyOTM1NFoifQ%3D%3D" http.request.useragent="docker/18.09.6 go/go1.10.8 git-commit/481bc77 kernel/4.9.0-9-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.6 \(linux\))" vars.name="web/auto-build-image/master" vars.uuid=705390eb-eb88-423e-9c72-d39433f35ac4 
...
10.233.64.117 - - [28/May/2019:02:30:19 +0000] "PATCH /v2/web/auto-build-image/master/blobs/uploads/1c334efa-e7df-43d1-a82b-e92fc7d67de8?_state=5vY-ETO868QRnHC5SwBmCUXvFcQDNzBSUj-8cye0aTp7Ik5hbWUiOiJ3ZWIvYXV0by1idWlsZC1pbWFnZS9tYXN0ZXIiLCJVVUlEIjoiMWMzMzRlZmEtZTdkZi00M2QxLWE4MmItZTkyZmM3ZDY3ZGU4IiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE5LTA1LTI4VDAyOjMwOjA5LjEzODczNzk0MloifQ%3D%3D HTTP/1.1" 500 89 "" "docker/18.09.6 go/go1.10.8 git-commit/481bc77 kernel/4.9.0-9-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.6 \\(linux\\))"
time="2019-05-28T02:35:59.851231763Z" level=error msg="response completed with error" auth.user.name=fury err.code=unknown err.detail="client disconnected" err.message="unknown error" go.version=go1.11.2 http.request.host=registry.mydomain.com http.request.id=d5280657-f0f2-4420-9fed-c34ab26caa04 http.request.method=PATCH http.request.remoteaddr=10.233.64.1 http.request.uri="/v2/web/auto-build-image/master/blobs/uploads/1c334efa-e7df-43d1-a82b-e92fc7d67de8?_state=5vY-ETO868QRnHC5SwBmCUXvFcQDNzBSUj-8cye0aTp7Ik5hbWUiOiJ3ZWIvYXV0by1idWlsZC1pbWFnZS9tYXN0ZXIiLCJVVUlEIjoiMWMzMzRlZmEtZTdkZi00M2QxLWE4MmItZTkyZmM3ZDY3ZGU4IiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE5LTA1LTI4VDAyOjMwOjA5LjEzODczNzk0MloifQ%3D%3D" http.request.useragent="docker/18.09.6 go/go1.10.8 git-commit/481bc77 kernel/4.9.0-9-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.6 \(linux\))" http.response.contenttype="application/json; charset=utf-8" http.response.duration=5m40.543384288s http.response.status=500 http.response.written=89 vars.name="web/auto-build-image/master" vars.uuid=1c334efa-e7df-43d1-a82b-e92fc7d67de8 
2019/05/28 02:36:46 http: multiple response.WriteHeader calls
10.233.64.112 - - [28/May/2019:02:31:55 +0000] "PATCH /v2/web/auto-build-image/master/blobs/uploads/a949ebbb-cbd2-4b94-81e7-3b009ea95af1?_state=omQnXuq6SJvTunGR1PM8dBp7E4bYWu3nIs-bsJ3-lNN7Ik5hbWUiOiJ3ZWIvYXV0by1idWlsZC1pbWFnZS9tYXN0ZXIiLCJVVUlEIjoiYTk0OWViYmItY2JkMi00Yjk0LTgxZTctM2IwMDllYTk1YWYxIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE5LTA1LTI4VDAyOjMxOjQ5LjkzMjM2NzA1WiJ9 HTTP/1.1" 500 89 "" "docker/18.09.6 go/go1.10.8 git-commit/481bc77 kernel/4.9.0-9-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.6 \\(linux\\))"
time="2019-05-28T02:36:46.646680713Z" level=error msg="response completed with error" auth.user.name=fury err.code=unknown err.detail="client disconnected" err.message="unknown error" go.version=go1.11.2 http.request.host=registry.mydomain.com http.request.id=2300b4cc-7f3c-44d5-b370-c210b36fc50e http.request.method=PATCH http.request.remoteaddr=10.233.64.1 http.request.uri="/v2/web/auto-build-image/master/blobs/uploads/a949ebbb-cbd2-4b94-81e7-3b009ea95af1?_state=omQnXuq6SJvTunGR1PM8dBp7E4bYWu3nIs-bsJ3-lNN7Ik5hbWUiOiJ3ZWIvYXV0by1idWlsZC1pbWFnZS9tYXN0ZXIiLCJVVUlEIjoiYTk0OWViYmItY2JkMi00Yjk0LTgxZTctM2IwMDllYTk1YWYxIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE5LTA1LTI4VDAyOjMxOjQ5LjkzMjM2NzA1WiJ9" http.request.useragent="docker/18.09.6 go/go1.10.8 git-commit/481bc77 kernel/4.9.0-9-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.6 \(linux\))" http.response.contenttype="application/json; charset=utf-8" http.response.duration=4m51.363115069s http.response.status=500 http.response.written=89 vars.name="web/auto-build-image/master" vars.uuid=a949ebbb-cbd2-4b94-81e7-3b009ea95af1

nginx-ingress log: (Where: XXX is the actual IP of the node)

XXX - [XXX] - - [28/May/2019:02:22:19 +0000] "PUT /registry/docker/registry/v2/repositories/web/auto-build-image/master/_uploads/397a7c95-4c6d-446a-ba2f-0702e5855150/startedat HTTP/1.1" 200 0 "-" "aws-sdk-go/1.15.11 (go1.11.2; linux; amd64)" 1093 0.030 [default-gitlab-minio-svc-9000] 10.233.64.139:9000 0 0.028 200 21d1666fff24c3894bf07d2971164211

XXX - [XXX] - - [28/May/2019:02:22:21 +0000] "PUT /registry/docker/registry/v2/repositories/web/auto-build-image/master/_uploads/397a7c95-4c6d-446a-ba2f-0702e5855150/hashstates/sha256/0 HTTP/1.1" 200 0 "-" "aws-sdk-go/1.15.11 (go1.11.2; linux; amd64)" 1192 0.009 [default-gitlab-minio-svc-9000] 10.233.64.139:9000 0 0.008 200 6ae2a7087337387128b6a182e37ae273

XXX - [XXX] - - [28/May/2019:02:22:38 +0000] "POST /api/v4/jobs/request HTTP/1.1" 204 0 "-" "gitlab-runner 11.11.0 (11-11-stable; go1.8.7; linux/amd64)" 917 0.041 [default-gitlab-unicorn-8181] 10.233.64.114:8181 0 0.044 204 f22af74c035dfffce71006e65e200457
2019/05/28 02:23:48 [error] 2252#2252: *28463 upstream timed out (110: Connection timed out) while sending request to upstream, client: 10.233.64.1, server: registry.mydomain.com, request: "PATCH /v2/web/auto-build-image/master/blobs/uploads/397a7c95-4c6d-446a-ba2f-0702e5855150?_state=wL6Un-taQENi2-I9YMYXPyiXM6sqB8t-Yzdi5E1dUPJ7Ik5hbWUiOiJ3ZWIvYXV0by1idWlsZC1pbWFnZS9tYXN0ZXIiLCJVVUlEIjoiMzk3YTdjOTUtNGM2ZC00NDZhLWJhMmYtMDcwMmU1ODU1MTUwIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE5LTA1LTI4VDAyOjIyOjE4LjE1MTkzMzgwMloifQ%3D%3D HTTP/1.1", upstream: "http://10.233.64.110:5000/v2/web/auto-build-image/master/blobs/uploads/397a7c95-4c6d-446a-ba2f-0702e5855150?_state=wL6Un-taQENi2-I9YMYXPyiXM6sqB8t-Yzdi5E1dUPJ7Ik5hbWUiOiJ3ZWIvYXV0by1idWlsZC1pbWFnZS9tYXN0ZXIiLCJVVUlEIjoiMzk3YTdjOTUtNGM2ZC00NDZhLWJhMmYtMDcwMmU1ODU1MTUwIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE5LTA1LTI4VDAyOjIyOjE4LjE1MTkzMzgwMloifQ%3D%3D", host: "registry.mydomain.com"

10.233.64.1 - [10.233.64.1] - - [28/May/2019:02:23:48 +0000] "PATCH /v2/web/auto-build-image/master/blobs/uploads/397a7c95-4c6d-446a-ba2f-0702e5855150?_state=wL6Un-taQENi2-I9YMYXPyiXM6sqB8t-Yzdi5E1dUPJ7Ik5hbWUiOiJ3ZWIvYXV0by1idWlsZC1pbWFnZS9tYXN0ZXIiLCJVVUlEIjoiMzk3YTdjOTUtNGM2ZC00NDZhLWJhMmYtMDcwMmU1ODU1MTUwIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDE5LTA1LTI4VDAyOjIyOjE4LjE1MTkzMzgwMloifQ%3D%3D HTTP/1.1" 504 160 "-" "docker/18.09.6 go/go1.10.8 git-commit/481bc77 kernel/4.9.0-9-amd64 os/linux arch/amd64 UpstreamClient(Docker-Client/18.09.6 \x5C(linux\x5C))" 24096768 86.710 [default-gitlab-registry-5000] 10.233.64.110:5000 0 86.707 504 219162a5049d01db9fa5677b9ae0cf68

I’ve tried installing the runner as a Docker runner on two different machines to no avail.
The problem is consistently with that particular layer in this project–the problem is consistently with some other layer in the other project I’ve tried building.

The external DNS for the cluster runs through cloudflare, but I’ve set up /etc/hosts and gitlab runner extra_hosts for gitlab.mydomain.com and registry.mydomain.com to point directly to the node’s IP because cloudflare has a 100 MB upload limit. I suspected it was still running through cloudflare, but according to the nginx log it’s only making it through 22-24 MB of the upload before 504ing (after 1 minute–terrible transfer rate or something just locking up?).

I’m leaning toward some minio problem, as I don’t have this issue on my cluster at work which is deployed via the old omnibus helm chart, but otherwise very similar. (no cloudflare at all there)

Any ideas?

To try to rule out Cloudflare as a possibility, I added hostAliases to the gitlab-registry deployment for gitlab.mydomain.com and minio.mydomain.com to go directly to the node (same IP as the nginx-ingress externalIP) - a couple of the projects that were running into this problem started working after that, but then I set up a rails project from the template, and that one’s having the same issue.

The transfer rate is really slow even though it’s local machine traffic and it’s on an NVMe drive

Here’s a test of trying to push from a command line: it’s frozen like this

75d4c78adb87: Pushed 
d8a926acf044: Pushing [================>                                  ]  44.33MB/132.9MB
02665233a680: Pushed 
ac788347f35d: Pushed 
b6b132e47ed1: Pushed 
de242e95f9e6: Pushing [==================================================>]  55.05MB
eab3cc012638: Pushed 
eb8c19b0dfbc: Pushing [==================================================>]  45.85MB
97cee2b72194: Pushed 
ebf12965380b: Pushing [==================================================>]  4.464MB

After several retries, the smaller ones end up pushing, but the 132 MB one keeps failing.

This is still a thing, but I guess those hostAliases don’t actually work, so instead I’ve dropped minio as one of the Cloudflare proxied domains

1 Like

@fury Thanks for jumping in and sharing a solution that works for you. This will be super helpful to those experiencing the same thing! :blush: