502 timeout when manually creating new pipeline after 13.10.3 upgrade

nwalter · April 16, 2021, 4:23am

Hi, I’m running 13.10.3-ee in GKE using the official Helm Chart. I upgraded from 13.7.9-ee last night - all went smoothly. The only issue I’m seeing is trying to manually run a pipeline from one of my projects.

Steps to reproduce:

CI/CD → Pipelines → Run Pipelines
Choose branch, Run Pipeline
30 secs later, the UI informs me that “Pipeline cannot be run. Something went wrong on our end. Please try again.”

The curious thing is that, although the UI indicates an error, if I head back to CI/CD → Pipelines I can see my pipeline running.

I’m only seeing this behaviour on one of my projects. Notable characteristics of this project include having thousands of previous pipeline runs - i.e. doing anything with it is typically slow. About 12 months ago I overcame a similar timeout issue by setting the GITLAB_RAILS_RACK_TIMEOUT environment variable to “120”. This setting still exists but I suspect is of no consequence to this problem as I believe the default GITLAB_RAILS_RACK_TIMEOUT is “60”.

I’m seeing this in the webservice logs (have obfuscated some of the info):

{"correlation_id":"01F3BZAB24X31VXWA5158T253C","duration_ms":29999,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"POST","msg":"","time":"2021-04-15T23:55:02Z","uri":"/mygroup/myproject/-/pipelines"}

{"content_type":"text/html; charset=utf-8","correlation_id":"01F3BZAB24X31VXWA5158T253C","duration_ms":30000,"host":"gitlab.example.com","level":"info","method":"POST","msg":"access","proto":"HTTP/1.1","referrer":"https://gitlab.example.com/mygroup/myproject/-/pipelines/new","remote_addr":"1.2.3.4:28191","remote_ip":"1.2.3.4","route":"","status":502,"system":"http","time":"2021-04-15T23:55:02Z","ttfb_ms":30000,"uri":"/mygroup/myproject/-/pipelines","user_agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36","written_bytes":2940}

So, this looks almost certainly like I’m hitting some timeout threshold (of 30 secs). Would really appreciate any guidance on how I might override whatever setting is capping the response time to 30 secs. Additionally, I’m using GKE ingress (not Nginx).

Thanks!