Embedded nginx cannot access workhorse socket after Omnibus 14.9.3 -> 14.10.5 upgrade

I’ve got a Linode running Ubuntu 20.04 and gitlab-ce. For the first time ever, ran into an upgrade issue where the web interface is no longer accessible.

After the minor version upgrade from 14.9.3 to 14.10.5, the root domain of my GitLab instance returns a blank HTTP 200 response. All other paths just return 404. I changed no configuration files before this upgrade.

The upgrade itself completed with no errors. All the status checking commands I’ve tried return no errors. The only errors I’ve found have to do with nginx, and this is where it gets mysterious. This error keeps getting posted to /var/log/gitlab/nginx/gitlab_error.log:

2022/07/01 09:43:03 [crit] 2586243#0: *116949 connect() to unix:/var/opt/gitlab/gitlab-workhorse/sockets/socket failed (2: No such file or directory) while connecting to upstream, client: <ip redacted>, server: <my domain>, request: "POST /api/v4/jobs/request HTTP/1.1", upstream: "http://unix:/var/opt/gitlab/gitlab-workhorse/sockets/socket:/api/v4/jobs/request", host: "<my domain>"

And this gets printed in /var/log/gitlab/nginx/registry_gitlab_error.log:

2022/07/01 11:35:57 [error] 4406#0: *34 connect() failed (111: Connection refused) while connecting to upstream, client: <ip redacted>, server: <my registry domain>, request: "POST /api/v4/jobs/request HTTP/1.1", upstream: "http://[::1]:5000/api/v4/jobs/request", host: "<my domain>"

The path /var/opt/gitlab/gitlab-workhorse/sockets/socket exists, and if I curl it directly, I get the correct HTML content:

curl --unix-socket /var/opt/gitlab/gitlab-workhorse/sockets/socket http://<my domain>/users/sign_in

But for some reason, the nginx embedded in Omnibus GitLab is not able to read the socket.

Below, the relevant files/dirs and their permissions:

/var/opt/gitlab/gitlab-workhorse:

drwxr-x--- 3 git               gitlab-www 4096 2022-07-01 10:59 gitlab-workhorse

/var/opt/gitlab/gitlab-workhorse/sockets:

drwxr-x--- 2 git  gitlab-www 4096 2022-07-01 11:28 sockets

/var/opt/gitlab/gitlab-workhorse/sockets/socket:

srwxrwxrwx 1 git git 0 2022-07-01 11:28 socket

All the permissions seem fine for the gitlab-www user to reach the socket. The nginx workers are running as that user. I even temporarily set a shell for gitlab-www in /etc/passwd and curled the socket successfully after doing su gitlab-www.

gitlab-ctl reconfigure, gitlab-ctl restart, restarting the whole server, even upgrading further to 15.x has not helped. I do have backups with which to restore the full server to the pre-upgrade state if needed, but I’d rather not go there unless absolutely necessary.

Any ideas?

I just did the downgrade + restore process back to 14.9.3, but the same issue persists.

I ended up restoring this morning’s pre-upgrade backup of the full VPS into a new server instance, and everything’s running fine again. However, what actually happened is still a mystery, and I’m not going to attempt to upgrade GitLab again without extensive precautions (like an LVM snapshot of the drive, or something), unless a cause for this is found.

I did a quick comparison of the two instances by md5sums of key config files. Before and after the 14.10.5 upgrade, the config files in /var/opt/gitlab/nginx/conf are identical. All the permissions on the socket file and its parent directories are identical. gitlab.rb is identical. /var/opt/gitlab/gitlab-rails/etc/gitlab.yml is identical. Both instances have ample free disk space.

I’m leaving the old, busted instance up for later forensics, but this error doesn’t make any sense whatsoever.