We’ve been running GitLab CE on AWS since June of 2017 (so, probably version 9.2.5?). To help with resiliency, we’ve run the DB components on (PGSQL flavored) RDS and placed the repositories onto an EFS share. We designed our deployment-automation to enable us to “upgrade” the GitLab installation simply by executing a CloudFormation update. Basically:
- a new EC2 instance is spawned
- CFn-init scripts installs GitLab CE via the omnibus RPM
- CFn-init scripts copy a templated gitlab.rb file from S3 to /etc/gitlab
- CFn-init scripts finish out by calling
Preformatted textUp through version 10.8.7, this worked like a charm. However, with the 11.x version, the final step (running
gitlab-ctl reconfigure) hangs. If, configure CFn to not roll-back on failure, I can login after CFn has signaled a failure (due to having exceeded the configured launch timeout) to check on its status. When I go to look at the /var/log/gitlab/reconfigure/NNNNNNNNNN.log, I consistently see output similar to:
[2018-11-15T17:28:19+00:00] INFO: execute[systemctl enable gitlab-runsvdir] ran successfully [2018-11-15T17:28:19+00:00] INFO: cookbook_file[/usr/lib/systemd/system/gitlab-runsvdir.service] sending run action to execute[systemctl start gitlab-runsvdir] (immediate)
When I check on the systemd service, I typically find:
# systemctl status -l gitlab-runsvdir ● gitlab-runsvdir.service - GitLab Runit supervision process Loaded: loaded (/usr/lib/systemd/system/gitlab-runsvdir.service; enabled; vendor preset: disabled) Active: inactive (dead)
Attempting to manually start the service simply hangs.
If I reboot, none of the gitlab-ctl utilities function. Can’t simply uninstall and reinstall: the yum job hangs when it calls gitlab-ctl. The failed replacement EC2 is just totally wedged.
Interestingly, if I launch an EC2 using the same CFn template, but prevent step 4 from happening, if I then login and execute the step 4 script from an interactive shell, everything completes completely. No hang. No errors. Resulting service comes up as expected (with all RDS-/EFS-hosted persisted data served out).
I’m assuming something’s changed in 11.x and its expectations around its invoking shell, but I haven’t been able to isolate what that might be, yet. Any help would be much appreciated.