GitLab VM becoming unresponsive every day around same time

fmuetsch · November 15, 2022, 6:51pm

Hi all! I’m having the following issue with my GitLab omnibus installation on Debian 11 (plain, no Docker).

Every day around the same time (~ between 5.30 pm to 6 pm UTC) my entire GitLab host becomes unresponsive in a way that I can’t even SSH into it anymore. The only thing that helps is hard-rebooting the VM. It very much feels like the machine running out of memory. But I’d assume 8 GB RAM should be enough for a small GitLab installation?

I’m using Prometheus + Grafana (both running on a different host) to pull system metrics from the GitLab VM using node-exporter and all the graphs (CPU, memory, etc.) look super normal until the crash.

I also poked around the log files (nginx, postgres, sidekiq, …) and couldn’t find anything suspicious at the time the crash happened. I also checked the system logs (journalctl --system), but no hints there either. What I’m a bit confused about is that the last log line was at the time the instance became inacessible, and the next one then is after I rebooted it - nothing in between. Apparently, the OS couldn’t even write syslog anymore.

Any ideas how to go about debugging this further? Any hints for me?

It could even be an issue with the host system, but before I get in touch with my hosting provider’s tech support, I wanted to check all possibilities within my own responsibility.

fmuetsch · November 15, 2022, 7:08pm

I repeatedly see the following error in gitlab-rails/exceptions.log, but I reckon it’s unrelated and also happening at a different time than the crash.

{"severity":"ERROR","time":"2022-11-15T16:15:47.152Z","correlation_id":"01GHY16FY17790MB1Q0XAGC4Q5","exception.class":"ActiveRecord::RecordNotFound","exception.message":"Couldn't find User without an ID","exception.backtrace":["app/controllers/uploads_controller.rb:41:in `find_model'","app/controllers/concerns/uploads_actions.rb:166:in `block in model'","lib/gitlab/utils/strong_memoize.rb:44:in `strong_memoize'","app/controllers/concerns/uploads_actions.rb:166:in `model'","app/controllers/uploads_controller.rb:73:in `authorize_create_access!'","ee/lib/gitlab/ip_address_state.rb:10:in `with'","ee/app/controllers/ee/application_controller.rb:45:in `set_current_ip_address'","app/controllers/application_controller.rb:530:in `set_current_admin'","lib/gitlab/session.rb:11:in `with_session'","app/controllers/application_controller.rb:521:in `set_session_storage'","lib/gitlab/i18n.rb:107:in `with_locale'","lib/gitlab/i18n.rb:113:in `with_user_locale'","app/controllers/application_controller.rb:515:in `set_locale'","app/controllers/application_controller.rb:509:in `set_current_context'","lib/gitlab/metrics/elasticsearch_rack_middleware.rb:16:in `call'","lib/gitlab/middleware/memory_report.rb:13:in `call'","lib/gitlab/middleware/speedscope.rb:13:in `call'","lib/gitlab/database/load_balancing/rack_middleware.rb:23:in `call'","lib/gitlab/middleware/rails_queue_duration.rb:33:in `call'","lib/gitlab/metrics/rack_middleware.rb:16:in `block in call'","lib/gitlab/metrics/web_transaction.rb:46:in `run'","lib/gitlab/metrics/rack_middleware.rb:16:in `call'","lib/gitlab/jira/middleware.rb:19:in `call'","lib/gitlab/middleware/go.rb:20:in `call'","lib/gitlab/etag_caching/middleware.rb:21:in `call'","lib/gitlab/middleware/query_analyzer.rb:11:in `block in call'","lib/gitlab/database/query_analyzer.rb:37:in `within'","lib/gitlab/middleware/query_analyzer.rb:11:in `call'","lib/gitlab/middleware/multipart.rb:173:in `call'","lib/gitlab/middleware/read_only/controller.rb:50:in `call'","lib/gitlab/middleware/read_only.rb:18:in `call'","lib/gitlab/middleware/same_site_cookies.rb:27:in `call'","lib/gitlab/middleware/handle_malformed_strings.rb:21:in `call'","lib/gitlab/middleware/basic_health_check.rb:25:in `call'","lib/gitlab/middleware/handle_ip_spoof_attack_error.rb:25:in `call'","lib/gitlab/middleware/request_context.rb:21:in `call'","lib/gitlab/middleware/webhook_recursion_detection.rb:15:in `call'","config/initializers/fix_local_cache_middleware.rb:11:in `call'","lib/gitlab/middleware/compressed_json.rb:26:in `call'","lib/gitlab/middleware/rack_multipart_tempfile_factory.rb:19:in `call'","lib/gitlab/middleware/sidekiq_web_static.rb:20:in `call'","lib/gitlab/metrics/requests_rack_middleware.rb:77:in `call'","lib/gitlab/middleware/release_env.rb:13:in `call'"],"user.username":null,"tags.program":"web","tags.locale":"en","tags.feature_category":"not_owned","tags.correlation_id":"01GHY16FY17790MB1Q0XAGC4Q5"}

iwalker · November 15, 2022, 7:37pm

Hi,

4cpu and 8gb ram is enough for a small installation. I have about the same with maybe 10-15 repos, maybe 10-12 users and don’t have such symptoms. I also run on Debian 11.

The fact it didn’t write out any logs anymore, means it went way too intensive with CPU usage or IO meaning some process went crazy. How many CPU does your server have? We know it has 8gb, but the rest? Amount of disk space free? Is swap configured and for how much?

fmuetsch · November 15, 2022, 7:49pm

Thanks for the reply! This is the host:

2 CPUs
8 GB RAM
22 GB free disk space on / (53 %)
8 GB Swap

Is there a way to tell which background / cron jobs are running at what time?

iwalker · November 15, 2022, 8:13pm

I think you’re cutting it a bit fine with 2cpu. Especially with all the other stuff enabled like prometheus, grafana, etc. Can you upgrade it to 4cpu? I assume it’s a VM at a hosting company?

In the admin web interface you can go under monitoring → background jobs - and here you can see the sidekiq/cron jobs.

Also, not sure what Gitlab version you are on, but there were 13.x and 14.x versions affected by vulnerabilities and some here posted when a crypto miner had been dropped on the server. Not saying this is the reason, but a possibility if you are running a vulnerable version.

fmuetsch · November 15, 2022, 8:26pm

Thanks a lot, very helpful!

I think you’re cutting it a bit fine with 2cpu. Especially with all the other stuff enabled like prometheus, grafana, etc. Can you upgrade it to 4cpu?

Yes, migrating to a 4 CPU instance is an option. But what are the chances that this might fix my problem? From my intuition, even if some heavy background tasks occupied the 2 cores, wouldn’t the system still be somewhat responsive at least (e.g. to write system logs)?

Also, not sure what Gitlab version you are on […]

I’m on version 15.3.3.

Is there a config option for sidekiq that I could possibly adapt (less concurrency or so?) to try mitigate that error?

iwalker · November 15, 2022, 8:35pm

Take a look at this article: Running on a Raspberry Pi | GitLab they do have parts here for sidekiq max concurrency.

Not necessarily. Had a system with 30GB ram recently, that was incorrectly configured with Oracle Database. All logging completely stopped, and the server was completely unresponsive like yours and had to be hard-rebooted. It’s fixed now, because the issues have been addressed. But sounds like you have a similar situation.

Gitlab has a lot of components, PostgreSQL, etc, so it could be a few things to look at.

fmuetsch · November 17, 2022, 1:08pm

I adapted the config according to the Raspberry Pi article, but am still getting those crashes. I realized that the crash happens pretty much exactly 24 hours after the system’s last reboot.

Kernel log doesn’t show any more messages after the time the system had crashed. Last log line was one of a series of [UFW BLOCK] (firewall) notifications.

Any more ideas?

HTWIMI · November 17, 2022, 1:23pm

If your VM is crashing anyway, why not issue a ‘gitlab-ctl stop’ at maybe 5 pm UTC by cron? If the VM still crashes, it may be no GitLab issue? Just an idea…

fmuetsch · November 17, 2022, 1:24pm

Yes, will try that next time. Super annoying to debug, if I only get a chance every 24 hours .

iwalker · November 17, 2022, 2:48pm

If you can, can you stop Gitlab at some point during the day, and then do a list of all the processes and post here? That way we can see if something else is left running that might be killing the machine:

ps aux

if stopping Gitlab at or just before 17:00 UTC means the problem disappears, then means something weird is going on with your Gitlab install. Is this server dedicated to Gitlab, or are there other apps running on there as well?

m.hoogenboom · November 21, 2022, 3:54pm

We had a similar problem (unresponsiveness), although it didn’t crash every 24 hours, but now and then and in our case it was caused by machine size. After I increased CPU and memory, it didn’t crash anymore. If that works for you, but still want to know which jobs are running on the server around the problematic time, you can run an analyser, GitlabSOS, to get performance statistics GitLab.com / GitLab Support Team / toolbox / fast-stats · GitLab. Or try to schedule the analyser just around the time it normally crashes.

fmuetsch · November 21, 2022, 4:36pm

Very interesting, thanks a lot for this!

I’m very sure, though, that it’s not a GitLab-related issue, as the VM still keeps crashing even if all GitLab services are stopped.

Will give that analyzer tool a try anyway!