Understanding background jobs in self-hosted Enterprise Edition of Gitlab

dunczzz · January 8, 2020, 4:32pm

Good afternoon,

We have a large on-premises Gitlab setup with hundreds of projects, but one main codebase project.

The main codebase project runs ~60 pipelines per day and each pipelines has over 1,000 jobs.

This week we’ve had a backlog of work since the Christmas holiday period, and we’ve got more pipelines running than usual. We limit these to 5 concurrent pipelines but it means we’ve got quite a lot of pipelines queued up waiting to start.

As the day has gone on, the Gitlab UI has gotten wonkier and wonkier. Recent commits aren’t showing on MRs, pipelines aren’t appearing after code has been pushed etc.

I’ve had a look in the Jobs page of the Admin area, and our stats look something like this:

89,904,561 Processed
17,700 Failed
11 Busy
37,894 Enqueued

That to me sounds like a serious concern - over 37,000 background jobs “enqueued”? Why are only 11 jobs “busy”? Does this mean “in progress”?

My two questions are therefore:

What do these different terms mean, and where can I find documentation for these background jobs?
Is there a way of improving performance on these background jobs? We’ve got loads of CPU cores we can throw at these tasks, and plenty of RAM

Thanks in advance

dnsmichi · January 8, 2020, 7:46pm

Hi,

GitLab is a Ruby on Rails app which uses unicorn and sidekiq. The latter is the application’s job scheduler, and has no relation to CI jobs.

Any event or longer running task triggered by clients in the browser is run in sidekiq. The interface is embedded into the admin area.

There you can inspect the failing job tasks, likely you’ll see a pattern. Sometimes this is e.g. caused by not being to send emails, or network communication gone wrong.

Tuning: https://docs.gitlab.com/ee/administration/operations/extra_sidekiq_processes.html

Cheers,
Michael

Linds · January 9, 2020, 10:10pm

Thanks for jumping in, @dnsmichi!

@dunczzz - let us know what other questions you have.

dnsmichi · January 10, 2020, 2:01pm

I learned about Sidekiq when hosting Discourse. I love the interface where you can debug and reschedule jobs, e.g. when adding new badges or having problems with task schedules.

Though, this is working as root user on a open heart, don’t enforce things here. Better use it read-only for root cause analysis.

Cheers,
Michael

dunczzz · January 13, 2020, 11:23am

Thanks for this @dnsmichi. I have some follow-up questions if you don’t mind

What does Unicorn do? I’ve just read the documentation here but it doesn’t really explain
(related to #1) I’ve noticed that we’ve set unicorn['worker_processes'] = 9 in our config, yet the documentation indicates a recommend value of (CPU cores * 1.5) + 1. It’s a 32 core box, so according to this, we should be setting a value of 49 not 9 … How will this be affecting us? If this isn’t your area of expertise, I’ll open another post.
In our Background Jobs -> Busy page in the Admin area, I can see each of our servers with their busy jobs. All 3 servers appear to be processing every queue. Our primary server has 9 Threads / 9 Busy and our two worker servers have 50 Threads / 50 Busy. In total, the Background Jobs page is stating that we have 111 Busy jobs and 20,320 enqueued. How do I interpret this?
In the Background Jobs -> Busy page, we also have 7 additional rows for queues like Queues: pipeline_processing:pipeline_update and Queues: pipeline_processing:stage_update. Presumably this was an attempt by a colleague (no longer with the organisation) to process pipelines more rapidly. They’re all sitting at 2 Threads / 0 Busy - would I be correct in interpreting that this means they’re not being used? If not, any idea why not? Are the other processes (the aforementioned 9/9, 50/50 rows on this page) taking priority or something?

Thanks again,
Duncan

dnsmichi · January 13, 2020, 1:19pm

Hi,

Unicorn is not a GitLab product but generally used as http server to serve Ruby on Rails applications. The wikipedia article is short but explains this a bit.

Unicorn works in the way that you have a single control process, run single threaded. It spawns worker processes which then do the work (executing jobs). That’s a similar pattern as with apache/httpd and the prefork model, AFAIK PostgreSQL follows the same approach.

Other servers in the Ruby world are thin (used by Dashing) and Puma. In the past years, I have seen that larger Ruby on Rails applications are shifting from Unicorn to Puma for performance reasons mostly.

GitLab is on the move to Puma, you can follow the progress in this epic.

Scaling Unicorn Workers

I haven’t done much here but only to evaluate an arbitrary number of Unicorn workers. The docs are correct about setting the value 1.5 or 2 as factor for the number of CPU cores. Keep in mind though, that this increases virtual memory allocation on the system itself.

I wouldn’t immediately set these values to 48 workers, considering the fact that other applications also require CPU resources on the same host (PostgreSQL, Redis, NodeJS).

Start with setting these values to 16, then raise to 32. Try to measure whether this helps with performance, or just increases memory and load on the system.

Busy Background jobs

9 Threads / 9 Busy and 50 Threads / 50 Busy means that the job pipeline is stalled, and the backlog is becoming huge.

The things I would analyze:

Are these running jobs ended at some point
How long is the (average) execution time of such a job, do they always reach a timeout?
How many jobs per minute, hour, day are executed

Raising the unicorn worker count will definitely help with the issue, but if there’s e.g. a job which fails with an email send timeout, and you multiply that with 1000 user emails, it will still block.

Therefore, identify blocking jobs first.

Queue Management

To my knowledge, this is controlled by Omnibus GitLab and you must not modify the Sidekiq scheduler on your own. GitLab itself defines the queues it needs for the workers, with googling for the queue names you’ll e.g. land here:

gitlab.com

gitlab-org/gitlab/blob/master/app/workers/all_queues.yml

# This file is generated automatically by
#   bin/rake gitlab:sidekiq:all_queues_yml:generate
#
# Do not edit it manually!
---
- :name: auto_devops:auto_devops_disable
  :feature_category: :auto_devops
  :has_external_dependencies: 
  :latency_sensitive: 
  :resource_boundary: :unknown
  :weight: 2
- :name: auto_merge:auto_merge_process
  :feature_category: :continuous_delivery
  :has_external_dependencies: 
  :latency_sensitive: 
  :resource_boundary: :cpu
  :weight: 3
- :name: chaos:chaos_cpu_spin
  :feature_category: :chaos_engineering
  :has_external_dependencies:

This file has been truncated. show original

From an application developer’s perspective, this works way better than to have just a single queue where one blocking job halts everything.

Monitoring

The most interesting part are not the idle queues, but the ones which have lots of items inside. Speaking of that, monitoring and collecting metrics for better “over time” visualization will help here.

I’m not sure if the default metrics for Prometheus are available also in-depth for unicorn/sidekiq, but there’s possibilities to integrate that into either the Prometheus service or your own monitoring.

Here’s a good blog post on the matter: https://samsaffron.com/archive/2018/02/02/instrumenting-rails-with-prometheus

In terms of EE I’d also look into possibilities with scaling in other directions, such as Elastisearch for the search backend. Maybe the problem is not only related to the job processing performance, but influenced by other components generating too many jobs with too long updates.

Cheers,
Michael

dunczzz · January 13, 2020, 1:35pm

Massively helpful response Michael. Thank you.