Sudden High CPU / Unresponsiveness

Hello,

This morning, with no intentional changes, our self hosted internal gitlab server (omnibus v13.8.4 old I know) on a 4CPU/8GB Fedora virtual machine started to exhibit high cpu load average (has been working great for years). The high cpu load average was not because of a high i/o wait time. The webserver began to report 500 and 502 errors. Many attempts of gitlab-ctl reconfigure and gitlab-ctl restart did not resolve the issues, cpu would begin to churn immediately upon startup (quickly to a 5.00+ top load average). No problems reported in gitlab-rake gitlab:check. No obvious errors in gitlab-ctl tail. When the 500 or 502 errors were thrown, the log would just show a generic timeout issue with no obvious google-able error.

Eventually I realized the problem process (or so I believe) is sidekiq, a gitlab-rake cache:clear resolved the inability to access the website, however the website is still very slow with a top load average hovering around 1.50 consistently for hours. There is 1 process that is continuing to eat up cpu. top shows this process at 80%+ most of the time, for a few seconds every minute it will drop to 10%. A ps on the process shows
sidekiq 5.2.9 queues:authorized_project_update:authorized_projec

Now that I can access the website, I navigated to Admin/Background Jobs and see this under the Busy tab for Processes.

gitlab.novalocal:11218 queues:authorized_project_update:authorized_project_update_project_create,authorized_project_update:authorized_project_update_project_group_link_create,authorized_project_update:authorized_project_update_user_refresh_over_user_range,authorized_project_update:authorized_project_update_user_refresh_with_low_urgency,auto_devops:auto_devops_disable,auto_merge:auto_merge_process,chaos:chaos_cpu_spin,chaos:chaos_db_spin,chaos:chaos_kill,chaos:chaos_leak_mem,chaos:chaos_sleep,container_repository:cleanup_container_repository,container_repository:container_expiration_policies_cleanup_container_repository,container_repository:delete_container_repository,cronjob:admin_email,cronjob:analytics_instance_statistics_count_job_trigger,cronjob:authorized_project_update_periodic_recalculate,cronjob:ci_archive_traces_cron,cronjob:ci_pipeline_artifacts_expire_artifacts,cronjob:ci_platform_metrics_update_cron,cronjob:ci_schedule_delete_objects_cron,cronjob:container_expiration_policy,cronjob:environments_auto_stop_cron,cronjob:expire_build_artifacts,cronjob:gitlab_usage_ping,cronjob:import_export_project_cleanup,cronjob:import_stuck_project_import_jobs,cronjob:issue_due_scheduler,cronjob:jira_import_stuck_jira_import_jobs,cronjob:member_invitation_reminder_emails,cronjob:metrics_dashboard_schedule_annotations_prune,cronjob:namespaces_prune_aggregation_schedules,cronjob:pages_domain_removal_cron,cronjob:pages_domain_ssl_renewal_cron,cronjob:pages_domain_verification_cron,cronjob:partition_creation,cronjob:personal_access_tokens_expired_notification,cronjob:personal_access_tokens_expiring,cronjob:pipeline_schedule,cronjob:prune_old_events,cronjob:prune_web_hook_logs,cronjob:releases_create_evidence,cronjob:releases_manage_evidence,cronjob:remove_expired_group_links,cronjob:remove_expired_members,cronjob:remove_unaccepted_member_invites,cronjob:remove_unreferenced_lfs_objects,cronjob:repository_archive_cache,cronjob:repository_check_dispatch,cronjob:requests_profiles,cronjob:schedule_merge_request_cleanup_refs,cronjob:schedule_migrate_external_diffs,cronjob:stuck_ci_jobs,cronjob:stuck_export_jobs,cronjob:stuck_merge_jobs,cronjob:trending_projects,cronjob:update_container_registry_info,cronjob:users_create_statistics,cronjob:x509_issuer_crl_check,dependency_proxy:purge_dependency_proxy_cache,deployment:deployments_drop_older_deployments,deployment:deployments_execute_hooks,deployment:deployments_finished,deployment:deployments_forward_deployment,deployment:deployments_link_merge_request,deployment:deployments_success,deployment:deployments_update_environment,gcp_cluster:cluster_configure_istio,gcp_cluster:cluster_install_app,gcp_cluster:cluster_patch_app,gcp_cluster:cluster_provision,gcp_cluster:cluster_update_app,gcp_cluster:cluster_upgrade_app,gcp_cluster:cluster_wait_for_app_installation,gcp_cluster:cluster_wait_for_app_update,gcp_cluster:cluster_wait_for_ingress_ip_address,gcp_cluster:clusters_applications_activate_service,gcp_cluster:clusters_applications_deactivate_service,gcp_cluster:clusters_applications_uninstall,gcp_cluster:clusters_applications_wait_for_uninstall_app,gcp_cluster:clusters_cleanup_app,gcp_cluster:clusters_cleanup_project_namespace,gcp_cluster:clusters_cleanup_service_account,gcp_cluster:wait_for_cluster_creation,github_importer:github_import_import_diff_note,github_importer:github_import_import_issue,github_importer:github_import_import_lfs_object,github_importer:github_import_import_note,github_importer:github_import_import_pull_request,github_importer:github_import_import_pull_request_merged_by,github_importer:github_import_import_pull_request_review,github_importer:github_import_refresh_import_jid,github_importer:github_import_stage_finish_import,github_importer:github_import_stage_import_base_data,github_importer:github_import_stage_import_issues_and_diff_notes,github_importer:github_import_stage_import_lfs_objects,github_importer:github_import_stage_import_notes,github_importer:github_import_stage_import_pull_requests,github_importer:github_import_stage_import_pull_requests_merged_by,github_importer:github_import_stage_import_pull_requests_reviews,github_importer:github_import_stage_import_repository,hashed_storage:hashed_storage_migrator,hashed_storage:hashed_storage_project_migrate,hashed_storage:hashed_storage_project_rollback,hashed_storage:hashed_storage_rollbacker,incident_management:clusters_applications_check_prometheus_health,incident_management:incident_management_add_severity_system_note,incident_management:incident_management_pager_duty_process_incident,incident_management:incident_management_process_alert,incident_management:incident_management_process_prometheus_alert,jira_connect:jira_connect_sync_branch,jira_connect:jira_connect_sync_builds,jira_connect:jira_connect_sync_deployments,jira_connect:jira_connect_sync_feature_flags,jira_connect:jira_connect_sync_merge_request,jira_connect:jira_connect_sync_project,jira_importer:jira_import_advance_stage,jira_importer:jira_import_import_issue,jira_importer:jira_import_stage_finish_import,jira_importer:jira_import_stage_import_attachments,jira_importer:jira_import_stage_import_issues,jira_importer:jira_import_stage_import_labels,jira_importer:jira_import_stage_import_notes,jira_importer:jira_import_stage_start_import,mail_scheduler:mail_scheduler_issue_due,mail_scheduler:mail_scheduler_notification_service,object_pool:object_pool_create,object_pool:object_pool_destroy,object_pool:object_pool_join,object_pool:object_pool_schedule_join,object_storage:object_storage_background_move,object_storage:object_storage_migrate_uploads,package_repositories:packages_nuget_extraction,pipeline_background:archive_trace,pipeline_background:ci_build_report_result,pipeline_background:ci_build_trace_chunk_flush,pipeline_background:ci_daily_build_group_report_results,pipeline_background:ci_pipeline_artifacts_coverage_report,pipeline_background:ci_pipeline_success_unlock_artifacts,pipeline_background:ci_ref_delete_unlock_artifacts,pipeline_background:ci_test_failure_history,pipeline_cache:expire_job_cache,pipeline_cache:expire_pipeline_cache,pipeline_creation:create_pipeline,pipeline_creation:run_pipeline_schedule,pipeline_default:build_coverage,pipeline_default:build_trace_sections,pipeline_default:ci_create_cross_project_pipeline,pipeline_default:ci_pipeline_bridge_status,pipeline_default:pipeline_metrics,pipeline_default:pipeline_notification,pipeline_hooks:build_hooks,pipeline_hooks:pipeline_hooks,pipeline_processing:build_finished,pipeline_processing:build_queue,pipeline_processing:build_success,pipeline_processing:ci_build_prepare,pipeline_processing:ci_build_schedule,pipeline_processing:ci_resource_groups_assign_resource_from_resource_group,pipeline_processing:pipeline_process,pipeline_processing:pipeline_update,pipeline_processing:stage_update,pipeline_processing:update_head_pipeline_for_merge_request,repository_check:repository_check_batch,repository_check:repository_check_clear,repository_check:repository_check_single_repository,todos_destroyer:todos_destroyer_confidential_issue,todos_destroyer:todos_destroyer_entity_leave,todos_destroyer:todos_destroyer_group_private,todos_destroyer:todos_destroyer_private_features,todos_destroyer:todos_destroyer_project_private,unassign_issuables:members_destroyer_unassign_issuables,update_namespace_statistics:namespaces_root_statistics,update_namespace_statistics:namespaces_schedule_aggregation,analytics_instance_statistics_counter_job,approve_blocked_pending_approval_users,authorized_keys,authorized_projects,background_migration,bulk_import,bulk_imports_entity,chat_notification,ci_delete_objects,create_commit_signature,create_note_diff_file,default,delete_diff_files,delete_merged_branches,delete_stored_files,delete_user,design_management_copy_design_collection,design_management_new_version,destroy_pages_deployments,detect_repository_languages,disallow_two_factor_for_group,disallow_two_factor_for_subgroups,email_receiver,emails_on_push,environments_canary_ingress_update,error_tracking_issue_link,experiments_record_conversion_event,expire_build_instance_artifacts,export_csv,external_service_reactive_caching,file_hook,flush_counter_increments,git_garbage_collect,github_import_advance_stage,gitlab_performance_bar_stats,gitlab_shell,group_destroy,group_export,group_import,import_issues_csv,invalid_gpg_signature_update,irker,issuable_export_csv,issue_placement,issue_rebalancing,mailers,merge,merge_request_cleanup_refs,merge_request_mergeability_check,metrics_dashboard_prune_old_annotations,metrics_dashboard_sync_dashboards,migrate_external_diffs,namespaceless_project_destroy,namespaces_onboarding_pipeline_created,namespaces_onboarding_user_added,new_issue,new_merge_request,new_note,pages,pages_domain_ssl_renewal,pages_domain_verification,pages_remove,pages_transfer,pages_update_configuration,phabricator_import_import_tasks,post_receive,process_commit,project_cache,project_daily_statistics,project_destroy,project_export,project_schedule_bulk_repository_shard_moves,project_service,project_update_repository_storage,prometheus_create_default_alerts,propagate_integration,propagate_integration_group,propagate_integration_inherit,propagate_integration_inherit_descendant,propagate_integration_project,propagate_service_template,reactive_caching,rebase,remote_mirror_notification,repository_cleanup,repository_fork,repository_import,repository_remove_remote,repository_update_remote_mirror,self_monitoring_project_create,self_monitoring_project_delete,service_desk_email_receiver,snippet_schedule_bulk_repository_shard_moves,snippet_update_repository_storage,system_hook_push,update_external_pull_requests,update_highest_role,update_merge_requests,update_project_statistics,upload_checksum,web_hook,web_hooks_destroy,x509_certificate_revoke

I don’t know if this is a normal process. I’m guessing that is some internal mechanism that handles essentially all gitlab operations. No “Jobs” appear under this Busy tab except for a cronjob:expire_build_artifacts but that seems to start and end quickly.

Webpages take almost a minute to load with no one using the server, no ci/cd jobs running, no merge requests etc. I’m really struggling to understand why all of this sluggishness happened seemingly random. Does anyone have any tips?

Thanks

Ending up taking a snapshot of the VM and loading it on a new machine and now performs regularly. Same specs (4cpu/8gb memory) but different hardware. Data center team now informed me a power surge occurred the morning odd behavior began. Maybe that affected the underlying cpus, I have no proof that was the issue though.