Infinite loop in Container Registry::Delete Container Repository Worker

I noticed the VM running my GitLab CE (16.11) container had a nearly pegged CPU. I checked the logs and see this spewing over and over as fast as it can:

==> /var/log/gitlab/gitlab-rails/application_json.log <==
{"severity":"INFO","time":"2024-05-03T05:10:52.094Z","meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","correlation_id":"b3cc39cfff4d3c2553fb73cea1680888","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","container_repository_id":1,"container_repository_path":"home/clarkweb","project_id":4,"third_party_cleanup_tags_service":true}
{"severity":"ERROR","time":"2024-05-03T05:10:52.151Z","meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","correlation_id":"b3cc39cfff4d3c2553fb73cea1680888","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","service_class":"Projects::ContainerRepository::DeleteTagsService","container_repository_id":1,"project_id":4,"message":"could not delete tags: latest"}

==> /var/log/gitlab/sidekiq/current <==
{"severity":"INFO","time":"2024-05-03T05:10:52.151Z","class":"ContainerRegistry::DeleteContainerRepositoryWorker","project_id":4,"container_repository_id":1,"container_repository_path":"home/clarkweb","tags_size_before_delete":1,"deleted_tags_size":null,"meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","correlation_id":"b3cc39cfff4d3c2553fb73cea1680888","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","job_status":"running","queue":"default","jid":"c77069ad12c1b287315366ff","retry":0}
{"severity":"INFO","time":"2024-05-03T05:10:52.164Z","retry":0,"queue":"default","version":0,"store":null,"status_expiration":1800,"queue_namespace":"container_repository_delete","args":[],"class":"ContainerRegistry::DeleteContainerRepositoryWorker","jid":"c77069ad12c1b287315366ff","created_at":"2024-05-03T05:10:52.072Z","meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","correlation_id":"b3cc39cfff4d3c2553fb73cea1680888","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2024-05-03T05:10:52.073Z","job_size_bytes":2,"pid":503,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-c77069ad12c1b287315366ff: done: 0.090109 sec","job_status":"done","scheduling_latency_s":0.001062,"redis_calls":9,"redis_duration_s":0.002142,"redis_read_bytes":10,"redis_write_bytes":1358,"redis_queues_calls":2,"redis_queues_duration_s":0.000264,"redis_queues_read_bytes":2,"redis_queues_write_bytes":614,"redis_queues_metadata_calls":2,"redis_queues_metadata_duration_s":0.000807,"redis_queues_metadata_read_bytes":3,"redis_queues_metadata_write_bytes":105,"redis_shared_state_calls":5,"redis_shared_state_duration_s":0.001071,"redis_shared_state_read_bytes":5,"redis_shared_state_write_bytes":639,"db_count":10,"db_write_count":3,"db_cached_count":1,"db_txn_count":1,"db_replica_txn_count":0,"db_primary_txn_count":0,"db_main_txn_count":1,"db_ci_txn_count":0,"db_main_replica_txn_count":0,"db_ci_replica_txn_count":0,"db_replica_count":0,"db_primary_count":10,"db_main_count":10,"db_ci_count":0,"db_main_replica_count":0,"db_ci_replica_count":0,"db_replica_cached_count":0,"db_primary_cached_count":1,"db_main_cached_count":1,"db_ci_cached_count":0,"db_main_replica_cached_count":0,"db_ci_replica_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":0,"db_main_wal_count":0,"db_ci_wal_count":0,"db_main_replica_wal_count":0,"db_ci_replica_wal_count":0,"db_replica_wal_cached_count":0,"db_primary_wal_cached_count":0,"db_main_wal_cached_count":0,"db_ci_wal_cached_count":0,"db_main_replica_wal_cached_count":0,"db_ci_replica_wal_cached_count":0,"db_replica_txn_max_duration_s":0.0,"db_primary_txn_max_duration_s":0.0,"db_main_txn_max_duration_s":0.006,"db_ci_txn_max_duration_s":0.0,"db_main_replica_txn_max_duration_s":0.0,"db_ci_replica_txn_max_duration_s":0.0,"db_replica_txn_duration_s":0.0,"db_primary_txn_duration_s":0.0,"db_main_txn_duration_s":0.006,"db_ci_txn_duration_s":0.0,"db_main_replica_txn_duration_s":0.0,"db_ci_replica_txn_duration_s":0.0,"db_replica_duration_s":0.0,"db_primary_duration_s":0.008,"db_main_duration_s":0.008,"db_ci_duration_s":0.0,"db_main_replica_duration_s":0.0,"db_ci_replica_duration_s":0.0,"external_http_count":3,"external_http_duration_s":0.007039506017463282,"cpu_s":0.06638,"mem_objects":11710,"mem_bytes":762920,"mem_mallocs":2780,"mem_total_bytes":1231320,"worker_id":"sidekiq_0","rate_limiting_gates":[],"duration_s":0.090109,"completed_at":"2024-05-03T05:10:52.164Z","load_balancing_strategy":"primary","db_duration_s":0.009304,"urgency":"low","target_duration_s":300,"target_scheduling_latency_s":60}
{"severity":"INFO","time":"2024-05-03T05:10:52.166Z","retry":0,"queue":"default","version":0,"store":null,"status_expiration":1800,"queue_namespace":"container_repository_delete","args":[],"class":"ContainerRegistry::DeleteContainerRepositoryWorker","jid":"50f5fed453b70bdc30159220","created_at":"2024-05-03T05:10:52.159Z","meta.caller_id":"ContainerRegistry::DeleteContainerRepositoryWorker","correlation_id":"b3cc39cfff4d3c2553fb73cea1680888","meta.root_caller_id":"Cronjob","meta.feature_category":"container_registry","meta.client_id":"ip/","worker_data_consistency":"always","size_limiter":"validated","enqueued_at":"2024-05-03T05:10:52.160Z","job_size_bytes":2,"pid":503,"message":"ContainerRegistry::DeleteContainerRepositoryWorker JID-50f5fed453b70bdc30159220: start","job_status":"start","scheduling_latency_s":0.005232}

I’d marked some containers for deletion earlier today. Maybe something to do with that? I have no idea how to make it stop. Restarting the container doesn’t fix it.

Thoughts?

I think the order of steps to reproduce this is:

  1. Have Project 1 in Group A
  2. Have some containers in Project 1
  3. Schedule a containers for deletion in Project 1
  4. Move Project 1 to Group B
  5. Wait for the clean up task to come around
  6. The clean up task fails to find the group, dies and repeats

I was never able to clean this job up. I had to migrate all the groups in this install to another instance.

I encountered the same problem. Did you manage to fix it?
If not, perhaps you can file a bug for it in the issue tracker. That might be a better place to get it fixed.

FWIW, the issue I wrote up is here. A robot told me to “self triage” which I attempted, despite the sound of it. During that process, different robots indicated I was triaging all wrong, so I gave up. Someone else came in with (in my opinion) an unrelated permissions problem and now the whole thing is parked in backlog.

I’ve been a happy camper keeping my registry entirely separate from GitLab. Less to go wrong.