GitLab CI jobs queued on retry

Hello GitLab community!

I have a pipeline which is configured to retry failing jobs. That was working perfect for a long time, but last few weeks we observed that something changed. So if job failed on pipeline it’s pending with an information
This job is in pending state and is waiting to be picked by a runner
We created a couple of runners as we thought that’s problem with lack of them, but it’s not. Only pending jobs are these with retry option. I configure retry using configuration below

.runner_tags: &runner_tags
  image: ${ANSIBLE_DOCKER_IMAGE}:${ANSIBLE_DOCKER_TAG}
  retry: 1
  extends: .ansible_run_tags

I’m using public GitLab with my custom runners (docker type). One additional note: I can prove that, but I think the pending time is related with time how long failing job took.

I hope someone can help me to find solution. It’s annoying as we execute these pipelines using API and other process waits until pipeline is finished with some timeout. So pipeline is running but main process returns it took too long :frowning:

1 Like

My job is queued and on logs of runner I can see:

Nov 23 09:15:24 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring request slot                            #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=cxs9Mjg-
Nov 23 09:15:26 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mFeeding runners to channel                        #033[0;m  #033[37;1mbuilds#033[0;m=0
Nov 23 09:15:26 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mFeeding runner to channel                         #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:26 integrationgitlabrunner-1 gitlab-runner[416730]: Checking for jobs...nothing                       #033[0;m  runner#033[0;m=2nAJXMiK
Nov 23 09:15:26 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mProcessing runner                                 #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:26 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring executor from provider                  #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:26 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring job slot                                #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:26 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring request slot                            #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:27 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mFeeding runner to channel                         #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=cxs9Mjg-
Nov 23 09:15:27 integrationgitlabrunner-1 gitlab-runner[416730]: Checking for jobs...nothing                       #033[0;m  runner#033[0;m=cxs9Mjg-
Nov 23 09:15:27 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mProcessing runner                                 #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=cxs9Mjg-
Nov 23 09:15:27 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring executor from provider                  #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=cxs9Mjg-
Nov 23 09:15:27 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring job slot                                #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=cxs9Mjg-
Nov 23 09:15:27 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring request slot                            #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=cxs9Mjg-
Nov 23 09:15:29 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mFeeding runners to channel                        #033[0;m  #033[37;1mbuilds#033[0;m=0
Nov 23 09:15:29 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mFeeding runner to channel                         #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:41 integrationgitlabrunner-1 gitlab-runner[416730]: Checking for jobs...nothing                       #033[0;m  runner#033[0;m=2nAJXMiK
Nov 23 09:15:41 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mProcessing runner                                 #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:41 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring executor from provider                  #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:41 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring job slot                                #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:41 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mAcquiring request slot                            #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=2nAJXMiK
Nov 23 09:15:42 integrationgitlabrunner-1 gitlab-runner[416730]: #033[37;1mFeeding runner to channel                         #033[0;m  #033[37;1mbuilds#033[0;m=0 #033[37;1mrunner#033[0;m=cxs9Mjg-

Can be something related with a version of gitlab runner?

And after job finally executed – it’s not on logs, nothing, so is it possible that job is picked but GitLab CI is not aware about that?

Additional thing:
job is pending and queued until it’s run on the same runner - even if it’s available or when I pause runner on project and enable it - then immidiately job is running on it. API calls responses

  "id": 3368503264,
  "status": "pending",
  "stage": "healthcheck-2",
  "name": "hvves",
  "ref": "master",
  "tag": false,
  "coverage": null,
  "allow_failure": false,
  "created_at": "2022-11-23T12:31:45.443Z",
  "started_at": null,
  "finished_at": null,
  "duration": null,
  "queued_duration": 103.90318713,
  "user": {
    
  },
  "commit": {
    
  },
  "pipeline": {

  },
  "web_url": "https://gitlab.com/Orange-OpenSource/lfn/onap/xtesting-onap/-/jobs/3368503264",
  "project": {
    "ci_job_token_scope_enabled": false
  },
  "artifacts": [],
  "runner": null,
  "artifacts_expire_at": null,
  "tag_list": [
    "kubernetes"
  ]
}

running

{
  "id": 3368503264,
  "status": "running",
  "stage": "healthcheck-2",
  "name": "hvves",
  "ref": "master",
  "tag": false,
  "coverage": null,
  "allow_failure": false,
  "created_at": "2022-11-23T12:31:45.443Z",
  "started_at": "2022-11-23T12:37:27.524Z",
  "finished_at": null,
  "duration": 2.918878457,
  "queued_duration": 341.971386,
  "user": {

  },
  "commit": {

  },
  "pipeline": {

  },
  "web_url": "https://gitlab.com/Orange-OpenSource/lfn/onap/xtesting-onap/-/jobs/3368503264",
  "project": {
    "ci_job_token_scope_enabled": false
  },
  "artifacts": [],
  "runner": {
    "id": 19198544,
    "description": "ONAP integration runner #1",
    "active": true,
    "paused": false,
    "is_shared": false,
    "runner_type": "project_type",
    "name": "gitlab-runner",
    "online": true,
    "status": "online"
  },
  "artifacts_expire_at": null,
  "tag_list": [
    "kubernetes"
  ]
}

Since you are running your own Runners, it always helps to actually include it’s version in the post.

Hi @balonik ,

Sorry I missed that, but, to be honest, I’m not sure if that really matters in that case, as we were checked on a lot of runners and for each of them we have the same issue. But to be clear - we tested on runners with versions:

  • 15.6.0
  • 15.5.1
  • 14.3.0

Right now an interim solution is to, using GitLab API, get pending jobs and if there are any, we pause and activate one of the runners - that unlock pending jobs.

Hey!

We are having the same issue. Previously the queue/retry system was working but not anymore. Do we know something more about this? The interin solution doesn’t work in our org :frowning:

Thanks for the help! :slight_smile:

1 Like

We are facing the same issue as well. Any progress on this?

We have the same issue as well with public Gitlab and custom docker based runners on a linux machine (no kubernetes involved)

Hi there, we also have the same issue with the following deployment:

GitLab Server:
Version: 15.7.0-ee
Type: On Premise (Self Hosted)
Installation Method: Omnibus
OS: RHEL 8

GitLab Runner Server:
Version: 15.7.1
Type: On Premise (Self Hosted)
Installation Method: Omnibus (from GitLab repo)
OS: RHEL 8
Runner Engine: Docker

Feel free to ping me if there is another information needed.

Thank you.

Hi, is there any solution for this bug?

It’s weird since the jobs is not running only on docker runner (inside a GitLab runner VM), but when I use GitLab runner inside Kubernetes with Kubernetes engine, the queue is working.

Is there a bug on GitLab runner docker engine?

cc GitLab team: @tnir @dnsmichi

Hi,

I solved the problem by commenting out the listen_address = "[::]:8093" on /etc/gitlab-runner/config.toml

It seems the bug related to "409 Conflict" causes Runner to not run any jobs, and give up checking for new jobs for half an hour (#4360) · Issues · GitLab.org / gitlab-runner · GitLab and Runner doesn't pick up jobs, receives 409 conflict on each attempt. (#29466) · Issues · GitLab.org / gitlab-runner · GitLab

Please try on your runner.

I hope someone can help me to find solution. It’s annoying as we execute these pipelines using API and other process waits until pipeline is finished with some timeout. So pipeline is running but main process returns it took too long :frowning:

There is another workaround described here gitlab-runner sometimes ignores jobs (#22088) · Issues · GitLab.org / GitLab · GitLab

I believe this issue will be fixed by Start pipeline in after_commit callback when retrying jobs (!116480) · Merge requests · GitLab.org / GitLab · GitLab.