Errors: 403+500 Random Pipeline Fails with multiple runners

Hi there kind reader,

I’m having an issue with random pipeline fails that I cannot find an explanation for.
Runner, Cache and Gitlab are all on the same server. Disk space is not full.

  1. Sometimes jobs fail but upon retrying they work without a problem.
  2. Sometimes all jobs fail on the first try and the second works all the time.

The pipelines fail with the following message:

Running on runner--via server...
Getting source from Git repository
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /builds/-zLyZxzq/1/group/repo/random_repo/.git/
fatal: unable to access 'https://myserver.group/repo/random_repo/.git/': 
The requested URL returned error: 403
ERROR: Job failed: exit code 1

other with this: message

Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /builds/-zLyZxzq/1/group/repo/random_repo/.git/
error: RPC failed; HTTP 403 curl 22 The requested URL returned error: 403
fatal: the remote end hung up unexpectedly
Cleaning up project directory and file based variables
ERROR: Job failed: exit code 1

In the syslog I found the following error:

gitlab-runner[1008]: #033[0;33mWARNING: Checking for jobs... failed
033[0;m#033[0;33mrunner#033[0;m=-zLyZxzq #033[0;33mstatus#033[0;m=POST https://git.server.com/api/v4/jobs/request: **409 Conflict**
gitlab-runner[1008]: Checking for jobs... received                     
033[0;m  job#033[0;m=60932 repo_url#033[0;m=https://git.server.com/random_repo.git 
runner#033[0;m=riiNiSJC
gitlab-runner[1008]: #033[31;1mERROR: Could not create cache adapter            
 033[0;m  #033[31;1merror#033[0;m=cache factory not found: **factory for cache adapter "" was not registered**

Workhorse shows a 500 at the same time/the user is presented with Error 500 at the same time,

/var/log/gitlab/gitlab-workhorse/current:
{"content_type":"text/html; "duration_ms":22,"host":"git.server.com","level":"info","method":"GET","msg":"access",,"route":"","status":500,"}

Im wondering if the 403 and cache error are related or if they are two separate problems.
Anyone have an idea what the problem could be, how I can investigate further or know a potential fix?

Best regards

Could be a permission problem, i.e. the user who triggered the pipeline does not have permissions to actually run the pipeline in this project. When you manually retrigger the jobs with your own account that has owner permissions, it works. Or it s a timing problem with the server slow to respond.

I’ve googled for the error and found ERROR in 11.5 - Could not create cache adapter (#3802) · Issues · GitLab.org / gitlab-runner · GitLab where it mentions runners.cache configuration is the source of the problem. Maybe the cache problem sources from performance problems or accessing shared resources on the host (server - Git repo, runner - cloned repo).

Can you share the config.toml for the runner (remove sensitive details), and also the version of GitLab server and Runner involved here?

Maybe there is a performance problem with all components on a single host, either CPU or I/O. I’d suggest monitoring the resource usage and correlating spikes/alerts to failed pipelines.

Another way to mitigate is to spin up a separate VM and install the GitLab runner, making it the default (temporarily disable the other runner). Do the performance problems continue, or are they gone?

Last but not least, depending on the performance dashboards of the server, add more CPU/memory or faster disks to it. Which sizing are you using at the moment?

Great suggestions. Thank you. I will investigate and will give an update.

This is likely the issue. I’m using a server with HDD and now am actively monitoring resources to hopefully catch it sometime.

Permission errors are not the issue since triggering the job repeatedly as the same user it works.
I also checked and the “Could not create cache adapter” error and it seems unrelated to the issue since it also sometimes triggers even when the jobs run successfully.

1 Like

After updating everything to the latest version the problem did not persist.

1 Like

Interesting, thank you for sharing. :slight_smile:

I’d suggest keeping the additional monitoring for resources, to detect problems faster if they come back.