I’m having an issue with random pipeline fails that I cannot find an explanation for.
Runner, Cache and Gitlab are all on the same server. Disk space is not full.
Sometimes jobs fail but upon retrying they work without a problem.
Sometimes all jobs fail on the first try and the second works all the time.
The pipelines fail with the following message:
Running on runner--via server...
Getting source from Git repository
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /builds/-zLyZxzq/1/group/repo/random_repo/.git/
fatal: unable to access 'https://myserver.group/repo/random_repo/.git/':
The requested URL returned error: 403
ERROR: Job failed: exit code 1
other with this: message
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in /builds/-zLyZxzq/1/group/repo/random_repo/.git/
error: RPC failed; HTTP 403 curl 22 The requested URL returned error: 403
fatal: the remote end hung up unexpectedly
Cleaning up project directory and file based variables
ERROR: Job failed: exit code 1
In the syslog I found the following error:
gitlab-runner[1008]: #033[0;33mWARNING: Checking for jobs... failed
033[0;m#033[0;33mrunner#033[0;m=-zLyZxzq #033[0;33mstatus#033[0;m=POST https://git.server.com/api/v4/jobs/request: **409 Conflict**
gitlab-runner[1008]: Checking for jobs... received
033[0;m job#033[0;m=60932 repo_url#033[0;m=https://git.server.com/random_repo.git
runner#033[0;m=riiNiSJC
gitlab-runner[1008]: #033[31;1mERROR: Could not create cache adapter
033[0;m #033[31;1merror#033[0;m=cache factory not found: **factory for cache adapter "" was not registered**
Workhorse shows a 500 at the same time/the user is presented with Error 500 at the same time,
Im wondering if the 403 and cache error are related or if they are two separate problems.
Anyone have an idea what the problem could be, how I can investigate further or know a potential fix?
Could be a permission problem, i.e. the user who triggered the pipeline does not have permissions to actually run the pipeline in this project. When you manually retrigger the jobs with your own account that has owner permissions, it works. Or it s a timing problem with the server slow to respond.
Can you share the config.toml for the runner (remove sensitive details), and also the version of GitLab server and Runner involved here?
Maybe there is a performance problem with all components on a single host, either CPU or I/O. I’d suggest monitoring the resource usage and correlating spikes/alerts to failed pipelines.
Another way to mitigate is to spin up a separate VM and install the GitLab runner, making it the default (temporarily disable the other runner). Do the performance problems continue, or are they gone?
Last but not least, depending on the performance dashboards of the server, add more CPU/memory or faster disks to it. Which sizing are you using at the moment?
This is likely the issue. I’m using a server with HDD and now am actively monitoring resources to hopefully catch it sometime.
Permission errors are not the issue since triggering the job repeatedly as the same user it works.
I also checked and the “Could not create cache adapter” error and it seems unrelated to the issue since it also sometimes triggers even when the jobs run successfully.
I have another update on this. The problem occurred again after a few weeks and then the real issue was found. Turns out our own gitlab blocked the ip of the runner since it ran too many times. Removing the rate limit of runner solved the problem once and for all.