Problem with multiple versions of GitLab Runner and Docker executor errors

我的使用场景如下,之前使用 gitkab-ci 文件实现 cicd,runner 解释器为 docker,一直都使用正常,但是在某次更新 gitlab 版本(runner 也同样更新为一样版本)后发现了一些情况
之前正常执行的任务现在会直接报错 ERROR: Failed to remove network for build ERROR: Preparation failed: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:951:0s),仔细看时发现指定的 runner 为之前某个老的 runner 版本,在多次点击重复构建时,会找到正确版本的 runner,此时任务就可以正常执行了
错误截图:


正确截图:

另外,我通过管理员访问 web-runners 界面时也会直接返回数据错误,并且之前可以正常显示 runner 的,现在都是空白
请告诉我你们有无类似的问题,该如何解决呢

通过 web-ui 访问 runners 错误截图

正确的 runner 执行成功截图:


可以看到执行成功版本号与之前错误 runner 版本号是不同的

Hello,
If this is a dedicated runner, please check whether the docker is already installed and it is running.

您好 我的 runner 是安装在当前 gitlab 主机上的,是 systemd 管理,并不是 docker 运行
还有,我的 gitlab 服务器上docker 肯定是安装并且正在运行,否则我的正确的 runner 执行时是不会成功的
您说对吗

在注册 runner 时选择的执行器是什么?是 Docker 吗?如果是的话,那么 Docker 守护进程需要保持运行。 Plz check this.

是 docker,我的Docker 守护进程是保持运行状态的
不然后面不会正常执行成功的。

这种情况现在还在发生吗?还是只是偶尔发生?如果是后者,可能只是一个小故障。之前我遇到过这种情况,是因为实例重启后 Docker 守护进程没有运行。谢谢

是的,,我现在所有的 ci 任务自动触发都会直接报错,在反复点击重试几次后,才会找到正确的 runner并执行成功。。我怀疑是数据库关于 runner 的异常了导致的,,我做过重启服务器,重启 runer, 重启 gitlab 等等,都解决不了这个问题

Mind changing the language to English so that more folks can contribute? Thanks.

My usage scenario is as follows. I used gitkab-ci file to implement cicd before, and the runner interpreter was docker. It has been working normally, but after updating the gitlab version (the runner was also updated to the same version), I found some problems. The tasks that were previously executed normally now directly report errors ERROR: Failed to remove network for build ERROR: Preparation failed: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? (docker.go:951:0s). When I looked carefully, I found that the specified runner was an old runner version. When I clicked on the build again several times, the correct version of the runner was found, and the task could be executed normally. Error screenshot:


Correct screenshot:

Translated to English, please refer to

Hello, based on my description, do you have any ideas?

Looks like there are dangling/outdated runners registered to the GitLab instance, which occasionally take over the jobs and lack permissions or availability of the Docker container engine, leading to the errors you are seeing.

I’d suggest to investigate the runner fleet dashboard, and disable/delete all runners that are not healthy or stale. Alternatively, assign a specific tag for the CI/CD jobs that need a specific tagged runner, to always use a specific runner pool.


First of all, thank you for your reply. I logged into the cicd interface as an administrator and got the following error. I think it was caused by an error in the database.

I don’t speak the localized language in the UI, so I don’t know what it says. Can you translate the error to English?

The numbers in the middle of the screen say 28 / 8 / 10 where I assume that 10 runners are stale. I would investigate them first and identify the runner that causes problems with execution (the docker error in your logs above). Disable all runners that are not needed.

Eventually this GitLab version is outdated and needs an update. Can you share the output of /help as URL endpoint in your browser, e.g. https://example.gitlab.com/help.

First of all, thank you for your reply. Since the last time, I deleted all the data about ci in the database of the server and restarted. Now there is no data error on the interface, but there is a new problem. When I register the runner, I directly report an error 500. The production.log log shows “Gitlab::Auth::UnauthorizedError (Gitlab::Auth::UnauthorizedError):”. The runner version corresponds to gitlab. I don’t know how to proceed. I will attach a screenshot below.

Web display no data input error:


registart runner:



error log:

Screenshots of gitlab version and runner version:

After I manually repaired the database, the runner was created successfully, but I encountered the following error when executing the task:


production.log:

api-log:

Please don’t delete manually in the database unless support/engineering ask to do so, it can break your setup and data integrity of the application. Unregistering/deleting Runners can be done through the UI, or API.

What happens if you follow the suggestion to go to the pipeline editor? Can you share the CI/CD configuration which is involved? Highly likely it still refers to runners being broken, or somehow not fully registered.

Since the setup seems to be in a broken state (database deletes, unknown runner errors), eventually consider a reinstall on a new server, and then migrate all groups/projects.

First of all, thank you for your guidance and help.
The reason I deleted the data was to confirm my direction. I thought that the reason why the runner kept reporting errors was because of data problems. Now it has been confirmed.
Then how can I restore all my data on the new server and ignore the wrong data about the runner? Otherwise, use the tar package of gitlab-backup create to restore directly, and then it will return to the multi-runner execution error of the old server. Do you understand my description?