Need to restart docker service to get docker network working again

Env

  • gitlab-ce 12.7.2-ce.0 amd64
  • gitlab-runner 12.6.0 amd64
  • Debian Stretch

Problem

I have to restart the Debian (Stretch) Docker service to get network working again. Otherwise I get a “Connection unreachable” inside the CI Docker, for accessing the outside resources.

  • What are you seeing, and how does that differ from what you expect to see?
 Running with gitlab-runner 12.6.0 (ac8e767a)
   on Shared Docker runner vJRY-fex
Using Docker executor with image git.example.com:5555/internal/container/stretch/build:v1 ...
00:34
 Starting service docker:dind ...
 Pulling docker image docker:dind ...
 Using docker image sha256:8489eeb24a264b6bcdb17f3da00140cebe92ee36bd22365f37d07d59390df4ee for docker:dind ...
 Waiting for services to be up and running...
 *** WARNING: Service runner-vJRY-fex-project-120-concurrent-0-docker-0 probably didn't start properly.
 Health check error:
 service "runner-vJRY-fex-project-120-concurrent-0-docker-0-wait-for-service" timeout
 Health check container logs:
 Service container logs:
 2020-01-29T10:44:59.621826740Z time="2020-01-29T10:44:59.621556665Z" level=info msg="Starting up"
 2020-01-29T10:44:59.625563991Z time="2020-01-29T10:44:59.625412983Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
 2020-01-29T10:44:59.626422790Z failed to load listeners: can't create unix socket /var/run/docker.sock: device or resource busy
 *********
 Authenticating with credentials from job payload (GitLab Registry)
 Pulling docker image git.example.com:5555/internal/container/stretch/build:v1 ...
 Using docker image sha256:5223d58b17d2138d64951fb738ddd44b71bf8734477ac05f7874db64922532cb for git.example.com:5555internal/container/stretch/build:v1 ...
Running on runner-vJRY-fex-project-120-concurrent-0 via git...
00:01
Fetching changes...
00:06
 Reinitialized existing Git repository in /builds/internal/backoffice_ui/.git/
 From https://git.example.com/internal/backoffice_ui
  * [new ref]         refs/pipelines/565 -> refs/pipelines/565
    535d10e..7e5d9ca  develop            -> origin/develop
 Checking out 7e5d9caf as develop...
 *********

… a lot of other stuff, but than the important part:

$ apt update
 WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
 Err:1 http://repos.example.com/debian stretch InRelease
   Could not connect to repos.example.com:80 (172.21.1.124), connection timed out
 Err:2 http://repos.example.com/debian nodejs_11 InRelease
   Unable to connect to repos.example.com:http:
 Reading package lists...
 Building dependency tree...
 Reading state information...
 All packages are up to date.

In the end, the build breaks, because all the packages which are required can’t be installed.

Workaround

~# service docker restart
  • After restart
 Running with gitlab-runner 12.6.0 (ac8e767a)
   on Shared Docker runner vJRY-fex
Using Docker executor with image git.example.com:5555/internal/container/stretch/build:v1 ...
00:33
 Starting service docker:dind ...
 Pulling docker image docker:dind ...
 Using docker image sha256:8489eeb24a264b6bcdb17f3da00140cebe92ee36bd22365f37d07d59390df4ee for docker:dind ...
 Waiting for services to be up and running...
 *** WARNING: Service runner-vJRY-fex-project-120-concurrent-0-docker-0 probably didn't start properly.
 Health check error:
 service "runner-vJRY-fex-project-120-concurrent-0-docker-0-wait-for-service" timeout
 Health check container logs:
 Service container logs:
 2020-02-04T09:01:09.326839247Z time="2020-02-04T09:01:09.326549028Z" level=info msg="Starting up"
 2020-02-04T09:01:09.330228908Z time="2020-02-04T09:01:09.330098855Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
 2020-02-04T09:01:09.331162134Z failed to load listeners: can't create unix socket /var/run/docker.sock: device or resource busy
 *********
 Authenticating with credentials from job payload (GitLab Registry)
 Pulling docker image git.example.com:5555/internal/container/stretch/build:v1 ...
 Using docker image sha256:5223d58b17d2138d64951fb738ddd44b71bf8734477ac05f7874db64922532cb for git.example.com:5555/internal/container/stretch/build:v1 ...
Running on runner-vJRY-fex-project-120-concurrent-0 via git...
00:02
Fetching changes...
00:02
 Reinitialized existing Git repository in /builds/internal/backoffice_ui/.git/
 Checking out 7e5d9caf as develop...
 Removing .filename
 Removing backoffice-ui-build-deps_0.0.21+0~20200129104742_all.deb
 Skipping Git submodules setup
Authenticating with credentials from job payload (GitLab Registry)
05:04
 $ echo 'deb http://repos.example.com/debian/ nodejs_11 main' >> /etc/apt/sources.list
 $ apt update
 WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
 Get:1 http://repos.example.com/debian stretch InRelease [3982 B]
....

All runs fine now. Just restart Docker is enough. It works than for a few hours or days … not sure.

Configs

  • gitlab-ci.yaml
variables:
   TOOL_ARGS: apt-get -o Debug::pkgProblemResolver=yes --no-install-recommends --yes  --allow-unauthenticated
   DEB_PACKAGE_NAME: "backoffice-ui"
stages: 
    - build
    - publish
    - deploy
 
build:stretch: &build
   stage: build
   tags:
     - docker
   image: git.example.com:5555/internal/container/stretch/build:v1
   before_script:
   - echo 'deb http://repos.example.com/debian/ nodejs_11 main' >> /etc/apt/sources.list
   - apt update
   - git reset --hard
   - git clean -fd
   - git checkout $CI_COMMIT_REF_NAME
...
  • Dockerfile
FROM debian:stretch
LABEL maintainer="me@example.com"
ENV LANG=C.UTF-8 \
    DEBIAN_FRONTEND=noninteractive
RUN mkdir -p /usr/share/man/man1 \
	&& apt-get update \
	&& apt-get -qy upgrade \
	&& apt-get -qy dist-upgrade \
	&& export build_deps=' \
		build-essential \
		ca-certificates \
		fakeroot \
		git-buildpackage \
		lintian \
		pristine-tar' \
	&& apt-get -qy install --no-install-recommends $build_deps \
		autodep8 \
		autopkgtest \
		git \
	&& apt-get -qy autoremove --purge \
	&& apt-get clean \
	&& apt-mark auto $build_deps \
	&& rm -rf /var/lib/apt/lists/*
ADD overlay /
ENTRYPOINT ["/usr/bin/gitlab-ci-entrypoint"]
  • Runner config
[[runners]]
  name = "Shared Docker runner"
  url = "https://git.example.com/"
  token = "secret"
  executor = "docker"
  clone_url = "https://git.example.com"
  environment = ["DOCKER_TLS_CERTDIR=","GIT_SSL_NO_VERIFY=1","DOCKER_DRIVER=overlay2"]
  [runners.custom_build_dir]
  [runners.docker]
    cap_add = ["NET_ADMIN"]
    tls_verify = false
    image = "docker:stable"
    privileged = true
    disable_entrypoint_overwrite = false
    network_mode = "bridge"
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/certs/client", "/var/run/docker.sock:/var/run/docker.sock", "/srv/gitlab-runner/data:/srv/gitlab-runner/data", "/cache", "/opt/aptly/incoming:/publish:rw"]
    extra_hosts = ["git.example.com:192.168.43.18","repos.example.com:172.1.1.1"]
    shm_size = 0
    services = ["docker:dind"]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]

Final

I tried really a lot, to understand, how to get it working again. I saw in tcpdump, that the connection tries to get out, but nothing happens, until I realized that the CI is working again, after a reboot. Than I tested with a restart from the docker service and voilà it works (again).

Hi,

This sounds odd. Which Docker version is involved here? docker info or alike as output.

Cheers,
Michael

Hi @dnsmichi

had to look first, if I’m not posting on the wrong Discourse ^^

root@git:[~]: grep docker /etc/passwd
root@git:[~]: grep docker /etc/group
docker:x:991:git,gitlab-runner
git:~$ dpkg -l |grep docker
ii  docker-ce                            5:19.03.5~3-0~debian-stretch      amd64        Docker: the open-source application container engine
ii  docker-ce-cli                        5:19.03.5~3-0~debian-stretch      amd64        Docker CLI: the open-source application container engine
  • Before starting build job
root@git:[~]: brctl show
bridge name	bridge id		STP enabled	interfaces
docker0		8000.02429314d0f1	no

root@git:[~]: grep vethb3c5a6d /var/log/kern.log
Feb  5 08:22:10 git kernel: [1717825.480910] docker0: port 2(vethb3c5a6d) entered blocking state
Feb  5 08:22:10 git kernel: [1717825.480912] docker0: port 2(vethb3c5a6d) entered disabled state
Feb  5 08:22:10 git kernel: [1717825.481032] device vethb3c5a6d entered promiscuous mode
Feb  5 08:22:10 git kernel: [1717825.481248] IPv6: ADDRCONF(NETDEV_UP): vethb3c5a6d: link is not ready
Feb  5 08:22:10 git kernel: [1717825.481251] docker0: port 2(vethb3c5a6d) entered blocking state
Feb  5 08:22:10 git kernel: [1717825.481253] docker0: port 2(vethb3c5a6d) entered forwarding state
Feb  5 08:22:10 git kernel: [1717825.846404] IPv6: ADDRCONF(NETDEV_CHANGE): vethb3c5a6d: link becomes ready
Feb  5 08:22:41 git kernel: [1717856.215753] docker0: port 2(vethb3c5a6d) entered disabled state
Feb  5 08:22:41 git kernel: [1717856.258977] docker0: port 2(vethb3c5a6d) entered disabled state
Feb  5 08:22:41 git kernel: [1717856.261804] device vethb3c5a6d left promiscuous mode
Feb  5 08:22:41 git kernel: [1717856.261831] docker0: port 2(vethb3c5a6d) entered disabled state


  • While running the build job
root@git:[~]: brctl show
bridge name	bridge id		STP enabled	interfaces
docker0		8000.02429314d0f1	no		vethb3c5a6d
  • Uh, it changes while running:
root@git:[~]: brctl show
bridge name	bridge id		STP enabled	interfaces
docker0		8000.02429314d0f1	no		vethc31f2b0

root@git:[~]: grep vethc31f2b0 /var/log/kern.log
Feb  5 08:22:50 git kernel: [1717865.191863] docker0: port 1(vethc31f2b0) entered blocking state
Feb  5 08:22:50 git kernel: [1717865.191865] docker0: port 1(vethc31f2b0) entered disabled state
Feb  5 08:22:50 git kernel: [1717865.191943] device vethc31f2b0 entered promiscuous mode
Feb  5 08:22:50 git kernel: [1717865.192725] IPv6: ADDRCONF(NETDEV_UP): vethc31f2b0: link is not ready
Feb  5 08:22:50 git kernel: [1717865.192728] docker0: port 1(vethc31f2b0) entered blocking state
Feb  5 08:22:50 git kernel: [1717865.192729] docker0: port 1(vethc31f2b0) entered forwarding state
Feb  5 08:22:50 git kernel: [1717865.192781] docker0: port 1(vethc31f2b0) entered disabled state
Feb  5 08:22:50 git kernel: [1717865.441820] IPv6: ADDRCONF(NETDEV_CHANGE): vethc31f2b0: link becomes ready
Feb  5 08:22:50 git kernel: [1717865.441854] docker0: port 1(vethc31f2b0) entered blocking state
Feb  5 08:22:50 git kernel: [1717865.441855] docker0: port 1(vethc31f2b0) entered forwarding state
Feb  5 08:27:40 git kernel: [1718155.605209] docker0: port 1(vethc31f2b0) entered disabled state
Feb  5 08:27:40 git kernel: [1718155.664816] docker0: port 1(vethc31f2b0) entered disabled state
Feb  5 08:27:40 git kernel: [1718155.667455] device vethc31f2b0 left promiscuous mode
Feb  5 08:27:40 git kernel: [1718155.667479] docker0: port 1(vethc31f2b0) entered disabled state

I have to wait, until it stops working … maybe something with forwarding … or the bridge. Something which get fixed, while restarting the Docker service.

In the end … it could be a udev problem.

cu denny

Hi Denny,

no, this isn’t the Icinga Discourse :wink: I wanted to learn more about GitLab, and imho the best way to do so is looking how people use it and help them solve their questions :slight_smile:

In terms of the problem - maybe this is related to the bridge, and their IP address assignment. This issue suggests to use IPv6 in the end, which is the default on your end. Maybe there is a possibility to force this into IPv4 for network routing of Docker, also suggested here.

Or, going this route with investigating on the bridge and their MAC addresses and iptables assignment.

Cheers,
Michael

1 Like

hi @dnsmichi,

I have to wait, until the problem appears again and try the fixes with IPv6. I think, that the Docker in Docker thing, which stops working.

hi,

I’ve found the problem: Puppet :slight_smile: Puppet was dropping the firewall rules from Docker so the network connection was dead. I’ve added a flag to the base puppet class, to ommit dropping unknown rules.

1 Like

Hi,

thanks for coming back and sharing your solution! :heart:

Cheers,
Michael

Hi, sorry to jump on on this, but what about this warning

*** WARNING: Service runner-vJRY-fex-project-120-concurrent-0-docker-0 probably didn't start properly.
 Health check error:
 service "runner-vJRY-fex-project-120-concurrent-0-docker-0-wait-for-service" timeout
 Health check container logs:
 Service container logs:
 2020-02-04T09:01:09.326839247Z time="2020-02-04T09:01:09.326549028Z" level=info msg="Starting up"
 2020-02-04T09:01:09.330228908Z time="2020-02-04T09:01:09.330098855Z" level=warning msg="could not change group /var/run/docker.sock to docker: group docker not found"
 2020-02-04T09:01:09.331162134Z failed to load listeners: can't create unix socket /var/run/docker.sock: device or resource busy
 *********

I have the same thing going on. On a different thread someone suggested it’s due to the volume for the docker.sock to be removed from the toml config, but that just made things worse for me.

Any idea?

Thanks

EDIT:
Btw This is what I’m running on a self-hosted CE:

Running with gitlab-runner 14.8.2 (c6e7e194)
  on Prod APP eus305 TQ-VuMm-
Preparing the "docker" executor
00:35
Using Docker executor with image docker:19.03.12 ...
Starting service docker:19.03.12-dind ...
Pulling docker image docker:19.03.12-dind ...
Using docker image sha256:66dc2d45749a48592f4348fb3d567bdd65c9dbd5402a413b6d169619e32f6bd2 for docker:19.03.12-dind with digest docker@sha256:674f1f40ff7c8ac14f5d8b6b28d8fb1f182647ff75304d018003f1e21a0d8771 ...

and this is the warning I get

*** WARNING: Service runner-tq-vumm--project-5-concurrent-0-58fa0c06e49069c1-docker-0 probably didn't start properly.
...
2022-04-08T09:40:45.536662749Z failed to load listeners: can't create unix socket /var/run/docker.sock: device or resource busy
*********