Issue: CI Builds using Docker are failing conditionally. Gitlab type: Self-Hosted (paid) Gitlab version: 12.6.2-ee Docker image: Locally built docker image (Debian Buster w/PHP7.4) Notes: Previous CI/Docker images continue to work properly.
I’m working on a new CI/Deployment, which requires we upgrade our (local) Docker images we use.
I’ve built a new Debian Buster image the same way we’ve done previously for older OS versions.
Added a new runner, similar to all our others (30+), didn’t work initially, same HTTP 443 as described below.
While attempting to debug the issue, I noticed if I was actively running the container while the job processed, it succeeded.
Example: gitlab-host:~# docker run -it --rm buster-php74 bash
Retry the pipeline that had been failing, results in success.
Now, logout of the container mentioned above, retry the pipeline, it will fail with HTTP 443:
fatal: unable to access 'https://gitlab.example.com/project/project-api.git/': Failed to connect to gitlab.example.com port 443: Connection refused
Since you are not specifying the image keyword here, I’d say that this is run via the shell executor, and not in Docker itself. The GitLab runner config.toml would be interesting here.
If you start your Docker container once, it highly likely maps ports to the host system (80, 443) and then the CI pipeline succeeds. If you stop the container, service ports are gone.
If you add the following to your script section, it shows where this is executed.
Hmm, we don’t specify “image” in any of our others, but I went ahead and added it for good measure (it is specified in the config.toml), but no change in the behavior.
FWIW, this is identical to our other runners with the exception of name/token/image
+ eval 'echo "Running on $(hostname) via <gitlab host>..."
'
+++ hostname
++ echo 'Running on runner-xxxxxxx-project-759-concurrent-0 via <gitlab hostname>...'
Running on runner-xxxxxxx-project-759-concurrent-0 via <gitlab hostname>...
+ exit 0
when I see the message below I would strait guess there is a connection issue. Have you checked the DNS - may be just run a curl or ping to the repository URL?
fatal: unable to access 'https://gitlab.example.com/project/project-api.git/': Failed to connect to gitlab.example.com port 443: Connection refused
May be something changed for Debian Buster compared to the older releases?
Yea, I did check them, they are set to our defaults, and furthermore, when running the containers manually, I can clone any repo just fine, and do anything else network related that I would expect to work such as curl-ing the main gitlab.example.com page.
The Connection refused would indicate at the least that DNS is working, but the host is refusing, so I also double-checked our firewall to make sure we’re not blocking any of the traffic (though this host has been running for years and no one has made any firewall changes recently).
Just an some extra info, while we were debugging using some echo commands, we found that we could hold the runner container open if we left a never ending ping running. I did have to start a container manually (it doesn’t seem to matter which one I start) to get the build to progress far enough to reach the ping command, but from there I was able to grab the ci-token info from the ENV and confirm I was able to clone the repo without issue.
I’m trying to add a before_script to the failing builds to it will stay open long enough to examine it, but so far a simple ping at the top isn’t keeping it open.
In the interest of no one wasting time, we just tcpdump-ed the docker0 interface while one of the doomed containers was starting up and we were able to capture the following:
15:41:13.122317 IP (tos 0xc0, ttl 64, id 23994, offset 0, flags [none], proto ICMP (1), length 88)
<gitlab host> > 192.168.0.3: ICMP <gitlab host> tcp port https unreachable, length 68
Which sure seems to suggest there’s is some form of networking issue, though why a manually run container would resolve it, is still unclear.
Does anyone know if the “runner-helper” containers do anything special with network?
@nightman68 yea, all fails before anything starts building
When I understood right you setup a new runner with a new image. Have you tried to run the jobs with one of the old images on the new runner which were working on the old runner?
If the git clone/fetch fails I would enable some git debugging in the CI file:
It seemed that something changed in our docker0 interface and was running into our firewall (we use Firehol FWIW). Strange because we didn’t make any actual changes to our firewall, so possibly related to the upgrade we did earlier that week to gitlab runner?
@nightman68 thanks for suggesting the verbose options, it did help illustrate where the breakdown was happening.
@dnsmichi thanks for looking into it, was a great learning experience for Gitlab CI
Glad you could figure it out by yourself I’ve learned new things with the great help from @nightman68 myself Maybe you’ll stay here for a bit and try to help others too? Or you’ll throw in some likes showing your appreciation, or mark one reply as solution
In case you want to share your CI experience, there’s a great epic/issue for doing so. I had added my feedback there too.