Gitlab CI shared runner suddenly unable to resolve AWS deployment host

kevpfowler · August 21, 2019, 9:21pm

Beginning today (8/21/2019), a .gitlab-ci.yml deploy stage script that was previously working fine started to fail. During this deploy stage, the script uses ssh to execute a few simple commands on a server in our AWS VPC. That server has a public IP and a domain address (aliased here as MYHOSTNAME.com) we maintain that points to it. I have been able to use that hostname as usual from several other locations successfully.

Here is the log:

Pulling docker image docker:latest ...
Using docker image sha256:0cecfefe921f22fc898f7a0055358380c8870ab6f05b01999367911714fe9d00 for docker:latest ...
Running on runner-fa6cab46-project-13833214-concurrent-0 via runner-fa6cab46-srm-1566409267-884419bc...
Fetching changes with git depth set to 50...
Initialized empty Git repository in /builds/XXXXXXXXX/.git/
Created fresh repository.
From https://gitlab.com/XXXXXXXXXXXXXX
 * [new branch]      ___testing -> origin/___testing
Checking out 056e8479 as ___testing...
Skipping Git submodules setup
$ which ssh-agent || ( apk update; apk add --no-cache openssh-client )
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
v3.10.2-4-gca30a4d858 [http://dl-cdn.alpinelinux.org/alpine/v3.10/main]
v3.10.2-4-gca30a4d858 [http://dl-cdn.alpinelinux.org/alpine/v3.10/community]
OK: 10335 distinct packages available
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
(1/6) Installing openssh-keygen (8.0_p1-r0)
(2/6) Installing ncurses-terminfo-base (6.1_p20190518-r0)
(3/6) Installing ncurses-terminfo (6.1_p20190518-r0)
(4/6) Installing ncurses-libs (6.1_p20190518-r0)
(5/6) Installing libedit (20190324.3.1-r0)
(6/6) Installing openssh-client (8.0_p1-r0)
Executing busybox-1.30.1-r2.trigger
OK: 17 MiB in 21 packages
$ eval $(ssh-agent -s)
Agent pid 22
$ echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
Identity added: (stdin) ((stdin))
$ mkdir -p ~/.ssh
$ [[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config
$ scp -P $SSH_PORT ./docker-compose.yml ubuntu@${SERVER}:
ssh: Could not resolve hostname MYHOSTNAME.com: Name does not resolve**
lost connection

I checked the /etc/resolv.conf in the dind container that is running:

nameserver 169.254.169.254
search c.gitlab-ci-155816.internal google.internal

Again, this script has worked fine until today. Any ideas? Suggestions for debug?

Kevin

bartj · August 22, 2019, 8:10am

I believe that IP isn’t correct: amazon web services - What's special about 169.254.169.254 IP address for AWS? - Stack Overflow At least I can’t connect to port 53 TCP or UDP which might be useful if using it as a nameserver

kevpfowler · August 22, 2019, 7:51pm

This turned out to be a problem on our end (well, our service provider) with hostnames getting dropped from the Google public DNS service. This was resolved today and the deploy stage is working again.