Rebuilt CI runner and now I get "host key verification failed"

I have a private gitlab server and two runners. They are all running the latest gitlab omnibus versions. I started running out of space on my runners in docker-in-docker builds. I brought one of the runners down, rebuilt it with double the disk space (they are all virtual anyway) and brought it back up with the same name as it had previously. Now, docker packaging runs fine (there is lots of space and they use artifacts from previous builds), but when I check out dependent projects in the actual build, I get this error on the rebuilt runner:

> git clone git@xxx.yyy.com:ggg/project.git …/project/
Cloning into ‘…/project’…
Host key verification failed.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
ERROR: Build failed: exit code 1

I do not, however, get this error on the runner that I didn’t touch. It builds fine but cannot package docker due to the same space limitations. This tells me my project configuration is fine. Deploy keys were setup using gitlab’s setup instructions. I’ve tried deleting and re-registering the new runner to no avail. (It doesn’t unregister for some reason, but I can delete and re-register it just fine.)

I tried launching the docker image from the runner and running the git clone command. It asks me for the password for git@xxx.yyy.com, and I don’t know it, but it clearly can resolve the gitlab server fine. If it didn’t, it wouldn’t have asked for the password. I can also ping the gitlab server by name just fine. I don’t think connectivity is the issue.

My thought is that the gitlab server has the ssh fingerprint of the old runner cached somewhere (similar to .ssh/known_hosts), but I cannot find it for the life of me. None of the users on the runner have anything of consequence in the known_hosts, btw.

Anyone have any thoughts on this? Or know where that fingerprint would be stored?

Dave

Can you post your gitlab runner config? Have you tried instead of using git clone just doing a simple ssh gitserver:port ofcourse you cannot log in, but any problems with known_hosts should popup there. (you do run your tests with the git clone url which is ssh, instead of https ofcourse :slight_smile: )

@riemers Thanks for replying!

config (pretty vanilla):

concurrent = 1
check_interval = 0
[[runners]]
name = “runner01.mydomain.com
url = “https://git.mydomain…com/ci”
token = “XXXXXXXXXXXXXXXXXXXXXXX”
executor = “docker”
[runners.docker]
tls_verify = false
image = “myrepo/image”
privileged = true
disable_cache = false
volumes = ["/cache"]
[runners.cache]
Insecure = false

If I try to ssh from the runner to the git server, it just prompts for a password. No warnings.

I think I failed to mention previously that its a public project on the server.

I thought maybe gitlab implemented their own ssh server (with a separate known_hosts) since there are gems ssh gems in the gitlab directory.

I’m baffled…

Edited: 8/9/216 to describe solution
I’m adding this in case someone else encounters the same error.

I previously thought that reconfiguring SSH fixed the issue (e.g. dpkg-reconfigure openssh-server). As far as I can tell, it did not. Later the same day I thought I had it sorted, another project failed building that previously worked. The final solution? Changing my gitlab-ci.yml file.

I had implemented ssh in my gitlab-ci.yml files before there was much official documentation from gitlab and I usied various questions and answers people had posted in various places to derive a working ci config file. I had skimmed the docs and it looked basically the same, but obviously I missed the filename, as described below.

In the end, two particular changes were made to the yml file to enable building again. First:
- echo "$SSH_PRIVATE_KEY" > $SSH_TMP - ssh-add $SSH_TMP - rm $SSH_TMP
Changed to
+ - ssh-add <(echo "$SSH_PRIVATE_KEY")
I like this more anyway because the key is never written to disk. I seems like it should be functionally equivalent anyway (other than the tmp file), though. I doubt this is the show-stopper.

The second change, and more likely to be the fix, is this:
- '[[ -f /.dockerinit ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config'
Changed to
- '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config'
Please note that the difference is that .dockerinit changed to .dockerenv. It makes me wonder if this changed on the docker side, but regardless, I can see that if StrictHostKeyChecking no was not added to the ssh config file, it would cause issues consistent with what I was seeing.

This does make me wonder why it worked for months, but a new runner build made it fail. My only guess is the newer versions of docker, openssh, and the kernel that were included in the new build directly or indirectly impacted the behavior. I’m not sure and will likely never have the time to investigate.

There it is. Thanks @riemers for your help!

Dave

2 Likes

Thank you @dpankros !! You saved my day !! :slight_smile: