GitLab runner has stopped working for no apparent reason

Not sure what happened, it worked fine a few days ago when I last used it. It’s as if it doesn’t execute the job scripts and quits or something.

Running with gitlab-runner 14.6.0 (5316d4ac)
  on LAPTOP-NSD23I7N khE_F_t7
  feature flags: FF_USE_FASTZIP:true
Preparing the "shell" executor
00:00
Using Shell executor...
Preparing environment
00:01
Running on LAPTOP-NSD23I7N...
Getting source from Git repository
00:18
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in C:/GitLab-Runner/builds/khE_F_t7/0/emrys90/project-cards/.git/
Checking out 745abf5d as dev...
Removing Library/
git-lfs/3.0.2 (GitHub; windows amd64; go 1.17.2)
Skipping Git submodules setup
Restoring cache
01:53
Version:      14.6.0
Git revision: 5316d4ac
Git branch:   14-6-stable
GO version:   go1.13.8
Built:        2021-12-17T17:35:49+0000
OS/Arch:      windows/amd64
Checking cache for dev-android-applab...
Runtime platform                                    arch=amd64 os=windows pid=23420 revision=5316d4ac version=14.6.0
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted. 
Successfully extracted cache
Executing "step_script" stage of the job script
00:01
$ bash "ci/build.sh"


Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit status 1

Hi @emrys90

It looks like this line is the one that’s failing:

Personally, I’d add set -x and set -e to the top of this file, so that you can see which lines are being executed, and which one is failing.

It already has that though, I normally get logs when anything in there fails. It’s as if its not even executing the script.

Interesting. If you have access to the runner, what happens if you go to that directory, su to gitlab-runner and run the script by hand?

It begins running the script, but fails due to not having all the environment variables set that the runner uses. Here’s the first few lines of the script, which you can see from the logs its not showing up in there:

#!/usr/bin/env bash

set -e
set -x

echo "Building for $BUILD_TARGET"

export BUILD_PATH=./Builds/$BUILD_TARGET/

OK, so BUILD_TARGET is missing? Is that set in the .gitlab-ci.yml file or via the web ui?

It’s in the .yml file. Everything worked fine in this whole pipeline a few days ago. If the script was the issue I would still at least get that first echo statement.

OK, so if you write:

script:
    - echo "$BUILD_TARGET"
    - bash "ci/build.sh"

in the relevant job, do you get a sensible answer for BUILD_TARGET? If not, is BUILD_TARGET made up of other variables, and if so are they set?

It logs it correctly:

Checking cache for dev-android-applab...
Runtime platform                                    arch=amd64 os=windows pid=21916 revision=5316d4ac version=14.6.0
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted. 
Successfully extracted cache
Executing "step_script" stage of the job script
00:01
$ echo "$BUILD_TARGET"
Android
$ bash "ci/build.sh"


Cleaning up project directory and file based variables
00:00
ERROR: Job failed: exit status 1

OK, so if you have the variable available in the YAML file, but it’s just not getting to the script, why not change the script so that it accepts a command line argument. So in the YAML file you’d have:

- script
    - bash "ci/build.sh" "$BUILD_TARGET"

and in the Bash script, something like:

BUILD_TARGET="${1}"

I think you might be misunderstanding what the issue is? My script isn’t even running. If it was running, I would get the first echo in that script. Even if it wasn’t getting the variable, the echo would still happen.

Sorry, you’re right.

But your log does show that the runner is executing $ bash "ci/build.sh", so it is trying to execute that line, and you have seen that you can run the script correctly manually as gitlab-runner and get some output.

These are just random ideas for debugging, but I would be inclined to change the she-bang line to #!/bin/bash and see what happens.

Clearly, it’s not a permissions issue because you’re calling bash directly from the YAML file, but you could also change that to chmod +x ci/build.sh && ./ci/build.sh just to avoid the extra shell invocation, and see if you get a different result.

Changing it to #!/bin/bash had no effect. This is running on Windows, so I don’t think its any kind of file permissions error. Plus this whole thing worked as of a few days ago and I can’t think of anything I would have changed to break it.

Well, it’s likely that something has changed in your environment. It is odd though that you can run the script manually as the gitlab-runner user but not automatically.

A longer term solution, which would prevent this from happening again, would be to use a Docker runner and pin the version of the image that you use. It will mean changing your infrastructure a bit, but that would give you some assurance that the environment is stable.

I’m at a loss as for what could have caused it or how to fix it now… I have no idea what could possibly prevent a script from executing through the runner.

I would guess that the thing that changed was something in your wider environment; maybe a package update, or windows update, or something that might have changed a setting or and env var somewhere, maybe.

Any ideas what I need to do to fix it? It’s worked fine for a long time and do not want to have to redo a full process like trying to move it all to docker as a workaround. Plus the fact that I need to build it on Windows for Unity, so I can’t just use a Linux image in a docker for it.

I think you need to gather more information here about exactly what is causing the problem. You might also try creating a new runner and seeing whether that gives you different results (but I’d guess not?).

However, if it’s Unity that you are building, I use this image which is made specifically for building Unity apps on GitLab, and can target any of the relevant platforms, including Windows and Android. There’s an example repo with a simple Unity app that shows how the image is intended to be used.

More information would be helpful for sure, but I honestly have no idea what to look for. I don’t have any idea what could possibly caused this. The only thing I recently changed on my computer was installing docker, I had never used it before. I think there might have also been some recent Windows updates that installed.

Does anyone have any other ideas on how to possibly fix this issue? I would like to avoid switching it to docker as that would be a time consuming process that takes me away from other development needs, and this is a production product with lots of ongoing work needed so I am effectively shut down right now until this is solved. Also, I had used the gableroux a while back and switched away from it for an issue I was having, I don’t remember what the issue was, but it makes me more hesitant as well to try and switch to docker.

This has worked fine for over a year, so I would like to just get this functional again.

I’ve tried removing docker from the PATH, and uninstalling the recent Windows updates, but its still not working.