GitLab.com runner never finishes after script is completed

I have a runner on Windows 11 that never finishes. The bash script has an echo at the end of it, and I see that echo in the CICD log, so I know the script reached the end. But after that script the cache part never starts, and the job just hangs until it times out. Any ideas how to fix this?

Here’s CICD yml entry for it:
build-OculusQuestAppLab:
stage: deploy
cache:
key: “$CI_COMMIT_REF_SLUG-android-applab”
paths:
- Library/
script:
- bash “ci/build.sh”
- bash “ci/deploy-oculus-quest.sh”
variables:
BUILD_TARGET: Android
SCRIPTING_DEFINES: OCULUS;DISABLESTEAMWORKS
tags:
- unity
rules:
- if: ‘$CI_COMMIT_BRANCH == “master” || $CI_COMMIT_BRANCH == “qa” || $CI_COMMIT_BRANCH == “dev”’
when: always
- when: manual
allow_failure: true

It reaches the last line of “bash ci/deploy-oculus-quest.sh” which is ‘echo “Deploy finished”’, and then just freezes there.

Do you also store any artifacts? is the runner cache a accessed over the network? If so, then the artifact/cache uploading takes enough time that the runner job times out (or you just manually stopped it because you thought it was hanging)

No artifacts, only the cache. The cache is stored locally. The CICD has no logs for the cache starting, which it does have logs for in other pipelines that run correctly. It’s only this one pipeline that is having issues. It also doesn’t happen 100% of the time, maybe 90% or so at the moment. It’s just as if the runner thinks the previous script is still running or something and never starts the cache portion.

Can you check if the runner CPU usage is high when the job presumably hangs?
I suppose it could be the runner compressing the cache so it can be stored locally

Also, consider not using cache for packages (like node_modules for example, as it is in most cases faster if you just download them every time)

1 Like

It’s not an issue with the cache taking a long time to finish. The cache never even starts, because it has 0 log entries saying so. Every other pipeline will show a log entry when the cache is starting.

Anything in the GitLab Runner logs - Install GitLab Runner on Windows | GitLab

1 Like

The final message in the logs is just the “Deploy Finished” from the last script. Nothing else shows until it times out a few hours later.

Since you’re using Windows, with an .sh script - I’d like to ask which bash interpreter you’re using. It’s possible that that’s the root cause of your issue?
Maybe the interpreter hangs on script end? (though in this case the first script should hang before we even reach the second one, so this is probably not the case)

Can you add another script line to your .gitlab-ci.yml so

build-OculusQuestAppLab:
  script:
    - bash “ci/build.sh”
    - bash “ci/deploy-oculus-quest.sh”
    - echo "third step of the script"`

Also, please do use codeblocks next time :wink:

1 Like

The config is set to use powershell as the shell executor. I just tried it with the echo and it did reach that in the log, but it then froze after and never started the cache like usual.

Aight so the scripts are not the problem… but damn, you’re on windows… do you know how to see the whole process tree on Win? 'cause I have no clue. Something like htop or alike - so you create the job, and when it gets stuck you go see on which part it hangs and hopefully get some useful info?
Maybe there’s a specific command that’s ran right after the script finishes which fails.

Maybe it is a permission problem, or a more generic OS error. I’d suggest correlating all events at the date and time when the CI job runs and fails. win+r and eventvwr opens the event viewer.

Also, can you share the script content of deploy-oculus-quest.sh and build.sh to exactly see what it does? It may start background services and the like, which are blocking the termination of the job itself.

Other jobs that also use build.sh complete fine, so if its a script related thing then it would be the deploy script causing it I imagine.

Here’s the build script:

#!/usr/bin/env bash

set -e
set -x

echo "Building for $BUILD_TARGET"

export BUILD_PATH=./Builds/$BUILD_TARGET/
mkdir -p $BUILD_PATH

"C:\Program Files\Unity\Hub\Editor\\${UNITY_VERSION}\Editor\Unity.exe" \
  -projectPath $(pwd) \
  -quit \
  -batchmode \
  -gitBranch $CI_COMMIT_BRANCH \
  -buildTarget $BUILD_TARGET \
  -customBuildTarget $BUILD_TARGET \
  -customBuildName "$BUILD_NAME" \
  -customBuildPath $BUILD_PATH \
  -customScriptingDefines $SCRIPTING_DEFINES \
  -customZipPrefix $ZIP_PREFIX \
  -executeMethod BuildCommand.PerformBuild \
  -logFile -

UNITY_EXIT_CODE=$?

if [ $UNITY_EXIT_CODE -eq 0 ]; then
  echo "Run succeeded, no failures occurred";
elif [ $UNITY_EXIT_CODE -eq 2 ]; then
  echo "Run succeeded, some tests failed";
elif [ $UNITY_EXIT_CODE -eq 3 ]; then
  echo "Run failure (other failure)";
else
  echo "Unexpected exit code $UNITY_EXIT_CODE";
fi

ls -la $BUILD_PATH
[ -n "$(ls -A $BUILD_PATH)" ] # fail job if build folder is empty

Here’s the deploy script:

#!/usr/bin/env bash

set -e
set -x

if [ "$CI_COMMIT_REF_SLUG" = "master" ]; then
	channel="RC"
elif [ "$CI_COMMIT_REF_SLUG" = "qa" ]; then
	channel="BETA"
elif [ "$CI_COMMIT_REF_SLUG" = "dev" ]; then
	channel="ALPHA"
else
	channel=$CI_COMMIT_REF_SLUG
fi

gameVersion=`cat GameVersion.txt`
echo "Deploying Oculus Quest build for channel $channel version $gameVersion"

C:/OculusDeploy/ovr-platform-util.exe upload-quest-build -a $OCULUS_QUEST_APP_ID --app_secret $OCULUS_QUEST_APP_SECRET --apk "C:/OculusDeploy/Build/Cards & Tankards.apk" -c $channel

echo "Deploy finished"
exit 0

Does anyone have any other ideas to try? It’s a bit random too, sometimes it will complete, and other times it will just hang there until it times out or I manually cancel the job. It’s only this one deploy to Oculus that ever has issues too. I have a separate deploy to Steam that never freezes like this.

I figured it out. The issue is that some of the exe’s I call in my scripts spawn other child processes that keep running after the exe that I call exits. I did a workaround fix by after the exe exits, finding all child processes of that process and killing them.

1 Like

Would you be able to share the code for that? Looking at pretty much the same issue, building for Android with Unity, and the build does not finish.

Greg

My previous attempt at that actually didn’t work 100% of the time. I further narrowed it down to an issue of adb not being killed after Unity exits. This is the powershell code that I used for it and it is working 100% of the time for me now:

if ( $Env:BUILD_TARGET -eq "Android" )
{
  echo "Killing adb.exe processes as it prevents the CICD process from exiting"
  
  try
  {
    Stop-Process -Name "adb" -ErrorAction SilentlyContinue
  }
  catch
  {
    Write-Host "An error occurred:"
    Write-Host $_
  }
}

Awesome, thank you emrys90!

For me, this seems to work as well:
"taskkill //IM "adb.exe" //F"

For future reference: I did not have too much luck trying to find child processes ( nothing showed up after the build step finished )… however running “tasklist” I was able to see that “adb” is still running, and killing it allowed the CI job to finish.