I have a runner on Windows 11 that never finishes. The bash script has an echo at the end of it, and I see that echo in the CICD log, so I know the script reached the end. But after that script the cache part never starts, and the job just hangs until it times out. Any ideas how to fix this?
Do you also store any artifacts? is the runner cache a accessed over the network? If so, then the artifact/cache uploading takes enough time that the runner job times out (or you just manually stopped it because you thought it was hanging)
No artifacts, only the cache. The cache is stored locally. The CICD has no logs for the cache starting, which it does have logs for in other pipelines that run correctly. It’s only this one pipeline that is having issues. It also doesn’t happen 100% of the time, maybe 90% or so at the moment. It’s just as if the runner thinks the previous script is still running or something and never starts the cache portion.
Can you check if the runner CPU usage is high when the job presumably hangs?
I suppose it could be the runner compressing the cache so it can be stored locally
Also, consider not using cache for packages (like node_modules for example, as it is in most cases faster if you just download them every time)
It’s not an issue with the cache taking a long time to finish. The cache never even starts, because it has 0 log entries saying so. Every other pipeline will show a log entry when the cache is starting.
Since you’re using Windows, with an .sh script - I’d like to ask which bash interpreter you’re using. It’s possible that that’s the root cause of your issue?
Maybe the interpreter hangs on script end? (though in this case the first script should hang before we even reach the second one, so this is probably not the case)
Can you add another script line to your .gitlab-ci.yml so
build-OculusQuestAppLab:
script:
- bash “ci/build.sh”
- bash “ci/deploy-oculus-quest.sh”
- echo "third step of the script"`
The config is set to use powershell as the shell executor. I just tried it with the echo and it did reach that in the log, but it then froze after and never started the cache like usual.
Aight so the scripts are not the problem… but damn, you’re on windows… do you know how to see the whole process tree on Win? 'cause I have no clue. Something like htop or alike - so you create the job, and when it gets stuck you go see on which part it hangs and hopefully get some useful info?
Maybe there’s a specific command that’s ran right after the script finishes which fails.
Maybe it is a permission problem, or a more generic OS error. I’d suggest correlating all events at the date and time when the CI job runs and fails. win+r and eventvwr opens the event viewer.
Also, can you share the script content of deploy-oculus-quest.sh and build.sh to exactly see what it does? It may start background services and the like, which are blocking the termination of the job itself.
Does anyone have any other ideas to try? It’s a bit random too, sometimes it will complete, and other times it will just hang there until it times out or I manually cancel the job. It’s only this one deploy to Oculus that ever has issues too. I have a separate deploy to Steam that never freezes like this.
I figured it out. The issue is that some of the exe’s I call in my scripts spawn other child processes that keep running after the exe that I call exits. I did a workaround fix by after the exe exits, finding all child processes of that process and killing them.
My previous attempt at that actually didn’t work 100% of the time. I further narrowed it down to an issue of adb not being killed after Unity exits. This is the powershell code that I used for it and it is working 100% of the time for me now:
if ( $Env:BUILD_TARGET -eq "Android" )
{
echo "Killing adb.exe processes as it prevents the CICD process from exiting"
try
{
Stop-Process -Name "adb" -ErrorAction SilentlyContinue
}
catch
{
Write-Host "An error occurred:"
Write-Host $_
}
}
For me, this seems to work as well: "taskkill //IM "adb.exe" //F"
For future reference: I did not have too much luck trying to find child processes ( nothing showed up after the build step finished )… however running “tasklist” I was able to see that “adb” is still running, and killing it allowed the CI job to finish.
My pipeline returns error whenever trying to kill the process with Stop-Process, even wrapped in Try/Catch, even with -ErrorAction either set to SilentlyContinue or Ignore.
Same thing by pipelining Get-Process and pass it to Stop-Process.