I’m having two issues with our 32-bit Windows runners that I suspect are related. The runners are at version 14.4.0 and our GitLab instance is at version 14.4.1-ee. The runners are tied to specific machines running 32-bit Windows 10 Pro (19043). Here’s a representative config.toml
file. There’s nothing fancy going on:
concurrent = 1
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]
name = <hostname>
url = <url>
token = <token>
executor = "shell"
shell = "powershell"
output_limit = 81920000
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
The relevant jobs install an application built in a previous stage using msiexec
and run a lengthy series of tests. The tests require elevated privileges to interact with drivers and hardware. The gitlab-runner
service runs as NT AUTHORITY/SYSTEM
(the service is configured to login as the local system user, switching this to a local administrator account doesn’t change anything). The problems are:
-
On two of the machines, the runners don’t upload traces to the coordinator. Running in debug mode, it appears they aren’t even attempting to upload traces (I don’t see anything in WireShark, either). However, they download and upload build artifacts.
-
On the same machines, the jobs fail to load a (signed) driver. This typically happens when the application is run without sufficient privileges. However, executing the same commands using either
exec
or an administrator PowerShell session succeeds.
Here is representative debugging output:
PS C:\gitlab-runner> .\gitlab-runner-windows-386.exe --debug run
Runtime platform arch=386 os=windows pid=2364 revision=4b9e985a version=14.4.0
Starting multi-runner from C:\gitlab-runner\config.toml... builds=0
Checking runtime mode GOOS=windows uid=-1
Configuration loaded builds=0
listenaddress: ""
sessionserver:
listenaddress: ""
advertiseaddress: ""
sessiontimeout: 1800
concurrent: 1
checkinterval: 0
loglevel: null
logformat: null
user: ""
runners:
- name: <hostname>
limit: 0
outputlimit: 81920000
requestconcurrency: 0
runnercredentials:
url: <url>
token: <token>
tlscafile: ""
tlscertfile: ""
tlskeyfile: ""
runnersettings:
executor: shell
buildsdir: ""
cachedir: ""
cloneurl: ""
environment: []
preclonescript: ""
prebuildscript: ""
postbuildscript: ""
debugtracedisabled: false
shell: powershell
custombuilddir:
enabled: false
referees: null
cache:
type: ""
path: ""
shared: false
s3:
serveraddress: ""
accesskey: ""
secretkey: ""
bucketname: ""
bucketlocation: ""
insecure: false
authenticationtype: ""
gcs:
cachegcscredentials:
accessid: ""
privatekey: ""
credentialsfile: ""
bucketname: ""
azure:
cacheazurecredentials:
accountname: ""
accountkey: ""
containername: ""
storagedomain: ""
gracefulkilltimeout: null
forcekilltimeout: null
featureflags: {}
ssh: null
docker: null
parallels: null
virtualbox: null
machine: null
kubernetes: null
custom: null
sentrydsn: null
modtime: 2021-11-12T19:43:27.7983993-08:00
loaded: true
builds=0
listen_address not defined, metrics & debug endpoints disabled builds=0
[session_server].listen_address not defined, session endpoints disabled builds=0
Starting worker builds=0 worker=0
Feeding runners to channel builds=0
Dialing: tcp <url>:443 ...
Checking for jobs... nothing runner=fT9zCaM7
Feeding runners to channel builds=0
Checking for jobs... received job=8686 repo_url=<url/repo>.git runner=fT9zCaM7
Processing chain chain-leaf=[0x13f24840 0x13f24b00 0x13f24dc0] context=certificate-chain-build
Certificate doesn't provide parent URL: exiting the loop Issuer=ISRG Root X1 IssuerCertURL=[] Serial=172886928669790476064670243504169061120 Subject=ISRG Root X1 context=certificate-chain-build
Failed to requeue the runner builds=1 runner=fT9zCaM7
Running with gitlab-runner 14.4.0 (4b9e985a) job=8686 project=2 runner=fT9zCaM7
on <hostname> fT9zCaM7 job=8686 project=2 runner=fT9zCaM7
Preparing the "shell" executor job=8686 project=2 runner=fT9zCaM7
Shell configuration: environment: []
dockercommand:
- powershell
- -NoProfile
- -NoLogo
- -InputFormat
- text
- -OutputFormat
- text
- -NonInteractive
- -ExecutionPolicy
- Bypass
- -Command
- '-'
command: powershell
arguments:
- -NoProfile
- -NonInteractive
- -ExecutionPolicy
- Bypass
- -Command
passfile: true
extension: ps1
job=8686 project=2 runner=fT9zCaM7
Using Shell executor... job=8686 project=2 runner=fT9zCaM7
Waiting for signals... job=8686 project=2 runner=fT9zCaM7
No referees configured job=8686 project=2 runner=fT9zCaM7
Executing build stage build_stage=prepare_script job=8686 project=2 runner=fT9zCaM7
Preparing environment job=8686 project=2 runner=fT9zCaM7
Using new shell command execution job=8686 project=2 runner=fT9zCaM7
Executing build stage build_stage=get_sources job=8686 project=2 runner=fT9zCaM7
Getting source from Git repository job=8686 project=2 runner=fT9zCaM7
Using new shell command execution job=8686 project=2 runner=fT9zCaM7
Feeding runners to channel builds=1
Submitting job to coordinator... ok code=200 job=8686 job-status= runner=fT9zCaM7 update-interval=0s
Executing build stage build_stage=restore_cache job=8686 project=2 runner=fT9zCaM7
Skipping stage (nothing to do) build_stage=restore_cache job=8686 project=2 runner=fT9zCaM7
Executing build stage build_stage=download_artifacts job=8686 project=2 runner=fT9zCaM7
Downloading artifacts job=8686 project=2 runner=fT9zCaM7
Using new shell command execution job=8686 project=2 runner=fT9zCaM7
Executing build stage build_stage=step_script job=8686 project=2 runner=fT9zCaM7
Executing "step_script" stage of the job script job=8686 project=2 runner=fT9zCaM7
Using new shell command execution job=8686 project=2 runner=fT9zCaM7
Submitting job to coordinator... ok code=200 job=8686 job-status= runner=fT9zCaM7 update-interval=0s
Executing build stage build_stage=after_script job=8686 project=2 runner=fT9zCaM7
Running after_script job=8686 project=2 runner=fT9zCaM7
Using new shell command execution job=8686 project=2 runner=fT9zCaM7
Executing build stage build_stage=archive_cache_on_failure job=8686 project=2 runner=fT9zCaM7
Skipping stage (nothing to do) build_stage=archive_cache_on_failure job=8686 project=2 runner=fT9zCaM7
Executing build stage build_stage=upload_artifacts_on_failure job=8686 project=2 runner=fT9zCaM7
Uploading artifacts for failed job job=8686 project=2 runner=fT9zCaM7
Using new shell command execution job=8686 project=2 runner=fT9zCaM7
Skipping referees execution job=8686 project=2 runner=fT9zCaM7
Executing build stage build_stage=cleanup_file_variables job=8686 project=2 runner=fT9zCaM7
Cleaning up project directory and file based variables job=8686 project=2 runner=fT9zCaM7
Using new shell command execution job=8686 project=2 runner=fT9zCaM7
WARNING: Job failed: exit status 1
duration_s=51.9473531 job=8686 project=2 runner=fT9zCaM7
Submitting job to coordinator... ok code=200 job=8686 job-status= runner=fT9zCaM7 update-interval=0s
WARNING: Failed to process runner builds=0 error=exit status 1 executor=shell runner=fT9zCaM7
Checking for jobs... nothing runner=fT9zCaM7
Feeding runners to channel builds=0
WARNING: Starting graceful shutdown, waiting for builds to finish StopSignal=quit builds=0
Broadcasting interrupt signal builds=0
All workers stopped. Can exit now builds=0
This only occurs when the jobs are trigged from the CI web interface. It does not happen when the same commands are run using exec
or from an elevated command prompt. It doesn’t happen on another 32-bit Windows 10 machine that, as far as I can tell, is virtually identical (except for the underlying hardware). It doesn’t occur on any of our 64-bit test machines. The 32-bit and 64-bit jobs are identical except that they install the 32-bit and 64-bit versions of the application, respectively.
Clearly, there’s a difference between the 32-bit Windows machines causing one to succeed and two to fail in the same way. I just don’t know what it is, and I’ve been banging my head against the problem for days. My intuition is that it’s some kind of permissions or security setting, but I haven’t been able to figure out which one. The machines are sometimes on different networks, but fiddling with the firewall or which network the machines are on doesn’t work. If anyone has any insight into what could be causing these problems, I would appreciate it. I’m probably overlooking something obvious.