Unable to use 'docker' executor on Gitlab-runner windows

manognamukthinuthala · January 31, 2024, 10:11am

I have self-hosted GitLab, and gitlab-runner container running on a Linux machine. I have a java project which has functional test suites for our application but the test files are purely dependent on Windows web chrome driver to run those test suites. So if I run ‘mvn clean test’ in gitlab-ci.yaml file it is asking for valid executable driver.

So I installed gitlab-runner on windows and still I have my GitLab instance on Linux and tried to connect that way, while creating ‘docker’ executor I got below error:

There has been a runner system failure, please try again
ERROR: Job failed (system failure): error during connect: in the default daemon configuration on Windows, the docker client must be run with elevated privileges to connect: Get “http://%2F%2F.%2Fpipe%2Fdocker_engine/v1.24/info”: open //./pipe/docker_engine: The system cannot find the file specified. (docker.go:907:0s)

In the below reference it is clearly mentioned that gitlab-runner on windows can support docker executor even though container is running on linux
[Docker executor | GitLab]

Here, I just want to understand that does runner it-self will create a virtual environment? If yes, then I hope that Docker client need not be installed on my local machine. Below is my runner register command:

gitlab-runner register --executor="docker" --custom_build_dir-enabled="true" --docker-image="maven:3.6.3-jdk-11" --url="http://hostIP:80" --clone-url="http://hostIP:80" --registration-token="xxxxxxxxxxxxxxxxxxxxx" --description="docker-runner" --tag-list="docker-test-runner" --run-untagged="true" --locked="false" --docker-network-mode="none" --cache-dir="/cache" --docker-disable-cache="true" --docker-volumes="C:\\Temp\\builds" --docker-volumes="C:\\Temp\\cache" --docker-privileged="true"

Maybe I am missing some configurations from my end. I found 2-3 similar posts with the same error in Gitlab community with no solution provided. Please help me out!!! Or let me know if there is any way so that I can run testcases dependent on windows web chrome driver even when my Gitlab instance and gitlab-runner are running on Linux. Will it work if I configure a shell executor type as “powershell”?

Thanks in advance!!!

sdunt · February 6, 2024, 11:06pm

In our case we used the “fleet runner” to host WinDERs systems… (sorry I’m a Linux bigot and find the need to work with micro SOFT tooling annoying…)

GitLab Windows Fleet - instance runners.

Gitlab Fleet runners are basically a ‘management’ runner, on Linux in our case, that makes calls to AWS and spins up new hosts in an AWS Autoscaling Group, ASG, which the management runner then handles off jobs to. As we have it configured, once the job is done or failed, the management runner nukes (terminates) the instance and creates a new instance for each new job.

Fleet Runners make use of:
Custom IAM security Role for EC2 on AWS.
EC2-Image Builder created AMI that has powershell, git and standard tools installed.
ASG group so there is a launch template that defines host size, ami, networking, etc.

Setup Process

Git clone: GitLab.org / fleeting / fleeting-plugin-aws · GitLab
and then build - compile it.

cd cmd/fleeting-plugin-aws/
go install

The ‘compile’ needs to run on a fairly decent sized host, it will NOT run on a micro sized instance. Its needs something like 2 G of ram itself to run successfully.

once the install - compile is complete. On Linux the result will be in ~/go/bin and you can move that to somewhere on the standard $PATH like /usr/local/bin/

Install the latest version of gitlab-runner. and, because we need to update some environment variables, create the file /etc/sysconfig/gitlab-runner where we can set the AWS region, and update the $PATH to include where the fleeting-plugin-aws* exists, if you did not copy fleeting-plugin-aws* to somewhere on the path.

PATH=$PATH:/home/ubuntu/go/bin
AWS_DEFAULT_REGION=us-east-2

AWS IAM Role

EC2 role for host running the management runner needs to have:

Policies named:
AmazonSSMManagedEC2InstanceDefaultPolicy
AmazonSSMManagedInstanceCore
CloudWatchAgentServerPolicy

Create a policy for AutoScaler access like:

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": [
				"autoscaling:SetDesiredCapacity",
				"autoscaling:TerminateInstanceInAutoScalingGroup"
			],
			"Resource": "arn:aws:autoscaling:<group>"
		},
		{
			"Effect": "Allow",
			"Action": [
				"autoscaling:DescribeAutoScalingGroups",
				"ec2:DescribeInstances"
			],
			"Resource": "*"
		},
		{
			"Effect": "Allow",
			"Action": [
				"ec2:GetPasswordData",
				"ec2-instance-connect:SendSSHPublicKey"
			],
			"Resource": "arn:aws:ec2:us-east-2:<snip>:instance/*"
		}
	]
}

The Resource Arn for Autoscaling actions is the ARN for the auto scaling group created for these runners… You will need to add additional ARN’s if you use more than one ASG.

Plus a policy for all of the SSM buckets:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": [
                "arn:aws:s3:::aws-ssm-us-east-1/*",
                "arn:aws:s3:::aws-windows-downloads-us-east-1/*",
                "arn:aws:s3:::amazon-ssm-us-east-1/*",
                "arn:aws:s3:::amazon-ssm-packages-us-east-1/*",
                "arn:aws:s3:::us-east-1-birdwatcher-prod/*",
                "arn:aws:s3:::patch-baseline-snapshot-us-east-1/*",
                "arn:aws:s3:::aws-ssm-us-east-2/*",
                "arn:aws:s3:::aws-windows-downloads-us-east-2/*",
                "arn:aws:s3:::amazon-ssm-us-east-2/*",
                "arn:aws:s3:::amazon-ssm-packages-us-east-2/*",
                "arn:aws:s3:::us-east-2-birdwatcher-prod/*",
                "arn:aws:s3:::patch-baseline-snapshot-us-east-2/*"
            ]
        },
       ]
}

Role needs to have a trust policy so EC2 instance can use it:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

Then assign this role to the EC2 host that is the runner ‘manager’

Config.toml for runner

concurrent = 4
check_interval = 0
shutdown_timeout = 0
log_level = "info"

[[runners]]
  name = "WinFleet Autoscaler"
  url = "https://git.<snip>"
  token = "<Snip>"
  shell = "powershell"
  executor = "instance"
  build_dir = "C:\\users\\Administrator\\builds"
  cache_dir = "C:\\users\\Administrator\\cache"
  pre_get_sources_script = '''
      git config --system --unset credential.helper
    '''
   # Winder version of GIT includes 'credential helper', That causes issues for git clones..
  [runners.cache]
    Type = "s3"
    Shared = true
    MaxUploadedArchiveSize = 0
    [runners.cache.s3]
      ServerAddress = "s3.amazonaws.com"
      AccessKey = "AKIA<snip>"
      SecretKey = "<snip>"
      BucketName = "<snip>-runner-cache"
      BucketLocation = "us-east-2"

  # Autoscaler config
  [runners.autoscaler]
    plugin = "fleeting-plugin-aws"
    capacity_per_instance = 1  #How many jobs to run on each instance at the same time
    max_use_count = 1          #should instances be reused?
    max_instances = 2          #what is the maximum number of instances to run in the ASG at one time

    [runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
      name             = "<snip>"              # AWS Autoscaling Group name
      region           = "us-east-2"

    [runners.autoscaler.connector_config]
      os                = "windows"
      protocol          = "winrm"     #Supported connection protocols are ssh or winrm
      username          = "Administrator"
      key_path = "/root/.ssh/<snip>.pem"
      use_static_credentials = false
#Runner uses the AWS connect functions to decode the dynamic user password.
#Therefore it needs the SSH key for the instance to decode the password.
#This process also takes at MINIMUM 4 minutes to complete. therefore we set timeout at 10 min.
      timeout           = "10m0s"
      use_external_addr = false		#use the internal VPC network address to connect the instance. NOT the public IP

    [[runners.autoscaler.policy]]
#Don't leave ANY idle instance setting around running.. If this is >0 it will immediately startup a runner and keep it running 24 - 365.
      idle_count = 0
      idle_time = "30m0s"

[session_server]
  session_timeout = 1800

[runners.custom_build_dir]
    enabled = true

Winrm setup.

This syntax may vary depending on what winDers release is used, the following works for WinDers 2022

The connection to the WinDers host uses winrm (Windows Remote Management) which is not setup by default and is also secured, preventing access. We use a “user data” script in the ASG launch template to configure winrm. (These command are run by AWS when the instance Starts up)

<powershell>
netsh advfirewall firewall add rule name="WinRM-HTTP" dir=in localport=5985 protocol=TCP action=allow
winrm quickconfig -quiet -force
winrm set winrm/config/service/auth '@{Basic="true"}'
winrm set winrm/config/service '@{AllowUnencrypted="true"}'
Set-Item WSMan:localhost\client\trustedhosts -value * -Force
$Key = 'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System'
$Setting = 'LocalAccountTokenFilterPolicy'
Set-ItemProperty -Path $Key -Name $Setting -Value 1 -Force
</powershell>
<persist>true</persist>

That exact script must be pasted into the AWS EC2 launch template ‘advanced’ section under ‘user data’.

If winrm is not setup correctly you will see something like this in the management runner logs, and displayed in the CiCD job console.

WARNING: Job failed: prepare environment: invalid content type. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information
  duration_s=0.003301332 job=195056 project=872 runner=fmZPJTrsw

Bad credentials on the management runner can cause this error:

2023-10-24T18:36:54.699Z [ERROR] connection preparation failed: instance=i-0d783ea err="rpc error: code = Unknown desc = fetching password data: operation error EC2: GetPasswordData, https response err
or StatusCode: 403, RequestID: , api error UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:sts:::assumed-role/FleetRunner
/i-03bd6ab is not authorized to perform: ec2:GetPasswordData on resource: arn:aws:ec2:us-east-2::instance/i-0d783ea because no identity-based policy allows the ec2:GetPasswordData
 action. Encoded authorization failure message: <snip>

On WinDERS runners, if the ‘git credential helper’ is enabled you will get errors like this:

Fetching changes...
Initialized empty Git repository in C:/Users/Administrator/builds/fmZPJTrsw/0/devops/ci_testing/.git/
Created fresh repository.
fatal: Failed to write item to store. [0x520]
fatal: A specified logon session does not exist. It may already have been terminated

Need to run the command git config --system --unset credential.helper on the runner before job starts.

Troubleshooting.

User Data Script.

There is no errors produced on the console if there are syntax errors in the user data script. It is best to test the exact script by creating an instance from the AMI being used and run the commands on the powershell console to check the syntax.

You can also use the SSM tools to connect to the instance once its up and check the settings, these commands should verify the settings:

winrm enumerate winrm/config/listener
winrm get winrm/config
winrm get winrm/config/client
Get-ExecutionPolicy

$Key = 'HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System'
$Setting = 'LocalAccountTokenFilterPolicy'
Get-ItemProperty -Path $Key -Name $Setting

Management runner

The runner logs messages to the system log either /var/log/syslog or /var/log/messages

Or if you stop the systemd runner process you can run the program from the commandline and it will output all messages to the console… This also allows you to change environment variables and use different user credentials for AWS to test:

Copy the command line from the systemd file - /etc/systemd/system/gitlab-runner.service which should looks something like:

/usr/bin/gitlab-runner "run" "--working-directory" "/home/gitlab-runner" "--config" "/etc/gitlab-runner/config.toml" "--service" "gitlab-runner" "--user" "gitlab-runner"

References

winrm

https://help.quali.com/Online%20Help/0.0/Portal/Content/DevGuide/Config-Mng/Cnfg-WinRM-for-Cstm-Scrpts.htm

powershell

AWS “user data”

MiSe4444 · March 15, 2024, 11:44pm

Hello,

Great explanation. One of the most completes and explicits out there.
I think I am getting how to use a “Linux Runner Manager” to autoscale “Windows docker executors”

But currently I am facing an issue, I cannot check if I have the IAM policies

I was wondering if I do not have those rights, would I get the following error ?

I creates the custom AMI with an specific *.pem
I am for sure using the same *.pem so I cannot relate this error to another issue. But I might be missing something.
I also waited more than 4 minutes on a t2.xlarge and the spawned machine is on the same VPC as the Runner-manager

Additionally I have the following configuration:

Sorry, but also, the S3 bucket is necessary ?
I think I do not get why it is needed.
I mean it is not specifically required for a minimal configuration where no previous artifacts need to be retrieved, am I right ?

Edit1:
So the above issue was using a custom AMI
And the error on gitlab was :

When using a default image, I get

This errors seems to indicate the default image does not have docker installed (normal cause it is custom)

Thank you , very much!
Any help would be much appreciated!