Artifact upload failing on macOS runner

Problem to solve

My CI job executed by the shell executor on a macOS Catalina machine fails in uploading the job artifacts. This is the job log:

Uploading artifacts for successful job 03:04
Uploading artifacts...
Runtime platform                                    arch=amd64 os=darwin pid=61838 revision=fe451d5a version=17.1.0
build/: found 1489 matching artifact files and directories 
assets/: found 19 matching artifact files and directories 
WARNING: Uploading artifacts as "archive" to coordinator... 499 status code 499  id=95953 responseStatus=499 status code 499 status=499 token=glcbt-64
WARNING: Retrying...                                context=artifacts-uploader error=invalid argument
WARNING: Uploading artifacts as "archive" to coordinator... 499 status code 499  id=95953 responseStatus=499 status code 499 status=499 token=glcbt-64
WARNING: Retrying...                                context=artifacts-uploader error=invalid argument
WARNING: Uploading artifacts as "archive" to coordinator... 499 status code 499  id=95953 responseStatus=499 status code 499 status=499 token=glcbt-64
FATAL: invalid argument                            
Cleaning up project directory and file based variables 00:00
ERROR: Job failed: exit status 1

and this is the relevant output of the gitlab-runner instace started with --debug flag:

Executing build stage                               build_stage=upload_artifacts_on_success job=95953 project=9 runner=MqdvDjJs
Uploading artifacts for successful job  job=95953 project=9 runner=MqdvDjJs
Using new shell command execution                   job=95953 project=9 runner=MqdvDjJs
Appending trace to coordinator...ok                 code=202 job=95953 job-log=0-44203 job-status=running runner=MqdvDjJs sent-log=44163-44202 status=202 Accepted update-interval=3s
Appending trace to coordinator...ok                 code=202 job=95953 job-log=0-44699 job-status=running runner=MqdvDjJs sent-log=44203-44698 status=202 Accepted update-interval=3s
Updating job...                                     bytesize=44699 checksum=crc32:96b9c9bc job=95953 runner=MqdvDjJs
Submitting job to coordinator...ok                  bytesize=44699 checksum=crc32:96b9c9bc code=200 job=95953 job-status=running runner=MqdvDjJs update-interval=0s
Appending trace to coordinator...ok                 code=202 job=95953 job-log=0-45048 job-status=running runner=MqdvDjJs sent-log=44699-45047 status=202 Accepted update-interval=3s
Updating job...                                     bytesize=45048 checksum=crc32:30e46182 job=95953 runner=MqdvDjJs
Submitting job to coordinator...ok                  bytesize=45048 checksum=crc32:30e46182 code=200 job=95953 job-status=running runner=MqdvDjJs update-interval=0s
Appending trace to coordinator...ok                 code=202 job=95953 job-log=0-45397 job-status=running runner=MqdvDjJs sent-log=45048-45396 status=202 Accepted update-interval=3s
Updating job...                                     bytesize=45397 checksum=crc32:2caff401 job=95953 runner=MqdvDjJs
Submitting job to coordinator...ok                  bytesize=45397 checksum=crc32:2caff401 code=200 job=95953 job-status=running runner=MqdvDjJs update-interval=0s
Updating job...                                     bytesize=45397 checksum=crc32:2caff401 job=95953 runner=MqdvDjJs
Submitting job to coordinator...ok                  bytesize=45397 checksum=crc32:2caff401 code=200 job=95953 job-status=running runner=MqdvDjJs update-interval=0s
Skipping referees execution                         job=95953 project=9 runner=MqdvDjJs
Executing build stage                               build_stage=cleanup_file_variables job=95953 project=9 runner=MqdvDjJs
Cleaning up project directory and file based variables  job=95953 project=9 runner=MqdvDjJs
Using new shell command execution                   job=95953 project=9 runner=MqdvDjJs
WARNING: Job failed: exit status 1
                 duration_s=697.729202121 job=95953 project=9 runner=MqdvDjJs
Appending trace to coordinator...ok                 code=202 job=95953 job-log=0-45951 job-status=running runner=MqdvDjJs sent-log=45397-45950 status=202 Accepted update-interval=3s
Updating job...                                     bytesize=45951 checksum=crc32:72148933 job=95953 runner=MqdvDjJs
WARNING: Submitting job to coordinator... job failed  bytesize=45951 checksum=crc32:72148933 code=200 job=95953 job-status=failed runner=MqdvDjJs status=200 OK update-interval=0s
Removed job from processing list                    builds=0 job=95953 max_builds=1 project=9 repo_url=https://git.herd.cloud.infn.it/herd/HerdSoftware.git time_in_queue_seconds=1

The same jobs executed on Linux using the docker executor succeeds.

Configuration

config.toml:

concurrent = 1
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "macos-runner-pg"
  url = "https://git.herd.cloud.infn.it/"
  token = "XXX"
  executor = "shell"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]

Versions

  • Self-managed Gitlab instance 17.1.2
  • Self-hosted runner on macOS 17.1.0

I tried to troubleshoot the problem by trying a manual upload of the artifacts with gitlab-runner artifacts-uploader, but it fails the authentication:

$ /usr/local/bin/gitlab-runner artifacts-uploader --url https://git.herd.cloud.infn.it/ --token XXXXXX --id 95992 --path build/ --path assets/ --artifact-format zip --artifact-type archive    
Runtime platform                                    arch=amd64 os=darwin pid=74575 revision=fe451d5a version=17.1.0
build/: found 1501 matching artifact files and directories 
assets/: found 19 matching artifact files and directories 
ERROR: Uploading artifacts as "archive" to coordinator... POST https://git.herd.cloud.infn.it/api/v4/jobs/95992/artifacts: 403 Forbidden  id=95992 responseStatus=403 Forbidden status=403 token=RV9xAhEd
FATAL: permission denied 

I am using an access token with Owner role and full permissions, but it still fails; I guess it’s because an access token cannot be used for this, or because artifact upload is locked once the original job finishes.

Anyway if someone can help with this then I can try to go on with troubleshooting.

I have access to another instance of Gitlab 17.1.2 where another project of mine is hosted, so I tried to run the CI pipeline for that project on the macOS runner. For this instance the artifact upload worked with no errors, so I’d say that it could be a problem of configuration of the first Gitlab instance, which however does not impact on the artifact upload from Linux runners with docker executor.
I am quite clueless now, so I really need some help.

I looked into the logs of the Gitlab instance and I found this suspicious entry in correspondence of the failed upload:

gitlab  | ==> /var/log/gitlab/gitlab-workhorse/current <==
gitlab  | {"error":"MultipartUpload: upload multipart failed\n\tupload id: e338d331-4faa-4776-8d4b-fd0abf166b35\ncaused by: ReadRequestBody: read multipart upload data failed\ncaused by: unexpected EOF","level":"error","msg":"error uploading S3 session","time":"2024-07-19T10:10:09Z"}
gitlab  | {"correlation_id":"01J357F1YV7E88FKRTRYJRPQ95","error":"handleFileUploads: extract files from multipart: persisting multipart file: unexpected EOF","level":"error","method":"POST","msg":"","time":"2024-07-19T10:10:09Z","uri":"/api/v4/jobs/96000/artifacts?artifact_format=zip\u0026artifact_type=archive"}
gitlab  | {"content_type":"text/plain; charset=utf-8","correlation_id":"01J357F1YV7E88FKRTRYJRPQ95","duration_ms":60000,"host":"git.herd.cloud.infn.it","level":"info","method":"POST","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"193.205.222.19:0","remote_ip":"193.205.222.19","route":"^/api/v4/jobs/[0-9]+/artifacts\\z","status":500,"system":"http","time":"2024-07-19T10:10:09Z","ttfb_ms":60000,"uri":"/api/v4/jobs/96000/artifacts?artifact_format=zip\u0026artifact_type=archive","user_agent":"gitlab-runner 17.1.0 (17-1-stable; go1.22.3; darwin/amd64)","written_bytes":22}

While I don’t clearly understand it it seems related to some timeout on the runner side which interrupts the data transfer. Do I understand well? If yes, how can it be solved?

In the end there was nothing wrong with Gitlab. The culprit was the Traefik reverse proxy running in front of the Gitlab instance. Because of this change Traefik started to apply a 60 s timeout which cut the artifact upload from the macOS runner. Linux runners were not impacted because they are in the same network infrastructure of the Gitlab instance, resulting in faster upload times and no timeout, while the macOS runner is geographically displaced.
Adding the --entryPoints.websecure.transport.respondingTimeouts.readTimeout=600 variable when startig Traefik fixed the problem.