Has anything changed recently (since May 16th) in the gitlab.com shared runners?

jellby · May 22, 2023, 12:03pm

I have some jobs in pipeline that are failing (they give results slightly off from reference) for no apparent reason, since nothing related to them has changed in the code. Moreover, when I run a pipeline with the same code as a successful pipeline, now it fails.

Trying to reproduce the failing results locally, even using the same docker image… well, fails (i.e., the results are correct). So my only lead is that something might have changed in the shared runners (architecture, core libraries…), or there may be a bug in my code that only shows from a certain date.

jellby · May 22, 2023, 1:47pm

If it helps, I’m finding the problem with configurations that use Intel MKL (and note the MKL is provided in the docker image).

jellby · May 23, 2023, 8:38am

I guess this is the recent change: GitLab 16.0 released with Value Streams Dashboards and improvements to AI-powered Code Suggestions | GitLab

It’s still a mystery why the jobs fail with shared runners, but I cannot reproduce the behaviour locally or in specific runners, so it’s a nightmare to debug.

dnsmichi · May 23, 2023, 11:32am

Can you share an example snippet for .gitlab-ci.yml that allows to reproduce the error? Also, logs/screenshots of the exact error message would be helpful to make suggestions on where to look.

jellby · May 23, 2023, 11:49am

I’m afraid reproducing the error is anything but trivial (I haven’t managed to, and it’s not a crash or catastrophic failure, but just some slight numerical differences). Also, to be clear, the issue is not (or does not seem to be) some workflow problem in the .gitlab-ci.yml, but something in the running environment.

In any case, here is an example job that passed: test:intel 2/2 (#4209677188) · Jobs · Molcas / Aux / Forrest · GitLab
And retrying the job yesterday (same commit, docker image, etc.) failed: test:intel 2/2 (#4326109010) · Jobs · Molcas / Aux / Forrest · GitLab (See tests 374, 397, 423. The outputs, which can be found in the artifacts, reveal that it’s just numerical discrepancies with the reference)

It wouldn’t be the first time I see semi-random failures due to memory garbage and stuff, but this is pretty consistent: it always passed until a week ago, and now it always fails on shared runners, and passes locally and on specific runners.

hakre · May 24, 2023, 1:59pm

It looks to me that the tests are flaky.

Can you share what the failure code scf is referring to?

And can you enable that the tests run in a randomized order?

jellby · May 24, 2023, 2:55pm

scf is one of the programs run by the test, that means it’s the one that fails. I can’t easily change the order, but I’ve by now more or less decided that problem is that MKL is probably using different instructions depending on the CPU or something else (and that the tests are sensitive to that). All the machines I have interactive access to are Intel, while the shared runners report AMD. If the CPU in the shared runners has changed recently, that would explain what I’m seeing. I’ve tried fiddling with the environment variables that should supposedly control MKL’s behavior, with no success, neither in fixing the shared runners nor in reproducing the failure elsewhere.

hakre · May 24, 2023, 3:28pm

That can make sense. Supposedly MKL is using the kernel and kernel is a shared resource with containers. I fear this is pretty hard to track on cloud instances, perhaps a shell helps, root might be shared, too. I found for inspecting the instances, utilities like tunshell can help (not linking any longer to the dot com address as the comment was recently marked as spam).

dnsmichi · May 24, 2023, 4:14pm

To better understand the problem, the container images in Container Registry · Molcas / Dockerfiles · GitLab registry.gitlab.com/molcas/dockerfiles/intel-phusion@sha256:d7462f2fa92bdc65ed17feb86b395422a68689181f625cb7ae725c19331bb038 provide a somewhat architecture specific build environment?

I can’t easily change the order, but I’ve by now more or less decided that problem is that MKL is probably using different instructions depending on the CPU or something else (and that the tests are sensitive to that). All the machines I have interactive access to are Intel, while the shared runners report AMD.

Hmmm that sounds like a tricky dependency. Is it important for your tests to run on very specific platform and architecture? If so, self-managed runners would be a more viable alternative. SaaS runners on Linux | GitLab does not guarantee a specific CPU architecture as far as I can see.

If the CPU in the shared runners has changed recently, that would explain what I’m seeing.

GitLab 16 brings two changes that can relate to this:

Doubling the size for vCPU/RAM for the SaaS runners on Linux. This happens without any configuration setting.
GPU enabled runners. They require a specific job tag, otherwise jobs are not picked up by these runners.

While reading the linked issues, I’ve found your comment in Upgrade the machine type for GitLab SaaS Runners small to 2vCPUs (#388162) · Issues · GitLab.org / GitLab · GitLab Tagged the product DRIs (directly individual responsible).

jellby · May 24, 2023, 4:52pm

To better understand the problem, the container images in Container Registry · Molcas / Dockerfiles · GitLab registry.gitlab.com/molcas/dockerfiles/intel-phusion@sha256:d7462f2fa92bdc65ed17feb86b395422a68689181f625cb7ae725c19331bb038 provide a somewhat architecture specific build environment?

Not that I’m aware, at least it should work in any not too old amd_64 machine. The intel in the name refers to the compilers. Here is the Dockerfile used for building it: https://gitlab.com/Molcas/Dockerfiles/-/blob/master/intel-phusion/Dockerfile

Hmmm that sounds like a tricky dependency. Is it important for your tests to run on very specific platform and architecture? If so, self-managed runners would be a more viable alternative. SaaS runners on Linux | GitLab does not guarantee a specific CPU architecture as far as I can see.

The tests are not platform-specific, they are intended to work “everywhere”. The problem is not really that the test fails, but that it was working consistently before and it’s failing consistently now. If this is due to some CPU/compiler/library-specific behaviour, it is of course something we want to fix (or bug-report where appropriate), as we aim at having a robust code, or at least robust tests. The annoying part is that the key change (changing the “small” shared runners from Intel to AMD hardware) was not announced or documented, as far as I can see.

jellby · May 25, 2023, 7:48am

@hakre Thanks for the tunshell tip, by the way. I’m sure it will be very useful, now and in the future.

dnsmichi · May 25, 2023, 11:32am

Thanks. I did not know about Intel MKL and what it does exactly. After following the discussion here, and in the issue it makes more sense to what’s going on with AMD CPUs vs. Intel CPUs. Copying here for reference:

I never heard of MKL but now googled for it. Intel seems to not support AMD well, or the code does not run that fast, requiring workarounds.

Reddit - Dive into anything

Can we use Intel MKL libraries on systems with new AMD processors ? - ITensor Support Q&A

software - Since MKL is not optimized for AMD hardware, should I use a math library specific to AMD, or would an open-source one be just as good? - Matter Modeling Stack Exchange

I’ve added a suggestion in the issue discussion in Upgrade the machine type for GitLab SaaS Runners small to 2vCPUs (#388162) · Issues · GitLab.org / GitLab · GitLab Maybe you’d like to start an MR to update the docs table? Thanks!

Topic		Replies	Views
GitLab Pipeline randomly failing using Docker Runners - suspect caching issue GitLab CI/CD ci , runner , docker	9	7309	August 31, 2021
Working runner config until suddenly "Failed to process runner" GitLab CI/CD ci , runner	1	4602	March 23, 2022
Job randomly fails on Shared runner in Gitlab CI with exit code 137 GitLab CI/CD	1	747	February 4, 2019
Issues with local runners using gitlab.com GitLab CI/CD runner , docker , pipelines	2	891	February 3, 2023
Gitlab CI: failing pipeline after switching from shared to dedicated runner? GitLab CI/CD runner , docker	0	705	April 23, 2019

Has anything changed recently (since May 16th) in the gitlab.com shared runners?

Related topics