Has anything changed recently (since May 16th) in the gitlab.com shared runners?

I have some jobs in pipeline that are failing (they give results slightly off from reference) for no apparent reason, since nothing related to them has changed in the code. Moreover, when I run a pipeline with the same code as a successful pipeline, now it fails.

Trying to reproduce the failing results locally, even using the same docker image… well, fails (i.e., the results are correct). So my only lead is that something might have changed in the shared runners (architecture, core libraries…), or there may be a bug in my code that only shows from a certain date.

If it helps, I’m finding the problem with configurations that use Intel MKL (and note the MKL is provided in the docker image).

I guess this is the recent change: GitLab 16.0 released with Value Streams Dashboards and improvements to AI-powered Code Suggestions | GitLab

It’s still a mystery why the jobs fail with shared runners, but I cannot reproduce the behaviour locally or in specific runners, so it’s a nightmare to debug.

Can you share an example snippet for .gitlab-ci.yml that allows to reproduce the error? Also, logs/screenshots of the exact error message would be helpful to make suggestions on where to look.

I’m afraid reproducing the error is anything but trivial (I haven’t managed to, and it’s not a crash or catastrophic failure, but just some slight numerical differences). Also, to be clear, the issue is not (or does not seem to be) some workflow problem in the .gitlab-ci.yml, but something in the running environment.

In any case, here is an example job that passed: test:intel 2/2 (#4209677188) · Jobs · Molcas / Aux / Forrest · GitLab
And retrying the job yesterday (same commit, docker image, etc.) failed: test:intel 2/2 (#4326109010) · Jobs · Molcas / Aux / Forrest · GitLab (See tests 374, 397, 423. The outputs, which can be found in the artifacts, reveal that it’s just numerical discrepancies with the reference)

It wouldn’t be the first time I see semi-random failures due to memory garbage and stuff, but this is pretty consistent: it always passed until a week ago, and now it always fails on shared runners, and passes locally and on specific runners.

It looks to me that the tests are flaky.

Can you share what the failure code scf is referring to?

And can you enable that the tests run in a randomized order?

scf is one of the programs run by the test, that means it’s the one that fails. I can’t easily change the order, but I’ve by now more or less decided that problem is that MKL is probably using different instructions depending on the CPU or something else (and that the tests are sensitive to that). All the machines I have interactive access to are Intel, while the shared runners report AMD. If the CPU in the shared runners has changed recently, that would explain what I’m seeing. I’ve tried fiddling with the environment variables that should supposedly control MKL’s behavior, with no success, neither in fixing the shared runners nor in reproducing the failure elsewhere.

That can make sense. Supposedly MKL is using the kernel and kernel is a shared resource with containers. I fear this is pretty hard to track on cloud instances, perhaps a shell helps, root might be shared, too. I found for inspecting the instances, utilities like tunshell can help (not linking any longer to the dot com address as the comment was recently marked as spam).

To better understand the problem, the container images in Container Registry · Molcas / Dockerfiles · GitLab registry.gitlab.com/molcas/dockerfiles/intel-phusion@sha256:d7462f2fa92bdc65ed17feb86b395422a68689181f625cb7ae725c19331bb038 provide a somewhat architecture specific build environment?

I can’t easily change the order, but I’ve by now more or less decided that problem is that MKL is probably using different instructions depending on the CPU or something else (and that the tests are sensitive to that). All the machines I have interactive access to are Intel, while the shared runners report AMD.

Hmmm that sounds like a tricky dependency. Is it important for your tests to run on very specific platform and architecture? If so, self-managed runners would be a more viable alternative. SaaS runners on Linux | GitLab does not guarantee a specific CPU architecture as far as I can see.

If the CPU in the shared runners has changed recently, that would explain what I’m seeing.

GitLab 16 brings two changes that can relate to this:

While reading the linked issues, I’ve found your comment in Upgrade the machine type for GitLab SaaS Runners small to 2vCPUs (#388162) · Issues · GitLab.org / GitLab · GitLab Tagged the product DRIs (directly individual responsible).

To better understand the problem, the container images in Container Registry · Molcas / Dockerfiles · GitLab registry.gitlab.com/molcas/dockerfiles/intel-phusion@sha256:d7462f2fa92bdc65ed17feb86b395422a68689181f625cb7ae725c19331bb038 provide a somewhat architecture specific build environment?

Not that I’m aware, at least it should work in any not too old amd_64 machine. The intel in the name refers to the compilers. Here is the Dockerfile used for building it: https://gitlab.com/Molcas/Dockerfiles/-/blob/master/intel-phusion/Dockerfile

Hmmm that sounds like a tricky dependency. Is it important for your tests to run on very specific platform and architecture? If so, self-managed runners would be a more viable alternative. SaaS runners on Linux | GitLab does not guarantee a specific CPU architecture as far as I can see.

The tests are not platform-specific, they are intended to work “everywhere”. The problem is not really that the test fails, but that it was working consistently before and it’s failing consistently now. If this is due to some CPU/compiler/library-specific behaviour, it is of course something we want to fix (or bug-report where appropriate), as we aim at having a robust code, or at least robust tests. The annoying part is that the key change (changing the “small” shared runners from Intel to AMD hardware) was not announced or documented, as far as I can see.

@hakre Thanks for the tunshell tip, by the way. I’m sure it will be very useful, now and in the future.

Thanks. I did not know about Intel MKL and what it does exactly. After following the discussion here, and in the issue it makes more sense to what’s going on with AMD CPUs vs. Intel CPUs. Copying here for reference:

I never heard of MKL but now googled for it. Intel seems to not support AMD well, or the code does not run that fast, requiring workarounds.

I’ve added a suggestion in the issue discussion in Upgrade the machine type for GitLab SaaS Runners small to 2vCPUs (#388162) · Issues · GitLab.org / GitLab · GitLab Maybe you’d like to start an MR to update the docs table? Thanks!