I have some jobs in pipeline that are failing (they give results slightly off from reference) for no apparent reason, since nothing related to them has changed in the code. Moreover, when I run a pipeline with the same code as a successful pipeline, now it fails.
Trying to reproduce the failing results locally, even using the same docker image… well, fails (i.e., the results are correct). So my only lead is that something might have changed in the shared runners (architecture, core libraries…), or there may be a bug in my code that only shows from a certain date.
It’s still a mystery why the jobs fail with shared runners, but I cannot reproduce the behaviour locally or in specific runners, so it’s a nightmare to debug.
Can you share an example snippet for .gitlab-ci.yml that allows to reproduce the error? Also, logs/screenshots of the exact error message would be helpful to make suggestions on where to look.
I’m afraid reproducing the error is anything but trivial (I haven’t managed to, and it’s not a crash or catastrophic failure, but just some slight numerical differences). Also, to be clear, the issue is not (or does not seem to be) some workflow problem in the .gitlab-ci.yml, but something in the running environment.
It wouldn’t be the first time I see semi-random failures due to memory garbage and stuff, but this is pretty consistent: it always passed until a week ago, and now it always fails on shared runners, and passes locally and on specific runners.
scf is one of the programs run by the test, that means it’s the one that fails. I can’t easily change the order, but I’ve by now more or less decided that problem is that MKL is probably using different instructions depending on the CPU or something else (and that the tests are sensitive to that). All the machines I have interactive access to are Intel, while the shared runners report AMD. If the CPU in the shared runners has changed recently, that would explain what I’m seeing. I’ve tried fiddling with the environment variables that should supposedly control MKL’s behavior, with no success, neither in fixing the shared runners nor in reproducing the failure elsewhere.
That can make sense. Supposedly MKL is using the kernel and kernel is a shared resource with containers. I fear this is pretty hard to track on cloud instances, perhaps a shell helps, root might be shared, too. I found for inspecting the instances, utilities like tunshell can help (not linking any longer to the dot com address as the comment was recently marked as spam).
To better understand the problem, the container images in Container Registry · Molcas / Dockerfiles · GitLabregistry.gitlab.com/molcas/dockerfiles/intel-phusion@sha256:d7462f2fa92bdc65ed17feb86b395422a68689181f625cb7ae725c19331bb038 provide a somewhat architecture specific build environment?
I can’t easily change the order, but I’ve by now more or less decided that problem is that MKL is probably using different instructions depending on the CPU or something else (and that the tests are sensitive to that). All the machines I have interactive access to are Intel, while the shared runners report AMD.
Hmmm that sounds like a tricky dependency. Is it important for your tests to run on very specific platform and architecture? If so, self-managed runners would be a more viable alternative. SaaS runners on Linux | GitLab does not guarantee a specific CPU architecture as far as I can see.
If the CPU in the shared runners has changed recently, that would explain what I’m seeing.
GitLab 16 brings two changes that can relate to this:
To better understand the problem, the container images in Container Registry · Molcas / Dockerfiles · GitLabregistry.gitlab.com/molcas/dockerfiles/intel-phusion@sha256:d7462f2fa92bdc65ed17feb86b395422a68689181f625cb7ae725c19331bb038 provide a somewhat architecture specific build environment?
Hmmm that sounds like a tricky dependency. Is it important for your tests to run on very specific platform and architecture? If so, self-managed runners would be a more viable alternative. SaaS runners on Linux | GitLab does not guarantee a specific CPU architecture as far as I can see.
The tests are not platform-specific, they are intended to work “everywhere”. The problem is not really that the test fails, but that it was working consistently before and it’s failing consistently now. If this is due to some CPU/compiler/library-specific behaviour, it is of course something we want to fix (or bug-report where appropriate), as we aim at having a robust code, or at least robust tests. The annoying part is that the key change (changing the “small” shared runners from Intel to AMD hardware) was not announced or documented, as far as I can see.
Thanks. I did not know about Intel MKL and what it does exactly. After following the discussion here, and in the issue it makes more sense to what’s going on with AMD CPUs vs. Intel CPUs. Copying here for reference:
I never heard of MKL but now googled for it. Intel seems to not support AMD well, or the code does not run that fast, requiring workarounds.