jobs fail when we set the latest version or major version of python, for example, the settings 3.9, 3.8.17, 3.1.14 do not work, but versions 3.8.16 and 3.11.3 work
Using Docker executor with image python:3.9 …
Pulling docker image python:3.9 …
…
…
shell not found
ERROR: Job failed: exit code 1
We’ve found the same problem with some of our clients’ pipelines. Noting that security bugfix releases went in on June 6th 2023 across several major versions of Python, perhaps their changes are incompatible with your Gitlab CI scripts in some fashion?
Thank you for looking into this. It’s strange that before the jobs were working and a few days ago they started crashing.
Pulling docker image python:3.9 …
Using docker image xyz for python:3.9 …
Running on runner-xyz1-project-123-concurrent-0 via ly9999…
Fetching changes with git depth set to 50…
Reinitialized existing Git repository in /construct/sem/reporting/yx_y1/.git/
Checking out 6xvbbg as master…
Skipping Git submodules setup
Checking cache for default-protected…
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted.
Successfully extracted cache
shell not found
shell not found
ERROR: Job failed: exit code 1
No proper solution yet, I’m afraid - rolling back one patch version has worked for us in the interim, but it’s not sustainable in the long run.
So far, all I can say is that it’s misbehaving in the wheel-building step of the pipeline, and not getting any further than that. We might have to revisit the build tooling and see if that helps.
Following up on that, using the python:3.10.12 image it’s not even getting to an “echo” statement at the start of the script block. The following minimal .gitlab-ci.yml file is failing.
default:
image: python:3.10.12
stages:
- setup
wheel:
stage: setup
script:
- echo "Does it get this far?"
And here is the output (with #REDACTIONS#) from our Gitlab pipeline:
Running with gitlab-runner 16.0.1 (79704081)
on #SERVER#
Preparing the "docker" executor
Using Docker executor with image python:3.10.12 ...
Pulling docker image python:3.10.12 ...
Using docker image sha256:23e11cf6844c334b2970fd265fb09cfe88ec250e1e80db7db973d69d757bdac4 for python:3.10.12 with digest docker.io/python@sha256:60ec661aff9aa0ec90bc10ceeab55d6d04ce7b384157d227917f3b49f2ddb32e ...
Preparing environment
Running on #RUNNER# via #SERVER#...
Getting source from Git repository 00:03
Fetching changes with git depth set to 50...
Initialized empty Git repository in #BUILD_GITDIR#
Created fresh repository.
Checking out #HASH# as detached HEAD (ref is test-build-change)...
Skipping Git submodules setup
Executing "step_script" stage of the job script 00:01
Using docker image sha256:23e11cf6844c334b2970fd265fb09cfe88ec250e1e80db7db973d69d757bdac4 for python:3.10.12 with digest docker.io/python@sha256:60ec661aff9aa0ec90bc10ceeab55d6d04ce7b384157d227917f3b49f2ddb32e ...
shell not found
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1
That works fine if we pin it back to python:3.10.11.
A further bit of discovery: the new Docker images build from a Debian 12 (bookworm) base image, rather than the previous Debian 11 (bullseye) image, presumably because there was a high severity OpenSSL vulnerability (CVE-2023-2650).
Is it possible that the change in the underlying OS base image could have also changed the shell configuration/availability for these images, such that it’s not holding hands with the Gitlab runner correctly anymore?
What you are seeing, I’ve experienced once (similar) with a Windows runner, when a wrong shell command was defined in the runner configuration.
AFAIK:
Every docker image provides one or more shells (terminals) that can be used by a runner to execute script defined in .gitlab-ci.yml. This is prerequisite for any script to run in the job and I believe this might be the reason why your script part is not executing. E.g. if I use ubuntu:latest - it provides “/bin/sh” shells and “/bin/bash” shells → this means Runner has to use one of those shells as well.
GitLab runner supports different shells, depending on the platform - Types of shells supported by GitLab Runner | GitLab . It can be configured in config.toml file of the runner. Normally, default works, but this is where things can be mismatched.
I might be wrong as well, but this could be something to check.
Are you using your own GitLab runners or shared runners from gitlab.com ? If you have your own runners, can you please share your config.toml with us?
P.S. Have you tried adding this to your config file?
Confirming essentially what @DrCuriosity wrote above – the images that fail here was rebuilt from bookworm to bullseye, but in some cases, the release number was not bumped. Several work-arounds below, including using 3.10.11 if you previously relied on 3.10.
I suspect that the gitlab-ci runner does actually have a problem, perhaps by relying on the use of bash, instead of using purely posix shell scripts. But I could not reproduce the problem running a container directly using the same inputs. The source code of the gitlab-ci runner is quite convoluted. Even with debugging, I could not ascertain what is really going on.
I also cannot understand why there is a difference because of Debian11 to 12. In analyzing diffs across the exported containers, I could not understand why the third workaround (see below) would have the effect it does:
On both exported filesystems, /bin/sh points to /bin/dash
On both exported filesystems, /bin/dash and /bin/bash are real executables about the same size from their corresponding mate on the other image.
Perhaps gitlab-ci-runner is invoking a scriptlet or the container in some way that the gitlab-runner’s --debug mode does not indicate.
OK, taking a step back:
python:3.10.12 is seen with bookworm in digest python@sha256:aa79a3d35cb9787452dad51e17e4b6e06822a1a601f8b4ac4ddf74f0babcbfd5 . There are no problems with this image.
However, the same version of python with the same minor version number was release under bullseye with the digest python@sha256:a8462db480ec3a74499a297b1f8e074944283407b7a417f22f20d8e2e1619782. This image will fail without workarounds.
Workarounds
Use the digest of the last working image, as suggested above
Find the most recent minor version number that still works: For 3.10, it’s 3.10.11. And pray some idiot doesn’t rebuild and re-push that image.
Use the fugly hack suggested on gitlab issue tracker.
Are you using your own GitLab runners or shared runners from gitlab.com ? If you have your own runners, can you please share your config.toml with us?
I’m working with a GitLab instance internal to an institution. Community Edition v16.0.2, runner is currently gitlab-runner 16.0.1 (79704081). The runner configuration is locked down and not available to me. I’ll see if I can find the right person to make aware of this thread, though.
As mentioned earlier, this appears to be a problem with any image built using bookworm. Looking at the projects in my GitLab instance, we use different containers for different tasks. Our composer images are built using alpine which run fine. However, the latest version of node, python, and php use bookworm, those all give the no shell found error. If I change the job that uses node:latest to node:18-alpine it will work.
I tried specifying an entrypoint for the image but get the following error:
install_npm_dependencies:
image:
name: "node:latest"
entrypoint: ["/bin/bash"] # also tried /bin/sh, /usr/bin/bash, and /usr/bin/sh
When the entry point is set to /usr/bin/sh then I get a message saying that it can’t open the file. When I run the container locally with either /bin/bash or /usr/bin/sh it works.
When adding this to node:latest, I get a core dump. It seems to work for other images such as php:latest. You shouldn’t have to do this unless the image being used is an oddball. Even then, you probably shouldn’t be using oddball images.
In my case gitlab-runner’s shell detection script was failing to stat the available shell executables due to an incompatibility between the container and the host, thus returning failure for every check and giving up with the “shell not found” error.
This sometimes happens when running bleeding edge images on older hosts, but typically it’s more obvious and often presents itself as a filesystem permissions error or some other system call failure. Essentially, the binaries/libraries in the container are using new/modified system calls that the dockerd/containerd’s seccomp layer doesn’t understand yet. Updating the host kernel and container runtime tends to fix this.
Thanks @rpetti!
We faced same kind of issue while using oracle linux 9 build image on lower version of VM. Akash has come across your comment.
We are getting your insights in and out of opentext