Jobs fails - shell not found (version python)

Hello,

jobs fail when we set the latest version or major version of python, for example, the settings 3.9, 3.8.17, 3.1.14 do not work, but versions 3.8.16 and 3.11.3 work

Using Docker executor with image python:3.9 …
Pulling docker image python:3.9 …


shell not found
ERROR: Job failed: exit code 1

What could be causing this? Thank you

Hi @lukgit , welcome to the GitLab Community Forum! :wave:

All the versions of python listed work in GitLab CI jobs for me, no problems: Pipeline · Greg Myers / 🐍 · GitLab

:thinking: I suspect the contents of the script section of your CI jobs is causing this problem, not the image itself.

  • What happens between Pulling docker image python:3.9 and shell not found?
  • Can you share the .gitlab-ci.yml snippet where you define what this CI job does?
1 Like

We’ve found the same problem with some of our clients’ pipelines. Noting that security bugfix releases went in on June 6th 2023 across several major versions of Python, perhaps their changes are incompatible with your Gitlab CI scripts in some fashion?

Release announcements for reference:

Thank you for looking into this. It’s strange that before the jobs were working and a few days ago they started crashing.

Pulling docker image python:3.9 …
Using docker image xyz for python:3.9 …
Running on runner-xyz1-project-123-concurrent-0 via ly9999…
Fetching changes with git depth set to 50…
Reinitialized existing Git repository in /construct/sem/reporting/yx_y1/.git/
Checking out 6xvbbg as master…
Skipping Git submodules setup
Checking cache for default-protected…
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted.
Successfully extracted cache
shell not found
shell not found
ERROR: Job failed: exit code 1

Thanks for the info, how did you solve the problem did you just change the python version like us?

No proper solution yet, I’m afraid - rolling back one patch version has worked for us in the interim, but it’s not sustainable in the long run.

So far, all I can say is that it’s misbehaving in the wheel-building step of the pipeline, and not getting any further than that. We might have to revisit the build tooling and see if that helps.

Following up on that, using the python:3.10.12 image it’s not even getting to an “echo” statement at the start of the script block. The following minimal .gitlab-ci.yml file is failing.

default:
  image: python:3.10.12

stages:
  - setup

wheel:
  stage: setup

  script:
    - echo "Does it get this far?"

And here is the output (with #REDACTIONS#) from our Gitlab pipeline:

Running with gitlab-runner 16.0.1 (79704081)
  on #SERVER#
Preparing the "docker" executor
Using Docker executor with image python:3.10.12 ...
Pulling docker image python:3.10.12 ...
Using docker image sha256:23e11cf6844c334b2970fd265fb09cfe88ec250e1e80db7db973d69d757bdac4 for python:3.10.12 with digest docker.io/python@sha256:60ec661aff9aa0ec90bc10ceeab55d6d04ce7b384157d227917f3b49f2ddb32e ...
Preparing environment
Running on #RUNNER# via #SERVER#...
Getting source from Git repository 00:03
Fetching changes with git depth set to 50...
Initialized empty Git repository in #BUILD_GITDIR#
Created fresh repository.
Checking out #HASH# as detached HEAD (ref is test-build-change)...
Skipping Git submodules setup
Executing "step_script" stage of the job script 00:01
Using docker image sha256:23e11cf6844c334b2970fd265fb09cfe88ec250e1e80db7db973d69d757bdac4 for python:3.10.12 with digest docker.io/python@sha256:60ec661aff9aa0ec90bc10ceeab55d6d04ce7b384157d227917f3b49f2ddb32e ...
shell not found
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1

That works fine if we pin it back to python:3.10.11.

A further bit of discovery: the new Docker images build from a Debian 12 (bookworm) base image, rather than the previous Debian 11 (bullseye) image, presumably because there was a high severity OpenSSL vulnerability (CVE-2023-2650).

Is it possible that the change in the underlying OS base image could have also changed the shell configuration/availability for these images, such that it’s not holding hands with the Gitlab runner correctly anymore?

1 Like

Could be.

What you are seeing, I’ve experienced once (similar) with a Windows runner, when a wrong shell command was defined in the runner configuration.

AFAIK:

  • Every docker image provides one or more shells (terminals) that can be used by a runner to execute script defined in .gitlab-ci.yml. This is prerequisite for any script to run in the job and I believe this might be the reason why your script part is not executing. E.g. if I use ubuntu:latest - it provides “/bin/sh” shells and “/bin/bash” shells → this means Runner has to use one of those shells as well.
  • GitLab runner supports different shells, depending on the platform - Types of shells supported by GitLab Runner | GitLab . It can be configured in config.toml file of the runner. Normally, default works, but this is where things can be mismatched.

I might be wrong as well, but this could be something to check.

Are you using your own GitLab runners or shared runners from gitlab.com ? If you have your own runners, can you please share your config.toml with us?

P.S. Have you tried adding this to your config file?

Confirming essentially what @DrCuriosity wrote above – the images that fail here was rebuilt from bookworm to bullseye, but in some cases, the release number was not bumped. Several work-arounds below, including using 3.10.11 if you previously relied on 3.10.

It’s not clear to me what the source problem is. A similar problem occurred many years ago and is referenced "shell not found" when trying to use Ubuntu or Fedora image (#27614) · Issues · GitLab.org / gitlab-runner · GitLab, but that task is still open! Some suggest a newer version of docker fixes the problem. However, I don’t think that’s the right answer.

I suspect that the gitlab-ci runner does actually have a problem, perhaps by relying on the use of bash, instead of using purely posix shell scripts. But I could not reproduce the problem running a container directly using the same inputs. The source code of the gitlab-ci runner is quite convoluted. Even with debugging, I could not ascertain what is really going on.

I also cannot understand why there is a difference because of Debian11 to 12. In analyzing diffs across the exported containers, I could not understand why the third workaround (see below) would have the effect it does:

  • On both exported filesystems, /bin/sh points to /bin/dash
  • On both exported filesystems, /bin/dash and /bin/bash are real executables about the same size from their corresponding mate on the other image.

Perhaps gitlab-ci-runner is invoking a scriptlet or the container in some way that the gitlab-runner’s --debug mode does not indicate.

OK, taking a step back:

python:3.10.12 is seen with bookworm in digest python@sha256:aa79a3d35cb9787452dad51e17e4b6e06822a1a601f8b4ac4ddf74f0babcbfd5 . There are no problems with this image.

However, the same version of python with the same minor version number was release under bullseye with the digest python@sha256:a8462db480ec3a74499a297b1f8e074944283407b7a417f22f20d8e2e1619782. This image will fail without workarounds.

Workarounds

  1. Use the digest of the last working image, as suggested above

  2. Find the most recent minor version number that still works: For 3.10, it’s 3.10.11. And pray some idiot doesn’t rebuild and re-push that image.

  3. Use the fugly hack suggested on gitlab issue tracker.

image:
    name: python:3.10
    entrypoint: [ '/bin/bash', '-c', 'ln -snf /bin/bash /bin/sh && /bin/bash -c $0' ]
1 Like

Are you using your own GitLab runners or shared runners from gitlab.com ? If you have your own runners, can you please share your config.toml with us?

I’m working with a GitLab instance internal to an institution. Community Edition v16.0.2, runner is currently gitlab-runner 16.0.1 (79704081). The runner configuration is locked down and not available to me. I’ll see if I can find the right person to make aware of this thread, though.

1 Like

I’m having the same issues. Running gitlab-runner v. 16.0.2.

My config.toml:

concurrent = 1
check_interval = 0
shutdown_timeout = 0

[session_server]
  session_timeout = 1800


[[runners]]
  name = "*************************************************"
  url = "https://gitlab.com/"
  id = 22901457
  token = "*********************************"
  token_obtained_at = 2023-04-24T16:18:40Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "docker"
  [runners.docker]
    tls_verify = false
    image = "alpine:latest"
    privileged = false
    disable_entrypoint_overwrite = false
    oom_kill_disable = false
    disable_cache = false
    volumes = ["/cache"]
    shm_size = 0

I have the same issue with image python:3.7. When I downgraded to python:3.7.16 it started to work. It may also help you as an interim solution.

Problem happens with image debian:stable-slim too. I had to change it to debian:bullseye-slim to make it work.

As mentioned earlier, this appears to be a problem with any image built using bookworm. Looking at the projects in my GitLab instance, we use different containers for different tasks. Our composer images are built using alpine which run fine. However, the latest version of node, python, and php use bookworm, those all give the no shell found error. If I change the job that uses node:latest to node:18-alpine it will work.

I tried specifying an entrypoint for the image but get the following error:

install_npm_dependencies:
  image: 
    name: "node:latest"
    entrypoint: ["/bin/bash"] # also tried /bin/sh, /usr/bin/bash, and /usr/bin/sh

/usr/bin/sh: /usr/bin/sh: cannot execute binary file

When the entry point is set to /usr/bin/sh then I get a message saying that it can’t open the file. When I run the container locally with either /bin/bash or /usr/bin/sh it works.

You can try updating the entry point to this:

entrypoint: [ '/bin/bash', '-s' ]

When adding this to node:latest, I get a core dump. It seems to work for other images such as php:latest. You shouldn’t have to do this unless the image being used is an oddball. Even then, you probably shouldn’t be using oddball images.

A simple CI job isn’t working for me either

python39:
  image: python:3.9
  script:
    - date

I get the shell not found error

I can confirm that this

 entrypoint: [ '/bin/bash', '-c', 'ln -snf /bin/bash /bin/sh && /bin/bash -c $0' ]

works. Apparently you have to override the entrypoint. I’ve also set my shell=“bash” in the [[runners]] section of config.toml if that matters.

1 Like

In my case gitlab-runner’s shell detection script was failing to stat the available shell executables due to an incompatibility between the container and the host, thus returning failure for every check and giving up with the “shell not found” error.

This sometimes happens when running bleeding edge images on older hosts, but typically it’s more obvious and often presents itself as a filesystem permissions error or some other system call failure. Essentially, the binaries/libraries in the container are using new/modified system calls that the dockerd/containerd’s seccomp layer doesn’t understand yet. Updating the host kernel and container runtime tends to fix this.

2 Likes

Thanks @rpetti!
We faced same kind of issue while using oracle linux 9 build image on lower version of VM. Akash has come across your comment.
We are getting your insights in and out of opentext :slight_smile: