Job became suddenly slow (after upgrading Docker image to PHP 8.3.7 ?)

Hello,

Problem to solve

Our jobs running on php-fpm Docker images became suddenly slow around the 26th May 2024, 12:40PM UTC.

Here is a fast job which took 3 minutes : api-lint (#6941810952) · Jobs · Eternaltwin / Mush / Mush · GitLab

Here is a slow job which took 33 minutes : api-lint (#6942032199) · Jobs · Eternaltwin / Mush / Mush · GitLab

The only difference I can spot between the two job is that I pushed a new Docker image for CI (upgrading PHP from 8.3.2 to 8.3.7) and install some new dependencies (ext-protobuf and some existing packages upgrades)

That said, the script executed is still fast when ran locally. I also used GitHub - firecow/gitlab-ci-local: Tired of pushing to test your .gitlab-ci.yml? to run the job locally, and it is fast.

Steps to reproduce

Running the api-lint job on our repository.

I may reproduce this in a tiny repo if needed.

Configuration

Here is our .gitlab-ci.yml · develop · Eternaltwin / Mush / Mush · GitLab

Versions

Please select whether options apply, and add the version information.

  • Self-managed
  • GitLab.com SaaS
  • Self-hosted Runners

Versions

Thank you in advance for your help.

And now our job fails after running for 20 minutes : api-lint (#6951909845) · Jobs · Eternaltwin / Mush / Mush · GitLab, we cannot work anymore… :frowning:

The image is 1.14GB which took a while to download on my local Rancher Desktop.

Not sure if that has an impact on the CI/CD pipelines.

What you could do is add timing points into the script execution steps, and group the outputs to measure what takes a lot of time. A recent example of the GitLab handbook pipeline in .gitlab-ci.yml · 3e639a4d3beaab9607b6ee7aee61e267da892f30 · GitLab.com / Content Sites / handbook · GitLab

Thank you for your answer.

I migrated the image to alpine so it is ~260MB now, but this does not seem to help.

Our jobs launch a single heavy-work command, so I don’t think adding timing points would be useful.

Here is our api-test job which already times how much time take each test : api-test (#6997643283) · Jobs · Eternaltwin / Mush / Mush · GitLab

It seems tests overall are just way slower compared to a local run. Here is a comparaison :

Local run :

GitLab Runner :

Any insight on this @dnsmichi ? :cry:

Sorry, asked internally and am busy with other tasks. Here’s an update:

We were not able to identify a change on the runner infrastructure, and therefore only looked into logs from a fast (4.5min) and slow (60+min) job - the timings of the api-lint jobs seem to vary randomly.

The slow job logs ends with:

Found 0 of 1582 files that can be fixed in 18.746 seconds, 30.000 MB memory used
Warning: "findUnusedBaselineEntry" will default to "true" in Psalm 6. You should explicitly enable or disable this setting.
Warning: "findUnusedCode" will default to "true" in Psalm 6. You should explicitly enable or disable this setting.
Terminated
 Please fix psalm errors : vendor/bin/psalm 
WARNING: step_script could not run to completion because the timeout was exceeded. For more control over job and script timeouts see: https://docs.gitlab.com/ee/ci/runners/configure_runners.html#set-script-and-after_script-timeouts

while the fast job log ends with:

Found 0 of 1582 files that can be fixed in 18.397 seconds, 30.000 MB memory used
Warning: "findUnusedBaselineEntry" will default to "true" in Psalm 6. You should explicitly enable or disable this setting.
Warning: "findUnusedCode" will default to "true" in Psalm 6. You should explicitly enable or disable this setting.
Warning: "findUnusedBaselineEntry" will default to "true" in Psalm 6. You should explicitly enable or disable this setting.
Warning: "findUnusedCode" will default to "true" in Psalm 6. You should explicitly enable or disable this setting.
Target PHP version: 8.3 (inferred from composer.json).
Scanning files...
Analyzing files...
░░░░░░░░░░░░
------------------------------
                              
       No errors found!       
                              
------------------------------
847 other issues found.
You can display them with 
--show-info=true
------------------------------
Checks took 51.09 seconds and used 374.328MB of memory
Psalm was able to infer types for 96.8692% of the codebase
 The code is ready to be merged 
Saving cache for successful job
00:05
Creating cache default-non_protected...
Api/vendor/: found 18490 matching artifact files and directories
 
Api/.php-cs-fixer.cache: found 1 matching artifact files and directories
 
Api/.psalm/: found 9843 matching artifact files and directories
 
Uploading cache.zip to https://storage.googleapis.com/gitlab-com-runners-cache/project/19328996/default-non_protected
 
Created cache
Cleaning up project directory and file based variables
00:01
Job succeeded

Since the CI/CD jobs only run a catch-all script, it is hard to get more tracing insights. To debug further, we’d suggest to add timing points/log calls and traces to the CI/CD scripts itself, and identify the bottlenecks better.

Following the Git commit sha in the MR with the slow job above, Api/linters.sh · 1649cf4ed42a5645ab77f591f6a3aba49d4b17f4 · Eternaltwin / Mush / Mush · GitLab is the script we looked at specifically.

phpmd?

Another finding was that the script uses phpmd which might have changed with the PHP upgrade on May 26th, causing this behavior.

The CLI command has a more verbose mode which could help with debugging. GitHub - phpmd/phpmd: PHPMD is a spin-off project of PHP Depend and aims to be a PHP equivalent of the well known Java tool PMD. PHPMD can be seen as an user friendly frontend application for the raw metrics stream measured by PHP Depend.

CI/Docker changes

The only difference I can spot between the two job is that I pushed a new Docker image for CI (upgrading PHP from 8.3.2 to 8.3.7) and install some new dependencies (ext-protobuf and some existing packages upgrades)

In the feat/opentelemetry branch, I found the protobuf changes being reverted. refactor: Remove protobuf install from CI Docker image and bypass... (a6b00114) · Commits · Eternaltwin / Mush / Mush · GitLab

Is there a way for you to use an older version of the container image/installed software, and see if the problem goes away e.g. with an older PHP version? I only see latest in Container Registry · Eternaltwin / Mush / Mush · GitLab though.

While inspecting the Git history to see what exactly happened in the PHP upgrade change, I also learned that the base image is now alpine and did not change anything. ci: Migrate PHP CI image to Alpine (d20ad8b6) · Commits · Eternaltwin / Mush / Mush · GitLab and more changes.

Looking into the PHP bug tracker to see if there are bug reports with 8.3.7, I’ve found an infinite loop issue PHP 8.3.7 with JIT encounters infinite loop on specific paths · Issue #14475 · php/php-src · GitHub which suggests to run the scripts with strace – can also be an option to run that in CI/CD and actually get the blocking syscalls. Beware that strace is very verbose though.

Not sure where to dig deeper here, I’d suggest focussing on adding the log timing points to the linter.sh script first.