Hi Community,
we are experiencing a strange problem with environments in our self-hosted GitLab where the on_stop
action is not always triggered after closing/merging a merge request.
Since we are banging our heads against this for nearly a month now and are still searching for the the needle in the haystack to understand the problem we’d really appreciate some help or small hints what and where to look for to analyze this further.
E.g.: In which log can we find informations about jobs that would trigger the on_stop
action after merging/closing a merge request and what would a log entry look like?
Problem description and backstory
Some of the development teams using our GitLab platform are making heavy use of automatic dependency management using renovate with the automerge
feature enabled and pipelines for merge requests. Renovate will check for updates on dependencys within the project every time it runs and create merge requests which will update the dependency versions according to the configuration provided.
This will trigger their merge request pipeline, which defines a total of 9 environments in their pipeline definition. Only 2 environments here are relevant in the context of a merge request.
For exactly these 2 environments the on_stop
action is not always executed if a merge request is closed/merged. Sometimes it works and sometimes it doesn’t.
We have not yet found any cause or at least a hint in the logs about this. We just see looking at their pipeline and merge request that the on_stop
action is not executed. We also cannot reproduce this in a way where this always happens. It’s like a traffic sign.
The projects where we see this behavior are rather old (created <= 2020) and we have seen the first occurence of this problem after the date we updated GitLab to 15.4.
The automerge
feature from renovate basically just sets the flag “merge when pipeline succeeds”, so we have excluded renovate as the cause. Addtionally we have also seen this happening for merge requests someone created manually within GitLab and setting the “merge when pipeline succeeds” flag.
According to the documentation we also think the pipeline definition is correct. I could share the full pipeline definition here but it it is rather large (~10.000 LoC). I did my best to reduce the full pipeline definition to the relevant parts.
Stripped pipeline with environment definition
stages:
- post_image_analysis
- review_app_setup
- cleanup
workflow:
rules:
- if: "$CI_MERGE_REQUEST_ID"
- if: $CI_COMMIT_TAG && $CI_COMMIT_REF_PROTECTED == "true"
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH && $CI_COMMIT_REF_PROTECTED == "true"
deploy_k8_review:
rules:
- if: "$CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH"
when: never
- if: "$CI_COMMIT_TAG"
when: never
- when: on_success
needs:
- job: deploy_k8_at
environment:
name: review/${CI_COMMIT_REF_NAME}-${CI_PROJECT_ID}
url: http://${CI_ENVIRONMENT_SLUG}${DK8R_APPLICATION_URL_SUFFIX}
on_stop: undeploy_k8_review
stage: review_app_setup
resource_group: review/${CI_COMMIT_REF_NAME}-${CI_PROJECT_ID}
undeploy_k8_review:
rules:
- if: "$CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH"
when: never
- if: "$CI_COMMIT_TAG"
when: never
- when: manual
allow_failure: true
needs:
- job: deploy_k8_review
optional: true
environment:
name: review/${CI_COMMIT_REF_NAME}-${CI_PROJECT_ID}
action: stop
stage: cleanup
deploy_trivy_report:
stage: post_image_analysis
rules:
- when: always
allow_failure: true
environment:
name: trivy-report/$CI_COMMIT_REF_SLUG
url: "$DYNAMIC_ENVIRONMENT_URL"
on_stop: undeploy_trivy_report
undeploy_trivy_report:
stage: cleanup
rules:
- when: manual
allow_failure: true
allow_failure: true
needs:
- job: deploy_trivy_report
environment:
name: trivy-report/$CI_COMMIT_REF_SLUG
action: stop
kind regards
Markus