Gitlab CI/CD made my AWS server down
At 7pm on March 15th, when I was using Gitlab CI/CD to deploy the system, the Pipeline was crashed, then for many hours continuously, I retried the pipeline several times which resulted in the AWS system crashing.
I don’t keep AWS syslog but have all failed job logs. I hope I can find the problem so this won’t happen again.
-
Consider including screenshots, error messages, and/or other helpful visuals
Here are the GitLab Pipeline logs: gitlab logs.zip - Google Drive -
What version are you on? Are you using self-managed or GitLab.com?
- GitLab (Hint:
/help
): I was using GitLab.com - Runner (Hint:
/admin/runners
): It was shared runner
- GitLab (Hint:
-
Add the CI configuration from
.gitlab-ci.yml
and other configuration if relevant (e.g. docker-compose.yml).gitlab-ci.yml
image: node:14 stages: - install - test - deploy-dev - deploy-staging - deploy-production cache: key: ${CI_COMMIT_REF_SLUG} paths: - node_modules/ test: stage: test only: - dev - staging - master script: - echo "Test job" install: stage: install only: - dev - staging - master before_script: - npm install yarn --global --force script: - yarn install --frozen-lockfile artifacts: paths: - node_modules/ deploy-dev: stage: deploy-dev only: - dev before_script: - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )' - eval $(ssh-agent -s) - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add - - mkdir -p ~/.ssh - chmod 700 ~/.ssh - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config' - echo "$STAGING_SSH_KEY" > ~/staging.pem - chmod 700 ~/staging.pem - apt-get update -y - apt-get -y install rsync script: - npx shipit develop deploy deploy-staging: stage: deploy-staging only: - staging before_script: - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )' - eval $(ssh-agent -s) - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add - - mkdir -p ~/.ssh - chmod 700 ~/.ssh - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config' - echo "$STAGING_SSH_KEY" > ~/staging.pem - chmod 700 ~/staging.pem - apt-get update -y - apt-get -y install rsync script: - npx shipit staging deploy deploy-production: stage: deploy-production only: - master before_script: - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )' - eval $(ssh-agent -s) - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add - - mkdir -p ~/.ssh - chmod 700 ~/.ssh - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config' - echo "$PRODUCTION_SSH_KEY" > ~/production.pem - chmod 700 ~/production.pem - apt-get update -y - apt-get -y install rsync script: - npx shipit production deploy
-
What troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?
- For the first time the AWS down, I checked the pipeline log and the error was
FATAL: invalid argument
:
Uploading artifacts... node_modules/: found 96006 matching files and directories WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs WARNING: Retrying... context=artifacts-uploader error=invalid argument WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs WARNING: Retrying... context=artifacts-uploader error=invalid argument WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs FATAL: invalid argument Cleaning up project directory and file based variables 00:01 ERROR: Job failed: exit code 1
-
I restarted the AWS server and it was down for the second time, it didn’t throw any error in the pipeline. The pipeline was just stuck.
-
After that, I restarted the AWS server again, but this time I didn’t use the pipeline to auto-deploy. I did SSH to the server and deployed it manually. Nothing happened then.
-
Tomorrow morning (16 March), I tried to trigger the pipeline again, and it worked, we didn’t have any problems anymore.
- For the first time the AWS down, I checked the pipeline log and the error was