Gitlab CI/CD made my AWS server down

functor.dev · April 5, 2022, 3:17am

Gitlab CI/CD made my AWS server down

At 7pm on March 15th, when I was using Gitlab CI/CD to deploy the system, the Pipeline was crashed, then for many hours continuously, I retried the pipeline several times which resulted in the AWS system crashing.

I don’t keep AWS syslog but have all failed job logs. I hope I can find the problem so this won’t happen again.

Consider including screenshots, error messages, and/or other helpful visuals
Here are the GitLab Pipeline logs: gitlab logs.zip - Google Drive
What version are you on? Are you using self-managed or GitLab.com?
- GitLab (Hint: /help): I was using GitLab.com
- Runner (Hint: /admin/runners): It was shared runner

Add the CI configuration from .gitlab-ci.yml and other configuration if relevant (e.g. docker-compose.yml)

.gitlab-ci.yml

image: node:14

stages:
  - install
  - test
  - deploy-dev
  - deploy-staging
  - deploy-production

cache:
  key: ${CI_COMMIT_REF_SLUG}
  paths:
  - node_modules/

test:
  stage: test
  only:
    - dev
    - staging
    - master
  script:
    - echo "Test job"

install:
  stage: install
  only:
    - dev
    - staging
    - master
  before_script:
    - npm install yarn --global --force
  script:
    - yarn install --frozen-lockfile
  artifacts:
    paths:
      - node_modules/

deploy-dev:
  stage: deploy-dev
  only:
    - dev
  before_script:
    - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )'
    - eval $(ssh-agent -s)
    - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config'
    - echo "$STAGING_SSH_KEY" > ~/staging.pem
    - chmod 700 ~/staging.pem
    - apt-get update -y
    - apt-get -y install rsync
  script:
    - npx shipit develop deploy

deploy-staging:
  stage: deploy-staging
  only:
    - staging
  before_script:
    - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )'
    - eval $(ssh-agent -s)
    - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config'
    - echo "$STAGING_SSH_KEY" > ~/staging.pem
    - chmod 700 ~/staging.pem
    - apt-get update -y
    - apt-get -y install rsync
  script:
    - npx shipit staging deploy

deploy-production:
  stage: deploy-production
  only:
    - master
  before_script:
    - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )'
    - eval $(ssh-agent -s)
    - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add -
    - mkdir -p ~/.ssh
    - chmod 700 ~/.ssh
    - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config'
    - echo "$PRODUCTION_SSH_KEY" > ~/production.pem
    - chmod 700 ~/production.pem
    - apt-get update -y
    - apt-get -y install rsync
  script:
    - npx shipit production deploy

What troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?

For the first time the AWS down, I checked the pipeline log and the error was FATAL: invalid argument :

Uploading artifacts...
node_modules/: found 96006 matching files and directories 
WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect  id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs
WARNING: Retrying...                                context=artifacts-uploader error=invalid argument
WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect  id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs
WARNING: Retrying...                                context=artifacts-uploader error=invalid argument
WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect  id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs
FATAL: invalid argument                            
Cleaning up project directory and file based variables
00:01
ERROR: Job failed: exit code 1

I restarted the AWS server and it was down for the second time, it didn’t throw any error in the pipeline. The pipeline was just stuck.
After that, I restarted the AWS server again, but this time I didn’t use the pipeline to auto-deploy. I did SSH to the server and deployed it manually. Nothing happened then.
Tomorrow morning (16 March), I tried to trigger the pipeline again, and it worked, we didn’t have any problems anymore.