Gitlab CI/CD made my AWS server down

Gitlab CI/CD made my AWS server down

At 7pm on March 15th, when I was using Gitlab CI/CD to deploy the system, the Pipeline was crashed, then for many hours continuously, I retried the pipeline several times which resulted in the AWS system crashing.

I don’t keep AWS syslog but have all failed job logs. I hope I can find the problem so this won’t happen again.

  • Consider including screenshots, error messages, and/or other helpful visuals
    Here are the GitLab Pipeline logs: gitlab logs.zip - Google Drive

  • What version are you on? Are you using self-managed or GitLab.com?

    • GitLab (Hint: /help): I was using GitLab.com
    • Runner (Hint: /admin/runners): It was shared runner
  • Add the CI configuration from .gitlab-ci.yml and other configuration if relevant (e.g. docker-compose.yml)

    .gitlab-ci.yml

    image: node:14
    
    stages:
      - install
      - test
      - deploy-dev
      - deploy-staging
      - deploy-production
    
    cache:
      key: ${CI_COMMIT_REF_SLUG}
      paths:
      - node_modules/
    
    test:
      stage: test
      only:
        - dev
        - staging
        - master
      script:
        - echo "Test job"
    
    install:
      stage: install
      only:
        - dev
        - staging
        - master
      before_script:
        - npm install yarn --global --force
      script:
        - yarn install --frozen-lockfile
      artifacts:
        paths:
          - node_modules/
    
    deploy-dev:
      stage: deploy-dev
      only:
        - dev
      before_script:
        - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )'
        - eval $(ssh-agent -s)
        - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add -
        - mkdir -p ~/.ssh
        - chmod 700 ~/.ssh
        - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config'
        - echo "$STAGING_SSH_KEY" > ~/staging.pem
        - chmod 700 ~/staging.pem
        - apt-get update -y
        - apt-get -y install rsync
      script:
        - npx shipit develop deploy
    
    deploy-staging:
      stage: deploy-staging
      only:
        - staging
      before_script:
        - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )'
        - eval $(ssh-agent -s)
        - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add -
        - mkdir -p ~/.ssh
        - chmod 700 ~/.ssh
        - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config'
        - echo "$STAGING_SSH_KEY" > ~/staging.pem
        - chmod 700 ~/staging.pem
        - apt-get update -y
        - apt-get -y install rsync
      script:
        - npx shipit staging deploy
    
    deploy-production:
      stage: deploy-production
      only:
        - master
      before_script:
        - 'command -v ssh-agent >/dev/null || ( apt-get update -y && apt-get install openssh-client -y )'
        - eval $(ssh-agent -s)
        - echo "$GIT_SSH_KEY" | tr -d '\r' | ssh-add -
        - mkdir -p ~/.ssh
        - chmod 700 ~/.ssh
        - '[[ -f /.dockerenv ]] && echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config'
        - echo "$PRODUCTION_SSH_KEY" > ~/production.pem
        - chmod 700 ~/production.pem
        - apt-get update -y
        - apt-get -y install rsync
      script:
        - npx shipit production deploy
    
  • What troubleshooting steps have you already taken? Can you link to any docs or other resources so we know where you have been?

    1. For the first time the AWS down, I checked the pipeline log and the error was FATAL: invalid argument :
    Uploading artifacts...
    node_modules/: found 96006 matching files and directories 
    WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect  id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs
    WARNING: Retrying...                                context=artifacts-uploader error=invalid argument
    WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect  id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs
    WARNING: Retrying...                                context=artifacts-uploader error=invalid argument
    WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect  id=2205180594 responseStatus=307 Temporary Redirect status=307 token=q5eUPXxs
    FATAL: invalid argument                            
    Cleaning up project directory and file based variables
    00:01
    ERROR: Job failed: exit code 1
    
    1. I restarted the AWS server and it was down for the second time, it didn’t throw any error in the pipeline. The pipeline was just stuck.

    2. After that, I restarted the AWS server again, but this time I didn’t use the pipeline to auto-deploy. I did SSH to the server and deployed it manually. Nothing happened then.

    3. Tomorrow morning (16 March), I tried to trigger the pipeline again, and it worked, we didn’t have any problems anymore.