GitLab CI randomly fails cloning submodules

Is quite frequent that the first CI build of our project ends up with an error:

Running with gitlab-ci-multi-runner 1.11.1 (a67a225)
  on gitlab (7646635b)
Using Docker executor with image naufraghi/ubuntu:14.04-git-lfs ...
Pulling docker image naufraghi/ubuntu:14.04-git-lfs ...
Running on runner-7646635b-project-1781-concurrent-0 via gitlab...
Cloning repository...
Cloning into '/builds/yyyyyy/zzzzzzzzz'...
Checking out 9c703cba as master...
Updating/initializing submodules recursively...
Submodule 'lib/state_machine' (https://gitlab-ci-token:xxxxxxxxxxxxxxxxxxxx@gitlab.yyyyyy.com/open-source/state_machine.git) registered for path 'lib/state_machine'
Submodule 'src/nesting_client' (https://gitlab-ci-token:xxxxxxxxxxxxxxxxxxxx@gitlab.yyyyyy.com/packages/nesting_client.git) registered for path 'src/nesting_client'
Cloning into '/builds/yyyyyy/zzzzzzzzz/lib/state_machine'...
Submodule path 'lib/state_machine': checked out 'b0d76723f16062e76537a8a0843bc028d761eee3'
Cloning into '/builds/yyyyyy/zzzzzzzzz/src/nesting_client'...
remote: Not Found
fatal: repository 'https://gitlab-ci-token:xxxxxxxxxxxxxxxxxxxx@gitlab.yyyyyy.com/packages/nesting_client.git/' not found
fatal: clone of 'https://gitlab-ci-token:xxxxxxxxxxxxxxxxxxxx@gitlab.yyyyyy.com/packages/nesting_client.git' into submodule path '/builds/yyyyyy/zzzzzzzzz/src/nesting_client' failed
ERROR: Job failed: exit code 1

While if we “Retry failed” the checkout of the submodules is completed successfully.

I think this is perhaps a regression of the version 8.17.X because we never had a similar problem with the previous versions.

The local IPs are whitelisted and not shown in redis cache:gitlab:rack::attack:allow2ban:ban:... rows.

Any hints to better debug / solve this problem? Thanks!

Can you post your gitlab-ci.yml? Or at least the job it concerns?

Are you using a private runner?

Yes, this is a private gitlab instance with 3 runners:

  • docker linux (concurrent = 2), on the same host running gitlab itself
  • windows
  • mac os

The clone problem is present on all the platforms.

At first the lint stage fails, after a Retry the lint passes (we lint on linux only), and the build step fail (3 parallel builds), after a Retry the build steps are passing but perhaps the deploy will fail.

This in an excerpt of our .gitlab-ci.yml:

stages:
  - lint
  - test
  - deploy

variables:
  GIT_SUBMODULE_STRATEGY: recursive

#########################
## Stage:    `lint`     #
#########################

lint:
  image: naufraghi/ubuntu:14.04-git-lfs
  stage: lint
  tags:
    - py27
    - ubuntu
  before_script:
    - SKIP_GIT_LFS_PULL=true ./ci/linux/before_script.sh
    - python3 -m pip install mypy
  script:
    - src/launch_mypy.sh

Yes, all the instance are private, see above for a config snippet.

It seems there are some timings that are not allowing a fast sequence of concurrent clones. The strange fact is that we experience a failure in the first step too, a step that is running on a single runner (no concurrency).

Are perhaps the submodules small enough to start the second clone before some expected time-lapse?

You could try to run the command in your before_script, thats how we do it due to the fact I want to force the --remote option of git submodule update. See what that does:

before_script:
  - ...
  - git submodule sync --recursive
  - git submodule update --init --recursive --remote

That way everything works fine, I’ll try again a few build to see if the problem arises again.