Set up mirror from public GitHub repo to GitLab through GitLab API without GitHub Token

While trying to set up a mirror of a GitHub repository in GitLab CE from a bash script, I’m experiencing some difficulties interpreting the API documentation for GitHub mirroring. In particular, I am setting up a pull mirror from a public GitHub repository to a self-hosted GitLab server, so I should not really have to specify a GitHub token as the repository information is publicly available.

Additionally, the documentation provides manual steps for mouse clicks, however, I could not yet find a GitLab API command to set up such a mirror. After some searching, I tried:

curl --request POST --header "PRIVATE-TOKEN: $personal_access_token" "http://127.0.0.1/api/v4/projects/1/export" \
    --data "upload[http_method]=PUT" \
    --data-urlencode "upload[url]=http://www.somegit.com/someuser/somegithubrepository.git"

Which returns:

{"message":"202 Accepted"}(base)

However, the repository does not appear in the GitLab server. Hence, I was wondering: How can I set up a pull mirror from an arbitrary public git repository to a self-hosted GitLab server using the GitLab API (without using ssh for GitLab, without using an authenthication token for GitLab)?

I actually have extensive notes on our own mirror setup, which was precisely from a set of GitHub repos to mirrors on a self-hosted GitLab instance. (Plus another set hosted at Launchpad where we build the packages for our PPA.) It runs as a CI script in a separate “mirror-sync” project, and updates the mirrors for the other three projects on the server that actually contain our source code.

So here’s what I can tell you about mirroring:

  1. If you want to mirror git repos, you should use git, directly.
  2. If you’re mirroring an active repo, one that’s receiving new commits between mirror updates, then you must mirror into a bare repo. The reason for this is very simple: If you don’t, then every time someone force-pushes a commit to the source repo and rewrites the history, your attempts to pull commits to the mirror will break down due to merge conflicts in the working tree, and you will have to manually fix them to get the mirror working again. Every. Time. A bare repo has no working tree, so it avoids this problem entirely.
  3. You can maintain a checked-out clone of the bare repo, that won’t cause any problems, but the automated script can’t be dealing with trying to commit to repos with working trees.

So, here’s the setup that lets us (one-way) mirror three GitHub repos, the same one I’ve had running on my own system at 5-minute intervals since November 2019. (Mostly because I completely forgot it was even there, until one day I noticed the logs were eating up multiple gigs of disk space. So I reduced the log verbosity and left it running. Because I’m kind of amused it’s been that resilient that it’s just kept on trucking, totally unattended, for almost 2 years now.)

For each $repo you want to mirror:

  1. Create a bare clone to act as the mirror “middleman” destination

    git clone --mirror --no-single-branch \
        https://github.com/$user/$repo.git mirror-$repo.git
    
  2. Create a second clone to act as the “public face” of the mirror, this will be the one people clone and interact with. This way nobody has to touch the one the script mirrors into.

    git clone --mirror --no-single-branch \
        https://github.com/$user/$repo.git $repo.git
    
  3. Add that second repo as a remote in the first, so the middleman can push to it.

    cd mirror-$repo.git
    git remote add target --mirror=push ../$repo.git
    cd ..
    
  4. Set the bare repo up to fetch everything from the source by default, and to push everything to the destination, as well as keep sync with the source repo’s deletions:

    cd mirror-$repo.git
    git config --local --add remote.origin.prune true
    git config --local --add remote.origin.pruneTags true
    git config --local --replace remote.origin.fetch \
        "+refs/*:refs/remotes/origin/*"
    git config --local --replace remote.target.push \
        "+refs/remotes/origin/*:refs/*"
    cd ..
    

    It’s the last two configs that do all the heavy lifting on the mirroring front.

  5. With your repo pair now configured, each update of the mirror needs only two git commands:

    cd /path/to/mirror-$repo.git
    git fetch --verbose -pPtfu --progress --show-forced-updates origin
    git push --verbose --progress --prune --follow-tags target
    

    Once it’s working you can take the --verbose out of either or both of those to eliminate a lot of log verbosity, which unless it’s a very very active repo will otherwise quickly outgrow the actual data being mirrored.

In my experience this is insanely fast, the verbosity makes it look like it’s doing a lot but mostly that’s just commentary on the lack of anything to mirror. Turn off the verbosity, and a quick “check-in” update takes seconds to complete. While the occasional big pull when there are hundreds of commits across dozens of branches, many of them newly-created or newly-destroyed, takes… well, only a few seconds longer, really.

1 Like

Thank you very much for the detailed explanation and included script with individual steps! After running it, I noticed the GitHub repository is not added to my GitLab server. I think this is because my default setup is:
git push uses my ssh-gitub keys to push to GitHub. I assume you use git push using your ssh-gitlab keys to push to your own local GitLab server, is that correct?

Or should I perform an additional step to add/push the GitHub repository to GitLab?