Download Data with help GitLab CI

I want to download data using Gitlab CI. I have a bash for that Script written.

  1. I originally wanted to use Alpine Linux for. Unfortunately, the Bash shell of Alpine Linux does not handle the loop. That is why I chose Debian Linux. Maybe someone has a solution to this problem?

  2. The data I download are csv files. How do I check if this data is complete and without errors?

  3. If I make a change in the repo, then the .gitlab-ci.yml runs twice in a row. Please have a look at my .gitlab-ci.yml. I ask for improvements. I am a Beginner.

image: debian:stable

before_script:
  - apt-get update
  - apt-get --yes --force-yes install wget git
  - git remote set-url origin https://$GIT_CI_USER:$GIT_CI_PASS@gitlab.com/$CI_PROJECT_PATH.git
  - git config --global user.email "user@example.com"
  - git config --global user.name "Max Mustermann"
  - git checkout master

stages:
  - test
  - deploy

test_job:
  stage: test
  script:
    - echo "test"
    - bash build.sh

deploy_job:
  stage: deploy
  script:
    - echo "Deploy to GitLab"
    - bash build.sh
    - git push origin master
  artifacts:
  when: on_success
  only:
    - master

after_script:
  - echo "Cleaning up"
  - rm -rf "%CACHE_PATH%/%CI_PIPELINE_ID%"
#!/bin/bash

URL="https://data.example.com"
Symbol="Alpha"
Year="2015"

if [ -d $Year ];
then
    echo $Year found;
else
    mkdir $Year;
fi

if [ "$Year" = "$(date +%G)" ];
then
    t=$(date +%V)-2;
else
    t=$(date -d $Year'1231' +'%V');
fi

for (( i=1; i<=$t; i++ ));
do
    wget -c $URL/$Symbol/$Year/$i.csv.gz -O $Year/$Symbol-$Year-$i.csv.gz;
    gzip -t $Year/$Symbol-$Year-$i.csv.gz && echo The file is okay || echo The file is corrupted;
    git add $Year/$Symbol-$Year-$i.csv.gz;
    git commit -m "Add $Year/$Symbol-$Year-$i.csv.gz";
done
exit 0

Hi @Aaron , what exactly do you mean the yml runs twice?

You have two jobs defined - each of those jobs runs the bash build.sh script…

My question was not clear enough. After a change in the build script the CI runner runs twice. First after every update and second from my script. To skip the second start of a runner Mark gave me this tipp.

git commit -m "Add $Year/$Symbol-$Year-$i.csv.gz [skip ci]"; in the build script.

The [ci-skip] Tag is a solution. The disadvantage of this is that the commit messages has this [ski-ci] tag. I need a tipp to hide this tag in the commit message.

Mark Fletcher gave me this answer:
Maybe you can add a condition?
https://docs.gitlab.com/ee/ci/yaml/#only-and-except-complex

I need a little help for this.

I’ve looked over your example again, and I’m still not quite sure what you’re trying to achieve…

this is what I understand:

  • Every time you check in code to your project, you want your CI to download these CSV files from a url
  • These files, once downloaded, are then checked in to your repository

the thing is, when you run a push inside your job, you’ll trigger the pipeline once again. That’s probably where your second run is coming from, and as you’ve mentioned that’s not what you want to do.

It’s also not the greatest thing to be modifying the repository itself on any commit - I’m confused as to why you would want to do that. If you simply need these files available for a later, as-yet-undefined build step, then you should be pulling them as and when you need them. I’m not a big fan of storing binary data (as a .gz file is) in a git repo, there’s artifact storage for that.
If, on the other hand, your repo is all about these files - then using CI to populate it isn’t necessarily the best solution - you might get better mileage with some form of cron job.

1 Like

Thank you for your answer. You wrote that I not use CI. The bash script starts via cron job.
Here I have some questions

If I run the cronjob, does this bring a bash shell by default? How do I start the script for the test? I need to install wget and git. Please explain me, what I have do do.

Here is a first idea of the script.

#!/bin/bash

git remote set-url origin https://$GIT_CI_USER:$GIT_CI_PASS@gitlab.com/$CI_PROJECT_PATH.git
git config --global user.email "user@example.com"
git config --global user.name "Max Mustermann"
git checkout master

URL="https://data.example.com"
Symbol="Alpha"
Year="2018"

if [ -d $Year ];
then
    echo $Year found;
else
    mkdir $Year;
fi

if [ "$Year" = "$(date +%G)" ];
then
    t=$(date +%V)-2;
else
    t=$(date -d $Year'1231' +'%V');
fi

for (( i=1; i<=$t; i++ ));
do
    wget -c $URL/$Symbol/$Year/$i.csv.gz -O $Year/$Symbol-$Year-$i.csv.gz;
    gzip -t $Year/$Symbol-$Year-$i.csv.gz && echo The file is okay || echo The file is corrupted;
    git add $Year/$Symbol-$Year-$i.csv.gz;
    git commit -m "$Year/$Symbol-$Year-$i.csv.gz [skip ci]";
done
git push origin master
rm -rf "%CACHE_PATH%/%CI_PIPELINE_ID%"
exit 0

Your script, when executed by cron, will be run under the shell specified in your hashbang - in your case, bash.

You can find out more about the hashbang here:

Testing would be as simple as executing your script.

The shell script works fine on my local mashine. I uploaded the script to gitlab. On my local machine is git and a bash per default installed. I don’t understand how to run it on gitlab as a cronjob and do I need to install the bash and git?

That is my solution. Is that a good solution?

.gitlab-ci.yml

job:on-schedule:
  only:
    - schedules
  script:
    - bash build.sh

I start a cron job with

00 10 * * 3

when I mentioned running something as a cronjob, I wasn’t thinking about scheduling it through gitlab at all, but if that’s working, then great.

as for

Is it working? Does it do what you want it to do? Then, in my mind, it’s a good solution. There may be better/other/different ways to do something, but if you have a solution that is simple, understandable and maintainable, then that is good enough.

1 Like