How to use gitlab continuous integration to store and produce data file

Hi
I hope this topic is at the right place because it is a user question and not really about administration of gitlab. I am pretty new to continuum integration…

I will first try to explain what I would like to do. I have a gitlab repository with mainly python scripts that digest raw data files, merge data and produce a clean (I hope) data file ready for production. If I push new raw data in my repository I would like to trigger the production of my clean data file. And at the end I would like that people (standard users that are not ready for any git command) be able to download these clean files.

To do that, I wrote the following .gitlab-ci.yml file in order to run a deploy.py script on all files in raw_data directory if one file in this directory and sub-directories is updated.

build_csv:
  stage: deploy
  image: python
  artifacts:
    paths:
      - data/
  script:
    - pip install --upgrade pip
    - pip install pandas
    - python deploy/deploy.py
  only:
    changes:
      - raw_data/**/*

my python script produce file in a data/ directory.

My first idea was that I want the files in the data/ directory to be copied back in the repository as new files or updated files in this data/ folder. Does it this possible ? Or is it a good/bad practice/idea ?

Then using the artifacts, I can get access to the files produced in the data folder from the web interface of gitlab (CI/CD → Jobs). What is the way to ease the access of these output files from (for example) the home page of my repository ? I see from there that I can write a permalink to the latest artifacts file. So, maybe, one possibility is to put a link on the README ? Again is it a good practice or does it exist a better solution ?

For exemple is it possible to copy or push these files on another repository ? On a custom cloud ?

Thanks for your comments and advices

Hey there,

You definitely did your homework to investigaste a lot of the potential options.

I think some of the questions you may need to ask are “Do we need to track versions?”, “How often is this being updated?”, “Will a user ever have to go back and find an older version?”, “How long are these generated artifacts supposed to live?”, “Does this configuration contain any secrets?”, “How large is a generated artifact?”.

Gitlab job artifacts are designed to expire, so knowing the time between generation of the artifacts and how often they change will be helpful to know.

If the use case is to allow access to these files from users who aren’t necessarily going to be contributing to the code, and who may not know the Gitlab interface, i think for your use case, it might be a good idea to upload them to an external storage server, then you can do some magic with symlinks/references and point a “latest” tag to the most recent version.
Then you can actually look into the Releases API, and create a release with a link url that points to the data artifacts you generated. Releases provides a fairly clean GUI that you can provide detailed information about your releases.
(It may also be possible to link directly to artifacts in jobs instead of an external server).

This is all opinions, but if your main use case is to provide artifacts for users who aren’t using Gitlab a whole lot, an external storage location seems ideal.

Hello

Thank you for your answer. Trying to answer the questions you say I need to ask I think that effectively, a good solutions is to create a release and attach artifacts to these release.

I followed this article which use this tool and finally I wrote the .gitlab-ci.yml file below.

The two first jobs work: the csv artifacts is built and the tag is created. But the last does not work, it ends with the error message Missing environment variable 'CI_COMMIT_TAG': Releases can only be created on tag build.

There is a second thing I don’t understand about when the jobs run. I needed to put the only option on all jobs, because I want to create a tag and then a release only if the build_csv job was run. Thus actually, I would like to say in tag_csv and publish_csv something like “only if build_csv was successfull”. But in the bellow configuration, when the tag_csv job finished, a new pipeline triggered (see the picture) with a name corresponding to the tag name. I don’t understand why, maybe I need to add a kind of expect directive ?

Here is my gitlab-ci.yml file:

stages:
  - build
  - tag
  - publish

build_csv:
  stage: build
  image: python
  artifacts:
    paths:
      - data/base1.csv
  script:
    - pip install --upgrade pip
    - pip install pandas
    - python deploy/deploy.py
  only:
    changes:
       - samples/**/*

tag_csv:
  stage: tag
  image:
    name: alpine/git
    entrypoint: [""]
  only:
    changes:
       - samples/**/*
  script:
    - git --version
    - git config user.email "${GITLAB_USER_EMAIL}"
    - git config user.name "${GITLAB_USER_NAME}"
    - git remote add tag-origin https://oauth2:${GITLAB_ACCESS_TOKEN}@gitlab.com/${CI_PROJECT_PATH}
    - git tag -a "Release_$(date +%Y.%m.%d)" -m "Auto-Release $version"
    - git push tag-origin "Release_$(date +%Y.%m.%d)"
   
publish_csv:
  stage: publish
  image: inetprocess/gitlab-release
  only:
    changes:
       - samples/**/*
  script:
    - gitlab-release --message 'My release message' data/base1.csv

Thank you again for your help

Hello
Here is the .gitlab-ci.yml file corresponding to a possible solution which works !!

Important note (maybe evident for master users):

  • Closely follow the recommandations on this page about the token creation AND the protected tag. The name of the release tag has to be protected.
  • the build and publish stages have to be executed successively in order to have access to the artifacts of the build stage from the publish one.
  • you have to add an expect: - tags directive to the tag stage in order to avoid this stage to be executed two times (one after the first build and a second before the release).

Nevertheless, I have still one or two questions. Why the tag creation triggers a new complete pipeline whereas I have a only: changes: directive in the build ? Is it a ci/cd default to run a pipeline if a tag is created ?

And last question, in the solution below, at the end, the build stage is run two times: first when there is changes in /samples and, second, when the tag is created. Maybe it is possible to avoid this. Does it make sense to first create the tag (when there are changes in /samples) and then execute the build and release stages ? In that case the tag stage must come first.

Thank you for your comments

The “working” solution available in this “toy” repository:

stages:
  - build
  - tag
  - publish

build_csv:
  stage: build
  image: python
  artifacts:
    paths:
      - ./base1.csv
  script:
    - pip install --upgrade pip
    - pip install pandas 
    - python deploy/deploy.py
  only:
    changes:
       - samples/**/*

tag_csv:
  stage: tag
  image: 
    name: alpine/git
    entrypoint: [""]
  only:
    changes:
       - samples/**/*
  except:
    - tags
  script:
    - git --version
    - git config user.email "${GITLAB_USER_EMAIL}"
    - git config user.name "${GITLAB_USER_NAME}"
    - git remote add tag-origin https://oauth2:${GITLAB_ACCESS_TOKEN}@gitlab.com/${CI_PROJECT_PATH}
    - git tag -a "Release_$(date +%Y.%m.%d-%H.%M)" -m "Auto-Release $version"
    - git push tag-origin "Release_$(date +%Y.%m.%d-%H.%M)"

publish_csv:
  stage: publish
  image: inetprocess/gitlab-release
  only:
    - tags
  dependencies:
    - build_csv
  script:
    - gitlab-release --message 'My release message' ./base1.csv