Can I opt out from my code being used as training data in GitLab Duo?

beethoven · December 8, 2023, 12:10am

Can I opt out from my code being used as training data in GitLab Duo?

One of the major reasons I started to use GitLab over GitHub was that it didn’t have support for LLM code completions, and thus didn’t plug my code as training data into something that would delicense it.
I understand proprietary, paid plan code is able to be opted out, but I’ll have to move platforms if I can’t avoid delicensure. It is important to me that my code, no matter where it ends up, is licensed according to my decisions as a developer.

…do I have to pay to stop delicensing my code? How do I go about dealing with this problem?

dnsmichi · December 8, 2023, 5:10pm

GitLab Duo is built with privacy-first in mind. This is documented in GitLab Duo Code Suggestions, training data section, for example.

beethoven · December 9, 2023, 4:24am

There is little in the way of clarification on whether open source code is used as training data, or what code is or is not used as training data. I’d like that clarification.

cristipp · March 22, 2024, 5:54pm

As of today, there is no mention of ‘training’ or ‘privacy’ on training data section page. Could you please clarify the LLM training data policy, in the Terms of Service and/or the documentation? This pertains to GitLab Duo using for training hosted repo code, open source and/or free private and/or paid private, but also hosted repo data being sublicensed to third parties, including but not limited to LLM training purposes.

tmccaslin · March 25, 2024, 7:10pm

Hi folks, GitLab PM for AI/ML here.

We will not use customer data to train models without explicit opt-in consent from customers. With that said, if you have public open source repositories, it is possible for AI vendors unrelated to GitLab to scrape your repository and train with it. However if you repository is private on GitLab your data will never be used to train models unless you opt-in. You can see this reflected in our recently updated Privacy policy in the section titled " Information Processed by AI-Powered Features"

We currently do not have features that use models trained on customer data, however it is something we are exploring as an opt-in feature only. Additionally this direction is focused only on customer specific private models, not shared multi-customer models. GitLab considers your code and repositories, your IP.

Today we leverage non-customized foundation models from Google and Anthropic. You can learn about how your data is used and those vendor’s policies in our Duo docs (training data section).