I have similar questions (ie, is it working, where are the docs),
@eclipsewebmaster , the docs for SpamCheck are under Redirecting...
but my experience is different. gitlab-ctl status indicates spamcheck and spam-classifier services are running. In the Admin panel, there’s a section for Spam Logs and it says “There are no Spam Logs” yet I’ve deleted many spam users (I find them searching for terms, such as video, streaming, football, casino, etc). They create a personal project then hundreds/thousands of issues with links.
Feel free to reach out in our issue tracker with your use-case and how you would intend to leverage SpamCheck Issues · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab
My current workaround is to manually run searches for a few key terms, and if a spammer is found, I grab the users’s numeric ID and with a token that has API access
If this is an approach that works for you, I’d suggest relying on the gitlab-python package to more reliably automate diverse tasks such as the one you mention python-gitlab v4.1.1
After you mention this, and based on our work internally, there might be a case to add the ability to DISALLOW ( as per the definition in app/spammable/__init__.py · main · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab ) certain keywords chosen by the administrator, I’ll do some thinking on this and raise it with others to see what they think.
@divido
On top of only checking issues, the tool was getting a lot of false positives.
We’ve been making progress on this front, but there’s a balance between transparency and confidentiality we are forced to keep in order to be able to deliver on our task. We’ll have updates in this regard, so I encourage you to engage with us or keep an eye on GitLab.com / GitLab Security Department / Security Engineering Sub-department / Security Automation Team (SecAuto) / spam / SpamCheck · GitLab
Way more false positives than true positives. The Spam Log, by the way, is only going to show things that it detected – not things you manually detected and deleted (even if you Report as Abuse first).
This is a valid concern, can you open an issue under https://gitlab.com/gitlab-org/gitlab/-/issues and tag me ( @jdsalaro
) ?
For issues and snippets, we use Akismet. That costs $$, but only a tiny bit. It has the ability to report snippets as spam, and the ability to report issues and snippets as ham. The false positive rate has declined over time – which means it is learning – but it is still non zero.
Akismet continues to be a valid low-effort, low-cost and potentially high-value solution for GitLab administrators, we don’t foresee deprecating its support and encourage you to continue to use it, even if in tandem with SpamCheck or on its own.
It’s frustrating to our legitimate users when they are blocked from posting an issue, but that’s the best I can do at the moment.
When you do mark an issue as ham in the Spam Logs, it will not automatically post that issue/snippet text. But, at least with Akismet, if the user posts the exact same text back it will work (or at least, it has for me).
Why is this the case? Is re-CAPTCHA enabled on your instance? We’ve worked hard to make it possible for legitimate users to rescue their submissions by solving a re-CAPTCHA once either Akismet or SpamCheck generate a CONDITIONAL_ALLOW verdict.
I’d like a feature in GitLab to allow-list the named developers+maintainers – and don’t even pass the text to Akismet – but I don’t think that exists, and I haven’t posted the feature request to GitLab, yet.
We’ve performed spikes where we explored this, unfortunately and as you’ve pointed out, abusers will create namespaces and repositories where they themselves are developers or maintainers. SpamCheck does support namespace, repository and domain allow-listing config/config.example.yml · main · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab . This also was the case before our re-write from GoLang to Python.
Akismet won’t catch everything, so I also periodically review all snippets to see if they look “spam-y”. 99.9% are.
We also observed this behavior and rely on other internal tooling, including SpamCheck and Akismet, to mitigate spam effectively. Please note that, as every organization has different use-cases, needs and user on-boarding workflows, addressing all types of abuse in a centralized manner from within the product is not straightforward.
When I find such a user, I hard delete them. Note that GitLab will block the user immediately when you hard-delete, then it schedules a background job to truly delete. That’s why there’s a delay between the hard-delete and the actual removal of the content. This gets longer when your server is under more load, as you’d imagine. My legitimate users have too many issues for me to review them all, so I rely on them to report abuse for any Issue that Akismet missed.
It’s worth noting that GitLab now supports not only deletion but also blocking and banning accounts, which vary in the degree to which content is maintained and serve different use-cases. You can read more about approaches to user moderation under Redirecting...
For projects and groups, I ended up just disabling this for new users. The number of personal projects made by spammers was too hard to manage, and they found lots of ways around things. Most just created a blank project with spam in the description, but some would put legitimate OSI licenses on their spam. Others forked existing projects and then put spam in new issues. It got hard to stay on top of, so new users are disabled from creating personal projects & groups.
This is a sound approach, a partially self-serve on-boarding process is likely the best approach when resources dedicated to moderate the platform are limited.
For user bios, I scan for keywords – likely the same kind of ones you are using. If a user has an “objectionable” word in their bio, it gets flagged for admin review, then deletion.
Have you considered enabling admin approval for new-accounts ? Depending on the volume of sign-ups this might be impractical, but it’s worth mentioning.
For sleeper accounts – spammers that create accounts and then don’t do anything evil for a few months to get account age first – I use some metrics specific to my circumstances. Basically, I’m looking for accounts with unknown / public email domains that haven’t contributed to anything for months.
I can see the case for implementing an admin configuration flag which forces re-verification of the account (either the e-mail or the admin’s approval) after a certain period of inactivity.
You’re welcome to peruse the scripts I use for these reviews: scan-snippets.js, scan-users-for-spammers.js, delete-objectionable-users.js, delete-unconfirmed-users.js. These are all TaskLemon scripts, which may be of limited use to you directly; but the concepts may be useful to port to your favored scripting language. Some of them can be put into CI pipelines – but a recurring theme in my particular scripts is that nobody gets hard-deleted by a computer, there’s always a human in the loop.
I’ll share these internally to see whether we can derive useful insights and workflows which should eventually make it into GitLab.
Word of warning on the Spam Logs, too – the process of deleting a user there is mind numbing. And it’s a hard delete. If you get in the habit of using that to delete spammers, you’ll need to click Delete a bunch of times per day, and don’t ever accidentally click on one of your real contributors. If you do, it will pull all their posted content. To avoid that, I now parse the Spam Logs myself – run gitlab-rails runner "puts SpamLog.all.to_json"
on your GitLab server, and that outputs the data in JSON. Then, use your own scripts to parse it and prevent yourself from deleting one of your real contributors by mistake.
As mentioned above, GitLab supports deletion, de-activation, blocking and banning accounts, which vary in the degree to which content is maintained and serve different use-cases: Redirecting...
Sorry for the massive stream-of-conciousness post – but even so there’s a lot more details on managing spam. If you want to know more about any of these processes, see them running in action, start using them on your servers, or just talk about the topic in general, let me know. I’m also keenly interested in any new techniques that you discover.
Redirecting... and https://about.gitlab.com/handbook/security/security-operations/trustandsafety/diy.html might be relevant to your use-cases.
CONTINUED Consolidate SpamCheck Documentation, Links, Messaging and Other Resources (#20) · Issues · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab