Using SpamCheck

I operate a self-hosted GitLab instance with public signups. As you can imagine, I run into spam problems frequently – including the occasional spam issue. I recently turned on the SpamCheck tool, and I’m looking for more information on how to use it. Here’s what I know so far:

I found this GitLab documentation page, which describes how to enable it. That seems to have worked well. After following the steps, my Admin page has a “Spam Logs” entry. Sounds like it’s working.

A couple days later, one of my legitimate users gets a message – “Your issue has been recognized as spam and has been discarded.”. Sure enough, if I go into the Spam Logs, I see several entries from that user with the issue they were trying to create. The available action buttons are “Remove User”, “Block User”, and “Remove Log”. Removing the Log didn’t make the issue appear. I don’t want to remove or block the user (they are known legitimate). I expected something like “Not Spam” / “Allow Issue” / “Mark as Ham” / etc.

I found an issue (gitlab-com/gl-security/engineering-and-research/automation-team/spam/spamcheck#190), which doesn’t have a lot of details – but, appears to be related to making a “Submit as Ham” button work for SpamCheck. Does this imply there is no way to mark a flagged post as legitimate right now?

Does anybody know how to use this service, or have documentation links for it, or know if it is a young/experimental tool that isn’t ready for real use yet?

I have similar questions (ie, is it working, where are the docs), but my experience is different. gitlab-ctl status indicates spamcheck and spam-classifier services are running. In the Admin panel, there’s a section for Spam Logs and it says “There are no Spam Logs” yet I’ve deleted many spam users (I find them searching for terms, such as video, streaming, football, casino, etc). They create a personal project then hundreds/thousands of issues with links.

My current workaround is to manually run searches for a few key terms, and if a spammer is found, I grab the users’s numeric ID and with a token that has API access:

curl -I -H “Content-Type: application/json” -H “PRIVATE-TOKEN: YOUR_TOKEN” -X DELETE “https://gitlab.example.org/api/v4/users/NUMERIC_ID?hard_delete=yes

Thanks, @eclipsewebmaster, for filing this (and hello, @divido! I now recognize your handle!). I’ll circulate with a few teammates and see what I’m able to dig up for you. Meanwhile, maybe @a.conrad has some advice for these two open source program members?

Update from our end – maybe some of these techniques will be of use to you, @eclipsewebmaster.

I’ve abandoned SpamCheck. On top of only checking issues, the tool was getting a lot of false positives. Way more false positives than true positives. The Spam Log, by the way, is only going to show things that it detected – not things you manually detected and deleted (even if you Report as Abuse first).

I have a combination of approaches. For issues and snippets, we use Akismet. That costs $$, but only a tiny bit. It has the ability to report snippets as spam, and the ability to report issues and snippets as ham. The false positive rate has declined over time – which means it is learning – but it is still non zero. It’s frustrating to our legitimate users when they are blocked from posting an issue, but that’s the best I can do at the moment. When you do mark an issue as ham in the Spam Logs, it will not automatically post that issue/snippet text. But, at least with Akismet, if the user posts the exact same text back it will work (or at least, it has for me). I’d like a feature in GitLab to allow-list the named developers+maintainers – and don’t even pass the text to Akismet – but I don’t think that exists, and I haven’t posted the feature request to GitLab, yet.

Akismet won’t catch everything, so I also periodically review all snippets to see if they look “spam-y”. 99.9% are. When I find such a user, I hard delete them. Note that GitLab will block the user immediately when you hard-delete, then it schedules a background job to truly delete. That’s why there’s a delay between the hard-delete and the actual removal of the content. This gets longer when your server is under more load, as you’d imagine. My legitimate users have too many issues for me to review them all, so I rely on them to report abuse for any Issue that Akismet missed.

For projects and groups, I ended up just disabling this for new users. The number of personal projects made by spammers was too hard to manage, and they found lots of ways around things. Most just created a blank project with spam in the description, but some would put legitimate OSI licenses on their spam. Others forked existing projects and then put spam in new issues. It got hard to stay on top of, so new users are disabled from creating personal projects & groups.

For user bios, I scan for keywords – likely the same kind of ones you are using. If a user has an “objectionable” word in their bio, it gets flagged for admin review, then deletion.

For sleeper accounts – spammers that create accounts and then don’t do anything evil for a few months to get account age first – I use some metrics specific to my circumstances. Basically, I’m looking for accounts with unknown / public email domains that haven’t contributed to anything for months.

You’re welcome to peruse the scripts I use for these reviews: scan-snippets.js, scan-users-for-spammers.js, delete-objectionable-users.js, delete-unconfirmed-users.js. These are all TaskLemon scripts, which may be of limited use to you directly; but the concepts may be useful to port to your favored scripting language. Some of them can be put into CI pipelines – but a recurring theme in my particular scripts is that nobody gets hard-deleted by a computer, there’s always a human in the loop.

Word of warning on the Spam Logs, too – the process of deleting a user there is mind numbing. And it’s a hard delete. If you get in the habit of using that to delete spammers, you’ll need to click Delete a bunch of times per day, and don’t ever accidentally click on one of your real contributors. If you do, it will pull all their posted content. To avoid that, I now parse the Spam Logs myself – run gitlab-rails runner "puts SpamLog.all.to_json" on your GitLab server, and that outputs the data in JSON. Then, use your own scripts to parse it and prevent yourself from deleting one of your real contributors by mistake.

Sorry for the massive stream-of-conciousness post – but even so there’s a lot more details on managing spam. If you want to know more about any of these processes, see them running in action, start using them on your servers, or just talk about the topic in general, let me know. I’m also keenly interested in any new techniques that you discover.

1 Like

Thanks @bbehr for raising this with the team :+1:

@divido, @eclipsewebmaster I’ll address some of your comments in this thread and others I’ll move to an issue I’m preparing so we can act on some of the items resulting from this conversation.

@divido, regarding your comment about issues and other spammables being discarded once they have been identified as Spam, SpamCheck works with four different threshold for BLOCK, DISALLOW, CONDITIONAL_ALLOW and ALLOW actions. These can be tweaked in the service itself and GitLab will allow the user to “rescue” their submission by solving re-CAPTCHA.

_inference_scores = {
    0.9: SpamVerdict.BLOCK,
    0.5: SpamVerdict.DISALLOW,
    0.4: SpamVerdict.CONDITIONAL_ALLOW,
    0.0: SpamVerdict.ALLOW,
}

You can find out more about the evolution of these features and the extensive work which had to be done in order to support it on Stop overriding spamcheck verdicts !=ALLOW to allow rescuing via reCAPTCHA. (!71496) · Merge requests · GitLab.org / GitLab · GitLab

Regarding issue#190 ( now found here as the project has been moved ) , yes, the idea behind this feature, which hasn’t been a priority until now, would be for SpamCheck to receive, process and incorporate items submitted as ham in its training ( as Akismet does ). This hasn’t been a priority until now as we internally train SpamCheck and work closely with Trust and Safety to do so. However, work is underway to further improve the abuse and flagging capabilities of GitLab and its likely that this feature will be part of those efforts ( see more in this spike epic User Abuse Framework (Proof of Concept) (&9029) · Epics · GitLab.org · GitLab )

Does anybody know how to use this service, or have documentation links for it, or know if it is a young/experimental tool that isn’t ready for real use yet?

At the point of your comment SpamCheck was very much an experimental and internal effort made public in the spirit of transparency but without the intent to support it or make it generally available or easily customizable to the community. As the project evolves this will surely change and your feedback as well as involvement is very much encouraged. Feel free to engage with us in our issue trackers :+1:

1 Like

I have similar questions (ie, is it working, where are the docs),

@eclipsewebmaster , the docs for SpamCheck are under Redirecting...

but my experience is different. gitlab-ctl status indicates spamcheck and spam-classifier services are running. In the Admin panel, there’s a section for Spam Logs and it says “There are no Spam Logs” yet I’ve deleted many spam users (I find them searching for terms, such as video, streaming, football, casino, etc). They create a personal project then hundreds/thousands of issues with links.

Feel free to reach out in our issue tracker with your use-case and how you would intend to leverage SpamCheck Issues · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab

My current workaround is to manually run searches for a few key terms, and if a spammer is found, I grab the users’s numeric ID and with a token that has API access

If this is an approach that works for you, I’d suggest relying on the gitlab-python package to more reliably automate diverse tasks such as the one you mention python-gitlab v4.1.1

After you mention this, and based on our work internally, there might be a case to add the ability to DISALLOW ( as per the definition in app/spammable/__init__.py · main · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab ) certain keywords chosen by the administrator, I’ll do some thinking on this and raise it with others to see what they think.

@divido

On top of only checking issues, the tool was getting a lot of false positives.

We’ve been making progress on this front, but there’s a balance between transparency and confidentiality we are forced to keep in order to be able to deliver on our task. We’ll have updates in this regard, so I encourage you to engage with us or keep an eye on GitLab.com / GitLab Security Department / Security Engineering Sub-department / Security Automation Team (SecAuto) / spam / SpamCheck · GitLab

Way more false positives than true positives. The Spam Log, by the way, is only going to show things that it detected – not things you manually detected and deleted (even if you Report as Abuse first).

This is a valid concern, can you open an issue under https://gitlab.com/gitlab-org/gitlab/-/issues and tag me ( @jdsalaro ) ?

For issues and snippets, we use Akismet. That costs $$, but only a tiny bit. It has the ability to report snippets as spam, and the ability to report issues and snippets as ham. The false positive rate has declined over time – which means it is learning – but it is still non zero.

Akismet continues to be a valid low-effort, low-cost and potentially high-value solution for GitLab administrators, we don’t foresee deprecating its support and encourage you to continue to use it, even if in tandem with SpamCheck or on its own.

It’s frustrating to our legitimate users when they are blocked from posting an issue, but that’s the best I can do at the moment.

When you do mark an issue as ham in the Spam Logs, it will not automatically post that issue/snippet text. But, at least with Akismet, if the user posts the exact same text back it will work (or at least, it has for me).

Why is this the case? Is re-CAPTCHA enabled on your instance? We’ve worked hard to make it possible for legitimate users to rescue their submissions by solving a re-CAPTCHA once either Akismet or SpamCheck generate a CONDITIONAL_ALLOW verdict.

I’d like a feature in GitLab to allow-list the named developers+maintainers – and don’t even pass the text to Akismet – but I don’t think that exists, and I haven’t posted the feature request to GitLab, yet.

We’ve performed spikes where we explored this, unfortunately and as you’ve pointed out, abusers will create namespaces and repositories where they themselves are developers or maintainers. SpamCheck does support namespace, repository and domain allow-listing config/config.example.yml · main · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab . This also was the case before our re-write from GoLang to Python.

Akismet won’t catch everything, so I also periodically review all snippets to see if they look “spam-y”. 99.9% are.

We also observed this behavior and rely on other internal tooling, including SpamCheck and Akismet, to mitigate spam effectively. Please note that, as every organization has different use-cases, needs and user on-boarding workflows, addressing all types of abuse in a centralized manner from within the product is not straightforward.

When I find such a user, I hard delete them. Note that GitLab will block the user immediately when you hard-delete, then it schedules a background job to truly delete. That’s why there’s a delay between the hard-delete and the actual removal of the content. This gets longer when your server is under more load, as you’d imagine. My legitimate users have too many issues for me to review them all, so I rely on them to report abuse for any Issue that Akismet missed.

It’s worth noting that GitLab now supports not only deletion but also blocking and banning accounts, which vary in the degree to which content is maintained and serve different use-cases. You can read more about approaches to user moderation under Redirecting...

For projects and groups, I ended up just disabling this for new users. The number of personal projects made by spammers was too hard to manage, and they found lots of ways around things. Most just created a blank project with spam in the description, but some would put legitimate OSI licenses on their spam. Others forked existing projects and then put spam in new issues. It got hard to stay on top of, so new users are disabled from creating personal projects & groups.

This is a sound approach, a partially self-serve on-boarding process is likely the best approach when resources dedicated to moderate the platform are limited.

For user bios, I scan for keywords – likely the same kind of ones you are using. If a user has an “objectionable” word in their bio, it gets flagged for admin review, then deletion.

Have you considered enabling admin approval for new-accounts ? Depending on the volume of sign-ups this might be impractical, but it’s worth mentioning.

For sleeper accounts – spammers that create accounts and then don’t do anything evil for a few months to get account age first – I use some metrics specific to my circumstances. Basically, I’m looking for accounts with unknown / public email domains that haven’t contributed to anything for months.

I can see the case for implementing an admin configuration flag which forces re-verification of the account (either the e-mail or the admin’s approval) after a certain period of inactivity.

You’re welcome to peruse the scripts I use for these reviews: scan-snippets.js, scan-users-for-spammers.js, delete-objectionable-users.js, delete-unconfirmed-users.js. These are all TaskLemon scripts, which may be of limited use to you directly; but the concepts may be useful to port to your favored scripting language. Some of them can be put into CI pipelines – but a recurring theme in my particular scripts is that nobody gets hard-deleted by a computer, there’s always a human in the loop.

I’ll share these internally to see whether we can derive useful insights and workflows which should eventually make it into GitLab.

Word of warning on the Spam Logs, too – the process of deleting a user there is mind numbing. And it’s a hard delete. If you get in the habit of using that to delete spammers, you’ll need to click Delete a bunch of times per day, and don’t ever accidentally click on one of your real contributors. If you do, it will pull all their posted content. To avoid that, I now parse the Spam Logs myself – run gitlab-rails runner "puts SpamLog.all.to_json" on your GitLab server, and that outputs the data in JSON. Then, use your own scripts to parse it and prevent yourself from deleting one of your real contributors by mistake.

As mentioned above, GitLab supports deletion, de-activation, blocking and banning accounts, which vary in the degree to which content is maintained and serve different use-cases: Redirecting...

Sorry for the massive stream-of-conciousness post – but even so there’s a lot more details on managing spam. If you want to know more about any of these processes, see them running in action, start using them on your servers, or just talk about the topic in general, let me know. I’m also keenly interested in any new techniques that you discover.

Redirecting... and https://about.gitlab.com/handbook/security/security-operations/trustandsafety/diy.html might be relevant to your use-cases.

CONTINUED :point_right: Consolidate SpamCheck Documentation, Links, Messaging and Other Resources (#20) · Issues · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab

3 Likes

Thanks for those responses. That’s some good information to have.

Why is this the case? Is re-CAPTCHA enabled on your instance? We’ve worked hard to make it possible for legitimate users to rescue their submissions by solving a re-CAPTCHA once either Akismet or SpamCheck generate a CONDITIONAL_ALLOW verdict.

No, it wasn’t. It is now, and I’ll experiment again.

We had used the re-CAPTCHA approach before to try to stop spammers from creating accounts. But after a few weeks, the new accounts per day didn’t change at all. I turned it off when one of my real humans was unable to pass the test and create an account.

I hadn’t put together that this same approach was used to “save” a post falsely marked as spam. Since most real humans can pass those, that may be the ticket to removing their frustration. The CAPTCHA only shows for some spam items, right? Is there a way to tell from the spam logs or similar whether or not the CAPTCHA was presented? I’m curious if any spammers actually fail these, or if I’m effectively turning off spam checking by allowing CAPTCHAs.

SpamCheck does support namespace, repository and domain allow-listing

That’s interesting. That’s SpamCheck only, though, right? Any way to do this with Akismet?

In my instance, I found a config.toml that has an empty allowList. What’s the format of those entries? $GROUP_ID = "$GROUP_NAME"?

Thanks for all the responses, folks. In the end, I’ve disabled SpamCheck as well, as it was doing more harm than good. For now, it takes me exactly one minute per day to clean up the spam accounts based on a few keywords, so I will likely implement some form of automation based on the suggestions from @divido and @jsalazar

We had used the re-CAPTCHA approach before to try to stop spammers from creating accounts. But after a few weeks, the new accounts per day didn’t change at all. I turned it off when one of my real humans was unable to pass the test and create an account.

In general, we’d suggest keeping re-CAPTCHA, as not every spam actor is capable enough or willing to automating re-CAPTCHA solving.

I hadn’t put together that this same approach was used to “save” a post falsely marked as spam. Since most real humans can pass those, that may be the ticket to removing their frustration. The CAPTCHA only shows for some spam items, right?

re-CAPTCHA is shown for every spammable object for which a CONDITIONAL_ALLOW verdict was issued.

Is there a way to tell from the spam logs or similar whether or not the CAPTCHA was presented?

There’s active work going on to develop new anti-abuse capabilities within GitLab, and further features in this regard have been discussed.

At the moment, however, it’s already possible to check whether a request that triggered the creation of a spam log was verified using re-CAPTCHA by going to https://gitlab.example.com/admin/spam_logs and looking at the “Recaptcha Verified?” column.

I’m curious if any spammers actually fail these, or if I’m effectively turning off spam checking by allowing CAPTCHAs.

That’s interesting. That’s SpamCheck only, though, right? Any way to do this with Akismet?

Correct, this is only SpamCheck.

In my instance, I found a config.toml that has an empty allowList. What’s the format of those entries? $GROUP_ID = "$GROUP_NAME"?

SpamCheck’s previous version uses a TOML file for its allow-list:

[allowList]
000001  = "mygroup/myproject"

SpamCheck’s previous version uses a YAML file for its allow-list config/config.example.yml · main · GitLab.org / gl-security / Security Engineering / Security Automation / Spam / SpamCheck · GitLab

1 Like