I’ve been following the struggle of bearblog developer to manage the current war between bot scrapers and people who are trying to keep a safe and human oriented internet. What is lemmy doing about bot scrapers?
Some context from bearblog dev
The great scrape
https://herman.bearblog.dev/the-great-scrape/
LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author’s permission and all content being opt-in by default.
Needless to say, this is unethical. But as Meta has proven, it’s much easier to ask for forgiveness than permission. It is unlikely they will be ordered to “un-train” their next generation models due to some copyright complaints.
Aggressive bots ruined my weekend
https://herman.bearblog.dev/agressive-bots/
It’s more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I’ve blocked close to 2 million malicious requests across several hundred blogs.
What’s wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I’m still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers


It’s a version of the age old question on how do you keep someone from stealing your images while still being able to show it. No one can see an image without having downloaded it already. The best you can do is layer in things like watermarks to make cleaning it into a “pure” version not worth the trouble. Same with text, poison it so it’s less valuable without a lot of extra work.
You can’t poison text in a way that’s meaningful to LLMs without making it indecipherable to humans.
That’s what satire was invented for. /s
I mean, reddit text is poisoned by virtue of being highly unhinged. It’s probably one of the best reasons not to use Ai right now, since its dataset is being formed from literal redditors.
Maybe we just gotta toxify it up here a bit
Doesn’t seem to have negatively impacted AI much.
Accepting that your premise is true for individual texts, there seems to be a fairly flat number of poisoned docs needed regardless of total training corpus size. So the question is how to sneak that many docs into the corpus.
https://arxiv.org/abs/2510.07192
That’s DRM, and it only works if everyone is accessing the information on devices they don’t fully control.