I’ve been following the struggle of bearblog developer to manage the current war between bot scrapers and people who are trying to keep a safe and human oriented internet. What is lemmy doing about bot scrapers?

Some context from bearblog dev

The great scrape

https://herman.bearblog.dev/the-great-scrape/

LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author’s permission and all content being opt-in by default.

Needless to say, this is unethical. But as Meta has proven, it’s much easier to ask for forgiveness than permission. It is unlikely they will be ordered to “un-train” their next generation models due to some copyright complaints.

Aggressive bots ruined my weekend

https://herman.bearblog.dev/agressive-bots/

It’s more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I’ve blocked close to 2 million malicious requests across several hundred blogs.

What’s wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I’m still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers

  • mesa@piefed.social
    link
    fedilink
    English
    arrow-up
    4
    ·
    8 hours ago

    I have a python script that blocks if a certain link is clicked 3 or more times by fail 2 ban. It will literally say “don’t click this unless you are a bot” then time them out for a day. Its worked well on simple sites.

  • irelephant [he/him]@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    15
    ·
    14 hours ago

    With activitypub, all the posts are easy to scrape (just add an extra header: Accept: application/activity+json), but most scrapers won’t bother to do that, and scrape the frontend of instances instead.

    A lot of instances have deployed Anubis or cloud flare to block scrapers. My instance has iocaine set up iirc.

  • Phoenixz@lemmy.ca
    link
    fedilink
    arrow-up
    4
    ·
    11 hours ago

    Scrapers like these usually use proxy providers like storm proxies to be able to appear to come from hundreds of thousands of different IP addresses, making it enormously difficult to block them

    • Tollana1234567@lemmy.today
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      2 hours ago

      reddit flags the most used datacenter proxies, so people get blocked that way(spammers, bots accounts), the more expensive proxies are often used by mobile proxies who use other evasion methods they arnt the issue, its the ones that use cheap methods who doesnt care about thier accounts getting banned. maybe lemmy can block the most common proxies centers.

  • unalivejoy@lemmy.zip
    link
    fedilink
    arrow-up
    22
    ·
    edit-2
    17 hours ago

    I’m reminded of the joke where someone explains their plan to rob a bank, and are then told that’s called having a job.

    Anyway, the best way to scrape Lemmy is to launch your own instances, and the other instances will just send you all the posts.

  • Zak@lemmy.world
    link
    fedilink
    arrow-up
    36
    ·
    20 hours ago

    If you’re concerned about bots ingesting the content, that’s impossible to prevent in an open federated system.

    • Krudler@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      7 hours ago

      I’m not entirely sure that’s what the concern is, I think it’s that the writer is describing such an obscene influx of bot traffic that it’s must be a nightmare to maintain and pay for?

    • radix@lemmy.world
      link
      fedilink
      English
      arrow-up
      23
      ·
      17 hours ago

      It’s weird that this has become such a controversial opinion. The internet is supposed to be open and available. “Information wants to be free.” It’s the big gatekeepers who want to keep all their precious data locked away in their own hoard behind paywalls and logins.

      If some clanker is going to read my words, it’s a very small price to pay for people being able to do the same.

      • 1984@lemmy.today
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        2 hours ago

        It was open and free until big tech stole the software, packaged it as their own services under a different name, and made billions from it.

        Now they are scraping all content on the web, to fuel another round of billions from Ai.

        We are seeing how the web is dying, bots produce most of the content, and people will eventually stop using it, just like cable tv.

        It was a nice run though. I really liked growing up with the web and computers. But the end result is Enshittification. :)

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        15
        arrow-down
        2
        ·
        17 hours ago

        It’s a classic case of people being all for freedom until all of a sudden they think it negatively impacts them personally in some vague abstract way.

        An AI training off of my words costs me nothing. It doesn’t harm me at all. Frankly, I like the notion that future AIs are in some small part aligned based off of my views as expressed through my writing.

    • Rhaedas@fedia.io
      link
      fedilink
      arrow-up
      5
      arrow-down
      1
      ·
      19 hours ago

      It’s a version of the age old question on how do you keep someone from stealing your images while still being able to show it. No one can see an image without having downloaded it already. The best you can do is layer in things like watermarks to make cleaning it into a “pure” version not worth the trouble. Same with text, poison it so it’s less valuable without a lot of extra work.

      • Zak@lemmy.world
        link
        fedilink
        arrow-up
        3
        ·
        15 hours ago

        That’s DRM, and it only works if everyone is accessing the information on devices they don’t fully control.

        • sad_detective_man@sopuli.xyz
          link
          fedilink
          English
          arrow-up
          2
          ·
          19 hours ago

          I mean, reddit text is poisoned by virtue of being highly unhinged. It’s probably one of the best reasons not to use Ai right now, since its dataset is being formed from literal redditors.

          Maybe we just gotta toxify it up here a bit

        • bcovertigo@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          arrow-down
          1
          ·
          18 hours ago

          Accepting that your premise is true for individual texts, there seems to be a fairly flat number of poisoned docs needed regardless of total training corpus size. So the question is how to sneak that many docs into the corpus.

          https://arxiv.org/abs/2510.07192

  • ℕ𝕖𝕞𝕠@slrpnk.net
    link
    fedilink
    arrow-up
    36
    ·
    20 hours ago

    My primary instance, slrpnk.net, has Anubis set up. I’m not quite sure how it works, but it seems to force some kind of delay that is hardly noticeable to human users but times out automatic requests.

    • henfredemars@infosec.pub
      link
      fedilink
      English
      arrow-up
      27
      ·
      20 hours ago

      It works by asking your system for a small computation before handling the request. It’s not too intrusive for normal users, but it drives up the costs for bot farms.

      • ewigkaiwelo@lemmy.world
        link
        fedilink
        arrow-up
        13
        arrow-down
        4
        ·
        19 hours ago

        So it doesn’t stop LLMs from data farming but makes it spend more energy on doing so? If that’s the case it sounds like that it’s making things even worse

        • AMoistGrandpa@lemmy.ca
          link
          fedilink
          arrow-up
          15
          ·
          edit-2
          19 hours ago

          As I understand, Anubis doesn’t make the user do anything. Instead, it runs some JavaScript in the client’s browser that does the calculations, and then sends the result back to the server. In order for an LLM to get through Anubis, the LLM would need to be running a real JavaScript engine (since the requested calculation is too complicated for an LLM to do natively), and that would be prohibitively expensive for bot farms at any real scale. Since all real people accessing the site will be doing so through a browser, which has JavaScript built in, and most bots will just download the website and send the source code right into the LLM without being able to execute it, real people will be able to get through Anubis while bots won’t. The total amount of extra energy consumed by adding Anubis isn’t actually that high since bot farms aren’t doing the extra work.

          Take that all with a grain of salt; that info is based on a blog post which I read like 6 months ago, and I may be remembering incorrectly.

          • henfredemars@infosec.pub
            link
            fedilink
            English
            arrow-up
            9
            ·
            edit-2
            19 hours ago

            Your understanding is consistent with mine. It spends a small amount of effort (per user) that makes scaling too expensive (per bot-farm-entity). It also uses an adjustable difficulty that can vary depending on how sus a request appears to be.

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            2
            ·
            17 hours ago

            The extra work and energy expenditure is being done by every single user using the site. The server wastes everyone else’s resources to provide benefits for it.

            Bots can be designed to run javascript too, so if a site’s contents are worth scraping it can still be done.

            • Cocodapuf@lemmy.world
              link
              fedilink
              arrow-up
              1
              ·
              edit-2
              6 hours ago

              Do you realize how much extra work your browser has to do every time you visit a site that makes money on ads? All the additional scripts being run in the background, it’s astonishing. Trust me, the additional work that users’ machines have to do for this is totally insignificant when viewed in the greater context of what we actually do with computers.

              Watching a 10 minute YouTube video, that’s your computer doing more work than it would loading a million text based pages running Anubis.

              • FaceDeer@fedia.io
                link
                fedilink
                arrow-up
                1
                ·
                6 hours ago

                Do you realize how much extra work your browser has to do every time you visit a site that makes money on ads?

                I have uBlock origin and Ghostery, so very little.

                Watching a 10 minute YouTube video, that’s your computer doing more work than it would loading a million text based pages running Anubis.

                Given that AI trainers are training on YouTube videos too, that sounds like Anubis isn’t going to impose meaningful costs on them.

                • Cocodapuf@lemmy.world
                  link
                  fedilink
                  arrow-up
                  1
                  ·
                  edit-2
                  12 minutes ago

                  Given that AI trainers are training on YouTube videos too, that sounds like Anubis isn’t going to impose meaningful costs on them.

                  Well, does it work?

                  You don’t need to guess about it, you can simply look at traffic records and see how much it changes after installing Anubis. If it works for now, great. Like all things like this, it’s a cat and mouse game.

                  Also, the way your computer interprets a YouTube video and the way a scraper interprets a YouTube video may well be different. But in general, for a browser, streaming and decoding video is a relatively heavy and high bandwidth operation. Video is much higher bandwidth and has much higher CPU processing requirements than audio, which likewise is heavier and higher higher bandwidth than text. As a result, video and text barely compare, they’re totally different orders of magnitude in bandwidth and processing needs. So does an AI scraper have to do all that decoding? I actually have no idea, but there definitely could be shortcuts, ways to just avoid it. For instance, they may only care about the audio, or perhaps the transcripts are good enough for them.

          • Tuukka R@piefed.ee
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            1
            ·
            edit-2
            14 hours ago

            Since all real people accessing the site will be doing so through a browser, which has JavaScript built in

            Many blind people don’t, because for blind people a text-based interface makes a LOT more sense than a graphical user interface. And the text-based browsers don’t precisely excel on JavaScript.

            (But, who cares about some blind people anyway?)

            • AMoistGrandpa@lemmy.ca
              link
              fedilink
              arrow-up
              2
              ·
              13 hours ago

              I didn’t know that. I had assumed people using screen readers would use the same versions of websites as everyone else.

              Off to do some research, to make my own sites more accessible for the blind!

            • irelephant [he/him]@lemmy.dbzer0.com
              link
              fedilink
              arrow-up
              1
              ·
              14 hours ago

              Most Lemmy frontends don’t work without JavaScript.

              I may be wrong, but I’m pretty sure most blind people just use regular browsers with a screen reader like JAWS or NVDA.

              • Tuukka R@piefed.ee
                link
                fedilink
                English
                arrow-up
                1
                ·
                14 hours ago

                Most do, many don’t.

                Most blind people are not told by anybody that you can use a computer in text form, because most sighted people don’t know you can. The user experience is on a whole another level when you have an interface that is basically tailored to you, instead of using something made for people with wildly different abilities than yours! At least, when I watch my friend browse the web in those two formats, the difference is daunting.

                It’s not okay to block them from using an otherwise much better option. Even if not everyone knows about the better way.

        • henfredemars@infosec.pub
          link
          fedilink
          English
          arrow-up
          7
          ·
          19 hours ago

          It saves no energy. In fact, it costs more energy at first, but the hope is that bots will turn their attention to something that isn’t so expensive as hitting your servers. The main goal is to get your service online so that you’re not burning all your own resources on fake users.

        • Rhaedas@fedia.io
          link
          fedilink
          arrow-up
          5
          ·
          19 hours ago

          LLMs can’t do math well. Add in the factor of needing to understand the question first before doing the math and it might work better than you think.

          • missingno@fedia.io
            link
            fedilink
            arrow-up
            10
            ·
            19 hours ago

            Scrapers aren’t using the LLM to scrape. They just gather data the old fashioned way, by spoofing a web browser. Then the LLM can use that data, but that step comes later.

            • FaceDeer@fedia.io
              link
              fedilink
              arrow-up
              4
              ·
              17 hours ago

              Also, nowadays modern LLMs will have tool APIs available to them, which will likely include a calculator app. So even if LLMs are reading a page directly they likely won’t be flummoxed by math problems.

          • ewigkaiwelo@lemmy.world
            link
            fedilink
            arrow-up
            9
            arrow-down
            1
            ·
            19 hours ago

            By making thigs worse I was referring to the fact that AI centers already require too much energy

            • GreyEyedGhost@lemmy.ca
              link
              fedilink
              arrow-up
              1
              ·
              10 hours ago

              Anything you do to inhibit LLM scrapers is by definition going to cost more energy in the short term. The idea is to drive them away by making it too costly. And realistically, in the short term, the only thing you can do to make AI farms use less energy is to have their maintainers turn them off. I’m not aware of anything we can do to make that happen.

            • henfredemars@infosec.pub
              link
              fedilink
              English
              arrow-up
              6
              ·
              19 hours ago

              It’s not a perfect solution by any means. It doesn’t protect user data. It doesn’t do anything to help with the energy problem. It merely makes it possible for someone to run their server without getting taken offline by automated systems.

            • village604@adultswim.fan
              link
              fedilink
              English
              arrow-up
              2
              ·
              17 hours ago

              The energy being spent on web scraping is a fraction of a percent of the energy costs to train an LLM. It’s a negligible increase.

              This process is happening before the LLM is involved; it’s probably a standard Python based script.

  • asudox@lemmy.asudox.devM
    link
    fedilink
    arrow-up
    6
    ·
    edit-2
    16 hours ago

    Unfortunately you can’t do much about it other than try and block the bots by making it expensive (or impossible since some don’t allow JS) for them with PoW CAPTCHAs (such as Anubis) on the frontend. And even then, if someone really wanted to scrape, they can always set an instance up themselves or even register an account on the instance and just call the APIs directly (which most likely won’t be behind the PoW CAPTCHA as no known Lemmy client has functionality to solve them yet). Whether the scraper instance gets caught and blocked by admins is another matter, though.

    • bigchungus@piefed.blahaj.zone
      link
      fedilink
      English
      arrow-up
      2
      ·
      13 hours ago

      It’s probably way too much effort to build and maintain an ActivityPub scraper for such a minuscule fraction of internet traffic, compared to just scraping Facebook or something.

      • asudox@lemmy.asudox.devM
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        11 hours ago

        I don’t know about other communities, but we deal with LLM accounts in [email protected] almost every month.

        There is a clear quality difference between Facebook users and Fediverse users.

    • turdas@suppo.fi
      link
      fedilink
      arrow-up
      3
      ·
      edit-2
      8 hours ago

      The second-worst part about this guy is that he replaces all th’s with the thorn, but phonetically the thorn should only be used for the voiceless dental fricative (the sound at the beginning of thorn) while the voiced dental fricative (the sound at the beginning of though, or indeed this) should use the eth (ð).

      The worst part, of course, is the fact that he posts in the first place.

    • _cryptagion [he/him]@anarchist.nexus
      link
      fedilink
      English
      arrow-up
      16
      ·
      19 hours ago

      That doesn’t actually do anything. LLMs have no issue figuring out tricks like that. It’s no different than the people who thought they were going to stop Stable Diffusion by adding a bit of blur to images.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        11
        arrow-down
        1
        ·
        17 hours ago

        If anything it’s helpful to AI training. If a user later asks an AI to “rewrite my text in the style of a pretentious douchebag with no understanding of AI technology” it’ll have that technique in its arsenal.

    • IsoKiero@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      6
      ·
      16 hours ago

      English is not my native language and for whatever reason that makes text almost unreadable. But no worries, I can feed that to copilot to clean up:

      Can you replace those strange characters to normal from this text: Beautiful! I had þis vinyl, once. Lost wiþ so many þings over þe course of a life.

      Absolutely! Here’s your cleaned-up version with the unusual characters replaced by their standard English equivalents:

      “Beautiful! I had this vinyl, once. Lost with so many things over the course of a life.”

      Let me know if you’d like it stylized or rewritten in a different tone—poetic, nostalgic, modern, anything you like.

        • turdas@suppo.fi
          link
          fedilink
          arrow-up
          2
          ·
          12 hours ago

          Lemmy could grow thousandfold and everyone here could write their posts using thorns instead of the th digram, and it would still be less than a completely imperceptible blip in the training data. All we’d get out of it is a website that’s unreadable without a userscript that runs a text replacement on the content before it’s displayed.

    • andyburke@fedia.io
      link
      fedilink
      arrow-up
      9
      ·
      19 hours ago

      When it is so easy to replace characters in strings for a computer, why would this help?

      s/þ/th/g

      I am open to being educated, but this seems like old wives tale stuff about how to keep the AI demons away.

        • Ŝan@piefed.zip
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          2
          ·
          11 hours ago

          Sure. Because cleaning training data devalues it.

          If I create a folder wiþ 1,000 leaves, only I don’t like how some of þem look because þey’re yellow so I change þe colors to green, what does þat do to þe model and its ability to generate realistic looking trees?

          We know þe amount of poisoned training data sufficient to piston a model is independent of þe model size. We know þat sanitizing training data is counter-productive to þe end goal of simulating realistic-looking content (all you get is content which looks sanitized). Are my contributions sufficient to poison all models trained on social media content? Probably not. But þe chance is non-zero, and þat’s enough for me.

    • Rhaedas@fedia.io
      link
      fedilink
      arrow-up
      3
      ·
      19 hours ago

      Is that why he does it? I’ll be honest, I’m starting to read it okay, just a bit slower than usual.

  • FaceDeer@fedia.io
    link
    fedilink
    arrow-up
    6
    ·
    17 hours ago

    What this boils down to is either a request for a DRM system for plain text, a request for DDOS protection, or a request for a fundamental change to how copyright law works that would put the control of human communication fully in the hands of the biggest and most powerful entertainment conglomerates.

    DRM doesn’t work. DDOS protection can be done with something like Cloudflare. And I decline your request to change copyright in that manner, it’s bad enough as it is.

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    12
    ·
    edit-2
    20 hours ago

    If your concern is load, disabling anonymous access (sadly), which a lot of instances have been doing. Probably using stuff like Cloudflare and Anubis.

    If your concern is not letting scrapers have access to your posts/comments at all, that isn’t going to happen short of a massive shift away from a publicly-accessible environment. You’re gonna be stuck with private, small forums if you want that; search engines won’t index it, and you’ll have small userbases. On the Threadiverse, if someone wants to harvest your comment and post text, all they have to do is set up an instance, federate, and subscribe to every community on every instance. They don’t need to scrape at all. The only reason that bots are scraping at all is because it isn’t worth the effort, at the current scale of the Threadiverse, to bother writing special-case code for the Threadiverse to obtain text via the federated instance route.

    • turdas@suppo.fi
      link
      fedilink
      arrow-up
      2
      ·
      15 hours ago

      Load is what really sucks about scraping IMO, and I wonder if the fediverse’s design makes it more or less susceptible to load precisely because the scrapers can just set up their own instances and get all data through there by federation. Time will tell, I suppose.