I’ve been following the struggle of bearblog developer to manage the current war between bot scrapers and people who are trying to keep a safe and human oriented internet. What is lemmy doing about bot scrapers?

Some context from bearblog dev

The great scrape

https://herman.bearblog.dev/the-great-scrape/

LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author’s permission and all content being opt-in by default.

Needless to say, this is unethical. But as Meta has proven, it’s much easier to ask for forgiveness than permission. It is unlikely they will be ordered to “un-train” their next generation models due to some copyright complaints.

Aggressive bots ruined my weekend

https://herman.bearblog.dev/agressive-bots/

It’s more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I’ve blocked close to 2 million malicious requests across several hundred blogs.

What’s wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I’m still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers

  • ewigkaiwelo@lemmy.world
    link
    fedilink
    arrow-up
    15
    arrow-down
    4
    ·
    1 day ago

    So it doesn’t stop LLMs from data farming but makes it spend more energy on doing so? If that’s the case it sounds like that it’s making things even worse

    • AMoistGrandpa@lemmy.ca
      link
      fedilink
      arrow-up
      19
      ·
      edit-2
      1 day ago

      As I understand, Anubis doesn’t make the user do anything. Instead, it runs some JavaScript in the client’s browser that does the calculations, and then sends the result back to the server. In order for an LLM to get through Anubis, the LLM would need to be running a real JavaScript engine (since the requested calculation is too complicated for an LLM to do natively), and that would be prohibitively expensive for bot farms at any real scale. Since all real people accessing the site will be doing so through a browser, which has JavaScript built in, and most bots will just download the website and send the source code right into the LLM without being able to execute it, real people will be able to get through Anubis while bots won’t. The total amount of extra energy consumed by adding Anubis isn’t actually that high since bot farms aren’t doing the extra work.

      Take that all with a grain of salt; that info is based on a blog post which I read like 6 months ago, and I may be remembering incorrectly.

      • henfredemars@infosec.pub
        link
        fedilink
        English
        arrow-up
        11
        ·
        edit-2
        1 day ago

        Your understanding is consistent with mine. It spends a small amount of effort (per user) that makes scaling too expensive (per bot-farm-entity). It also uses an adjustable difficulty that can vary depending on how sus a request appears to be.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        3
        ·
        1 day ago

        The extra work and energy expenditure is being done by every single user using the site. The server wastes everyone else’s resources to provide benefits for it.

        Bots can be designed to run javascript too, so if a site’s contents are worth scraping it can still be done.

        • henfredemars@infosec.pub
          link
          fedilink
          English
          arrow-up
          1
          ·
          10 hours ago

          It is effective at discouraging bots when looking at real world services today, but indeed you have found the primary downside. It does impose costs on users even if the costs are disproportionately placed on bots.

        • Cocodapuf@lemmy.world
          link
          fedilink
          arrow-up
          4
          ·
          edit-2
          18 hours ago

          Do you realize how much extra work your browser has to do every time you visit a site that makes money on ads? All the additional scripts being run in the background, it’s astonishing. Trust me, the additional work that users’ machines have to do for this is totally insignificant when viewed in the greater context of what we actually do with computers.

          Watching a 10 minute YouTube video, that’s your computer doing more work than it would loading a million text based pages running Anubis.

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            1
            ·
            18 hours ago

            Do you realize how much extra work your browser has to do every time you visit a site that makes money on ads?

            I have uBlock origin and Ghostery, so very little.

            Watching a 10 minute YouTube video, that’s your computer doing more work than it would loading a million text based pages running Anubis.

            Given that AI trainers are training on YouTube videos too, that sounds like Anubis isn’t going to impose meaningful costs on them.

            • Cocodapuf@lemmy.world
              link
              fedilink
              arrow-up
              2
              ·
              edit-2
              12 hours ago

              Given that AI trainers are training on YouTube videos too, that sounds like Anubis isn’t going to impose meaningful costs on them.

              Well, does it work?

              You don’t need to guess about it, you can simply look at traffic records and see how much it changes after installing Anubis. If it works for now, great. Like all things like this, it’s a cat and mouse game.

              Also, the way your computer interprets a YouTube video and the way a scraper interprets a YouTube video may well be different. But in general, for a browser, streaming and decoding video is a relatively heavy and high bandwidth operation. Video is much higher bandwidth and has much higher CPU processing requirements than audio, which likewise is heavier and higher higher bandwidth than text. As a result, video and text barely compare, they’re totally different orders of magnitude in bandwidth and processing needs. So does an AI scraper have to do all that decoding? I actually have no idea, but there definitely could be shortcuts, ways to just avoid it. For instance, they may only care about the audio, or perhaps the transcripts are good enough for them.

      • Tuukka R@piefed.ee
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        edit-2
        1 day ago

        Since all real people accessing the site will be doing so through a browser, which has JavaScript built in

        Many blind people don’t, because for blind people a text-based interface makes a LOT more sense than a graphical user interface. And the text-based browsers don’t precisely excel on JavaScript.

        (But, who cares about some blind people anyway?)

        • AMoistGrandpa@lemmy.ca
          link
          fedilink
          arrow-up
          2
          ·
          1 day ago

          I didn’t know that. I had assumed people using screen readers would use the same versions of websites as everyone else.

          Off to do some research, to make my own sites more accessible for the blind!

        • irelephant [he/him]@lemmy.dbzer0.com
          link
          fedilink
          arrow-up
          2
          ·
          1 day ago

          Most Lemmy frontends don’t work without JavaScript.

          I may be wrong, but I’m pretty sure most blind people just use regular browsers with a screen reader like JAWS or NVDA.

          • Tuukka R@piefed.ee
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 day ago

            Most do, many don’t.

            Most blind people are not told by anybody that you can use a computer in text form, because most sighted people don’t know you can. The user experience is on a whole another level when you have an interface that is basically tailored to you, instead of using something made for people with wildly different abilities than yours! At least, when I watch my friend browse the web in those two formats, the difference is daunting.

            It’s not okay to block them from using an otherwise much better option. Even if not everyone knows about the better way.

    • henfredemars@infosec.pub
      link
      fedilink
      English
      arrow-up
      8
      ·
      1 day ago

      It saves no energy. In fact, it costs more energy at first, but the hope is that bots will turn their attention to something that isn’t so expensive as hitting your servers. The main goal is to get your service online so that you’re not burning all your own resources on fake users.

    • Rhaedas@fedia.io
      link
      fedilink
      arrow-up
      5
      ·
      1 day ago

      LLMs can’t do math well. Add in the factor of needing to understand the question first before doing the math and it might work better than you think.

      • missingno@fedia.io
        link
        fedilink
        arrow-up
        11
        ·
        1 day ago

        Scrapers aren’t using the LLM to scrape. They just gather data the old fashioned way, by spoofing a web browser. Then the LLM can use that data, but that step comes later.

        • FaceDeer@fedia.io
          link
          fedilink
          arrow-up
          4
          ·
          1 day ago

          Also, nowadays modern LLMs will have tool APIs available to them, which will likely include a calculator app. So even if LLMs are reading a page directly they likely won’t be flummoxed by math problems.

      • ewigkaiwelo@lemmy.world
        link
        fedilink
        arrow-up
        9
        arrow-down
        1
        ·
        1 day ago

        By making thigs worse I was referring to the fact that AI centers already require too much energy

        • henfredemars@infosec.pub
          link
          fedilink
          English
          arrow-up
          6
          ·
          1 day ago

          It’s not a perfect solution by any means. It doesn’t protect user data. It doesn’t do anything to help with the energy problem. It merely makes it possible for someone to run their server without getting taken offline by automated systems.

        • GreyEyedGhost@lemmy.ca
          link
          fedilink
          arrow-up
          1
          ·
          22 hours ago

          Anything you do to inhibit LLM scrapers is by definition going to cost more energy in the short term. The idea is to drive them away by making it too costly. And realistically, in the short term, the only thing you can do to make AI farms use less energy is to have their maintainers turn them off. I’m not aware of anything we can do to make that happen.

        • village604@adultswim.fan
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 day ago

          The energy being spent on web scraping is a fraction of a percent of the energy costs to train an LLM. It’s a negligible increase.

          This process is happening before the LLM is involved; it’s probably a standard Python based script.