I’ve been following the struggle of bearblog developer to manage the current war between bot scrapers and people who are trying to keep a safe and human oriented internet. What is lemmy doing about bot scrapers?

Some context from bearblog dev

The great scrape

https://herman.bearblog.dev/the-great-scrape/

LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author’s permission and all content being opt-in by default.

Needless to say, this is unethical. But as Meta has proven, it’s much easier to ask for forgiveness than permission. It is unlikely they will be ordered to “un-train” their next generation models due to some copyright complaints.

Aggressive bots ruined my weekend

https://herman.bearblog.dev/agressive-bots/

It’s more dangerous than ever to self-host, since simple mistakes in configurations will likely be found and exploited. In the last 24 hours I’ve blocked close to 2 million malicious requests across several hundred blogs.

What’s wild is that these scrapers rotate through thousands of IP addresses during their scrapes, which leads me to suspect that the requests are being tunnelled through apps on mobile devices, since the ASNs tend to be cellular networks. I’m still speculating here, but I think app developers have found another way to monetise their apps by offering them for free, and selling tunnel access to scrapers

  • Zak@lemmy.world
    link
    fedilink
    arrow-up
    44
    ·
    1 day ago

    If you’re concerned about bots ingesting the content, that’s impossible to prevent in an open federated system.

    • radix@lemmy.world
      link
      fedilink
      English
      arrow-up
      26
      ·
      1 day ago

      It’s weird that this has become such a controversial opinion. The internet is supposed to be open and available. “Information wants to be free.” It’s the big gatekeepers who want to keep all their precious data locked away in their own hoard behind paywalls and logins.

      If some clanker is going to read my words, it’s a very small price to pay for people being able to do the same.

      • 1984@lemmy.today
        link
        fedilink
        arrow-up
        3
        ·
        edit-2
        15 hours ago

        It was open and free until big tech stole the software, packaged it as their own services under a different name, and made billions from it.

        Now they are scraping all content on the web, to fuel another round of billions from Ai.

        We are seeing how the web is dying, bots produce most of the content, and people will eventually stop using it, just like cable tv.

        It was a nice run though. I really liked growing up with the web and computers. But the end result is Enshittification. :)

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        15
        arrow-down
        3
        ·
        1 day ago

        It’s a classic case of people being all for freedom until all of a sudden they think it negatively impacts them personally in some vague abstract way.

        An AI training off of my words costs me nothing. It doesn’t harm me at all. Frankly, I like the notion that future AIs are in some small part aligned based off of my views as expressed through my writing.

        • GreyEyedGhost@lemmy.ca
          link
          fedilink
          arrow-up
          5
          ·
          22 hours ago

          It will harm the owner of the server, who will be serving a large amount of data to someone he may not want to, at his expense.

            • GreyEyedGhost@lemmy.ca
              link
              fedilink
              arrow-up
              2
              ·
              10 hours ago

              Ads also cost many users to be served to their clients, and the more invasive and obnoxious the ad, generally the more it costs. If they don’t want to respect me, why should I respect them?

    • Krudler@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      20 hours ago

      I’m not entirely sure that’s what the concern is, I think it’s that the writer is describing such an obscene influx of bot traffic that it’s must be a nightmare to maintain and pay for?

    • Rhaedas@fedia.io
      link
      fedilink
      arrow-up
      5
      arrow-down
      1
      ·
      1 day ago

      It’s a version of the age old question on how do you keep someone from stealing your images while still being able to show it. No one can see an image without having downloaded it already. The best you can do is layer in things like watermarks to make cleaning it into a “pure” version not worth the trouble. Same with text, poison it so it’s less valuable without a lot of extra work.