Content Independence Day: no AI crawl without compensation

(blog.cloudflare.com)

49 points | by kotk 2 days ago ago

35 comments

  • agentultra a day ago ago

    A nice attempt and another layer for the swiss cheese of technology it will take to try and ease the burden AI companies are putting on people trying to run websites.

    I'd be cautious about relying on just the good will of Cloudflare.

    It's unfortunate that we need honeypots and tarpits to trap AI scrapers just so that our hosting bills don't get hosed. It's taking a good chunk of value out of running a site on the Internet.

    • OutOfHere a day ago ago

      Feel free to waste your expensive outgoing bandwidth running malware. It is a genius idea really from the cloud companies to enrich their balances.

      Definitely don't rewrite your web server more efficiently in Rust instead. /s

      • Retric a day ago ago

        Serving poisoned text can be so cheap it’s effectively free as long as you don’t give them a lot of links.

        • Mars008 a day ago ago

          Yeh, and say goodby to google search. You didn't want to be there anyway, right?

          • Retric 19 hours ago ago

            Google makes it easy to identify their bot. Often people want to do this to give them more access.

            People care about AI companies because they’re ignoring robots.txt etc.

        • OutOfHere 14 hours ago ago

          Another thing that doesn't make sense is why it has to be poisoned text. Why can't it just be a mix of whitespace? I doubt anyone is using LLMs with streaming inputs to determine whether to continue reading the page.

          • Retric 14 hours ago ago

            Company’s actively harming you should be discouraged, preferably by running them out of business. Whitespace doesn’t do that and makes it easy to identify when the crawlers fail.

            Swapping meaning poisons the LLM but makes it really difficult for a preprocessing step to understand the difference between good and bad inputs.

      • techjamie a day ago ago

        Many of these tarpits deliberately serve the data at an excruciatingly low speed to ease the burden on the server resources. It's cheaper than quickly serving the same crawlers your entire website at max speed constantly.

        • OutOfHere a day ago ago

          If we are going for cheaper, how is it cheaper than an HTTP 429 error? It's not.

          • DamonHD 21 hours ago ago

            Virtually nothing pays attention to 429s that I have observed. More things pay attention to 500s and 503s. Some however use those as a trigger to repoll immediately.

  • TekMol a day ago ago

    Currently, what I do is that when an IP requests insane amounts of URLs on my server (especially when its all broken urls causing 404s) I look up the IP and then block the whole organization.

    For example today some bot from the range 14.224.0.0-14.255.255.255 got crazy and caused a storm of 404s. Dozens per second for hours on end. So I blocked the range like this:

    iptables -A INPUT -m iprange --src-range 14.224.0.0-14.255.255.255 -j DROP

    That's probably not the best way and might block significant parts of whole countries. But at least it keeps my service alive for now.

    What do others here do to protect their servers?

    • PaulDavisThe1st a day ago ago

      At git.ardour.org, we block any attempt to retrieve a specific commit. Trying to do so triggers fail2ban putting the IP into blocked status for 24hrs. They also get a 404 response.

      We wouldn't mind if bots simply cloned the repo every week or something. But instead they crawl through the entire reflog. Fucking stupid behavior, and one that has cost us an extra $50/month even with just the 404.

    • GGO a day ago ago

      I like rate-limiting. I know none of my users will need more than 10qps. I set that for all routes, and all bots get throttled. I can also have much higher rate-limit for authenticated users. Have not had bots slamming me - they just get 429s

  • rorylaitila a day ago ago

    It's unfortunate but I think the ship has sailed. Good on them for trying but I don't see it working.

    I am advising all my clients away from informational content which is easily remixed by LLMs. And I'm not bothering anymore with targeting informational search queries on my own sites.

    I'm doubling down on community and interaction. Finding ways to interact with original content with smaller audiences, rather than produce information for a global search audience.

  • azangru a day ago ago

    > Cloudflare, along with a majority of the world's leading publishers and AI companies, is changing the default to block AI crawlers

    How is this done, technically? User agent checking? IP range blocking?

    • yablak a day ago ago
      • grg0 a day ago ago

        This requires good faith on behalf of the crawler? So it's DOA; why even bother implementing this?

        Also, what a piece of zero-trust shit the web is becoming thanks to a couple of shit heads who really need to extract monetary value out of everything. Even if this non-solution were to work, the prospect of putting every website behind Cloudsnare is not a good one anyway.

        What the web needs right now, to be honest, is machetes. In ample quantity. Tell me who's running that crawler that is bothering you and I will put them to the sword. They won't even need to present a JWK in the header.

        • xg15 a day ago ago

          Maybe I didn't understand the proposal completely yet, but wouldn't the crawler only have to cooperate (send the right headers, implement that auth framework, etc) if they want to pay?

          The standard response to a crawler is a 402 Payment Required response, probably as a result of an aggressive bot detection.

          So essentially, it's turning a site's entire content into an API: Either sign up for an API key or get blocked.

          The question remains though how well they will be able to distinguish bot traffic from humans - also, will they make an exception for search engines?

          • grg0 a day ago ago

            That is not what I understood, and it sounds terrible. What if you're not a crawler but random Joe surfing the internet? Clearly Joe should see content without payment? So they need some way to tell the crawler and Joe apart, and presumably they require the crawler to set certain request headers. The headers aren't just to issue the payment, it's to identify the crawler in the first place?

            • AkshatM a day ago ago

              Joe will be fine. Cloudflare is pretty good at differentiating humans from bot traffic - see how we do it here: https://developers.cloudflare.com/turnstile/

              The idea behind the headers is to allow bots to bypass automatic bot filtering, not blockade all regular traffic. In other words:

              - we block bots (the website owner can configure how aggressively we block) - unless they say they're from an AI crawler we've vetted, as attested by the signature headers - in which case we let them pay - and then they get to access the content

              (Disclosure: I wrote the web bot auth implementation Cloudflare uses for pay per crawl)

              • xg15 20 hours ago ago

                Thanks for replying! Do you have some provision for false positives as well, like sending a captcha in the body of the 402 response? (So in case the client was a human and not a bot, they could still try to solve the captcha)

              • grg0 a day ago ago

                Ok, well, thanks for the clarification.

            • xg15 a day ago ago

              The writeup doesn't talk about actively misbehaving crawlers a lot, but this bit implies for me that the headers are for the "happy path", I.e. crawlers wanting to pay:

              > Each time an AI crawler requests content, they either present payment intent via request headers for successful access (HTTP response code 200), or receive a 402 Payment Required response with pricing.

              I don't see how it would make sense otherwise, as the requirements for crawlers include applying for a registration with Cloudflare.

              Who in their right mind would jump through registration hoops only so they can not access a site? This wouldn't even keep away the crawlers that are operating today.

              I agree there has to be some way to distinguish crawlers from regular users, but the only way I can see how this could be done is with bot detection algorithms.

              ...which are imperfect and will likely flag some legitimate human users as bots. So yes, this will probably leading to web browsing becoming even more unpleasant.

        • cryptonector a day ago ago

          It's Cloudflare. That means they are good at DoS and DDoS protection. AI crawlers are basically DoS agents. I think CF can start with an honor system that also has attached to it the implied threat to block crawlers from all CF hosted content, and that is a pretty big hammer to hit the abusers with.

          So I'm cautiously optimistic. Well, I suppose pessimistic too: if this works what this will mean is that all contents will end up moving into big player hosting like CF.

  • mhuffman a day ago ago

    So are they going to try and IP gate them or trust that AI companies that literally stole the info they used to make the base models will now respect robots.txt entries?

    • trhway a day ago ago

      Every one likes net neutrality when the one is benefitting from it, yet the one immediately jumps at the opportunity to break net neutrality on their services if it allows to increase profit by price discrimination (which may take a shape of extracting a rent from some subset of consumers like seems to be in this case) .

  • mzs a day ago ago

    Is there a cut that cloudfare gets or is that behind an NDA?

  • yladiz a day ago ago

    Would this be preferable to something like Anubis?

  • jmole a day ago ago

    > Imagine an AI engine like a block of swiss cheese. New, original content that fills one of the holes in the AI engine’s block of cheese is more valuable than repetitive, low-value content that unfortunately dominates much of the web today.

    Great statement in theory - but in practice, the whole people-as-a-service industry for AI data generation is IMO more damaging to the knowledge ecosystem than open data. e.g. companies like pareto.ai

    "Proprietary data for pennies on the dollar" is the late-stage capitalism equivalent of the postdoctoral research trap.

  • ChrisArchitect a day ago ago

    Discussion:

    Cloudflare to introduce pay-per-crawl for AI bots

    https://news.ycombinator.com/item?id=44432385

  • tiahura a day ago ago

    I thought the web was supposed to be free and open?

    • Havoc a day ago ago

      That ship has sailed

  • ramesh31 a day ago ago

    This will play out precisely like the "do not track" header; bad actors will create an arms race that makes anyone respecting it into a chump.

    • teeray a day ago ago

      If the only way to escape a proof of work tarpit is to pay the toll, you’re either going to pay in money or time & compute.

  • accountforih 2 days ago ago

    I don’t understand, companies that want to crawl can pay for services like brightdata or crawlbase, the barriers don’t apply to them

    This ends up hurting individuals and small companies that are harmless and cannot afford to pay