Hacker News

stephendause•3h

Cloudflare Introduces Default Blocking of A.I. Data Scrapers nytimes.com

101 comments

abalashov•12m
Few people realise that virtually everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop.
It might be too little, too late, at this juncture, and this particular solution doesn't seem too innovative. However, it is directionally 100% correct, and let's hope for massively more innovation in defending against AI parasitism.
jjangkke•12m
so TLDR it adjusts your robot.txt and relies on cloudflare to catch bot behavior and it doesn't actually do any sophisticated residential proxy filtering or common bypass methods that works on cloudflare turnstill, do I have this correct?
this just pushes AI agents "underground" to adopt the behavior of a full blown stealth focused scraper which makes it harder to detect.
jasonthorsness•3h
I turned this on and it adjusts the robots.txt automatically; not sure what else it is doing.
# NOTICE: The collection of content and other data on this # site through automated means, including any device, tool, # or process designed to data mine or scrape content, is # prohibited except (1) for the purpose of search engine indexing or # artificial intelligence retrieval augmented generation or (2) with express # written permission from this site’s operator.
# To request permission to license our intellectual # property and/or other materials, please contact this # site’s operator directly.
# BEGIN Cloudflare Managed content
User-agent: Amazonbot Disallow: /
User-agent: Applebot-Extended Disallow: /
User-agent: Bytespider Disallow: /
User-agent: CCBot Disallow: /
User-agent: ClaudeBot Disallow: /
User-agent: Google-Extended Disallow: /
User-agent: GPTBot Disallow: /
User-agent: meta-externalagent Disallow: /
# END Cloudflare Managed Content User-agent: * Disallow: /* Allow: /$
- 1vuio0pswjnm7•34m
  "User-agent: CCBot disallow: /"
  Is Common Crawl exclusively for "AI"
  CCBot was already in so many robots.txt prior to this
  How is CC supposed to know or control how people use the archive contents
  What if CC is relying on fair use
```
   # To request permission to license our intellectual
   # property andd/or other materials, please contact this
   # site's operator directly
```
  If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materials
  Is it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee
  Is this fee shared with the rights holders
  - nemomarx•14m
    Read a tos and notice that you give the site operators unlimited license to reproduce or spread your works, almost on any site. it's required to host and show the content essentially
- postalcoder•3h
  This is interesting. The reasoning and response don't line up.
```
  > Cloudflare is making the change to protect original content on the internet, Mr. Prince said. If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content, he said

  >  prohibited except for the purpose of [..] artificial intelligence retrieval augmented generation 
```
  This seems to be targeted at taxing training of language models, but why an exclusion for the RAG stuff? That seems like it has a much greater immediate impact for online content creators, for whom the bots are obviating a click.
  - fennecfoxy•3h
    With that opinion, are you also suggesting that we ban ad blockers? Because it's better I not click & consume resources than click and not be served ads, basically just costing the host money.
    It means sense to allow for RAG in the same way that search engines provide a snippet of an important chunk of the page.
    A blog author could not complain that their blog is getting ragged when they're extremely liable to be Google/whatever searching all day and basically consuming others' content in exactly the same way that they're trying to disparage.
    - postalcoder•2h
      I don't think we should ban ad blockers, but I also think it's fair to suggest that the loss of organic traffic could be affecting the incentive to create new digital content, at least as much as the fear of having your content absorbed into an LLM's training data.
      - Boldened15•2h
        IMO the backlash against LLMs is more philosophical, a lot of people don’t like them or the idea of one learning from their content. Unless your website has some unique niche information unavailable anywhere else there’s no direct personal risk. RAG would be a more direct threat if anything.
        toomuchtodo•1h
        It's really about who is getting the value from the work of the content. If content creators of all sorts have their work consumed by LLMs, and LLM orgs charge for it can capture all the value, why should people create to have their work vacuumed up for the robot's benefit? For exposure? You can't eat or pay rent with exposure. Humans must get paid, and LLMs (foundational models and output using RAG) cannot improve without a stream of works and data humans create.
        Whether you call it training or something else is irrelevant, it's really exploitation of human work and effort for AI shareholder returns and tech worker comp (if those who create aren't compensated). And the technocracy has not been, based on the evidence, great stewards of the power they obtain through this. Pay the humans for their work.
    - ijk•2h
      What I want to know is if the flood of scraping everyone has been complaining about is coming from people trying to scrape for training or bots doing RAG search.
      I get that everyone wants data, but presumably the big players already scraped the web. Do they really need to do it again? Or is it bit players reproducing data that's likely already in the training set? Or is it really that valuable to have your own scraped copy of internet scale data?
      I feel like I'm missing something here. My expectation is that RAG traffic is going to be orders of magnitude higher than scraping for training. Not that it would be easy to measure from the outside.
      - mattcollins•1h
        I wondered about this, too.
        Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.
- bee_rider•2h
  I wonder… Google scrapes for indexing and for AI, right? I wonder if they will eventually say: ok, you can have me or not, if you don’t want to help train my AI you won’t get my searches either. That’s a tough deal but it is sort of self-consistent.
  - mrweasel•1h
    Very few people seems to be complaining that Google crashes their sites. Google also publish their crawlers IP ranges, but you really don't need to rate-limit Google, they know how to back off and not overload sites.
  - giancarlostoro•2h
    "Embrace, Extend, Extinguish" Google's mantra. And yes, I know about Microsoft's history with that phrase ;) But Google has done this with email, browsers (Google has web apps that run fine on Firefox but request you use Chrome), Linux (Android), and I'm sure there's others I am forgetting about.
    So yeah, I too could see them doing this.
- xyst•3h
  So in addition to updating the robots.txt file, which really only blocks a small number of them.
  Seems CF has been gathering data and profiling these malicious agents.
  This post by CF elaborates a bit further: https://blog.cloudflare.com/declaring-your-aindependence-blo...
  Basically becomes a game of cat and mouse.
Meekro•2h
I've heard lots of people on HN complaining about bot traffic bogging down their websites, and as a website operator myself I'm honestly puzzled. If you're already using Cloudflare, some basic cache configuration should guarantee that most bot traffic hits the cache and doesn't bog down your servers. And even if you don't want to do that, bandwidth and CPU are so cheap these days that it shouldn't make a difference. Why is everyone so upset?
- noodle•52m
  As someone who had some outages due to AI traffic and is now using CloudFlare's tools:
  Most of my site is cached in multiple different layers. But some things that I surface to unauthenticated public can't be cached while still being functional. Hammering those endpoints has taken my app down.
  Additionally, even though there are multiple layers, things that are expensive to generate can still slip through the cracks. My site has millions of public-facing pages, and a batch of misses that happen at the same time on heavier pages to regenerate can back up requests, which leads to errors, and errors don't result in caches successfully being filled. So the AI traffic keeps hitting those endpoints, they keep not getting cached and keep throwing errors. And it spirals from there.
- conductr•2h
  The presumption I’m already using cloudfare is a start. Is this a requirement for maintaining a simple website now?
  - haiku2077•2h
    Either that or Anubis (https://anubis.techaro.lol/docs), yes.
    - roguecoder•1h
      So these companies broke the internet
- x0x0•15m
  It's not complex. I worked on a big site. We did not have the compute or i/o (most particularly db iops) to live generate the site. Massive crawls both generated cold pages / objects (cpu + iops) and yanked them into cache, dramatically worsening cache hit rates. This could easily take down the site.
  Not to mention that cache is expensive at scale. So permitting big or frequent crawls by stupid crawlers either require significant investments in cache or slow down and worsen the site for all users. For whom we, you know, built the site, not to provide training data for companies.
  As others have mentioned, Google is significantly more competent than 99.9% of the others. They are very careful to not take your site down and provide, or used to provide, traffic via their search. So it was a trade, not a taking.
  Not to mention I prefer not to do business with Cloudflare because I don't like companies that don't publish quota. If going over X means I need an enterprise account that starts at $10k/mo, I need to know the X. Cloudflare's business practice appears to be letting customers exceed that quota then aggressively demanding they pay or they'll be kicked off the service nearly immediately.
- deepsiml•2h
  Not much into that kind of DevOps. What is a good basic caching in this instance?
  - haiku2077•2h
    It comes down to:
    1. Use the Cache-Control header to express how to cache your site correctly (https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Cac...)
    2. Use a CDN service, or at least a caching reverse proxy, to serve most of the cacheable requests to reduce load on the (typically much more expensive) origin servers
    - mrweasel•1h
      Just note that many AI scrapers will go to great length to do cache busting. For some reason many of them feel like they need to get the absolute latest version and don't trust your cache.
      - haiku2077•1h
        You can use Cache Control headers to express that your own CDN should aggressively refresh a resource but always serve it to external clients from cache. It's covered in the link under "Managed Caches"
  - TechDebtDevin•2h
    Cloudflare and other CDNs will usually automatically cache your static pages.
btown•3h
The headline is somewhat misleading: sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare.
The idea that Cloudflare could do the latter at the sole discretion of its leadership, though, is indicative of the level of power Cloudflare holds.
- GrayShade•2h
  > sites using Cloudflare now have an opt-in option to quickly block all AI bots, but it won't be turned on by default for sites using Cloudflare
  Do you have a source for that? https://blog.cloudflare.com/content-independence-day-no-ai-c... does say "changing the default".
  - mattcollins•1h
    "This feature is available to all customers, meaning anyone can enable this today from the Cloudflare dashboard."
    https://blog.cloudflare.com/control-content-use-for-ai-train...
- bitpush•3h
  It is now an adversarial relationship between aibots and website, and cloudflare is merely reacting to it.
  Would you say the same for ddos protection? Isn't that the same as well?
  - TechDebtDevin•2h
    They arent doing anything. They are attempting to insert themselves into the middle of a marketplace (that doesnt exist and never will) where scrapers pay for IP. They think theyre going to profit off the bots, not protect your site. Dont fall for their scam.
    - bitpush•1h
      What do you mean they are trying to insert themselves? If I have a website that I host with cloudflare, I (as the rightful website owner) has inserted Cloudflare in between.
      It isnt CF going around saying, that's a nice website you have there. I'm gonna put myself in between.
- TechDebtDevin•2h
  They cant do anything other than bog down the internet. I havent found a single cf provided challenge I havent been able to get past in < half a day.
  This is simply juat the first step in them implementing a marketplace and trying to get into LLM SEO. They dont care about your site or protecting it. They are gearing up to start making a cut in the Middle between scrapers and publishers. Why wouldnt I go DIRECTLY to the publisher and make a deal. So dumb I hate cf so much.
  The only thing cloudflare knows how to do is MITM attacks.
  - Marsymars•45m
    So what would you suggest as an alternative if I have a site where I don’t want the content used for LLM training?
ssijak•16m
I dont want this by default. I want my website to end up in AI chatbots. For SEO
alganet•3h
> If A.I. companies freely use data from various websites without permission or payment, people will be discouraged from creating new digital content
I don't see a way out of this happening. AI fundamentally discourages other forms of digital interaction as it grows.
Its mechanism of growing is killing other kinds of digital content. It will eventually kill the web, which is, ironically, its main source of food.
- fennecfoxy•2h
  Additionally, ad blocker usage is apparently at 30%. So it's a redundant or more nuanced argument, really.
  - account42•2h
    Ad blockers only discourage commercialized content creation, not all of it. IMO that actually improves the quality of the content created.
- spwa4•2h
  Yes what everyone wants to do with AI: generate entertainment and interactions with humans, including economical ones, will need to happen or AI will starve.
  - alganet•2h
    That's what is going to make it starve. Belly full, but of its own shit being tossed around humans seeking cheap copouts of doing actual work.
- preachermon•3h
  just like capitalism has now turned to exploiting people as its main input?
  - alganet•2h
    These kinds of comparisons rarely lead to good discussions.
    Let's instead be focused and talk about real stuff.
    Consider https://learnpythonthehardway.org/ for example. It has influenced a generation of Python developers. Not just the main website, but the tons of Python code and Python-related content it inspired.
    Why would anyone write these kinds of textbooks/websites/guides if AI can replace them? AI companies are effectively broadcasting you don't need the hard way anymore, you can just vibe.
    Arguibly though, without the existance of Learn Python the Hard Way and similar content, AI would be worse at writing Python stuff. That's what I mean by "main source of food", good content that influences a lot of people. Net-positive effects hard to predict or even identify except for the more popular cases (such as LPTHW).
    If my prediction is right, no one will notice that good content has stopped being produced. It will appear as if content is being created in generally the same way as before, but in reality, these long tail initiatives like LPTHW will have ceased before anyone can do anything about it.
    Again, I don't see a way out of this scenario. Not for AI companies, not for content writers. This is going to happen. The world in which I'm wrong is the best one.
    - mfost•1h
      In a similar vein, I remember people advocating for replacing new untrained hires with AI. After all, a competent senior engineer is needed to validate the contributions of the new hires anyway and they can do the same checking the AI code.
      But then, how would you even train and replace those competent seniior engineers that do the filtering when they retire? The whole system was predicated on having a chain of new hires that gain experience in the process.
      - alganet•33m
        From what I could perceive, companies believe coding AIs will eventually learn to both code and teach better than seniors.
        This is based on two assumptions:
        - AI will get better. Developers using the system will transfer their knowledge to it.
        - Seniors in a couple of years will be different. They should be those who can engage with the AI feedback loop.
        Here's why I think it won't work:
        - Senior developers learn more than they can produce. There is knowledge they never transfer to what they work on. Internalized knowledge that never materializes directly into code. _But it materializes indirectly_.
        - Senior developer knowledge come from "schools", not just reading. These schools are not real physical locations. They're traditions, or ideas, that form a very long tail. These ideas, again, are not directly transferrable to code or prose.
        - Juniors get embarrassed. You say "stop making this nonsense", and they'll stop and reflect, because they respect seniors. They might disagree, but a pea was then placed under their matress, and they'll think about "this nonsense" you told them to stop doing and why. That is how they get better. So far, AI has not demonstrated being able to do that.
        The production of quality content is an aspect of one of those "schools of thought". You are supposed to bear the responsibility of passing the knowledge. Keeping lean codebases easy to understand is also a hallmark of many schools of thought. Working from fundamentals is another one of those ideas, etc.
postalcoder•3h
```
  > When you enable this feature via a pre-configured managed rule, Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website. The rule has also been expanded to include more signatures of AI bots that do not follow the rules.
```
We already know companies like Perplexity are masking their traffic. I'm sure there's more than meets the eye, but taking this at face value, doesn't punishing respectful and transparent bots only incentivize obfuscation?
edit: This link[0], posted in a comment elsewhere, addresses this question. tldr, obfuscation doesn't work.
```
  > We leverage Cloudflare global signals to calculate our Bot Score, which for AI bots like the one above, reflects that we correctly identify and score them as a “likely bot.”

  > When bad actors attempt to crawl websites at scale, they generally use tools and frameworks that we are able to fingerprint. For every fingerprint we see, we use Cloudflare’s network, which sees over 57 million requests per second on average, to understand how much we should trust this fingerprint. To power our models, we compute global aggregates across many signals. Based on these signals, our models were able to appropriately flag traffic from evasive AI bots, like the example mentioned above, as bots.
```
[0] https://blog.cloudflare.com/declaring-your-aindependence-blo...
- jerf•3h
  "doesn't punishing respectful and transparent bots only incentivize obfuscation?"
  Sure, but we crossed that bridge over 20 years ago. It's not creating an arms race where there wasn't already one.
  Which is my generic response to everyone bringing similar ideas up. "But the bots could just...", yeah, they've been doing it for 20+ years and people have been fighting it for just as long. Not a new problem, not a new set of solutions, no prospect of the arms race ending any time soon, none of this is new.
- hombre_fatal•3h
  Next line:
  > The rule has also been expanded to include more signatures of AI bots that do not follow the rules.
  The Block AI Bots rule on the Super Bot Fight Mode page does filter out most bot traffic. I was getting 10x the traffic from bots than I was from users.
  It definitely doesn't rely on robots.txt or user agent. I had to write a page rule bypass just to let my own tooling work on my website after enabling it.
  - account42•2h
    How many of those "bots" you are filtering are actually bots and how many are regular users buttflare has misidentified as bots?
    - hombre_fatal•2h
      Pretty simple to see this if you've run a website: compare your analytics pre-bot to post-bot to post-bot-blocker.
      There is a clear moment where you land on AI bot radar. For my large forum, it was a month ago.
      Overnight, "72 users are viewing General Discussion" turned into "1720 users".
      40% requests being cached turned into 3% of requests are cached.
- fluidcruft•1h
  Cloudflare already knows how to make the web hell for people they don't like.
  I read the robots.txt entries as those AI bots that will be not marked as "malicious" and that will have the opportunity to be allowed by websites. The rest will be given the Cloudflare special.
- colechristensen•3h
  >doesn't punishing respectful and transparent bots only incentivize obfuscation?
  They're cloudflare and it's not like it's particularly easy to hide a bot that is scraping large chunks of the Internet from them. On top of the fact that they can fingerprint any of your sneaky usage, large companies have to work with them so I can only assume there are channels of communication where cloudflare can have a little talk with you about your bad behavior. I don't know how often lawyers are involved but I would expect them to be.
blakesterz•3h
The list of bots is pretty short right now:
https://developers.cloudflare.com/bots/concepts/bot/#ai-bots
- JimDabell•2h
  > AI bots
  > You can opt into a managed rule that will block bots that we categorize as artificial intelligence (AI) crawlers (“AI Bots”) from visiting your website. Customers may choose to do this to prevent AI-related usage of their content, such as training large language models (LLM).
  > CCBot (Common Crawl)
  Common Crawl is not an AI bot:
  https://commoncrawl.org
- hennell•3h
  Cloudflare sees a lot of the web traffic. I assume these are the biggest bots they're seeing right now, and any new contenders would be added as they find them. Probably impossible to really block everything, but they've got the web-coverage to detect more than most.
  - TechDebtDevin•2h
    They are lying. They cant detect crawlers unless we tell them we are who we are.
- ZiiS•3h
  Enough to more than half the traffic to most sites if the blocks hold.
zargath•1h
Sounds very basic, sadly.
Anybody know why these web crawling/bot standards are not evolving ? I believe robots.txt was invented in 1994(thx chatgpt). People have tried with sitemaps, RSS and IndexNow, but its like huge$$ organizations are depending on HelloWorld.bas tech to control their entire platform.
I want to spin up endpoints/mcp/etc. and let intelligent bots communicate with my services. Let them ask for access, ask for content, pay for content, etc. I want to offer solutions for bots to consume my content, instead of having to choose between full or no access.
I am all for AI, but please try to do better. Right now the internet is about to be eaten up by stupid bot farms and served into chat screens. They dont want to refer back to their source and when they do its with insane error rates.
- TechDebtDevin•1h
  This comment seems like it comes from a Cloudflare employee.
  This is clearly the first step in cf building out a marketplace where they will (fail) at attempting to be the middleman in a useless market between crawlers and publishers.
  - zargath•1h
    nah, disappointed cf customer
cmg•3h
Archive link: https://archive.ph/ARnyu
Sol-•3h
Do the major AI companies actually honor robots.txt? Even if some of their publicly known crawlers might do it, surely they have surreptitious campaigns where they do some hidden crawling, just like how they illegally pirate books, images and user data to train on.
- chasd00•3h
  My thought too, honoring robots.txt is just a convention. There's no requirement to follow robots.txt, or at least certainly no technical requirement. I don't think there's any automatic legal requirement either.
  Maybe sites could add "you must honor policies set in robots.txt" to something like a terms of service but I have no idea if that would have enough teeth for a crawler to give up.
  - TechDebtDevin•2h
    Cloudflare snd their customera have been desperately for years trying to kill scrapers in court. This is all. Meaningless, but they are probably gearing up for another legal battle to define robots.txt as a legal contract. Theyre going to use this marketplace theyre scamming people with to do it. They will fail.
- px43•3h
  There's a lack of clarity, but it seems likely to me that a majority of this traffic is actually people asking questions to the AI, and the AI going out and researching for answers. When the AI tools are being used like a web browser to do research, should they still be adhering to robots.txt, or is that only intended for search indexing?
- mschuster91•2h
  Cloudflare, for all I hate their role as a gatekeeper these days, actually has the leverage to force the AI companies to bend.
StochasticLi•44m
ehem https://github.com/Kaliiiiiiiiii-Vinyzu/patchright
dirkc•2h
I assume they will "protect original content online" by blocking LLM clients from ingesting data as context?
I'm not optimistic that you can effectively block your original content from ending up in training sets by simply blocking the bots. For now I just assume that anything I put online will end up being used to train some LLM
badlibrarian•3h
Did they ever fix the auto-blocking of RSS feeds?
https://news.ycombinator.com/item?id=41864632
dougb5•3h
> Cloudflare can detect and block verified AI bots that comply with robots.txt and respect crawl rates, and do not hide their behavior from your website
It's the bots that do hide their behavior -- via residential proxy services -- that are causing most of the burden, for my site anyway. Not these large commercial AI vendors.
dawnerd•2h
I’ve been using this for a while on my mastodon server and after a few tweaks to make sure it wasn’t blocking legit traffic it’s been really working great. Between Microsoft and Meta, they were hitting my services more than any other traffic combined which says a lot of you know how noisy mastodon can be. Server load went down dramatically.
It also completely put a stop to perplexity as far as I can tell.
And the robots file meant nothing, they’d still request it hundreds of thousands of times instead of caching it. Every request they’d hit it first then hit their intended url.
- danielspace23•2h
  Have you considered Anubis? I know it's harder to install, but personally, I think the point of Mastodon is trying to avoid centralization where possible, and CloudFlare is one of the corporations that are keeping the internet centralized.
- TechDebtDevin•2h
  This does nothing dude. Literally nothing. OpenAI or whoever are just going to hire people like me who dont get caught. Stop ruining the experience of users and allowing cf to fill the internet with more bloated javascript challenge pages and privacy invading fingerprinting. Stop making cf the police of the internet. We're literally handing the internet to this company on a silver platter to do MITM attacks on our privacy and god knows what else. Fucking wild.
  - fluidcruft•1h
    They literally said it significantly reduced their server resource usage. Are you suggesting they are lying?
  - drowsspa•36m
    Why do you think you have the moral high ground here?
account42•2h
Yay, looking forward to more CAPTCHAs as a regular user.
•2h
[deleted]
yodon•2h
Discussed yesterday (270+ comments)[0]
[0]https://news.ycombinator.com/item?id=44432385
j45•22m
This is interesting. I'm a fan of Cloudflare, and appreciate all the free tiers they put out there for many.
Today I see this article about Cloudflare blocking scrapers. There are useful and legitimate cases where I ask Claude to go research something for me. I'm not sure if Cloudflare discerns legitimate search/research traffic from an AI client vs scraping. Of the sites that are blocked by default will include content by small creators (unless on major platforms with deal?), while the big guys who have something to sell like an Amazon, etc, will likely be able to facilitate and afford a deal to show up more in the results.
A few days ago, Cloudflare is also looking to charge AI companies to scrape the content, which is cached copies of other people's content. I'm guessing it will involve paying the owners of the data at some point as well. Being able to exclude it from this purpose (sell/license content, or scrape) would be a useful lever.
Putting those two stories together:
- Is this a new form of showing up in the AISEO (Search everywhere optimization) to show up in an AI's corpus or ability to search the web, or paying licensing fees instead of advertising fees.. these could be new business models which are interesting, but trying to see where these steps may vector ahead towards, and what to think about today.
- With training data being the most valuable thing for AI companies, and this is another avenue for revenue for Cloudflare, this can look like a solution which helps with content licensing as a service.
I'd like to see where abstracting this out further ends up going
Maybe I'm missing something, is anyone else seeing it this way, or another way that's illuminating to them? Is anyone thinking about rolling their own service for whatever parts of Cloudflare they're using?
- ec109685•20m
  It seems like search access is more valuable these days since reasoning requires realtime access to site data.
NullCascade•2h
How would you do the opposite of this? Optimize your content to be more likely crawled by AI bots? I know traditional Google-focused SEO is not enough because these AI bots often use other web search/indexing APIs.
- TechDebtDevin•2h
  There are script tags you can put in your site from LLM SEO companies if you want your content to be indexed by Perplexity or OpenAI. Theyre kind of too new for me to reccomend.
gazpacho•3h
From an open source projects perspective we’d want to disable this on our docs sites. We actually want those to be very discoverable by LLMs, during training or online usage.
Roark66•3h
This is a bit silly. Slowing down, yes, but blocking? People who *really* want that content will find a way and this will hit everyone else instead that will have to do silly riddles before following every link or run crypto mining for them before being shown the content .
I recently went to a big local auction site on which I buy frequently and I got one of these "we detected unusual traffic from your network" messages. And "prove you're human". Which was followed by "you completed the capcha in 0.4s your IP is banned". Really? Am I supposed to slow down my browsing now? I tried a different browser, a different OS, logging on,clearing cookies, etc. Same result when I tried a search. It took 4h after contacting their customer service to unblock it. And the explanation was "you're clicking too fast".
At some point it just becomes a farce and the hassle is not worth the content. Also, while my story doesn't involve any bots perhaps a time will come when local LLMs will be good enough that I'll be able to tell one "reorder my cat food" and it will go and do it. Why are they so determined to "stop it" (spoiler, they can't).
For anyone who says LLMs are already capable of ordering cat food I say not so fast. First the cat food has to be on sale/offer (sometimes combined with extras). Second it is supposed to be healthy (no grains) and third the taste needs to be to my cats liking. So far I'm not going to trust a LLM with this.
- picohernandez•2h
  [dead]
deadbabe•2h
No one else can really do this except Cloudflare.
lucasyvas•3h
I fail to see how this won’t just result in UA string or other obfuscation.
- chasd00•3h
  a crawler doesn't have to change anything, they can just ignore the robots.txt file. It's up to the client to read robots.txt and follow its directives but there's no technical reason why the client cannot just ignore everything in the file period.
- kube-system•2h
  Cloudflare’s filtering is already way more sophisticated than just looking at UA string or other voluntary reporting. They’re almost certainly using fingerprinting and behavioral analytics.
cratermoon•3h
I'm still not sure this is going to be very effective, as so many of the worst offenders don't identify themselves as bots, and often change their user agent. Has Cloudflare said anything about identifying the bad actors?
- GrayShade•3h
  Yes, they have over the years, for example https://blog.cloudflare.com/residential-proxy-bot-detection-..., https://blog.cloudflare.com/cloudflare-bot-management-machin..., https://blog.cloudflare.com/introducing-bot-analytics/.
- chasd00•3h
  i've mentioned this in a couple replies so maybe i'm wrong but it's up to the client to obey robots.txt. Why would they not just ignore it? Unless there's some legal consequence not complying with robots.txt then why even follow it? There's no technical enforcement of the policies in the file, it's up to the client to honor them.
  - kentonv•2h
    > There's no technical enforcement of the policies in the file, it's up to the client to honor them.
    That's incorrect. Cloudflare does in fact enforce this at a technical level. Cloudflare has been doing bot detection for years and can pretty reliably detect when bots are not following robots.txt and then block them.
thephotonsphere•2h
account wall :-(
Spivak•3h
Poor ChatGPT-User, nobody understands you. Blocking a real user because of the, admittedly odd, browser they're using misses the point.
•2h
[deleted]
rorylaitila•3h
Unfortunately I think pissing into the wind. Information websites are all but dead. AI contains all published human information. If you have positioned your website as an answer to a question, it won't survive that way.
"Information" is dead but content is not. Stories, empathy, community, connection, products, services. Content of this variety is exploding.
The big challenge is discoverability. Before, information arbitrage was one pathway to get your content discovered, or to skim a profit. This is over with AI. New means of discovery are necessary, largely network and community based. AI will throw you a few bones, but it will be 10% of what SEO did.
- ozgrakkurt•2h
  You are assuming LLMs will replace search engines. Why is this the case?
  To me it seems like there has to be so much optimization for this to happen that, it is not likely. LLM answers are slow and unreliable. Even using something like perplexity doesn’t give much value over using a regular search engine in my experience
  - rorylaitila•2h
    LLMs will not fully replace search engines, but Google and Bing are evolving to be LLM first, anyhow. So "what is a search engine" today is not what it was yesterday. Let's call the time before LLMs, traditional search. LLM first products bundle some aspect of traditional search. And traditional search is adding LLM answers.
    Traditional search will still be highly useful for transactional, product, realtime, and action oriented queries. Also for discovering educational/entertainment content that is valued in of itself and cannot be reformulated by LLM.
- fennecfoxy•2h
  >AI contains all published human information
  No, it most certainly does not. It was certainly trained on large swathes of human knowledge/interactions.
  A model that consists of a perfect representation/compression of all this info is a zip file, not a model file.
  - rorylaitila•2h
    AI providers have scrapped and will continue to, all internet published information or virtually so. Since "Information" is infinite, AI cannot contain "all information" in a complete sense. But it certainly answers almost everything that matters for any existing search query that has ever been targeted by a webpage that is crawlable.
    In any case, as manifest by real world SEO, which is plummeting in traffic for informational queries, the effect is the same. This real world impact is what matters and will not be reversed, regardless of attempts at blocking.