Few people realise that virtually everything we do online has, until this point, been free training to make OpenAI, Anthropic, etc. richer while cutting humans--the ones who produced the value--out of the loop.
It might be too little, too late, at this juncture, and this particular solution doesn't seem too innovative. However, it is directionally 100% correct, and let's hope for massively more innovation in defending against AI parasitism.
"User-agent: CCBot disallow: /"
Is Common Crawl exclusively for "AI"
CCBot was already in so many robots.txt prior to this
How is CC supposed to know or control how people use the archive contents
What if CC is relying on fair use
If the operator has no intellectual property rights in the material, then do they need permission from the rights holders to license such materialsIs it common for website terms and conditions to permit site operators to sublicense other peoples' ("users") work for use in creating LLMs for a fee
Is this fee shared with the rights holders
Read a tos and notice that you give the site operators unlimited license to reproduce or spread your works, almost on any site. it's required to host and show the content essentially
This is interesting. The reasoning and response don't line up.
This seems to be targeted at taxing training of language models, but why an exclusion for the RAG stuff? That seems like it has a much greater immediate impact for online content creators, for whom the bots are obviating a click.With that opinion, are you also suggesting that we ban ad blockers? Because it's better I not click & consume resources than click and not be served ads, basically just costing the host money.
It means sense to allow for RAG in the same way that search engines provide a snippet of an important chunk of the page.
A blog author could not complain that their blog is getting ragged when they're extremely liable to be Google/whatever searching all day and basically consuming others' content in exactly the same way that they're trying to disparage.
I don't think we should ban ad blockers, but I also think it's fair to suggest that the loss of organic traffic could be affecting the incentive to create new digital content, at least as much as the fear of having your content absorbed into an LLM's training data.
IMO the backlash against LLMs is more philosophical, a lot of people don’t like them or the idea of one learning from their content. Unless your website has some unique niche information unavailable anywhere else there’s no direct personal risk. RAG would be a more direct threat if anything.
It's really about who is getting the value from the work of the content. If content creators of all sorts have their work consumed by LLMs, and LLM orgs charge for it can capture all the value, why should people create to have their work vacuumed up for the robot's benefit? For exposure? You can't eat or pay rent with exposure. Humans must get paid, and LLMs (foundational models and output using RAG) cannot improve without a stream of works and data humans create.
Whether you call it training or something else is irrelevant, it's really exploitation of human work and effort for AI shareholder returns and tech worker comp (if those who create aren't compensated). And the technocracy has not been, based on the evidence, great stewards of the power they obtain through this. Pay the humans for their work.
What I want to know is if the flood of scraping everyone has been complaining about is coming from people trying to scrape for training or bots doing RAG search.
I get that everyone wants data, but presumably the big players already scraped the web. Do they really need to do it again? Or is it bit players reproducing data that's likely already in the training set? Or is it really that valuable to have your own scraped copy of internet scale data?
I feel like I'm missing something here. My expectation is that RAG traffic is going to be orders of magnitude higher than scraping for training. Not that it would be easy to measure from the outside.
I wondered about this, too.
Cloudflare have some recent data about traffic from bots (https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...) which indicates that, for the time being, the overwhelming majority of the bot requests are for AI training and not for RAG.
I wonder… Google scrapes for indexing and for AI, right? I wonder if they will eventually say: ok, you can have me or not, if you don’t want to help train my AI you won’t get my searches either. That’s a tough deal but it is sort of self-consistent.
Very few people seems to be complaining that Google crashes their sites. Google also publish their crawlers IP ranges, but you really don't need to rate-limit Google, they know how to back off and not overload sites.
"Embrace, Extend, Extinguish" Google's mantra. And yes, I know about Microsoft's history with that phrase ;) But Google has done this with email, browsers (Google has web apps that run fine on Firefox but request you use Chrome), Linux (Android), and I'm sure there's others I am forgetting about.
So yeah, I too could see them doing this.
So in addition to updating the robots.txt file, which really only blocks a small number of them.
Seems CF has been gathering data and profiling these malicious agents.
This post by CF elaborates a bit further: https://blog.cloudflare.com/declaring-your-aindependence-blo...
Basically becomes a game of cat and mouse.