I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel. https://clangbuiltlinux.github.io/
This LLM did it in (checks notes):
> Over nearly 2,000 Claude Code sessions and $20,000 in API costs
It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!
> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.
The next milestone is:
Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.
> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Still a really cool project!
One thing people have pointed out is that well-specified (even if huge and tedious) projects are an ideal fit for AI, because the loop can be fully closed and it can test and verify the artifact by itself with certainty. Someone was saying they had it generate a rudimentary JS engine because the available test suite is so comprehensive
Not to invalidate this! But it's toward the "well-suited for AI" end of the spectrum
Yes - the gcc "torture test suite" that is mentioned must have been one of the enablers for this.
It's notable that the article says Claude was unable to build a working assembler (& linker), which is nominally a much simpler task than building a compiler. I wonder if this was at least in part due to not having a test suite, although it seems one could be auto generated during bootstrapping with gas (GNU assembler) by creating gas-generated (asm, ELF) pairs as the necessary test suite.
It does beg the question of how they got the compiler to point of correctness of generating a valid C -> asm mapping, before tackling the issue of gcc compatibility, since the generated code apparently has no relation to what gcc generates. I wonder which compilers' source code Claude has been trained on, and how closely this compiler's code generation and attempted optimizations compares to those?
i'm sure claude has been trained on every open source compiler
> Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
Does it really boot...?
> Does it really boot...?
They don't need 16b x86 support for the RISCV or ARM ports, so yes, but depends on what 'it' we're talking about here.
Also, FWIW, GCC doesn't directly assemble to machine code either; it shells out to GAS (GNU Assembler). This blog post calls it "GCC assembler and linker" but to be more precise the author should edit this to "GNU binutils assembler and linker." Even then GNU binutils contains two linkers (BFD and GOLD), or did they excise GOLD already (IIRC, there was some discussion a few years ago about it)?
Yeah, didn't mention gas or ld, for similar reasons. I agree that a compiler doesn't necessarily "need" those.
I don't agree that all the claims are backed up by their own comments, which means that there's probably other places where it falls down.
Its... Misrepresentation.
Like Chicken is a Scheme compiler. But they're very up front that it depends on a C compiler.
Here, they wrote a C compiler that is at least sometimes reliant on having a different C compiler around. So is the project at 50%? 75%?
Even if its 99%, thats not the same story as they tried to write. And if they wrote that tale instead, it would be more impressive, rather than "There's some holes. How many?"
Their C compiler is not reliant on having another C compiler around. Compiling the 16-bit real mode bootstrap for the Linux kernel on x86(-64) requires another C compiler; you certainly don't need another compiler to compile the kernel for another architecture, or to compile another piece of software not subject to the 32k constraint.
The compiler itself is entirely functional; it just can't generate code optimal enough to fit within the constraints for that very specific (tiny!) part of the system, so another compiler is required to do that step.
The assembler seems like nearly the easiest part. Slurp arch manuals and knock it out, it’s fixed and complete.
I am surprised by the number of comments that say the assembler is trivial - it is admittedly perhaps simpler than some other parts of the compiler chain, but it’s not trivial.
What you are doing is kinda serialising a self-referential graph structure of machine code entries that reference each others addresses, but you don’t know the addresses because the (x86) instructions are variable-length, so you can’t know them until you generate the machine code, chicken-and-egg problem.
Personally I find writing parsers much much simpler than writing assemblers.
assembler is far from trivial at least for x86 where there are many possible encodings for a given instruction. emitting the most optimal encoding that does the correct thing depends on surrounding context, and you'd have to do multiple passes over the input.
Huh. A second person mentioning the assembler. Don't think I ever referred to one...?
> Still a really cool project!
Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.
The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.
It's amazing that it "works", but viability is another issue.
It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.
Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.
On top of that, Anthropic is losing money on it.
All of those things combined, viability remains a serious question.
> You won't know until you've finished spending the money whether it will fail or not.
How do you conclude that? You start off with a bunch of tests and build these things incrementally, why would you spend 20k before realizing there’s a problem?
Because literally no real-world non-research project starts with "we have an extremely comprehensive test suite and specification complete down to the most finite detail" and then searches for a way to turn it into code.
I’ve spent nearly 20 years working as a consultant writing software, I know that. How do you think humans solve that problem?
Precisely. Figuring out what the specification is supposed to look like is often the hardest part.
> It cost $20,000
I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???
You should look it up. :)
That's irrelevant in this context, because it's not "get the humans to make a working product OR get the AI to make a working product"
The problem is you may pay $20K for gibberish, then try a second time, fail again, and then hire humans.
Coincidentally yes, I am aware, my last contract was building out a SCADA module the AI failed to develop at the company that contracted me.
I'm using that money to finance a new software company, and so far, AI hasn't been much help getting us off the ground.
Edit: oh yeah, and on top of paying Claude to fuck it up, you still have to also pay the salary of the guy arguing with Claude.
> The problem is you may pay $20K for gibberish, then try a second time, fail again, and then hire humans.
You can easily pay humans $20k a day and get gibberish in output. Heck, this happen all the times. This happens right now in multiple companies.
Yes sometime humans produce nice code. This happens from time to time...
> > It cost $20,000
> I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???
I'll bite - I can write you an unoptimised C compiler that emits assembly for $20k, and it won't be 100k lines of code (maybe 15k, the last time I did this?).
It won't take me a week, though.
I think this project is a good frame of reference and matches my experience - vibing with AI is sometimes more expensive than doing it myself, and always results in much more code than necessary.
Does it support x64, x8664, arm64 and riscv? (sorry, just trolling - we don't know the quality of backend other than x8664 which is supposed to be able to build bootable linux.)
It's not hard to build a compiler just for a bootable linux.
I see no test criteria that actually runs that built linux through various test plans, so, yeah emitting enough asm just to boot is doable.
> I can write you an unoptimised C compiler that emits assembly for $20k
You may be willing to sell your work at that price, but that’s not the market rate, to put it very mildly. Even 10 times that would be seriously lowballing in the realm of contract work, regardless of whether it’s “optimised” or not (most software isn’t).
> You may be willing to sell your work at that price, but that’s not the market rate, to put it very mildly.
It is now.
At any rate, this is my actual rate. I live in South Africa, and that's about 4 weeks of work for me, without an AI.
That’s a VERY nice rate for SA; approximately what I charge in the UK. I assume these are not local companies who hire you.
> That’s a VERY nice rate for SA; approximately what I charge in the UK. I assume these are not local companies who hire you.
A local Fintech needing PCI work pays that, but that's not long-term contracts.
Deal. I'll pay you IF you can achieve the same level of performance. Heck, I'll double it.
You must provide the entire git history with small commits.
I won't be holding my breath.
> Deal. I'll pay you IF you can achieve the same level of performance. Heck, I'll double it.
> You must provide the entire git history with small commits.
> I won't be holding my breath.
Sure; I do this often (I operate as a company because I am a contractor) - money to be held in escrow, all the usual contracts, etc.
It's a big risk for you, though - the level of performance isn't stated in the linked article so a parser in Python is probably sufficient.
TCC, which has in the past compiled bootable Linux images, was only around 15k LoC in C!
For reference, for a engraved-in-stone spec, producing a command-line program (i.e. no tech stack other than a programming language with the standard library), a coder could reasonably produce +5000LoC per week.
Adding the necessary extensions to support booting isn't much either, because the 16-bit stuff can be done just the same as CC did it - shell out to GCC (thereby not needing many of the extensions).
Are you *really* sure that a simple C compiler will cost more than 4 weeks f/time to do? It takes 4 weeks or so in C, are you really sure it will take longer if I switch to (for example) Python?
> the level of performance isn't stated in the linked article so a parser in Python is probably sufficient.
No, you'll have to match the performance of the actual code, regardless of what happens to be written in the article. It is a C compiler written in Rust.
Obviously. Your games reveal your malign intent.
EDIT: And good LORD. Who writes a C compiler in python. Do you know any other languages?!?
> No, you'll have to match the performance of the actual code, regardless of what is in the article. It is a C compiler written in Rust.
Look, it's clear that you don't hire s/ware developers very much - your specs are vague and open to interpretation, and it's also clear that I do get hired often, because I pointed out that your spec isn't clear.
As far as "playing games" goes, I'm not allowing you to change your single-sentence spec which, very importantly, has "must match performance", which I shall interpret to as "performance of emitted code" and not "performance of compiler".
> Your games reveal your intent.
It should be obvious to you by know that I've done this sort of thing before. The last C compiler I wrote was 95% compliant with the (at the time, new) C99 standard, and came to around 7000LoC - 8000LoC of C89.
> EDIT: And good LORD. Who writes a C compiler in python. Do you know any other languages?!?
Many. The last language I implemented (in C99) took about two weeks after hours (so, maybe 40 hours total?), was interpreted, and was a dialect of Lisp. It's probably somewhere on Github still, and that was (IIRC) only around 2000LoC.
What you appear to not know (maybe you're new to C) is that C was specifically designed for ease of implementation.
1. It was designed to be quick and easy to implement.
2. The extensions in GCC to allow building bootable Linux images are minimal, TBH.
3. The actual 16-bit emission necessary for booting was not done by CC, but by shelling out to GCC.
4. The 100kLoC does not include the tests; it used the GCC tests.
I mean, this isn't arcane and obscure knowledge, you know. You can search the net right now and find 100s of undergrad CS projects where they implement enough of C to compile many compliant existing programs.
I'm wondering; what languages did you write an implementation for? Any that you designed and then implemented?
You seem to have doubled down on a bluff that was already called.
Naw. I got him to reveal himself, which was the whole point.
It's amazing what you can get people to do.
No, you're overestimating how complex it is to write an unoptimized C compiler. C is (in the grand scheme of things) a very simple language to implement a compiler for.
The rate probably goes up if you ask for more and more standards (C11, C17, C23...) but it's still a lot easier than compilers for almost any other popular language.
That feels like Silicon-Valley-centric point of view. Plus who would really spend $20k in building any C compiler today in the actual landscape of software?
All that this is saying is that license laundering of a code-base is now $20k away through automated processes, at least if the original code base is fully available. Well, with current state-of-the-art you’ll actually end up with a code-base which is not as good as the original, but that’s it.
If my devs are writing that much code they're doing something wrong. Lines of code is an anti metric. That used to be commonly accepted knowledge.
You wouldn’t pay a human to write 100k LOC. Or at least you shouldn’t. You’d pay a human to write a working useful compiler that isn’t riddled with copyright issues.
If you didn’t care about copying code, usefulness, or correctness you could probably get a human to whip you up a C compiler for a lot less than $20k.
Are you trolling me? Companies (made of humans) write 100,000 LOC all the time.
And it's really expensive, despite your suspicions.
No, companies don’t pay people to write 100k LOC. They pay people to write useful software.
We figured out that LOC was a useless productivity metric in the 80s.
Dude.
Microsoft paid my company a lot of money to write code. And in the end you were able to count it, and the LOC is a perfectly fine metric which is still used today to measure complexity of a project.
If you actually work in software you know this.
I have no idea what point you're trying to make - but I've grown very tired of all the trolls attacking me. Good night.
EDIT: OH. Maybe you mean that people don't cite LOC in contract deliverables. Yeah, I know. I never said that, and it's irrelevant to my point.
Without questioning the LOC metric itself, I'll propose a different problem: LOC for human and AI projects are not necessarily comparable for judging their complexity.
For a human, writing 100k LOC to do something that might only really need 15k would be a bit surprising and unexpected - a human would probably reconsider what they were doing well before they typed 100k LOC. Where-as, an AI doesn't necessarily have that concern - it can just keep generating code and doesn't care how long it will take so it doesn't have the same practical pressure to produce concise code.
The result is that while for large enough human-written programs there's probably an average "density" they reach in relation of LOC vs. complexity of the original problem, AI-generated programs probably average out at an entirely different "density" number.
I can't stress enough how much LOC is not a measure of anything.
Yep. I’ve seen people copy 100’s of lines instead of adding a if statement.
OK, well, the people in MY software industry use LOC as an informal measure of complexity.
LIKE THE WHOLE WORLD DOES.
But hey, maybe it's just the extremely high profile projects I've worked on.
Your first post specifically stated:
"I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???"
which any reasonable reading would take to mean "paid-by-line", which we all know doesn't happen. Otherwise, I could type out 30,000 lines of gibberish and take my fat paycheck.
> which any reasonable reading would take to mean
Nope. Try again.
> you could probably get a human to whip you up a C compiler for a lot less than $20k
I fork Clang or GCC and rename it. I'll take only $10k.
It really depends on the human and the code it outputs.
I can get my 2y old child to output 100k LoC, but it won't be very good.
Your 2yr old can't build a C compiler in Rust that builds Linux.
Sorry mate, I think you're tripping.
Well, if these humans can cheat by taking whatever needed degree of liberty in copycat attitude to fit in the budget, I guess that a simple `git clone https://gcc.gnu.org/git/gcc.git SomeLocalDir` is as close to $0 as one can hope to either reach. And it would end up being far more functional and reliable. But I get that big-corp overlords and their wanna-match-KPI minions will prefer an "clean-roomed" code base.
100k lines of clean, bug free, optimized, and vulnerability free code or 100k lines of outsourced slop? Two very different price points.
A compiler that can build linux.
That level of quality should be sufficient.
Do you know any low quality programmers that write C compilers in rust THAT CAN BUILD LINUX?
No you don't. They do not exist.
Yep. Building a working C compiler that compiles Linux is an impossible task for all but the top 1% of developers. And the ones that could do it have better things to do, plus they’d want a lot more than 20K for the trouble.
What's so hard about it? Compiler construction is well researched topic and taught in the universities. I made toy language compiler as a student. May be I'm underestimating this task, but I think that I can build some simple C compiler which will output trivial assembly. Given my salary of $2500, that would probably take me around a year, so that's pretty close LoL.
You can one shot prompt a toy C compiler. Getting one that can compile Linux in a bootable way is significantly harder.
It's a bit more nuanced. You can build a simple compiler without too many issues. But once you want it to do optimisations, flow control protection, good and fast register allocation, inling, autovectoriasation, etc. that's going to take a multiples of the original time.
> Building a working C compiler ... is an impossible task
I think you might be thinking of C++
I’m not. I’ve been working with C on and off for 30 years. Linux requires GNU extensions beyond standard C. Once you get the basics done, there’s still a lot more work to do. Compiling a trivial program might work. But you’ll hit an edge case or 50 in the millions of lines in Linux.
I also should’ve qualified my message with “in 2 weeks”, or even “in 2 months.” Given more time it’s obviously possible for more people.
Interesting, why impossible? We studied compiler construction at uni. I might have to dig out a few books, but I’m confident I could write one. I can’t imagine anyone on my course of 120 nerds being unable to do this.
You are underestimating the complexity of the task so do other people on the thread. It's not trivial to implement a working C compiler very much so to implement the one that proves its worth by successfully compiling one of the largest open-source code repositories ever, which btw is not even a plain ISO C dialect.
I didn’t say it was trivial. Just that I thought my course mates would be able to do it.
You thought your course mates would be able to write a C compiler that builds the Linux?
Huh. Interesting. Like the other guy pointed out, compiler classes often get students to write toy C compilers. I think a lot of students don't understand the meaning of the word "toy". I think this thread is FULL of people like that.
I took a compilers course 30 years ago. I have near zero confidence anyone (including myself) could do it. The final project was some sort of toy language for programming robots with an API we were given. Lots of yacc, bison, etc.
Lots of segfaults, too.
Hey! I built a Lego technic car once 20 years ago. I am fully confident that I can build an actual road worthy electric vehicle. It's just a couple of edge cases and a bit bigger right? /s
https://news.ycombinator.com/item?id=46909529
I’ll be shocked if they are able to do it in 4 months, never mind 4 weeks.
Do you think this was guided by a low quality Anthropic developer?
You can give a developer the GCC test suite and have them build the compiler backwards, which is how this was done. They literally brute forced it, most developers can brute force. It also literally uses GCC in the background... Maybe try reading the article.
HEH... I'm sorry man but I truly don't understand what point you're trying to make, other than change the subject and get nasty.
You take care now.
The trick to not be confused is to read more than the title of the article.
I love how I'm getting downvoted for calling him out for saying "read the article".
The people in here have gone entirely nuts.
it's 20,000 in 2026, with the price of tokens halving every year (at a given perf level), this will be around 1,000 dollars in 2030
Progress can be reviewed over time, and I'd think that'd take a lot of the risk out.
> On top of that, Anthropic is losing money on it.
It seems they are *not* losing money on inference: https://bsky.app/profile/steveklabnik.com/post/3mdirf7tj5s2e
That's for the API right? The subs are still a loss. I don't know which one of the two is larger.
no, and that is widely known. the actual problem is that the margins are not sufficient at that scale to make up for the gargantuan training costs to train their SOTA model.
Source on that?
Because inference revenue is outpacing training cost based on OpenAI’s report and intuition.
Net inference revenue would need to be outpacing to go against his think about margins.
That's a good point! Here claude opus wrote a C compiler. Outrageously cool.
Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.
To be fair, it was a pretty horrible mess of useEffects. But just another data point.
Also I was hoping opus would finally be able to handle complex typescript generics, but alas...
> On top of that, Anthropic is losing money on it
This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not
Anyone remember the dotcom bust?
Oh yeah, I do. That whole internet thing was a total HOAX. I can't believe people bought into that.
Can you imagine if Amazon, EBay, PayPal, or Saleforce existed today?
Well, how is your Solaris installation going?
I also remember having gone into research, because there were no jobs available, and even though I was employed at the time, our salaries weren't being paid.
What does this even mean?
==> Anyone remember the dotcom bust?
Yeah. That's not an explanation. But thanks. You're awesome.
Remember that thing that caused it? That "Internet" thing? After those companies went bust it pretty much disappeared didn't it.
Completely detached from reality, brainwashed SV VC's who have made dumping the norm in their bubble.
I can guarantee you that 90% of successful businesses in the world made a profit their first year.
1 year seems aggressive. Successful restaurants have around the first year as the average break even timeline, with the vast majority between 6 and 18 months.
They are making a profit on each sale, but there are fixed costs to running a business.
I’ll bite. Share your data?
Companies that were not profitable in their first year: Microsoft, Google, SpaceX, airBnB, Uber, Apple, FedEx, Amazon.
If the vast majority of companies are immediately profitable, why do we have VC and investment at all? Shouldn’t the founders just start making money right aeay?
> Companies that were not profitable in their first year: Microsoft, Google, SpaceX, airBnB, Uber, Apple, FedEx, Amazon.
US Big Tech, US Big Tech, US Tech-adjacent, US Big Tech, US Big Tech, US Big Tech, FedEx, US Tech-adjacent.
In other words, exactly what I was getting at.
Also, a basic search shows Microsoft to have been profitable first year. I'd be very surprised if they weren't. Apple also seems to have taken less than 2 years. And unsurprisingly, these happen to be the only two among the tech companies you named that launched before 1995.
Check out the Forbes Global 5000. Then go think about the hypothetical Forbes Global 50,000. Is the 50,000th most successful company in the world not successful? Of course not, it's incredibly successful.
> why do we have VC and investment at all
Out of all companies started in 2024 I can guarantee you that <0.01% have received VC investment by now (Feb 2026) and <1% of tech companies did. I'll bet my house on it.
I love how your comment is getting downvoted.
Like it's a surprise that startups burn through money. I get the feeling that people really have no idea what they're talking about in here anymore.
It's a shame.
then you are misunderstaing the downvoting. it's not that the fact that they are burning money. it's the fact that this cost today 20k but that is not the real cost if you factor the it is losing money on this price.
So Tomorrow when this "startup" will need to come out of their money burning phase, like every startup has to sooner or later, that cost will increase, because there is no other monetising avenue, at least not for anthropic that "wilL never use ads".
at 20k this "might" be a reasonable cost for "the project", at 200k it might not.
That would be insightful if the cost of inference weren’t declining at roughly 90% per year. Source: https://epoch.ai/data-insights/llm-inference-price-trends
According to that article, the data they analyzed was API prices from LLM providers, not their actual cost to perform the inference. From that perspective, it's entirely possible to make "the cost of inference" appear to decline by simply subsidizing it more. The authors even hint at the same possibility in the overview:
> Note that while the data insight provides some commentary on what factors drive these price drops, we did not explicitly model these factors. Reduced profit margins may explain some of the drops in price, but we didn’t find clear evidence for this.
Source that they’re losing money on each token?
Are we forgetting that sometimes, they just go bankrupt?
name one with comparable number of users and revenue? not saying you are wrong but I would bet against the outcome
I'll be able to do just that in 36mo or so after the IPOs and subsequent collapse, I think.
Enron
I should have guessed someone would answer this question in this thread with Enron :)
I did not ask for random company that went under for any reason but specific question related to users and revenue.
Well there are lots and lots of examples that don't end in bankruptcy, just a very large loss of capital for investors. The majority of the stars of the dotcom bubble just as one example: Qualcomm, pets.com, Yahoo!, MicroStrategy etc etc.
Uber, which you cite as a success, is only just starting to make any money, and any original investors are very unlikely to see a return given the huge amounts ploughed in.
MicroStrategy has transformed itself, same company, same founder, similar scam 20 years later, only this time they're peddling bitcoin as the bright new future. I'm surprised they didn't move on to GAI.
Qualcomm is now selling itself as an AI first company, is it, or is it trying to ride the next bubble?
Even if GAI becomes a roaring success, the prominent companies now are unlikely to be those with lasting success.
The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.
> This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.
It's still really, really impressive though.
Like, economics aside this is amazing progress. I remember GPT3 not being able to hold context for more than a paragraph, we've come a long way since then.
Hell, I remember bag of words being state of the art when I started my career. We have come a really, really, really long way since then.
Much like quoting Quake code almost verbatim not so long ago.
> Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?
No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation). That being said, they provide all the code et al for people to review.
I do agree that an out of distribution test would be super helpful, but given that it will almost certainly fail (given what we know about LLMs) I'm not too pushed about that given that it will definitely fail.
Look, I'm pretty sceptical about AI boosting, but this is a much better attempt than the windsurf browser thing from a few months back and it's interesting to know that one can get this work.
I do note that the article doesn't talk much about all the harnesses needed to make this work, which assuming that this approach is plausible, is the kind of thing that will be needed to make demonstrations like this more useful.
[1] https://en.wikipedia.org/wiki/Leakage_(machine_learning)
This question is extremely important because test set leakage leads to impressively looking results that do not generalize to anything at all.
> This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].
I am quite familiar with leakage, having been building statistical models for maybe 15+ years at this point.
However, that's not really relevant in this particular case given that LLMs are trained on approximately the entire internet, so leakage is not really a concern (as there is no test set, apart from the tasks they get asked to do in post-training).
I think that's its impressive that this even works at all as even if it's just predicting tokens (which is basically what they're trained to), as this is a pointer towards potentially more useful tasks (convert this cobol code base to java, for instance).
I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.
Here's, for example, VHDL tests suite for GHDL, open source VHDL compiler and simulator: https://github.com/ghdl/ghdl/tree/master/testsuite
The GHDL test suite is sufficient and general enough to develop a pretty capable clone, to my knowledge. To my knowledge, there is only one open source VHDL compiler and it is written in Ada. And, again, expertise to implement another one from scratch to train an LLM on it is very, very scarce - VHDL, being highly parallel variant of Ada, is quirky as hell.
So someone can test your hypothesis on the VHDL - agent-code a VHDL compiler and simulator in Rust so that it passes GHDL test suite. Would it take two weeks and $20,000 as with C? I don't know but I really doubt so.
There are two compilers that can handle the Linux kernel. GCC and LLVM. Both are written in C, not Rust. It's "in distribution" only if you really stretch the meaning of the term. A generic C compiler isn't going to be anywhere near the level of rigour of this one.
There is tinycc, that makes it three compilers.
There is a C compiler implemented in Rust from scratch: https://github.com/PhilippRados/wrecc/commits/master/?after=... (the very beginning of commit history)
There are several C compilers written in Rust from scratch of comparable quality.
We do not know whether Anthropic has a closed source C compiler written in Rust in their training data. We also do not know whether Anthropic validated their models on their ability to implement C compiler from scratch before releasing this experiment.
That language J I proposed does not have any C compiler implemented in it at all. Idiomatic J expertise is scarce and expensive so that it would be a significant expense for Anthropic to have C compiler in J for their training data. Being Turing-complete, J can express all typical compiler tips and tricks from compiler books, albeit in an unusual way.
TinyCC can't compile a modern linux kernel. It doesn't support a ton of the extensions they use. That Rust compiler similarly can't do it.
How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.
HEY ALL!
I have to stop participating in this conversation. Some helpful people from the internet have begun to send me threatening email messages.
Thanks HN. You're PURE AWESOME!
Only if we take them at their word. I remember thinking things were in a completely different state when Amazon had their shop and go stores, but then finding out it was 1000s of people in Pakistan just watching you via camera.
If will write you an C compiler by hand for 19k and it will be better than what Claude made.
Writing a toy C compiler isn't that hard. Any decent programmer can write one in a few weeks or months. The optimizations are the actually interesting part and Claude fails hard at that.
> optimizations aren't as good as the 40 year gcc project
with all optimizations disabled:
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
That distinction doesn't change my point. I am not surprised that a 40 year old project generates better code than this brand new one.
Not only is it new. There has been 0 performance optimization done. Well none prompted for at least. Once you give the agents a profiler and start a loop focusing on performance you'll see it start improving it.
We are talking about compiler here and "performance" referred above is the performance of generated code.
When you are optimizing a program, you have a specific part of code to improve. The part can be found with profiler.
When you are optimizing a compiler generated code, you have many similar parts of code in many programs and not-so-specific part of compiler that can be improved.
Yes, performance of the generated code. You have some benchmark of using a handful of common programs going through common workflows and you measure the performance of the generated code. As tweaks are made you see how the different performance experiments effect the overall performance. Some strategies are always a win, but things like how you layout different files and functions in memory have different trade offs and are hard to know up front without doing actual real world testing.
https://en.wikipedia.org/wiki/Privatization_(computer_progra...
To correctly apply privatization one has to have correct dependency analysis. This analysis uses results of many other analyses, for example, value range analysis, something like Fourier-Motzkin algorithm, etc.
So this agentic-optimized compiler has a program where privatization is not applied, what tweaks should agents apply?
it costs $20,000 to reinvent the wheel, that it probably trained on. If that's your definition of legit, sure
well, if in this period it is a matter of cost, tomorrow won't be anymore. 4GB of RAM in the 80s would have cost tens of millions of dollars, now even your car runs 4 gb memory only for the infotainment systems, and runs dozens GBs of RAM for the most complex assistants. So i would see this achievement more as a warning, the final result is not what's concerning, it is the premonition behind it
GCC had 40 years headstart
I’m excited and waiting for the team that shows with $20k in credits they can substantially speed up the generated code by improving clang!
i'm sorry but that will take another $20 billion in AI capex to train our latest SOTA model so that it will cost $20k to improve the code.
Claude did not wrote it. you wrote it with PREVIOUS EXPERIENCE with 20.000 long commandshyellihg him exactly what to do.
Real usable AI would create it with simple: 'make c compilers c99 faster than GCC'.
AI usage should be banned in general. It takes jobs faster than creating new ones ..
That's actually pretty funny. They're patting it on the back for using, in all likelihood, some significant portions of code that they actually wrote, which was stolen from them without attribution so that it could be used as part of a very expensive parlour trick.
Did you do diffs to confirm the code as stolen or are you just speculating.
> AI usage should be banned in general. It takes jobs faster than creating new ones ..
I don't have an strong opinion about that in either direction, but curious: Do you feel the same about everything, or is just about this specific technology? For example, should the nail gun have been forbidden if it was invented today, as one person with a nail gun could probably replace 3-4 people with normal "manual" hammers?
You feel the same about programmers who are automating others out of work without the use of AI too?
> It takes jobs faster than creating new ones ..
You think compiler engineer from Google gives a single shit about this?
They’ll automate millions out of career existence for their amusement while cashing out stock money and retiring early comfortably.
> It takes jobs faster than creating new ones ..
I have no problems with tech making some jobs obsolete, that's normal. The problem is, the job being done with the current generation of LLMs are, at least for now, mostly of inferior quality.
The tools themselves are quite useful as helpers in several domains if used wisely though.
Businesses do not exist to create jobs; jobs are a byproduct.
Even that is underselling it; jobs are a necessary evil that should be minimised. If we can have more stuff with fewer people needing to spend their lives providing it, why would we NOT want that?
Because we've built a system where if you don't have a job, you die.
This is already hyperbolic; in most countries where software engineers or similar knowledge workers are widely employed there are welfare programmes.
To add to that, if there is such mass unemployment in this scenario it will be because fewer people are needed to produce and therefore everything will become cheaper... This is the best kind of unemployment.
So at best: none of us have to work again and will get everything we need for free. At worst, certain professions will need a career switch which I appreciate is not ideal for those people but is a significantly weaker argument for why we should hold back new technology.
If you were to rank all of the C compilers in the world and then rank all of the welfare systems in the world, this vibe-coded mess would be at approximately the same rank as the American welfare system. Especially if you extrapolate this narcissistic, hateful kleptocracy out a few more years.
Did we build it or did nature?
We did.
Jobs are a means, not a goal.
Jobs are the only way that you survive in this society (food, shelter). Look how we treat unhoused people without jobs. AI is taking jobs away and that is putting people's survival at risk.
This is getting close to a Ken Thompson "Trusting Trust" era -- AI could soon embed itself into the compilers themselves.
A pay to use non-deterministic compiler. Sounds amazing, you should start.
Application-specific AI models can be much smaller and faster than the general purpose, do-everything LLM models. This allows them to run locally.
They can also be made to be deterministic. Some extra care is required to avoid computation paths that lead to numerical differences on different machines, but this can be accomplished reliably with small models that use integer math and use kernels that follow a specific order of operations. You get a lot more freedom to do these things on the small, application-specific models than you do when you're trying to run a big LLM across different GPU implementations in floating point.
> They can also be made to be deterministic.
Yeah, in the same way how pseudo-random number generators are "deterministic." They generate the exact same sequence of numbers every time given the seeds are the same!
But that's not the "determinism" people are referring to when they say LLMs aren't deterministic.
Some people care more about compile times than the performance of generated code. Perhaps even the correctness of generated code. Perhaps more so than determinism of the generated code. Different people in different contexts can have different priorities. Trying to make everyone happy can sometimes lead to making no one happy. Thus dichotomies like `-O2` vs `-Os`.
EDIT (since HN is preventing me from responding):
> Some people care more about compiler speed than the correctness?
Yeah, I think plenty of people writing code in languages that have concepts like Undefined Behavior technically don't really care as much about correctness as they may claim otherwise, as it's pretty hard to write large volumes of code without indirectly relying on UB somewhere. What is correct in such case was left up to interpretation of the implementer by ISO WG14.
Some people care more about compiler speed than the correctness? I would love to meet these imaginary people that are fine with a compiler that is straight up broken. Emitting working code is the baseline, not some preference slider.
> I would love to meet these imaginary people that are fine with a compiler that is straight up broken.
That's not what I said; you're attacking a strawman.
My point was more so that some people prefer the madness that is -funsafe-math-optimizations, or happen to rely on UB (intentionally or otherwise). What even is "correct" in the presence of UB? What is correct in such case was left up to interpretation of the implementer by ISO WG14.
You might have not run Gentoo. Most Gentooers will begrudgingly but eventually admit to cooking their own gonads when updating a laptop.
Anyway, please define: "correctness".
Let's pretend, for just a second, that the people who do, having been able to learn how to program, are not absolute fucking morons. Straight up broken is obviously not useful, so maybe the conclusions you've jumped to could use some reexamination.
a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced. The only thing worse would be a CPU bug like the legendary Pentium bug. Imagine you compile something like Postgres only to have it crash in some unpredictable way. How long do you stare at Postgres source before suspecting the compiler? What if this compiler was used to compile code in software running all over cloud stacks? Bugs in compilers are very bad news, they have to be correct.
> a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced
Is this true? It’s not an everyday thing, but when using less common flags, or code structures, or targets… every few years I run into a codegen issue. It’s hard to imagine going through a career without a handful…
It's not that uncommon if you work in massive lowish level systems. Clang/LLVM being relatively bug free is the result of many corporate big tech low level compiler swes working with the application swes to debug why XYZ isn't working properly and then writing the appropriate fix. But compiler bugs still come up every so often, I've seen it on multiple occasions.
Yeah, my current boss spent time weeding out such hardware bugs: https://arxiv.org/abs/2110.11519 (EDIT: maybe https://x.com/Tesla_AI/status/1930686196201714027 is a more relevant citation)
They found a bimodal distribution in failures over the lifetime of chips. Infant mortality was well understood. Silicon aging over time was much less well understood, and I still find surprising.
What I want to know is when we get AI decompilers
Intuitively it feels like it should be a straightforward training setup - there's lots of code out there, so compile it with various compilers, flags etc and then use those pairs of source+binary to train the model.
We're already starting to see people experimenting with applying AI towards register allocation and inlining heuristics. I think that many fields within a compiler are still ripe for experimentation.
https://llvm.org/docs/MLGO.html
Reminds me of https://www.teamten.com/lawrence/writings/coding-machines/
Hmm, well, there are already embedded in fonts: https://hackaday.com/2024/06/26/llama-ttf-is-ai-in-a-font/
Sorry, clang 26.0 requires an Nvidia B200 to run.
The asymmetry will be between the frontier AI's ability to create exploits vs find them.
would be hard to miss gigantic kv cache matrix multiplications
Then i'll be left wondering why my program requires 512TB of RAM to open
>$20,000 of tokens. >less efficient than existing compilers
what is the ecological cost of producing this piece of software that nobody will ever use?
If you evaluate the cost/benefit in isolation? It’s net negative.
If you see this as part of a bigger picture to improve human industrial efficiency and bring us one step closer to the singularity? Most likely net positive.
With that way of thinking you would just move in a cave.
> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel
Did this come down to making Clang 100% gcc compatible (extensions, UDB, bugs and all), or were there any issues that might be considered as specific to the linux kernel?
Did you end up building a gcc compatability test suite as a part of this? Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?
> extensions
Some were necessary (asm goto), some were not (nested functions, flexible array members not at the end of structs).
> UDB, bugs and all
Luckily, the kernel didn't intentionally rely on GCC specifics this way. Where it did unintentionally, we fixed the kernel sources properly with detailed commit messages explaining why.
> or were there any issues that might be considered as specific to the linux kernel?
Yes, https://github.com/ClangBuiltLinux/linux/issues is our issue tracker. We use tags extensively to mark if we triage the issue to be kernel-side vs toolchain-side.
> Did you end up building a gcc compatability test suite as a part of this?
No, but some tricky cases LLVM got wrong were distilled from kernel sources using either:
- creduce - cvise (my favorite) - bugpoint - llvm-reduce
and then added to LLVM's existing test suite. Many such tests were also simply manually written.
> Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?
GCC and binutils have their own test suites. Folks in the LLVM community have worked on being able to test clang against GCC's test suite. I personally have never run GCC's test suite or looked at its sources.
Being just a grunt engineer in a product firm I can't imagine being able to spend multiple years on one project. If it's something you're passionate about, that sounds like a dream!
This work originally wasn't my 100% project, it was my 20% project (or as I prefer to call it, 120% project).
I had to move teams twice before a third team was able to say: this work is valuable to us, please come work for us and focus just on that.
I had to organize multiple internal teams, then build an external community of contributors to collaborate on this shared common goal.
Having carte blanche to contribute to open source projects made this feasible at all; I can see that being a non-starter at many employers, sadly. Having low friction to change teams also helped a lot.
I want to verify the claim that it builds the Linux kernel. It quickly runs into errors, but yeah, still pretty cool!
make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc -j30 defconfig all
``` /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:44:184: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) ~0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (~0x80000000); (void)pto_tmp__; } asm ("and" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:49:183: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) 0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (0x80000000); (void)pto_tmp__; } asm ("or" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:61:212: error: expected ';' after expression before 'pao_tmp__' ```
They said it builds Linux 6.9, maybe you are trying to compile a newer version there?
git switch v6.9
The riscv build succeeded. For the x86-64 build I ran into
There are many other errors.tinyconfig and allnoconfig have fewer errors.
Still very impressive.I feel like I could have done this in a much shorter time, for much less tokens, but still very impressive!
They said that it wasn't able to support 16 bit real mode. Needs to call gcc for that.
Isn't the AI basing what it does heavily on the publicly available source code for compilers in C though? Without that work it would not be able to generate this would it? Or in your opinion is it sufficiently different from the work people like you did to be classed as unique creation?
I'm curious on your take on the references the GAI might have used to create such a project and whether this matters.
> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel.
How much of that time was spent writing the tests that they found to use in this experiment? You (or someone like you) were a major contributor to this. All Opus had to do here was keep brute forcing a solution until the tests passed.
It is amazing that it is possible at all, but remains an impossibly without a heavy human hand. One could easily still spend a good part of their career reproducing this if they first had to rewrite all of the tests from scratch.
What were the challenges out of interest. Some of it is the use of gcc extensions? Which needed an equivalent and porting over to the equivalent
`asm goto` was the big one. The x86_64 maintainers broke the clang builds very intentionally just after we had gotten x86_64 building (with necessary patches upstreamed) by requiring compiler support for that GNU C extension. This was right around the time of meltdown+spectre, and the x86_64 maintainers didn't want to support fallbacks for older versions of GCC (and ToT Clang at the time) that lacked `asm goto` support for the initial fixes shipped under duress (embargo). `asm goto` requires plumbing throughout the compiler, and I've learned more about register allocation than I particularly care...
Fixing some UB in the kernel sources, lots of plumbing to the build system (particularly making it more hermetic).
Getting the rest of the LLVM binutils substitutes to work in place of GNU binutils was also challenging. Rewriting a fair amount of 32b ARM assembler to be "unified syntax" in the kernel. Linker bugs are hard to debug. Kernel boot failures are hard to debug (thank god for QEMU+gdb protocol). Lots of people worked on many different parts here, not just me.
Evangelism and convincing upstream kernel developers why clang support was worth anyones while.
https://github.com/ClangBuiltLinux/linux/issues for a good historical perspective. https://github.com/ClangBuiltLinux/linux/wiki/Talks,-Present... for talks on the subject. Keynoting LLVM conf was a personal highlight (https://www.youtube.com/watch?v=6l4DtR5exwo).
i mean… your work also went into the training set, so it's not entirely surprising that it spat a version back out!
Anthropic's version is in Rust though, so at least a little different.
There's parts of LLVM architecture that are long in the tooth (IMO) (as is the language it's implemented in, IMO).
I had hoped one day to re-implement parts of LLVM itself in Rust; in particular, I've been curious if we can concurrently compile C (and parse C in parallel, or lazily) that haven't been explored in LLVM, and I think might be safer to do in Rust. I don't know enough about grammers to know if it's technically impossible, but a healthy dose of ignorance can sometimes lead to breakthroughs.
LLVM is pretty well designed for test. I was able to implement a lexer for C in Rust that could lex the Linux kernel, and use clang to cross check my implementation (I would compare my interpretation of the token stream against clang's). Just having a standard module system makes having reusable pieces seems like perhaps a better way to compose a toolchain, but maybe folks with more experience with rustc have scars to disagree?
> I had hoped one day to re-implement parts of LLVM itself in Rust
Heh, earlier this day, I was just thinking how crazy a proposal would it actually be to have a Rust dependency (specifically, the egg crate, since one of the things I'm banging my head against right now might be better solved with egraphs).
One thing LLMs are really good at is translation. I haven’t tried porting projects from one language to another, but it wouldn’t surprise me if they were particularly good at that too.
as someone who has done that in a professional setting, it really does work well, at least for straightforward things like data classes/initializers and average biz logic with if else statements etc... things like code annotations and other more opaque stuff like that can get more unreliable though because there are less 1-1 representations... it would be interesting to train an llm for each encountered new pattern and slowly build up a reliable conversion workflow
It's not really important in latent space / conceptually.
This is the proper deep critique / skepticism (or sophisticated goal-post moving, if you prefer) here. Yes, obviously this isn't just reproducing C compiler code in the training set, since this is Rust, but it is much less clear how much of the generated Rust code can (or can not) be accurately seen as being translated from C code in the training set.
Clang is not written in Rust tho
jinx
>Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.
It's worth noting that this was developed by compiling Linux and running tests, so at least that is part of the training set and not the testing set.
But at least for linux, I'm guessing the tests are very robust and I'm guessing that will work correctly. That said, if any bugs pop up, it will show weak points in the linux tests.
> This LLM did it
You do realize the LLM had access (via his training set) and "reused" (not as is, of course) your own work, right?
It’s cool but there’s a good chance it’s just copying someone else’s homework albeit in an elaborate round about way.
I would claim that LLMs desperately need proprietary code in their training, before we see any big gains in quality.
There's some incredible source available code out there. Statistically, I think there's a LOT more not so great source available code out there, because the majority of output of seasoned/high skill developers is proprietary.
To me, a surprising portion of Claude 4.5 output definitely looks like student homework answers, because I think that's closer to the mean of the code population.
This is dead wrong: essentially the entirety of the huge gains in coding performance in the past year have come from RL, not from new sources of training data.
I echo the other commenters that proprietary code isn’t any better, plus it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there.
Author attributes past year's degradation of code generation by LLMs to excessive use of new source of training data, namely, users' code generation conversations.
Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.
And their “explanation” blaming the training data is just a guess on their part, one that I suspect is wrong. There is no argument given that that’s the actual cause of the observed phenomenon. It’s a just-so story: something that sounds like it could explain it but there’s no evidence it actually does.
My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said. I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.
[1] https://openreview.net/forum?id=4OsgYD7em5
The sources of training data already were the reasons for allegations, even leading to lawsuits. So I would suspect that no engineer from any LLM company would disclose anything on their sources of training data besides innocently sounding "synthetic data verified by ourselves."From the days I have worked on blockchains, I am very skeptical about any company riding any hype. They face enormous competition and they will buy, borrow or steal their way to try to not go down even a little. So, until Anthropic opens the way they train their model so that we can reproduce their results, I will suspect they leaked test set into it and used users code generation conversation as new source of training data.
That is not what No True Scotsman is. I’m pointing out a bad argument with weak evidence.
The article about degradation is a case study (single prompt), weakest of the studies in hierarchy of knowledge. Case studies are basis for further, more rigorous studies. And author took the time to test his assumptions and presented quite clear evidence that such degradation might be present and that we should investigate.
> it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there
The quality of the existing code base makes a huge difference. On a recent greenfield effort, Claude emitted an MVP that matched the design semantics, but the code was not up to standards. For example, it repeatedly loaded a large file into memory in different areas where it was needed (rather than loading once and passing a reference.)
However, after an early refactor, the subsequently generated code vastly improved. It honors the testing and performance paradigms, and it's so clean there's nothing for the linter to do.
Progress with RL is very interesting, but it's still too inefficient. Current models do OK on simple boring linear code. But they output complete nonsense when presented with some compact but mildly complex code, e.g. a NumPyro model with some nesting and einsums.
For this reason, to be truly useful, model outputs need to be verifiable. Formal verification with languages like Dafny , F*, or Isabelle might offer some solutions [1]. Otherwise, a gigantic software artifact such as a compiler is going to have a critical correctness bugs with far-fetched consequences if deployed in production.
Right now, I think treating a LLM like something different than a very useful information retrieval system with excellent semantic capabilities is not something I am comfortable with.
[1] https://risemsr.github.io/blog/2026-02-04-nik-agentic-pop
Human-written compilers have bugs too! It takes decades of use to iron them out, and we’re introducing new ones all the time.
I will say many closed source repos are probably equally as poor as open source ones.
Even worse in many cases because they are so over engineered nobody understands how they work.
I firmly agree with your first sentence. I can just think about the various modders that have created patches and performance enhancing mods for games with budgets of tens to hundreds of millions of dollars.
But to give other devs and myself some grace, I do believe plenty of bad code can likely be explained by bad deadlines. After all, what's the Russian idiom? "There is nothing more permanent than the temporary."
I'd bet, on average, the quality of proprietary code is worse than open-source code. There have been decades of accumulated slop generated by human agents with wildly varied skill levels, all vibe-coded by ruthless, incompetent corporate bosses.
There's only very niche fields where closed-source code quality is often better than open-source code.
Exploits and HFT are the two examples I can think of. Both are usually closed source because of the financial incentives.
Here we can start debating what means better code.
I haven’t seen HFT code but I have seen examples of exploit codes and most of it is amateur hour when it comes to building big size systems.
They are of course efficient in getting to the goal. But exploits are one off code that is not there to be maintained.
Not to mention, a team member is (surprise!) fired or let go, and no knowledge transfer exists. Womp, womp. Codebase just gets worse as the organization or team flails.
Seen this way too often.
Developers are often treated as cogs. Anyone should be able to step in a pick things up instantly. It’s just typing, right? /s
In my time, I have potentially written code that some legal jurisdictions might classify as a "crime against humanity" due to the quality.
It doesn’t matter what the average is though. If 1% of software is open source, there is significantly more closed source software out there and given normal skills distributions, that means there is at least as much high quality closed source software out there, if not significantly more. The trick is skipping the 95% of crap.
yeah, but isn't the whole point of claude code to get people to provide preference data/telemetry data to anthropic (unless you opt out?). same w/ other providers.
i'm guessing most of the gains we've seen recently are post training rather than pretraining.
Yes, but you have the problem that a good portion of that is going to be AI generated.
But, I naively assume most orgs would opt out. I know some orgs have a proxy in place that will prevent certain proprietary code from passing through!
This makes me curious if, in the allow case, Anthropic is recording generated output, to maybe down-weight it if it's seen in the training data (or something similar)?
Let's start with the source code for the Flash IDE :)
This is cool and actually demonstrates real utility. Using AI to take something that already exists and create it for a different library / framework / platform is cool. I'm sure there's a lot of training data in there for just this case.
But I wonder how it would fare given a language specification for a non-existent non-trivial language and build a compiler for that instead?
If you come up with a realistic language spec and wait maybe six months, by then it'll probably be approach being cheap enough that you could test the scenario yourself!
It looks like a much more progressed/complete version of https://github.com/kidoz/smdc-toolchain/tree/master/crates/s... . But that one is only a month old. So a bit confused there. Maybe that was also created via LLM?
I see that as the point that all this is proving - most people, most of the time, are essentially reinventing the wheel at some scope and scale or another, so we’d all benefit from being able to find and copy each others’ homework more efficiently.
And the goal post shifts.
..A small thing, but it won't compile the RISCV version of hello.c if the source isn't installed on the machine it's running on.
It is standing on the shoulders of giants (all of the compilers of the past, built into it's training data... and the recent learnings about getting these agents to break up tasks) to get itself going. Still fairly impressive.
On a side-quest, I wonder where Anthropic is getting there power from. The whole energy debacle in the US at the moment probably means it made some CO2 in the process. Would be hard to avoid?
Also: a large amount of folks seem to think Claude code is losing a ton of money. I have no idea where the final numbers land, however, if the $20,000 figure is accurate and based on some of the estimates I've seen, they could've hired 8 senior level developers at a quarter million a year for the same amount of money spent internally.
Granted, marketing sucks up far too much money for any startup, and again, we don't know the actual numbers in play, however, this is something to keep in mind. (The very same marketing that likely also wrote the blog post, FWIW).
this doesn't add up. the 20k is in API costs. people talk about CC losing money because it's way more efficient than the API. I.e. the same work with efficient use of CC might have cost ~$5k.
but regardless, hiring is difficult and high-end talent is limited. If the costs were anywhere close to equivalent, the agents are a no-brainer
> hiring is difficult and high-end talent is limited.
Not only that, but firing talent is also a pain. You can't "hire" 10 devs for 2 weeks, and fire them afterwards. At least you can't keep doing that, people talk and no one would apply.
CC hits their APIs, And internally I'm sure Anthropic tracks those calls, which is what they seem to be referencing here. What exactly did Anthropic do in this test to have "inefficient use of CC" vs your proposed "efficient use of CC"?
Or do you mean that if an external user replicated this experience they might get billed less than $20k due to CC being sold at lower rates than per-API-call metered billing?
Even if the dollar cost for product created was the same, the flexibility of being able to spin a team up and down with an API call is a major advantage. That AI can write working code at all is still amazing to me.
This thing was done in 2 weeks. In the orgs I've worked in, you'd be lucky to get HR approval to create a job posting within 2 weeks.