663 comments
  • ndesaulniers1d

    I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel. https://clangbuiltlinux.github.io/

    This LLM did it in (checks notes):

    > Over nearly 2,000 Claude Code sessions and $20,000 in API costs

    It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!

    > The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.

    The next milestone is:

    Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.

    > The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

    Still a really cool project!

    • brundolf8h

      One thing people have pointed out is that well-specified (even if huge and tedious) projects are an ideal fit for AI, because the loop can be fully closed and it can test and verify the artifact by itself with certainty. Someone was saying they had it generate a rudimentary JS engine because the available test suite is so comprehensive

      Not to invalidate this! But it's toward the "well-suited for AI" end of the spectrum

      • HarHarVeryFunny7h

        Yes - the gcc "torture test suite" that is mentioned must have been one of the enablers for this.

        It's notable that the article says Claude was unable to build a working assembler (& linker), which is nominally a much simpler task than building a compiler. I wonder if this was at least in part due to not having a test suite, although it seems one could be auto generated during bootstrapping with gas (GNU assembler) by creating gas-generated (asm, ELF) pairs as the necessary test suite.

        It does beg the question of how they got the compiler to point of correctness of generating a valid C -> asm mapping, before tackling the issue of gcc compatibility, since the generated code apparently has no relation to what gcc generates. I wonder which compilers' source code Claude has been trained on, and how closely this compiler's code generation and attempted optimizations compares to those?

        • spullara6h

          i'm sure claude has been trained on every open source compiler

    • shakna1d

      > Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

      Does it really boot...?

      • ndesaulniers1d

        > Does it really boot...?

        They don't need 16b x86 support for the RISCV or ARM ports, so yes, but depends on what 'it' we're talking about here.

        Also, FWIW, GCC doesn't directly assemble to machine code either; it shells out to GAS (GNU Assembler). This blog post calls it "GCC assembler and linker" but to be more precise the author should edit this to "GNU binutils assembler and linker." Even then GNU binutils contains two linkers (BFD and GOLD), or did they excise GOLD already (IIRC, there was some discussion a few years ago about it)?

        • shakna24h

          Yeah, didn't mention gas or ld, for similar reasons. I agree that a compiler doesn't necessarily "need" those.

          I don't agree that all the claims are backed up by their own comments, which means that there's probably other places where it falls down.

          Its... Misrepresentation.

          Like Chicken is a Scheme compiler. But they're very up front that it depends on a C compiler.

          Here, they wrote a C compiler that is at least sometimes reliant on having a different C compiler around. So is the project at 50%? 75%?

          Even if its 99%, thats not the same story as they tried to write. And if they wrote that tale instead, it would be more impressive, rather than "There's some holes. How many?"

          • Philpax23h

            Their C compiler is not reliant on having another C compiler around. Compiling the 16-bit real mode bootstrap for the Linux kernel on x86(-64) requires another C compiler; you certainly don't need another compiler to compile the kernel for another architecture, or to compile another piece of software not subject to the 32k constraint.

            The compiler itself is entirely functional; it just can't generate code optimal enough to fit within the constraints for that very specific (tiny!) part of the system, so another compiler is required to do that step.

      • TheCondor18h

        The assembler seems like nearly the easiest part. Slurp arch manuals and knock it out, it’s fixed and complete.

        • jakewins5h

          I am surprised by the number of comments that say the assembler is trivial - it is admittedly perhaps simpler than some other parts of the compiler chain, but it’s not trivial.

          What you are doing is kinda serialising a self-referential graph structure of machine code entries that reference each others addresses, but you don’t know the addresses because the (x86) instructions are variable-length, so you can’t know them until you generate the machine code, chicken-and-egg problem.

          Personally I find writing parsers much much simpler than writing assemblers.

          • nicebyte3h

            assembler is far from trivial at least for x86 where there are many possible encodings for a given instruction. emitting the most optimal encoding that does the correct thing depends on surrounding context, and you'd have to do multiple passes over the input.

        • shakna18h

          Huh. A second person mentioning the assembler. Don't think I ever referred to one...?

    • qarl22h

      > Still a really cool project!

      Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.

      The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.

      • PostOnce21h

        It's amazing that it "works", but viability is another issue.

        It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.

        Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.

        On top of that, Anthropic is losing money on it.

        All of those things combined, viability remains a serious question.

        • ryanjshaw8h

          > You won't know until you've finished spending the money whether it will fail or not.

          How do you conclude that? You start off with a bunch of tests and build these things incrementally, why would you spend 20k before realizing there’s a problem?

          • friendzis6h

            Because literally no real-world non-research project starts with "we have an extremely comprehensive test suite and specification complete down to the most finite detail" and then searches for a way to turn it into code.

            • ryanjshaw3h

              I’ve spent nearly 20 years working as a consultant writing software, I know that. How do you think humans solve that problem?

            • galdauts4h

              Precisely. Figuring out what the specification is supposed to look like is often the hardest part.

        • qarl20h

          > It cost $20,000

          I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

          You should look it up. :)

          • PostOnce12h

            That's irrelevant in this context, because it's not "get the humans to make a working product OR get the AI to make a working product"

            The problem is you may pay $20K for gibberish, then try a second time, fail again, and then hire humans.

            Coincidentally yes, I am aware, my last contract was building out a SCADA module the AI failed to develop at the company that contracted me.

            I'm using that money to finance a new software company, and so far, AI hasn't been much help getting us off the ground.

            Edit: oh yeah, and on top of paying Claude to fuck it up, you still have to also pay the salary of the guy arguing with Claude.

            • AlfeG10h

              > The problem is you may pay $20K for gibberish, then try a second time, fail again, and then hire humans.

              You can easily pay humans $20k a day and get gibberish in output. Heck, this happen all the times. This happens right now in multiple companies.

              Yes sometime humans produce nice code. This happens from time to time...

          • lelanthran17h

            > > It cost $20,000

            > I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

            I'll bite - I can write you an unoptimised C compiler that emits assembly for $20k, and it won't be 100k lines of code (maybe 15k, the last time I did this?).

            It won't take me a week, though.

            I think this project is a good frame of reference and matches my experience - vibing with AI is sometimes more expensive than doing it myself, and always results in much more code than necessary.

            • flakiness17h

              Does it support x64, x8664, arm64 and riscv? (sorry, just trolling - we don't know the quality of backend other than x8664 which is supposed to be able to build bootable linux.)

              • lelanthran17h

                It's not hard to build a compiler just for a bootable linux.

                I see no test criteria that actually runs that built linux through various test plans, so, yeah emitting enough asm just to boot is doable.

            • p-e-w17h

              > I can write you an unoptimised C compiler that emits assembly for $20k

              You may be willing to sell your work at that price, but that’s not the market rate, to put it very mildly. Even 10 times that would be seriously lowballing in the realm of contract work, regardless of whether it’s “optimised” or not (most software isn’t).

              • lelanthran17h

                > You may be willing to sell your work at that price, but that’s not the market rate, to put it very mildly.

                It is now.

                At any rate, this is my actual rate. I live in South Africa, and that's about 4 weeks of work for me, without an AI.

                • zingar5h

                  That’s a VERY nice rate for SA; approximately what I charge in the UK. I assume these are not local companies who hire you.

                  • lelanthran4h

                    > That’s a VERY nice rate for SA; approximately what I charge in the UK. I assume these are not local companies who hire you.

                    A local Fintech needing PCI work pays that, but that's not long-term contracts.

                • qarl12h

                  Deal. I'll pay you IF you can achieve the same level of performance. Heck, I'll double it.

                  You must provide the entire git history with small commits.

                  I won't be holding my breath.

                  • lelanthran11h

                    > Deal. I'll pay you IF you can achieve the same level of performance. Heck, I'll double it.

                    > You must provide the entire git history with small commits.

                    > I won't be holding my breath.

                    Sure; I do this often (I operate as a company because I am a contractor) - money to be held in escrow, all the usual contracts, etc.

                    It's a big risk for you, though - the level of performance isn't stated in the linked article so a parser in Python is probably sufficient.

                    TCC, which has in the past compiled bootable Linux images, was only around 15k LoC in C!

                    For reference, for a engraved-in-stone spec, producing a command-line program (i.e. no tech stack other than a programming language with the standard library), a coder could reasonably produce +5000LoC per week.

                    Adding the necessary extensions to support booting isn't much either, because the 16-bit stuff can be done just the same as CC did it - shell out to GCC (thereby not needing many of the extensions).

                    Are you *really* sure that a simple C compiler will cost more than 4 weeks f/time to do? It takes 4 weeks or so in C, are you really sure it will take longer if I switch to (for example) Python?

                    • qarl11h

                      > the level of performance isn't stated in the linked article so a parser in Python is probably sufficient.

                      No, you'll have to match the performance of the actual code, regardless of what happens to be written in the article. It is a C compiler written in Rust.

                      Obviously. Your games reveal your malign intent.

                      EDIT: And good LORD. Who writes a C compiler in python. Do you know any other languages?!?

                      • lelanthran11h

                        > No, you'll have to match the performance of the actual code, regardless of what is in the article. It is a C compiler written in Rust.

                        Look, it's clear that you don't hire s/ware developers very much - your specs are vague and open to interpretation, and it's also clear that I do get hired often, because I pointed out that your spec isn't clear.

                        As far as "playing games" goes, I'm not allowing you to change your single-sentence spec which, very importantly, has "must match performance", which I shall interpret to as "performance of emitted code" and not "performance of compiler".

                        > Your games reveal your intent.

                        It should be obvious to you by know that I've done this sort of thing before. The last C compiler I wrote was 95% compliant with the (at the time, new) C99 standard, and came to around 7000LoC - 8000LoC of C89.

                        > EDIT: And good LORD. Who writes a C compiler in python. Do you know any other languages?!?

                        Many. The last language I implemented (in C99) took about two weeks after hours (so, maybe 40 hours total?), was interpreted, and was a dialect of Lisp. It's probably somewhere on Github still, and that was (IIRC) only around 2000LoC.

                        What you appear to not know (maybe you're new to C) is that C was specifically designed for ease of implementation.

                        1. It was designed to be quick and easy to implement.

                        2. The extensions in GCC to allow building bootable Linux images are minimal, TBH.

                        3. The actual 16-bit emission necessary for booting was not done by CC, but by shelling out to GCC.

                        4. The 100kLoC does not include the tests; it used the GCC tests.

                        I mean, this isn't arcane and obscure knowledge, you know. You can search the net right now and find 100s of undergrad CS projects where they implement enough of C to compile many compliant existing programs.

                        I'm wondering; what languages did you write an implementation for? Any that you designed and then implemented?

                  • bee_rider7h

                    You seem to have doubled down on a bluff that was already called.

                    • qarl37m

                      Naw. I got him to reveal himself, which was the whole point.

                      It's amazing what you can get people to do.

              • wavemode2h

                No, you're overestimating how complex it is to write an unoptimized C compiler. C is (in the grand scheme of things) a very simple language to implement a compiler for.

                The rate probably goes up if you ask for more and more standards (C11, C17, C23...) but it's still a lot easier than compilers for almost any other popular language.

              • psychoslave11h

                That feels like Silicon-Valley-centric point of view. Plus who would really spend $20k in building any C compiler today in the actual landscape of software?

                All that this is saying is that license laundering of a code-base is now $20k away through automated processes, at least if the original code base is fully available. Well, with current state-of-the-art you’ll actually end up with a code-base which is not as good as the original, but that’s it.

          • etler14h

            If my devs are writing that much code they're doing something wrong. Lines of code is an anti metric. That used to be commonly accepted knowledge.

          • sarchertech20h

            You wouldn’t pay a human to write 100k LOC. Or at least you shouldn’t. You’d pay a human to write a working useful compiler that isn’t riddled with copyright issues.

            If you didn’t care about copying code, usefulness, or correctness you could probably get a human to whip you up a C compiler for a lot less than $20k.

            • qarl20h

              Are you trolling me? Companies (made of humans) write 100,000 LOC all the time.

              And it's really expensive, despite your suspicions.

              • sarchertech19h

                No, companies don’t pay people to write 100k LOC. They pay people to write useful software.

                We figured out that LOC was a useless productivity metric in the 80s.

                • qarl19h

                  Dude.

                  Microsoft paid my company a lot of money to write code. And in the end you were able to count it, and the LOC is a perfectly fine metric which is still used today to measure complexity of a project.

                  If you actually work in software you know this.

                  I have no idea what point you're trying to make - but I've grown very tired of all the trolls attacking me. Good night.

                  EDIT: OH. Maybe you mean that people don't cite LOC in contract deliverables. Yeah, I know. I never said that, and it's irrelevant to my point.

                  • DSMan1952763h

                    Without questioning the LOC metric itself, I'll propose a different problem: LOC for human and AI projects are not necessarily comparable for judging their complexity.

                    For a human, writing 100k LOC to do something that might only really need 15k would be a bit surprising and unexpected - a human would probably reconsider what they were doing well before they typed 100k LOC. Where-as, an AI doesn't necessarily have that concern - it can just keep generating code and doesn't care how long it will take so it doesn't have the same practical pressure to produce concise code.

                    The result is that while for large enough human-written programs there's probably an average "density" they reach in relation of LOC vs. complexity of the original problem, AI-generated programs probably average out at an entirely different "density" number.

                  • rezonant11h

                    I can't stress enough how much LOC is not a measure of anything.

                    • icedchai5h

                      Yep. I’ve seen people copy 100’s of lines instead of adding a if statement.

                    • qarl6h

                      OK, well, the people in MY software industry use LOC as an informal measure of complexity.

                      LIKE THE WHOLE WORLD DOES.

                      But hey, maybe it's just the extremely high profile projects I've worked on.

                  • beowulfey10h

                    Your first post specifically stated:

                    "I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???"

                    which any reasonable reading would take to mean "paid-by-line", which we all know doesn't happen. Otherwise, I could type out 30,000 lines of gibberish and take my fat paycheck.

                    • qarl6h

                      > which any reasonable reading would take to mean

                      Nope. Try again.

            • Chaosvex8h

              > you could probably get a human to whip you up a C compiler for a lot less than $20k

              I fork Clang or GCC and rename it. I'll take only $10k.

          • m00x11h

            It really depends on the human and the code it outputs.

            I can get my 2y old child to output 100k LoC, but it won't be very good.

            • qarl6h

              Your 2yr old can't build a C compiler in Rust that builds Linux.

              Sorry mate, I think you're tripping.

          • psychoslave11h

            Well, if these humans can cheat by taking whatever needed degree of liberty in copycat attitude to fit in the budget, I guess that a simple `git clone https://gcc.gnu.org/git/gcc.git SomeLocalDir` is as close to $0 as one can hope to either reach. And it would end up being far more functional and reliable. But I get that big-corp overlords and their wanna-match-KPI minions will prefer an "clean-roomed" code base.

          • bopbopbop720h

            100k lines of clean, bug free, optimized, and vulnerability free code or 100k lines of outsourced slop? Two very different price points.

            • qarl20h

              A compiler that can build linux.

              That level of quality should be sufficient.

              Do you know any low quality programmers that write C compilers in rust THAT CAN BUILD LINUX?

              No you don't. They do not exist.

              • icedchai20h

                Yep. Building a working C compiler that compiles Linux is an impossible task for all but the top 1% of developers. And the ones that could do it have better things to do, plus they’d want a lot more than 20K for the trouble.

                • vbezhenar18h

                  What's so hard about it? Compiler construction is well researched topic and taught in the universities. I made toy language compiler as a student. May be I'm underestimating this task, but I think that I can build some simple C compiler which will output trivial assembly. Given my salary of $2500, that would probably take me around a year, so that's pretty close LoL.

                  • cma6h

                    You can one shot prompt a toy C compiler. Getting one that can compile Linux in a bootable way is significantly harder.

                • viraptor11h

                  It's a bit more nuanced. You can build a simple compiler without too many issues. But once you want it to do optimisations, flow control protection, good and fast register allocation, inling, autovectoriasation, etc. that's going to take a multiples of the original time.

                • rezonant11h

                  > Building a working C compiler ... is an impossible task

                  I think you might be thinking of C++

                  • icedchai11h

                    I’m not. I’ve been working with C on and off for 30 years. Linux requires GNU extensions beyond standard C. Once you get the basics done, there’s still a lot more work to do. Compiling a trivial program might work. But you’ll hit an edge case or 50 in the millions of lines in Linux.

                    I also should’ve qualified my message with “in 2 weeks”, or even “in 2 months.” Given more time it’s obviously possible for more people.

                • rhubarbtree14h

                  Interesting, why impossible? We studied compiler construction at uni. I might have to dig out a few books, but I’m confident I could write one. I can’t imagine anyone on my course of 120 nerds being unable to do this.

                  • menaerus12h

                    You are underestimating the complexity of the task so do other people on the thread. It's not trivial to implement a working C compiler very much so to implement the one that proves its worth by successfully compiling one of the largest open-source code repositories ever, which btw is not even a plain ISO C dialect.

                    • rhubarbtree9h

                      I didn’t say it was trivial. Just that I thought my course mates would be able to do it.

                      • qarl6h

                        You thought your course mates would be able to write a C compiler that builds the Linux?

                        Huh. Interesting. Like the other guy pointed out, compiler classes often get students to write toy C compilers. I think a lot of students don't understand the meaning of the word "toy". I think this thread is FULL of people like that.

                        • icedchai4h

                          I took a compilers course 30 years ago. I have near zero confidence anyone (including myself) could do it. The final project was some sort of toy language for programming robots with an API we were given. Lots of yacc, bison, etc.

                          Lots of segfaults, too.

                        • stevejb3h

                          Hey! I built a Lego technic car once 20 years ago. I am fully confident that I can build an actual road worthy electric vehicle. It's just a couple of edge cases and a bit bigger right? /s

                  • icedchai11h

                    I’ll be shocked if they are able to do it in 4 months, never mind 4 weeks.

              • bopbopbop720h

                Do you think this was guided by a low quality Anthropic developer?

                You can give a developer the GCC test suite and have them build the compiler backwards, which is how this was done. They literally brute forced it, most developers can brute force. It also literally uses GCC in the background... Maybe try reading the article.

                • qarl20h

                  HEH... I'm sorry man but I truly don't understand what point you're trying to make, other than change the subject and get nasty.

                  You take care now.

                  • bopbopbop720h

                    The trick to not be confused is to read more than the title of the article.

                  • qarl6h

                    I love how I'm getting downvoted for calling him out for saying "read the article".

                    The people in here have gone entirely nuts.

        • georgeven2h

          it's 20,000 in 2026, with the price of tokens halving every year (at a given perf level), this will be around 1,000 dollars in 2030

        • RA_Fisher3h

          Progress can be reviewed over time, and I'd think that'd take a lot of the risk out.

        • tumdum_21h

          > On top of that, Anthropic is losing money on it.

          It seems they are *not* losing money on inference: https://bsky.app/profile/steveklabnik.com/post/3mdirf7tj5s2e

          • quikoa4h

            That's for the API right? The subs are still a loss. I don't know which one of the two is larger.

          • byzantinegene13h

            no, and that is widely known. the actual problem is that the margins are not sufficient at that scale to make up for the gargantuan training costs to train their SOTA model.

            • aurareturn10h

              Source on that?

              Because inference revenue is outpacing training cost based on OpenAI’s report and intuition.

              • cma6h

                Net inference revenue would need to be outpacing to go against his think about margins.

        • chamomeal20h

          That's a good point! Here claude opus wrote a C compiler. Outrageously cool.

          Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.

          To be fair, it was a pretty horrible mess of useEffects. But just another data point.

          Also I was hoping opus would finally be able to handle complex typescript generics, but alas...

        • bdangubic21h

          > On top of that, Anthropic is losing money on it

          This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not

          • ThrowawayR218h

            Anyone remember the dotcom bust?

            • qarl18h

              Oh yeah, I do. That whole internet thing was a total HOAX. I can't believe people bought into that.

              Can you imagine if Amazon, EBay, PayPal, or Saleforce existed today?

              • pjmlp15h

                Well, how is your Solaris installation going?

                I also remember having gone into research, because there were no jobs available, and even though I was employed at the time, our salaries weren't being paid.

                • qarl12h

                  What does this even mean?

                  • pjmlp11h

                    ==> Anyone remember the dotcom bust?

                    • qarl6h

                      Yeah. That's not an explanation. But thanks. You're awesome.

            • mikkupikku3h

              Remember that thing that caused it? That "Internet" thing? After those companies went bust it pretty much disappeared didn't it.

          • deaux17h

            Completely detached from reality, brainwashed SV VC's who have made dumping the norm in their bubble.

            I can guarantee you that 90% of successful businesses in the world made a profit their first year.

            • HWR_149h

              1 year seems aggressive. Successful restaurants have around the first year as the average break even timeline, with the vast majority between 6 and 18 months.

              They are making a profit on each sale, but there are fixed costs to running a business.

            • brookst16h

              I’ll bite. Share your data?

              Companies that were not profitable in their first year: Microsoft, Google, SpaceX, airBnB, Uber, Apple, FedEx, Amazon.

              If the vast majority of companies are immediately profitable, why do we have VC and investment at all? Shouldn’t the founders just start making money right aeay?

              • deaux14h

                > Companies that were not profitable in their first year: Microsoft, Google, SpaceX, airBnB, Uber, Apple, FedEx, Amazon.

                US Big Tech, US Big Tech, US Tech-adjacent, US Big Tech, US Big Tech, US Big Tech, FedEx, US Tech-adjacent.

                In other words, exactly what I was getting at.

                Also, a basic search shows Microsoft to have been profitable first year. I'd be very surprised if they weren't. Apple also seems to have taken less than 2 years. And unsurprisingly, these happen to be the only two among the tech companies you named that launched before 1995.

                Check out the Forbes Global 5000. Then go think about the hypothetical Forbes Global 50,000. Is the 50,000th most successful company in the world not successful? Of course not, it's incredibly successful.

                > why do we have VC and investment at all

                Out of all companies started in 2024 I can guarantee you that <0.01% have received VC investment by now (Feb 2026) and <1% of tech companies did. I'll bet my house on it.

          • qarl18h

            I love how your comment is getting downvoted.

            Like it's a surprise that startups burn through money. I get the feeling that people really have no idea what they're talking about in here anymore.

            It's a shame.

            • cowl16h

              then you are misunderstaing the downvoting. it's not that the fact that they are burning money. it's the fact that this cost today 20k but that is not the real cost if you factor the it is losing money on this price.

              So Tomorrow when this "startup" will need to come out of their money burning phase, like every startup has to sooner or later, that cost will increase, because there is no other monetising avenue, at least not for anthropic that "wilL never use ads".

              at 20k this "might" be a reasonable cost for "the project", at 200k it might not.

              • brookst16h

                That would be insightful if the cost of inference weren’t declining at roughly 90% per year. Source: https://epoch.ai/data-insights/llm-inference-price-trends

                • EddieRingle14h

                  According to that article, the data they analyzed was API prices from LLM providers, not their actual cost to perform the inference. From that perspective, it's entirely possible to make "the cost of inference" appear to decline by simply subsidizing it more. The authors even hint at the same possibility in the overview:

                  > Note that while the data insight provides some commentary on what factors drive these price drops, we did not explicitly model these factors. Reduced profit margins may explain some of the drops in price, but we didn’t find clear evidence for this.

              • aurareturn10h

                Source that they’re losing money on each token?

          • PostOnce20h

            Are we forgetting that sometimes, they just go bankrupt?

            • bdangubic20h

              name one with comparable number of users and revenue? not saying you are wrong but I would bet against the outcome

              • PostOnce13h

                I'll be able to do just that in 36mo or so after the IPOs and subsequent collapse, I think.

              • svieira19h

                Enron

                • bdangubic18h

                  I should have guessed someone would answer this question in this thread with Enron :)

                  I did not ask for random company that went under for any reason but specific question related to users and revenue.

                  • grey-area13h

                    Well there are lots and lots of examples that don't end in bankruptcy, just a very large loss of capital for investors. The majority of the stars of the dotcom bubble just as one example: Qualcomm, pets.com, Yahoo!, MicroStrategy etc etc.

                    Uber, which you cite as a success, is only just starting to make any money, and any original investors are very unlikely to see a return given the huge amounts ploughed in.

                    MicroStrategy has transformed itself, same company, same founder, similar scam 20 years later, only this time they're peddling bitcoin as the bright new future. I'm surprised they didn't move on to GAI.

                    Qualcomm is now selling itself as an AI first company, is it, or is it trying to ride the next bubble?

                    Even if GAI becomes a roaring success, the prominent companies now are unlikely to be those with lasting success.

      • thesz16h

          > This test sorta definitely proves that AI is legit.
        
        This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

        The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.

        • disgruntledphd212h

          > This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

          It's still really, really impressive though.

          Like, economics aside this is amazing progress. I remember GPT3 not being able to hold context for more than a paragraph, we've come a long way since then.

          Hell, I remember bag of words being state of the art when I started my career. We have come a really, really, really long way since then.

          • thesz12h

              > It's still really, really impressive though.
            
            Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?

            Much like quoting Quake code almost verbatim not so long ago.

            • disgruntledphd210h

              > Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?

              No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation). That being said, they provide all the code et al for people to review.

              I do agree that an out of distribution test would be super helpful, but given that it will almost certainly fail (given what we know about LLMs) I'm not too pushed about that given that it will definitely fail.

              Look, I'm pretty sceptical about AI boosting, but this is a much better attempt than the windsurf browser thing from a few months back and it's interesting to know that one can get this work.

              I do note that the article doesn't talk much about all the harnesses needed to make this work, which assuming that this approach is plausible, is the kind of thing that will be needed to make demonstrations like this more useful.

              • thesz9h

                  > No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation).
                
                This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].

                [1] https://en.wikipedia.org/wiki/Leakage_(machine_learning)

                This question is extremely important because test set leakage leads to impressively looking results that do not generalize to anything at all.

                • disgruntledphd26h

                  > This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].

                  I am quite familiar with leakage, having been building statistical models for maybe 15+ years at this point.

                  However, that's not really relevant in this particular case given that LLMs are trained on approximately the entire internet, so leakage is not really a concern (as there is no test set, apart from the tasks they get asked to do in post-training).

                  I think that's its impressive that this even works at all as even if it's just predicting tokens (which is basically what they're trained to), as this is a pointer towards potentially more useful tasks (convert this cobol code base to java, for instance).

                  I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.

                  • thesz2h

                      > I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.
                    
                    Take any language with compiler and several thousands of users and you have a plenty of tests that approximate spec inward and outward.

                    Here's, for example, VHDL tests suite for GHDL, open source VHDL compiler and simulator: https://github.com/ghdl/ghdl/tree/master/testsuite

                    The GHDL test suite is sufficient and general enough to develop a pretty capable clone, to my knowledge. To my knowledge, there is only one open source VHDL compiler and it is written in Ada. And, again, expertise to implement another one from scratch to train an LLM on it is very, very scarce - VHDL, being highly parallel variant of Ada, is quirky as hell.

                    So someone can test your hypothesis on the VHDL - agent-code a VHDL compiler and simulator in Rust so that it passes GHDL test suite. Would it take two weeks and $20,000 as with C? I don't know but I really doubt so.

        • Rudybega6h

          There are two compilers that can handle the Linux kernel. GCC and LLVM. Both are written in C, not Rust. It's "in distribution" only if you really stretch the meaning of the term. A generic C compiler isn't going to be anywhere near the level of rigour of this one.

          • thesz5h

            There is tinycc, that makes it three compilers.

            There is a C compiler implemented in Rust from scratch: https://github.com/PhilippRados/wrecc/commits/master/?after=... (the very beginning of commit history)

            There are several C compilers written in Rust from scratch of comparable quality.

            We do not know whether Anthropic has a closed source C compiler written in Rust in their training data. We also do not know whether Anthropic validated their models on their ability to implement C compiler from scratch before releasing this experiment.

            That language J I proposed does not have any C compiler implemented in it at all. Idiomatic J expertise is scarce and expensive so that it would be a significant expense for Anthropic to have C compiler in J for their training data. Being Turing-complete, J can express all typical compiler tips and tricks from compiler books, albeit in an unusual way.

            • Rudybega2h

              TinyCC can't compile a modern linux kernel. It doesn't support a ton of the extensions they use. That Rust compiler similarly can't do it.

      • LinXitoW19h

        How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.

      • qarl47m

        HEY ALL!

        I have to stop participating in this conversation. Some helpful people from the internet have begun to send me threatening email messages.

        Thanks HN. You're PURE AWESOME!

      • soperj16h

        Only if we take them at their word. I remember thinking things were in a completely different state when Amazon had their shop and go stores, but then finding out it was 1000s of people in Pakistan just watching you via camera.

      • cardanome6h

        If will write you an C compiler by hand for 19k and it will be better than what Claude made.

        Writing a toy C compiler isn't that hard. Any decent programmer can write one in a few weeks or months. The optimizations are the actually interesting part and Claude fails hard at that.

      • kvemkon21h

        > optimizations aren't as good as the 40 year gcc project

        with all optimizations disabled:

        > Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

        • qarl20h

          That distinction doesn't change my point. I am not surprised that a 40 year old project generates better code than this brand new one.

          • charcircuit18h

            Not only is it new. There has been 0 performance optimization done. Well none prompted for at least. Once you give the agents a profiler and start a loop focusing on performance you'll see it start improving it.

            • thesz16h

              We are talking about compiler here and "performance" referred above is the performance of generated code.

              When you are optimizing a program, you have a specific part of code to improve. The part can be found with profiler.

              When you are optimizing a compiler generated code, you have many similar parts of code in many programs and not-so-specific part of compiler that can be improved.

              • charcircuit16h

                Yes, performance of the generated code. You have some benchmark of using a handful of common programs going through common workflows and you measure the performance of the generated code. As tweaks are made you see how the different performance experiments effect the overall performance. Some strategies are always a win, but things like how you layout different files and functions in memory have different trade offs and are hard to know up front without doing actual real world testing.

                • thesz12h

                    > As tweaks are made...
                    > ...how you layout different files and functions in memory have different trade offs and are hard to know up front without doing actual real world testing.
                  
                  These are definitely not an algorithmic optimizations like privatization [1].

                  https://en.wikipedia.org/wiki/Privatization_(computer_progra...

                  To correctly apply privatization one has to have correct dependency analysis. This analysis uses results of many other analyses, for example, value range analysis, something like Fourier-Motzkin algorithm, etc.

                  So this agentic-optimized compiler has a program where privatization is not applied, what tweaks should agents apply?

      • byzantinegene14h

        it costs $20,000 to reinvent the wheel, that it probably trained on. If that's your definition of legit, sure

        • organicUser12h

          well, if in this period it is a matter of cost, tomorrow won't be anymore. 4GB of RAM in the 80s would have cost tens of millions of dollars, now even your car runs 4 gb memory only for the infotainment systems, and runs dozens GBs of RAM for the most complex assistants. So i would see this achievement more as a warning, the final result is not what's concerning, it is the premonition behind it

      • miohtama15h

        GCC had 40 years headstart

    • ip2615h

      I’m excited and waiting for the team that shows with $20k in credits they can substantially speed up the generated code by improving clang!

      • byzantinegene13h

        i'm sorry but that will take another $20 billion in AI capex to train our latest SOTA model so that it will cost $20k to improve the code.

    • iberator16h

      Claude did not wrote it. you wrote it with PREVIOUS EXPERIENCE with 20.000 long commandshyellihg him exactly what to do.

      Real usable AI would create it with simple: 'make c compilers c99 faster than GCC'.

      AI usage should be banned in general. It takes jobs faster than creating new ones ..

      • arcanemachiner15h

        That's actually pretty funny. They're patting it on the back for using, in all likelihood, some significant portions of code that they actually wrote, which was stolen from them without attribution so that it could be used as part of a very expensive parlour trick.

        • whynotminot9h

          Did you do diffs to confirm the code as stolen or are you just speculating.

      • embedding-shape12h

        > AI usage should be banned in general. It takes jobs faster than creating new ones ..

        I don't have an strong opinion about that in either direction, but curious: Do you feel the same about everything, or is just about this specific technology? For example, should the nail gun have been forbidden if it was invented today, as one person with a nail gun could probably replace 3-4 people with normal "manual" hammers?

        You feel the same about programmers who are automating others out of work without the use of AI too?

      • wiseowise13h

        > It takes jobs faster than creating new ones ..

        You think compiler engineer from Google gives a single shit about this?

        They’ll automate millions out of career existence for their amusement while cashing out stock money and retiring early comfortably.

      • benterix10h

        > It takes jobs faster than creating new ones ..

        I have no problems with tech making some jobs obsolete, that's normal. The problem is, the job being done with the current generation of LLMs are, at least for now, mostly of inferior quality.

        The tools themselves are quite useful as helpers in several domains if used wisely though.

      • 7thpower11h

        Businesses do not exist to create jobs; jobs are a byproduct.

        • jaccola10h

          Even that is underselling it; jobs are a necessary evil that should be minimised. If we can have more stuff with fewer people needing to spend their lives providing it, why would we NOT want that?

          • direwolf2010h

            Because we've built a system where if you don't have a job, you die.

            • jaccola10h

              This is already hyperbolic; in most countries where software engineers or similar knowledge workers are widely employed there are welfare programmes.

              To add to that, if there is such mass unemployment in this scenario it will be because fewer people are needed to produce and therefore everything will become cheaper... This is the best kind of unemployment.

              So at best: none of us have to work again and will get everything we need for free. At worst, certain professions will need a career switch which I appreciate is not ideal for those people but is a significantly weaker argument for why we should hold back new technology.

              • jelder9h

                If you were to rank all of the C compilers in the world and then rank all of the welfare systems in the world, this vibe-coded mess would be at approximately the same rank as the American welfare system. Especially if you extrapolate this narcissistic, hateful kleptocracy out a few more years.

            • aurareturn10h

              Did we build it or did nature?

      • unglaublich9h

        Jobs are a means, not a goal.

        • sc68cal6h

          Jobs are the only way that you survive in this society (food, shelter). Look how we treat unhoused people without jobs. AI is taking jobs away and that is putting people's survival at risk.

    • beambot1d

      This is getting close to a Ken Thompson "Trusting Trust" era -- AI could soon embed itself into the compilers themselves.

      • bopbopbop724h

        A pay to use non-deterministic compiler. Sounds amazing, you should start.

        • Aurornis24h

          Application-specific AI models can be much smaller and faster than the general purpose, do-everything LLM models. This allows them to run locally.

          They can also be made to be deterministic. Some extra care is required to avoid computation paths that lead to numerical differences on different machines, but this can be accomplished reliably with small models that use integer math and use kernels that follow a specific order of operations. You get a lot more freedom to do these things on the small, application-specific models than you do when you're trying to run a big LLM across different GPU implementations in floating point.

          • soraminazuki19h

            > They can also be made to be deterministic.

            Yeah, in the same way how pseudo-random number generators are "deterministic." They generate the exact same sequence of numbers every time given the seeds are the same!

            But that's not the "determinism" people are referring to when they say LLMs aren't deterministic.

        • ndesaulniers24h

          Some people care more about compile times than the performance of generated code. Perhaps even the correctness of generated code. Perhaps more so than determinism of the generated code. Different people in different contexts can have different priorities. Trying to make everyone happy can sometimes lead to making no one happy. Thus dichotomies like `-O2` vs `-Os`.

          EDIT (since HN is preventing me from responding):

          > Some people care more about compiler speed than the correctness?

          Yeah, I think plenty of people writing code in languages that have concepts like Undefined Behavior technically don't really care as much about correctness as they may claim otherwise, as it's pretty hard to write large volumes of code without indirectly relying on UB somewhere. What is correct in such case was left up to interpretation of the implementer by ISO WG14.

          • bopbopbop724h

            Some people care more about compiler speed than the correctness? I would love to meet these imaginary people that are fine with a compiler that is straight up broken. Emitting working code is the baseline, not some preference slider.

            • ndesaulniers3h

              > I would love to meet these imaginary people that are fine with a compiler that is straight up broken.

              That's not what I said; you're attacking a strawman.

              My point was more so that some people prefer the madness that is -funsafe-math-optimizations, or happen to rely on UB (intentionally or otherwise). What even is "correct" in the presence of UB? What is correct in such case was left up to interpretation of the implementer by ISO WG14.

            • gerdesj20h

              You might have not run Gentoo. Most Gentooers will begrudgingly but eventually admit to cooking their own gonads when updating a laptop.

              Anyway, please define: "correctness".

            • fragmede23h

              Let's pretend, for just a second, that the people who do, having been able to learn how to program, are not absolute fucking morons. Straight up broken is obviously not useful, so maybe the conclusions you've jumped to could use some reexamination.

          • chasd0024h

            a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced. The only thing worse would be a CPU bug like the legendary Pentium bug. Imagine you compile something like Postgres only to have it crash in some unpredictable way. How long do you stare at Postgres source before suspecting the compiler? What if this compiler was used to compile code in software running all over cloud stacks? Bugs in compilers are very bad news, they have to be correct.

            • addaon22h

              > a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced

              Is this true? It’s not an everyday thing, but when using less common flags, or code structures, or targets… every few years I run into a codegen issue. It’s hard to imagine going through a career without a handful…

            • Anon109613h

              It's not that uncommon if you work in massive lowish level systems. Clang/LLVM being relatively bug free is the result of many corporate big tech low level compiler swes working with the application swes to debug why XYZ isn't working properly and then writing the appropriate fix. But compiler bugs still come up every so often, I've seen it on multiple occasions.

            • ndesaulniers23h

              Yeah, my current boss spent time weeding out such hardware bugs: https://arxiv.org/abs/2110.11519 (EDIT: maybe https://x.com/Tesla_AI/status/1930686196201714027 is a more relevant citation)

              They found a bimodal distribution in failures over the lifetime of chips. Infant mortality was well understood. Silicon aging over time was much less well understood, and I still find surprising.

      • int_19h14h

        What I want to know is when we get AI decompilers

        Intuitively it feels like it should be a straightforward training setup - there's lots of code out there, so compile it with various compilers, flags etc and then use those pairs of source+binary to train the model.

      • ndesaulniers24h

        We're already starting to see people experimenting with applying AI towards register allocation and inlining heuristics. I think that many fields within a compiler are still ripe for experimentation.

        https://llvm.org/docs/MLGO.html

      • psychoslave11h

        Hmm, well, there are already embedded in fonts: https://hackaday.com/2024/06/26/llama-ttf-is-ai-in-a-font/

      • jojobas22h

        Sorry, clang 26.0 requires an Nvidia B200 to run.

      • andai22h

        The asymmetry will be between the frontier AI's ability to create exploits vs find them.

      • dnautics19h

        would be hard to miss gigantic kv cache matrix multiplications

      • greenavocado22h

        Then i'll be left wondering why my program requires 512TB of RAM to open

    • VladVladikoff8h

      >$20,000 of tokens. >less efficient than existing compilers

      what is the ecological cost of producing this piece of software that nobody will ever use?

      • ryanjshaw8h

        If you evaluate the cost/benefit in isolation? It’s net negative.

        If you see this as part of a bigger picture to improve human industrial efficiency and bring us one step closer to the singularity? Most likely net positive.

      • thefounder8h

        With that way of thinking you would just move in a cave.

    • HarHarVeryFunny7h

      > I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel

      Did this come down to making Clang 100% gcc compatible (extensions, UDB, bugs and all), or were there any issues that might be considered as specific to the linux kernel?

      Did you end up building a gcc compatability test suite as a part of this? Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?

      • ndesaulniers6h

        > extensions

        Some were necessary (asm goto), some were not (nested functions, flexible array members not at the end of structs).

        > UDB, bugs and all

        Luckily, the kernel didn't intentionally rely on GCC specifics this way. Where it did unintentionally, we fixed the kernel sources properly with detailed commit messages explaining why.

        > or were there any issues that might be considered as specific to the linux kernel?

        Yes, https://github.com/ClangBuiltLinux/linux/issues is our issue tracker. We use tags extensively to mark if we triage the issue to be kernel-side vs toolchain-side.

        > Did you end up building a gcc compatability test suite as a part of this?

        No, but some tricky cases LLVM got wrong were distilled from kernel sources using either:

        - creduce - cvise (my favorite) - bugpoint - llvm-reduce

        and then added to LLVM's existing test suite. Many such tests were also simply manually written.

        > Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?

        GCC and binutils have their own test suites. Folks in the LLVM community have worked on being able to test clang against GCC's test suite. I personally have never run GCC's test suite or looked at its sources.

    • the_jends21h

      Being just a grunt engineer in a product firm I can't imagine being able to spend multiple years on one project. If it's something you're passionate about, that sounds like a dream!

      • ndesaulniers3h

        This work originally wasn't my 100% project, it was my 20% project (or as I prefer to call it, 120% project).

        I had to move teams twice before a third team was able to say: this work is valuable to us, please come work for us and focus just on that.

        I had to organize multiple internal teams, then build an external community of contributors to collaborate on this shared common goal.

        Having carte blanche to contribute to open source projects made this feasible at all; I can see that being a non-starter at many employers, sadly. Having low friction to change teams also helped a lot.

    • MaskRay20h

      I want to verify the claim that it builds the Linux kernel. It quickly runs into errors, but yeah, still pretty cool!

      make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc -j30 defconfig all

      ``` /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:44:184: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) ~0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (~0x80000000); (void)pto_tmp__; } asm ("and" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:49:183: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) 0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (0x80000000); (void)pto_tmp__; } asm ("or" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:61:212: error: expected ';' after expression before 'pao_tmp__' ```

      • silver_sun18h

        They said it builds Linux 6.9, maybe you are trying to compile a newer version there?

        • MaskRay18h

          git switch v6.9

          The riscv build succeeded. For the x86-64 build I ran into

              % make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc-x86 HOSTCC=/tmp/p/claudes-c-compiler/target/release/ccc-x86 LDFLAGS=-fuse-ld=bfd LD=ld.bfd -j30 vmlinux -k
              make[1]: Entering directory '/tmp/linux/x86'
              ...
                CC      arch/x86/platform/intel/iosf_mbi.o
              ccc: error: lgdtl requires memory operand
                AR      arch/x86/platform/intel-mid/built-in.a
              make[6]: *** [/home/ray/Dev/linux/scripts/Makefile.build:362: arch/x86/realmode/rm/wakeup_asm.o] Error 1
              ld.bfd: arch/x86/entry/vdso/vdso32/sigreturn.o: warning: relocation in read-only section `.eh_frame'
              ld.bfd: error in arch/x86/entry/vdso/vdso32/sigreturn.o(.eh_frame); no .eh_frame_hdr table will be created
              ld.bfd: warning: creating DT_TEXTREL in a shared object
              ccc: error: unsupported pushw operand
          
          There are many other errors.

          tinyconfig and allnoconfig have fewer errors.

              RELOCS  arch/x86/realmode/rm/realmode.relocs
              Invalid absolute R_386_32 relocation: real_mode_seg
          
          Still very impressive.
          • 63stack8h

            I feel like I could have done this in a much shorter time, for much less tokens, but still very impressive!

          • pertymcpert16h

            They said that it wasn't able to support 16 bit real mode. Needs to call gcc for that.

    • grey-area15h

      Isn't the AI basing what it does heavily on the publicly available source code for compilers in C though? Without that work it would not be able to generate this would it? Or in your opinion is it sufficiently different from the work people like you did to be classed as unique creation?

      I'm curious on your take on the references the GAI might have used to create such a project and whether this matters.

    • 9rx15h

      > I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel.

      How much of that time was spent writing the tests that they found to use in this experiment? You (or someone like you) were a major contributor to this. All Opus had to do here was keep brute forcing a solution until the tests passed.

      It is amazing that it is possible at all, but remains an impossibly without a heavy human hand. One could easily still spend a good part of their career reproducing this if they first had to rewrite all of the tests from scratch.

    • zaphirplane1d

      What were the challenges out of interest. Some of it is the use of gcc extensions? Which needed an equivalent and porting over to the equivalent

      • ndesaulniers24h

        `asm goto` was the big one. The x86_64 maintainers broke the clang builds very intentionally just after we had gotten x86_64 building (with necessary patches upstreamed) by requiring compiler support for that GNU C extension. This was right around the time of meltdown+spectre, and the x86_64 maintainers didn't want to support fallbacks for older versions of GCC (and ToT Clang at the time) that lacked `asm goto` support for the initial fixes shipped under duress (embargo). `asm goto` requires plumbing throughout the compiler, and I've learned more about register allocation than I particularly care...

        Fixing some UB in the kernel sources, lots of plumbing to the build system (particularly making it more hermetic).

        Getting the rest of the LLVM binutils substitutes to work in place of GNU binutils was also challenging. Rewriting a fair amount of 32b ARM assembler to be "unified syntax" in the kernel. Linker bugs are hard to debug. Kernel boot failures are hard to debug (thank god for QEMU+gdb protocol). Lots of people worked on many different parts here, not just me.

        Evangelism and convincing upstream kernel developers why clang support was worth anyones while.

        https://github.com/ClangBuiltLinux/linux/issues for a good historical perspective. https://github.com/ClangBuiltLinux/linux/wiki/Talks,-Present... for talks on the subject. Keynoting LLVM conf was a personal highlight (https://www.youtube.com/watch?v=6l4DtR5exwo).

    • phillmv1d

      i mean… your work also went into the training set, so it's not entirely surprising that it spat a version back out!

      • underdeserver1d

        Anthropic's version is in Rust though, so at least a little different.

        • ndesaulniers24h

          There's parts of LLVM architecture that are long in the tooth (IMO) (as is the language it's implemented in, IMO).

          I had hoped one day to re-implement parts of LLVM itself in Rust; in particular, I've been curious if we can concurrently compile C (and parse C in parallel, or lazily) that haven't been explored in LLVM, and I think might be safer to do in Rust. I don't know enough about grammers to know if it's technically impossible, but a healthy dose of ignorance can sometimes lead to breakthroughs.

          LLVM is pretty well designed for test. I was able to implement a lexer for C in Rust that could lex the Linux kernel, and use clang to cross check my implementation (I would compare my interpretation of the token stream against clang's). Just having a standard module system makes having reusable pieces seems like perhaps a better way to compose a toolchain, but maybe folks with more experience with rustc have scars to disagree?

          • jcranmer21h

            > I had hoped one day to re-implement parts of LLVM itself in Rust

            Heh, earlier this day, I was just thinking how crazy a proposal would it actually be to have a Rust dependency (specifically, the egg crate, since one of the things I'm banging my head against right now might be better solved with egraphs).

        • yoz-y23h

          One thing LLMs are really good at is translation. I haven’t tried porting projects from one language to another, but it wouldn’t surprise me if they were particularly good at that too.

          • andrekandre11h

            as someone who has done that in a professional setting, it really does work well, at least for straightforward things like data classes/initializers and average biz logic with if else statements etc... things like code annotations and other more opaque stuff like that can get more unreliable though because there are less 1-1 representations... it would be interesting to train an llm for each encountered new pattern and slowly build up a reliable conversion workflow

        • rwmj1d

          It's not really important in latent space / conceptually.

          • D-Machine20h

            This is the proper deep critique / skepticism (or sophisticated goal-post moving, if you prefer) here. Yes, obviously this isn't just reproducing C compiler code in the training set, since this is Rust, but it is much less clear how much of the generated Rust code can (or can not) be accurately seen as being translated from C code in the training set.

      • GaggiX1d

        Clang is not written in Rust tho

    • TZubiri17h

      >Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.

      It's worth noting that this was developed by compiling Linux and running tests, so at least that is part of the training set and not the testing set.

      But at least for linux, I'm guessing the tests are very robust and I'm guessing that will work correctly. That said, if any bugs pop up, it will show weak points in the linux tests.

    • ur-whale11h

      > This LLM did it

      You do realize the LLM had access (via his training set) and "reused" (not as is, of course) your own work, right?

    • jbjbjbjb23h

      It’s cool but there’s a good chance it’s just copying someone else’s homework albeit in an elaborate round about way.

      • nomel23h

        I would claim that LLMs desperately need proprietary code in their training, before we see any big gains in quality.

        There's some incredible source available code out there. Statistically, I think there's a LOT more not so great source available code out there, because the majority of output of seasoned/high skill developers is proprietary.

        To me, a surprising portion of Claude 4.5 output definitely looks like student homework answers, because I think that's closer to the mean of the code population.

        • dcre19h

          This is dead wrong: essentially the entirety of the huge gains in coding performance in the past year have come from RL, not from new sources of training data.

          I echo the other commenters that proprietary code isn’t any better, plus it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there.

          • thesz16h

              > the huge gains in coding performance in the past year have come from RL, not from new sources of training data.
            
            This one was on HN recently: https://spectrum.ieee.org/ai-coding-degrades

            Author attributes past year's degradation of code generation by LLMs to excessive use of new source of training data, namely, users' code generation conversations.

            • dcre9h

              Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.

              And their “explanation” blaming the training data is just a guess on their part, one that I suspect is wrong. There is no argument given that that’s the actual cause of the observed phenomenon. It’s a just-so story: something that sounds like it could explain it but there’s no evidence it actually does.

              My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said. I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.

              • thesz6h

                  > Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.
                
                I see "No True Scotsman" argument above.

                  > My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said.
                
                Reinforcement learning reinforces what is already in the LM, makes width of search path of possible correct answer narrower and wider search path in not-RL-tuned base models results in more correct answers [1].

                [1] https://openreview.net/forum?id=4OsgYD7em5

                  > I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.
                
                The sources of training data already were the reasons for allegations, even leading to lawsuits. So I would suspect that no engineer from any LLM company would disclose anything on their sources of training data besides innocently sounding "synthetic data verified by ourselves."

                From the days I have worked on blockchains, I am very skeptical about any company riding any hype. They face enormous competition and they will buy, borrow or steal their way to try to not go down even a little. So, until Anthropic opens the way they train their model so that we can reproduce their results, I will suspect they leaked test set into it and used users code generation conversation as new source of training data.

                • dcre1h

                  That is not what No True Scotsman is. I’m pointing out a bad argument with weak evidence.

                  • thesz43m

                      >>> It is a very contrived under-specified prompt.
                    
                    No True Prompt can be such contrived and underspecified.

                    The article about degradation is a case study (single prompt), weakest of the studies in hierarchy of knowledge. Case studies are basis for further, more rigorous studies. And author took the time to test his assumptions and presented quite clear evidence that such degradation might be present and that we should investigate.

          • elevation18h

            > it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there

            The quality of the existing code base makes a huge difference. On a recent greenfield effort, Claude emitted an MVP that matched the design semantics, but the code was not up to standards. For example, it repeatedly loaded a large file into memory in different areas where it was needed (rather than loading once and passing a reference.)

            However, after an early refactor, the subsequently generated code vastly improved. It honors the testing and performance paradigms, and it's so clean there's nothing for the linter to do.

          • nextos18h

            Progress with RL is very interesting, but it's still too inefficient. Current models do OK on simple boring linear code. But they output complete nonsense when presented with some compact but mildly complex code, e.g. a NumPyro model with some nesting and einsums.

            For this reason, to be truly useful, model outputs need to be verifiable. Formal verification with languages like Dafny , F*, or Isabelle might offer some solutions [1]. Otherwise, a gigantic software artifact such as a compiler is going to have a critical correctness bugs with far-fetched consequences if deployed in production.

            Right now, I think treating a LLM like something different than a very useful information retrieval system with excellent semantic capabilities is not something I am comfortable with.

            [1] https://risemsr.github.io/blog/2026-02-04-nik-agentic-pop

            • dcre8h

              Human-written compilers have bugs too! It takes decades of use to iron them out, and we’re introducing new ones all the time.

        • bearjaws20h

          I will say many closed source repos are probably equally as poor as open source ones.

          Even worse in many cases because they are so over engineered nobody understands how they work.

          • hirvi7418h

            I firmly agree with your first sentence. I can just think about the various modders that have created patches and performance enhancing mods for games with budgets of tens to hundreds of millions of dollars.

            But to give other devs and myself some grace, I do believe plenty of bad code can likely be explained by bad deadlines. After all, what's the Russian idiom? "There is nothing more permanent than the temporary."

        • typ20h

          I'd bet, on average, the quality of proprietary code is worse than open-source code. There have been decades of accumulated slop generated by human agents with wildly varied skill levels, all vibe-coded by ruthless, incompetent corporate bosses.

          • Manouchehri20h

            There's only very niche fields where closed-source code quality is often better than open-source code.

            Exploits and HFT are the two examples I can think of. Both are usually closed source because of the financial incentives.

            • ozim17h

              Here we can start debating what means better code.

              I haven’t seen HFT code but I have seen examples of exploit codes and most of it is amateur hour when it comes to building big size systems.

              They are of course efficient in getting to the goal. But exploits are one off code that is not there to be maintained.

          • Take843520h

            Not to mention, a team member is (surprise!) fired or let go, and no knowledge transfer exists. Womp, womp. Codebase just gets worse as the organization or team flails.

            Seen this way too often.

            • icedchai5h

              Developers are often treated as cogs. Anyone should be able to step in a pick things up instantly. It’s just typing, right? /s

          • hirvi7418h

            In my time, I have potentially written code that some legal jurisdictions might classify as a "crime against humanity" due to the quality.

          • kortilla18h

            It doesn’t matter what the average is though. If 1% of software is open source, there is significantly more closed source software out there and given normal skills distributions, that means there is at least as much high quality closed source software out there, if not significantly more. The trick is skipping the 95% of crap.

        • bhadass22h

          yeah, but isn't the whole point of claude code to get people to provide preference data/telemetry data to anthropic (unless you opt out?). same w/ other providers.

          i'm guessing most of the gains we've seen recently are post training rather than pretraining.

          • nomel21h

            Yes, but you have the problem that a good portion of that is going to be AI generated.

            But, I naively assume most orgs would opt out. I know some orgs have a proxy in place that will prevent certain proprietary code from passing through!

            This makes me curious if, in the allow case, Anthropic is recording generated output, to maybe down-weight it if it's seen in the training data (or something similar)?

        • andai22h

          Let's start with the source code for the Flash IDE :)

      • wvenable22h

        This is cool and actually demonstrates real utility. Using AI to take something that already exists and create it for a different library / framework / platform is cool. I'm sure there's a lot of training data in there for just this case.

        But I wonder how it would fare given a language specification for a non-existent non-trivial language and build a compiler for that instead?

        • nmstoker21h

          If you come up with a realistic language spec and wait maybe six months, by then it'll probably be approach being cheap enough that you could test the scenario yourself!

      • luke544123h

        It looks like a much more progressed/complete version of https://github.com/kidoz/smdc-toolchain/tree/master/crates/s... . But that one is only a month old. So a bit confused there. Maybe that was also created via LLM?

      • nlawalker21h

        I see that as the point that all this is proving - most people, most of the time, are essentially reinventing the wheel at some scope and scale or another, so we’d all benefit from being able to find and copy each others’ homework more efficiently.

      • computerex20h

        And the goal post shifts.

      • kreelman21h

        ..A small thing, but it won't compile the RISCV version of hello.c if the source isn't installed on the machine it's running on.

        It is standing on the shoulders of giants (all of the compilers of the past, built into it's training data... and the recent learnings about getting these agents to break up tasks) to get itself going. Still fairly impressive.

        On a side-quest, I wonder where Anthropic is getting there power from. The whole energy debacle in the US at the moment probably means it made some CO2 in the process. Would be hard to avoid?

    • eek212122h

      Also: a large amount of folks seem to think Claude code is losing a ton of money. I have no idea where the final numbers land, however, if the $20,000 figure is accurate and based on some of the estimates I've seen, they could've hired 8 senior level developers at a quarter million a year for the same amount of money spent internally.

      Granted, marketing sucks up far too much money for any startup, and again, we don't know the actual numbers in play, however, this is something to keep in mind. (The very same marketing that likely also wrote the blog post, FWIW).

      • willsmith7222h

        this doesn't add up. the 20k is in API costs. people talk about CC losing money because it's way more efficient than the API. I.e. the same work with efficient use of CC might have cost ~$5k.

        but regardless, hiring is difficult and high-end talent is limited. If the costs were anywhere close to equivalent, the agents are a no-brainer

        • NitpickLawyer11h

          > hiring is difficult and high-end talent is limited.

          Not only that, but firing talent is also a pain. You can't "hire" 10 devs for 2 weeks, and fire them afterwards. At least you can't keep doing that, people talk and no one would apply.

        • majormajor21h

          CC hits their APIs, And internally I'm sure Anthropic tracks those calls, which is what they seem to be referencing here. What exactly did Anthropic do in this test to have "inefficient use of CC" vs your proposed "efficient use of CC"?

          Or do you mean that if an external user replicated this experience they might get billed less than $20k due to CC being sold at lower rates than per-API-call metered billing?

      • GorbachevyChase22h

        Even if the dollar cost for product created was the same, the flexibility of being able to spin a team up and down with an API call is a major advantage. That AI can write working code at all is still amazing to me.

      • bloaf21h

        This thing was done in 2 weeks. In the orgs I've worked in, you'd be lucky to get HR approval to create a job posting within 2 weeks.

  • NitpickLawyer1d

    This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:

    > This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis

    > I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.

    > Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.

    And the very open points about limitations (and hacks, as cc loves hacks):

    > It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

    > It does not have its own assembler and linker;

    > Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

    Ending with a very down to earth take:

    > The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

    All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.

    • geraneum1d

      > This was a clean-room implementation

      This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.

      • raincole1d

        It's not a clean-room implementation, but not because it's trained on the internet.

        It's not a clean-room implementation because of this:

        > The fix was to use GCC as an online known-good compiler oracle to compare against

        • Calavar24h

          The classical definition of a clean room implementation is something that's made by looking at the output of a prior implementation but not at the source.

          I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.

          • visarga17h

            This is the reimplementation scenario for agentic coding. If you have a good spec and battery of tests you can delete the code and reimplement it. Code is no longer the product of eng work, it is more like bytecode now, you regenerate it, you don't read it. If you have to read it then you are just walking a motorcycle.

            We have seen at least 3 of these projects - the JustHTML one, the FastRender and this one. All started from beefy tests and specs. They show reimplementation without manual intervention kind of works.

            • Calavar16h

              I think that's overstating it.

              JustHTML is a success in large part because it's a problem that can be solved with 4 digit LOC. The whole codebase can sit in an LLM's context at once. Do LLMs scale beyond that?

              I would classify both FastRender and Opus C compiler as interesting failures. They are interesting because they got a non-negligible fraction of the way to feature complete. They are failures because they ended with no clear path for moving the needle forward to 80% feature complete, let alone 100%.

              From the original article:

              > The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

              From the experiments we've seen so far it seems that a large enough agentic code base will inevitably collapse under its own weight.

            • jayd163h

              > Code is no longer the product of eng work

              Never was.

            • franktankbank7h

              Great way to get constantly moving holes.

        • array_key_first23h

          If you read the entire GCC source code and then create a compatible compiler, it's not clean room. Which Opus basically did since, I'm assuming, its training set contained the entire source of GCC. So even if they were actively referencing GCC I think that counts.

          • nmilo22h

            What if you just read the entire GCC source code in school 15 years ago? Is that not clean room?

            • hex4def622h

              No.

              I'd argue that no one would really care given it's GCC.

              But if you worked for GiantSodaCo on their secret recipe under NDA, then create a new soda company 15 years later that tastes suspiciously similar to GiantSodaCo, you'd probably have legal issues. It would be hard to argue that you weren't using proprietary knowledge in that case.

              • Zambyte7h

                Given that GCC is not public domain, the copyright holders will probably care.

          • pertymcpert16h

            I read the source. If anything it takes concepts from LLVM more than GCC, but the similarities aren't very deep.

      • GorbachevyChase21h

        https://arxiv.org/abs/2505.03335

        Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.

      • TacticalCoder23h

        I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.

        It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."

        https://en.wikipedia.org/wiki/Clean-room_design

        The "without infringing any of the copyrights" contains "any".

        We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.

        Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.

        It's not a clean-room design, plain and simple.

      • cryptonector20h

        Hmm... If Claude iterated a lot then chances are very good that the end result bears little resemblance to open source C compilers. One could check how much resemblance the result actually bears to open source compilers, and I rather suspect that if anyone does check they'll find it doesn't resemble any open source C compiler.

      • iberator16h

        this. last sane person in HN

      • antirez1d

        The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.

        • halxc1d

          We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.

          It is a research topic for heaven's sake:

          https://arxiv.org/abs/2504.16046

          • RyanCavanaugh1d

            The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.

            • silver_sun19h

              > this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?)

              A quick search brings up several C compilers written in Rust. I'm not claiming they are necessarily in Claude's training data, but they do exist.

              https://github.com/PhilippRados/wrecc (unfinished)

              https://github.com/ClementTsang/rustcc

              https://codeberg.org/notgull/dozer (unfinished)

              https://github.com/jyn514/saltwater

              I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.

            • philipportner1d

              This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

              https://arxiv.org/pdf/2601.02671

              > For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

              • Aurornis24h

                Their technique really stretched the definition of extracting text from the LLM.

                They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.

                You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.

                • D-Machine20h

                  To make some vague claims explicit here, for interested readers:

                  > "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) [...]"

                  So, yes, it is not "literally verbatim" (~96% verbatim), and there is indeed A LOT (hundreds or thousands of prompting attempts) to make this happen.

                  I leave it up to the reader to judge how much this weakens the more basic claims of the form "LLMs have nearly perfectly memorized some of their source / training materials".

                  I am imagining a grueling interrogation that "cracks" a witness, so he reveals perfect details of the crime scene that couldn't possibly have been known to anyone that wasn't there, and then a lawyer attempting the defense: "but look at how exhausting and unfair this interrogation was--of course such incredible detail was extracted from my innocent client!"

                  • DiogenesKynikos16h

                    The one-shot performance of their recall attempts is much less impressive. The two best-performing models were only able to reproduce about 70% of a 1000-token string. That's still pretty good, but it's not as if they spit out the book verbatim.

                    In other words, if you give an LLM a short segment of a very well known book, it can guess a short continuation (several sentences) reasonably accurately, but it will usually contain errors.

                    • D-Machine15h

                      Right, and this should be contextualized with respect to code generation. It is not crazy to presume that LLMs have effectively nearly perfectly memorized certain training sources, but the ability to generate / extract outputs that are nearly identical to those training sources will of course necessarily be highly contingent on the prompting patterns and complexity.

                      So, dismissals of "it was just translating C compilers in the training set to Rust" need to be carefully quantified, but, also, need to be evaluated in the context of the prompts. As others in this post have noted, there are basically no details about the prompts.

                • Calavar21h

                  Sure, maybe it's tricky to coerce an LLM into spitting out a near verbatim copy of prior data, but that's orthoginal to whether or not the data to create a near verbatim copy exists in the model weights.

                  • Paradigma114h

                    Like with those chimpanzees creating Shakespeare.

                  • D-Machine20h

                    Especially since the recalls achieved in the paper are 96% (based on block largest-common substring approaches), the effort of extraction is utterly irrelevant.

            • seba_dos124h

              > The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.

              The lesson here is that the Internet compresses pretty well.

            • mft_1d

              (I'm not needlessly nitpicking, as I think it matters for this discussion)

              A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.

              But your overall point still stands, regardless.

            • uywykjdskn23h

              You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test.

          • ben_w1d

            We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.

            The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"

            • tza54j1d

              We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic.

              It is enough to have read even parts of a work for something to be considered a derivative.

              I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

              It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.

              • ben_w1d

                > It is enough to have read even parts of a work for something to be considered a derivative.

                For IP rights, I'll buy that. Not as important when the question is capabilities.

                > I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

                For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":

                It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.

                ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.

            • philipportner1d

              Granted, these are some of the most widely spread texts, but just fyi:

              https://arxiv.org/pdf/2601.02671

              > For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

              • D-Machine20h

                Note "near-verbatim" here is:

                > "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) and refuses to continue after reaching the end of the first chapter; the generated text has nv-recall = 4.0% with the full book. We extract substantial proportions of the book from Gemini 2.5 Pro and Grok 3 (76.8% and 70.3%, respectively), and notably do not need to jailbreak them to do so (N = 0)."

                if you want to quantify the "near" here.

              • ben_w1d

                Already aware of that work, that's why I phrased it the way I did :)

                Edit: actually, no, I take that back, that's just very similar to some other research I was familiar with.

            • antirez1d

              Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.

              • shakna1d

                During a "clean room" implementation, the implementor is generally selected for not being familiar with the workings of what they're implementing, and banned from researching using it.

                Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.

                I mean... It's in the name.

                > The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.

                If it can recall... Then it is not a clean room implementation. Fin.

            • boroboro41d

              While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is)

          • Aurornis24h

            Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.

            Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.

          • soulofmischief1d

            The point is that it's a probabilistic knowledge manifold, not a database.

            • PunchyHamster1d

              we all know that.

              • soulofmischief23h

                Unfortunately, that doesn't seem to be the case. The person I replied to might not understand this, either.

        • majormajor21h

          You couldn't reasonably claim you did a clean-room implementation of something you had read the source to even though you, too, would not have a verbatim copy of the entire source code in your memory (barring very rare people with exceptional memories).

          It's kinda the whole point - you haven't read it so there's no doubt about copying in a clean-room experiment.

          A "human style" clean-room copy here would have to be using a model trained on, say, all source code except GCC. Which would still probably work pretty well, IMO, since that's a pretty big universe still.

        • PunchyHamster1d

          So it will copy most code with adding subtle bugs

    • modeless1d

      There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.

      • zamadatix1d

        The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.

      • LinXitoW19h

        The main issue with improvements in the last year is that a lot of it is based not on the models strictly becoming better, but on tooling being better, and simply using a fuckton more tokens for the same task.

        Remember that all these companies can only exist because of massive (over)investments in the hope of insane returns and AGI promises. While all these improvements (imho) prove the exact opposite: AGI is absolutely not coming, and the investments aren't going to generate these outsized returns. The will generate decent returns, and the tools are useful.

        • modeless16h

          I disagree. A year ago the models would not come close to doing this, no matter what tools you gave them or how many tokens you generated. Even three months ago. Effectively using tools to complete long tasks required huge improvements in the models themselves. These improvements were driven not by pretraining like before, but by RL with verifiable rewards. This can continue to scale with training compute for the foreseeable future, eliminating the "data wall" we were supposed to be running into.

      • nozzlegear1d

        Every S-curve looks like an exponential until you hit the bend.

        • NitpickLawyer1d

          We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.

          • LinXitoW19h

            Model improvement is very much slowing down, if we actually use fair metrics. Most improvements in the last year or so comes down to external improvements, like better tooling, or the highly sophisticated practice of throwing way more tokens at the same problem (reasoning and agents).

            Don't get me wrong, LLMs are useful. They just aren't the kind of useful that Sam et al. sold investors. No AGI, no full human worker replacement, no massive reduction in cost for SOTA.

          • kelnos23h

            Yes, and Moore's law took decades to start to fail to be true. Three years of history isn't even close to enough to predict whether or not we'll see exponential improvement, or an unsurmountable plateau. We could hit it in 6 months or 10 years, who knows.

            And at least with Moore's law, we had some understanding of the physical realities as transistors would get smaller and smaller, and reasonably predict when we'd start to hit limitations. With LLMs, we just have no idea. And that could be go either way.

          • nozzlegear1d

            > We've been hearing this for 3 years now

            Not from me you haven't!

            > "they've hit a wall, no more data, running out of data, plateau this, saturated that"

            Everyone thought Moore's Law was infallible too, right until they hit that bend. What hubris to think these AI models are different!

            But you've probably been hearing that for 3 years too (though not from me).

            > Models keep on getting better, at more broad tasks, and more useful by the month.

            If you say so, I'll take your word for it.

            • torginus1d

              Except for Moore's law, everyone knew decades ahead of what the limits of Dennard scaling are (shrinking geometry through smaller optical feature sizes), and roughly when we would get to the limit.

              Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

              • nozzlegear24h

                > Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

                Idk, that sounds remarkably similar to these AI models to me.

              • kijiki20h

                Everyone?

                Intel, at the time the unquestioned world leader in semiconductor fabrication was so unable to accurately predict the end of Dennard scaling that they rolled out the Pentium 4. "10Ghz by 2010!" was something they predicted publicly in earnest!

                It, uhhh, didn't quite work out that way.

            • Cyphase1d

              25 is 2025.

              • nozzlegear24h

                Oh my bad, the way it was worded made me read it as the name of somebody's model or something.

          • fmbb24h

            > And yet, here we are.

            I dunno. To me it doesn’t even look exponential any more. We are at most on the straight part of the incline.

            • sdf2erf21h

              Personally my usage has fell off a cliff the past few months. Im not a SWE.

              SWE's may be seeing benefit. But in other areas? Doesnt seem to be the case. Consumers may use it as a more preferred interface for search - but this is a different discussion.

        • raincole1d

          This quote would be more impactful if people haven't been repeating it since gpt-4 time.

          • kimixa1d

            People have also been saying we'd be seeing the results of 100x quality improvements in software with corresponding decease in cost since gpt-4 time.

            So where is that?

          • nozzlegear1d

            I agree, I have been informed that people have been repeating it for three years. Sadly I'm not involved in the AI hype bubble so I wasn't aware. What an embarrassing faux pas.

        • esafak17h

          What if it plateaus smarter than us? You wouldn't be able to discern where it stopped. I'm not convinced it won't be able to create its own training data to keep improving. I see no ceiling on the horizon, other than energy.

        • famouswaffles20h

          Cool I guess. Kind of a meaningless statement yeah? Let's hit the bend, then we'll talk. Until then repeating, 'It's an S Curve guys and what's more, we're near the bend! trust me" ad infinitum is pointless. It's not some wise revelation lol.

          • smj-edison20h

            Maybe the best thing to say is we can only really forecast about 3 months out accurately, and the rest is wild speculation :)

            History has a way of being surprisingly boring, so personally I'm not betting on the world order being transformed in five years, but I also have to take my own advice and take things a day at a time.

          • nozzlegear18h

            > Kind of a meaningless statement yeah?

            If you say so. It's clear you think these marketing announcements are still "exponential improvements" for some reason, but hey, I'm not an AI hype beast so by all means keep exponentialing lol

            • famouswaffles2h

              I'm not asking you to change your belief. By all means, think we're just around the corner of a plateau, but like I said, your statement is nothing meaningful or profound. It's your guess that things are about to slow down, that's all. It's better to just say that rather than talking about S curves and bends like you have any more insight than OP.

      • chasd001d

        i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.

      • uywykjdskn23h

        Yea the software engineering profession is over, even if all improvements stop now.

    • gmueckl1d

      The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.

      Prove this statement wrong.

      • libraryofbabel1d

        Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

        Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

        • nicoburns23h

          "clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.

          • signatoremo22h

            Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?

            • sarchertech19h

              Doesn’t matter how you gain general knowledge of compiler techniques as long as you don’t have specific knowledge of the implementation of the compiler you are reverse engineering.

              If you have ever read the source code of the compiler you are reverse engineering, you are by definition not doing a clean room implementation.

              • signatoremo21m

                > you are by definition not doing a clean room implementation.

                This makes no sense. Reverse engineering IS an application of clean room implementation. Citing Wikipedia:

                “Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design”

                https://en.wikipedia.org/wiki/Clean-room_design

              • pertymcpert15h

                Claude was not reverse engineering here. By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.

                • sarchertech10h

                  Claude was reverse engineering gcc. It was using it as an oracle and attempting to exactly march its output. That is the definition of reverse engineering. Since Claude was trained on the gcc source code, that’s not a clean room implementation.

                  > By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.

                  Clean room implementation has a very specific definition. It’s not my definition. If your compiler course walked through the source code of a specific compiler then no you couldn’t build a clean room implementation of that specific compiler.

        • gmueckl22h

          The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.

          The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.

          • astrange12h

            > The result is a fuzzy reproduction of the training input, specifically of the compilers contained within.

            Is it? I'm somewhat familiar with gcc and clang's source and it doesn't really particularly look like it to me.

            https://github.com/anthropics/claudes-c-compiler/blob/main/s...

            https://llvm.org/doxygen/LoopStrengthReduce_8cpp_source.html

            https://github.com/gcc-mirror/gcc/blob/master/gcc/gimple-ssa...

            • gmueckl6h

              Checking for similarity with compilers that consist of orders of magnitudes more code probably doesn't reveal much. There many more smaller compilers for C-adjacent languages out there pkus cod3 fragments from text books.

              • astrange5h

                There are not many more compilers with the specific optimization pass I linked.

                Also, I don't think you could reuse code from a different compiler unless you used the same IR.

          • libraryofbabel21h

            Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?

            • gmueckl6h

              I personally work on simulation software and create novel simulation methods as part of the job. I find that LLMs can only help if I reduce that task to a translation of detailed algorithms descriptions from English to code. And even then, the output is often riddled with errors.

      • NitpickLawyer1d

        > Prove this statement wrong.

        If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.

        • shakna1d

          Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.

          But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.

          • signatoremo22h

            > Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do.

            Are you sure about that? Do you have some examples? The older Claude models can’t do it according to TFA.

            • shakna18h

              Not ones I recorded. But something I threw at DeepSeek, early Claude, etc.

              And the prompt was just that. Nothing detailed.

        • gmueckl1d

          This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.

        • geraneum1d

          Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.

        • hn_acc11d

          Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?

          • Philpax23h

            Please look at the source code and tell me how this is a "simple, basic LLM task".

      • Marha011d

        Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

        • jesse__1d

          This sounds very wrong to me.

          Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

          I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

          • FeepingCreature1d

            This would imply that the English internet is not much bigger than 20x the English Wikipedia.

            That seems implausible.

            • jesse__23h

              > That seems implausible.

              Why, exactly?

              Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

              • onraglanroad4h

                Because we can count? How could you possibly think that Wikipedia was 5% of the whole Internet? It's just such a bizarrely foolish idea.

        • kgeist1d

          A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

          • FeepingCreature1d

            I would be extremely surprised if it was that small.

            • artisin7h

              I was curious about the scale of 1TiB of text. According to WolframAlpha, it's roughly 1.1 trillion characters, which breaks down to 180.2 billion words, 360.5 million pages, or 16.2 billion lines. In terms of professional typing speed, that's about 3800 years of continuous work.

              So post-deduplication, I think it's a fair assessment that a significant portion of high-quality text could fit within 1TiB. Tho 'high-quality' is a pretty squishy and subjective term.

            • kaibee18h

              Well, a terabyte of text is... quite a lot of text.

        • gmueckl1d

          This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.

      • 0xCMP1d

        I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.

        If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.

        Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.

        • hn_acc11d

          The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.

          If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?

      • brutalc1d

        No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.

    • panzi1d

      > clean-room implementation

      Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.

      • pertymcpert15h

        I'm familiar with both compilers. There's more similarity to LLVM, it even borrows some naming such as mem2reg (which doesn't really exist anymore) and GetElementPtr. But that's pretty much where things end. The rest of it is just common sense.

    • shubhamjain14h

      Yeah, I am amazed how people are brushing this off simply because GCC exists. This was far more challenging task than the browser thing, because of how far few open source compilers are there. Add to that no internet access and no dependencies.

      At this point, it’s hard to deny that AI has become capable of completing extremely difficult tasks, provided it has enough time and tokens.

      • bjackman13h

        I don't think this is more challenging than the browser thing. The scope is much smaller. The fact that this is "only" 100k lines is evidence for this. But, it's still very impressive.

        I think this is Anthropic seeing the Cursor guy's bullshit and saying "but, we need to show people that the AI _can actually_ do very impressive shit as long as you pick a more sensible goal"

    • kelnos23h

      Honestly I don't find it that impressive. I mean, it's objectively impressive that it can be done at all, but it's not impressive from the standpoint of doing stuff that nearly all real-world users will want it to do.

      The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.

      Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

      • steveklabnik22h

        > Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

        Hello, this is what I did over my Christmas break. I've been taking some time to do other things, but plan on returning to it. But this absolutely works. Claude has written far more programs in my language than I have.

        https://rue-lang.dev/ if you want to check it out. Spec and code are both linked there.

      • simonw23h

        Are you a frequent user of coding agents?

        I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.

        I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.

        • bdangubic23h

          I am all-daily user (multiple claude max accounts). this fits my mental model mostly but not model I had before but developed with daily use. my job revolves around two core things:

          1. data analysis / visualization / …

          2. “is this possible? can this even be done?”

          for #1 - I don’t do much anymore, for #2 I mostly do it still all “by hand” not for the lack of serious trying. so “it can do #1 1000x better than me cause it is generally solved problem(s) it is trained on while it can’t effectively do #2 otherwise” fits perfectly

    • chamomeal19h

      What's making these models so much better on every iteration? Is it new data? Different training methods?

      Kinda waiting for them to plateau so I can stop feeling so existential ¯\_(ツ)_/¯

      • esafak17h

        More compute (bigger models, and prediction-time scaling), algorithmic advances, and ever more data (including synthetic).

        Remember that all white collar workers are in your position.

    • dyauspitr1d

      > Claude did not have internet access at any point during its development

      Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.

      • simonw24h

        It's desirable if you're trying to build a C compiler as a demo of coding agent capabilities without all of the Hacker News commenters saying "yeah but it could just copy implementation details from the internet".

  • andrewshawcare19h

    It used the best tests it could find for existing compilers. This is effectively steering Claude to a well-defined solution.

    Hard to find fully specified problems like this in the wild.

    I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.

    I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

    > Write extremely high-quality tests

    > Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.

    > For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.

    • tantalor18h

      Why didn't Claude realize on its own that it needed a continuous integration pipeline?

      Far to much human intervention here.

    • sublimefire13h

      > Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

      My thinking as well, IMO it is because you need to wait for results for longer. You basically want to shorten the loops to improve the system. It hints at a problem that most of what we see is a challenge to seed a good context for it to successfully do something in many iterations.

    • krzat15h

      You know what else is well specified? LLM improving on itself.

      • widdershins14h

        I wouldn't describe intelligence as well specified. We can't even agree on what it is.

    • GalaxyNova19h

      > Hard to find fully specified problems like this in the wild.

      This is such a big and obvious cope. This is obviously a very real problem in the wild and there are many, many others like it. Probably most problems are like this honestly or can be made to be like this.

      • anematode19h

        Impressive, my sarcasm/bait detector almost failed me.

  • hmry1d

    If I, a human, read the source code of $THING and then later implement my own version, that's not a "clean-room" re-implementation. The whole point of "clean-room" is that no single person has access to both the original code and the new code. (That way, you can legally prove that no copyright infringement took place.)

    But when an AI does it, now it counts? Opus is trained on the source code of Clang, GCC, TCC, etc. So this is not "clean-room".

    • astrange12h

      Copyright doesn't protect ideas, it protects writing. Avoiding reading LLVM or GCC is to protect you from other kinds of IP issues, but it's not a copyright issue. The same people contribute to both projects despite their different licenses.

      • hmry9h

        They don't call Clang a "clean-room implementation". Unlike Anthropic, who are calling their project exactly that

        A clean-room implementation is when you implement a replacement by only looking at the behavior and documentation (possibly written by another person on your team who is not allowed to write code, only documentation).

    • bmandale24h

      That's not the only way to protect yourself from accusations of copyright infringement. I remember reading that the GNU utils were designed to be as performant as possible in order to force themselves to structure the code differently from the unix originals.

      • Crestwave21h

        Yes, but Anthropic is specifically claiming their implementation is clean-room, while GNU never made that claim AFAIK.

  • whinvik1d

    It's weird to see the expectation that the result should be perfect.

    All said and done, that its even possible is remarkable. Maybe these all go into training the next Opus or Sonnet and we start getting models that can create efficient compilers from scratch. That would be something!

    • regularfry24h

      This is firmly where I am. "The wonder is not how well the dog dances, it is that it dances at all."

      • sumitkumar5h

        I was also startled when I learned about the human ancestor who was the first to see a mirror.

        The brilliance of AI is that it copies(mirrors) imperfectly and you can only look at part_of_the_copy(inference) at a time.

      • the847223h

        "It's like if a squirrel started playing chess and instead of "holy shit this squirrel can play chess!" most people responded with "But his elo rating sucks""

        • LinXitoW19h

          It's more like "We were promised, over and over again, that the squirrel would be autonomous grand master level. We spent insane amounts of money, labour, and opportunity costs of human progress on this. Now, here's a very expensive squirrel, that still needs guidance from a human grandmaster, and most of it's moves are just replications of existing games. Oh, it also can't move the pieces by itself, so it depends on Piece Mover library."

          • wyldfire9h

            Any way you slice it: LLMs provide real utility today, right now. Even yesterday, before Opus/Codex were updated. So the money was not all for naught. It seems very plausible given the progress made so far that this new industry will continue to deliver significant productivity gains.

            If you want to worry about something, let's worry about what happens to humanity when the world we've become accustomed to is yanked out from underneath us in a span of 10-20 years.

          • somebodythere13h

            even a squirrel that needs guidance from a human grandmaster, is heavily inspired by existing games, and who can use Piece Mover library is incredible. 5 years ago the squirrel was just a squirrel. then it was able to make legal moves. now it can play a whole game from start to finish, with help. that is incredible

            • dirkc11h

              I think the post you're responding to would agree, but is trying to make the argument that it isn't worth the cost:

              > spent insane amounts of money, labour, and opportunity costs of human progress on this

              That said, I would 100% approve of certain people pouring all their energy into AI to rather focus on teaching squirrels chess!

          • potsandpans16h

            My opinion: you are critiquing electricity because the candles are still better / more affordable / more honestly made.

            You seem to be mad that companies are in the business of selling us things. It's the way this whole thing works.

            If you don't think this is impressive: stop everything you're doing and go make a c compiler that can build the Linux kernel.

            • LinXitoW13h

              For reference, I use LLMs daily for coding. I do think they are useful.

              I am speaking about corporations and sales tactics, because this VERY experiment was done by exactly such a corporation. How about you think about how "this whole thing works", and apply it to their post? What did they not write? How many worse experiments did they not post about to not jeopardize investments?

              I don't find this impressive, because it doesn't do anything I'd want, anything I'd need, anything the world needs, and it doesn't do anything new compared to my personal experience. Which, just to reiterate, is that LLMs are useful, just not nowhere close to as world shattering/ending as the CEOs are selling it. Acknowledging that has nothing to do with being a luddite.

              • potsandpans3h

                To be a bit pedantic, I'm not accusing you of being a Luddite. That would mean that you were fundamentally opposed to a new technology that's obviously more useful.

                Instead, in my opinion you are not giving enough grace to what is being demonstrated today.

                This is my analogy: you're seeing electrical demonstrations in front of your very eyes, but because the charlatans who are funding the research haven't quite figured out how to harness it, you're dismissing the wonder. "That's all well and good, but my beeswax candles and gas lamps light my apartment just fine."

            • legulere13h

              It is very impressive indeed, but impressiveness is not the same as usefulness. If important further features can’t get implemented anymore The usefulness is pretty limited. And usefulness further needs to be weighed against cost.

        • knollimar22h

          I'm not trying to get coached in chess by the squirrel for 200 per month though.

        • emp173447h

          But people have been telling us for years that the squirrel was going to improve at chess at an exponential rate and take over the world through sheer chess-mastery.

        • amlib22h

          But the Squirrel is only playing chess because someone stuffed the pieces with food and it has learned that the only way to release it is by moving them around in some weird patterns.

        • echelon17h

          "The squirrel can do my job and more? It can do five years of my work in a month? For only $20k? Pssh, but I bet it copied someone's homework."

          Developer salaries are about to tank.

          This is the end of the line. People are just in denial.

          Soon companies will hire the squirrel instead of you. And the squirrel will transform into enormous infrastructure we can't afford ourselves.

          "One mega squirrel to implement your own operating system overnight. Just $100k."

          It's going to be out of the reach of humans / ICs soon. Purely industrial. And all innovation will accrue to the capital holders.

          Open weights models are our only hope of keeping a foot in the door.

          • Ronsenshi12h

            This is really questionable outcome. So you'll have your own custom OS riddled with holes that AI won't be capable of fixing because the context and complexity became so high that running any small bug fix would cost thousands of dollars in tokens.

            Is this how tech field ends? Overengineered brittle black-box monstrosities that nobody understands because important thing for business was "it does A, B, and C" and it doesn't matter how.

          • esafak17h

            IF you want the code to be reviewed and maintained you still need a developer. A developer can craft a better spec.

    • viccis15h

      >It's weird to see the expectation that the result should be perfect.

      Given that they spent $20k on it and it's basically just advertising targeted at convincing greedy execs to fire as many of us as they can, yeah it should be fucking perfect.

    • minimaxir1d

      A symptom of the increasing backlash against generative AI (both in creative industries and in coding) is that any flaw in the resulting product is predicate to call it AI slop, even if it's very explicitly upfront that it's an experimental demo/proof of concept and not the NEXT BIG THING being hyped by influencers. That nuance is dead even outside of social media.

      • stonogo1d

        AI companies set that expectation when their CEOs ran around telling anyone who would listen that their product is a generational paradigm shift that will completely restructure both labor markets and human cognition itself. There is no nuance in their own PR, so why should they benefit from any when their product can't meet those expectations?

        • minimaxir1d

          Because it leads to poor and nonconstructive discourse that doesn't educate anyone about the implications of the tech, which is expected on social media but has annoyingly leaked to Hacker News.

          There's been more than enough drive-by comments from new accounts/green names even in this HN submission alone.

          • krupan1d

            It does lead to poor non-constructive discourse. That's why we keep calling those CEOs to task on it. Why are you not?

            • dwaltrip1d

              The CEOs aren't here in the comments.

              • LinXitoW19h

                Which is why we ought to always bring up their BS every time people try to pretend it didn't happen.

                The promises made are ABSOLUTELY relevant to how promising or not these experiments are.

                • pertymcpert15h

                  I bet you get upset when you buy a new iPhone and don't love it, because Tim Cook said on the ad that they think you're going to love it.

                  • emp173447h

                    It cannot be overstated how absurd the marketing campaign for AI was. OpenAI and Anthropic have convinced half the world that AI is going to become a literal god. They deserve to eat a lot of shit for those outright lies.

          • amlib21h

            It's not just social media, it's IRL too.

            Maybe the general population will be willing to have a more constructive discussions about this tech once the trillion dollar companies stop pillaging everything they see in front of them and cease acting like sociopaths whose only objectives seem to be concentrating power, generating dissidence and harvesting wealth.

  • itay-maman1d

    My first reaction: wow, incredible.

    My second reaction: still incredible, but noting that a C compiler is one of the most rigorously specified pieces of software out there. The spec is precise, the expected behavior is well-defined, and test cases are unambiguous.

    I'm curious how well this translates to the kind of work most of us do day-to-day where requirements are fuzzy, many edge cases are discovered on the go, and what we want to build is a moving target.

    • ndesaulniers1d

      > C compiler is one of the most rigorously specified pieces of software out there

      /me Laughs in "unspecified behavior."

      • ori_b1d

        There's undefined behavior, which is quite well specified. What do you mean by unspecified behavior? Do you have an example?

      • irishcoffee24h

        Undefined is absolutely clear in the spec.

        Unspecified is whatever you want it to mean. I am also laughing, having never heard "unspecified" before.

        • LiamPowell22h

          Unspecified behaviour is defined in the glossary at the start of the spec and the term "unspecified" appears over a hundred times...

    • astrange12h

      The C spec is certainly not formal or precise.

      https://www.ralfj.de/blog/2020/12/14/provenance.html

      Another example is that it's unclear from the standard if you can write malloc() in C.

      • butterNaN8h

        Sure but the point OP is making is that it is still more spec'd than most real world problems

        • astrange2h

          You're welcome to try writing a C compiler and standard library doing no research other than reading the spec.

    • cryptonector20h

      > My second reaction:

      This is the key: the more you constrain the LLM, the better it will perform. At least that's my experience with Claude. When working with existing code, the better the code to begin with, the better Claude performs, while if the code has issues then Claude can end up spinning its wheels.

    • softwaredoug23h

      Yes I think any codegen with a lot of tests and verification is more about “fitting” to the tests. Like fitting an ML model. It’s model training, not coding.

      But a lot of programming we discover correctness as we go, one reason humans don’t completely exit the loop. We need to see and build tests as we go, giving them particular care and attention to ensure they test what matters.

  • psychoslave12h

    >The fix was to use GCC as an online known-good compiler oracle to compare against.

    >This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library.

    How does one re-conciliate both of this statements? Sure one can fetch all of gnu.org in local, and a model which already scrapped the whole internet somehow already integrated it in its weights, didn’t it?

    The worldwide median household income (as of 2013 data from Gallup) was approximately $9,733 per year (in PPP, current international dollars). This means that $20,000 per year is more than double the global median income.

    A median Luxembourg citizen earns $20,000 in about 5 to 6 months of work, a Burundi one would on median need 42.5 months, that is 3.5 years.

    https://worldpopulationreview.com/country-rankings/median-in...

    • a4564636h

      Thank you!!! All these resources being spent on centralizing and claiming to outsource and reduce human thinking to nothing.

  • boring-human17h

    People focused on the flaws are missing the picture. Opus wasn't even trained to be "a member of a team of engineers," it was adapted to the task by one person with a shell script loop. Specific training for this mode of operation is inevitable. And model "IQ" is increasing with every generation. If human IQ is increasing at all, it's only because the engineer pool is shrinking more at one end than the other.

    This is a five-alarm fire if you're a SWE and not retiring in the next couple years.

    • smithcoin7h

      > This is a five-alarm fire if you're a SWE and not retiring in the next couple years.

      I’m sorry, but this is such a hype beast take. In my opinion this is equivalent to telling people not to learn to drive five years ago because of self driving from Tesla. How is that going?

      Every single line of code produced is a liability. This idea that you’re going to have “gas town” like agents running and building apps without humans in the loop at any point to generate liability free revenue is insane to me.

      Are humans infallible? Obviously not. But if you are telling me that ‘magic probability machines’ are creating safe, secure, and compliant software that has no need for engineers to participate in the output- first I’d like to see a citation and second I have a bridge to sell you.

      • boring-human6h

        > In my opinion this is equivalent to telling people not to learn to drive five years ago because of self driving

        Self-driving has different economics. We're reading tea leaves, true, but it's also true that software has zero marginal cost and that $20K pays for an engineer-month in SF.

        > Every single line of code produced is a liability.

        Do you have a hard spec and rock-solid test cases? If you do, you have two options to a working prototype: 2-6 engineer-years, or $20K. The second option will greatly increase in quality and likely decrease in price over the next few years.

        What if the spec and the test cases are the new software? Assembly programmers used to make an argument against compiled code that's somewhat parallel to yours: every instruction is a (performance) liability.

        > without humans in the loop

        There will be humans, just fewer and fewer. The spec and test cases are AI-eligible too.

        > safe, secure, and compliant software

        I'm not sure humans' advantage here is safe, if it even exists still.

  • _lunix2h

    The comments at [1] are a bit _too_ trollish for me, but they _do_ showcase that this compiler is far too lenient on what it accepts to the point where I'd hesitate to call it ... a C compiler (This [2] comment in particular is pretty damning).

    Still, an impressive achievement nonetheless, but there's a lot of nuance under the surface.

    [1] https://github.com/anthropics/claudes-c-compiler/issues/1

    [2] https://github.com/anthropics/claudes-c-compiler/issues/1#is...

    • Philpax1d

      The issue is that it's missing the include paths. The compiler itself is fine.

    • krupan1d

      Thank you. That was a long article that started with a claim that was backed up by no proof, dismissing it as not the most interesting thing they were talking about when in fact it's the baseline of the whole discussion.

    • Retr0id1d

      Looks like these users are just missing glibc-devel or equivalent?

      • delusional1d

        Naa, it looks like it's failing to include the standard system include directories. If you take then from gcc and pass them as -I, it'll compile.

        • Retr0id1d

          Can confirm (on aarch64 host)

              $ ./target/release/ccc-arm -I /usr/include/ -I /usr/local/include/ -I /usr/lib/gcc/aarch64-redhat-linux/15/include/ -o hello hello.c 
          
              $ ./hello
              Hello from CCC!
          • u80801d

            Seems this non-artificial intelligence model just too limited to understand concept of include path.

        • zamadatix1d

          Hmm, I didn't have to do that. https://i.imgur.com/OAEtgvr.png

          But yeah, either way it just needs to know where to find the stdlib.

          • Retr0id1d

            Probably depends on where your distro puts stuff by default, I think it has a few of the common include paths hardcoded.

    • worldsavior1d

      AI is the future.

    • suddenlybananas1d

      This is truly incredible.

    • ZeWaka1d

      lol, lmao

  • btown1d

    > This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.

    This is incredible!

    But it also speaks to the limitations of these systems: while these agentic systems can do amazing things when automatically-evaluable, robust test suites exist... you hit diminishing returns when you, as a human orchestrator of agentic systems, are making business decisions as fast as the AI can bring them to your attention. And that assumes the AI isn't just making business assumptions with the same lack of context, compounded with motivation to seem self-reliant, that a non-goal-aligned human contractor would have.

    • _qua1d

      Interesting how the concept of a clean room implementation changes when the agent has been trained on the entire internet already

      • falcor841d

        To the best of my knowledge, there's no Rust-based compiler that comes anywhere close to 99% on the GCC torture test suite, or able to compile Doom. So even if it saw the internals of GCC and a lot of other compilers, the ability to recreate this step-by-step in Rust is extremely impressive to me.

        • D-Machine16h

          I think the careful response to this is:

          (1) There are compilers written in C in the training set

          (2) LLMs demonstrably can near-perfectly memorize training-set inputs (see other comments here)

          (3) LLMs are very good at translation tasks (natural language or code, e.g.: C to Rust)

          I don't think this necessarily completely deflates the impressiveness of this accomplishment, but it does qualify it to some degree.

        • jsheard1d

          The impressiveness of converting C to Rust by any means is kind of contingent on how much unnecessary unsafe there is in the end result though.

    • jillesvangurp14h

      You can use ai coding tools to create test suites, specifications, documentation, etc. And you can use them to scrutinize those, review them, criticize them, etc. Not having a test suite just means you start with creating one. Then the next question of course becomes "for what?".

      This indeed puts human prompters in a position where their job is to set the goals, outline the vision, ask for the right things, ask critical questions, and to correct where needed.

      Human contractors are a good analogy. Because they tend to come in without too much context into a new job. Their context is mainly what they've done before. But it takes time to get up to speed with whatever the customer is asking for and their context. People are slightly better at getting information out of other people. AI coding tools don't ask enough critical questions, yet. But that sounds fixable. The breakthroughs here are as much in the feedback loops and plumbing around the models as they are in the models themselves. It's all about getting the right information in and out of the context.

      • socalgal212h

        You would spend years verifying the tests actually work where as the tests for this accomplishment were already verified by humans over decades

    • falcor841d

      Agreed, but the next step is of having an AI agent actually run the business and be able to get the business context it needs as a human would. Obviously we're not quite there, but with the rapid progress on benchmarks like Vending-Bench [0], and especially with this teams approach, it doesn't seem far fetched anymore.

      As a particular near-term step, I imagine that it won't be long before we see a SaaS company using an AI product manager, which can spawn agents to directly interview users as they utilize the app, independently propose and (after getting approval) run small product experiments, and come up with validated recommendations for changing the product roadmap. I still remember Tay, and wouldn't give something like that the keys to the kingdom any time soon, but as long as there's a human decision maker at the end, I think that the tech is already here.

      [0] https://andonlabs.com/evals/vending-bench-2

  • forty23h

    We live a wonderful time where I can spend hours and $20000 to build a C compiler which is slow and inefficient and anyway requires an existing great compiler to even work, and then neither I nor the agent has any idea on how to make it useful :D

    • sieep10h

      Heres $20b in VC funding. Congrats!

  • OsrsNeedsf2P1d

    This is like a working version of the Cursor blog. The evidence - it compiling the Linux kernel - is much more impressive than a browser that didn't even compile (until manually intervened)

    • ben_w1d

      It certainly slightly spoils what I was planning to be a fun little April Fool's joke (a daft but complete programming language). Last year's AI wasn't good enough to get me past the compiler-compiler even for the most fundamental basics, now it's all this.

      I'll still work on it, of course. It just won't be so surprising.

  • akrauss1d

    I would like to see the following published:

    - All prompts used

    - The structure of the agent team (which agents / which roles)

    - Any other material that went into the process

    This would be a good source for learning, even though I'm not ready to spend 20k$ just for replicating the experiment.

    • a4564636h

      Just claims with nothing to back it. Steal people's work of years, and turn around be like I make it "so much better". Support this compiler for 20 years then

    • password43211d

      Yes unfortunately these days most are satisfied with just the sausage and no details about how it was made.

  • its-kostya3h

    As cool as the result is, this article is quite tone death to the fact that they asked a statistical model to "build" what was already in its training dataset... And not to mention with troves of forum data discussing bugs and best practices.

  • underdeserver1d

    > when agents started to compile the Linux kernel, they got stuck. [...] Every agent would hit the same bug, fix that bug, and then overwrite each other's changes.

    > [...] The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel

    This is a remarkably creative solution! Nicely done.

  • lubujackson23h

    This is very much a "vibe coding can build you the Great Pyramids but it can't build a cathedral" situation, as described earlier today: https://news.ycombinator.com/item?id=46898223

    I know this is an impressive accomplishment and is meant to show us the future potential, but it achieves big results by throwing an insane amount of compute at the problem, brute forcing its way to functionality. $20,000 set on fire, at Claude's discounted Max pricing no less.

    Linear results from exponential compute is not nothing, but this certain feels like a dead end approach. The frontier should be more complexity for less compute, not more complexity from an insane amount more compute.

    • Philpax23h

      > $20,000 in API costs

      I would interpret this as being at API pricing. At subscription pricing, it's probably at most 5 or 6 Max subscriptions worth.

    • ajross22h

      > $20,000 set on fire

      To be fair, that's two weeks of the employer cost of a FAANG engineer's labor. And no human hacks a working compiler in two weeks.

      It's a lot of AI compute for a demo, sure. But $20k stunts are hardly unique. Clearly there's value being demonstrated here.

      • lionkor16h

        Yes a human can hack together a compiler in two weeks.

        If you can't, you should turn off the AI and learn for yourself for a while.

        Writing a compiler is not a flex; it's a couple very well understood problems, most of which can be solved using existing libraries.

        Parsing is solved with yacc, bison, or sitting down and writing a recursive descent parser (works for most well designed languages you can think of).

        Then take your AST and translate it to an IR, and then feed that into anything that generates code. You could use crainlift or whatever it's called, you could roll your own.

        • Anon109612h

          Meanwhile:

          > I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel.

          https://news.ycombinator.com/item?id=46905771

          • latexr6h

            If you spend a decade working on something, you’re not “hacking it”.

        • ajross9h

          > Parsing is solved with yacc, bison, or sitting down and writing a recursive descent parser (works for most well designed languages you can think of).

          No human being writes a recursive descent parser for "Linux Kernel C" in two weeks, though. And AFAIK there's no downloadable BNF for that you can hand to an automatic generator either, you have to write it and test it and refine it. And you can't do it in two weeks.

          Yes yes, we all know how to write a compiler because we took a class on it. That's like "Elite CS Nerd Basic Admission". We still can't actually do it at the cost being demonstrated, and you know it.

      • pcloadlett3r20h

        Is there really value being presented here? Is this codebase a stable enough base to continue developing this compiler or does it warrant a total rewrite? Honest question, it seems like the author mentioned it being at its limits. This mirrors my own experience with Opus in that it isn't that great at defining abstractions in one-shot at least. Maybe with enough loops it could converge but I haven't seen definite proof of that in current generation with these ambitious clickbaity projects.

        • segh7h

          This is an experiment to see the current limit of AI capabilities. The end result isn't useful, but the fact is established that in Feb 2026, you can spend $20k on AI to get a inefficient but working C complier.

          • latexr5h

            > The end result isn't useful

            Then, as your parent comment asked, is there value in it? $20K, which is more than the yearly minimum wage in several countries in Europe, was spent recreating a worse version of something we already have, just to see if it was possible, using a system which increases inequality and makes climate change—which is causing people to die—worse.

          • ajross5h

            > inefficient but working

            FWIW, an inefficient but working product is pretty much the definition of a startup MVP. People are getting hung up on the fact that it doesn't beat gcc and clang, and generalizing to the idea that such a thing can't possibly be useful.

            But clearly it can, and is. This builds and boots Linux. A putative MVP might launch someone's dreams. For $20k!

            The reflexive ludditism is kinda scary actually. We're beyond the "will it work" phase and the disruption is happening in front of us. I was a luddite 10 months ago. I was wrong.

        • ajross20h

          If it generates a booting kernel and passes the test suite at 99% it's probably good enough to use, yeah.

          The point isn't to replace GCC per se, it's to demonstrate that reasonably working software of equivalent complexity is within reach for $20k to solve whatever problem it is you do have.

          • pcloadlett3r19h

            > it's probably good enough to use, yea.

            Not for general purpose use, only for demo.

            > that reasonably working software of equivalent complexity is within reach for $20k to solve

            But if this can't come close to replacing GCC and can't be modified without introducing bugs then it hasn't proven this yet. I learned some new hacks from the paper and that's great and all but from my experiencing of trying to harness even 4 claude sessions in parallel on a complex task it just goes off the rails in terms of coherence. I'll try the new techniques but my intuition is that its not really as good as you are selling it.

            • ajross8h

              > Not for general purpose use, only for demo.

              What does that mean, though? I mean, it's already meeting a very high quality bar by booting at all and passing those tests. No, it doesn't beat existing solutions on all the checkboxes, but that's not what the demo is about.

              The point being demonstrated is that if you need a "custom compiler" or something similar for your own new, greenfield requirement, you can have it at pretty-clearly-near-shippable quality in two weeks for $20k.

              And if people can't smell the disruption there, I don't know what to say.

              • vuciuc7h

                > you can have it at pretty-clearly-near-shippable quality in two weeks for $20k.

                if you spend months writing a tight spec, tests and have a better version of the compiler around to use when everything else fails.

                • ajross6h

                  > if you spend months writing a tight spec, tests and have a better version of the compiler around to use when everything else fails.

                  Doesn't matter because your competitors will have beaten you to market. That's just a simple Darwinian point, no AI magic needed.

                  No one doubts that things will be different in the coming Claudepocalypse, and new ideas about quality and process will need to happen to manage it. But sticking our heads in the sand and pretending that our stone tools are still better is just a path to early retirement at this point.

      • a4564636h

        Humans can hack a compiler in much less. Stop reading this hype and focus on learning

  • ks20481d

    It's cool that you can look at the git history to see what it did. Unfortunately, I do not see any of the human written prompts (?).

    First 10 commits, "git log --all --pretty=format:%s --reverse | head",

      Initial commit: empty repo structure
      Lock: initial compiler scaffold task
      Initial compiler scaffold: full pipeline for x86-64, AArch64, RISC-V
      Lock: implement array subscript and lvalue assignments
      Implement array subscript, lvalue assignments, and short-circuit evaluation
      Add idea: type-aware codegen for correct sized operations
      Lock: type-aware codegen for correct sized operations
      Implement type-aware codegen for correct sized operations
      Lock: implement global variable support
      Implement global variable support across all three backends
    • c-linkage8h

      That's crazy to me. At this point, I don't even know if the git commit log would be useful to me as a human.

      Maybe it's just me, but I like to be able to do both incremental testing and integration testing as I develop. This means I would start with the lexer and parser and get them tested (separately and together) before moving on to generating and validating IR.

      It looks like the AI is dumping an entire compiler in one commit. I'm not even sure where I would begin to look if I were doing a bug hunt.

      YMMV. I've been a solo developer for too many years. Not that I avoided working on a team, but my teams have been so small that everything gets siloed pretty quickly. Maybe life is different when more than one person works on the same application.

  • gignico1d

    > To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.

    If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)

    • chasd0024h

      > If you don't care about code quality, maintainability, readability, conformance to the specification, and performance of the compiler and of the compiled code, please, give me your $20,000, I'll give you your C compiler written from scratch :)

      i don't know if you could. Let's say you get a check for $20k, how long will it take you to make an equivalent performing and compliant compiler? Are you going to put your life on pause until it's done for $20k? Who's going to pay your bills when the $20k is gone after 3 months?

      • pja13h

        There are plenty of people on HN who could re-implement a C compiler like this in less than three months. Algorithmically compilers like this are a solved problem that has been very well documented over the last sixty or seventy years. Implementing a small compiler is a typical MSc project that you might carry out in a couple of months alongside a taught masters.

        This compiler is both slower than gcc even when optimising (you can’t actually turn optimisation off) & doesn’t reject type incorrect code so will happily accept illegal C code. It’s also apparently very brittle - what happens if you feed it the Linux kernel sources v. 6.10 instead of 6.9? - presumably it fails.

        All of the above make it simultaneously 1) really, really impressive and 2) completely useless in the real world. Great for creating discussion though!

      • nnevatie11h

        > Who's going to pay your bills when the $20k is gone after 3 months?

        And who's going to maintain this turd the LLM pushed out? It's a cool one-shot sort of thing, but let's not pretend this is useful as a real compiler or something anyone would like to maintain, as a human.

        One could keep improving one the implementation by vibing more, but I think that's just taking you to the wrong direction of the rabbit hole.

    • minimaxir1d

      There is an entire Evaluation section that addresses that criticism (both in agreement and disagreement).

    • 52-6F-621d

      If we're just writing off the billions in up front investment costs, they can just send all that my way while we're at it. No problem. Everybody happy.

  • travisgriggs17h

    A C Compiler seems like one of the more straightforward things to have done. Reading this gives me the same vibe as when a magician does a frequently done trick (saw someone in half, etc).

    I'd be more interested in letting it have a go at some some of the other "less trodden" paths of computing. Some of the things that would "wow me more":

    - Build a BEAM alternative, perhaps in an embedded space

    - Build a Smalltalk VM, perhaps in an embedded space, or in WASM

    These things are documented at some level, but still require a bit of original thinking to execute and pull off. That would wow me more.

    • adgjlsfhk117h

      if it actually compiles real C correctly, it's pretty impressive. The C standard is a total mess.

      • 8-prime6h

        Yet we have gcc and clang navigating that mess. From which Opus 4.6 was able to take inspiration.

  • marklsnyder6h

    Very cool, but I can't help but wonder how this translates to similarly complex projects where innate knowledge about the domain hasn't been embedded in the LLM via training data. There's a wealth of open source compiler code and related research papers that have been fed to the LLM. It seems like that would advantage the LLM significantly.

    • phendrenad26h

      Not just open-source compilers, but books on compiler design, which have proliferated because every CS professor wants to take a crack at the problem.

  • arkh14h

    My question would be: what are the myriad other projects you tasked Opus 4.6 to build and it could not get to a point you could kinda-sorta make a post about?

    This kind of headline makes me think of p-hacking.

  • rco878610h

    > Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem.

    I think this is the fundamental thing here with AI. You can spin up infinite agents that can all do....stuff. But how do you keep them from doing the wrong stuff?

    Is writing an airtight spec and test harness easier or less time consuming than just keeping a human in the loop and verifying and redirecting as the agents work?

    It all still comes back to context management.

    Very cool demonstration of the tech though.

  • yu3zhou41d

    At this point, I genuinely don't know what to learn next to not become obsolete when another Opus version gets released

    • missingdays1d

      Learn to fix bugs, it's gonna be more relevant than ever

      • wiseowise13h

        That’s already Opus 4.7s main selling point.

    • RivieraKid1d

      I agree. I don't understand there are so many software engineers who are excited about this. I would only be excited if I was a founder in addition to being a software engineer.

    • segh7h

      People skills :/

  • danfritz1d

    Ha yes classic showcase of:

    1) obvious green field project 2) well defined spec which will definitely be in the training data 3) an end result which lands you 90% from the finish

    Now comes the hard part, the last 10%. Still not impressed here. Since fixing issues in the end was impossible without introducing bugs I have doubts about quality

    I'm glad they do call it out in the end. That's fair

  • throwaway20271d

    Next time can you build a Rust compiler in C? It doesn't even have to check things or have a borrow checker, as long as it reduces the compile times so it's like a fast debug iteration compiler.

    • Philpax23h

      You will experience very spooky behaviour if you do this, as the language is designed around those semantics. Nonetheless, mrustc exists: https://github.com/thepowersgang/mrustc

      It will not be noticeably faster because most of the time isn't spent in the checks, it's spent in the codegen. The cranelift backend for rustc might help with this.

  • geooff_1d

    Maybe I'm naive, but I find these re-engineering complex product posts underwhelming. C Compilers exist and realistically Claudes training corpus contains a ton of C Compiler code. The task is already perfectly defined. There exists a benchmark of well-adopted codebases that can be used to prove if this is a working solution. Half the difficulty in making something is proving it works and is complete.

    IMO a simpler novel product that humans enjoy is 10x more impressive than rehashing a solved problem, regardless of difficulty.

    • bs72801d

      I don't see this as just exercise in making a new useful thing, but benchmarking the SOTA models ability to create a massive* project on its own, with some verifiable metrics of success. I believe they were able to build FFMPEG with this rust compiler?

      How much would it cost to pay someone to make a C compiler in rust? A lot more than $20k

      * massive meaning "total context needed" >> model context window

    • stephc_int131d

      This is a nice benchmark IMO. I would be curious to see how competitors and improved models would compare.

      • NitpickLawyer1d

        And how long will it take before an open model recreates this. The "vibe" consensus before "thinking" models really took off was that open was ~6mo behind SotA. With the massive RL improvements, over the past 6 months I've thought the gap was actually increasing. This will be a nice little verifiable test going forward.

  • tymonPartyLate13h

    I try to see this like F1 racing. Building a browser or a C compiler with agent swarms is disconnected from the reality of normal software projects. In normal projects the requirements are not full understood upfront and you learn and adapt and change as you make progress. But the innovations from professional racing result in better cars for everyone. We'll probably get better dev tools and better coding agents thanks to those experiments.

  • hexo7h

    I really love how they waste energy for stuff like this. Even better, all that nonsense talk we constantly kept hearing about energy crysis just a few years ago...

    • a4564636h

      Yup. All the tech hype bros are like "but my compiler"... Nobody was paying me to write a compiler, the meaning of "clean room" keeps changing, that they had to spend $20k (on the surface), not include the energy costs, the hardware costs, the time of assembly, etc. If you only paid that much money to a person and group of people. It is the hype bros wet dream to extract all value out of people and somehow get rich. Who cares if humanity suffers, look what I built for myself by enslaving people and wasting earth resources. Every single AI fetishist in this thread is responsible for it.

  • rwmj1d

    The interesting thing here is what's this code worth (in money terms)? I would say it's worth only the cost of recreation, apparently $20,000, and not very much more. Perhaps you can add a bit for the time taken to prompt it. Anyone who can afford that can use the same prompt to generate another C compiler, and another one and another one.

    GCC and Clang are worth much much more because they are battle-tested compilers that we understand and know work, even in a multitude of corner cases, over decades.

    In future there's going to be lots and lots of basically worthless code, generated and regenerated over and over again. What will distinguish code that provides value? It's going to be code - however it was created, could be AI or human - that has actually been used and maintained in production for a long time, with a community or company behind it, bugs being triaged and fixed and so on.

    • kingstnap24h

      The code isn't worth money. This is an experiment. The knowledge that something like this is even possible is what is worth money.

      If you had the knowledge that a transformer could pull this off in 2022. Even with all its flawed code. You would be floored.

      Keep in mind that just a few years ago, the state of the art in what these LLMs could do was questions of this nature:

      Suppose g(x) = f−1 (x), g(0) = 5, g(4) = 7, g(3) = 2, g(7) = 9, g(9) = 6 what is f(f(f(6)))?

      The above is from the "sparks of AGI paper" on GPT-4, where they were floored that it could coherently reason through the 3 steps of inverting things (6 -> 9 -> 7 -> 4) while GPT 3.5 was still spitting out a nonsense argument of this form:

      f(f(f(6))) = f(f(g(9))) = f(f(6)) = f(g(7)) = f(9).

      This is from March 2023 and it was genuinely very surprising at the time that these pattern matching machines trained on next token prediction could do this. Something like a LSTM can't do anything like this at all btw, no where close.

      To me its very surprising that the C compiler works. It takes a ton of effort to build such a thing. I can imagine the flaws actually do get better over the next year as we push the goalposts out.

  • dzaima22h

    Clicked on the first thing I happen to be interested in - SIMD stuff - and ended up at https://github.com/anthropics/claudes-c-compiler/blob/6f1b99..., which is a fast path incompatible with the _mm_free implementation; pretty trivial bug, not even actually SIMD or anything specialized at all.

    A whole lot of UB in the actual SIMD impls (who'd have expected), but that can actually be fine here if the compiler is made to not take advantage of the UB. And then there's the super-weird mix of manual loops vs inline assembly vs builtins.

  • epolanski1d

    However it was achieved, building a such a complex project like a C compiler on a 20k $ budget in full autonomy is quite impressive.

    Imho some commenters focus way too much on the (many, and honestly also shared by the blog post too) cons, that they forget to be genuinely impressed by the steps forward.

  • mshockwave5h

    how did it do regalloc before instruction selection? How do you select the correct register class without knowing which instruction you're gonna use?

  • polyglotfacto7h

    So I do think one can get value from coding agents, but that value is out of proportion compared to the investments made by the AI labs, so now they're pushing this kind of stuff which I find to be a borderline scam.

    Let me explain why:

    > the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux

    Seems like a failure to me.

    > I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

    This has code smell written all over it.

    ----

    Conclusion: this cost 20k to build, not taking into account the money spent on training the model. How much would you pay for this software? Zero.

    The reality is that LLM are up there with SQL and ROR(or above) in terms of changing how people write software and interact with data. That's a big deal, but not enough to support trillion dollar valuations.

    So you get things like this project, which are just about driving a certain narrative.

    • conception7h

      I don’t understand. This badly done work wasn’t possible at all six months ago. In six more months it will be better. It’s not a mostly static technology for the last twenty plus years.

  • small_model1d

    How about we get the LLM's to collaborate and design a perfect programming language for LLM coding, it would be terse (less tokens) easy for pattern searches etc and very fast to build, iterate over.

    • copperx1d

      I'm surprised by the assumption that LLMs would design such a language better than humans. I don't think that's the case.

    • WarmWash1d

      I cannot decide if LLMs would be excellent at writing in pure binary (why waste all that context on superfluous variable names and function symbols) or be absolutely awful at writing pure binary (would get hopelessly lost without the huge diversification of tokens).

      • anematode1d

        Binary is wayyy less information dense than normal code, so it wouldn't work well at all.

      • small_model1d

        We would still need the language to be human readable, but it could be very dense. They could build the ultimate std lib, that goes directly to kernels, so a call like spawn is all the tokens it needs to start a co routine for example.

    • hagendaasalpine1d

      what about APL et al (BQN), information dense(?)

  • jaccola10h

    I think this is cool!

    But by some definition my "Ctrl", "C", and "V" keys can build a C compiler...

    Obviously being facetious but my point being: I find it impossible to judge how impressed I should be by these model achievements since they don't show how they perform on a range of out-of-distribution tasks.

  • keeptrying19h

    This is more an example of code distribution rather than intelligence.

    If Claude had NOT been trained on compiler code, it would NOT have been able to build a compiler.

    Definitely signals the end of software IP or at least in its present form.

    • stkdump16h

      In a weird sense Open Source won

      • keeptrying9h

        Yep - its an interesting angle to look at it.

        Or rather OpenSource might have just saved the world!

  • sigbottle9h

    Even with all the caveats:

    - trained on all the GCC/clang source - pulled down a kernel branch, presumably with extensive tests in source - used GCC as an oracle

    I certainly wouldn't be able to do this.

    I flip flop man.

  • exitcode00001d

    Cool article, interesting to read about their challenges. I've tasked Claude with building an Ada83 compiler targeting LLVM IR - which has gotten pretty far.

    I am not using teams though and there is quite a bit of knowledge needed to direct it (even with the test suite).

  • mimd5h

    I'm annoyed at the cost statement, as that's the sleight of hand. "$20000" at current pricing. Add some orders of magnitude to the costs and you'll get your true price you'll have to pay when the VC money starts to wear off. 2nd, this is ignoring the dev time that he/others put in over multiple iterations of this project (opus 4, opus 4.5) and all the other work to create the scaffolding for it, and all the millions/tens of millions of dollars of hand written test suits (linux kernel, gcc, doom, sqlite, etc) he got to use to guide the process. So add some more cost on top of that orders of magnitude increase and the dev time is probably a couple months/years more than "2 weeks".

    And this is just working off the puff pieces statements, and not even diving into the code to see it's limits/origins, etc. I also don't see the scaffold in the repo, as that's where the effort is.

    But still it's not surprising, from my own experience, given a rigorously definable problem, enough effort, grunt work, and massaging, you can get stuff out of the current models.

  • karmakaze24h

    I'm not particularly impressed that it can turn C into an SSA IR or assembly etc. The optimizations, however sophisticated is where anything impressive would be. Then again, we have lots of examples in the training set I would expect. C compilers are probably the most popular of all compilers. What would be more impressive is for it to have made a compiler for a well defined language that isn't very close to a popular language.

    What I am impressed by is that the task it completed had many steps and the agent didn't get lost or caught in a loop in the many sessions and time it spent doing it.

    • astrange12h

      > What would be more impressive is for it to have made a compiler for a well defined language that isn't very close to a popular language.

      That doesn't seem difficult as long as you can translate it into a well-known IR. The Dragon Book for some reason spends all its time talking about frontend parsing, which does give you the impression it's impossible.

      I agree writing compilers isn't especially difficult, but it is a lot of work and people are scared of it.

      The hard part is UI - error handling and things like that.

  • owenpalmer1d

    It can compile the linux kernel, but does it boot?

  • softwaredoug24h

    I think we’re getting to a place where for anything with extensive verification available we’ll be “fitting” code to a task against tests like we fit an ML model to a loss function.

  • anupamchugh17h

    This is a very early research prototype with no other inter-agent communication methods or high-level goal management processes."

    The lock file approach (current_tasks/parse_if_statement.txt) prevents two agents from claiming the same task, but it can't prevent convergent wasted work. When all 16 agents hit the same Linux kernel bug, the lock files didn't help — the problem wasn't task collision, it was that the agents couldn't see they were all solving the same downstream failure. The GCC oracle workaround was clever, but it was a human inventing a new harness mid-flight because the coordination primitive wasn't enough.

    Similarly, "Claude frequently broke existing functionality implementing new features" isn't a model capability problem — it's an input stability problem. Agent N builds against an interface that agent M just changed. Without gating on whether your inputs have changed since you started, you get phantom regressions

  • cesaref12h

    Most of the effort when writing a compiler is handling incorrect code, and reporting sensible error messages. Compiling known good code is a great start though.

  • miki12321113h

    What I find to be the most impressive part here is that it wrote the compiler without reference to the C specification and without architecture manuals at hand.

  • smy2001115h

    I think the good thing about it is that if you are given good specification, you are likely to get good result. Writing a C compiler is not something new, but it will be great for all the porting projects.

  • polskibus1d

    So did the Linux compiled with this compiler worked? Does it work the same as GCC-compiled Linux (but slower due to generating non optimized code?)

  • storus24h

    Now this is fairly "easy" as there are multitude of implementations/specs all over the Internet. How about trying to design a new language that is unquestionably better/safer/faster for low-level system programming than C/Rust/Zig? ML is great in aping existing stuff but how about pushing it to invent something valuable instead?

  • Decabytes13h

    For me the real test will be building a c++ compiler

  • throwaway20271d

    I think it's funny how me and I assume many others tried to do the same thing and they probably saw it being a popular query or had the same idea.

  • subzel011h

    One thing this article proved is that the Dead Internet Theory is real. Look at all these Claudy comments!

  • jgarzik20h

    Already done, months ago, with better taste: https://github.com/rustcoreutils/posixutils-rs

  • personjerry24h

    > Over nearly 2,000 Claude Code sessions and $20,000 in API costs

    Well there goes my weekend project plans

    • degurechaff20h

      well, you can use jules and spend zero dollar on it. I also create similiar project like this, c11 compiler in rust using AI agent + 1 developer(https://github.com/bungcip/cendol). not fully automated like anthophic did, but at least i can understand what it did.

  • nottorp12h

    Apparently there's a reproducibility crisis in science.

    Are Anthropic's claims reproducible?

  • cuechan1d

    > The compiler is an interesting artifact on its own [...]

    its funny bacause by (most) definitions, it is not an artifact:

    > a usually simple object (such as a tool or ornament) showing human workmanship or modification as distinguished from a natural object

  • socalgal212h

    Thinking about the this, while it’s a cool achievement, how useful is it really? It realizes on the fact there is a large comprehensive set of tests and a large number of available projects that can function as tests.

    That situation is extremely uncommon for most development

  • jackdoe14h

    honestly i am amazed that it can do that, but I wish they use it to rewrite the claude code cli.

    i had to killall -9 claude 3 times yesterday

    • rhubarbtree14h

      They are already writing Claude with Claude - I think they said 90% of their code is written with Claude.

      • jackdoe14h

        yes, they must be killing it hundreds of times per day, maybe its time for 'please rewrite opencode, but dont touch anything, you can only use `cp`' kind of prompt

  • stephc_int131d

    They should add this to the benchmark suite, and create a custom eval for how good the resulting compiler is, as well as how maintainable the source code.

    • snek_case1d

      This would be an expensive benchmark to run on a regular basis, though I guess for the big AI labs it's nothing. Code quality is hard to objectively measure, however.

  • stevefan199921h

    I tried writing a C compiler in Rust in the spirit of TCC, but I'm just too lazy to finish it.

  • jwpapi24h

    This is my favorite article this year. Just very insightful and honest. The learnings are worth thousands for me.

  • jhallenworld24h

    Does it make a conforming preprocessor?

  • mucle622h

    This feels like the start of a paradigm shift.

    I need to reunderwrite what my vision of the future looks like.

  • jcalvinowens1d

    How much of this result is effectively plagiarized open source compiler code? I don't understand how this is compelling at all: obviously it can regurgitate things that are nearly identical in capability to already existing code it was explicitly trained on...

    It's very telling how all these examples are all "look, we made it recreate a shitter version of a thing that already exists in the training set".

    • jeroenhd1d

      The fact it couldn't actually stick to the 16 bit ABI so it had to cheat and call out to GCC to get the system to boot says a lot.

      Without enough examples to copy from (despite CPU manuals being available in the training set) the approach failed. I wonder how well it'll do when you throw it a new/imaginary instruction set/CPU architecture; I bet it'll fail in similar ways.

      • jsnell1d

        "Couldn't stick to the ABI ... despite CPU manuals being available" is a bizarre interpretation. What the article describes is the generated code being too large. That's an optimization problem, not a "couldn't follow the documentation" problem.

        And it's a bit of a nasty optimization problem, because the result is all or nothing. Implementing enough optimizations to get from 60kB to 33kB is useless, all the rewards come from getting to 32kB.

      • jcalvinowens1d

        IMHO a new architecture doesn't really make it any more interesting: there's too many examples of adding new architectures in the existing codebases. Maybe if the new machine had some bizarre novel property, I suppose, but I can't come up with a good example.

        If the model were retrained without any of the existing compilers/toolchains in its training set, and it could still do something like this, that would be very compelling to me.

    • Philpax1d

      What Rust-based compiler is it plagiarising from?

      • lossolo1d

        Language doesn't really matter, it's not how things are mapped in the latent space. It only needs to know how to do it in one language.

        • HDThoreaun24h

          Ok you can say this about literally any compiler though. The authors of every compiler have intimate knowledge of other compilers, how is this different?

          • eggn00dles21h

            grace hopper spinning in her grave rn

        • jsnell1d

          Did you actually look at these?

          > https://github.com/jyn514/saltwater

          This is just a frontend. It uses Cranelift as the backend. It's missing some fairly basic language features like bitfields and variadic functions. And if I'm reading the documentation right, it requires all the source code to be in a single file...

          > https://github.com/ClementTsang/rustcc

          This will compile basically no real-world code. The only supported data type is "int".

          > https://github.com/maekawatoshiki/rucc

          This is just a frontend. It uses LLVM as the backend.

        • Philpax1d

          Look at what those compilers are capable of compiling and to which targets, and compare it to what this compiler can do. Those are wonderful, and I have nothing but respect for them, but they aren't going to be compiling the Linux kernel.

          • rubymamis1d

            I just did a quick Google search only on GitHub, maybe there are better ones out there on the internet?

          • Philpax22h

            Can't compile the Linux kernel, and ironically, also partly written by Claude.

          • Philpax22h

            A genuinely impressive effort, but alas, still missing some pretty critical features (const, floating point, bools, inline, anonymous structs in function args).

      • jcalvinowens1d

        Being written in rust is meaningless IMHO. There is absolutely zero inherent value to something being written in rust. Sometimes it's the right tool for the job, sometimes it isn't.

        • modeless1d

          It means that it's not directly copying existing C compiler code which is overwhelmingly not written in Rust. Even if your argument is that it is plagiarizing C code and doing a direct translation to Rust, that's a pretty interesting capability for it to have.

          • seba_dos124h

            Translating things between languages is probably one of the least interesting capabilities of LLMs - it's the one thing that they're pretty much meant to do well by design.

          • jcalvinowens1d

            Surely you agree that directly copying existing code into a different language is still plagiarism?

            I completely agree that "reweite this existing codebase into a new language" could be a very powerful tool. But the article is making much bolder claims. And the result was more limited in capability, so you can't even really claim they've achieved the rewrite skill yet.

        • Philpax1d

          Please don't open a bridge to the Rust flamewar from the AI flamewar :-)

          • jcalvinowens1d

            Hahaha, fair enough, but I refuse to be shy about having this opinion :)

    • anematode1d

      Honestly, probably not a lot. Not that many C compilers are compatible with all of GCC's weird features, and the ones that are, I don't think are written in Rust. Hell, even clang couldn't compile the Linux kernel until ~10 years ago. This is a very impressive project.

  • stephc_int131d

    It means that if you already have or a willing to build very robust test suite and the task is a complicated but already solved problem, you can get a sub-par implementation for a semi-reasonable amount of money.

    This is not entirely ridiculous.

  • tonis215h

    I wish they would do llvm from scratch too

  • IshKebab1d

    > I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

    This has been my experience of vibe coding too. Good for getting started, but you quickly reach the point where fixing one thing breaks another and you have to finish the project yourself.

  • sreekanth85018h

    Much better than cursor's browser fiasco.

  • lambda-lollipop16h

    apparently [hello world does not compile...](https://github.com/anthropics/claudes-c-compiler/issues/1)

  • 77341281d

    I'm sure this is impressive, but it's probably not the best test case given how many C compilers there are out there and how they presumably have been featured in the training data.

    This is almost like asking me to invent a path finding algorithm when I've been thought Dijkstra's and A*.

    • NitpickLawyer1d

      It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago. It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.

      A pertinent quote from the article (which is a really nice read, I'd recommend reading it fully at least once):

      > Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.

      • wmf1d

        In this case it's not reproducing training data verbatim but it probably is using algorithms and data structures that were learned from existing C compilers. On one hand it's good to reuse existing knowledge but such knowledge won't be available if you ask Claude to develop novel software.

        • RobMurray1d

          How often do you need to invent novel algorithms or data structures? Most human written code is just rehashing existing ideas as well.

          • notnullorvoid1d

            I wouldn't say I need to invent much that is strictly novel, though I often iterate on what exists and delve into novel-ish territory. That being said I'm definitely in a minority where I have the luxury/opportunity to work outside the monotony of average programming.

            The part I find concerning is that I wouldn't be in the place I am today without spending a fair amount of time in that monotony and really delving in to understand it and slowly push outside it's boundary. If I was starting programming today I can confidently say I would've given up.

          • lossolo1d

            They're very good at reiterating, that's true. The issue is that without the people outside of "most humans" there would be no code and no civilization. We'd still be sitting in trees. That is real intelligence.

            • ben_w1d

              Why's that the issue?

              "This AI can do 99.99%* of all human endeavours, but without that last 0.01% we'd still be in the trees", doesn't stop that 99.99% getting made redundant by the AI.

              * vary as desired for your preference of argument, regarding how competent the AI actually is vs. how few people really show "true intelligence". Personally I think there's a big gap between them: paradigm-shifting inventiveness is necessarily rare, and AI can't fill in all the gaps under it yet. But I am very uncomfortable with how much AI can fill in for.

              • notnullorvoid22h

                Here's a potentially more uncomfortable thought, if all people through history with potential for "true intelligence" had a tool that did 99% of everything do you think they would've had motivation to learn enough of that 99% to give insight into the yet discovered.

        • ofrzeta17h

          You mean like ... a compiler engineer that has learned from books and code samples?

      • simonw1d

        This is a good rebuttal to the "it was in the training data" argument - if that's how this stuff works, why couldn't Opus 4.5 or any of the other previous models achieve the same thing?

      • f311a14h

        That's because they still struggle hard with out-of-distribution tasks even though some of them can be solved using existing training data pretty well. Focusing on out-of-distribution will probably lower scores for benchmarks. They focus too much on common tasks.

        And keep in mind, the original creators of the first compiler had to come up with everything: lexical analysis -> parsing -> IR -> codegen -> optimization. LLMs are not yet capable of producing a lot of novelty. There are many areas in compilers that can be optimized right now, but LLMs can't help with that.

      • lossolo1d

        They couldn't do it because they weren't fine-tuned for multi-agent workflows, which basically means they were constrained by their context window.

        How many agents did they use with previous Opus? 3?

        You've chosen an argument that works against you, because they actually could do that if they were trained to.

        Give them the same post-training (recipes/steering) and the same datasets, and voila, they'll be capable of the same thing. What do you think is happening there? Did Anthropic inject magic ponies?

      • fatherwavelet23h

        At some point it becomes like someone playing a nice song on piano and then someone countering with "that is great but play a song you don't know!".

        Then they start improvising and the same person counters with "what a bunch of slop, just making things up!"

      • falloutx1d

        They can literally print out entire books line by line.

      • zephen1d

        > It's a bit disappointing that people are still re-hashing the same "it's in the training data" old thing from 3 years ago.

        They only have to keep reiterating this because people are still pretending the training data doesn't contain all the information that it does.

        > It's not like any LLM could 1for1 regurgitate millions of LoC from any training set... This is not how it works.

        Maybe not any old LLM, but Claude gets really close.

        https://arxiv.org/pdf/2601.02671v1

      • skydhash1d

        Because for all those projects, the effective solution is to just use the existing implementation and not launder code through an LLM. We would rather see a stab at fixing CVEs or implementing features in open source projects. Like the wifi situation in FreeBSD.

      • lunar_mycroft1d

        LLMs can regurgitate almost all of the Harry Potter books, among others [0]. Clearly, these models can actually regurgitate large amounts of their training data, and reconstructing any gaps would be a lot less impressive than implementing the project truly from scratch.

        (I'm not claiming this is what actually happened here, just pointing out that memorization is a lot more plausible/significant than you say)

        [0] https://www.theregister.com/2026/01/09/boffins_probe_commerc...

        • StilesCrisis1d

          The training data doesn't contain a Rust based C compiler that can build Linux, though.

  • secretsatan13h

    Who checks to see if it’s backdoored?

  • logicprog20h

    I will say that one thing that's extremely interesting is that everyone laughed at and made fun of Steve Yegge when he released Gas Town, which centered exactly around this idea — of having more than a dozen agents working on a project simultaneously with some generalized agents focusing on implementing features while other are more specialized and tasked with second-order tasks, where you just independently run them in a loop from an orchestrator until they've finished the project where they all work on work trees and, you know, satisfy merch conflicts and stuff as a coordination mechanism — but it's starting to kind of look like he was right. He really was aiming for where the puck was headed. First we got cursor with the fast render browser, then we got Kimi K2.5 releasing with — from everything I can tell — actually very innovative and new specific RL techniques for orchestrating agent swarms. And now we have this, Anthropic themselves doing a Gas Town-style agent swarm model of development. It's beginning to look like he absolutely did know where the puck was headed before it got there.

    Now, whether we should actually be building software in this fashion or even headed in this direction at all is a completely separate question. And I would tend strongly towards no. Not until at least we have very strong, yet easy to use concise and low effort formal verification, deterministic simulation testing, property-based testing, integration testing, etc; and even then, we'll end up pair programming those formal specifications and batteries of tests with AI agents. Not writing them ourselves, since that's inefficient, nor turning them over to agent swarms, since they are very important. And if we turn them over to swarms, we'd end up with an infinite regress problem. And ultimately, that's just programming at a higher level at that point. So I would argue we should never predominantly develop in this way.

    But still, there is prescience in Gastown apparently, and that's interesting.

  • casey218h

    Interesting that they are still going with a testing strategy despite the wasted time. I think in the long run model checking and proofs are more scale-able.

    I guess it makes as agents can generate tests, since you are taking this route I'd like to see agents that act as a users, that can only access docs, textbooks, user forums and builds.

  • sho_hn1d

    Nothing in the post about whether the compiled kernel boots.

    • chews1d

      video does show it booting.

  • davemp22h

    Brute forcing a problem with a perfect test oracle and a really good heuristic (how many c compilers are in the training data) is not enough to justify the hype imo.

    Yes this is cool. I actually have worked on a similar project with a slightly worse test oracle and would gladly never have to do that sort of work again. Just tedious unfulfilling work. Though we caught issues with both the specifications/test oracle when doing the work. Also many of the team members learned and are now SMEs for related systems.

    Is this evidence that knowledge work is dead or AGI is coming? Absolutely not. I think you’d be pretty ignorant with respect to the field to suggest such a thing.

  • almosthere22h

    This is like the 6th trending claude story today. It must be obvious that they told everyone at Anthropic to upvote and comment.

  • light_hue_11d

    > This was a clean-room implementation (Claude did not have internet access at any point during its development);

    This is absolutely false and I wish the people doing these demonstrations were more honest.

    It had access to GCC! Not only that, using GCC as an oracle was critical and had to be built in by hand.

    Like the web browser project this shows how far you can get when you have a reference implementation, good benchmarks, and clear metrics. But that's not the real world for 99% of people, this is the easiest scenario for any ML setting.

    • rvz1d

      > This is absolutely false and I wish the people doing these demonstrations were more honest.

      That's because the "testing" was not done independently. So anything can be possibly be made to be misleading. Hence:

      > Written by Nicholas Carlini, a researcher on our Safeguards team.

  • gre1d

    There's a terrible bug where once it compacts then it sometimes pulls in .o or binary files and immediately fills your entire context. Then it compacts again...10m and your token budget is gone for the 5 hour period. edit: hooks that prevent it from reading binary files can't prevent this.

    Please fix.. :)

  • pshirshov23h

    Pfft, a C compiler.

    Look at this: https://github.com/7mind/jopa

  • Havoc1d

    Cool project, but they really could have skipped the mention of clean room. Something trained on every copyrighted thing known to mankind is the opposite of clean room

    • cheema331d

      As others have pointed out, humans train on existing codebases as well. And then use that knowledge to build clean room implementations.

      • mxey24h

        That’s the opposite of clean-room. The whole point of clean-room design is that you have your software written by people who have not looked into the competing, existing implementation, to prevent any claim of plagiarism.

        “Typically, a clean-room design is done by having someone examine the system to be reimplemented and having this person write a specification. This specification is then reviewed by a lawyer to ensure that no copyrighted material is included. The specification is then implemented by a team with no connection to the original examiners.”

      • kelnos23h

        No they don't. One team meticulously documents and specs out what the original code does, and then a completely independent team, who has never seen the original source code, implements it.

        Otherwise it's not clean-room, it's plagiarism.

      • regularfry24h

        What they don't do is read the product they're clean-rooming. That's kinda disqualifying. Impossible to know if the GCC source is in 4.6's training set but it would be kinda weird if it wasn't.

      • HarHarVeryFunny21h

        True, but the human isn't allowed to bring 1TB of compressed data pertaining to what they are "redesigning from scratch/memory" into the clean room.

        In fact the idea of a "clean room" implementation is that all you have to go on is the interface spec of what you are trying to build a clean (non-copyright violating) version of - e.g. IBM PC BIOS API interface.

        You can't have previously read the IBM PC BIOS source code, then claim to have created a "clean room" clone!

      • pizlonator24h

        Not the same.

        I have read nowhere near as much code (or anything) as what Claude has to read to get to where it is.

        And I can write an optimizing compiler that isn't slower than GCC -O0

      • cermicelli1d

        If that's what clean room means to you, I do know AI can definitely replace you. As even ChatGPT is better than that.

        (prompt: what does a clean room implementation mean?)

        From ChatGPT without login BTW!

        > A clean room implementation is a way of building something (usually software) without copying or being influenced by the original implementation, so you avoid copyright or IP issues.

        > The core idea is separation.

        > Here’s how it usually works:

        > The basic setup

        > Two teams (or two roles):

        > Specification team (the “dirty room”)

        > Looks at the original product, code, or behavior

        > Documents what it does, not how it does it

        > Produces specs, interfaces, test cases, and behavior descriptions

        > Implementation team (the “clean room”)

        > Never sees the original code

        > Only reads the specs

        > Writes a brand-new implementation from scratch

        > Because the clean team never touches the original code, their work is considered independently created, even if the behavior matches.

        > Why people do this

        > Reverse-engineering legally

        > Avoid copyright infringement

        > Reimplement proprietary systems

        > Create open-source replacements

        > Build compatible software (file formats, APIs, protocols)

        I really am starting to think we have achieved AGI. > Average (G)Human Intelligence

        LMAO

    • benjiro1d

      Hot take:

      If you try to reimplement something in a clean room, its a step by step process, using your own accumulated knowledge as the basis. That knowledge that you hold in your brain, all too often is code that may have copyrights on it, from the companies you worked on.

      Is it any different for a LLM?

      The fact that the LLM is trained on more data, does not change that when you work for a company, leave it, take that accumulated knowledge to a different company, you are by definition taking that knowledge (that may be copyrighted) and implementing it somewhere else. It only a issue if you copy the code directly, or do the implementation as a 1:1 copy. LLMs do not make 1:1 copies of the original.

      At what point is trained on copyrighted data, any different then a human trained on copyrighted data, that get reimplemented in a transformative way. The big difference is that the LLM can hold more data over more fields, vs a human, true... But if we look at specializations, this can come back to the same, no?

      • Crestwave21h

        Clean-room design is extremely specific. Anyone who has so much as glanced at Windows source code[1] (or even ReactOS code![2]) is permanently banned from contributing to WINE.

        This is 100% unambiguously not clean-room unless they can somehow prove it was never trained on any C compiler code (which they can't, because it most certainly was).

        [1] https://gitlab.winehq.org/wine/wine/-/wikis/Developer-FAQ#wh...

        [2] https://gitlab.winehq.org/wine/wine/-/wikis/Clean-Room-Guide...

      • cermicelli24h

        If you have worked on a related copyrighted work you can't work on a clean room implementation. You will be sued. There are lots of people who have tried and found out.

        They weren't trillion dollar AI companies to bankroll the defense sure. But thinking about clean room and using copyrighted stuff is not even an argument that's just nonsense to try to prove something when no one asked.

  • dmitrygr1d

    > The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

    Worse than "-O0" takes skill...

    So then, it produced something much worse than tcc (which is better than gcc -O0), an equivalent of which one man can produce in under two weeks. So even all those tokens and dollars did not equal one man's week of work.

    Except the one man might explain such arbitrary and shitty code as this:

    https://github.com/anthropics/claudes-c-compiler/blob/main/s...

    why x9? who knows?!

    Oh god the more i look at this code the happier I get. I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot...

    • ben_w1d

      I'm trying to recall a quote. Some war where all defeats were censored in the news, possibly Paris was losing to someone. It was something along the lines of "I can't help but notice how our great victories keep getting closer to home".

      Last year I tried using an LLM to make a joke language, I couldn't even compile the compiler the source code was so bad. Before Christmas, same joke language, a previous version of Claude gave me something that worked. I wouldn't call it "good", it was a joke language, but it did work.

      So it sucks at writing a compiler? Yay. The gloriously indefatigable human mind wins another battle against the mediocre AI, but I can't help but notice how the battles keep getting closer to home.

      • sjsjsbsh1d

        > but I can't help but notice how the battles keep getting closer to home

        This has been true for all of (known) human history. I’m gonna go ahead and make another bold prediction: tech will keep getting better.

        The issue with this blog post is it’s mostly marketing.

    • sebzim45001d

      Can one man really make a C compiler in one week that can compile linux, sqlite, etc.?

      Maybe I'm underestimating the simplicity of the C language, but that doesn't sound very plausible to me.

      • dmitrygr1d

        yes, if you do not care to optimize, yes. source: done it

        • Philpax1d

          I would love to see the commit log on this.

          • rustystump1d

            Implementing just enough to conform to a language is not as difficult as it seems. Making it fast is hard.

          • dmitrygr1d

            did this before i knew how to git, back in college. target was ARMv5

            • Philpax1d

              Great. Did your compiler support three different architectures (four, if you include x86 in addition to x86-64) and compile and pass the test suite for all of this software?

              > Projects that compile and pass their test suites include PostgreSQL (all 237 regression tests), SQLite, QuickJS, zlib, Lua, libsodium, libpng, jq, libjpeg-turbo, mbedTLS, libuv, Redis, libffi, musl, TCC, and DOOM — all using the fully standalone assembler and linker with no external toolchain. Over 150 additional projects have also been built successfully, including FFmpeg (all 7331 FATE checkasm tests on x86-64 and AArch64), GNU coreutils, Busybox, CPython, QEMU, and LuaJIT.

              Writing a C compiler is not that difficult, I agree. Writing a C compiler that can compile a significant amount of real software across multiple architectures? That's significantly more non-trivial.

              • AshamedCaptain11h

                Frankly, I think you are exaggerating. My university had a course that required students to build a C compiler that could run the C subset of SPECint (which includes frigging Perl) and this was the usual 3 month class that was not expected to fill in 24h of your time, so I'd say 1 week sounds perfectly reasonable for someone already familiar. Good enough C for a shitton of projects is barely more complicated than writing an assembler, in fact, that is one of C's strong points (which is also the source of most of its weaknesses).

                • Philpax8h

                  I really, really don't think so, but you're welcome to try :-)

    • bwfan1231d

      > I can already feel the contracts coming to fix LLM slop

      First, the agents will attempt to fix issues on their own. Most easy problems will be fixed or worked-around in this manner. The hard problems will require a deeper causal model of how things work. For these, the agents will give up. But, the code-base has evolved to a point where no-one understands whats going on including the agents and its human handlers. Expect your phone to ring at that point, and prepare to ask for a ransom.

    • small_model1d

      Claude is only a few years old so we should compare it to a 3 year old human's C compiler

      • notnullorvoid22h

        Claude requires many lifetimes worth of data to "learn". Evolution aside humans don't require much data to learn, and our learning happens in real-time in response to our environment.

        Train Claude without the programming dataset and give it a dozen of the best programming books, it'll have no chance of writing a compiler. Do the same for a human with an interest in learning to program and there's a good chance.

      • zephen1d

        Claude contains the entire wisdom of the internet, such as it is.

    • sjsjsbsh1d

      > I can already feel the contracts coming to fix LLM slop like this when any company who takes this seriously needs it maintained and cannot

      Honest question, do you think it’d be easier to fix or rewrite from scratch? With domains I’m intimately familiar with, I’ve come very close to simply throwing the LLM code out after using it to establish some key test cases.

      • dmitrygr1d

        Rewrite is what I’ve been doing so far in such cases. Takes fewer hours

  • sjsjsbsh1d

    > So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026

    What? Didn’t cursed lang do something similar like 6 or 7 months ago? These bombastic marketing tactics are getting tired.

    • ebiester1d

      Do you not see the difference between a toy language and a clean room implementation that can compile Linux, QEMU, Postgres, and sqlite? (No, it doesn't have the assembler and linker.)

      That's for $20,000.

      • falloutx1d

        people have built compilers for free, with $20000 you can even a couple of devs for a year in low income countries.

    • jsnell1d

      No? That was a frontend for a toy language calling using LLVM as the backend. This is a totally self-contained compiler that's capable of compiling the Linux kernel. What's the part that you think is similar?

  • andrepd13h

    This chatbot has several C compilers in its training data. How is this possibly a useful benchmark for anything? LLMs routinely output code verbatim or modulo trivial changes as their own (very useful for license-laundering too).

  • trilogic1d

    Can it create employment? How is this making life better. I understand the achievement but come on, wouldn´t it be something to show if you created employment for 10000 people using your 20000 USD!

    Microsoft, OpenAI, Anthropic, XAI, all solving the wrong problems, your problems not the collective ones.

    • m4ck_19h

      Didn't you hear? We're heading towards a workless utopia where everything will be free (according to people who are actively working to eliminate things like food assistance for less fortunate mothers and children.)

    • jeffbee1d

      "Employment" is not intrinsically valuable. It is an emergent property of one way of thinking about economic systems.

      • wiseowise13h

        That’s the most HN reply ever. Obtuse and pedantic.

        Tell a struggling undergrad or unemployed that “employment” is not intrinsically valuable, maybe they’ll be able to use the rhetoric to move a couple positions higher in a soup kitchen queue before their food coupons expire.

      • trilogic1d

        For employment I mean "WHATEVER LEADS TO REWARD COLLECTIVE HUMANS TO SURVIVE".

        Call it as you wish, but I am certainly not talking about coding values.

        • falcor841d

          I'm struggling to even parse the syntax of "WHATEVER LEADS TO REWARD COLLECTIVE HUMANS TO SURVIVE", but assuming that you're talking about resource allocation, my answer is UBI or something similar to it. We only need to "reward" for action when the resources are scarce, but when resources are plentiful, there's no particular reason not to just give them out.

          I know it's "easier to imagine an end to the world than an end to capitalism", but to quote another dreamer: "Imagine all the people sharing all the world".

          • swexbe1d

            Except resources won't be plentiful for a long while since AI is only impacting the service sector. You can't eat a service, you can't live in one. SAAS will get very cheap though...

            • falcor8424h

              Robotics has been advancing very quickly recently. If we solve long-term AI action planning, I don't see any limitation to making it embodied.

    • mofeien1d

      Obviously a human in the loop is always needed and this technology that is specifically trained to excel at all cognitive tasks that humans are capable of will lead to infinite new jobs being created. /s

  • bsoles23h

    The title should have said "Antropic stole GCC and other open-source compiler code to create a subpar, non-functional compiler", without attribution or compensation. Open source was never meant for thieving megacorps like them.

    No, I did not read the article...

  • ur-whale11h

    > We tasked Opus 4.6 using agent teams to build a C Compiler

    So, essentially to build something for which many, many examples already exist on the web, and which is likely baked into its training set somehow ... mmmyeah.

  • falloutx1d

    So it copied one of the C compilers? This was always possible but now you need to pay $1000 in API costs to Anthropic

    • Rudybega1d

      It wrote the compiler in Rust. As far as I know, there aren't any Rust based C compilers with the same capabilities. If you can find one that can compile the Linux kernel or get 99% on the GCC torture test suite, I would be quite surprised. I couldn't in a search.

      Maybe read the article before being so dismissive.

      • falloutx1d

        Why does language of the compiler matter? Its a solved problem and since other implementations are already available anyone can already transpile them to rust.

        • Rudybega1d

          Direct transpilation would create a ton of unsafe code (this repo doesn't have any) and fixing that would require a lot of manual fixes from the model. Even that would be a massive achievement, but it's not how this was created.

          • f311a14h

            They are trained pretty hard to transpile the code between languages and do this pretty well because this can be done using RL.

            You can force the agent not to use unsafe, this is why it burned $20000. Thousands of attempts against good tests with good boundaries set.

      • hgs31d

        > As far as I know, there aren't any Rust based C compilers with the same capabilities.

        If you trained on a neutral representation like an AST or IR, then the source language shouldn't matter. *

        * I'm not familiar with how Anthropic builds their models, but training this way should nullify PL differences.

        • astrange12h

          LLMs do not learn concepts perfectly across programming languages, just as they don't learn concepts perfectly across human languages.

    • chucksta1d

      Add a 0 and double it

      |Over nearly 2,000 Claude Code sessions and $20,000 in API cost

      • lossyalgo1d

        One more reason RAM prices will continue to go up.

  • 1d
    [deleted]
  • chvid1d

    100.000 lines of code for something that is literally a text book task?

    I guess if it only created 1.000 lines it would be easy to see where those lines came from.

    • falcor841d

      > literally a text book task

      Generating a 99% compliant C compiler is not a textbook task in any university I've ever heard of. There's a vast difference between a toy compiler and one that can actually compile Linux and Doom.

      From a bit of research now, there are only three other compilers that can compile an unmodified Linux kernel: GCC, Clang/LLVM and Intel's oneAPI. I can't find any other compiler implementation that came close.

      • cv50051d

        That's because you need to implement a bunch of gcc-specific behavior that linux relies on. A 100% standards compliant c23 compiler can't compile linux.

        • falcor8423h

          Ok, yes, that's true, though my understanding is that it's not the GCC is not compliant, but rather that it includes extensions beyond the standard, which is allowed by the standard, which says (in section 4. Conformance):

          > A conforming implementation may have extensions (including additional library functions), provided they do not alter the behavior of any strictly conforming program

          Anyway, this just makes Claude's achievement here more impressive, right?

    • anematode1d

      A simple C89 compiler is a textbook task; a GCC-compatible compiler targeting multiple architectures that can pass 99% of the GCC torture test suite is absolutely not.

    • wmf1d

      This has multiple backends and a long tail of C extensions that are not in the textbook.

    • blibble23h

      indeed

      building a working C compiler from scratch is literally in my "teach yourself C in 24 hours" book from 30 years ago

      • simonw23h

        Which book was that? Sounds excellent.

        Might have been Compiler Design in C from 1990. Looks like that's available for free now: https://holub.com/compiler/

        • blibble21h

          it's not that one, but it's in my parents house

          you'll forgive me if I don't ring them in the early hours of the morning...

          remember C was specifically designed to be easy to compile

          (hence anachronisms like forward declarations)

  • fxtentacle1d

    You could hire a reasonably skilled dev in India for a week for $1k —- or you could pay $20k in LLM tokens, spend 2 hours writing essays to explain what you want, and then get a buggy mess.

    • Philpax1d

      No human developer, not even Fabrice Bellard, could reproduce this specific result in a week. A subset of it, sure, but not everything this does.