I’ve been mostly holding off on learning any of the tools that do this because it seemed so obvious that it’ll be built natively. Will definitely give this a go at some point!
I’ve been mostly holding off on learning any of the tools that do this because it seemed so obvious that it’ll be built natively. Will definitely give this a go at some point!
To the folks comparing this to GasTown: keep in mind that Steve Yegge explicitely pitched agent orchestrators to among others Anthropic months ago:
> I went to senior folks at companies like Temporal and Anthropic, telling them they should build an agent orchestrator, that Claude Code is just a building block, and it’s going to be all about AI workflows and “Kubernetes for agents”. I went up onstage at multiple events and described my vision for the orchestrator. I went everywhere, to everyone. (from "Welcome to Gas Town" https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...)
That Anthropic releases Agent Teams now (as rumored a couple of weeks back), after they've already adopted a tiny bit of beads in form of Tasks) means that either they've been building them already back when Steve pitched orchestrators or they've decided that he's been right and it's time to scale the agents. Or they've arrived at the same conclusions independently -- it won't matter in the larger scale of things. I think Steve greately appreciates it existing; if anything, this is a validation of his vision. We'll probably be herding polecats in a couple of months officially.
While i appreciate anthropic making a proof of concept like they did with claude code cli on which they can then do RL to optimise the patterns that work, I expect this to be as unusable as the cli itself. Its a big difference if a model provider internalises something like thinking mode which mainly depends on context and text or if they try to grab a part of the agent loop which has to run on the side of the systems we build and use.
We cannot allow model providers to own the browsers, CLIs, memory, IDEs, extensions and other tooling. Its not just a matter of power but also they just suck at it as i experience every time i have to use claude code instead of amp.
I truly hope we get the pattern of innovation that looks like:
- some dude vibecodes a really cool idea
- model providers build into their reference implementations
- model providers optimize models to work optimally
- startup and/or open source projects step in and build something that is actually usable and opens a new market segment
We saw this play out beautifully with amp, kilo, roo, cline, continue
Another aspect is that we do not want interfaces just made for agents to work in teams, we want software made for humans and agents, that are true platforms for these agent teams to collaborate in.
This is great and all but, who can actually afford to let these agents run on tasks all day long? Is anyone here actually using this or are these rollouts aimed at large companies?
I'm burning through so many tokens on Cursor that I've had to upgrade to Ultra recently - and i'm convinced they're tweaking the burn rate behind the scenes - usage allowance doesn't seem proportional.
Thank god the open source/local LLM world isn't far behind.
Real numbers from today. FastAPI codebase, ~50k LOC. 4 agents, 6 tasks, ~6 min wall clock vs ~18-20 min sequential. 24 tests, 0 file conflicts. Token cost: roughly 4x a single session.
To your cost question — agent teams are sprinters, not marathon runners. You use them for a 6-minute burst of parallel work, not all day. A 6-minute burst at 4x cost is still cheaper than 20 minutes at 1x if your time matters more than tokens.
The constraint nobody mentions: tasks must be file-disjoint. Two agents editing the same file means overwrites. Plan decomposition matters more than the agents themselves.
One thing to watch: Claude Code crashed mid-session with a React reconciler error (#23555). 4 agents + MCP servers pushes the UI past its limits.
Need it be actually disjoint? Interested in learning about the limitation here because apparently the agents can coordinate.
Otherwise what’s the difference between what they are providing vs me creating two independent pull requests using agents and having an agent resolve merge conflicts?
It does need to be disjoint. The https://code.claude.com/docs/en/agent-teams are explicit: "Two teammates editing the same file leads to overwrites. Break the work soeach teammate owns a different set of files."
locking is for task claiming — preventing two agents from grabbing the same task — not for file writes:
"Task claiming uses file locking to prevent race conditions when multiple teammates try to claim the same task simultaneously."
The coordination layer (TaskList, blockedBy, SendMessage) handles logical task sequencing, not concurrent file access. You can make agent B wait for agent A via dependencies, but that serializes the work and kills the parallelism benefit.
Anthropic themselves were able to write a c compiler using teams all at the same time
https://www.anthropic.com/engineering/building-c-compiler
Here is the relevant excerpt:
"To prevent two agents from trying to solve the same problem at the same time, the harness uses a simple synchronization algorithm:
Claude takes a "lock" on a task by writing a text file to current_tasks/ (e.g., one agent might lock current_tasks/parse_if_statement.txt, while another locks current_tasks/codegen_function_definition.txt). If two agents try to claim the same task, git's synchronization forces the second agent to pick a different one. Claude works on the task, then pulls from upstream, merges changes from other agents, pushes its changes, and removes the lock. Merge conflicts are frequent, but Claude is smart enough to figure that out."
A Claude max 20x plan and you’ll be fine. I’d been doing my normal process of running 4 Claude sessions in parallel because that was about the right amount of concurrent sessions for me to watch what’s going on and approve/deny plans and code… and this blows it out of the water. With an agent swarm it’s so fast at executing and testing I’m limited by my idea and review capabilities now. I tried running 2 and I can’t keep up, I’m defining specs and the other window is done, tested, validated and waiting for me.
Many many companies can afford to hire a junior engineer for $150k/year (plus employer payroll taxes, employee benefits etc.).
Are you spending more than $150k per year on AI?
(Also, you're talking about the cost of your Cursor subscription, when the article is about Claude Code. Maybe try Claude Max instead?)
If it could do anything that a junior dev could, that’d be a valid point of comparison. But it continually, wildly performs slower and falls short every time I’ve tried.
But it continually, wildly performs slower and falls short every time I’ve tried.
If it falls short every time you've tried, it's likely that one or more of these is true:A. You're working on some really deep thing that only world-class expects can do, like optimizing graphics engines for AAA games.
B. You're using a language that isn't in the top ~10 most popular in AI models' training sets.
C. You have an opportunity to improve your ability to use the tools effectively.
How many hours have you spent using Claude Code?
Trying to make a media player, media server, all by using ffmpeg and a pre-built media streaming engine as it's core. Python and SQLite. About a week's worth of effort every time until it begins to go too far off the rails to be reliable to continue to develop with. It never did get the ffmpeg commands right, I had to go back to crafting those by hand, it never did get the streaming engine to play in the browser's video player in the supported hls and dash formats. Asked it to build a file and file metadata caching layer and then had to continue to re-prompt it to poll the caching layers before trying to get values from the database. Never even got to the library, metadata, or library image functionality. Had to ask it to create the rbac permissions model I wanted despite it being very junior-level common sense (super-admin, user-admin, metadata admin, image admin).
Not exactly world-class software.
I'm curious which harness and which model(s) you've been using.
And whether you have a decent PRD or spec. Are you trying to prompt the harness with one bit at a time, or did you give it a complete spec and ask it to analyze it and break it down into individual issues with dependencies (e.g. using beads and beads_viewer)?
I'm not looking for reasons to criticize your approach or question your experience, but your answers may point to opportunities for you to get more out of these tools.
If you're using Claude Code and you have a friend who has had more success with these tools, consider exporting your transcripts and letting them have a look: https://simonwillison.net/2025/Dec/25/claude-code-transcript...
I recently built something in the same universe - using ffmpeg to receive streams from obs to capture audio and video - don't want to get into details beyond except to say it involved a fairly involved pipeline of ray actors and a significant admin interface with nicegui. I had no problem doing this with claude. You need to give it access to look up how do things, like context7. If you are doing something very specific, you need to have a session that does research to build a skill so it doesn't need to redo that research every time. And yes, you do need to tell it the architecture and be fairly detailed with something like how you want rbac.
Using these tools takes quite a bit of effort but even after doing all those steps to use the tool well, I still got this project done in a few days when it otherwise would have taken me 1-2 months and likely simply would never happened at all.
> A. You're working on some really deep thing that only world-class expects can do, like optimizing graphics engines for AAA games.
This is a relatively common skill. One thing I always notice about the video game industry is it's much more globally distributed than the rest of the software industry.
Being bad at writing software is Japan's whole thing but they still make optimized video games.
It’s a simple compiler optimization over bayesian statistics. It’s masters-level stuff at best, given that I’m on it instead of some expert. The codebase is mixed python and rust, neither of which are uncommon.
The issues I ran into are primarily “tail-chasing” ones - it gets into some attractor that doesn’t suit the test case and fails to find its way out. I re-benchmark every few months, but so far none of the frontier models have been able to make changes that have solved the issue without bloating the codebase and failing the perf tests.
It’s fine for some boilerplate dedup or spinning up some web api or whatever, but it’s still not suitable for serious work.
Would you expect a junior engineer to perform better than this?
> like optimizing graphics engines for AAA games.
Claude would be worse than an expert at this, but this is a benchmarkable task. Claude can do experiments a lot quicker than a human can. The hard part would be ensure that the results aren't just gaming your benchmark.
The possibility that the performance of these tools still isn't at the level some people need it to be is not an option?
It's insulting that criticism is often met with superficial excuses and insinuation that the user lacks the required skills.
When really solid programmers who started skeptical (and even have a ban policy if PR submitters don’t disclose they used AI) now show how their workflows have been improved by AI agents, it may be worth trying to understand what they are doing and you are not.
https://mitchellh.com/writing/my-ai-adoption-journey
My experience mirrors that of Mitchell. It absolutely is at the level now where AI can free up time to do the really interesting stuff.
That possibility is covered by A and B.
GP said 'falls short every time I’ve tried'. Note the word 'every'.
Companies are not comparing it straight to juniors. They're more making a comparison between a Senior with the assistance of one more more juniors, vs a Senior with the assistance of AI Agents.
I feel like comparison just to a junior developer is also becoming a fairly outdated comparison. Yes, it is worse in some ways, but also VASTLY superior in others.
It’s funny so many companies making people RTO and spending all this money on offices to get “hallway” moments of innovation, while emptying those offices of the people most likely to have a new perspective.
I am way more productive with $200/month of AI than I would be with $5,000/month of junior developer. And it isn’t close.
What if you are going to spend 5400 either way, you go all agent or get an apprentice and an agent for them too.
I can't even get through my Claude Max quota, and that's only 200/mo. And I code every day and use it for various other pretty-intensive tasks.
only $200/mo…$200 a month is a used car payment.
I guarantee you that price will double by 2027. Then it’ll be a new car payment!
I’m really not saying this to be snarky, I’m saying this to point out that we’re really already in the enshittification phase before the rapid growth phase has even ended. You’re paying $200 and acting like that’s a cheap SaaS product for an individual.
I pay less for Autocad products!
This whole product release is about maximizing your bill, not maximizing your productivity.
I don’t need agents to talk to each other. I need one agent to do the job right.
$200/month is peanuts when you are a business paying your employees $200k/year. I think LLMs make me at least 10% more effective and therefore the cost to my employer is very worth it. Lots of trades have much more expensive tools (including cars).
> I think LLMs make me at least 10% more effective
I know this was last year but...
https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
Honestly, that is a “skill issue” as the kids these days say. When used properly and with skill, agents can increase your productivity. Like any tool, use it wrong and your life will be worse off. The logically consistent view if you want to believe this study and my experience is that the average person is hindered by using AI because they do not have the skills, but there are people out there who gain a net benefit.
It drives me nuts that people take the mean of AI code generation results and use that to make claims about what AI code generation is possible of. It's like using the mean basketball player to argue that people like LeBron and Jordan don't exist.
No, we just want to point out not everybody utilizing agents ends up like LeBron or Jordan - most are Brian Scalabrine.
For sure. I like having discussions with nuanced takes, these are tools with strengths and weaknesses and being a good tool user includes knowing when not to pick it up.
It’s a skill issue, which means you can’t fire any of your highly skilled employees, which means it has the same value as any other business organization tool like Jira or Microsoft Excel, approximately $10-20 per user per month.
Autodesk Fusion for manufacturing costs less than Claude Max and you literally can’t do your job without it.
So Autodesk takes you from 0 to 100% productivity for under $200 a month and companies are expected to pay $200+ to gain an extra 10-20%?
That math isn’t how it works with any other business logic tools.
I don’t need external research to validate or invalidate my own experience.
One of the outcomes of that study is that your own productivity estimate might not match up with reality.
Maybe for the developers who weren't very productive to begin with, and got even lazier now.
I think it depends on the tasks you use it for. Bootstrapping or translating projects between languages is amazing. New feature development? Questionable.
I don’t write frontend stuff, but sometimes need to fix a frontend bug.
Yesterday I fed claude very surgical instructions on how the bug happens, and what I want to happen instead, and it oneshot the fix. I had a solution in about 5 minutes, whereas it would have taken me at least an hour, but most likely more time to get to that point.
Literally an hour or two of my day was saved yesterday. I am salaried at around $250/hour, so in that one interaction AI saved my employer $250-500 in wages.
AI allows me to be a T shaped developer, I have over a decade of deep experience in infrastructure, but know fuck all about front end stuff. But having access to AI allows me as an individual who generally knows how computers work to fix a simple problem which is not in my domain.
Maybe this is a gray area, but that's kind of my experience with it too. I understand what I want to happen, but don't understand the language and it produces a language specific result that is close enough, maybe even one-shot, for me to use. I categorize this under translation.
It also depends upon how you manage it
My process, which probably wouldn't work with concurrent agents because I'm keeping an eye on it, is basically:
- "Read these files and write some documentation on how they work - put the documentation in the docs folder" (putting relevant files into the context and giving it something to refer to later on)
- "We need to make change X, give me some options on how to do it" (making it plan based on that context)
- "I like option 2 - but we also need to take account of Y - look at these other files and give me some more options" (make sure it hasn't missed anything important)
- "Revised option 4 is great - write a detailed to-do list in the docs/tasks folder" (I choose the actual design, instead of blindly accepting what it proposes)
- I read the to-do list and get it rewritten if there's anything I'm not happy with
- I clear the context window
- "Read the document in the docs folder and then this to-do list in the docs/tasks folder - then start on phase 1"
- I watch what it's doing and stop if it goes off on one (rare, because the context window should be almost empty)
- Once done, I give the git diffs a quick review - mainly the tests to make sure it's checking the right things
- Then I give it feedback and ask it to fix the bits I'm not happy with
- Finally commit, clear context and repeat until all phases are done
Most of the time this works really well.
Yesterday I gave it a deep task, that touched many aspects of the app. This was a Rails app with a comprehensive test suite - so it had lots of example code to read, plus it could give itself definite end points (they often don't know when to stop). I estimated it would take me 3-4 days for me complete the feature by hand. It made a right mess of the UI but it completed the task in about 6 hours, and I spent another 2 hours tidying it up and making it consistent with the visuals elsewhere (the logic and back-end code was fine).
So either my original estimate is way off, or it has saved me a good amount of time there.
When you say "it" completed the task in 6 hours, do you mean with you in the loop or running autonomously for hours after a certain point?
New feature development in web and mobile apps is absolutely 10% more productive with these tools, and anyone who says otherwise is coping. That's a large fraction of software development.
The flat earther argument.
“The research is wrong.”
I proposed a logically consistent perspective where both my experience and the study are true at the same time? What is your response to that other than comparing me to a flat earther? Do you have something useful to contribute?
Not saying $200/mo isn't a lot, but I think you're underestimating used car payments these days. The average US used car payment is above $500 now.
As company owner the math is simple:
If I pay $3k/month to a developer and a $200/month tool makes them 10% more productive I will pay it without thinking.
I pay $200/month, don’t come near the limits (yet), and if they raised the price to $1000/month for the exact same product I’d gladly pay it this afternoon (Don’t quote me on this Anthropic!)
If you’re not able to get US$thousands out of these models right now either your expectations are too high or your usage is too low, but as a small business owner and part/most-time SWE, the pricing is a rounding error on value delivered.
As a business expense to make profit, I can understand being ok with this price point.
But as an individual with no profit motive, no way.
I use these products at work, but not as much personally because of the bill. And even if I decided I wanted to pursue a for profit side project I’d have to validate it’s viability before even considering a 200$ monthly subscription
I'm paying $100 per month even though I don't write code professionally. It is purely personal use. I've used the subscription to have Claude create a bunch of custom apps that I use in my daily life.
This did require some amount of effort on my part, to test and iterate and so on, but much less than if I needed to write all the code myself. And, because these programs are for personal use, I don't need to review all the code, I don't have security concerns and so on.
$100 every month for a service that writes me custom applications... I don't know, maybe I'm being stupid with my money, but at the moment it feels well worth the price.
You can do it for $40 month. What I'm doing:
- $20 for Claude Pro (Claude Code) - $20 for ChatGPT Plus (Codex) - Amp Free Plan (with ads and you get about $10 of daily value)
So you get to use 3 of the top coding agents for $40 month.
Some tools are not meant for individuals. That 100k software defined radio isn’t meant for you either.
We’re gonna see an economic boom any minute.
"Rounding error" lol, you can hire an actual full time human in India for $1000/month.
Will they be better than Opus though?
wouldn’t hire one for $15/month…
with the US salaries for SWEs $1000/month is not a rounding error for all but definitely for some. say you make $100/hr and CC saves you say 30hrs / month? not rounding error but no brainer. if you make $200+/hr it starts to become a rounding error. I have multiple max accounts at my disposal and at this point would for sure pay $1000/month for max plan. it comes down to simple math
I'm curious: what concrete value have you extracted using these tools that is worth US$thousands?
That's one of 3 possible futures.
1. 1-3 LLM vendors are substantially higher quality than other vendors and none of those are open source. This is an oligarchy and the scenario you described will play out.
2. >3 LLM vendors are all high quality and suitable for the tasks. At least one of these is open source. This is the "commodity" scenario, and we'll end up paying roughly the cost of inference. This still might be hundreds per month, though.
3. Somewhere in between. We've got >3 vendors, but 1-3 of them are somewhat better than the others, so the leaders can charge more. But not as much more than they can in scenario #1.
It's clear what's gonna play out. Chinese open source labs are slowly closing the gap, and as American frontier labs hit diminishing return on various tasks, the Chinese models are going to be good enough for the vast majority of use cases. This is going to strip American labs ability to do monopoly plays, and force them into open behavior.
The only place frontier labs will be able to profit take is niche models for specific purposes where they can control who has access to traces tightly. Any general pupose LLM with highly available traces is gonna get distilled down instantly.
> I’m saying this to point out that we’re really already in the enshittification phase before the rapid growth phase has even ended. You’re paying $200 and acting like that’s a cheap SaaS product for an individual.
Traditional SaaS products don't write code for me. They also cost much less to run.
I'm having a lot of trouble seeing this as enshittification. I'm not saying it won't happen some day, but I don't think we're there. $200 per month is a lot, but it depends on what you're getting. In this case, I'm getting a service that writes code for me on demand.
Traditional SaaS products literally “write code” for you (they implement business logic). See: Zapier, Excel.
The enshittification is that the costs are going up faster than inflation and companies like OpenAI are talking about adding advertisements.
https://www.fintechweekly.com/magazine/articles/cursor-prici...
https://hostbor.com/claude-ai-max-plan-explained/
We can see especially in the case of Claude AI Max that while it sounds like you’re getting better value than the cheaper plans, the company is now encouraging less efficient use of the tool (having multiple agents talking to each other, rather than improving models so that one agent is doing work correctly).
> Traditional SaaS products literally “write code” for you (they implement business logic). See: Zapier, Excel.
Eh, I'd call those a sort of programming language. The user is still writing code, albeit in a "friendlier" manner. You can't just ask for what you want in English.
> The enshittification is that the costs are going up faster than inflation and companies like OpenAI are talking about adding advertisements.
In 1980, IT would have cost $0 at most companies. It's okay for costs to go up if you're getting a service you were not getting before.
If you can’t get $200 of value out of Claude Code Max, then you need to really step up your game. That’s user error.
I could write an essay about how almost everything you wrote either is extremely incorrect or is extremely likely to be incorrect. I am too lazy to, though, so I will just have to wait for another commenter to do the equivalent.
Why not make your AI tool do it for you?
Because, while I have been a huge AI optimist for decades, I generally don't like their current writing output. And even if I did, it would feel like plagiarism unless I prepended it with "an AI responded with this:", which would make me seem lazy. (Though I did already just admit I am very lazy in my first post, so perhaps that is what I will do going forward once they become better writers.)
Especially for what’s basically an experiment. Gas town didn’t really work, so there’s no guarantee this will even produce anything of value.
You know those VC funded startups with just two founders… them.
I mean what you get for Claude Code Max is insane its 30x on the token price. If you don’t spend that all it’s your own fault. That must be below elecricity cost
I wonder if my $20/mo subscription will last 10 minutes.
At this point, if you're paying out of pocket you should use Kimi or GLM for it to make sense
GLM is OK (haven't used it heavily but seems alright so far), a bit slow with ZAI's coding plan, amazingly fast on Cerebras but their coding plan is sold out.
Haven't tried Kimi, hear good things.
These are super slow to run locally, though, unless you've got some great hardware - right?
At least, my M1 Pro seems to struggle and take forever using them via Ollama.
Ah ok, same. I keep wondering about how this would ever accomplish anything.
I've had good results with Haiku for certain tasks.
Seems similar to Gas Town
I'm not anti-whimsy, but if your project goes too hard on the whimsy (and weird AI-generated animal art), it's kind of inevitable that someone else is going to create a whimsy-free clone, and their version will win because it's significantly less embarrassing to explain to normal people.
Where are the polecats, though? What about the mayor's dog?
I don't know what Gas Town is, but Claude Code Agent Teams is what I was doing for a while now. You use your main conversation only to spawn sub agents to plan and execute, allowing you to work for a long time without losing context or compacting, because all token-heavy work is done by sub agents in their own context. Claude Code Agent Teams just streamlines this workflow as far as I can tell.
yeah, seems like a much simpler design though (i.e. only seems like one 'special/leader' agent, and the rest are all workers vs gastown having something like 8 different roles mayor, polecat, witnesses, etc).
Wonder how they compare?
i would have to imagine the gastown design isn't optimal though? why 8, and why does there need to multiple hops of agent communications before two arbitrary agents communicate with each other as opposed to single shared filespace?
I've been using Gas Town a decent bit since it was released. I'd agree with you that it's design is sub-optimal, but I believe that's more due to the way the actual agents/harnesses have been designed as opposed to optimal software design. The problem you often run into is that agents will sometimes hang thinking they need human input for a problem they are on, or they think they're at a natural stopping point. If you're trying to do fully orchestrated agentic coding where you don't look at the code at all (putting aside whether that's good or not for a second) then this is sub-optimal behavior, and so these extra roles have been designed to 'keep the machine going' as it were.
Often times if I'm only working on a single project or focus, then I'm not using most of those roles at all and it's as you describe, one agent divvying out tasks to other agents and compiling reports about them. But due to the fact that my velocity with this type of coding is now based on how fast I can tell that agent what I want, I'm often working on 3 or 4 projects simultaneously, and Gas Town provides the perfect orchestration framework for doing this.
the problem with gastown is it tries to use agents for supervision when it should be possible to use much simpler and deterministic approaches to supervision, and also being a lot more token efficient
I strongly believe we will need both agentic and deterministic approaches. Agentic to catch edge cases & the like, deterministic as those problems (along with the simpler ones early on) are continually turned into hard coded solutions to the maximum extent possible.
Ideally you could eventually remove the agentic supervisor. But for some cases you would want to keep it around, or at least a smaller model which suffices.
yegge's article does come off as complicated design for the sake of complication
Yeah but worse
No polecats smh
>"Seems similar to Gas Town"
I love that we are in this world where the crazy mad scientists are out there showing the way that the rest of us will end up at, but ahead of time and a bit rough around the edges, because all of this is so new and unprecedented. Watching these wholly new abstractions be discovered and converged upon in real time is the most exciting thing I've seen in my career.
The action is hot, no doubt. This reminds me of Spacewar! -> Galaxy Game / Computer Space.
I absolutely cannot trust Claude code to independently work on large tasks. Maybe other people work on software that's not significantly complex, but for me to maintain code quality I need to guide more of the design process. Teams of agents just sounds like adding a lot more review and refactoring that can just be avoided by going slower and thinking carefully about the problem.
You write a generic architecture document on how you want your code base to be organized, when to use pattern x vs pattern y, examples of what that looks like in your code base, and you encode this as a skill.
Then, in your prompt you tell it the task you want, then you say, supervise the implementation with a sub agent that follows the architecture skill. Evaluate any proposed changes.
There are people who maximize this, and this is how you get things like teams. You make agents for planning, design, qa, product, engineering, review, release management, etc. and you get them to operate and coordinate to produce an outcome.
That's what this is supposed to be, encoded as a feature instead of a best practice.
Aren't you just moving the problem a little bit further? If you can't trust it will implement carefully specified features, why would you believe it would properly review those?
It's hard to explain, but I've found LLMs to be significantly better in the "review" stage than the implementation stage.
So the LLM will do something and not catch at all that it did it badly. But the same LLM asked to review against the same starting requirement will catch the problem almost always
The missing thing in these tools is that automatic feedback loop between the two LLMs: one in review mode, one in implementation mode.
I've noticed this too and am wondering why this hasn't been baked into the popular agents yet. Or maybe it has and it just hasn't panned out?
AFAICT this is already baked into the GitHub Copilot agent. I read its sessions pretty often and reviewing/testing after writing code is a standard part of its workflow almost every time. It's kind of wild seeing how diligent it is even with the most trivial of changes.
Anecdotaly I think this is in Claude Code. It's pretty frequent to see it implement something, then declare it "forgot" a requirement and go back and alter or add to the implementation.
You have to dump the context window for the review to work good.
How does this not use up tokens incredibly fast though? I have a Pro subscription and bang up against the limits pretty regularly.
It _does_ use up tokens incredibly fast, which is probably why Anthropic is developing this feature. This is mostly for corporations using the API, not individuals on a plan.
I'd love to see a breakdown of the token consumption of inaccurate/errored/unused task branches for claude code and codex. It seems like a great revenue source for the model providers.
Yeah, that's what I was thinking. They do have an incentive to not get everything right on the first try, as long as they don't over do it... I also feel like that they try to get more token usage by asking unnecesary follow up questions that the user may say yes to etc.
I had to go to Max, Pro is more like a taster.
At work tho we use Claude Code thru a proxy that uses the model hosted on AWS bedrock. It’s slower than consumer direct-to-Anthropic and you have to wait a bit for the latest models (Opus 4.5 took a while to get), but if our stats are to be believed it’s much much cheaper.
I don't know, all I can say is with API-based billing, doing multi-thousand like refactors that would take days to do costs like $4. In terms of value : effort, it's incredible.
It does use tokens faster, yes.
I agree, but I've found that making an "adversarial" model within claude helps with the quality a lot. One agent makes the change, the other picks holes in it, and cycle. In the end, I'm left with less to review.
This sounds more like an automation of that idea than just N-times the work.
Glad I'm not the only one. I do the same, but I tend to have gemini be the one that critiques.
Do you do this manually? Or some abstraction above that? skills, some light orchestration, etc?
I just tell it to do so, but you could even add that as a requirement to CLAUDE.md
Exactly, one out of four or three prompts require tuning, nudging or just stopping it. However it takes seniority to see where it goes astray. I suspect that lots of folks dont even notice that CC is off. It works, it passes the tests, so it is good.
Humans can't handle large tasks either, which is why you break them into manageable chunks.
Just ask claude to write a plan and review/edit it yourself. Add success criteria/tests for better results.
There is research[0] currently being done on how to divide tasks and combine the answers to LLMs. This approach allows LLMs reach outcomes (solving a problem that requires 1 million steps) which would be impossible otherwise.
All they did was prompt an LLM over and over again to execute one iteration of a towers of hanoi algorithm. Literally just using it as a glorified scripting language:
```
Rules:
- Only one disk can be moved at a time.
- Only the top disk from any stack can be moved.
- A larger disk may not be placed on top of a smaller disk.
For all moves, follow the standard Tower of Hanoi procedure: If the previous move did not move disk 1, move disk 1 clockwise one peg (0 -> 1 -> 2 -> 0).
If the previous move did move disk 1, make the only legal move that does not involve moving disk1.
Use these clear steps to find the next move given the previous move and current state.
Previous move: {previous_move} Current State: {current_state} Based on the previous move and current state, find the single next move that follows the procedure and the resulting next state.
```
This is buried down in the appendix while the main paper is full of agentic swarms this and millions of agents that and plenty of fancy math symbols and graphs. Maybe there is more to it, but the fact that they decided to publish with such a trivial task which could be much more easily accomplished by having an llm write a simple python script is concerning.
Good lord, I can only imagine the wasted electricity.
No offense to the academic profession, but they're not a good source of advice for best practices in commercial software development. They don't have the experience or the knowledge sufficient to understand my workplace and tasks. Their skill set and job is orthogonal to the corporate world.
Yes, the problem solved in the paper (Tower of Hanoi) is far more easily defined than 99% of actual problems you would find in commercial software development. Still proof of "theoretically possible" and seems like an interesting area of research.
You definitely have to create some sort of PLAN.md and PROGRESS.md via a command and an implement command that delegates work. That is the only way that I can get bigger things done no matter how „good“ their task feature is.
You run out of context so quickly and if you don’t have some kind of persistent guidance things go south
It's not sufficient, especially if I am not learning about the problem by being part of the implementation process. The models are still very weak reasoners, writing code faster doesn't accelerate my understanding of the code the model wrote. Even with clear specs I am constantly fighting with it duplicating methods, writing ineffective tests, or implementing unnecessarily complex solutions. AI just isn't a better engineer than me, and that makes it a weak development partner.
>AI just isn't a better engineer than me, and that makes it a weak development partner.
This would also be true of Junior Engineers. Do you find them impossible to work with as well?
I tried doing that and it didn't work. It still adds "fallbacks" that just hide errors or the fact that there is no actual implementation and "In a real app, we would do X, just return null for now"
you need a reviewer agent for every step of the process - review the plan generated by the planner, the update made by the task worker subagent, and a final reviewer once all tasks are done.
this does eat up tokens _very_ quickly though :(
Are people using Claude max 20x plan for personal pet projects? Are these expensed? Have you liquidated all other hobbies to fund this? Asking for a friend.
With stuff like this, might be that all the infra build-out is insufficient. Inference demand will go up like crazy.
Unlocking the next order of magnitude of software inefficiency!
Though I do hope the generated code will end up being better than what we have right now. It mustn't get much worse. Can't afford all that RAM.
Dunno, it's probably less energy efficient than a human brain, but being able to turn electricity into intelligence is pretty amazing. RAM and power generation are engineering problems to be solved for civilization to benefit from this.
It'd be nice if CC could figure out all the required permissions upfront and then let you queue the job to run overnight
Anyone paying attention has known that demand for all type of compute than can run LLMs (i.e. GPUs, TPUs, hell even CPUs) was about to blow up, and will remain extremely large for years to come.
It's just HN that's full of "I hate AI" or wrong contrarian types who refuse to acknowledge this. They will fail to reap what they didn't sow and will starve in this brave new world.
Agreed, agent scaling and orchestration indicates that demand for compute is going to blow up, if it hasn't already. The rationale for building all those datacenters they can't build fast enough is finally making sense.
This reads like a weird cult-ish revenge fantasy.
And what about you? Show your "I used AI today" badge, right now!
Oh yeah I mean if you're a webdev and you haven't built several data centres already you're basically asking to be homeless.
I’m looking for something like this, with opus in the driver seat, but the subagents should be using different LLMs, such as Gemini or Codex. Anyone know if such a tool? just-every/code almost does this, but the lead/orchestrator is always codex, which feels too slow compared to opus or Gemini.
I use opus for coding and codex for reviews. I trigger the reviews in each work task with a review skill that calls out to codex[0]
I don't need anything more complicated than that and it works fine - also run greptile[1] on PR's
These two basically do what you want, let Claude be the manager and Codex/Gemini be the worker. Many say that Coder-Codex-Gemini is easier to understand than CCG-Workflow, which has too many commands to start with.
https://github.com/FredericMN/Coder-Codex-Gemini https://github.com/fengshao1227/ccg-workflow
This one also seems promising, but I haven't tried it yet.
https://github.com/bfly123/claude_code_bridge
All of them are made by Chinese dev. I know some people are hesitant when they see Chinese products, so I'll address that first. But I have tried all of them, and they have all been great.
You can accomplish this with https://github.com/AgentWorkforce/relay and make the Lead/Orchestrator any harness you want. At the core agent-relay is agent to agent communication but it unlocks quite a few multi agent orchestration paradigms. I wrote about some learnings here as well https://x.com/khaliqgant/status/2019124627860050109?s=46
I think this is where future cursor features will be great - to coordinate across many different model providers depending on the sub-jobs to be done
What I want is something else: I want them to work in parallel on the same problem, and the orchestrator to then evaluate and consolidate their responses. I’m currently doing this manually, but it’s tedious.
At Augment' we've been working on this. Multi agents orchestration, spec driven, different models for different tasks, etc.
https://www.augmentcode.com/product/intent
can use the code AUGGIE to skip the queue. Bring your own agent (powered by codex, CC, etc) coming to it next week.
You can run an ensemble of LLMs (Opus, Gemini, Codex) in Claude Code Router via OpenRouter or any Agent CLI that supports Subagents and not tied to a single LLM like Opencode. I have an example of this in Pied-Piper, a subagent orchestrator that runs in Claude Code or ClaudeCodeRouter and uses distinct model/roles for each Subagent:
1. GPT-5.2 Codex Max for planning
2. Opus 4.5 for implementation
3. Gemini for reviews
It’s easy to swap models or change responsibilities. Doc and steps here: https://github.com/sathish316/pied-piper/blob/main/docs/play...
This sounds very promising. Using multiple CC instances (or mix of CLI-agents) across tmux panes has always been a workflow of mine, where agents can use the tmux-cli [1] skill/tool to delegate/collaborate with others, or review/debug/validate each others work.
This new orchestration feature makes it much more useful since they share a common task list and the main agent coordinates across them.
[1] https://github.com/pchalasani/claude-code-tools?tab=readme-o...
Yeah, I've been using your tools for a while. They've been nice.
I was working on my own alternative to Beads... then I realized I could do exactly this with something similar to Beads, I'm planning on open sourcing it soon because I like what I have so far, I also made it so I can sync my tasks directly to my GitHub projects as well. I think its more useful to have agent tasks eventually synched back up to real ticketing systems for historical reasons. Besides, its better to have alternatives that are agent agnostic.
Been using these types of flows across agent harnesses for a while. Check out https://github.com/tmc/it2
I personally have no use for this type of workflow. I like parallel claude code instances in worktrees but nothing beyond that
Am not a fan of dealing with worktrees Maybe for larger longer lived tasks but the time spent on merges from different agents is definitely a big headwind for parallel work.
This seems handled by this new agent which is cool.
I gave up on worktrees and hacked together a solution with fine-grained lockfiles for editing, running builds, etc that worked surprisingly good for what it was
I just built a quick plugin to automatically add agents & skills then fire off a team with them, depending on your task: https://github.com/drbscl/dream-team
Been waiting for this to drop and excited to test it out. We've been building something in this space - https://github.com/AgentWorkforce/relay, a real-time messaging layer that lets AI coding agents talk to each other across any CLI.
Assign roles to different models and have them coordinate: Claude as the lead, Codex on backend, Gemini on frontend, etc.
I wrote about my experiences with multi-agent orchestration here: https://x.com/khaliqgant/status/2019124627860050109?s=46
Subagents are out, put it all on agent teams!
something i really like from tryin git out over the last 10 minutes is that the main agent will continue talking to you while other agents are working, so you don't have to queue a message
Clean up the team
Claude Town
Excited to try this out. I've seen a lot of working systems on my own computer that share files to talk between different Claude Code agents and I think this could work similarly to that.
(i thought gas town was satire? people in comments here seem to be saying that gas town also had multi-agent file sharing for work tracking)
A cynical read of this is that it’s all a ploy to maximize usage.
Why do agents need to speak to each other if they’re just doing the work correctly the first time?
Is it an admission that a single agent is not useful and reliable enough?
I run a loop where I have 4 agents review in parallel after each implementation phase. It just increases the odds of finding issues.
I've switched this over to a team of 4 now that talk to each other to discuss issues they find and it's amazing. They confirm between themselves and if they wrongly identified something the others correct them.
Gas Town decimated by Claude bomb from orbit
"finish Claude tokens quota in 3 minutes, largely over delegation and result messages instead of code writing"
I find it amusing that the innovation in this space for the past year+ has been mostly centered around engineering: MCP, "agents", "skills", etc. Now "agent" orchestration is the new hotness.
Meanwhile, the same issues that have plagued these tools since their inception are largely ignored: hallucination, innacuracy, context collapse, etc. These won't be solved by engineering, but by new research and foundational improvements.
On one hand, solid engineering was sorely needed, and can extract a lot of value from the current tech. But on the other, all these announcements and improvements feel like companies grasping at straws to keep the hype cycle going by any means necessary. Charts must go up and to the right, or investors get antsy.
It's all adding to the mountain of signs that suggest that this isn't the path to artificial intelligence. It's interesting tech, with possibly many valuable applications, but the "AI" narrative is frankly tiring. I wish I could fast forward on this speculative phase, go past the inevitable crash, and arrive at a timeframe where we've figured out what this tech is actually good for, and where we hopefully use it more for good than evil.
Any self respecting engineer should recognize that these tools and models only serve to lower the value of your labor. They aren't there to empower you, they aren't going to enable you to join the ruling class with some vibe-rolled slop SaaS.
Using these things will fry your brain's ability to think through hard solutions. It will give you a disease we haven't even named yet. Your brain will atrophy. Do you want your competency to be correlated 1:1 to the quality and quantity of tokens you can afford (or be loaned!!)?
Their main purpose is to convince C-suite suits that they don't need you, or they should be justified in paying you less.This will of course backfire on them, but in the meantime, why give them the training data, why give them the revenue??
I'd bet anything these new models / agentic-tools are designed to optimize for token consumption. They need the revenue BADLY. These companies are valued at 200 X Revenue.. Google IPO'd at 10-11 x lmfao . Wtf are we even doing? Can't wait to watch it crash and burn :) Soon!
People often compare working with AI agents to being something like a project manager.
I've been a project manager for years. I still work on some code myself, but most of it is done by the rest of the team.
On one hand, I have more bandwidth to think about how the overall application is serving the users, how the various pieces of the application fit together, overall consistency, etc. I think this is a useful role.
On the other hand, I definitely have felt mental atrophy from not working in the code. I still think; I still do things and write things and make decisions. But I feel mentally out of shape; I lack a certain sharpness that I perceived when I was more directly in tune with the code.
And I'm talking, all orthogonal to AI. This is just me as a project manager with other humans on the project.
I think there is truth to, well, operate at a higher level! Be more systems-minded, architecture-minded, etc. I think that's true. And there are surely interesting new problems to solve if we can work not on the level of writing programs, but wielding tools that write programs for us.
But I think there's also truth to the risk of losing something by giving up coding. Whether if that which might be lost is important to you or not, is your own decision, but I think the risk is real.
I do think there’s a real risk of Brain Atrophy when you rely on AI coding tools for everything and while learning something new. About a year ago, I dealt with this problem by using Neovim and having shortcuts like below to easily toggle GitHub Copilot on/off. Now that AI is baked into almost every part of the toolchain in VSCode, Cursor, ClaudeCode, Intellij, I don't know how the newer engineers will learn without AI assistance.
I think in-line autocomplete is likely not that dangerous, if it's used in this manner responsibly, it's the large agentic tools that are problematic for your brain imo. But in-line autocompletes aren't going to raise billions of dollars and aren't flashy.
I'd say autocomplete introduces a certain level of fuzziness into the code we work with, though to a lower degree. I used autocomplete for over a year, and initially it did feel like a productivity boost, yet when I later stopped using them, it never felt like my productivity decreased. I stopped because something about losing explicit intent of my code feels uncomfortable to me.
It's very difficult to operate effectively at a higher level for a continued period of time without periodically getting back into the lower levels to try new things and learn new approaches or tools.
That doesn't even have to be writing a ton of code, but reading the code, getting intimately familiar with the metrics, querying the logs, etc.
I definitely think what you're losing is extremely important, and can't be compensated with LLMs once its gone.
Back when automatic piano players came out, if all the world's best piano players stopped playing and mostly just composing/writing music instead, would the quality of the music have increased or decreased. I think the latter.
From an economic standpoint this is basically machines doing work humans used to do. We’ve already gone through this many times. We built machines that can make stuff orders of magnitude faster than humans, and nobody really argues we should preserve obsolete tools and techniques as a valued human craft. Obviously automation messes with jobs and identity for some people, but historically a large chunk of human labor just gets automated as the tech gets better. So I feel that arguing about whether automation is good or bad in the abstract is a bit beside the point. The more interesting question imho is how people and companies adapt to it, because it’s probably going to happen either way.
I had to create a new account, because HN is protecting their investments and basically making it impossible to post for anyone who is critical of LLMs (said I was crawling, I'm on a dedicated proxy that definitely hasn't ever crawled HN lol).
Automation can be good overall for society, but you also can't ignore the fact that basically all automation has decreased the value of the labor it replaced or subsidized.
This automation isn't necessarily adding value to society. I don't see any software being built that's increasing the quality of people's life, I don't see research being accelerated. There is no economic data to support this either. The economic gains are only reflected in the values of companies who are selling tokens, or have been able to decrease their employee-counts with token allowances.
All I see is people sharing CRUD apps on twitter, 50 clones of the same SaaS, ,people constantly complaining about how their favorite software/OS has more bugs, the cost of hardware and electricity going up and people literally going into psychosis. (I have a list of 70+ people on twitter that I've been adding too that are literally manic and borderline insane because of these tools). I can see LLMs being genuinely useful to society, like helping with real time the blind, and disabled, but noone is doing that! It doesn't make money, automation is for capital owning class, not for the working class.
But hey, at least your favorite LLM shill from that podcast you loved can afford the $20,000/night resort this summer...
I'd be more okay with these mostly useless automation tools if the models were open source and didn't require $500k to run locally, but until then they basically only serve to make existing billionaires pad unnecessary zeros onto their net worth, and help prevent anyone from catching up with them.
I recommend people read this essay by Thomas Pynchon, actually read it, don't judge it by the title: https://www.nytimes.com/1984/10/28/books/is-it-ok-to-be-a-lu...
Of course it's to save businesses money (and not to empower programmers)! Software engineers for years automated jobs of other people, but when it's SEs that are getting automated, suddenly progress becomes bad?
So because those people didn't defend their livelihoods we shouldn't either?
I'd say there's very little jobs that SWE automated away outside of SOME data entry, SWE's built abstractions on top of existing processes. LLM companies want to abstract away the human entirely.
The crash and burn can't come soon enough.
When I use Google maps, I learn faster.
And I haven't to solve real hard problems for ages.
Some people will have problems some will not.
Future will tell.
Honestly my job is to ensure code quality and to protect the customer. I love working with claude code, it makes my life easier, but in no way would a team of agents improve code quality or speed up development. I would spend far too much time reviewing and fixing laziness and bad design decisions.
When you hear execs talking about AI, it's like listening to someone talk about how they bought some magic beans that will solve all their problems. IMO the only thing we have managed to do is spend alot more money on accelerated compute.
It would be tragically ironic if this post is AI generated.
I agree on all parts. I do not understand why anyone in the software industry would bend over backwards to show their work is worth less now.
>I'd bet anything these new models / agentic-tools are designed to optimize for token consumption.
You would think, but Claude Code has gotten incredibly more efficient over time. They are doing so much dogfooding with these things at this point that it makes more sense to optimize.
How Butlerian of you.
Shaking fist at clouds!!
Wow, a bunch of NFT people used to say the same thing.
lmao, please explain to me why these companies should be valued at 200x revenue.. They are providing autocomplete APIs.
How come Google's valuation hasn't increased 100-200x, they provide foundation models + a ton more services as well and are profitable. None of this makes sense, its destined to fail.
I like your name, it suggests you're here for a good debate.
Let me start by conceding on the company value front; they should not have such value. I will also concede that these models lower your value of labor and quality of craft.
But what they give in return is the ability to scale your engineering impact to new highs - Talented engineers know which implementation patterns work better, how to build debuggable and growable systems. While each file in the code may be "worse" (by whichever metric you choose), the final product has more scope and faster delivery. You can likewise choose to hone in the scope and increase quality, if that's your angle.
LLMs aren't a blanket improvement - They come with tradeoffs.
(I had to create a new account, because HN doesn't like LLM haters (don't mess with the bag ig)
the em dashes in your reply scare me, but I'll assume you're a real person lol.
I think your opinion is valid, but tell that to the C Suite who's laid of 400k tech workers in the last 16 months in the USA. These tools don't seem to be used to empower high quality engineering, only to naively increase the bottom line by decreasing the number of engineers, and increasing workloads on those remaining.
Full disclosure, I haven't been laid off ever, but I see what's happening. I think when the trade-off is that your labor is worth a fraction of what it used to be and you're also expected to produce more, then that trade-off isn't worth it.
It would be a lot different if the signaling from business leaders was the reverse. If they believed these tools empowered labor's impact to a business, and planned on rewarding on that, it would be a different story. That's not what we are seeing, and they are very open about their plans for the future of our profession.
Automation can be good overall for society, but you also can't ignore the fact that basically all automation has decreased the value of the labor it replaced or subsidized.
This automation isn't necessarily adding value to society. I don't see any software being built that's increasing the quality of people's life, I don't see research being accelerated. There is no economic data to support this either. The economic gains are only reflected in the values of companies who are selling tokens, or have been able to decrease their employee-counts with token allowances.
All I see is people sharing CRUD apps on twitter, 50 clones of the same SaaS, ,people constantly complaining about how their favorite software/OS has more bugs, the cost of hardware and electricity going up and people literally going into psychosis. (I have a list of 70+ people on twitter that I've been adding too that are literally manic and borderline insane because of these tools).
But hey, at least your favorite AI evangelist from that podcast you loved can afford the $20,000/night resort this summer...
Google is valued at 4T. Up from 1.2T in 2022.
it's too late to hateAI!
i hate Ai too
username checks out
> Any self respecting engineer should recognize that these tools and models only serve to lower the value of your labor.
Depends on what the aim of your labor is. Is it typing on a keyboard, memorizing (or looking up) whether that function was verb_noun() or noun_verb(), etc? Then, yeah, these tools will lower your value. If your aim is to get things done, and generate value, then no, I don't think these tools will lower your value.
This isn't all that different from CNC machining. A CNC machinist can generate a whole lot more value than someone manually jogging X/Y/Z axes on an old manual mill. If you absolutely love spinning handwheels, then it sucks to be you. CNC definitely didn't lower the value of my brother's labor -- there's no way he'd be able to manually machine enough of his product (https://www.trtvault.com/) to support himself and his family.
> Using these things will fry your brain's ability to think through hard solutions.
CNC hasn't made machinists forget about basic principles, like when to use conventional vs climb milling, speeds and feeds, or whatever. Same thing with AI. Same thing with induction cooktops. Same thing with any tool. Lazy, incompetent people will do lazy, incompetent things with whatever they are given. Yes, an idiot with a power tool is dangerous, as that tool magnifies and accelerates the messes they were already destined to make. But that doesn't make power tools intrinsically bad.
> Do you want your competency to be correlated 1:1 to the quality and quantity of tokens you can afford (or be loaned!!)?
We are already dependent on electricity. If the power goes out, we work around that as best as we can. If you can't run your power tool, but you absolutely need to make progress on whatever it is you're working on, then you pick up a hand tool. If you're using AI and it stops working for whatever reason, you simply continue without it.
I really dislike this anti-AI rhetoric. Not because I want to advocate for AI, but because it distracts from the real issue: if your work is crap, that's on you. Blaming a category of tool as inherently bad (with guaranteed bad results) suggests that there are tools that are inherently good (with guaranteed good results). No. That's absolutely incorrect. It is people who fall on the spectrum of mediocrity-to-greatness, and the tools merely help or hinder them. If someone uses AI and generates a bunch of slop, the focus should be on that person's ineptitude and/or poor judgement.
We'd all be a lot better off if we held each other to higher standards, rather than complaining about tools as a way to signal superiority.
Your brother's livelihood is not safe from AI, nor is any other livelihood. A small slice of lucky, smart, well-placed, protected individuals will benefit from AI, and I presume many unlucky people with substantial disabilities or living in poverty will benefit as well. Technology seems to continue the improve the outcomes at the very top and very bottom, while sacrificing the biggest group in the middle. Many HN Software Engineers here immensely benefitted from Big Tech over the past 15 years -- they were a part of that lucky privileged group winning 300k+ USD salaries plus equity for a long time. AI has completely disrupted this space and drastically decreased the value of their work, and it largely did this by stealing open source code for training data. These Software Engineers are right to feel upset and threatened and oppose these AI tools, since they are their replacement. I believe that is why you see so much AI hate in HN
I'm not trying to signal superiority, I'm legitimately worried about the value of my livelihood and skills I'm passionate about. What if McDonalds went around telling chefs that they're cooking wrong, that there's no reason to cook food in a traditional manner when you can increase profit and speed with their methods?
It would be insulting, they'd get screamed out of the kitchen. Now imagine they're telling those chefs they're going to enforce those methods on them regardless whether they like it or not.
It's not like he was the only one who came up with this idea. I built something like that without knowing about GasTown or Beeds. It's just an obvious next step
https://github.com/mohsen1/claude-code-orchestrator
I also share your confusion about him somehow managing to dominate credit in this space, when it doesn't even seem like Gastown ended up being very effective as a tool relative to its insane token usage. Everyone who's used an agentic tool for longer than a day will have had the natural desire for them to communicate and coordinate across context windows effectively. I'm guessing he just wrote the punchiest article about it and left an impression on people who had hitherto been ignoring the space entirely.
It was a fun article!
Exactly! I built something similar. These are such low hanging fruit ideas that no one company/person should be credited for coming up with them.
Seriously, I thought that was what langchain was for back in 2023.
Seriously, what is langchain? It’s so completely useless. Clearly none of the new agents care about it or need it. Irrelevant.
Agree, langchain was useless then and completely irrelevant now, but the idea that we need to orchestrate different LLM loops is extremely obvious.
> what is langchain?
and incantation you put on your resume to double your salary for a few months before the company you jumped ship to gets obsoleted by the foundational model
Compare both approaches to mature actor frameworks and they don’t seem to be breaking much ice. These kinds of supervisor trees and hierarchies aren’t new for actor based systems and they’re obvious applications of LLM agents working in concert.
The fact that Anthropic and OpenAI have been going on this long without such orchestration, considering the unavoidable issues of context windows and unreliable self-validation, without matching the basic system maturity you get from a default Akka installation shows us that these leading LLM providers (with more money, tokens, deals, access, and better employees than any of us), are learning in real time. Big chunks of the next gen hype machine wunder-agents are fully realizable with cron and basic actor based scripting. Deterministically, write once run forever, no subscription needed.
Kubernetes for agents is, speaking as a krappy kubernetes admin, not some leap, it’s how I’ve been wiring my local doom-coding agents together. I have a hypothesis that people at Google (who are pretty ok with kubernetes and maybe some LLM stuff), have been there for a minute too.
Good to see them building this out, excited to see whether LLM cluster failures multiply (like repeating bad photocopies), or nullify (“sorry Dave, but we’re not going to help build another Facebook, we’re not supposed to harm humanity and also PHP, so… no.”).
If it was so obvious and easy, why didn't we have this a year ago ? Models were mature enough back then to make this work
The high level idea is obvious but doing it is not easy. "Maybe agents should work in teams like humans with different roles and responsibilities and be optimized for those" isn't exactly mind bending. I experimented with it too when LLM coding became a thing.
As usual, the hard part is the actual doing and producing a usable product.
Orchestration definitely wasn't possible a year ago, the only tool that even produced decent results that far back was Aider, it wasn't fully agentic, and it didn't really shine until Gemini 2.5 03-25.
The truth is that people are doing experiments on most of this stuff, and a lot of them are even writing about it, but most of the time you don't see that writing (or the projects that get made) unless someone with an audience already (like Steve Yegge) makes it.
Roo Code in VSCode was working fine a year ago, even back in November 2024 with Sonnet 3.5 or 3.7
Because gathering training data and doing post-training takes time. I agree with OP that this is the obvious next step given context length limitations. Humans work the same way in organizations, you have different people specializing in different things because everyone has a limited "context length".
Because they are not good engineers [1]
Also, because they are stuck in a language and an ecosystem that cannot reliably build supervisors, hierarchies of processes etc. You need Erlang/Elixir for that. Or similar implementations like Akka that they mention.
[1] Yes, they claim their AI-written slop in Claude Code is "a tiny game engine" that takes 16ms to output a couple of hundred of characters on screen: https://x.com/trq212/status/2014051501786931427
what mature actor frameworks do you recommend?
They did mention Akka in their post, so I would assume that's one of them.
Elixir/Erlang. It's table stakes for them.
There seems to be a lot of convergent evolution happening in the space. Days before the gas town hype hit, I made a (less baroque, less manic) "agent team" setup: a shell script to kick off a ralph wiggum loop, and CLAUDE-MESSAGE-BUS.md for inter-ralph communication (Thread safety was hacked into this with a .claude.lock file).
The main claude instance is instructed to launch as many ralph loops as it wants, in screen sessions. It is told to sleep for a certain amount of time to periodically keep track of their progress.
It worked reasonably well, but I don't prefer this way of working... yet. Right now I can't write spec (or meta-spec) files quick enough to saturate the agent loops, and I can't QA their output well enough... mostly a me thing, i guess?
Not a you thing. Fancy orchestration is mostly a waste, validation is the bottleneck. You can do E2E tests and all sorts of analytic guardrails but you need to make sure the functionality matches intent rather than just being "functional" which is still a slow analog process.
> Right now I can't write spec (or meta-spec) files quick enough to saturate the agent loops, and I can't QA their output well enough... mostly a me thing, i guess?
Same for me, however, the velocity of the whole field is astonishing and things change as we get used to them. We are not talking that much about hallucinating anymore, just 4-5 months ago you couldn't trust coding agents with extracting functionality to a separate file without typos, now splitting Git commits works almost without a hinch. The more we get used to agents getting certain things right 100% of the time, the more we'll trust them. There are many many things that I know I won't get right, but I'm absolutely sure my agent will. As soon as we start trusting e.g. a QA agent to do his job, our "project management" velocity will increase too.
Interestingly enough, the infamous "bowling score card" text on how XP works, has demonstrated inherently agentic behaviour in more way than one (they just didn't know what "extreme" was back then). You were supposed to implement a failing test and then implement just enough functionality for this test to not fail anymore, even if the intended functionality was broader -- which is exactly what agents reliably do in a loop. Also, you were supposed to be pair-driving a single machine, which has been incomprehensible to me for almost decades -- after all, every person has their own shortcuts, hardware, IDEs, window managers and what not. Turns out, all you need is a centralized server running a "team manager agent" and multiple developers talking to him to craft software fast (see tmux requirement in Gas Town).
Sorry, are you saying that engineers at Anthropic who work on coding models every day hadn’t thought of multiple of them working together until someone else suggested it?
I remember having conversations about this when the first ChatGPT launched and I don’t work at an AI company.
Claude Code has already had subagent support. Mostly because you have to do very aggressive context window management with Claude or it gets distracted.
This is nothing new, folks have been doing this for since 2023. Lots of paper on arxiv and lots of code in github with implementation of multiagents.
... the "limit" were agents were not as smart then, context window was much smaller and RLVR wasn't a thing so agents were trained for just function calling, but not agent calling/coordination.
we have been doing it since then, the difference really is that the models have gotten really smart and good to handle it.
Honestly this is one of plenty ideas I also have.
But this shows how much stuff is still to do in the ai space
Why is Yegge so.... loud?
Like, who cares? Judging from his blog recount of this it doesn't seem like anybody actually does. He's an unnecessarily loud and enthused engineer inserting himself into AI conversations instead of just playing office politics to join the AI automation effort inside of a big corporation?
"wow he was yelling about agent orchestration in March 2025", I was about 5 months behind him, the company I was working for had its now seemingly obligatory "oh fuck, hackathon" back in August 2025
and we all came to the same conclusions. conferences had everyone having the same conclusion, I went to the local AWS Invent, all the panels from AWS employees and Developer Relations guys were about that
it stands to reason that any company working on foundational models and an agentic coding framework would also have talent thinking about that sooner than the rest of us
so why does Yegge want all of this attention and think its important at all, it seems like it would have been a waste of energy to bother with, like in advance everything should have been able to know that. "Anthropic! what are you doing! listen to meeeehhhh let me innnn!"
doesn't make sense, and gastown's branding is further unhinged goofiness
yeah I can't really play the attribution games on this one, can't really get behind who cares. I'm glad its available in a more benign format now