Single-agent LLMs suck at long-running complex tasks.
We’ve open-sourced a multi-agent orchestrator that we’ve been using to handle long-running LLM tasks. We found that single LLM agents tend to stall, loop, or generate non-compiling code, so we built a harness for agents to coordinate over shared context while work is in progress.
How it works: 1. Orchestrator agent that manages task decomposition 2. Sub-agents for parallel work 3. Subscriptions to task state and progress 4. Real-time sharing of intermediate discoveries between agents
We tested this on a Putnam-level math problem, but the pattern generalizes to things like refactors, app builds, and long research. It’s packaged as a Claude Code skill and designed to be small, readable, and modifiable.
Use it, break it, tell me about what workloads we should try and run next!
Even Anthropic research articles consistently demonstrate they themselves use one agent, and just tune the harness around it.
I ignore all Skills, MCPs, and view all of these as distractions that consume context, which leads to worse performance. It's better to observe what agent is doing, where it needs help and just throw a few bits of helpful, sometimes persistent context at it.
You can't observe what 20 agents are doing.
For most tasks, I agree. One agent with a good harness wins. The case for multiple agents is when the context required to solve the problem exceeds what one agent can hold. This Putnam problem needed more working context than fits in a single window. Decomposing into subgoals lets each agent work with a focused context instead of one agent suffocating on state. Ideally, multi-agent approaches shouldn't add more overall complexity, but there needs to be better tooling for observation etc, as you describe.
Thats the other thing, you hit the nail on the head, I dont want 20 agents unless they're doing research and scouring code. Claude can do that just fine. I want Claude Code doing as much as I can handle, and something like Beads does it for me.
Yes, but you can observe the agent observing what 20 agents are doing! /s
Now I see why Grey Walter made artificial tortoises in the 50s - he foresaw that it would be turtles all the way down.
Yeah I have seen those camps too. I think there will always be a set of problems that have complexity, measured by amount of context required to be kept in working ram, that need more than one agent to achieve a workable or optimal result. I think that single player mode, dev + claude code, you'll come up against these less frequently, but cross-team, cross-codebase bigger complex problems will need more complex agent coordination.