I think the real value here isn’t “planning vs not planning,” it’s forcing the model to surface its assumptions before they harden into code.
LLMs don’t usually fail at syntax. They fail at invisible assumptions about architecture, constraints, invariants, etc. A written plan becomes a debugging surface for those assumptions.
Sub agent also helps a lot in that regard. Have an agent do the planning, have an implementation agent do the code and have another one do the review. Clear responsabilities helps a lot.
There also blue team / red team that works.
The idea is always the same: help LLM to reason properly with less and more clear instructions.
A huge part of getting autonomy as a human is demonstrating that you can be trusted to police your own decisions up to a point that other people can reason about. Some people get more autonomy than others because they can be trusted with more things.
All of these models are kinda toys as long as you have to manually send a minder in to deal with their bullshit. If we can do it via agents, then the vendors can bake it in, and they haven't. Which is just another judgement call about how much autonomy you give to someone who clearly isn't policing their own decisions and thus is untrustworthy.
If we're at the start of the Trough of Disillusionment now, which maybe we are and maybe we aren't, that'll be part of the rebound that typically follows the trough. But the Trough is also typically the end of the mountains of VC cash, so the costs per use goes up which can trigger aftershocks.
This sounds very promising. Any link to more details?
This runs counter to the advice in the fine article: one long continuous session building context.
Since the phases are sequential, what’s the benefit of a sub agent vs just sequential prompts to the same agent? Just orchestration?
Context pollution, I think. Just because something is sequential in a context file doesn’t mean it’ll happen sequentially, but if you use subagents there is a separation of concerns. I also feel like one bloated context window feels a little sloppy in the execution (and costs more in tokens).
YMMV, I’m still figuring this stuff out
It's also great to describe the full use case flow in the instructions, so you can clearly understand that LLM won't do some stupid thing on its own
> LLMs don’t usually fail at syntax?
Really? My experience has been that it’s incredibly easy to get them stuck in a loop on a hallucinated API and burn through credits before I’ve even noticed what it’s done. I have a small rust project that stores stuff on disk that I wanted to add an s3 backend too - Claude code burned through my $20 in a loop in about 30 minutes without any awareness of what it was doing on a very simple syntax issue.
Except that merely surfacing them changes their behavior, like how you add that one printf() call and now your heisenbug is suddenly nonexistent
Did you just write this with ChatGPT?
I've never seen an LLM use "etc" but the rest gives a strong "it's not just X, it's Y" vibe.
I really hope the fine-tuning of our slop detectors can help with misinformation and bullshit detection.