Hacker News

intellectronica•7d

GPT-5 vs. Sonnet: Complex Agentic Coding elite-ai-assisted-coding.dev

78 comments

chromejs10•7d
This should have been compared with Opus... I know OP says he didn't because of cost but if you're comparing who is better then you need to compare the best to the best... if Claude Opus 4.1 is significantly better than GPT 5 then that could offset the extra expense. Not saying it will... but forget cost if we want to compare solely the quality
- nearbuy•7d
  For what it's worth, I've been trying Opus 4.1 in VS Code through GitHub Copilot and it's been really bad. Maybe worse than Sonnet and GPT 4.1. I'm not sure why it was doing so poorly.
  In one instance, I asked it to optimize a roughly 80 line C# method that matches some object positions by object ID and delta encodes their positions from the previous frame. It seemed to be confused about how all this should work and output completely wrong code. It has all the context it needs in the file and the method is fairly self-contained. Other models did much better. GPT-5 understood what to do immediately.
  I tried a few other tasks/questions that also had underwhelming results. Now I've switched to using GPT-5.
  If you have a quick prompt you'd like me to try, I can share the results.
  - cpursley•7d
    Use Claude Code, the rest aren't worth the bother.
  - bongodongobob•7d
    To me it seems that Opus is really good at writing code if you give it a spec. The other day I had Gpt come up with a spec for a DnD text game that uses the GPT API. It one shotted a 1k line program.
    However, if I'm not detailed with it, it does seem to make weird choices that end up being unmaintainable. It's like it has poor creative instincts but is really good at following the directions you give it.
  - muzani•7d
    Opus seems to need more babysitting IME, which is great if you are going to actually pair program. Terrible if you like leaving it to do its own thing or try to do multiple things at once.
- intellectronica•7d
  Opus costs 10X more. Maybe it's better, but I can't afford to use it, so who cares.
- runako•7d
  re: the comments that Opus is not cost effective...The whole sales pitch behind these tools, and quite specifically the pitch OpenAI made yesterday, is that they will replace people, specifically programmers. Opus is cheaper than a US-based engineer. It's totally reasonable to use it as the benchmark if it's best.
  Also keep in mind that many employees are not paying out of pocket for LLM use at work. A $1,000 monthly bill for LLM usage is high for an individual but not so much for a company that employees engineers.
  - michaelt•7d
    My experience with coding agents is they need a lot of hand-holding.
    They're impressive despite that. But if Sonnet is $20/month and I have to intervene every 3 minutes, while Opus is $100/month and I have to intervene every 5 minutes? ¯\_(ツ)_/¯
- qeternity•7d
  > but forget cost if we want to compare solely the quality
  I think this is the whole reason not to compare it to Opus...
- sergiotapia•7d
  You compare what can be used by most engineers. Most engineers are not going to spend that insane price of Opus. It's extremely high compared to all other models, so even if it is slightly better, it's a non-starter for engineering workloads.
- •7d
  [deleted]
- fouc•7d
  gpt-5 isn't supposed to be the best, it's supposed to be cost effective
h4ny•7d
I have been seeing different people reporting different results with different tasks. Watched a live stream that compared GPT-5, Gemini Pro 2.5, Claude 4 Sonnet, and GLM 4.5, and GPT-5 appeared to not follow instructions as well as the other three.
At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models (many benchmarks can be gamed).
The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.
- epolanski•7d
  I will also state another semi-obvious thing that people seem to consistently forget: models are non deterministic.
  You are not going to get the same output from GPT5 or Sonnet every time.
  And this obviously compounds across many different steps.
  E.g. give GPT5 the code to a feature (by pointing some files and tests) and tell it to review it and find improvement opportunities and write them down: depending on the size of the code, etc, the answers will slightly different.
  I often do it in Cursor by having multiple agents review a PR and each of them: - has to write down their pr-number-review-model.md (e.g. pr-15-review-sonnet4.md) - has to review the reviews of the other files
  Then I review it myself and try to decide what's valuable in there and what not. And to my disappointment (towards myself): - often they do point to valid flaws I would've not thought about - miss the "end-to-end" or general view of how the code fits in a program/process/business. What do I mean: sometimes the real feedback would be that we don't need it at all. But you need to have these conversations with AI earlier.
- x187463•7d
  This has been ubiquitous for a while. Even here on HN every thread about these models (even this one, I'm sure) features an inordinate amount of disagreement between people vehemently declaring one model more useful than another. There truly seems to be no objective measurement of quality that can discern the difference between frontier models.
- vineyardmike•7d
  > At the moment it feels like most people "reviewing" models depends on their believes and agenda, and there are no objective ways to evaluate and compare models
  I think you’ll always have some disagreement generally in life, but especially for things like this. Code has a level of subjectivity. Good variable names, correct amount of abstraction, verbosity, over complexity, etc are at least partially opinions. That makes benchmarking something subjective tough. Furthermore, LLMs aren’t deterministic, and sometimes you just get a bad seed in the RNG.
  Not only that, but the harness and prompt used to guide the model make a difference. Claude responds to the word “ultrathink”, but if GPT-5 uses “think harder”, then what should be in the prompt?
  Anecdotally, I’ve had the best luck with agentic coding when using Claude Code with Sonnet. Better than Sonnet with other tools, and better than Claude Code with other models. But I mostly use Go and Dart and I aggressively manage the context. I’ve found GPTs can’t write zig at all, but Gemini can, but they can both write python excellently. All that said, if I didn’t like an answer, I’d prompt again, but liked the answer, never tried again with a different model to see if I’d like it even more. So it’s hard to know what could’ve been.
  I’ve used a ton of models and harnesses. Cursor is good too, and I’ve been impressed with more models in cursor. I don’t get the hype of Qwen though because I’ve found it makes lots of small(er) changes in a loop, and that’s noisy and expensive. Gemini is also very smart but worse at following my instructions, but I never took the time to experiment with prompting.
- jjfoooo4•7d
  There's certainly a symbiosis blog publishers and small startups wanting to be perceived as influential, and big companies releasing models and wanting favorable coverage.
  I heavily discount same day commentary, there's a quid pro quo on early access vs favorable reviews (and yes, folks publishing early commentary aren't explicitly agreeing to write favorable things, but there's obvious bias baked in).
  I don't think it's all particularly concerning, you can discount reviews that are coming out so quickly that's it's unlikely the reviewer has really used it very much.
- muzani•7d
  If you were to objectively rank things, durian would be the best fruit in the world, python would be the best programming language, and the Tesla Model Y is the best car. Everyone has multiple inconsistent opinions on everything because everything is not the same.
  Just pick something and use it. AI models are interchangeable. It's not as big a decision as buying a car or even a durian.
- isaacremuant•7d
  > The blurring boundaries between technical overview, news, opinions and marketing is truly concerning.
  Can't help but laugh at this. It's like you just discovered skepticism and how the world actually works.
- qsort•7d
  Thankfully that isn't a problem: we have scientific and reliable benchmarks to cut through the nonsense! Oh wait...
jjani•7d
Is it really this easy now to get your article high on HN with 100 comments? The findings are completely meaningless.
"Agenticness" depends so much on the specific tooling (harness) and system prompts. It mentions Copilot - did it use this for both? Given it's created by Microsoft there's good reason to believe it'd be built yo do especially well with GPT (they'll have had 5 available in preview for months by now). Or it could be the opposite and be tuned towards Sonnet. At the very minimum you'd need to try a few different harnesses, preferably ones not closely related to either OpenAI/MS or Anthropic.
This article even mentions things like "Sonnet is much faster" which is very dependent on the specific load at the time of usage. Today everyone is testing GPT-5 so it's slow and Sonnet is much faster.
- intellectronica•7d
  OP here. I actually agree with you that the "findings" here are meaningless. This is pure vibe.
  Also regarding "Sonnet is faster" I did explicitly mention that I believe this is because GPT-5 is in preview and hours from the release. The speed I experienced doesn't say anything about the model performance you can expect.
- debarshri•7d
  I guess agents are voting it up
- ramesh31•7d
  >Is it really this easy now to get your article high on HN with 100 comments?
  Everyone wants to know the answer to GPT5 vs Claude without wasting the tokens personally because we can all more or less guess what the result will be.
arcticfox•7d
> Note that Claude 4 Sonnet isn’t the strongest model from Anthropic’s Claude series. Claude Opus is their most capable model for coding, but it seemed inappropriate to compare it with GPT-5 because it costs 10 times as much.
Well - I would have been interested in GPT-5 vs. Opus. Claude Code Max is affordable with Opus.
- swader999•7d
  You're absolutely right!
- qeternity•7d
  > Claude Code Max is affordable with Opus
  Because Anthropic is presumably massively subsidizing the usage.
Nizoss•7d
I have been using Claude Code with TDD through hooks, which significantly improved my workflow for production code.
Watching the ChatGPT 5 demo yesterday, I noticed most of the code seemed oriented towards one-off scripts rather than maintainable codebases which limits its value for me.
Does anyone know if ChatGPT 5 or Copilot have similar extensibility to enforce practices like TDD?
For context on the approach: https://github.com/nizos/tdd-guard
I use pre/post operation commands to enforce TDD rules.
- MrGreenTea•7d
  I just recently stumbled upon your tdd-guard when looking for inspiration for Claude hooks. I've been so impressed with what it allowed me to improve the workflow and quality. Then I was somewhat disappointed that almost no one seems to talk about this potential and how they're using hooks. Yours was the only interesting project I found in this regard and hope to give it a spin this weekend .
  You don't happen to have a short video where you go into a bit more detail on how you use it though?
- ethan_smith•7d
  GPT-5 supports custom function calling which you could use to build similar TDD hooks via the API, though nothing as streamlined as your Claude Code implementation exists out-of-the-box yet.
- •7d
  [deleted]
patcon•7d
> One continuous difference: while GPT-5 would do lots of thinking then do something right the first time, Claude frantically tried different things — writing code, executing commands, making pretty dumb mistakes [...], but then recovering. This meant it eventually got to correct implementation with many more steps.
Sounds like Claude muddles. I consider that the stronger tactic.
I sure hope GPt-5 is muddling on the backend, else I suspect it will be very brittle.
Re: https://contraptions.venkateshrao.com/p/massed-muddler-intel...
> Lindblom’s paper identifies two patterns of agentic behavior, “root” (or rational-comprehensive) and “branch” (or successive limited comparisons), and argues that in complicated messy circumstances requiring coordinated action at scale, the way actually effective humans operate is the branch method, which looks like “muddling through” but gradually gets there, where the root ["godding through"] method fails entirely.
- quijoteuniv•7d
  Today I used GPT-5 for some OpenTelemetry Collector configs that both Claude and OpenAI models struggled with before and it was surprisingly impressive. It got the replies right on the first try. Previously, both had been tripped up by outdated or missing docs (OTel changes so quickly).
  For home projects, I wish I could have GPT-5 plugged into Claude’s code CLI interface. iteration just works! Looking forward to less baby sitting in the future!
- sudohalt•7d
  Isn't the issue with that the prohibitive costs, it can easily be 5 to 10 (maybe even more for long running tasks). Currently they are probably subsidizing the compute costs to some extent.
- kiitos•7d
  yeah gpt-5 does lots of thinking and then does something -- but it's rarely the right thing, at least in my experience over the past day
macawfish•7d
Claude is just so well rounded and considerate. A lot of this probably comes down to prompt and context engineering, though surely there's something magical about Anthropic's principled training methodologies. They invented constitutional AI and I can only imagine that behind the scenes they're doing really cool stuff. Can't wait to see Claude 5!
SV_BubbleTime•7d
> but when I'd point out the missing implementation, it would give its usual "you're absolutely right" and try to fix it.
I really trying to not be annoyed by Claude’s “You’re absolutely right” because I know I cannot control it but this is an increasingly difficult task.
- CamperBob2•7d
  You can control it in the chat page, at least (User name at lower left->Settings). I use this:
```
    Answer concisely when appropriate, more 
    extensively when necessary.  Avoid rhetorical 
    flourishes, bonhomie, and (above all) cliches.  
    Take a forward-thinking view. OK to be mildly 
    positive and encouraging but NEVER sycophantic 
    or cloying.  Above all, NEVER use the phrase 
    "You're absolutely right."  Rather than "Let 
    me know if..." style continuations, list a 
    set of prompts to explore further topics.
```
  That last bit causes some clutter at the end of each response, not sure if I'm going to keep it. But it does do a good job at following these guidelines in my experience. The same basic instructions also work well in ChatGPT and Gemini.
  Does Claude Code not support anything like this?
- jpalawaga•7d
  I think it's because "you're right!" somehow presupposes it knew the answer and was just testing you.
  an intern never says that. they say "oh, I see."
- AlecSchueler•7d
  Does it also seem to be getting worse this way?
mewpmewp2•7d
So far from my testing I have found Claude Code with Sonnet 4 better than Cursor + GPT-5 still. I started exact same projects at the same time, and it seemed Claude Code was just more reliable. It was just much slower in terms of setting up the project and didn't setup the project up as scalably (despite them highlighting that in the demo), and when I tried to instruct it to set it up DRY, modular, etc it kind of didn't just go where I wanted it to, while Claude Code did.
It was a game involving OOP, three.js. I think both are probably great at good design and CRUD things.
- natiman1000•7d
  I was initially excited about GPT5, and I quickly switched to it but still can't use it for some reason it is clearly smart but not useful.
- nextworddev•7d
  GPT-5 is much cheaper though
anotheryou•7d
I think we need to stop testing models raw.
Claude is trained for claude code and that's how it's used in the field too.
- nightshift1•7d
  unless you use it through copilot
stitched2gethr•7d
This take rings true for me after admittedly only a couple of hours of use of gpt-5. I had an issue I had been working with Claude on but it was difficult to give it real-time feedback so it floundered. gpt-5 struggled in the same areas but after about $2 of tokens it did fix the issue. It was far from a 1 shot like I might have expected from the hype, but it did get the job in about an hour done where Claude could not in 3.
For reference my Claude usage was mostly Sonnet, but with consulting from Opus.
- 0xfaded•7d
  Would you be comfortable sharing a brief description of what the issue was?
ctbellmar•4d
I know it's been mentioned a few times, but worth repeating: these LLMs tend to do noticeably better in their own native environments. Claude (Opus or Sonnet) in Copilot != Claude in Claude Code. Same applies to Cursor, Windsurf, Augment, etc. This likely has a lot to do with context manipulation (and compression), which affects the resulting output. I imagine that GPT-5 likewise will do better in Codex vs 3rd party plugin/VS Code fork.
- fragmede•4d
  The system prompts aren't shared either, and probably accounts for quite a bit of difference as well.
lherron•7d
Did I miss the total cost for each run in the article? Can't seem to find it.
If Sonnet is more expensive AND more chatty/requires more attempts for the same result, seems like that would favor GPT5 for daily driver.
nojs•7d
This pretty much matches my experience today. GPT5 (in Cursor) feels smarter in isolation, but CC with Opus is faster and better at real tasks involving a large codebase.
DrNosferatu•7d
Then instruct GPT5 to write more structured and annotated code.
endorphine•7d
What is the way to use this agentic stuff with neovim? Do I have to resort to OpenAI's Codex or a nvim plugin is sufficient? Or Claude Code?
- Cyphus•7d
  I've been using codecompanion.nvim[0] combined with mcp-hub.nvim[1]. Code Companion works well for interactive chat but falls short for agentic coding. It's limited to some pre-configured and user-defined "workflows" which are basically templated steps with prompts, actions, and loops.
  I've been meaning to give avante.nvim[2] a try since it aims to provide a "Cursor like" experience, but for now I've been alternating between Code Companion for simple prompts and Claude CLI (in a tmux pane next to Neovim) for agentic stuff.
  [0] https://codecompanion.olimorris.dev/
  [1] https://ravitemer.github.io/mcphub.nvim/
  [2] https://github.com/yetone/avante.nvim
- mkozlows•7d
  Claude Code or cursor-cli or Codex or any of the command-line tools should be good. (Claude Code seems so far to be the option people like best of those, though.)
- roguesherlock•7d
  I've found these two to be really good
  https://github.com/dlants/magenta.nvim
  https://github.com/NickvanDyke/opencode.nvim
olddustytrail•7d
From reactions I've seen it appears that GPT-5 hallucinates less than previous models but the flip side is that it's worse for creative tasks.
This makes logical sense: you don't want a model to get creative if you need functioning code, but if you want a story idea it should basically be all hallucination.
I think it makes sense to have different models for these tasks.
doctoboggan•7d
I really like Claude code's context engineering and prompt engineering, is it possible to plug in GPT-5 into Claude code? I think that would be a more apples to apples test as it's just testing the models and not the agentic framework around them.
- koakuma-chan•7d
  I imagine Claude Code is optimized for Claude specifically, and GPT-5 would not be great there. You should probably use Codex if you want to use GPT-5.
mvATM99•7d
The manual approval of commands in GHCP can be circumvented, there's an experimental setting that allows you to accept all commands automatically.
I wish you could be a bit more specific though, you can't set which commands you want to auto-accept in detail.
- typpilol•7d
  Pretty sure you can set a terminal whitelist and blacklist for it.
Surac•7d
Typescript to rust. I mostly test models on c code. C is much less boilerplate and more code per word. Models need to be ready smart to see all the pointer magic and misuse of lib functions. Claude really makes a very competent c coder in my test
bn-l•7d
Github copilot is utter garbage. The diffing crawls along at a snail’s pace. I think it’s coming up on two years and this must criticised aspect of it still isn’t fixed—-even with all the reverse engineering of how cursor did it. I wish I could find an alternative to cursor (which has other issues). Honestly, that company just threw away a golden opportunity as the first mover.
- bredren•7d
  I've done evaluations of Github Copilot, Sourcegraph Cody and Gitlab Duo and Copilot is not garbage, but rather the by far leader among these other options.
- sourcecodeplz•7d
  Why did they throw it away? Because of the new opaque pricing?
- cebert•7d
  I’m not sure why you’re getting downvoted. I agree that Copilot seems like complete rubbish.
carterparks•7d
I'm getting an SSL error in Chrome: ERR_SSL_PROTOCOL_ERROR
- OJFord•7d
  I get 'unable to connect' in Firefox Android for this and many little blogs on HN lately, idk what's going on. Cloudflare blocking me (but not for all sites)? Geo-restriction (UK)?
animex•7d
Wonderful, timely article. It sounds like a hybrid approach might produce good results: Using ChatGPT-5 for planning/analysis and using Claude for execution.
- macawfish•7d
  Another option would be to add something to your AGENTS.md or whatever giving examples of the kind of code organization you want. You could ask Claude to explain its approach in terms explicitly that GPT-5 can understand. GPT-5 seems much more sensitive in its responsiveness to instructions. My sense is that in the long run this will be really nice, but that the current prompts in these mainstream LLM coding tools are designed for models with a different style of responsiveness to instructions.
indigodaddy•7d
What does the 1x and .33x mean on the list of models in copilot? (Never used but thinking about trying on the free tier)
- commandar•7d
  They're multipliers against your quota of requests. GPT-4.1 is "free" with a copilot sub, and then the premium models would burn credits against a multiplier. So higher multipliers count more against your monthly quota.
  GPT5, Sonnet 4, and Gemini Pro 2.5 are all 1x. Opus is 10x, for comparison.
  https://docs.github.com/en/copilot/reference/ai-models/suppo...
  Also worth keeping in mind that Copilot has reduced context windows even for the premium models, which has a very real impact on agentic performance.
- PufPufPuf•7d
  It's pricing / usage modifiers, basically price expressed as a ratio of the default model.
chisleu•7d
How was he doing "complex agentic coding" when the APIs have such extreme context and throughput limitations?
ramoz•7d
Not using Claude code is a crime.
siamtttt•5d
[dead]
siamtttt•5d
[dead]
- siamtttt•5d
  [dead]