This should have been compared with Opus... I know OP says he didn't because of cost but if you're comparing who is better then you need to compare the best to the best... if Claude Opus 4.1 is significantly better than GPT 5 then that could offset the extra expense. Not saying it will... but forget cost if we want to compare solely the quality
For what it's worth, I've been trying Opus 4.1 in VS Code through GitHub Copilot and it's been really bad. Maybe worse than Sonnet and GPT 4.1. I'm not sure why it was doing so poorly.
In one instance, I asked it to optimize a roughly 80 line C# method that matches some object positions by object ID and delta encodes their positions from the previous frame. It seemed to be confused about how all this should work and output completely wrong code. It has all the context it needs in the file and the method is fairly self-contained. Other models did much better. GPT-5 understood what to do immediately.
I tried a few other tasks/questions that also had underwhelming results. Now I've switched to using GPT-5.
If you have a quick prompt you'd like me to try, I can share the results.
Use Claude Code, the rest aren't worth the bother.
To me it seems that Opus is really good at writing code if you give it a spec. The other day I had Gpt come up with a spec for a DnD text game that uses the GPT API. It one shotted a 1k line program.
However, if I'm not detailed with it, it does seem to make weird choices that end up being unmaintainable. It's like it has poor creative instincts but is really good at following the directions you give it.
Opus seems to need more babysitting IME, which is great if you are going to actually pair program. Terrible if you like leaving it to do its own thing or try to do multiple things at once.
Opus costs 10X more. Maybe it's better, but I can't afford to use it, so who cares.
re: the comments that Opus is not cost effective...The whole sales pitch behind these tools, and quite specifically the pitch OpenAI made yesterday, is that they will replace people, specifically programmers. Opus is cheaper than a US-based engineer. It's totally reasonable to use it as the benchmark if it's best.
Also keep in mind that many employees are not paying out of pocket for LLM use at work. A $1,000 monthly bill for LLM usage is high for an individual but not so much for a company that employees engineers.
My experience with coding agents is they need a lot of hand-holding.
They're impressive despite that. But if Sonnet is $20/month and I have to intervene every 3 minutes, while Opus is $100/month and I have to intervene every 5 minutes? ¯\_(ツ)_/¯
> but forget cost if we want to compare solely the quality
I think this is the whole reason not to compare it to Opus...
You compare what can be used by most engineers. Most engineers are not going to spend that insane price of Opus. It's extremely high compared to all other models, so even if it is slightly better, it's a non-starter for engineering workloads.
gpt-5 isn't supposed to be the best, it's supposed to be cost effective