> The Claude C Compiler illustrates the other side: it optimizes for
> passing tests, not for correctness. It hard-codes values to satisfy
> the test suite. It will not generalize.
This is one of the pain points I am suffering at work: workers ask coding agents to generate some code, and then to generate test coverage for the code. The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it. Off to merge and build it goes.
I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens.
This is why you write the tests first and then the code. Especially when fixing bugs, since you can be sure that the test properly fails when the bug is present.
When fixing bugs, yes. When designing an app not so much because you realize many unexpected things while writing the code and seeing how it behaves. Often the original test code would test something that is never built. It's obvious for integration tests but it happens for tests of API calls and even for unit tests. One could start writing unit tests for a module or class and eventually realize that it must be implemented in a totally different way. I prefer experimenting with the implementation and write tests only when it settles down on something that I'm confident it will go to production.
Where I'm at currently (which may change) is that I lay down the foundation of the program and its initial tests first. That initial bit is completely manual. Then when I'm happy that the program is sufficiently "built up", I let the LLM go crazy. I still audit the tests though personally auditing tests is the part of programming I like the very least. This also largely preserves the initial architectural patterns that I set so it's just much easier to read LLM code.
In a team setting I try to do the same thing and invite team members to start writing the initial code by hand only. I suspect if an urgent deliverable comes up though, I will be flexible on some of my ideas.
> When fixing bugs, yes.
One thing I want to mention here is that you should try to write a test that not only prevents this bug, but also similar bugs.
In our own codebase we saw that regression on fixed bugs is very low. So writing a specific test for it, isn't the best way to spend your resources. Writing a broad test when possible, does.
Not sure how LLM's handle that case to come up with a proper test.
I'd argue the AI writing the tests shouldn't even know about the implementation at all. You only want to pass it the interface (or function signatures) together with javadocs/docstrings/equivalent.
Agreed 1000%. But that can be a lot of work; creating a good set of tests is nearly as much or often even more effort than implementing the thing being tested.
When LLMs can assist with writing useful tests before having seen any implementation, then I’ll be properly impressed.
from experience, AI is bad at TDD. they can infer tests based on written code, but are bad at writing generalised test unless a clear requirement is given, so you the engineer is doing most of the work anyway.
My day job has me working on code that is split between two different programming languages. I'd say LLMs are pretty good at TDD in one of those languages and a hot mess in the other.
Which, funny enough, is a pretty good reflection of how I thought of the people writing in those languages before LLMs: One considers testing a complete afterthought and in the wild it is rare to find tests at all, and when they are present they often aren't good. Whereas the other brings testing as a first-class feature and most codebases I've seen generally contain fairly decent tests.
No doubt LLM training has picked up on that.
I don't think it addresses the problem.
Writing the tests first and then writing code to pass the tests is no better than writing the code first then writing tests that pass. What matter is that both the code and the tests are written independently, from specs, not from one another.
I think that it is better not to have access to tests when first writing code, as to make sure to code the specs and not code the tests that test the specs as something may be lost in translation. It means that I have a preference for code first, but the ideal case would be for different people to do it in parallel.
Anyway, about AI, in an AI writes both the tests and the code, it will make sure they match no matter what comes first, it may even go back and forth between the tests and code, but it doesn't mean it is correct.
Tests are your spec. You write them first because that is the stage when you are still figuring out what you need to write.
Although TDD says that you should only write one test before implementing it, encouraging spec writing to be an iterative process.
Writing the spec after implementation means that you are likely to have forgotten the nuance that went into what you created. That is why specs are written first. Then the nuance is captured up front as it comes to mind.
Tests are not any more or any less of a spec than the code. If you are implementing a HTTP server for instance, RFC 7231 are your specs, not your tests, not your code.
I would say that which come first between specs and code depend on the context. If you are implementing a standard, the specs of the standard obviously come first, but if you are iterating, maybe for a user interface, it can make sense to start with the code so that you can have working prototypes. You can then write formal documents and tests later, when you are done prototyping, for regression control.
But I think that leaning on tests is not always a good idea. For example, let's continue with the HTTP server. You write a test suite, but there is a bug in your tests, I don't know, you confuse error 404 and 403. The you write your code, correctly, run the tests, see that one of your tests fail and tell you have returned 404 and not 403. You don't think much, after all "the tests are the specs", and change the code. Congratulations, you are now making sure your code is wrong.
Of course, the opposite can and do happen, writing the code wrong and making passing test without thinking about what you actually testing, and I believe that's why people came up with the idea of TDD, but for me, test-first flip the problem but doesn't solve it. I'd say the only advantage, if it is one, is that it prevents taking a shortcut and releasing untested code by moving tests out of the critical path.
But outside of that, I'd rather focus on the code, so if something are to be "the spec", that's it. It is the most important, because it is the actual product, everything else is secondary. I don't mean unimportant, I mean that from the point of view of users, it is better for the test suite to be broken than for the code to be broken.
> RFC 7231 are your specs
It is more like a meta spec. You still have to write a final spec that applies to your particular technical constraints, business needs, etc. RFC 7231 specifies the minimum amount necessary to interface with the world, but an actual program to be deployed into the wild requires much, much more consideration.
And for that, since you have the full picture not available to a meta spec, logically you will write it in a language that both humans and computers can understand. For the best results, that means something like Lean, Rocq, etc. However, in the real world you likely have to deal with middling developers straight out of learn to code bootcamps, so tests are the practical middle ground.
> I don't know, you confuse error 404 and 403.
Just like you would when writing RFC 7231? But that's what the RFC process is for. You don't have to skip the RFC process just because the spec also happens to be machine readable. If you are trying to shortcut the process, then you're going to have this problem no matter what.
But, even when shortcutting the process, it is still worthwhile to have written your spec in a machine-readable format as that means any changes to the spec automatically identify all the places you need to change in implementation.
> writing the code wrong and making passing test without thinking about what you actually testing
The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything. Then, years down the road after everyone has forgotten or moved on, when someone needs to do some refactoring there is no specification to define what the original code was actually supposed to do. Writing the test first means that you have proven that it can fail. That's not the only reason TDD suggests writing a test first, but it is certainly one of them.
> It is the most important, because it is the actual product
Nah. The specification is the actual product; it is what lives for the lifetime of the product. It defines the contract with the user. Implementation is throwaway. You can change the implementation code all day long and as long as the user contract remains satisfied the visible product will remain exactly the same.
> The much more likely scenario is that the code is right, but a mistake in the test leads it to not test anything.
What I usually do to prevent this situation is to write a passing test, then modify the code to make it fail, then revert the change. It also gives an occasion to read the code again, kind of like a review.
I have never seen this practice formalized though, good for me, this is the kind of things I do because I care, turning it into a process with Jira and such is a good way to make me stop caring.
> I have never seen this practice formalized though
Isn't that what is oft known as mutation testing? It is formalized to the point that we have automation to do the mutation for you automatically.
Also, if you find after implementation that the spec wasn't specific enough, go ahead and refresh the spec and have the LLM redo the code, from scratch if necessary. Writing code is so cheap right now, it takes a different mindset in general.
try this for a UI
the test generation loop is the real trap. you ask the agent to write code, then ask it to write tests for that code. of course the tests pass. they're testing what the code does, not what it should do.
we ran into this building a task manager. the PUT endpoint set completed=true but never set the completion timestamp. the agent-written tests all passed because they tested "does it set completed to true" not "does it record when it was completed." 59 tasks in production with null timestamps before a downstream report caught it.
the fix was trivial. the gap in verification wasn't.
Once upon a time people advocated writing tests first…
once upon a time 'engineering' in software had some meaning attached to it...
no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.
> no other engineering profession would accept the standards(or rather their lack of) on which software engineering is running.
I have bad news for you: they are pushing those "standards" (Agile, ASPICE) also in hardware and mechanical engineering.
The results can be seen already. Testing is expensive and this is the field where most savings can be implemented.
Agile isn't a coding standard or approach.
Once upon a time people were thinking about what they're doing. LLMs absolve people from thinking
Engineers aren't paid to think. They are paid to be replacable cogs who can be fired the moment they show independent thought.
i dont think that would help. the agent would hard code the test details into the code.
I wasn't able to force the agent to write failing tests yet. Although I'm sure it should be possible to do.
I do that all the time with Claude. What part is not working?
I don't really use anthropic models. But when I tried it with others they can write tests but they never confirm that they fail before they proceed to make implementation that causes them to pass. Maybe I didn't prompt it forcibly enough.
I haven’t tried this (yet), but I’ve heard of people disabling write access to test code while the agent is writing implementation and vice versa. I imagine “disabling” could be done via prompting, or just a quick one liner like: chmod -r 0644 ./tests
The magic word is "use red/green testing", that makes it create the tests first, confirm they fail (as they should), then it writes the code to match.
At my job we have a requirement for 100% test coverage. So everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything.
Exactly! It's frustrating how much developers get blamed for the outcomes of incompetent management.
> everyone just uses AI to generate 10,000 line files of unit tests and nobody can verify anything
This is not a guaranteed outcome of requiring 100% coverage. Not that that's a good requirement, but responding badly to a bad requirement is just as bad.
> The icing on the cake is that LLMs are producing so much code that humans are just rubber stamping all of it.
I don't understand the value of that much code. What features are worth that much more than stability?
> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
Obvious question: why not? Let’s say you have competent devs, fair assumption. Maybe it’s because they don’t have enough time for solid QA? Lots of places are feature factories. In my personal projects I have more lines of code doing testing than implementation.
It’s because people will do what they’re incentivized to do. And if no one cares about anything but whether the next feature goes out the door, that’s what programmers will focus on.
Honestly I think the other thing that is happening is that a lot of people who know better are keeping their mouths shut and waiting for things to blow up.
We’re at the very peak of the hype cycle right now, so it’s very hard to push back and tell people that maybe they should slow down and make sure they understand what the system is actually doing and what it should be doing.
Or if you say we should slow down your competence is questioned by others who are going very fast (and likely making mistakes we won't find until later).
And there is an element of uncertainty. Am I just bad at using these new tools? To some degree probably, but does that mean I'm totally wrong and we should be going this fast?
There is a saying: slow is smooth and smooth is fast.
I have personally outpaced some of my more impatient colleagues by spending extra time up front setting up test harnesses, reading specifications, etcetera. When done judiciously it pays off in time scales of weeks or less.
oh yeah, let them dig a hole and charge sweet consultant rates to fix it. the the healing can begin
Developers aren't given time to test and aren't rewarded if they do, but management will rain down hellfire upon their heads if they don't churn out code quickly enough.
How about a subsequent review where a separate agent analyzes the original issue and resultant code and approves it if the code meets the intent of the issue. The principle being to keep an eye out for manual work that you can describe well enough to offload.
Depending on your success rate with agents, you can have one that validates multiple criteria or separate agents for different review criteria.
You are fighting nondeterministic behavior with more nondeterministic behavior, or in other words, fighting probability with probability. That doesn't necessarily make things any better.
In my experience, an agent with "fresh eyes", i.e., without the context of being told what to write and writing it, does have a different perspective and is able to be more critical. Chatbots tend to take the entire previous conversational history as a sort of canonical truth, so removing it seems to get rid of any bias the agent has towards the decisions that were made while writing the code.
I know I'm psychologizing the agent. I can't explain it in a different way.
Fresh eyes, some contexts and another LLM.
The problem is information fatigue from all the agents+code itself.
I think of it as they are additive biased. ie "dont think about the pink elephant ". Not only does this not help llms avoid pink elphants instead it guarantees that pink elephant information is now being considered in its inference when it was not before.
I fear thinking about problem solving in this manner to make llms work is damaging to critical thinking skills.
Aren't human coders also nondeterministic?
Assigning different agents to have different focuses has worked for me. Especially when you task a code reviewer agent with the goal of critically examining the code. The results will normally be much better than asking the coder agent who will assure you it's "fully tested and production ready"
Human coders are far more reliable. The only downside is speed, and therefore cost
Probably true
(Sorry.)
Slop on slop. Who watches rhe watchman?
How long till the industry discover TDD?
> At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
its fun having LLMs because it makes it quite clear that a lot of testing has been cargo-culting. did people ever check often that the tests check for anything meaningful?
15years ago, I had tester writing "UI tests" / "User tests" that matched what the software was cranking out. At that time I just joined to continue at the client side so I didn't really worked on anything yet.
I had a fun discussion when the client tried to change values... Why is it still 0? Didn't you test?
And that was at that time I had to dive into the code base and cry.
Test automation is kind of like a religion. It is comforting to believe that the solution to code is more code.
Property testing could've helped
Building a C compiler should not have this problem. There is probably a million test suites coming from outside the LLM that it can sue verify correctness.
I think it boils down to how companies view LLMs and their engineers.
Some companies will do as you say - have (mostly clueless) engineers feed high level "wishes" to (entirely clueless) LLMs, and hope that everyone kind of gets it. And everyone will kind of get it. And everyone will kind of get it wrong.
Other companies will have their engineers explicitly treat the LLMs as collaborators / pair programmers, not independent developers. As an engineer in such a company, YOU are still the author of the code even if you "prompted" it instead of typing it. You can't just "fix this high level thing for me brah" and get away with it, but instead need to continuously interact with the LLM as you define and it implements the detailed wanted behaviors. That forces you to know _exactly_ what you want and ask for _exactly_ what you want without ambiguity, like in any other kind of programming. The difference is that the LLM is a heck of a lot quicker at typing code than you are.
Honestly, unit tests (at least on the front-end) are largely wasted time in the current state of software development. Taking the time that would have been spent on writing unit tests and instead using it to write functionally pure, immutable code would do much more to prevent bugs.
There's also the problem that when stack rank time comes around each year no one cares about your unit tests. So using AI to write unit tests gives me time to work on things that will actually help me avoid getting arbitrarily fired.
I wish that software engineers were given the time to write both clean code and unit tests, and I wish software engineers weren't arbitrarily judged by out of touch leadership. However, that's not the world we live in so I let AI write my unit tests in order to survive.
You are overvaluing “clean code.” Code is code, it either works within spec or it doesn’t; or, it does but there are errors, more or less catastrophic, waiting to show themselves at any moment. But even in that latter case, no single individual can know for certain, no matter how much work they put in, that their code is perfect. But they can know its useable, and someone else can check to make sure it doesn’t blow something else up, and that is the most important thing.
I like unit tests when I have to modify code that someone made years ago, as a basic sanity check.
Yeah this is the exact kind of ridiculousness I've noticed as well - everything that comes out of an LLM is optimized to give you what you want to hear, not what's correct.
> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code
This is true for humans too. Tests should not be written or performed by the same person that writes the code
That's a complete fantasy world where companies have twice the engineers they actually need instead of half.
> [Reviews] should not be written or performed by the same person that writes the code
> That's a complete fantasy world where companies have twice the engineers they actually need instead of half.
Agreed, but then companies shouldn't complain about the consequences of understaffing their teams.
My only hope is that all of this push leads in the end to the adoption of more formal verification languages and tools.
If people are having to specify things in TLA+ etc -- even with the help of an LLM to write that spec -- they will then have something they can point the LLM at in order for it to verify its output and assumptions.
Long time ago in France the mainstream view by computer people was that code or compute weren't what's important when dealing with computers, it is information that matters and how you process it in a sensible way (hence the name of computer science in French: informatique. And also the name for computer: “ordinateur”, literally: what sets things into order).
As a result, computer students were talked a lot (too much for most people's taste, it seems) about data modeling and not too much about code itself, which was viewed as mundane and uninteresting until the US hacker culture finally took over in the late 2000th.
Turns out that the French were just right too early, like with the Minitel.
"Computer science is no more about computers than astronomy is about telescopes." -Dijkstra
> The LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code.
I always felt like that's the main issue with unit testing. That's why I used it very rarely.
Maybe keeping tests in the separate module and not letting th Agent see the source during writing tests and not letting agent see the tests while writing implemntation would help? They could just share the API and the spec.
And in case of tests failing another agent with full context could decide if the fix should be delegated to coding agent or to testing agent.
This hits hard. I’m getting hit with so much slop at work that I’ve quietly stopped being all that careful with reviews.
>LLM happily churns out unit tests which are simply reinforcing the existing behaviour of the code. At no point does anyone stop and ask whether the generated code implements the desired functional behaviour for the system ("business logic").
You can use spec driven development and TDD. Write the tests first. Write failing code. Modify the code to pass the tests.
Um, you're supposed to write the tests first. The agents can't do this?
Actually, they extremely bad at that. All training data contains cod + tests, even if tests where created first. So far, all models that I tried failed to implement tests for interfaces, without access to actual code.
They can, but should be explicitly told to do that. Otherwise they just everything in batches. Anyway pure TDD or not but tests catches only what you tell AI to write. AI does not now what is right, it does what you told it to do. The above problem wouldn’t be solved by pure TDD.
> I have no constructive recommendations; I feel the industry will keep their foot on the pedal until something catastrophic happens
I can't wait. Maybe when shitty vibe coded software starts to cause real pain for people we can return to some sensible software engineering
I'm not holding my breath though