I must be the dumbest "prompt engineer" ever, each time I ask an AI to fix or even worse, create something from scratch it rarely returns the right answer and when asked for modification it will struggle even more.
All the incredible performance and success stories always come from these Twitter posts, I do find value in asking simple but tedious task like a small refactor or generate commands, but this "AI takes the wheel level" does not feel real.
I think it's probably the difference between "code" and "programming". An LLM can produce code and if you're willing to surrender to the LLMs version of whatever it is you ask for, then you can have a great and productive time. If you're opinionated about programming, LLMs fall short. Most people (software engineers, developers, whatever) are not "programmers" they're "coders" which is why they have a positive impression of LLMs: they produce code, LLMs produce code... so LLMs can do a lot of their work for them.
Coders used to be more productive by using libraries (e.g: don't write your own function for finding the intersection of arrays, use intersection from Lodash) whereas now libraries have been replaced by LLMs. Programmers laughed at the absurdity of left-pad[1] ("why use a dependency for 16 lines of code?") whereas coders thought left-pad was great ("why write 16 lines of code myself?").
If you think about code as a means to an end, and focus on the end, you'll get much closer to the magical experience you see spoken about on Twitter, because their acceptance criteria is "good enough" not "right". Of course, if you're a programmer who cares about the artistry of programming, that feels like a betrayal.
[1] https://en.wikipedia.org/wiki/Npm_left-pad_incident
Oh, this captures my experience perfectly.
I've been using Claude Code a lot recently, and it's doing amazing work, but it's not exactly what I want it to do.
I had to push it hard to refactor and simplify, as the code it generated was often far more complicated than it needed to be.
To be honest though, most of the code it generated I would accept if I was reviewing another developer's work.
I think that's the way we need to look at it. It's a junior developer that will complete our tasks, not always in our preferred way, but at 10x the speed, and frequently make mistakes that we need to point out in CR. It's not a tool which will do exactly what we would.
My experience so far on Claude 3.7 has been over-engineered solutions that are brittle. Sometimes they work, but usually not precisely the way I prompted it to, and often attempts to modify them require more refactoring due to the unnecessary complexity.
This has been the case so far in both js for web (svelte, react) and python automation.
I feel like 3.5 generally came up "short" more often than 3.7, but in practical usage it meant I could more easily modify and build on top of. 3.7 has led to a lot of deconstructing, reprompting, starting over.
All I really care about is the end result and, so far, LLMs are nice for code completion, but basically useless for anything else.
They write as much code as you want, and it often sorta works, but it’s a bug filled mess. It’s painstaking work to fix everything, on part with writing it yourself. Now, you can just leave it as-is, but what’s the use of releasing software that crappy?
I suppose it’s a revolution for that in-house crapware company IT groups create and foist on everyone who works there. But the software isn’t better, it just takes a day rather than 6 months (or 2 years or 5 years) to create. Come to think of it, it may not be useful for that either… I think the end-purpose is probably some kind of brag for the IT manger/exec, and once people realize how little effort is involved it won’t serve that purpose.
I love the subtle mistakes that get introduced in strings for example that then take me all the time I saved to fix.
Do you have an example of this?
Can’t remember account login so created a new account to respond.
I recently used Claude with something along the lines of “Ruby on rails 8, Hotwire, stimulus, turbo, show me how to do client side validations that don’t require a page refresh”
I am new to prompt engineering so feel free to critique. Anyway, it generated a stimulus controller called validations_controller.js and then proceeded to print out all of the remaining connected files but in all of them it referred to the string “validation” not “validations”. The solution it provided worked great and did exactly what I wanted to (though I expected a turbo frame based solution not a stimulus solution, but whatever it did what I asked it to do) with the exception of having to change all of the places where it put the string “validation” where it needed to put “validations” to match the name it used in the provided stimulus controller.
Say you hire a developer and ask him to directly debug an issue by simply skimming through the codebase, do you think he can complete this task say in 5-10 minutes? No, right? In claude code(CC), do the following: 1. /init which acts as a project guide. 2. Ask it to summarize the project as save it as summary.md 3. The prompt needs to be clear and detailed. Here’s an example: https://imgur.com/a/RJyp3f9
I remember reading the origin article for that prompt example and laughing at how long it likely took to write that essay when typing "hikes near San Francisco" into your favoured search engine will do the same thing, minus the hallucinations.
You can ask AI to help with your prompt
are you in the habit of saving bad LLM output to later reference in future Internet disputes?
??? A zillion LLM tools maintain history for you automatically. As long as you remember what the chat was about, it's only a search away.
Have you tried using Cursor rules? [1]
Creating a standard library stdlib with many (potentially thousands) of rules, and then iteratively adding to and amending the rules as you go, is one of the best practices for successful AI coding.
[1] https://docs.cursor.com/context/rules-for-ai
> …many (potentially thousands) of rules, and then iteratively adding to and amending the rules as you go…
Is this an especially better (easier, more efficient) route to a working, quality app/system than conventional programming?
I’m skeptical if the answer to the way to achieve 10x results is 10x more effort.
It's such a fast moving space, perhaps the need for 'rules' is just a temporary thing, but right now the rules will help you to achieve more predictable results and higher quality code.
You could easily end up with a lot of rules if you are working with a reasonably large codebase.
And as you work on your code, every time you have to deal with an issue of the code generation you ask Cursor to create a new rule so that next time it does it correctly.
In terms of AI programming vs conventional programming, the writing's on the wall: AI assistance is only getting better and now is a good time to jump on the train. Knowing how to program and configure your AI assistants and tools is now a key software engineering skill.
Or it's just a bubble and after diminishing returns they'll go in the bin with all the blockchain startups lol
10x more effort once, 10x faster programming forever. Also once you got examples of the rules files, LLM can write most of them for next projects.
I think specifying rules could be very useful in the same way as types, documentation, coding styles, advanced linting, semgrep etc.
We could use it for LLM driven coding style linting. Generate PRs for refactoring. Business logic bug detector
Also, you can just tell copilot to write rules for you.
at that point aren't you just replacing regular programming with creating the thousands of rules? I suppose the rules are reusable so it might be a form of meta-programming or advanced codegen
People endlessly creating "rules" for their proompts is the new version of constantly tweaking Vim or Emacs configurations
Born too late to explore the world.
Born to early to explore the galaxy.
Born at the right time to write endless proompts for LLMs.
Right on point. The same principle applies when deciding whether to use a framework or not. Coders often marvel at the speed with which they can build something using a framework they don’t fully understand. However, a true programmer seeks to know and comprehend what’s happening under the hood.
I'll preface this with the fact I agree that there is a difference between using a framework and being curious enough to dig into the details I think you're veering into No True Scotsman territory here.
IMO, the vast majority of programmers wouldn't meet the definition you've put forward here. I don't know many that dig into the operating system, platform, or hardware all that much, though I work in streaming media so that might just be an industry bias.
I'm the one who feels uncomfortable when something is magical. I'm trying to dig whenever possible. I don't always have success. I know very little about Linux kernel internals, for example.
That said, it's rarely pays out. Quite the opposite: I often spending lots of time digging unnecessarily. I'm not complaining, I am who I am and I don't want to change. Digging into internals makes me happier and satisfied. And sometimes it is useful. So Linux kernel internals are on the roadmap and hopefully I'll dig into it some day.
I agree that absolute majority of people I met are not of that kind. And probably they should not. Learning business side of the software is what makes someone really useful. I hate business side, I only love computer side.
I find that my understanding of different layers is usually not needed but is rarely important toy ability to solve thorny issues or figure out ways to do things or debug issues that others are not even trying.
I'm working in areas of systems programming where such knowledge and willingness to go deeper into the stack is helpful, even if rarely so.
I can't say I understand the kernel, I only scratch the surface of most of it and only dug deeper where I really needed to and only as deep as time allowed me and the need required though.
That just means they are not programmers, but coders as defined here.
This aligns with my experience. I've seen LLMs produce "code" that the person requesting is unable to understand or debug. It usually almost works. It's possible the person writing the prompt didn't actually understand the problem, so they got a half baked solution as a result. Either way, they need to go to a human with more experience to figure it out.
Tbh If I do not understand generated code perfectly, meaning it is using slightly something I do not know I usually spend approximately same time understanding generated code as writing it myself.
I'm waiting for artisan programming to become a thing.
by 100% organic, free range and fair trade programmers
Replace programmers with 'intelligence' to contrast with artificial
Artisanal code has been a thing for a long while.
If we're the luddite artisans, LLMs seem to represent the knitting frames which replaced their higher quality work with vastly cheaper, far crappier merchandise. There is a historical rhyme here.
You didn't had to spend time debugging a peace of cloth, and the cloth defects are obvious
There's a lot of code out there written for people who are far more concerned with cost and speed than quality - analogous to the "fast fashion" consumer segment.
Ive worked on all sorts of code bases filled to the brim with bugs which end users just worked around or ignored or didnt even encounter. Pre-product market fit startups, boring crappy CRUD for routine admin, etc.
It was horrible shit for end users and developers (me) but demand is very high. I expect demand from this segment will increase as LLMs drive the cost of supply to nearly zero.
I wouldnt be surprised if high end software devs (e.g. >1 million hit/day webapps where quality is critical) barely do anything different while the demand for devs at the low end of the market craters.
> I wouldnt be surprised if high end software devs (e.g. >1 million hit/day webapps where quality is critical) barely do anything different while the demand for devs at the low end of the market craters.
Over the last few decades the prevalance of more and more frameworks to handle more and more of what used to be boilerplate that everyone had to do for GUI apps has not significantly impacted the bottom end of the job market.
It's only, if anything, caused even higher demand for "just one more feature".
When's the last time you worked on a product in that space where the users or product managers didn't have any pending feature requests? Didn't add new ideas to the backlog faster than the team could implement?
Someone's gonna have to know just enough to make sure the LLM is producing shit that passes the (low) bar of adequateness. But that job will probably be even further away from the CS-knowing, performance-critical, correctness-matters roles than it is today. It's already a big gap, I just don't see the shit-shoveler job going away. The corporate world loves its shit shoveling.
Artisanal firmware is the future (or the past? or both?): https://www.youtube.com/watch?v=vBXsRC64hw4
From before people even knew what llms were: https://handmade.network
Like, writing binaries directly? Is assembly code too much of an abstraction?
People stress about good system design because of maintainability. No one cares about binary code because that's just the end result. What matters is the code that generates it, as that’s what needs to be maintained.
We have not yet reached a point where LLM generated code can also be maintained by LLMs and the tooling is not there. Once that happens, your argument will hold more weight. But for now, it doesn’t. Injecting unreadable, likely bug-ridden code into your application increases technical debt tenfold.
> If you think about code as a means to an end, and focus on the end
The problem with this is that you will never be able to modify the code in a meaningful way after it crosses a threshold, so either you'll have a prompt only modification ability, or you will just have to rewrite things from scratch.
I wrote my first application ever (equivalent to a education CMS today) in the very early 2000s with barely any notion of programming fundamentals. It was probably a couple hundred thousand lines of code by the time I abandoned it.
I wrote most of it in HTML, JS, ASP and SQL. I was in high school. I didn't know what common data structures were. I once asked a professor when I got into late high school "why arrays are necessary in loops".
We called this cookbook coding back in the day.
I was pretty much laughed at when I finally showed people my code, even though it was a completely functional application. I would say an LLM probably can do better, but it really doesn't seem like something we should be chasing.
I tried LLMs for my postgraduate "programming" tasks to create lower level data structures and algorithms that are possible to write a detailed requirements for - they failed miserably. When I pushed in certain directions, I've got student level replies like "collision probability is so low we can just ignore it", while same LLM accurately estimated that in my dataset there will be collisions.
And I don't believe until I see LLMs can use real debugger to figure out a root cause for a sophisticated, cascading bug.
This surrendering to the LLM has been going around a lot lately. I can only guess it is from people that haven't tried it very much themselves but love to repeat experiences from other people.
I’m a software developer by trade but also program art creation tools as a hobby. Funny thing is, at work, code is definitely a means to an end. But when I’m doing it for an art project, I think of the code as part of the art :) the process of programming and coming up with the code is ultimately a part of the holistic artistic output. The two are not separate, just as the artists paint and brushes are also a part of the final work of a painting.
> LLMs version of whatever it is you ask for, then you can have a great and productive time
Sure, but man are there bugs.
This is untrue.
You can be over specified in your prompts and say exactly what types and algorithms you want if you’re opinionated.
I often write giant page long specs to get exactly the code I want.
It’s only 2x as fast as coding, but thinking in English is way better than coding.
Also, if you cannot tell the difference between code written by an LLM or a human, what is the difference? This whole post is starting to feel like people with very strong (gaterkeeper'ish) views on hi-fi stereo equipment, coffee, wine, ... and programming. Or should I say "code-as-craft" <cringe>?
My comment isn't intended to pit "programmers" against "coders" or suggest that one is better than the other. I think the distinction is useful to help people understand why LLMs can be game-changing for some, and useless for others, because our attitudes towards programming/code can be so different.
If you go through and read the left-pad posts here on hn, you'll find people at both extremes: some fiercely defend left-pad as useful and worthwhile and think writing left-pad yourself is dumb-as-hell, and then at the other end you'll find some fiercely deride using left-pad as absurd when they could just write the code themselves. Here's a good thread to start with: https://news.ycombinator.com/item?id=11348798
Personally, I'd rather hire a "coder" than a "programmer" and consider myself more "coder" than "programmer" :)
Thank you for eloquently saying what I've been trying hard to express.
Interesting, but it seems ridiculous to disambiguate “Programmer” vs “Coder”.
They’re synonymous words and mean the same thing right?
Person who writes logic for machines
The different mindsets exist, but I agree these are bad words to differentiate them. Back when I started in software in the 80s a common expression was: there are two types of programmers, nerds and hippies. The distinction falling along similar lines - nerds needed to taste the metal, while hippies were more interested in getting stuff done.
There may never be a perfect taxonomy of programmer archetypes.
I imagine most of us here can agree that some elevate the craft and ritual to great effect while others avoid such high-minded conceits in favor of shipping whatever hits the expected target this week.
I’ve been both at different points in my career. Usually it’s a response to my environment.
Who cares? Substitute whatever labels you prefer, but the distinction between the two groups I’d certainly real.
Obviously you do - enough to respond.
What about those two vs. the concept of "software engineering"? - there, the "code" or "program" is even _less_ important, just a tool in an ecosystem where you are asking bigger questions like "is this maintainable / testable / readable / etc.?" "is this the right language / framework for my org / others to use" and so on and so on. These questions and context quite literally represent billions of context tokens in LLM world and why I continually do not worry about 99% of the fear mongering of them replacing anybody who writes code.
Yeah this one is dumb.
The real distinction is programmer vs software engineer, or IOW do you want to write code or solve problems?
Some hints for people stuck like this:
Consider using Aider. It's a great tool and cheaper to use than Code.
Look at Aiders LLM leaderboard to figure out which LLMs to use.
Use its architect mode (although you can get quite fast without it - I personally haven't needed it).
Work incrementally.
I use at least 3 branches. My main one, a dev one and a debug one. I develop on dev. When I encounter a bug I switch to debug. The reason is it can produce a lot of code to fix a bug. It will write some code to fix it. That won't work. It will try again and write even more code. Repeat until fixed. But in the end I only needed a small subset of the new code. So you then revert all the changes and have it fix it again telling it the correct fix.
Don't debug on your dev branch.
Aider's auto committing is scary but really handy.
Limit your context to 25k.
Only add files that you think are necessary.
Combining the two: Don't have large files.
Add a Readme.md file. It will then update the file as it makes code changes. This can give you a glimpse of what it's trying to do and if it writes something problematic you know it's not properly understanding your goal.
Accept that it is not you and will write code differently from you. Think of it as a moderately experienced coder who is modifying the codebase. It's not going to follow all your conventions.
https://aider.chat/
https://aider.chat/docs/leaderboards/
> I use at least 3 branches. My main one, a dev one and a debug one. I develop on dev. When I encounter a bug I switch to debug. The reason is it can produce a lot of code to fix a bug. It will write some code to fix it. That won't work. It will try again and write even more code. Repeat until fixed. But in the end I only needed a small subset of the new code. So you then revert all the changes and have it fix it again telling it the correct fix.
how big/complex does the codebase have to be for this to be for you to actually save time compared to just using a debugger and fixing it yourself directly? (I'm assuming here that bugs in smaller codebases are that much easier for a human to identify quickly)
So far I've used Aider for only a few projects - almost all where it starts from scratch. And virtually always for personal use - not work. As such, the focus on quality is not as high (i.e. there's no downside to me letting it run wild).
So I hope you can understand it when I say: Why should I waste my brain cells debugging when I can just tell it to fix its own problems?
Say you want something done (for personal use), and don't have the time to develop it yourself. Someone volunteers to write it for you. You run it, and it fails. Would you spend much time reading code someone else wrote to find the bug, or just go back to the guy with the error?
Yes, I have had to debug a few things myself occasionally, but I do it only when it's clear that the LLM isn't able to solve it.
In other cases, I'm writing something in a domain I'm not fully knowledgeable (or using a library I'm only mildly familiar with). So I lack the knowledge to debug quickly. I would have to read the docs or Google. Why Google when the LLM is more effective at figuring it out? Certainly in a few cases, the solution turned out to require knowledge I did not have, and I appreciate that the LLM solved it (and I learned something as a result).
The point with all this is: The experience not binary. It's the full spectrum. For my main codebase I'm responsible for at work, I haven't bothered using an LLM (and I have access to Copilot). I need to ensure the quality of the code, and I don't want to spend my time understanding the code the LLM wrote - to the level I would need to feel comfortable pushing to production.
Thanks, that's a helpful set of hints!
Can you provide a ballpark of what kind of $ costs we are talking here for using Aider with, say, Claude? (or any other provider that you think is better at the moment).
Say a run-of-the-mill bug-fixing session from your experience vs the most expensive one off the top of your head?
I've used it only a few times - mostly for projects written from scratch (not existing codebases). And so far only with Claude Sonnet.
Twice I had a "production ready" throwaway script in under $2 (the first was under a dollar). Both involved some level of debugging. But I can't overstate how awesome it is to have a single use script be so polished (command line arguments, extensive logging, etc). If I didn't make it polished, it would probably have been $0.15.
Another one I wrote - I probably spent $5-8 total, because I actually had it do it 3 times from scratch. The first 2 times there were things I wasn't happy with, or the code got too littered with attempts to debug (and I was not using 3 branches). When I finally figured everything out, I started again for the third time and it was relatively quick to get something working.
Now if I did this daily, it's a tad expensive - $40-60/month. But I do this only once or twice a week - still cheaper than paying for Copilot. If I plan to use it more, I'd likely switch to DeepSeek. If you look at the LLM leaderboard (https://aider.chat/docs/leaderboards/), you'll see that R1 is not far behind Sonnet, and is a third of the cost.
When I was using aider with Claude 3.5 api it cost about $0.01 per action.
The three-branch thing is so smart.
It took a while for me to realize it, and frankly, it's kind of embarrassing that I didn't think of it immediately.
It is, after all, what many of us would do in our manual SW development. But when using an LLM that seems pretty good, we just assume we don't need to follow all the usual good practices.
Does the LLM make commits along the way? I think I’m missing why you need all these branches vs git reset —hard once it figures out the bug?
Aider, by default, makes commits after each change (so that you can easily tell it to "undo"). Once a feature is done, you manually squash the commits if desired. Some people love it, some hate it.
You can configure it not to autocommit, although I suppose the "undo" command won't work in that case.
The just sounds like ⌘-Z with extra steps.
Aider doesn't run in your editor. Undo in editor won't undo Aider's changes.
do you have a special prompt to instruct aider to log file changes in the repo's README? I've used aider in repos with a README.md but it has not done this update. (granted, i've never /add the readme into aider's context window before either...)
Take a loot at conventions.md in aider.chat documentation
I have the same experience.
Where AI shines for me is as a form of a semantic search engine or even a tutor of sorts. I can ask for the information that I need in a relatively complex way, and more often than not it will give me a decent summary and a list of "directions" to follow-up on. If anything, it'll give me proper technical terms, that I can feed into a traditional search engine for more info. But that's never the end of my investigation and I always try to confirm the information that it gives me by consulting other sources.
Exactly same experience: since the early-access GPT-3 days, I played out various scenarios, and the most useful case has always been to use generativeAI as semantic search. It's generative features are just lacking in quality (for anything other than a toy project), and the main issues since the early GPT days remains, even though it gets better, it's still too unreliable for (mid-complex systems) serious work. Also, if you don't pay attention, it messes up other parts of the code.
Yeah I have had some "magic" moments where I knew "what" I needed, had an idea of "how it would look",but no idea how to do it and ai helped me understand how I should do it instead of the hacky very stupid way I would have done it
Same here. In some cases, brainstorming even kinda works – I mean, it usually gives very bad responses, but it serves as a good duck.
Code? Nope.
I've done code interviews with hundreds of candidates recently. The difference between those who are using LLMs effectively and those who are not is stark. I honestly think engineers who think like OP are going to get left behind. Take a weekend to work on getting your head around this by building a personal project (or learning a new language).
A few things to note:
a) Use the "Projects" feature in Claude web. The context makes a significant amount of difference in the output. Curate what it has in the context; prune out old versions of files and replace them. This is annoying UX, yes, but it'll give you results.
b) Use the project prompt to customize the response. E.g. I usually tell it not to give me redundant code that I already have. (Claude can otherwise be overly helpful and go on long riffs spitting out related code, quickly burning through your usage credits).
c) If the initial result doesn't work, give it feedback and tell it what's broken (build messages, descriptions of behavior, etc).
d) It's not perfect. Don't give up if you don't get perfection.
Hundreds of candidates? That's significant if not an exaggeration. What are the stark differences you have seen? Did you inquire about the candidate's use of language models?
Yes. I do async video interviews in round 1 of my interview process in order to narrow the candidate funnel. Candidates get a question at the start of the interview, with a series of things to work through in their own IDE while sharing their screen. I review all recordings (though I will skip around, and if candidates don't get very far I won't spend a lot of time watching at 1x speed.) The question as laid out encourages them to use all of the tools they usually rely on while coding (including google, stackoverflow, LLMs, ...).
Candidates who use LLMs generally get through 4 or 5 steps in the interview question. Candidates who don't are usually still on step 2 by the end of the interview (with rare exceptions), without their code quality being significantly better.
(I end up in 1:1 interviews with perhaps 10-15% of candidates who take round 1).
So you’re not _interviewing_ them, you’re having them complete expensive work-sample tests. And your evaluation metric is “completes lots of steps in a small time box.”
Seems more like trying to find out the most proficient LMM users than anything else. I’ve never done interviews but I imagine I’d be hard pressed to skip candidates solely because they aren’t using LLMs.
Each to their own and maybe their method works out, but it does seem whack.
The thing is, when you're doing frontend, a human programmer can't write 4,000 lines of React code in 1 hour. A properly configured LLM system can.
This is why I wouldn't hire a person who doesn't know how to do this.
What are you doing where 4000 lines of LLM-generated code per hour is a net positive? Sounds like a techdebt machine to me.
UIs in React are very verbose. I'm not saying this is running 24/7.
Is the question actually difficult, though? If you ask for some standard task, then of course those who are leaning heavily on LLMs will do well, as that's exactly where they work best. That doesn't tell you anything about the performance of those candidates in situations where the LLM won't help them.
I suppose, if you are specifically looking for coders to perform routine tasks, then you'll get what you need.
Of course, you could argue that ~90% of a programmer's work day is performing standard tasks, and even brilliant programmers who don't use LLMs will lose so much productivity that they are not worth hiring... Counterpoint: IMO, the amount of code you bash out in a given time bears no relation to your usefulness as a programmer. In fact, producing lots of code is often a problem.
No, I'm not doing leetcode or algorithm questions - it's basically "build a [tiny] product to specs", in a series of steps. I'm evaluating candidates on their process, their effectiveness, their communication (I ask for narration), and their attention to detail. I do review code afterwards. And, bear in mind that this is only round 1 - once I talk with the ones who do well, I'll go deep on a number of topics to understand how well rounded they are.
I think it's a reasonably balanced interview process. Take home tests are useless now that LLMs exist. Code interviews are very time consuming on the hiring side. I'm a firm believer that hiring without some sort of evaluation of practical competence is a very bad idea - as is often discussed on here, the fizzbuzz problem is real.
> it's basically "build a [tiny] product to specs", in a series of steps
That seems like exactly what the person you're replying to is saying - that sounds like basic standard product-engineering stuff, but simpler, like any of a million examples out there that an LLM has seen a million times. "Here's a problem LLMs are good at, wow, the people using the LLMs do best at it." Tautolgy.
So it's great for finding people who can use an LLM to do tiny product things.
In the same way takehomes had all the same limitations. More power to you if those are the people you are looking for, though.
But it also sounds like a process that most people with better options are gonna pass on most of the time. (Also like with takehomes.)
> "Here's a problem LLMs are good at, wow, the people using the LLMs do best at it."
Yes, product engineering, the thing that 90% of developers do most of their time.
But what LLM haven't look at ?
LLMs have looked at everything that exists so far. If all you're creating is yet another "Uber for Dogs", LLMs will do fine.
But if you want to create a paradigm shift, you need to do something new. Something LLMs don't yet know about.
> I ask for narration
That's a mistake. There are plenty of people who are not good multitaskers and cannot effectively think AND talk at the same time, myself included.
IMHO, for programming, monotaskers tend to do better than multitaskers.
Haven't coded for a couple years (but have been a dev for two decades) and haven't used LLM's myself for coding (not against this), so am really just curious, wouldn't you want to know if a dev can solve and understand problems themselves?
Because it seems like tougher real-world technical problems (where there are tons of dependencies with other systems in addition to technical and business requirements) needs for the dev to have an understanding of how things work, and if you rely on an LLM, you may not gain enough of an understanding of what's going on to solve problems like this...
... Although, I could see how devs that are more focused on application development and knowing the business domain is their key skill, wouldn't need to have as strong an understanding of the technical (no judgement here, have been in this role myself at times).
> Haven't coded for a couple years (but have been a dev for two decades) and haven't used LLM's myself for coding (not against this), so am really just curious, wouldn't you want to know if a dev can solve and understand problems themselves?
Yes, definitely, though I lean more on the 1:1 interviews for that. I understand the resistance to this from developers, but there's a lot of repetition across the industry in product engineering, and so of course it can be significantly optimized with automation. But, you still need good human brains to make most architectural decisions, and to develop IP.
Ah, I see, round 1 is just the initial weeder, while on top of this, you'd like devs that are using LLM's for automation. Sounds like a good balance:)
Are you concerned that eventual LLM price hikes in the near future reflecting their real cost might explode your cost or render your workforce ineffective?
if it's real that person interviewed at least one candidate per day last year. Idk what kind of engineering role in what kind of org where you even do that.
I suspect he doesn't do much engineering, which would explain why he's impressed by candidates who can quickly churn out small rote sample projects with AI. Anyone who actually writes software for a living knows that working on a large production code base has little in common with this.
When I've had an open req for my team at a California tech company I've had days where I would interview (remotely) 2-3 candidates in a single day, several days a week for several weeks straight. It's not impossible to interview 100 people in a few months at that rate.
So... do you really think you were doing it right?
Sorry this is a little harsh, but how do you get anywhere near 100 people before realizing the approach must be horribly flawed, and devising and implementing a better one? Surely it behooves you to not waste your employer's time, your own time, and the time of all those people you're interviewing (mostly pointlessly).
These were 25 minute phone screens for candidates that had been sourced by recruiting, a few years ago at a company that was going through hyper-growth (hiring hundreds of engineers in a year). Phone screening several dozen people for 2-4 eventual hires doesn't feel too inefficient to me.
There are companies whose product is high-quality mock interviews. I wouldn't be surprised by that number of interviews in just a year and it can easily be more than one candidate per day.
Edit: there are also recruitment agencies with ex-engineers that do coding interviews, too.
I'd add to that that the best results are with clear spec sheets, which you can create using Claude (web) or another model like ChatGPT or Grok. Telling them what you want and what tech you're using helps them create a technical description with clear segments and objectives, and in my experience works wonders in getting Claude Code on the right track, where it has full access to the entire context of your code base.
> The difference between those who are using LLMs effectively and those who are not is stark.
Same here. Most candidates I interviewed said they did not use AI for development work. And it showed. These guys were not well informed on modern tooling and frameworks. Many of them seemed stuck in/comfortable with their old way of doing things and resistant to learning anything new.
I even hired a couple of them, thinking that they could probably pick up these skills. That did not happen. I learned my lesson.
Isn't that more correlation than causation, though? The kind of person who's not keeping up with the current new tech hotness isn't going to be looking at AI or modern frameworks; and conversely the kind of person who's dabbling with AI is also likely to be looking at other leading-edge tech stuff in their field. That seems to me more likely to be the cause of what you're seeing than that the act of using AI/LLMs itself resulting in candidates improving their knowledge and framework awareness.
My workflow for that kind of thing goes something like this (I use Sonnet 3.7 Thinking in Cursor):
1. 1st prompt is me describing what I want to build, what I know I want and any requirements or restrictions I'm aware of. Based on these requirements, ask a series of questions to produce a complete specification document.
2. Workshop the specification back and forward until I feel it's complete enough.
3. Ask the agent to implement the specification we came up with.
4. Tell the agent to implement Cursor Rules based on the specifications to ensure consistent implementation details in future LLM sessions.
I'd say it's pretty good 80% of the time. You definitely still need to understand the problem domain and be able to validate the work that's been produced but assuming you had some architectural guidelines you should be able to follow the code easily.
The Cursor Rules step makes all the difference in my experience. I picked most of this workflow up from here: https://ghuntley.com/stdlib/
Edit: A very helpful rule is to tell Cursor to always checkout a new branch based on the latest HEAD of master/main for all of it's work.
I need to steal the specification idea.
Cursor w/ Claude has a habit of running away on tangents instead of solving just the one problem, then I need to reject its changes and even roll back to a previous version.
With a proper specification as guideline it might stay on track a bit better.
Copilot supports this somewhat natively:
https://docs.github.com/en/copilot/customizing-copilot/addin...
The first thing I do for a new project is ask Copilot to create a custom-instructions.md for me and then as I work on my projects, I ask it to update the instructions every now and then based on the current state of my project.
Much less misses this way in my experience.
Cool!
I just spent last night working with Cursor and Claude Code, both support different styles of custom rules.
Then I got to work and was lamenting what my corp sponsored Copilot can't do the same - but turns out it can! :D
EDIT: Just tried it and Copilot is completely clueless. It can explain the file, but doesn't know how to generate one. Claude generates its own CLAUDE.md on startup. Cursor can create its own rulefiles pretty well, even better if you add a rule for generating rules.
I have had success with having copilot generate the file for me with no issues. I use sonnet 3.5 or 3.7. I start by writing a paragraph describing the project and it does pretty well.
I decided to try seriously the Sonnet 3.7. I started with a simple prompt on claude.ai ("Do you know claude code ? Can you do a simple implementation for me ?"). After minimal tweaking from me, it gave me this : https://gist.github.com/sloonz/3eb7d7582c33e95f2b000a0920016...
After interacting with this tool, I decided it would be nice if the tool could edit itself, so I asked (him ? it ?) to create its next version. It came up with a non-working version of this https://gist.github.com/sloonz/3eb7d7582c33e95f2b000a0920016.... I fixed the bug manually, but it started an interactive loop : I could now describe what I wanted, describe the bugs, and the tool will add the features/fix the bugs itself.
I decided to rewrite it in Typescript (by that I mean: can you rewrite yourself in typescript). And then add other tools (by that: create tools and unit tests for the tools). https://gist.github.com/sloonz/3eb7d7582c33e95f2b000a0920016... and https://gist.github.com/sloonz/3eb7d7582c33e95f2b000a0920016... have been created by the tool itself, without any manual fix from me. Setting up the testing/mock framework ? Done by the tool itself too.
In one day (and $20), I essentially had recreated claude-code. That I could improve just by asking "Please add feature XXX". $2 a feature, with unit tests, on average.
So you’re telling me you spent 20 dollars and an entire day for 200 lines of JavaScript and 75 lines of python and this to you constitutes a working re-creation of Claude Code?
This is why expectations are all out of whack.
That amount of output is comparable to what many professional engineers produce in a given day, and they are a lot more expensive.
Keep in mind this is the commenters first attempt. And I'm surprised he paid so much.
Using Aider and Sonnet I've on multiple occasions produced 100+ lines of code in 1-2 hours, for under $2. Most of that time is hunting down one bug it couldn't fix by itself (reflective of real world programming experience).
There were many other bugs, but I would just point out the failures I was seeing and it would fix it itself. For particularly difficult bugs it would at times even produce a full new script just to aid with debugging. I would run it and it would spit out diagnostics which I fed back into the chat.
The code was decent quality - better than what some of my colleagues write.
I could probably have it be even more productive if I didn't insist on reading the code it produced.
The lines of code isn’t the point. Op claimed they asked Claude to recreate Claude code and it was successful. This is obviously an extreme exaggeration. I think this is the crux of a lot of these posts. This code generator output a very basic utility. To some this is a revelation, but it leaves others wondering what all the fuss is about.
It seems to me people’s perspective on code gen has largely to do with their experience level of actually writing code.
It's a very narrow reading of his comment. What he meant to say was it quickly created a rudimentary version of an AI code editor.
Just as a coworker used it to develop an AI code review tool in a day. It's not fancy - no bells and whistles, but it's still impressive to do it in a day with almost no manual coding.
> In one day (and $20), I essentially had recreated claude-code.
Not sure it’s a narrow reading. This is my point, if it’s a basic or rudimentary version people should be explicit about that. Otherwise these posts read like hype and only lead to dissatisfaction and disappointment for others.
s/reading/interpretation/
Reading something literally is by definition the narrowest interpretation.
> Using Aider and Sonnet I've on multiple occasions produced 100+ lines of code in 1-2 hours, for under $2. Most of that time is hunting down one bug it couldn't fix by itself (reflective of real world programming experience).
Was this using technologies you aren't familiar with? If not, the output rate seems pretty low (very human-paced, just with an extra couple bucks spent.)
By 100+ I mean 100-300 lines. I think most people aren't churning out 100 lines of code per hour unless it involves boilerplate.
More importantly, the 100-300 lines was very low effort for me. That does have its downsides (skills atrophy).
Remember that input tokens are quadratic with the length of the conversation, since you re-upload the n previous messages to get the (n+1)-nth message. When Claude completes a task in 3-4 shots, that’s cents. When he goes down in a rabbit hole, however…
I'm aider there's a command to "reset" so it doesn't send any prior chat. Whenever I complete a mini feature I invoke the command. It helpfully shows the size of the current contact in tokens and the cost so I keep an eye on it.
Doesn't Code have a similar option?
It does — /clear. It also has /compact to summarize previous tasks to preserve some situational awareness while reducing context bulk.
2200 lines. Half of them unit tests I would probably have been too lazy to write myself even for a "more real" project. Yes, I consider $20 cheap for that, considering:
1. It’s a learning experience 2. Looking at the chat transcripts, many of those dollars are burned for stupid reasons (Claude often fails with the insertLines/replaceLines functions and break files due to miss-by-1 offset) that are probably fixable 3. Remember that Claude started from a really rudimentary base with few tools — the bootstrapping was especially inefficient
Next experiment will be on an existing codebase, but that’s probably for next weekend.
Thanks for writing up your experience and sharing the real code. It is fascinating to see how close these tools can now get to producing useful, working software by themselves.
That said - I'm wary of reading too much into results at this scale. There isn't enough code in such a simple application to need anything more sophisticated than churning out a few lines of boilerplate that produce the correct result.
It probably won't be practical for the current state of the art in code generators to write large-scale production applications for a while anyway just because of the amount of CPU time and RAM they'd need. But assuming we solve the performance issues one way or another eventually it will be interesting to see whether the same kind of code generators can cope with managing projects at larger scales where usually the hard problems have little to do with efficiently churning out boilerplate code.
aider has this great visualisation of "self written code" - https://aider.chat/HISTORY.html
I suspect it would be somewhat challenging to do, but I'd love to see something like this where the contributions are bucketed into different levels of difficulty. It is often the case for me that a small percentage of the lines of code I write take a large percentage of the time I spend coding (and I assume this is true for most people).
LLM are replacing Google for me when coding. When I want to get something implemented, let's say make a REST request in Java using a specific client library, I previously used Google to find example of using that library.
Google has gotten worse (or the internet has more garbage) so finding code an example is more difficult than it used to be. Now I ask an LLM for an example. Sometimes I have to ask for a refinement and and usually something is broken in the example but it takes less time to get the LLM produced example to work than it does to find a functional example using Google.
But the LLM has only replaced my previous Google usage, I didn't expect Google to develop my applications and I don't with LLMs.
This has been my experience of successful usage as well. It's not writing code for me, but pulling together the equivalent of a Stack Overflow example and some explaining sentences that I can follow up on. Not perfect and I don't blindly copy paste it, same as Stack Overflow ever was, but faster and more interactive. It's helpful for wayfinding, but not producing the end result.
I used the Kagi free trial when I was doing Advent of Code in a somewhat unfamiliar language (Swift) last year, as well as ChatGPT occasionally.
The LLM was obviously much faster and the information was much higher density, but it had quite literally about a 20% rate of just making up APIs from my limited experiment. But I was very impressed with Kagi’s results and ended up signing up, now using it as my primary search engine.
It is really a double edged sword. Some APIs I would not have found myself. In some way an AI works like my mind fuzzy associating memory fragements: There should be an option for this command to do X because similar commands have this option and it would be possible to provide this option. But in reality the library is less than perfectly engineered and the option is not there. The AI also guesses the option is there. But I do not need a guess when I ask the AI - I need reliable facts. If the cost of an error is not high I still ask the AI and if it fails it is back to RTFM but if the cost of failure is high then everything that comes out of a LLM needs checking.
I did the Kagi trial in the fall of 2023 and tried to hobble along with the cheapest tier.
Then I got hooked by having a search engine that actually finds the stuff I need and I've been a subscriber for bit over a year now.
Wouldn't go back to Google lightly.
In order to use a library, I need to (this is my opinion) be able to reason about the library’s behavior, based on a specification of its interface contract. The LLM may help with coming up with suitable code, but verifying that the application logic is correct with respect to the library’s documented interface contract is still necessary. It’s therefore still a requirement to read and understand the library’s documentation. For example, for the case of a REST client, you need to understand how the possible failure modes of the HTTP protocol and REST API are translated by the library.
I wonder how good Google could be if they had a charge per query model that these LLMs do. AI or not, dropping the ad incentive would be nice.
I've found AI to be useful on precisely-scoped tasks I might assign to a junior programmer to take a day to do, like "convert this exact bash script to a Powershell script".
But in my own work, those tasks are pretty rare, like 3 times a month? Often I start working on something, and the scope and definition of success changes while I'm in the midst of it. Or it turns out to be harder than expected and it makes sense to timebox it and do a quick search for workarounds.
As much as we joke about StackOverflow commenters sometimes telling a question-asker they shouldn't be doing what they're trying to do, you do actually want that (soft) pushback some of the time. Most modern LLMs will gleefully come up with a complete plan for how we're gonna get the square peg in the round hole.
> you do actually want that (soft) pushback some of the time. Most modern LLMs will gleefully come up with a complete plan for how we're gonna get the square peg in the round hole.
I once accidentally asked a local DeepSeek distilled model to do the wrong thing by accidentally copy pasting the wrong variable name.
It told me how to do it, and then asked me if I was sure.
My local DeepSeek R1 model (deepseek-r1:7b) saw me trying to do something stuupid (I was working with the wrong variable). It told me how to do what I asked and then asked:
> _Is this modification part of a larger change you're making to the code? I'd like to make sure we're not modifying something important that might have side effects._
Looking at its though process:
> _The user wants to modify this string by replacing "-input" with "-kb". But looking at the ARN structure, I recall that AWS S3 ARNs are usually in the form arn<:aws:1151472526310103070>s3:::bucket_name RegionalPart path. The part after the bucket name is typically s3:// followed by the object key._ > _Wait, maybe the user has a specific reason to make this change. Perhaps they're formatting or structuring something for a different purpose. They might not realize that directly modifying ARNs can cause issues elsewhere if not done correctly._
That's nice!
I am a slow adopter of new tech but plan to spend a significant amount of time in 2025 using AI tools when coding. I am net negative on AI simply replacing programmers, but I think the practice of development is undergoing a seismic shift at this point.
My recent usage is oriented towards using pseudocode descriptions that closely map to Python to produce Python functions. I am very impressed with Claude 3.7's syntactic correctness when given a chunk of pseudocode that looks "python-y" to begin with.
My one concern is that much of my recent code requirements lack novelty. So there is a somewhat reasonable chance that the tool is just spitting out code it slurped somewhere in github or elsewhere in the larger Internet. Just this week, I gave Claude a relatively "anonymous" function in pseudocode, meaning variable names were not particularly descriptive with one tiny exception. However, Claude generated a situationally appropriate comment as part of the function definition. This was . . . surprising to me if somehow the model had NOT in its training set had some very close match to my pseudocode description that included enough context to add the comment.
At this point very little code is "novel". Everyone is simply rewriting code that has already been written in a similar form. The LLM isn't slurping up and restating code verbatim. It is taking code that it has seen thousands of times and generating a customized version for your needs. It's hubris to think that anyone here is generating "novel" code.
I have seen the argument that very little code is novel but I find it inherently unsatisfying and lacking in nuance? I think what bugs me about is that if you squint hard enough, all programming reduces to "take some data, do something to it." That "something" is doing a lot of heavy lifting in the argument that "something" is or isn't novel.
Heck, if we think about it from the programming language perspective, all code is "simply" using already existing language functions to cobble together a solution to some specific set of requirements. Is no program novel?
There is probably a consideration here that maybe boils down to the idea of engineering vs artisanal craftsmanship and where a specific project falls in that spectrum . . .
Yeah, it's so bad now I only trust my eyes. Everyone is faking posts, tweets and benchmarks that the truth no longer exists.
I'm using Claude 3.7 now and while it improved on certain areas, it degraded on others (ie: it randomly removes/changes things more now).
It's clear to anyone paying attention that LLMs hit a wall a while back. RAG is just expert systems with extra steps. 'Reasoning' is just burning more tokens in hopes it somehow makes the results better. And lately we've seen that a small blanket is being pulled that way or another.
LLMs are cool, machine learning is cooler. Still no 'AI' in sight.
I initially had the same experience. My codebase is super opinionated with a specific way to handle things. Initially it kept on wanting to do things it's way. I then changed my approach and documented the way the codebase is structured, how things should be done, all the conventions used and on every prompt I make sure to tell him to use these documents as reference. I also have a central document that keeps track of dependencies of modules and the global data model. Since I made these documents as reference developing new features has been a breathe. I created the architecture, documented it, and now it uses it.
The way I prompt it is first I write the documentation of the module I want, following the format I detailed inbthe master documents, and ask him to follow the documentation and specs.
I use cursor as well, but more as an assistant when I work on the architecture pieces.
But I would never let an AI the driver seat for building the architecture and making tech decisions.
What I've noticed from my extensive use over the past couple weeks has been Claude Code really sucks at thinking things through enough to understand the second and third order consequences of the code that it's writing. That said, it's easy enough to work around its deficiencies by using a model with extended thinking (Grok, GPT4.5, Sonnet 3.7 in thinking mode) to write prompts for it and use Claude Code as basically a dumb code-spewing minion. My workflow has been: give Grok enough context on the problem with specific code examples, ask it to develop an implementation plan that a junior developer can follow, and paste the result into Claude Code, asking it to diligently follow the implementation plan and nothing else.
"Claude Code really sucks at thinking things through enough to understand the second and third order consequences of the code that it's writing"
Yup, that's our job as software engineers.
> Yup, that's our job as software engineers
The more seasoned you are as a SWE, the higher the orders you consider, and not just on the technical aspect, but the human and business sides as well.
In all of these posts I fail to see how this is engineering anymore. It seems like we are one step away from taking ourselves out of the picture completely.
I don’t write binaries, assembly, or C. If I don’t have to write an application, I’m okay with that.
I still have to write the requirements, design, and acceptance criteria.
I still have to gather the requirements from stakeholders, figure out why those will or will not work, provision infra, figure out how to glue said infra together, test and observe and debug the whole thing, get feedback from stakeholders…
I have plenty of other stuff to do.
And if you automate 99% of the above work?
Then the requirements are going to get 100Xed. Put all the bells and whistles in. Make it break the laws of physics. Make it never ever crash and always give incredibly detailed feedback to the end users. Make it beautiful and faster than thought itself.
I’m not worried about taking myself out of the loop.
I have to say that I am worried that, by taking myself out of the loop for the 99%, I'm going to get worse at the 1% of things that occasionally fall into my lap because the LLM can't seem to do them. I think software engineering is a skill that is "use it or lose it", like many others.
There's also the question of whether I will enjoy my craft if it is reduced to, say, mostly being a business analyst and requirements gatherer. Though the people paying me probably don't care very much about that question.
Reading some of the comments in the "Layoffs don't work" right before reading comments here must have been one of the more surreal experiences for me :)
The takes are as different as (paraphrasing): "if a person can't create something with en empty text editor, I fail them", and "if a person can't speed run through an unrealistically large set of goals because they don't use AI-assisted development, I fail them".
I guess one should keep their skills at both honed at all times, even if neither are particularly useful at most real jobs, because you never know when you're going to be laid off and interviewing.
It's very specialized already, though.
How many devs could debug both a K8s network configuration issue and a bug in an Android app caused by a weird vendor's OS tweak? Not most of us.
Some people will be better at pushing the LLM things to generate the write crap for the MVP. Some people will be better at using these tools for testing and debugging. Some people will be better at incidence response. They'll probably all be using tools with some level of AI "magic" in them, but the specialization will be somewhat recognizable to what it's been for the past decade.
If you're on the business side you still want a team of people running that stuff until there's a step-change in the ability to trust these things and they get so good you'd be able to give over control of all your cloud/datacenter/network/whatever infrastructure and spending.
And at THAT point... the unemployed software engineers can team up with the unemployed lawyers and doctors and blue-collar workers who were replaced by embodied-LLM-powered robots and ... go riot and ransack some billionare's houses until they decide that these new magical productivity machines should let everyone have more free time and luxury, not less.
Thanks for sharing. I hope you are right. It's hard to stay objective as things are changing so quickly.
This has been my experience as well. Breaking problems into smaller problems where you can easily verify correctness works much better than having it solve the whole problem on its own.
you just described how a good developer works.
Hey, I've been hearing about this issue that programmers have on HN a lot.
But I'm in the more 'bad programmer/hacker' camp and think that LLMs are amazing and really helpful.
I know that one can post a link to the chat history. Can you do that for an example that you are comfortable sharing? I know that it may not be possible though or very time consuming.
What I'm trying to get at is: I suck at programming, I know that. And you probably suck a lot less. And if you say that LLMs are garbage, and I say they are great, I want to know where I'm getting the disconnect.
I'm sincerely, not trying to be a troll here, and I really do want to learn more.
Others are welcome to post examples and walk through them too.
Thanks for any help here.
>and think that LLMs are amazing and really helpful
Respectively, are you understanding what it produces or do you think that's its amazing because it produces something, that 'maybe' works.
Here's an e.g. I was just futzing with. I did a refactor of my code (typescript) and my test code broke (vitest) and for some reason it said 'mockResolvedValue()' is not a function. I've done this a gazillion times.
I allowed it via 3-4 iterations to try and fix it (I was being lazy and wanted my error to go away) and the amount of crap (rewriting tests, referenced code) it was producing was beyond ridiculous. (I was using github co-pilot).
Eventually I said "f.that for a game of soldiers" and used by brain. I forgot to uncomment a vi.mock() during the refactor.
I DO use it to fix stupid typescript errors (the error blob it dumps on you can be a real pita to process) and appreciate it when gives me a simple solution.
So I agree with quite a few comments here. I'm not ready to bend the knee to our AI Overloads.
Yeah, so I'm a 'hacker' (MIT definition here). I've only taken a single class in javascript with Sun based workstations about 20 years ago now (god, I'm old). I hated it.
All my work now is in python and SQL now and though I've watched a lot of youtube videos and plunked at StackOverFlow for ~15 years, I've never had formal education in either language. Like, it takes me about as long to set up the libraries and dependancies in python as it does to write my code. My formal job titles have never had 'programmer' in them.
My code, as such, is just to get something done. Mostly this is hardware interfacing stuff, but a little software too. I've had my code get passed up the chain and incorporated into codebases, but that's only happened a handful of times. The vast majority of the code I write is never seen by others. My code hasn't had to be maintainable as I've never really had to maintain it for more than 5 years max. I've used git on projects before, but I don't really see the need these days. The largest program I've written is ~1M lines of code. Most code is about 100 lines long now. I almost always know what 'working' means, in that I know the output that I want to see (again, mostly working in hardware). I almost never run tests of the code.
I've had the same issues you have had with LLMs, where they get stuck and I have to try to restart the process. Usually this happens to me in about 20 back and forths. I'm mostly just pasting relevant code snippets and the errors back into the LLM for a while until things work for me. Again, I know what 'working' looks like from the start.
Typically, I'll start the session with an LLM by telling it the problem I have, what I want the code skeleton to look like, and then what I want the output to look like. Then it will give the psuedo code, then I walk it through each portion of the psuedo code. Then I get to errors and debugging. Usually about half of this is just libraries and versions in python. Then, I get to the errors of the code itself. I can typically find what line of code is causing the error just from the terminal outputs. I'll talk with the LLM about that line of code, trying to understand it from the error. Then, on to the next error. Repeat that process until I get the working output I desire. I'm never expecting the right code out of the first LLM interaction, but I am expecting (and seeing) that the time it takes to get to working code is faster.
The time it would usually take me to get through all this before LLMs was about 2 weeks of work (~80 hours) per project. Now it takes me about half a day (~4 hours), and it's getting faster.
I'm not in the camp of thinking that AI is going to take my job. I am in the camp of thinking that AI is going to finally let me burn down the list of things that we really need to do around here.
Thank you for the reply!
I can give you an example here. We had to do some basic local VAT validation for EU countries and as the API that you can use for that has some issues for some countries (as it checks this in the national databases) we wanted to also have a basic local one. So using Claude 3.7 I wanted to get some basic VAT validation, in general the answer and solution was good, you would be impressed, but here comes the fun part. The basic solution was just some regular expressions and then it went further on its own and created specific validations for certain countries. These validations were something like credit card number validations, sums, check digits, quite nice you would say. But the thing is in a lot of these countries these numbers are basically assigned randomly and have no algorithm, so it went on to hallucinate some validations that don't exist providing a solution that looks nice, but basically it doesn't work in most cases.
Then I went on github and found that it used some code written by someone in JS 7 years ago and just converted and extended it for my language, but that code was wrong and simply useless. We'll end up with people publishing exploits and various other security flaws in Github, these LLMs will get trained on that and people that have no clue what they are doing will push out code based on that. We're in for fun times ahead.
here is one solution i am helping out with which was very very easy to create using claude https://www.youtube.com/watch?v=R72TvoXCimg&t=2s
Maybe it’s the shot up plane effect; we only see the winners but rarely see the failures. Leads us to wrong or incorrect conclusions.
Finding the right prompt to have current generation AI create the magic depicted in twitter posts may be a harder problem than most anticipate.
"wild", "insane" keywords usually are a good filter for marketing spam.
Influencer would be another term...
I don't believe in those either, and I never see compelling YouTube videos showing that in action.
For small stuff LLMs are actually great and often a lifesaver on legacy codebases, but that's more or less where it stops.
I'm in the same boat. I've found it useful in micro contexts but in larger programs, it's like a "yes man" that just agrees with what I suggest and creates an implementation without considering the larger ramifications. I don't know if it's just me.
I have a challenging, repetitive developer task that I need to do ~200 times. It’s for scraping a site and getting similar pieces of data.
I wrote a worksheet for Cursor and give it specific notes for how to accomplish the task in a particular case. Then let it run and it’s fairly successful.
Keep in mind…it’s never truly “hands off” for me. I still need to clean things up after it’s done. But it’s very good at figuring out how to filter the HTML down and parse out the data I need. Plus it writes good tests.
So my success story is that it takes 75% of the energy out of a task I find particularly tedious.
I haven’t found llm code gen to be very good except in cases like you mention here. When you need to do large boilerplatey code with a lot of hardcoded values or parameters. The kind of thing you could probably write a code generator yourself for if you cared enough to do it. Thankfully Llms can save us from some of that.
it rarely returns the right answer
One of the biggest difficulties AI will face is getting developers to unlearn the idea that there's a right answer, and that of the many thousands of possible right answers, 'the code I would have written myself' is just one (or a few if you're one of the few great devs who don't stop thinking about approaches after your first attempt.)
I spent a few hours trying cursor. I was impressed at first, I liked the feel of it and I tried to vibe code, whatever that means.
I tried to get it to build a very simple version of an app I’ve been working on. But the basics didn’t work, and as I got it to fix some functionality other stuff broke. It repeatedly nuked the entire web app, then rolled back again and again. It tried quick and dirty solutions that would lead to dead ends in just a few more features. No sense of elegance or foundational abstractions.
The code it produced was actually OK, and I could have fixed the bugs given enough time, but overall the results were far inferior to every programmer I’ve ever worked with.
On the design side, the app was ugly as hell and I couldn’t get it to fix that at all.
Autocomplete on a local level seems far more useful.
Gene and I would like to invite you to review our book, if you're up for it. It should be ready for early review in about 7-10 days.
It seems like you would be the perfect audience for it. We're hoping the book can teach you what you need in order to have all those success stories yourself.
How can I follow this book? I’m interested too.
I can definitely save time, but I find I need to be very precise about the exact behaviour, a skill I learned as… a regular programmer. Soper up is higher in languages I’m not familiar with, where I know what needs doing but not necessarily the standard way to do it.
I had Claude prototype a few things and for that it's really enjoyable.
Like a single page HTML J's page which does a few things and saves it state in local storage with a json backup feature (download the json).
I also enjoy it for doing things I don't care much but makes it more polished. Like I hate my basically empty readme with two commands. It looks ugly and when I come back to stuff like this a few days/weeks later I always hate it.
Claude just generates really good readmes.
I'm trying out Claude code right now and like it so far.
Funny, because I have the same feeling toward the "I never get it to work" comments. You don't need any special prompt engineering so that's definitely not it.
Yeah I gave Claude Code a try at about 5 different things, with miserable results on all of them (insult to injury -- each time it charged me about a buck!). I wonder if because it was C# with Unity code, maybe not so heavily represented in the training set?
I still find lots of use for LLMs authoring stuff at more like the function level. "I know I need exactly this."
Edit: I did however find it amazing for asking questions about sections of the code I did not write.
I’ve dug into this a few times.
Every single time they were doing something simple.
Just because someone has decades of experience or is a SME in some niche doesn’t mean they’re actually good… engineers.
> I do find value in asking simple but tedious task like a small refactor or generate commands,
This is already a productivity boost. I'm more and more impressed about what I can get out of these tools (as you said, simple but tedious things). ChatGPT4o (provided by company) does pretty complex things for me, and I use it more and more.
Actually, I noticed that when I can't use it (e.g. internal tools/languages), I'm pretty frustrated.
Are you concerned that these tools will soon replace the need for engineers?
Yes, I used to be skeptical about the hype, but now I'm somewhat concerned. I don't think they will replace engineers but they do increase their productivity. I'm not able to quantify by how much though. In my case, maybe it increases my productivity by 5-10%, saving me a few hours of work each week. Very rough estimate.
Does it mean that we'll need less engineers to perform the same amount of work? or we'll produce better products? In my company, there's no shortage of things to do, so I don't think we'll hire less people if suddenly engineers are a bit more productive. But who knows how it'll impact the industry as a whole.
I am willing to say I am a good prompt engineer, and "AI takes the wheel" is only ever my experience when my task is a very easy one. AI is fantastic for a few elements of the coding process--building unit tests, error checking, deciphering compile errors, and autocompleting trivially repetitive sections. But I have not been able to get it "take the wheel"
this space is moving really fast, I suggest before forming an definitive opinion try the best tool, such as the latest Claude model and use "agentic" mode or the equivalence on your client. For example, on Copilot this mode is brand new and only available in vscode insider. Cursor and other tools have had it for a little longer.
People have been saying it writes amazing code that works for far longer than that setup has been available though. Your comment makes me think the product is still trying to catch up to these expectations people are setting.
That being said I appreciate your suggestion and will consider giving that a shot.
You have to learn and figure out how to prompt it. My experience with Claude Code is this: one time it produces an incredible result; another time it's an utter failure. There are prompt tips and tricks which have enormous influence on the end result.
Can you give us some of these tips?
Not that I have anything concrete in my mind yet. I'm learning as we all do. But after some usage I've developed a little bit of a hunch which prompt works and which not.
For example, I have mindlessly asked Claude Code over a large codebase "where is desktop app version stored and how is it presented on site". I have expected useless answer given how vague the questions was. Instead I have got a truly exceptional and extremely clear report that fully covers the question.
Another example. I have asked Claude Code to come up with a script to figure out funding rate time intervals on a given exchange and Code ended up in an almost endless loop running small test scripts in node.js to figure this out and came up with a super suboptimal and complicated solution. Turns out my prompt was too verbose and detailed and I have specifically asked Claude Code to figure out time intervals, not just get them. So it did. Instead of just querying the exchange via API and printing the list on terminal (3 lines script) it actually, truly tried to figure them out in various ways.
You should also try the same prompt multiple times to see how this works.
Sometimes you will get better or worse answers completely by chance.
I think Claude is pretty good if you have it write a function and give it the inputs, output and a data example. You can also put to ask clarifying questions as needed because there is a good chance there are aspects of the prompt that are ambiguous.
My prompts are always better if I write them in a separate text file and then paste them in too. I think I just take my time and think things out more that way instead of trying to get to the answer as fast.
I agree it feels very different from my experience.
I'm curious when we'll start seeing verifiable results like live code streams with impressive results or companies dominating the competition with AI built products.
Are you actually using claude? There's an enormous difference between claude code and copilot, with the latter being a bigger burden these days than a help.
Can you clarify what tools and programming language you use? If find that the issue often is wrong tooling, exotic programming languages or frameworks.
I would consider frontend tasks using Typescript and React quite standard.
React in my experience sucks with AI. In fact I have not yet encountered a "heavy" framework which works well. Use something light like svelte.
Typed programming languages like Typescript, Scala, Haskell and so on will produce more errors -> you need to fix stuff manually. However it will also reduce bugs. So it is a mixed bag. For an error free experience python and JavaScript work very well.
When it comes to tooling, if you haven't used cline, roocode or aider (not as good) yet, you haven't seen what an AI can do.
A good starter would be starting fresh, by creating a README which describes the hole application you want to build in detail and let the AI decide the tech stack. You can most certainly build complex applications with an AI in blazing speed.
From the creators of static open-source marketing benchmarks: twitter PR posts.
Is this true even for Claude 3.7 Sonnet/3.7 Sonnet Thinking ?
Oh. That's because he's clearly lying.
It's not.
A lot -- and I mean a lot -- of people who hype it up are hobby or aspirational coders.
If you drill down on what exactly they use it for, they invariably don't write code in professional settings that will be maintained and which other humans have to read.
Everyone who does goes "eh, it's good for throwaway code or one offs and it's decent at code completion".
Then there's the "AGI will doom us all" cult weirdos, but we don't talk about them.
I have the same experience
What model are you using
You've got to do piecemeal validation steps yourself, especially for models like Sonnet 3.7 that tend to over-generate code and bury themselves in complexity. Windsurf seems to be onto something. Running Sonnet 3.7 in thinking mode will sometimes reveal bits and pieces about the prompts they're injecting when it mentions "ephemeral messages" reminding it about what files it recently visited. That's all external scaffolding and context built around the model to keep it on track.