Hacker News

AI Responses May Include Mistakes os2museum.com

214 comments

_wire_•22d
Google's Gemini in search just makes up something that arbitrarily appears to support the query without care for context and accuracy. Pure confabulation. Try it for yourself. Ridiculous. It works as memory support if you know the result you're looking for, but if you don't, you can't trust it as far as you can throw it.
If you look carefully at Google Veo output, it's similarly full of holes.
It's plain there's no reasoning whatsoever informing the output.
Veo output with goofy wrongness
https://arstechnica.com/ai/2025/05/ai-video-just-took-a-star...
Tesla FSD goes crazy
https://electrek.co/2025/05/23/tesla-full-self-driving-veers...
- camillomiller•22d
  This baffles me like no other tech has done before. Google is betting its own core business on a pivot that relies on a massively faulty piece of technology. And as Ben Evans also says, promising that it will get better only gets you so far, it’s an empty promise. Yesterday AI overview made up an entire album by a dead Italian musician when I searched for a tribute event that was happening at a Berlin venue. It just took the name of the venue and claimed it was the most important work from that artist.
  Funnily enough (not for Google), I copypasted that answer on chatGPT and it roasted AI Overview so bad on its mistakes and with such sarcasm that it even made me chuckle.
  - DanHulton•22d
    It's the unfounded promises that this will be solved because the tech will only get better that really upset me. Because sure, it will get better, I'm pretty certain of that. They'll have additional capabilities, they'll have access to more-recent data, etc. But "better" does not necessarily equate to "will fix the lying problem." That's a problem that is BAKED INTO the technology, and requires some kind of different approach to solve -- you can't just keep making a hammer bigger and bigger in the hopes that one day it'll turn into a screwdriver.
    Before LLMs really took off, we were in the middle of an "AI winter", where there just weren't any promising techs, at least none with sufficient funding attached to them. And it's WORSE now. LLMs have sucked all the air out of the room, and all of the funding out of other avenues of research. Technologies that were "10-20" years away now might be 30-40, because there's fewer people researching them, with less money, and they might even be completely different people trying to restart the research after the old ones got recruited away to work on LLMs!
    - mountainriver•21d
      I really don’t understand the whole AI winter talk all the time. We haven’t had anything of the sort since 2008. There were tons of major RL advancements before ChatGPT that were stunning.
      I would challenge anyone to find data to actually support any of these claims. ML spending has been up since deep learning year over year and the models just keep getting better
- justmarc•22d
  And suddenly this type of quality is becoming "normal" and acceptable now? Nobody really complains.
  That is very worrying. Normally this would never fly, but nowadays it's kind of OK?
  Why should false and or inaccurate results be accepted?
  - TeMPOraL•22d
    We lost that battle back when we collectively decided that sales and marketing is respectable work.
    - bheadmaster•22d
      Hah. Good observatoon.
      I often get in arguments about how I tend to avoid brands that put too much into marketing. Of course, theoretically, the amount of money a company puts into marketing doesn't automatically lower the quality of their products, but in my experience, the correlation is there. Whiskas, Coka Cola, McDonalds, etc.
    - justmarc•21d
      How would products get known, let alone sold, without this?
      - TeMPOraL•21d
        How would you give your neighbor a warm welcome without setting their house on fire?
        Scale and intent matter.
  - reaperducer•22d
    And suddenly this type of quality is becoming "normal" and acceptable now?
    The notion that "computers are never wrong" has been engrained in society for at least a century now, starting with scifi, and spreading to the rest of culture.
    It's an idea that has caused more harm than good.
  - rchaud•22d
    > Normally this would never fly, but nowadays it's kind of OK?
    We started down this path ever since obvious bugs were reframed as "hallucinations".
  - Nuzzerino•22d
    Complain to it enough times, remain resilient and you’ll eventually figure it out (that’s a wild card though). Or find someone who has and take their word for it (except you can’t because they’re probably indistinguishable from the ‘bot’ now according to the contradictory narrative). Iterate. Spiral. No one should have to go through that though. Be merciful.
  - meander_water•22d
    I've recently started wondering what the long term impacts of AI slop is going to be. Will people get so sick of the sub-par quality that there will be a widespread backlash, and a renewed focus on handmade or artisinal products made by hand? Or will we go the other way where everyone will accept the status-quo and everything will just get shittier, and we will just have multiple cycles of AI slop trained on AI slop?
    - jazzyjackson•21d
      I'm already seeing screen-free summer camps in my area. There's going to be a subset of the population that does not want to play along with calling hallucinations and deepfakes "progress," kids will be homeschooled more as parents lose their jobs and traditional classroom instruction loses effectiveness.
      I thought the movie "the Creator" was pretty neat, it envisions a future where AI gets blamed for accidentally nuking Los Angeles so America bans it and reignites a kind of cold war with Asia which has embraced GAI and transcended the need for central governance. Really it's a film about war and how it can be started with a lie but continue out of real existential fear.
    - lithocarpus•22d
      I'll guess it will be both at the same time with a far greater number of people going for the easier (latter) option, but still a real chunk of people going for what's real, and also a spectrum in between.
      This is how it already is for most aspects of life that have, for many, been enshittified by progress. Sadly the shitty part is not entirely avoidable by choice.
  - chronid•22d
    Suddenly? That's the level of quality that is standard in all software projects I've ever seen since I've started working in IT.
    Enshittification is all around us and is unstoppable. Because we have deadlines to hit and goals to shows we reached to the VP. We broke everything and the software is just half working? Come on that's an issue for the support and ops teams. On to the next beautiful feature we can put on marketing slides!
    - justmarc•21d
      Sadly you are absolutely right.
  - krapp•22d
    >Why should false and or inaccurate results be accepted?
    The typical response is "because humans are just as bad, if not worse."
  - veunes•22d
    And how quickly the bar is being lowered
  - emrah•22d
    When were search results 100% fact checked and accurate??
    - mdp2021•22d
      For example, in the times of "lectures", where transmitted information was literally read (as the term says) in real time from the source to the public.
      But in general, the (mis-)information that spinach could contain so much iron to be interchangeable with nails had to be a typo so rare that it would become anecdotal and generate cultural phenomena like Popeye.
- XorNot•22d
  I use ublock to remove Gemini responses from search, because even glancing at then is liable to bias my assumption about whatever I'm looking for.
  Information hygiene is a skill which started out important but is going to become absolutely critical.
  - MaxikCZ•22d
    Half my browser extensions have sole purpose of removing shit from sites I visit.
    HN is like a unicorn that havent made me block a single thing yet.
    - alpaca128•22d
      Ironically that's an AI tool I would use - one that can dynamically filter content from sites according to my preferences, to counter the algorithmic spam. It wouldn't be generative AI though, and that's the only kind of AI that matters right now apparently.
      - jazzyjackson•21d
        Could be a good use of structured output, Llms work okay as one shot classifiers, you could define a schema that's just like, array[]{xpath:true/false} and tell the bot what you want to see and what you don't want to see.
  - justmarc•22d
    We can't expect the vast majority of regular users to have any of that skill.
    What is this going to lead to? fascinating times.
    - jobigoud•22d
      It's very easy though, Right click > Block element > Create. Overlays show which blocks you are removing. Sliders can be used to increase/refine.
      How can we make it even easier and visual? Customizing pages by removing elements should be part of the default browser experience to be honest. Like in the initial web where you would tweak the color of links, visited links, etc.
- flomo•22d
  I had a question about my car, so I googled '[year] [make] [model] [feature]'. This seems like the sort of thing that Google had always absolutely nailed. But now, 90% of the page was ai slop about wrong model, wrong year, even the wrong make. (There was one youtube which was sorta informative, so some credit.)
  But way way down on at the very the bottom of the page, there was the classic google search answer on a totally unrelated car forum. Thanks CamaroZ28.com!
  - camillomiller•22d
    This is a very very good point. If this was happening with different queries that we never used, or a new type of questions/queries than I would have some patience. But it happens exactly with the formulations that were giving you the best results in the SERP before!
  - gambiting•22d
    I'm a member of a few car groups on Facebook and the misinformation coming from Google is infuriating, because people treat it as gospel and then you have to explain to them that the AI slop they were shown as the top result in Google is not - in fact - correct.
    As a simple example - someone googled "how to reset sensus system in Volvo xc60" and Google told them to hold the button under the infotainment screen for 20 seconds and they came to the group confused why it doesn't work. And it doesn't work because that's not the way to do it, but Google told them so, so of course it must be true.
    - flomo•22d
      Exactly, the "AI" opiateturd results are often for the completely wrong year/model or just obviously false. I'm certain Google used to be really good at this kind of thing.
    - nyarlathotep_•21d
      I wonder this too--if there will actually be more work created from LLM generations from a whole new genre of customer support that now not only has to know the "material" but has to know provide secondary support in resolving issues customers have from incorrect nonsense.
  - dingnuts•22d
    Ad supported search has been awful for a few years now, just buy a Kagi subscription and you'll be like me: horrified but mildly amused with a dash of "oh that explains a lot" when people complain about Google
  - •22d
    [deleted]
  - TeMPOraL•22d
    That was true before AI too (I know, I did such searches myself). Google results has been drowning in slop for over a decade now - it was just human-generated slop, aka. content marketing and SEO stuff.
    I'm not defending the AI feature here, just trying to frame the problem: the lies and hallucinations were already there, but nobody cared because apparently people don't mind being constantly lied to by other people.
    - flomo•22d
      No, I'm not complaining about SEO shit...
      The thing is the primordial google had the answer, but Google themselves buried it under 100+ links of Google-generated slopdiarrhea, most of which didn't even factually fit the question, and was not at all relevant to my automobile.
      - ben_w•22d
        Indeed, this is part of the current monopoly abuse case they're facing — did Google deliberately choose to make search worse, because that causes people to return to the search results page and spend more time looking at ads, and they knew they could get away with it?
        e.g. bottom of first page, penultimate paragraph https://www.justice.gov/d9/2023-11/417557.pdf
        flomo•22d
        Yeah they definitely did that.
        But this AI diarrhea is so awful, I honestly can't see any angle in giving me tons of bad results about a minor feature of my car. (I should sell it and use waymo??) Maybe the really sharp monopolists ran for the hills when the DOJ sheriffs showed up, and now Google Search is being run by former Yahoo execs.
        TeMPOraL•20d
        FWIW, two days later, I kinda changed my mind about AI in search - or at least, "AI answers". They're going to (and probably already are) cause a lot of issues for a lot of people.
        https://news.ycombinator.com/item?id=44157837
        TL;DR: it's not the correctness that's the problem, but that AI is too good at making bad queries work. A regular person making underspecified query due to their ignorance, will get a sensible (and possibly correct) answer to a wrong question, and might try to act on it - where pre-AI, they'd get a mix of confusing result that would hopefully made them realize their query (and understanding) is wrong.
        flomo•13d
        Late reply, but '[year] [make] [model] [feature]' has never been a bad query for as long as I've been using google. They used to be really good at that kind of thing, but now they are really bad. I can ignore the "AI results" at the top, the rest of the page is maybe more just Yahootardation than AI. But now they have some real competition.
        ben_w•13d
        Perhaps for cars.
        For other topics, say apps? Well…
        A while back I was looking for interesting apps for my Apple Watch to see if I could find some real benefit to keeping it. Even before GenAI I was finding listicles like these that always have date stamps as if they are recent news: https://techidroid.com/10-best-apps-for-apple-watch/#5_ViewR... (as I read it, it says "March 6, 2025" and lists ViewRanger app as number 5 on their list; I have no idea what it will say to anyone else).
        Now here's the important thing: I used to work at Augmentra, who made the ViewRanger app. They merged with OutdoorActive years ago and shut down the ViewRanger app in 2022.
        So, I've not trusted Google for a while now.
        flomo•12d
        HN had a story about Google removing the guy who cared about search quality a few years ago. Since then it's all about stuffing Google links on the top (like Yahoo did), and whatever SEO results has the most profitable google ads on it. Shame because it used to be good. Now the government is after them, and they might get lunch eaten.
- Nursie•22d
  Yep, I was looking up a hint for the Blue Prince game the other day for the (spoiler alert?) casino room.
  Google’s AI results proceeded to tell me all about the games available at the Blue Prince Casino down the road from here, where I know for a fact there’s only a prison, a Costco, a few rural properties and a whole lot of fuck-all.
  It’s amazing to watch it fill in absolute false, fabricated tripe at the top of their search page. It also frequently returns bad information on subjects like employment law and whatever else I look up.
  It would be hilarious if people weren’t actually relying on it.
  - datavirtue•22d
    I the have had a lot of luck with copilot conversations to research stocks and trading strategies. I am always skeptical of results and verify everything with various sources but it does help me find/get on the right track.
- veunes•22d
  Yeah, it feels like we've crossed into a weird uncanny valley where AI outputs sound smarter than ever, but the underlying logic (or lack thereof) hasn't caught up
  - roywiggins•22d
    I think it's just much easier for an LLM to learn how to be convincing than it is to actually be accurate. It just has to convince RLHF trainers that it's right, not actually be right. And the first one is a general skill that can be learned and applied to anything.
    https://arxiv.org/html/2409.12822v1
- MangoToupe•22d
  I'm honestly so confused how people use LLMs as a replacement for search. All chatbots can ever find are data tangential to the stuff I want (eg i ask for a source, it gives me a quote). Maybe i was just holding search wrong?
  - TeMPOraL•22d
    > eg i ask for a source, it gives me a quote
    It should give you both - the quote should be attributed to where it was found. That's, generally, what people mean when they ask or search for "a source" of some claim.
    As for general point - using LLMs as "better search" doesn't really look like those Google quick AI answers. It looks like what Perplexity does, or what o3 in ChatGPT does when asked a question or given a problem to solve. I recommend checking out the latter; it's not perfect, but good enough to be my default for nontrivial searches, and more importantly, it shows how "LLMs for search" should work to be useful.
  - mdp2021•22d
    > LLMs as a replacement for search
    Some people expect LLMs as part of a better "search".
    LLMs should be integrated to search, as a natural application: search results can heavily depend on happy phrasing, search engines work through sparse keywords, and LLMs allow to use structured natural language (not "foo bar baz" but "Which foo did a bar baz?" - which should be resistant to terms variation and exclude different semantics related to those otherwise sparse terms).
    But it has to be done properly - understand the question, find material, verify the material, produce a draft reply, verify the draft vis-a-vis the material, maybe iterate...
    - 1659447091•22d
      DuckDuckGo Ai assist is going in the right direction, imo. It will pull info from wikipedia, use math and map tools plus other web sources that has been mostly accurate for me on the search page.
      The chat option uses gpt-4o with web search and was able to provide links to colonial map resources I was curious about after falling down that rabbit hole. It also gave me general (& proper) present day map links to the places I was looking for in the map sites I asked for.
      It did get confused a few times when I was trying to get present day names of old places I had forgot; like Charles River in Va that it kept trying to send me to Boston or Charles City Co on the James river and told me to look for it around there...
      The York river wiki page clearly says it was once Charles River. Maybe I wasn't asking the right questions. For more unique things it was pretty helpful thou and saved the endless searching w/ 100 tabs adventure
  - jazzyjackson•21d
    Some chatbots plan a query and summarize what a search returns instead of trying to produce an answer on their own; I use perplexity a lot which always performs a search, I think ChatGPT et al have some kind of classifier to decide if web search is necessary. I especially use it when I want a suggestion without sifting through pages of top ten affiliate listicles (why is there a list of top 10 microwaves? I only need one microwave!)
  - MaxikCZ•22d
    Its good to be shown direction. When I only have a vauge idea of what I want, AI usually helps me frame it into searchable terms I had no clue existed.
  - incangold•22d
    I find LLMs are often better for X vs Y questions where search results were already choked by content farm chaff. Or at least LLMs present more concise answers, surrounded by fewer ads and less padding. Still have to double check the claims of course.
    - MangoToupe•21d
      I think I'm discovering that I just don't tend to think in terms of questions rather than content
  - Garlef•22d
    Maybe that's because we're conditioned by the UX of search.
    But another thing I find even more surprising is that, at least initially, many expected that the LLMs would give them access to some form of higher truth.
    - MangoToupe•21d
      I think you might be on to something. I've found myself irritated that i can't just chuck keywords at LLMs.
- christophilus•22d
  I’ve had good results with Brave search, which self reports to use: Meta Llama 3, Mistral / Mixtral, and CodeLLM. It’s not always 100% accurate, but it’s almost always done the trick and saved me digging through more docs than necessary.
- ImPostingOnHN•22d
  gemini is the worst LLM I've used, whether directly or through search. As in your experience, it regularly makes stuff up, like language/application features, or command flags (including regarding google products), and provides helpful references to sources which do not say what is cited from them.
  in my case, it does so roughly half the time, which is the worst proportion, because that means I can't even slightly rely upon the truth being the opposite of the output.
  - JimDabell•22d
    Gemini was underwhelming until 2.5 Pro came along, which is very good. But in my experience all of the Google models are far worse than everything else when it comes to hallucination.
    - MangoToupe•22d
      As a corollary, though, the chatbots are probably the most creative.
      - mdp2021•22d
        Professional creatives do measure their intuitions against a number of constraints...
        •22d
        [deleted]
- Kwpolska•22d
  Google recently started showing me their AI bullshit. This made me pull the trigger and switch to DuckDuckGo as the primary search engine.
  That said, some niche stuff has significantly better results in Google. But not in the AI bullshit. I searched for a very niche train-related, the bullshit response said condescendingly "this word does not exist, maybe you meant [similarly sounding but completely different word], which in the context of trains means ...". The first real result? Turns out that word does exist.
  - datavirtue•22d
    I switched to DDG over seven years ago and just realized it had been that long when I read your comment. Google started wasting my time and I had to shift.
  - dijksterhuis•22d
    fyi, you can remove any and all “ai” assistant bs etc from DDG if you use the noai subdomain (in case you wanna avoid their stuff, although it’s much less prominent anyway) https://noai.duckduckgo.com/
  - christophilus•22d
    What’s the word?
sspiff•22d
I find this phenomenon really frustrating. I understand (or am at least aware of) the probabilistic nature of LLMs and their limitations, but when I point this out to my wife or friends when they are misusing LLMs for tasks they are both unsuited for and unreliable at, they wave their hands and dismiss my concerns as me being an AI cynic.
They continue to use AI for math (asking LLMs to split bills, for example) and treat its responses for factual data lookup as 100% reliable and correct.
- osmsucks•22d
  > They continue to use AI for math (asking LLMs to split bills, for example)
  Ah, yes, high tech solutions for low tech problems. Let's use the word machine for this number problem!
  - sspiff•21d
    Many of them try to be mindful of their climate impact.
    I've tried to explain it in those terms as well: every medium-sized prompt on these large models consumes roughly one phone battery charge worth of energy. You have a phone with a calculator.
    I'd ask them to do the math on how much energy they're wasting asking stupid things of these systems, but I'm too afraid they'd ask ChatGPT to do the math.
    - alainx277•20d
      Where are you getting these numbers from? I'm finding data between 0.3Wh and 3Wh for a 4o query, while a typical phone battery has 10Wh.
      The energy consumption per prompt will decrease for the same task complexity, as inference hardware gets better (as they already have from the 2023 paper that reported 3Wh).
    - fatata123•21d
      [dead]
  - thaumasiotes•22d
    > Let's use the word machine for this number problem!
    You know, that's a thought process that makes internal sense.
    You have someone who's terrible at math. They want something else to do math for them.
    Will they prefer to use a calculator, or a natural language interface?
    How do you use a calculator without knowing what you're doing?
    - osmsucks•21d
      Feels to me like if you can't even use a calculator you have bigger problems to worry about...
      - thaumasiotes•21d
        You don't think people who can't use a calculator ever have dinner with their friends?
        osmsucks•20d
        I don't know of anyone who can't even operate a calculator, but is at the same buddies with ChatGPT.
  - datavirtue•22d
    I'm so lazy, I have chat bots do all kinds of complex calculations for me. I even use it as a stock screener and the poor thing just suffers, burning fuck tons of electricity.
- veunes•22d
  What's tricky is that for casual use, it gets things "close enough" often enough that people start building habits around it
- jatora•22d
  Using it for simple math is actually pretty hilarious. Hey maybe they make sure to have it use python!...but I dream
- BlueTemplar•22d
  Using LLMs (or platforms in general) is a bit like smoking (in closed spaces, with others present) : a nuisance.
  - diggan•22d
    That's just plain wrong, and I'm a smoker. LLMs won't affect the ones around you, unless you engage with them in some way. Sit next to me while I smoke and you'll be affected by passive smoking regardless of how much you engage or not. Not really a accurate comparison :)
- JeremyNT•22d
  > They continue to use AI for math (asking LLMs to split bills, for example) and treat its responses for factual data lookup as 100% reliable and correct.
  I don't do this but isn't it basically... fine? I assume all the major chatbots can do this correctly at this point.
  The trick here is that chatbots can do a wide range of tasks, so why context switch to a whole different app for something like this? I believe you'll find this happening more frequently for other use cases as well.
  Usability trumps all.
  - JeremyNT•22d
    Wish I could edit, but I was referring to the bill splitting math specifically here. I didn't mean to quote the rest.
    When it comes to facts that actually matter, people need to know to verify the output.
minimaxir•22d
The simple "AI responses may include mistakes" disclaimer or ChatGPT's "ChatGPT can make mistakes. Check important info." CYA text at the bottom of the UI are clearly no longer sufficient. After years of news stories about LLM hallucinations in fact-specific domains and people still getting burnt by them, LLM providers should be more aggressive in educating users about their fallability since hallucinations can't ever be fully fixed, even if it means adding friction.
- eddythompson80•22d
  That doesn't really make sense. You either make the LLM provider liable for the output of the model, or you have the current model. The friction already exists. All these AI companies and cloud providers are running "censored models" and more censorship is added at every layer. What would more friction be here? more pop-ups?
  Doing the former basically means killing the model-hosting business. Companies could develop models, use them internally and give them to their employees, but no public APIs exists. Companies strike legally binding contracts to use/license each other models, but the general public doesn't have access to those without something that would mitigate the legal risk.
  Maybe years down the line, as attitudes soften, some companies would begin to push the boundaries. Automating the legal approval process, opening signups, etc.
  - minimaxir•22d
    Yes, more popups, retention metrics be damned. Even 2 years since ChatGPT, many people still think it's omniscent which is what's causing trouble.
    - TeMPOraL•22d
      Hyperbole is as much of a problem. ChatGPT is not omniscient, but it's also not "dumb as shit", at least not across the board. LLMs got popular in big part because they provide unique, distinctly new value.
      This black and white assumption that because LLMs are not always giving probably correct answers therefore they are dangerous, reminds me of what the generation of my parents and teachers thought of Wikipedia when it became popular. The problems were different ("anyone can edit" vs. "hallucinates stuff"), but the mindset seems very similar.
    - MaxikCZ•22d
      People who are suspectible to read AI slop as universal truth do so because they dont read much at all. I guess you would be surprised how huge amount of users dont bother to read anything at all: a popup only exists in a sense "how do I close this", which is being solved by clicking at the most visually distinct button. If you asked them what they clicked or what the popup was about, they look at you like you are crazy for even assuming they should know.
    - eddythompson80•22d
      I don't think more pop-ups solve anything. It'll just make a chrome extension called "ChatGPTAutoAccept" get popular. You think someone who is thinking it's omniscent will suddenly reconsider because a "reminder, this is dumb as shit" pop up keeps annoying them every 5 minutes?
- camillomiller•22d
  Remember when Apple was roasted to hell anytime Maps would push you to get a wrong turn? Or when Google Maps would take you to the wrong place at the wrong time (like a sketchy neighborhood)? Those were all news stories they had to do PR crisis management for. Now they slap a disclaimer like that and we’re all good to go. The amount of public opinion forgiveness these technologies are granted is disproportionate and disheartening.
  - thejohnconway•22d
    That always struck me as pretty overblown, given that before map apps, people got lost all the goddam time. It was a rare trip with any complexity that a human map reader wouldn’t make a mistake or two.
    LLMs aren’t competing with perfect, they are competing with websites that may or may not be full of errors, or asking someone that may or may not know what they are talking about.
    - alpaca128•22d
      Worse - LLMs are competing with inconvenience, and inconvenience always loses.
      - TeMPOraL•22d
        That's literally the point of all progress, though.
        alpaca128•22d
        No. The point of progress is to improve the outcome. Improved outcome does not always align with convenience, and this is one example.
        Critical thinking is inconvenient and does not scale, but it's very important for finding the truth. Technology that disincentivizes critical thought makes it easier to spread lies.
        TeMPOraL•22d
        > Critical thinking is inconvenient and does not scale, but it's very important for finding the truth. Technology that disincentivizes critical thought makes it easier to spread lies.
        True. At the same time, technology that removes the need for critical thinking is bona fide positive form of progress.
        Think of e.g. consumer protection laws, and larger body of laws (and systems of enforcement) surrounding commerce. Their main goal is to reduce the risk for customers - and therefore, their need to critically evaluate every purchase. You don't as much critical thinking when you know certain classes of dangers are almost nonexistent; you don't need to overthink choices you know you can undo.
        There are good and bad ways of eliminating inconvenience, but by itself, inconvenience is an indication of waste. Our lives are finite, and even our capacity to hope and dream is exhaustible. This makes inconvenience our enemy.
        alpaca128•22d
        The examples you listed work because they increase trust to a point people feel safe enough to not second-guess everything. I disagree that AI in its current form can be trusted. Food safety is enforced by law, correctness in Google searches isn't enforced at all, in fact Google is incentivized to decrease the quality to reduce running costs.
        So yes, convenience and progress are strongly correlated but they're not the same.
        TeMPOraL•20d
        Ironically, just yesterday I had a situation that made me change my mind on some of this; it convinced me that the world isn't ready for "AI results" in search in their current form.
        Imagine: an impulsive person, suddenly facing the need to change the ownership structure of their mortgage, worried they'll have to pay a lot for this. A person who doesn't really know first thing about it. They enter a query in Google; because of their lack of domain knowledge, the query is missing one word. Without that word, the query matches two related but distinct kinds of ownership structures. Results for both come, and then AI summary happily blends them together. The person sees an explanation, panics, and shouts my ear off about the bad situation they've been put in by a third party. I'm confused (I know a bit more about this, but not much); they show me the phone.
        I'm staring dumbfounded, looking at a seemingly nonsensical AI summary. But it's not the text that made me pause in shock - it's the fact that the other person took it as gospel, and yelled it at me.
        Garbage in, garbage out, as they say. The way I see it now, the biggest problem isn't the garbage out that sometimes comes out of LLMs. The problem is the "garbage in" - specifically, the garbage that passes for thinking and communication among most of human population. An LLM may reason 100% correctly - it won't help when user supplies "wrong figures" without realizing it, and then acts on answer without thinking.
        The world is not ready for systems that indulge people in their bullshit.
        dartos•10d
        > The world is not ready for systems that indulge people in their bullshit.
        we never were. that's why the world is how it is rn.
  - arcanemachiner•22d
    Yeah, but we're all used to having software integrated into our lives now. And we all now how shitty and broken software often is...
  - ben_w•22d
    Apple maps currently insists that there's a hotel and restaurant across the street from me.
    According to the address on the business website that Apple Maps itself links to, the business is 432 km away from me.
- tbrownaw•22d
  > should be more aggressive in educating users about their fallability
  This might be an "experience is the best teacher" situation. It'd probably be pretty hard to invent a disclaimer that'd be as effective as getting bit.
  - minimaxir•22d
    Unfortunately, getting bit in cases such as publishing misinformation or false legal citations waste everyone time, not just their own.
- nyarlathotep_•21d
  > LLM providers should be more aggressive in educating users about their fallability since hallucinations can't ever be fully fixed, even if it means adding friction
  But they can't be as the whole premise of the boom is replacing human intellectual labor. They've said as much on many many occasions--see Anthropic's CEO going off about mass unemployment quite recently. How can the two of these co-exist?
- userbinator•22d
  The disclaimer needs to be in bold red text at the top.
- neepi•22d
  To be fair people are pretty damn unintelligent when it comes to verifying information. Despite my academic background I catch myself doing it all the time as well.
  However LLMs amplify this damage by sounding authoritative on everything and even worse being promoted as authoritative problem solvers for all domains with a small disclaimer. This double think is unacceptable.
  But if they made the disclaimer bigger then the AI market would collapse in about an hour much like people’s usage does when they don’t verify something and get shot by someone actually authoritative. This has happened at work a couple of times and caused some fairly high profile problems. Many people refuse to use it now.
  What we have is bullshit generator propped up by avoiding speaking the truth because the truth compromises the promoted utility. Classic bubble.
- YetAnotherNick•22d
  You are assuming that the people burnt by LLM responses doesn't know that ChatGPT can make mistakes?
- mdp2021•22d
  > The simple ...
  No, improper phrasing. Correct disclaimer is, "The below engine is structurally unreliable".
  --
  Comment, snipers. We cannot reply to unclear noise.
jll29•22d
Language models are not designed to know things, they are designed to say things - that's why they are called language models and not knowledge models.
Given a bunch of words have already been generated, it always ads the next words based on how common the sequence is.
The reason you get different answers each time is the effect of the pseudo-random number generator on picking the next word. The model looks at the probability distribution of most likely next words, and when the configuration parameter called "temperature" is 0 (and it is actually not possible to set to 0 in the GUI), there is no random influence, and strictly the most likely next word (top-1 MLE) will always be chosen. This leads to output that we would classify as "very boring".
So the model knows nothing about IBM, PS/2, 80286 versus 80486, CPUs, 280 or any models per se. -- One of the answers seems to suggest that there is no model 280, I wonder whether that one was generated through another process (there is a way to incorporate user feedback via "reinforcement learning"), or whether that was a consequence of the same randomized next-word picking, just a more lucky attempt.
- otabdeveloper4•22d
  > This leads to output that we would classify as "very boring".
  Not really. I set temperature to 0 for my local models, it works fine.
  The reason why the cloud UIs don't allow a temperature of 0 is because then models sometimes start to do infinite loops of tokens, and that would break the suspension of disbelief if the public saw it.
  - mdp2021•22d
    Which local models are you using, that do not output loop garbage at temperature 0?
    What do you get at very low temperature values instead of 0?
    - otabdeveloper4•22d
      > Which local models are you using, that do not output loop garbage at temperature 0?
      All of them. I make my own frontends using llama-cpp. Quality goes up with temperature 0 and loops are rare.
      The temperature setting isn't for improving quality, it's to not break your suspension of disbelief that you're talking to an intelligent entity.
      - mdp2021•22d
        > All of them
        You must be using recent (or just different) models than those I tried. Mine returned garbage easily at temperature 0. (But unfortunately, I cannot try and report from there.)
        This (LLM behaviour and benchmarking at low or 0 temperature value) should be a topic to investigate.
        otabdeveloper4•21d
        Probably a bug in the code you ran somewhere.
- verisimi•22d
  > Language models are not designed to know things, they are designed to say things - that's why they are called language models and not knowledge models.
  This is true. But you go to Google not to 'have a chat' but ostensibly to learn something based in knowledge.
  Google seem to be making an error in swapping the provision of 'knowledge' for 'words' you'd think, but then again perhaps it makes no difference when it comes to advertising dollars which is their actual business.
neilv•22d
On the Google search Web site, the "AI responses may include mistakes." weak disclaimer small print is also hidden behind the "Show more" button.
When OpenAI launched ChatGPT, I had to explain to a non-CS professor that it wasn't AI like they're thinking of, but currently more like a computational parlor trick that looks a lot like AI.
But turns out this parlor trick is awesome for cheating on homework.
Also good at cheating at many other kinds of work, if you don't care much about quality, nor about copyrights.
- stavros•22d
  I really don't understand the view that it's a "parlor trick that looks like AI". If it's not "a thing that can write code", but instead just looks like a thing that can write code (but can actually write code), it can write code. All the "no true Scotsman" stuff about what it's doing behind the scenes is irrelevant, because we have no idea what human brains are doing behind the scenes either.
  - ben_w•22d
    Although I broadly agree, I wouldn't go quite as far as where you say:
    > All the "no true Scotsman" stuff about what it's doing behind the scenes is irrelevant, because we have no idea what human brains are doing behind the scenes either.
    Computers and transistors have a massive speed advantage over biological brains and synapses — literally, not metaphorically, the same ratio as the speed difference between how far you walk in a day and continental drift, with your brain being continental draft — which means they have the possibility of reading the entire Internet in a few weeks to months to learn what they know, and not the few tens to hundreds of millenia it would take a human.
    Unfortunately, the method by which they acquire information and knowledge, is sufficiently inefficient that they actually need to read the entire Internet to reach the skill level of someone who has only just graduated.
    This means I'm quite happy to *simultaneously* call them extremely useful, even "artificial general intelligence", and yet also agree with anyone who calls them "very very stupid".
    If we actually knew how our brains did this inteligence thing, we could probably make AI genuinely smart as well as absurdly fast.
  - hnlmorg•22d
    Their point wasn’t that it’s not useful. It’s that it isn’t artificial intelligence like the masses consider the term.
    You wouldn’t say Intellisense isn’t useful but you also wouldn’t call it “AI”. And what LLMs are like is basically Intellisense on steroids (probably more like a cocktail of speed and acid, but you get my point)
    - stavros•22d
      If you'd call k-means AI but you wouldn't call LLMs AI, I'm so far off that reasoning that I don't think we can agree.
      - hnlmorg•22d
        I’m not arguing that LLMs are not AI. The problem is that “AI” itself is a nonsense term. It’s been around since forever and used to describe a whole plethora of different behaviours.
        My point is that to the average user of Gemini or ChatGPT, LLMs are like AGI. Whereas they’re actually more closer to intellisense or text-completions.
        And this is where the problem lies. People will read the output of LLMs and think it has read content on the topic (which is correct) and then deduced an answer (which is technically incorrect).
        It also doesn’t help that OpenAI keep using terms like “reasoning” which sounds a lot like general intelligence. But it’s actually just a bunch of scales based on words.
        AI doesn’t understand any of the core concepts it is reasoning about. So its reasoning is akin to asking a Hollywood script writer to throw a bunch of medical terms together for a new medical drama. Sure the words might be correct on their own, but that doesn’t mean the sentences are correct. And any subject matter expert who watches a drama that features their specialist subject will tell you that there’s more to understanding a subject than simply knowing the words.
        stavros•22d
        Ah OK, I see what you mean, by "AI" you mean "AGI", not what we call ML. It makes sense that way.
        dijksterhuis•22d
        over the last year, i’ve mentally split the two separate concepts like so
        * ML - actual backend models etc
        * AI - user interface that appears “intelligent” to humans
        LLMs UIs tend to have more appearance of intelligence because their interface is natural language — it’s basically the Eliza Effect https://en.m.wikipedia.org/wiki/ELIZA_effect
        i know it’s not the classic definition of the terms, but it’s helped me with my frustration around the bs marketing hype
  - keiferski•22d
    It matters if we are making a distinction between essence and output.
    On the output side, it functionally doesn’t really have a difference. At least in terms of more abstract things like writing code. Although I would argue that the output AI makes still doesn’t match the complexity and nuance of an individual human being, though, and may never do so, simply because the AI is simulating embodiment and existing in the world. It might need to simulate an Earth equivalent to truly simulate a human’s personal output.
    In the essence side, it’s much more of a clear distinction. We have numerous ways of determining if a thing is human or not - biology, for one. It would take some serious sci-fi until we get to the point where an android is indistinguishable from a human on the cellular level.
    - ben_w•22d
      > Although I would argue that the output AI makes still doesn’t match the complexity and nuance of an individual human being, though
      LLMs are very good at nuance. Better than any human I've seen — so much so, I find it to be a tell.
      > We have numerous ways of determining if a thing is human or not - biology, for one.
      I don't care if the intelligence is human, I care if it's (1) (a) intelligent, (b) educated, and (2) has the ability to suffer or not so I know if it should have moral subject rights.
      1a is slowly improving but we're guessing and experimenting: not really engineering intelligence, just scaling up the guesses that work OK. 1b was always easy, libraries fit "education" in isolation from the "intelligent" part of 1a. LLMs are good enough combination of (a) and (b) to be interesting, potentially even an economic threat, depending on how long the time-horizon between failures gets.
      2 is pre-paradigmatic, we don't have enough understanding of the problem to ask the correct question — even ignoring AI for the moment, the same problem faces animal welfare (and why would the answer be the same for each of chimps, dogs, ravens, lobsters, and bees?) and even within humans on topics such as abortion, terminal stage of neurodegenerative conditions such as Alzheimer's, etc.
  - neilv•22d
    Historically, there's been some discussion about that:
    https://en.wikipedia.org/wiki/Chinese_room
  - otabdeveloper4•22d
    LLMs can't write code.
    They don't have capacity to understand logical or temporal relationships, which is the core competency of coding.
    They can form syntactically valid strings in a formal language, which isn't the same thing as coding.
    - throw5425•20d
      You can rationalize it any way you want. I don't care what they capacity they have, or if they syntactically whatever, or any academic thing. In practice, they work. That's all it matters.
      - otabdeveloper4•20d
        > In practice, they work.
        Not really. You'll understand once you start trying to use the code they produce seriously.
    - stavros•22d
      Hmm, I guess I better throw away all this working code they wrote, then.
      - otabdeveloper4•21d
        See my second paragraph. "Working" (aka syntactically correct) code is not the significant and difficult part of coding.
        stavros•21d
        I don't care if the code works because it was formed because of temporal understanding or if it works because an LLM predicted enough tokens correctly, I care that it works.
- loa_in_•22d
  It's a memory augmentation/information retrieval tool with flexible input and output interface.
9x39•22d
Gemini appears tuned to try to handle the typical questions people type in, while more traditional things you search for get some confabulated nonsense.
I've observed a great deal of people trust the AI Overview as an oracle. IMO, it's how 'normal' people interact with AI if they aren't direct LLM users. It's not even age gated like trusting the news - trusting AI outputs seems to cross most demographics. We love our confident-based-on-nothing computer answers as a species, I think.
- eddythompson80•22d
  I think Google is in a particularly bad situation here.
  For over a decade now, that spot in the search page had the "excerpt from a page" UI, which made a lot of sense. It cut down an extra click, and if you trusted the source site, and presumably Google's "Excerpt Extraction Technology" (whatever that was) what was left not to trust? It was very trust worthy information location.
  Like if I search for a quick medical question, and there is an except from the mayoclinic, I trust the mayoclinic, so good enough for me. Sometimes I'd copy the excerpt from google, and go to the page and ctrl-f it.
  Google used to do a decent job at picking reputable sources, the excerpts were always indeed found in the page in a non-altering context, so it was good enough to build trust. Now that system has degraded over the years in terms of how good it was at picking those reputable sources. Most likely because it was SEO gamed.
  However, it has been replaced with a the AI Overview. I'm not against AI, but AI is fundamental different than "a relevant excerpt from a source you trust with a verifiable source in milliseconds".
  - tsunamifury•22d
    How could you think this hard and be so far off. Google is in a hyper strong position here and I don’t even like them.
    They can refine grounded results over time and begin serving up increasingly well reasoned results over time as Models improve cost effectively. Then that drives better vectors for ads.
    Like what about this is hard to understand?
    - eddythompson80•22d
      What about what is hard to understand?
      Google did it because it's better for Google, yes. They no longer have to deal with people trying to hack SEO. Now you would have to figure out how to influence the training process of google to hijack that box. So it's better for Google to move to AI Overview. What's your point here?
      I say Google is in a bad position morally or in terms of "doing the right thing" not that one would really expect it from a corporation per se. There is a distinction you know.
      Google introduced the box as "Excerpt from a search result" box. They traditionally put a lot of care into their search quality and it showed and built trust with their users. Over the years, the search quality dropped. Whether it was less attention from Google, fundamentally harder problem to solve with far more motivated attackers. Yet, the intrusion of bullshit website in the "Excerpt from a search result" box still let you decide that you are not gonna trust medical advice from "mikeInTheDeep.biz" it wasn't ideal that they build trust then let it slip, but being able to see a source with a quote makes it useful when you trust the source.
      With AI Overview, you either trust it all, don't trust any of it, use it as confirmation bias, don't
- geraneum•22d
  > if they aren't direct LLM users
  My manager, a direct LLM user, uses the latest models to confirm his assumptions. If they are not confirmed on the first try, he then proceeds to form the question differently until gets what he wants from them.
  edit: typo
- danielbln•22d
  We love our confident-based-on-nothing answers period, computer or not.
  - chneu•22d
    Most folks just want confirmation. They don't want to have their views/opinions changed. LLM are good at trying to give folks what they're looking for.
  - mdp2021•22d
    Repent.
    You are not there to "love what gives you the kicks". That's a kind of love that should not exit the bedroom (better, the bathroom).
- Llamamoe•22d
  I already went through a realization a while ago that you just can't mention something to people anymore and expect them to be able to learn about it by searching the web, like it used to be possible, because everything is just unreliable misleading SEO spam slop.
  I shudder to think how much worse this is going to be with "AI Overview". Are we entering an era of people googling "how does a printer work" and (possibly) being told that it's built by a system of pulleys and ropes and just trusting it blindly?
  Because that's the kind of magnitude of errors I've seen in dozens of searches I've made in the domains I'm interested in, and I think everyone has seen the screenshots of even more outlandish - or outright dangerous - answers.
hannob•22d
"AI Responses May Include Mistakes" is really the one, single most important thing I want to shout into the whole AI debate.
It also should be the central issue - together with the energy/climate impacts - in every debate about AI ethics or AI safety. It's those two things that will harm us most if this hype continues unchecked.
- consp•22d
  The problem is not it may, but it will make mistakes. But people do not realize this and treat it as an almighty oracle. It's a statistical model after all, there is a non zero chance of the monkey creating the works Shakespeare.
rcarmo•22d
This is why Google has got search fundamentally wrong. They just don’t care about accuracy of results anymore, and worry mostly about providing a quick answer and a bunch of sponsored links below it.
- Llamamoe•22d
  Except that out of 10 answers, the "quick answer" is subtly wrong 6 times, egregiously wrong 2, and outright dangerous once. I've seen screenshots of stuff that would get people killed or in legal trouble.
- dandanua•22d
  They just continue the Eric Schmidt idea "More results are better than none". It has evolved to "It's better to hallucinate than produce a negative answer", I guess.
elmerfud•22d
AI is like that one guy who always can tell you something about anything with total confidence. So really not sure why anyone would trust it beyond a bar conversation.
- vouaobrasil•22d
  I think it's psychological. Most people use visual body cues to determine whether someone is lacking in confidence in their answer. AI does not have any cues to show a lack of confidence, and people also have a high trust in machine output because traditional algorithms always give the correct answer.
  The percentage of people that will look at it critically is negligible.
  - normie3000•22d
    > Most people use visual body cues to determine whether someone is lacking in confidence in their answer
    Do they?
    - vouaobrasil•22d
      I can certainly tell when someone is just bull**ing from their tone of voice long before they tell me the information.
      - tonyedgecombe•22d
        You obviously haven’t met any of my past bosses. Some people have turned bullshitting into an art form.
      - ghusbands•22d
        Which is notably not a "visual body cue"
- JdeBP•22d
  No "AI" company has yet had the bravery to name its product Cliff Clavin.
  Bravery in several ways, that is. There's the risk of being sued by John Ratzenberger. (-:
- mvdtnz•21d
  > So really not sure why anyone would trust it beyond a bar conversation
  Really, you don't know why? Maybe because it's being promoted as "AI" by companies with missions like "organise the world's information", who have spent decade now doing their best to provide accurate information to user queries?
nickjj•22d
I had an experience the other day with ChatGPT and some Python code.
I wanted to modify Gunicorn's logger class so I can filter out specific URL paths. Given it's a hot code path (running on every request) I told it I made 3 solutions and was looking to see which one is the fastest. I used a list + loop using startswith, compiled regex and also used startswith while passing in a tuple of paths.
It produced me benchmark code and also benchmark results which stated the regex solution was the best and fastest solution using Python's standard library.
I didn't believe it so I ran the benchmark myself and the tuple version was over 5x faster than the regex solution.
I then told it I ran the benchmark and got different results and it almost word for word said something like "Oh right, thank you for the clarification, the tuple version is indeed the fastest!". It saved me a few minutes writing the benchmark code but yeah, I rarely trust its output for anything I'm not 100% on.
Dwedit•22d
Duckduckgo and Brave Search's AI-generated answers seem to correctly mention that model 280 does not exist, but this is 11 days after the article was first published, and the article is now part of the search results used to generate the AI responses.
csomar•22d
I had this a few months ago with an old man. He said there are 10 billion people in the world; so I told him you are off by 2 billion. He was adamant and challenged me to a Google search. So I did just that and lo and behold there are 10 billion people according to Google.
I even took a screenshot: https://imgur.com/a/oQYKmKP
I really had nothing to say at that moment.
- •22d
  [deleted]
- ekianjo•22d
  Note that we don't really know the exact answer because population reporting is shaky at best where there are the most people.
  - mdp2021•22d
    And a rational interlocutor, besides conversational shortcuts, replies in the form "The best estimations from sources like S0, S1 and S2, publish a value between V1 and V2".
Biganon•22d
Make up a fake popular wisdom saying that sounds real, search for it, Gemini will gladly tell you it exists and explain it to you
- mdp2021•22d
  ...As others have noted elsewhere: which Gemini? There are cheap ones and others proposed as flagship.
  - ghusbands•22d
    The one that appears when you search with Google.
simianwords•22d
In such discourse I never see discussion on this:
There is no doubt that LLMs have gotten more accurate as newer models were released. At what point should we say "look this is accurate enough to be useful"?
We should acknowledge that nothing is ever 100% accurate. You won't go to a doctor expecting 100% accuracy. You know that the doctor's accuracy is high enough for the effort of making an appointment, listening to them to be worth it. Maybe they are 60% accurate?
My point is that LLM's are maybe at 20-30% accuracy where the benefit clearly exists even if we collectively acknowledge that 20-30% is not that high.
I find it amusing to think about an LLM that is 1% accurate (which could have been achieved way earlier in 2010's). What could have been possible with such an LLM with the right mindset?
- demaga•22d
  Realistic-looking lorem ipsum
- gambiting•22d
  >>You won't go to a doctor expecting 100% accuracy.
  The way LLMs work at the moment is equivalent to going to a doctor with a set of symptoms, and the doctor telling you "ah yes, of course, it's illness X and you need to take medicine Y to cure it" and then you check and neither X nor Y exists. That's not "accuracy" that's just straight up fraud?
  I wouldn't have any problem with Google's AI saying "I don't know" or "I don't have enough sources to provide a good answer, but my best approximation is this". But I literally battle misinformation produced by Google's AI search every single day, because it's telling people actual made up facts that don't exist.
  - simianwords•22d
    I have a system prompt that gives me probability estimates to everything the LLM claims. Its a simple fix for your problem.
    - gambiting•22d
      My problem is people coming to communities I'm a part of with information they got "from Google" and that information being 100% wrong. Not sure how your prompt helps with that, I need Google to fix their system first.
    - jazzyjackson•21d
      LLMs are unable to introspect and don't know what they don't know. Watson the Jeopardy-bot had a confidence interval, but Watson was not an LLM
    - _shantaram•22d
      How does that work if the LLM is the one generating the probabilities too?
      - simianwords•21d
        according to you, if a human makes a prediction with some probability estimate it is useless because the estimate itself is inaccurate (hence probability "estimate"). in reality nothing needs to be 100% accurate for it to be useful including the estimate of probability itself.
        gambiting•21d
        It's weird to make an assumption about OPs position and argue with that instead of what they actually wrote.
        Also, why make it so personal? I think it was a fair question to ask - you didn't answer how it works - just got weirdly defensive about it.
        simianwords•21d
        hey that was not my intention, it was to bring to light that we assign probability estimates ourselves to our own predictions despite the estimates being not 100% accurate.
    - otabdeveloper4•22d
      Hate to break it to you, but the "probability estimates" it spits out are also complete bullshit.
      - simianwords•21d
        nope! you can also self assign probability estimates to your own predictions. if you follow it you will end up being more accurate in the long run even if your accuracy of probability estimates are themselves not accurate.
afro88•22d
If you look at the sources behind the AI responses for this search, they clearly don't mention the 280. Google are probably using a dirt cheap model for these responses and it's harming user trust
BlueTemplar•22d
> An expert will immediately notice discrepancies in the hallucinated answers, and will follow for example the List of IBM PS/2 Models article on Wikipedia. Which will very quickly establish that there is no Model 280.
'member when similar blogposts were written about not trusting Wikipedia ?
(And Wikipedia is still better than LLMs : while you can trust it less than fixed, specialist-made references, you can improve it yourself, as well as check Talk pages for potential disagreements, and page history for potential shenanigans.)
akomtu•21d
If you asked this "AI" to draw a possible continuation of a fictional map, it would draw something very believable and everyone would understand that this map and its continuation is a fiction. No one would try to say that the continuation has a "mistake" in it. "AI" works with text the same way it works with some meaningless ornaments, but we interpret that text to find meaning, and this is why AI is fooling people so much.
jimmySixDOF•22d
Error control as in LLM-as-a-Judge needs to be integrated into every pipeline there are use at home (sub 7B) sized models and nano's (fit into browser) [1] so push out to the edge if its a Google sized scale/cost problem. Should actually be like VirusTotal where you get a consensus score of x/72 different lookups.
[1] see haize labs : https://www.haizelabs.com/product/judges
- eddythompson80•22d
  This is a perfect example of snakeoil nonesense being sold in the AI tech market these days. People offering to wrap your LLM calls with another prompt asking the LLM to "reconsider" or "does this make sense to you", and selling you that artisanally crafted prompt at a premium.
  This is simply an information retrieval problem. LLMs don't retrieve information perfectly from their context. They are very good at it, but every now and then they'll introduce a change here or there. Changing an "Hello" to "Hi" doesn't really make any difference, but changing an "PS/2 Model 286" to "PS/2 Model 280" makes a huge difference. The LLM "knows"* this at some level because it "knows" that names are important to be reproduced in exact format. But everynow and then even names can change and sill generally mean the same thing, so everynow and then it'll change a name or an identifier for no reason.
  some of my favorite descriptions of this I have heard from people:
  - We need to introduce a reflection layer
  - We need a supervisor tree-like checks on responses
  - We need an evaluation-feedback mechanism for AI operations
  - We need to agents that operates and judges/supervisors/evaluators
  all apparently mean:
```
     const response = await getChatResponse([...messages, newMsg]);
     const {didGood, why} = await getChatResponse([RolePlayLikeAMeanJudgePromot, ...messages, newMsg, response])

     if (!didGoog) {
        response = await getChatResonse([ThreateningMafiaGuyPrompt, [ ...messages, newMsg, response, why[])
     }

     // loop like 4 times, maybe use different models and egg them on each other, like use Grok and tell it you got the response from Claude and that Claude was shit talking Grok all the way through it. Like it was unnecessary tbh.
     // this makes Grook work extra hard. It's actually been measured by kill-a-watt.
```
  *: I say "knows" to indicate just that the model training biases it that way
  - mdp2021•22d
    We consider LLMs an "intuition" machine that can talk and partially understand, and do not let it retrieve information from its faulty memory and will, but force it to use an implemented memex to produce any output.
    "Tell us about X (X='PS/2 Models'); here are your encyclopedias: extract and formulate".
    - eddythompson80•22d
      If you were to actually try that you'd know that approach doesn't really work either. Or rather, it's not the silver bullet you hope it is. If you still think that, go ahead and implement it. That's literally the main "output quality" struggle all AI providers are in.
      If you're just building a chatbot (like a pure ChatGPT/Claude interface-like) you risk massively increasing your latency and degrading your overall result quality for an attempt to improve a small scenario here or there.
      Seriously, try it. Take any "Tell us about X" prompt you like. Try it as-is with an LLM, then try it with + "; here are your encyclopedias: extract and formulate"
      I guarantee you that 99 times out of 100, the LLM will always reach out to the encyclopedia. The existing encyclopedia doesn't have a great LLM-like search interface that's able to find the most relevant parts to the LLM's query about X. In fact, you're building that part if I'm not mistaken. If you expect the encyclopedia to have that great search functionality that the LLM could use to always find the most relevant information about X, then you just pushed the problem one layer down. Someone will actually eventually have to tackle it.
      You can also see this in both ChatGPT and Claude outputs. Every now and then they will push a change to make it "more reliable" which basically makes it more likely to search the internet before answering a question. Which also happens to be more likely to skew its output based on SEO, current popular news and other nonesense.
      While nonscientific, I experience this everytime ChatGPT or Claude decide to do a web search instead of just answering the question. Ask it "I like tv show X, suggest tv shows like that" or "I like product X, suggest a similar product". If it uses the internet to search, it's a summary of the top gamed SEO results. Just whatever is popular atm, or whatever has commission links. Ask it not to use the internet and the result is surprisingly less.... "viral, SEO optimized, trended recently" type content.
      - mdp2021•21d
        You are misunderstanding the proposed frame: implementations may be faulty, but the approach remains necessary for "LLMs as informers". I.e., the answer provider should only work vis-a-vis documentation founding the output.
        This implies that if we do not have good enough ways to retrieve information from repositories, we will have to invent them. Because the "LLM as informer" can only be allowed to formulate what it will find through the memex.
        It is possible that to that aim, LLMs can not be directly implemented as they are in the current general state.
        Also the problem of information reliability has to be tackled, in order to build such system (e.g. some sources rank higher).
        It is not a solved problem, but it is a clear one. In mission critical applications, you would not even allow asking John at the nearby desk for information he may confuse.
mdp2021•22d
The paradox is that that the "overview" is within a search engine, that should provide links to pages containing third party answers to the question; a rational individual knowing to suffer from bad memory would keep documentation at hand to verify its faulty-memory based intuition; the elements for the normal process ("get an intuition - if that is what you do - but verify it through (assessed) available documentation") are there but are not used.
rchaud•22d
Of course it can't admit it gets things wrong. That would signal to the user that it can't provide what you want. That's Product Design 101.
Would you expect Netflix or Prime to simply show "No results" when you look up a show it doesn't have? Better to fill the screen with a bunch of "we think this is close enough" than to save the user some time.
johnea•21d
> AI simply makes stuff up. I do not consider made up, hallucinated answers useful, in fact they are worse than useless.
Let's all say it together: The LLM is just WRONG.
It's not "hallucinating", it doesn't have a brain, or consciousness. It's just generating a wrong answer.
Hilift•22d
"Mistakes"? These are confabulations. There is a reason it is occurring, the code was written by people, who exhibit the same behavior. If you wonder why, what possible disincentive does an AI agent have to not confabulate the truth or facts?
Almondsetat•22d
"I asked an LLM and this is the wrong answer it gave me" is a genre of content that I'm growing ever more annoyed at. There is nothing constructive or informative about these stories other than the usual adage of not blindly trusting an LLM
- mdp2021•22d
  > informative
  That Google uses a faulty assistant in the page is actually informative, not just for people who do not use that search engine, but for those attentive to the progresses in the area - where Google has given massive hits recently.
  > constructive
  The - extremely damaging - replacement of experts with "employees wielding an LLM" is ongoing. Some of us have been told nonsense by remote support service staff...
  - Almondsetat•22d
    While you argue that showcasing a 'faulty assistant' like Google's is 'informative', particularly for those tracking AI progress, the typical LLM-got-it-wrong post often doesn't provide that deeper insight. It usually presents an isolated error without context or analysis of the system's architecture, training data limitations, or the specific type of reasoning failure. This makes its informative value quite shallow, quickly becoming repetitive rather than truly enlightening about 'progresses in the area' beyond the surface-level observation that LLMs are imperfect.
    Regarding the 'constructive' aspect and the 'damaging replacement of experts,' I agree this is a critical concern. However, the genre of simply posting screenshots of LLM errors is rarely constructive in addressing this complex socio-technical issue. It highlights a symptom (LLMs making mistakes) but typically fails to constructively engage with the causes or potential solutions for deskilling, corporate responsibility in AI deployment, or the nuances of human-AI collaboration. True constructive engagement would require more than just pointing out a wrong answer; it would demand analysis, discussion of best practices, or calls for better system design and oversight, which this genre seldom provides.
    - mdp2021•22d
      Right. But simply, raising awareness helps fighting the "nurses as cheap doctors, random people with a script as a greater bargain" phenomenon.
      And for what the progresses in LLMs are concerned¹, it seems evident a revolution is required - and when the key (to surpass intuition towards process, dream towards wake) will be found it will become evident.
      (¹Before I was mentioning «progresses» in general - as in, "they give us Veo3 and yet Banged Inthehead at the search pages"?!)
- snarf_br•22d
  Then don't read it.
  With Gemini replacing Google search more and more people are blindly trusting those answers, so these stories are needed.
  - Almondsetat•22d
    Dismissing my observation with 'then don't read it' sidesteps the core issue. My point isn't about my personal reading habits, but about the low signal-to-noise ratio of this content genre. While you argue these stories are 'needed' because people blindly trust LLMs, especially with integrations like Gemini in search, these posts rarely offer more than the simplistic, already widely understood caution: 'don't blindly trust LLMs.' This is precisely the 'usual adage' I mentioned. The genre often lacks depth, failing to provide nuanced understanding or genuinely new information about why these systems fail in specific ways or how users can develop better critical assessment skills beyond mere distrust. If the goal is genuine education due to increased LLM exposure, the content needs to evolve beyond just showcasing errors.
veunes•22d
And the worst part is, the more plausible the output sounds, the more dangerous it becomes for casual users who don't know any better.
•22d
[deleted]
mpweiher•22d
In other exciting and groundshaking news: water wet!
And I have to admit I thought the title was a joke.
However, I loved the detailed description of just how bad it can be. And it puzzles me why people present AI slop as authoritative. Happens a lot in discussions these days. One example was someone presenting me with a grok answer about some aspect of the energy system. It turned out grok was off by a factor of 1000.
Of course you can also use that to your advantage with people who believe AI slop, as it is fairly simple to get the AI to produce the answer you want, including the answer you know is right ;-)
And I've actually started using AI a bit more in my coding, and it's been helpful as a starter. For example to get my little HTMX-Native project going, I needed to figure out how to configure Android's WebView for local data access.
Would I have figured it out eventually? Yes.
Was it faster with AI? Yes.
Was the AI answer faulty? Yes.
Was it still helpful? Yes.
charcircuit•22d
Gemini can handle this fine.
- nehal3m•22d
  Huh, I thought that AI overview feature is powered by Gemini.
  - charcircuit•22d
    Gemini doesn't refer to a specific model so perhaps the one on the search page is weaker than the ones it offers in the app.
    - mdp2021•22d
      They may have estimated the volume of replies in those "overview" pages and shrunk the costs routing through a computationally lightweight model.
      They may this way also have underestimated the reputational loss - the big umbrella of the "Ford Pinto case".
      --
      Edit: I was just now looking at the new Visual Capitalist's "Ranked: 2025’s 10 Largest S&P 500 Stocks". Is it possible that Alphabet being at the top with 7.6% of the weight of the 500-items set is paradoxically allowing to afford more damage?
- mucha•21d
  Capability != Reliability
ErrantX•22d
There is some parallel here with the message of 1984. In that Orwell saw political entity as the vehicle for mass misinformation.
But in some ways Google AI results, with the trust they have, seem just another vehicle for "we've always been at war with Eurasia".
For me this is the real risk of AI: developing dependence on it's output

I tried to reproduce the situation described in the post by Googling "ps/2 model 280". I refreshed the page about 15 times and gives a very similar answer every time. For example:

    The IBM PS/2 Model 280 was a low-end PS/2 model, often associated with the earlier IBM PC/XT and PC/AT models.
    It featured an 80286 CPU, a smaller hard drive, and a proprietary 3.5-inch floppy drive. While it used ISA slots for expansion, its proprietary RAM and other components made it less attractive for retro computing compared to the Model 30 286

I tried again after 10 minutes and I got the "The IBM PS/2 Model 280 was not a model in the original Personal System/2 family".

It seems the AI response is cached for a few minutes.

I tried in Kagi "ps/2 model 280?". This is the first response:

```

While a specific "PS/2 Model 280" is not definitively documented, here's what can be gathered:

    - It seems that AI models sometimes make errors regarding the PS/2 Model 280, describing it as an ISA-based 286 system. 12
    - The IBM PS/2 (Personal System/2) was a line of personal computers from IBM, introduced in 1987 3.
    - PS/2 computers had two BIOSes: ABIOS (Advanced BIOS) and CBIOS (Compatible BIOS). 4
    - One source mentions "CHIPS/280" including undocumented IBM PS/2 Model 80 address recovery logic. 5
    - The PS/2 Model 30, released in 1987, used an Intel 8086 processor and had an 8-bit ISA bus. 6
    - The PS/2 Model 30 286 had a 10MHz 286 CPU and could have 2 MB of RAM. 7
    - The PS/2 Model 80 offered enhanced performance for commercial and compute-intensive tasks. 8

References

    AI Responses May Include Mistakes | OS/2 Museum www.os2museum.com
    OS/2 Museum | OS/2, vintage PC computing, and random musings www.os2museum.com
    List of IBM PS/2 models - Wikipedia en.wikipedia.org
    IBM PS/2 - Wikipedia en.wikipedia.org
    Technology ardent-tool.com
    IBM PS/2 Model 30 - Wikipedia en.wikipedia.org
    IBM PS/2 Model 30 286 - Computers of Significant History - Userlandia www.userlandia.com
    IBM PS/2 (Model 80) - Technical specifications ardent-tool.com

```

I waited a few minutes and this is the response now:

```

While information on the IBM PS/2 Model 280 is limited, here's what is available:

    - There is mention of the PS/2 Model 280, with claims that it "was a significant step forward in IBM's personal computer line" 1. However, it's also noted that claims about the PS/2 Model 280 being an ISA-based 286 system may be inaccurate 2.

    - The IBM PS/2 series included a variety of models in different form factors like desktops, towers, all-in-ones, portables, laptops, and notebooks 3. Some notable models include:
        Model 30: Featured an Intel 8086 processor at 8 MHz and an 8-bit ISA bus 4.
        Model 80: A high-end desktop with a 32-bit Intel 386 processor, initially running at 16 MHz, housed in a tower case 5.

References

    AI Responses May Include Mistakes | OS/2 Museum www.os2museum.com
    OS/2 Museum | OS/2, vintage PC computing, and random musings www.os2museum.com
    List of IBM PS/2 models - Wikipedia en.wikipedia.org
    IBM PS/2 Model 30 - Wikipedia en.wikipedia.org
    IBM PS/2 Model 80 - Wikipedia en.wikipedia.org

```

I find the way Kagi gives references superior to the way Google does. Anyway, in the second answer Kagi's AI started to get a bit confused by stating "it's also noted that claims about the PS/2 Model 280 being an ISA-based 286 system may be inaccurate" missing the conclusion of the post.

YongHaoHu•21d
[dead]
410-572-7938•21d
[flagged]
system2•22d
The sun is hot, water is wet.
yusina•22d
This heading is so obvious, I hardly see why it warrants an article.
The other day I googled "Is it 2025?" and the AI reply was that nope, it's 2024. Such a joke.
- mdp2021•22d
  > [current time]
  You should know that LLMs are very weak in procedural operations.
  > obvious ... why it warrants an article
  The phenomenon is part of all the "obvious" things that are not in the awareness of very large masses.
  - yusina•22d
    > You should know that LLMs are very weak in procedural operations.
    Indeed! That's why they are LLMs, not AI!
    Why are new hypes always re-defining terms? 20 years ago, "AI" was actually about intelligence. Just like "crypto" was about cryptography instead of money scams.
    > The phenomenon is part of all the "obvious" things that are not in the awareness of very large masses.
    I can't imagine a single HN reader who is not aware that LLMs make mistakes. This is not the general public. (And even most of the general public has heard about this minor issue by now.)
    - mdp2021•22d
      > terms[!] ... ... I can't imagine a single HN reader who is not aware that LLMs make mistakes. This is not the general public
      So you wrote 'article' but you meant "submission" ;)
      > And even most of the general public has heard about this minor issue by now
      And still some public or private enterprises are trying to push LLMs in dangerous parts of the workflow - possibly because drives like "cutting costs" or "chasing waves" are valued more than objective results...
      > "AI" was actually about intelligence
      It was (is) about providing outputs intelligently - relevantly.
      - yusina•22d
        > So you wrote 'article' but you meant "submission" ;)
        Ah, true. Sorry.
jkuli•22d
Humans can also make mistakes. This is the first test I apply every time. Could it be that AI is actually more capable than a human? If industry decides than humans are more reliable, they will choose to use humans. Reliability is part of cost-effectiveness, and it's built in to the business decision process.
- theodric•22d
  A human would only make up an answer like the ones in the article if it were a compulsive liar. A human would ideally say "I don't know" or at worst employ the "I'll confirm that and circle back" corpspeak evasion.
- mdp2021•22d
  Another case of badly computed reputational loss.
  That context is not of "this more than that" comparisons, but of threshold: the service must be _adequate_.
  If you don't have random humans capable of providing the needed service, find them. Do not employ random agents.
- alpaca128•22d
  The average Google user expects LLMs to be perfect and treats their responses as answer from an oracle, not just better than the average human.
  It doesn't matter whether it's better than humans, the one thing that matters are the consequences of its widespread use.