Crazy that even o1-preview gets most things wrong.
This is in line with my own personal experience with LLMs and non-trivial questions. They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself…
It’s not clear to me why we’re still trying to encode all of human knowledge in a single model, instead of teaching the model how to look for answers from an external source (e.g. RAG).
You shouldn't use the rate as an indicator. They did something similar to what I did on my hallucinations benchmark (https://github.com/lechmazur/confabulations/), only using questions where at least one model made a mistake. I added this note:
"The benchmark includes questions where at least one LLM confabulated, in order to minimize the number of questions requiring human assessment. Because of this, and since the questions are intentionally adversarial, the absolute percentage should not be used to infer that LLMs frequently confabulate. This leaderboard does not reflect a "typical" hallucination rate."
> instead of teaching the model how to look for answers from an external source (e.g. RAG)
My benchmark specifically focuses on the RAG use case. Even with provided texts, current models still hallucinate.
True, but this pattern was established with TrufulQA a couple of years ago.
I disagreed then but it's unfortunately convenient to use this approach for some benchmarks.
[dead]
> They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself…
This just makes me think LLMs must be wrong far more often than we realize. How well can you judge the quality of a response on a topic you know nothing about?
> How well can you judge the quality of a response on a topic you know nothing about?
You cannot, but those pushing Gen AI wave away the probability and liability for obvious reasons. Sell the magic, it'll be someone else's problem when it doesn't deliver.
Not just wrong, but superficially credible-looking wrong, which is much worse. "Fake it till you make it" works fine for some things, but not a large chunk of what they're trying to sell AI for.
That's his/her point.
Honestly, try prompting it with “you are wrong 80% of the time, therefore you will need to double check your answers, first factually, then numerically, then double check the time/date. You are still probably wrong so do a third accuracy check. The user’s prompts are always wrong too mostly - so always check them”.
I stopped playing with larger models and have been pushing smaller models with this improvised system prompt and getting good results. It seems like it forces the model to do multiple passes before giving you any response.
My smaller local models give me less hallucinations than Meta.ai, for example, which generally spits out pleasing answers almost immediately (which are often hallucinations, since I don’t think it is system prompted to be adversarial to the user, or itself). I don’t have the same hallucination issue with Llama3 - 8b locally because of custom system prompts.
The model has all the correct information, so it almost needs to do RAG on itself. Multiple passes on itself seems like a way to do it.
How would this multiple passes work though? Unless the model actually talks about what it does, I am not sure how it would have this ability. The next word prediction mechanism is just always going to do it one shot. Your prompt paints a context that might keep it more on the rails, but it won't do multiple passes.
> Unless the model actually talks about what it does
That's how the chain-of-thought approach works. You can make the model do it inline, or you can run the loop yourself, possibly summarising the progress as you go. (Although with prompt caching it's not that important anymore) You can encourage it to print out assumptions/steps/ideas as it goes.
Your prompt paints a context that might keep it more on the rails, but it won't do multiple passes.
This is probably the truth behind the black magic I’m imagining. You could have it explicitly spit out this process, in which case you would see it’s first rough draft, followed by a “My first paragraph is probably wrong”, followed by a third paragraph where it attempts to fix the first paragraph. There is no outside RAG in this process.
The mumbo jumbo part of all this is that I’ve told it to “hide” this process from the user where it doesn’t explicitly output anything but its final answer, and the accuracy has been just as good (for my use case at least).
:Shrugs:
Yeah that’s not how next token prediction works. To actually do multiple passes you’d need to do that yourself, making multiple calls and feeding the responses back to the model.
Why? The very nature of next token prediction means it's entirely capable of having that. It's not multiple passes, it's just one pass. You making multiple calls is just inserting fixed tokens then asking it to carry on completing.
making multiple calls and feeding the responses back to the model.
By asking it to reconsider half its generated response, aren’t I essentially asking it to formulate the second half of its response from the first half internally? I’m bypassing the manual process of feeding in the extra prompt.
We are constantly having to tell the LLM close, but no cigar, iterate again, more or less.
Isn't this in part what o1-preview is doing?
I'm surprised that prompting it with "You are wrong 80% of the time" doesn't cause it to intentionally produce initially incorrect answers 80% of the time.
(Disclosure: I have not tried your prompt)
Why is that suprising?
There is no logical reasoning happening, it has no concept of right and wrong, let alone that it can force a specific percentage of wrongness.
> Why is that surprising?
You tend to get catered responses to whatever role you assign in the prompt. This is well documented. Here's a quick example from search results
https://www.ssw.com.au/rules/give-chatgpt-a-role/
"You are wrong 80% of the time" could be misconstrued as an expected role/command, rather than a mere observation.
> let alone that it can force a specific percentage of wrongness.
Ah, I see what you're saying here. I agree. Maybe I should have said that given the prompt, I'm surprised it doesn't give intentionally incorrect answers (full stop)
I have to be a bit of a buzzkill and say that this is all placebo.
Your prompt might give the model context that gives it better token quality much in the same way that asking “How to swim?” is worse than “I’d like to learn the proper technique for the freestyle swimming stroke, and you’re an expert swimming coach.”
There’s no guarantee your prompt isn’t giving less factual answers to be honest. I wouldn’t go about telling the model that it’s often wrong, as it’s not useful context and might skew results.
I don’t tell the model that it is 100% wrong, in which case it would contradict the first half of its response with the second half of its generation.
We basically want it to enter double-checking mode on it’s own from its initial rough draft (its original response, first half of its response, first paragraph, however you are formatting the output). Otherwise the model will output whatever it outputs, and we will have to manually tell it to reconsider facts, locations, and events.
I agree that there’s no guarantee, but this was a suggestion for those who are getting very wrong answers for simple things.
Can you please share more specifics please? What smaller models? What hardware do you use? How do you test their performance?
There is no rigor to this, this is just from throwing stuff against the wall. See my response to the other poster above.
Even if you're throwing stuff against the wall, you could at least elaborate on what you've tried? Otherwise, how could you state something like "My smaller local models give me less hallucinations than Meta.ai"?
The gist of it is I think these large hosted models have system prompts that are not as skeptical of its own outputs. You are an helpful AI Assistant seems to lead to more lax responses. Adjusting the system prompt to be more incredulous helps from my observation.
> It’s not clear to me why we’re still trying to encode all of human knowledge in a single model, instead of teaching the model how to look for answers from an external source (e.g. RAG).
To be fair: we tried a primordial version of that with venue-weighted citation-based ranking and it worked INCREDIBLY well for 10+ years. Then myopic profit motive poisoned the well. Ever since then we've been searching for solutions.
We do so by allocating resources in a way that primarily leverages a scientific credit assignment system that fetishizes... checks notes... venue-weighted citation-based ranking.
Jokes aside: I remain convinced that the O.G. google search appliance on prop data and then simply ignoring all academics remains the best knowledge retrieval (or whatever) tool available.
Why ignore academics?
> They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself
I forgot the name of this phenomenon with humans, described it to o1 and it gave the correct answer - Gell-Mann Amnesia Effect [1]
[1] https://www.epsilontheory.com/gell-mann-amnesia/I don't think it's surprising that o1-preview is only slightly better than GPT-4o, it was never advertised as being better at this kind of recall.
If you know nothing about, you probably have no idea whether the answer was correct or not, right? Otherwise you'd find the answer embarrassingly wrong. So that observation speaks much more about the human than about the computer.
That was kind of a joke…
How would the model know how to evaluate an answer without innate knowledge?
> They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself…
Sounds like Gell-Mann Amnesia. Maybe LLMs should replace reporters.
You're reading this wrong. They've deliberately chosen questions that one or more models fail at. It's not representative at all of how often the model is wrong in general.
From the paper:
> At least one of the four completions must be incorrect for the trainer to continue with that question; otherwise, the trainer was instructed to create a new question.
LLMs are experts in everything you are not
Indeed. Exactly like the journalists, bloggers, self-published book authors, internet commenters, wikipedia editors, and earlier models that taught them almost all of what they know.
That's a nice little aphorism. I think this happens in a lot of things in life. Like comments on Reddit always seem quite insightful until you actually read the article they're commenting on.
Sounds a bit like Gell-Mann Amnesia Effect: https://en.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_amn...
The Alt-Mann Amnesia Effect, maybe.
LLM version of Gell-Mann Amnesia.