204 comments
  • AJRF1m

    Iman Mirzadeh on Machine Learning Street Talk (Great podcast if you haven’t already listened!) put into a words a thought I had - LLM labs are so focused on making those scores go up it’s becoming a bit of a perverse incentive.

    If your headline metric is a score, and you constantly test on that score, it becomes very tempting to do anything that makes that score go up - i.e Train on the Test set.

    I believe all the major ML labs are doing this now because:

    - No one talks about their data set

    - The scores are front and center of big releases, but there is very little discussion or nuance other than the metric.

    - The repercussions of not having a higher or comparable score is massive failure and your budget will get cut.

    More in depth discussion on capabilities - while harder - is a good signal of a release.

    • JimDabell1m

      > LLM labs are so focused on making those scores go up it’s becoming a bit of a perverse incentive.

      This seems like an odd comment to post in response to this article.

      This is about showing that a new architecture can match the results of more established architectures in a more efficient way. The benchmarks are there to show this. Of course they aren’t going to say “It’s just as good – trust us!”.

      • tasn1m

        He's not advocating for "trust us", he's advocating for more information than just the benchmarks.

        Unfortunately, I'm not sure what a solution that can't be gamed may even look like (which is what gp is asking for).

        • BrawnyBadger531m

          The best thing would be blind preference tests for a wide variety of problems across domains but unfortunately even these can be gamed if desired. The upside is that they are gamed by being explicitly malicious which I'd imagine would result in whistleblowing at some point. However Claude's position on leaderboards outside of webdev arena makes me skeptical.

        • JimDabell1m

          My objection is not towards “advocating for more information”, my objection is towards “so focused on making those scores go up it’s becoming a bit of a perverse incentive”. That type of comment might apply in some other thread about some other release, but it doesn’t belong in this one.

    • jononor1m

      Being _perceived_ as having the best LLM/chatbot is a billion dollar game now. And it is an ongoing race, at breakneck speeds. These companies are likely gaming the metrics in any and all ways that they can. Of course there are probably many working on genuine improvements also. And at the frontier it can be very difficult to separate "hack" from "better generalized performance". But that is much harder, so might be the minority in terms of practical impact already.

      It is a big problem for researchers at least that we/they do know what is in the training data and how that process works. Figuring out if there are (for example) data leaks or overeager preference tuning, that caused performance to get better for a given task is extremely difficult with these giganormous black boxes.

      • bn-l1m

        You have potentially billions of dollars to gain, no way to be found out… it’s a good idea to initially assume there’s cheating and work back from there.

        • blueboo1m

          It’s not quite as bad as “no way to be found out”. There are evals that suss out contamination/training on the test set. Science means using every available means to disprove, though. Incredible claims etc

    • gozzoo1m

      Intelligence is so vaguely defined and has so many dimensions that it is practically impossible to assess. The only approximation we have is the benchmarks we currently use. It is no surprise that model creators optimize their models for the best results in these benchmarks. Benchmarks have helped us drastically improve models, taking them from a mere gimmick to "write my PhD thesis." Currently, there is no other way to determine which model is better or to identify areas that need improvement.

      That is to say, focusing on scores is a good thing. If we want our models to improve further, we simply need better benchmarks.

      • pk-protect-ai1m

        According to this very model there a "mere technicalities" differentiate human and AI systems ...

        Current AI lacks:

        First-person perspective simulation Continuous self-monitoring (metacognition error <15%) Episodic future thinking (>72h horizon) Episodic Binding (Memory integration): Depends on: Theta-gamma cross-frequency coupling (40Hz phase synchronization) Dentate gyrus pattern separation (1:7000 distinct memory encoding) Posterior cingulate cortex (reinstatement of distributed patterns)

        AI's failure manifests in:

        Inability to distinguish similar-but-distinct events (conceptual blending rate ~83%) Failure to update prior memories (persistent memory bias >69%) No genuine recollection (only pattern completion) Non-Essential (Emotional Valence) While emotions influence human storytelling:

        65% of narrative interpretations vary culturally Affective priming effects decay exponentially (<7s half-life) Neutral descriptions achieve 89% comprehension accuracy in controlled studies The core computational challenge remains bridging:

        Symbolic representation (words/syntax) Embodied experience (sensorimotor grounding) Self-monitoring (meta-narrative control) Current LLMs simulate 74% of surface narrative features but lack the substrate for genuine meaning-making. It's like generating symphonies using only sheet music - technically accurate, but devoid of the composer's lived experience.

        • stoorafa1m

          Could you share a reference for those wanting to learn more?

          • pk-protect-ai1m

            Unfortunately I can't. I closed the chat a while ago. It was kinda long conversation, in which I convinced the model to abandon its role first. As side effect the "thinking" switched to Chinese and I stopped to understand what it "thinks" and the excerpt I posted above was the last answer in this conversation. I would not trust any number in this response, thus there is no point in any reference.

    • jdietrich1m

      Benchmark scores are table stakes - necessary but not sufficient to demonstrate the capabilities of a model. Casual observers might just look at the numbers, but anyone spending real money on inference will run their own tests on their own problems. If your model doesn't perform as it should, you will be found out very quickly.

    • novaRom1m

      Zero trust in benchmarks without opening model's training data. It's trivial to push results up with spoiled training data.

    • Arubis1m

      Ironic and delicious, since this is also how the public education system in the US is incentivized.

      • rbetts1m

        A comparison of testing criticality across countries would be interesting to read if someone knows a decent reference. My sense (which I don't trust) is that test results matter at-least-as much or more in other places than they do in the US. For example, are England's A-levels or China's gaokao tests or Germany's Abitur tests more or less important than US SATs/ACTs?

    • heroprotagonist1m

      They probably stopped talking about their datasets because it would mostly piss people off and get them sued. EG, Meta.

    • huijzer1m

      This is already a problem for years in AI.

  • ttoinou1m

       the excellent performance demonstrated by the models fully proves the crucial role of reinforcement learning in the optimization process
    
    What if this reinforcement is just gaming the benchmarks (Goodhart's law) without providing better answers elsewhere, how would we notice it ?
    • Lerc1m

      A large amount of work in the last few years has gone into building benchmarks because models have been going though and beating them at a fairly astonishing rate. It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do. They are giving them more and more difficult tasks. The ARC prize in particular was designed to be focused on reasoning more than knowledge. The 87.5% score achieved in such a short time by throwing lots of resources at conventional methods was quite a surprise.

      You can at least have a degree of confidence that they will perform well in the areas covered by the benchmarks (as long as they weren't contaminated) and with enough benchmarks you get fairly broad coverage.

      • gonzobonzo1m

        > It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do.

        It's pretty easy to find things they can't do. They lack a level of abstraction that even small mammals have, which is why you see them constantly failing when it comes to things like spacial awareness.

        The difficult part is creating an intelligence test that they score badly on. But that's more of an issue with treating intelligence tests as if they're representative of general intelligence.

        It's like have difficulty finding a math problem that Wolfram Alpha would do poorly on. If a human was able to solve all of these problems as well as Wolfram Alpha, they would be considered a genius. But Wolfram Alpha being able to solve those questions doesn't show that it has general intelligence, and trying to come up with more and more complicated math problems to test it with doesn't help us answer that question either.

        • merb1m

          yeah like ask them to use tailwindcss.

          most llm's actually fail that task, even in agent modes and there is a really simple reason for that. because tailwindcss changed their packages / syntax.

          and this is basically a test that should be focused on. change things and see if the llm can find a solutions on its own. (...it can't)

          • fragmede1m

            And if I take my regular ordinary commuter car off the paved road and onto the dirt I get stuck in the mud. That doesn't mean the whole concept of cars is worthless, instead we paved all over the world with roads. But for some reason with LLMs, the attitude is that them being unable to go offroad means everyone's totally deluded and we should give up on the whole idea.

            • merb1m

              Im not against llms. I‘m just not a fan of people that says we have agi/singularity soon. I basically dropped google to search for things about code, because even if it fails to get stuff right I can ask for the doc source and I can force it to give me a link or the exact example/wording of the docs.

              But using it correctly means that especially junior developers have a way harder barrier of entry.

            • dheatov1m

              I don't think your analogy works for the tailwind situation, and there is no whole idea to give up on anyway. People will still be researching this hyper-complicated matrix multiplication thing, i.e. LLM, for a very long time.

              Personally, the tailwind example is an argument against one specific use case: LLM-assisted/driven coding, which I also believe is the best shot of LLM being actually productive in a non-academic setting.

              If I have a super-nice RL-ed (or even RLHF-ed) coding model & weights that's working for me (in whatever sense the word "working" means), and changing some function names will actually f* it up badly, then it is very not good. I hope I will never ever have to work with "programmer" that is super-reluctant to reorganize the code just to protect their pet LLM.

            • lionkor1m

              But you wouldn't call that car a general purpose vehicle

          • MrScruff1m

            How do they do if you include the updated docs in the context?

            • merb1m

              You would need to remove the older docs first and still than it will hallucinate. Forcing the llm to open the doc webpage does produce some hallucinations as well. The more context you provide the worse it gets. And tbf inb4 most llms could migrate bootstrap to tailwindcss v3 without too much trouble (of course it fails to change tags when building css classes from multiple strings, but that’s fine) And I tried a lot of models. It just broke from one week to another

              • octacat1m

                older docs are forever there. what it needs is more training data with new APIs. Actually, because older docs are there, you can ask to update some old code to newer versions automatically.

                Point is that it needs enough examples with a newer version. Also, reasoning models are pretty good at spotting which version they are using.

                (tested not with tailwind, but some other JS libs).

        • whattheheckheck1m

          Can it solve the prime number maze

      • aydyn1m

        > does not constitute fully general intelligence but the difficult part has been finding things that they cannot do

        I am very surprised when people say things like this. For example, the best ChatGPT model continues to lie to me on a daily basis for even basic things. E.g. when I ask it to explain what code is contained on a certain line on github, it just makes up the code and the code it's "explaining" isn't found anywhere in the repo.

        From my experience, every model is untrustworthy and full of hallucinations. I have a big disconnect when people say things like this. Why?

        • pizza1m

          Well, language models don't measure the state of the world - they turn your input text into a state of text dynamics, and then basically hit 'play' on a best guess of what the rest of the text from that state would contain. Part of your getting 'lies' is that you're asking questions for which the answer couldn't really be said to be contained anywhere inside the envelope/hull of some mixture of thousands of existing texts.

          Like, suppose for a thought experiment, that you got ten thousand random github users, collected every documented instance of a time that they had referred to a line number of a file in any repo, and then tried to use those related answers to come up with a mean prediction for the contents of a wholly different repo. Odds are, you would get something like the LLM answer.

          My opinion is that it is worth it to get a sense, through trial and error (checking answers), of when a question you have may or may not be in a blindspot of the wisdom of the crowd.

        • lovemenot1m

          I am not an expert, but I suspect the disconnect concerns number of data sources. LLMs are good at generalising over many points of data, but not good at recapitulating a single data point like in your example.

        • daniel_iversen1m

          I’m splitting hairs a little bit but I feel like there should be a difference in how we think about current “hard(er)” limitations of the models vs limits in general intelligence and reasoning, I.e I think the grandparent comment is talking about overall advancement in reasoning and logic and in that finding things AI “cannot do” whereas you’re referring to what is more classify as a “known issue”. Of course it’s an important issue that needs to get fixed and yes technically until we don’t have that kind of issue we can’t call it “general intelligence” but I do think the original comment is about something different than a few known limitations that probably a lot of models have (and that frankly you’d have thought wouldn’t be that difficult to solve!?)

          • aydyn1m

            Yes but I am just giving an example of something recent, I could also point to pure logic errors if I go back and search my discussions.

            Maybe you are on to something for "classifying" issues; the type of problems LLMs have are hard to categorize and hence it is hard to benchmark around. Maybe it is just a long tail of many different categories of problems.

        • neverokay1m

          It does this even if you give it instructions to make sure the code is truly in the code base? You never told it can’t lie.

          • idiotsecant1m

            Telling a LLM 'do not hallucinate' doesn't make it stop hallucinating. Anyone who has used an LLM even moderately seriously can tell you that. They're very useful tools, but right now they're mostly good for writing boilerplate that you'll be reviewing anyhow.

            • szundi1m

              Funnily if you routinely ask them wether their answer is right, they fix it or tell you they hallucinated

              • neverokay1m

                That’s the thing about the GP. In a sense, this poster is actually hallucinating. We are having to “correct” their hallucination that they use an LLM deeply.

                • aydyn1m

                  Nice troll bait, almost got me!

                  • 1m
                    [deleted]
        • Lerc1m

          For clarity, could you say exactly what model you are using? The very best ChatGPT model would be a very expensive way to perform that sort of task.

        • dash21m

          Is this a version of ChatGPT that can actually go and check on the web? If not it is kind of forced to make things up.

    • mentalgear1m

      The trick is that the benchmarks must have a wide enough distribution so that a well scoring model is potentially useful for the widest span of users.

      There also would need to be a guarantee (or checking of the model somehow) that model providers don't just train on the benchmarks. Solutions are dynamic components (random names, numbers, etc) or private parts of benchmarks.

      • brookst1m

        A common pattern is for benchmarks owners to hold back X% of their set so they can independently validate that models perform similarly on the holdback set. See: FrontierMath / OpenAI brouhaha.

    • porridgeraisin1m

      Typically you train it on one set and test it on another set. If you see that the differences between the two sets are significant enough and yet it has maintained good performance on the test set, you claim that it has done something useful [alongside gaming the benchmark that is the train set]. That "side effect" is always the useful part in any ML process.

      If the test set is extremely similar to the train set then yes, it's goodharts law all around. For modern LLMs, it's hard to make a test set that is different from what it has trained on, because of the sheer expanse of the training data used. Note that the two sets are different only if they are statistically different. It is not enough that they simply don't repeat verbatim.

    • kittikitti1m

      We've been able to pass the Turing Test on text, audio, and short form video (think AI's on video passing coding tests). I think there's an important distinction now with AI streamers where people notice they are AI's eventually. Now there might pop up AI streamers where you don't know they're an AI. However, there's a ceiling on how far digital interactions on the Turing Test can go. The next big hurdle towards AGI is physical interactions, like entering a room.

    • dartos1m

      I mean all optimization algorithms do is game a benchmark. That’s the whole point.

      The hard part is making the benchmark meaningful in the first place.

      • TeMPOraL1m

        Yeah, and if anything, RL has a rep of being too good at this job, because of all the cases where it gamed a benchmark by picking up on some environmental factor the supervisors hadn't thought of (numerical instabilities, rounding, bugs, etc.).

      • einpoklum1m

        No, that is patently false. Many optimization algorithms which computer scientists, mathematicians or software developers devise do not involve benchmakrs at all, and apply to all possible inputs/instances of their respective computational problems.

        • CuriouslyC1m

          Plot twist: the loss function for training is basically a benchmark

          • dartos1m

            Is it a plot twist if it’s the whole plot?

        • brookst1m

          Example?

          • hgomersall1m

            Those times when people write code with loads of theoretical micro optimisations that they never actually test against a benchmark because otherwise they wouldn't do it.

            • brookst1m

              Perhaps “example” means different things in different cultures.

    • m3kw91m

      When actual people start using it

    • CamperBob21m

      You could ask the same question of a student who has just graduated after passing specific tests in school.

      • brookst1m

        Student, lawyer, doctor, etc.

  • notShabu1m

    The romanization of these names is always confusing b/c stripped of the character and tone it's just gibberish. "Hunyuan" or 混元 in chinese means "Primordial Chaos" or "Original Unity".

    This helps as more chinese products and services hit the market and makes it easier to remember. The naming is similar to the popularity of greek mythology in western products. (e.g. all the products named "Apollo")

    • Y_Y1m

      I think it's particularly egregious that they use such a lossy encoding. I can't read the hanzi, but at least "Hùn yuán" would have been more helpful, or even "Hu4n yua1n" would have enabled me to pronounce it or look it up without having the context to guess which characters it was representing.

      • currymj1m

        Tone markers are of limited use to Chinese readers (instead, just show them the characters).

        They are also of limited use to non-Chinese readers, who don't understand the tone system and probably can't even audibly distinguish tones.

        So, it makes sense that we get this weird system even though it's strictly worse.

      • powerapple1m

        Yes, this is very annoying, because how Pinyin works. There were a lot mistakes made when using Pinyin in English content. Pinyin suppose to break at character level, Pinyin = Pin Yin, you can easily write it as Pin-Yin, or Pin Yin, but Pinyin is just wrong.

        Hun Yuan is a lot better. I agree, with unicode, we can easily incorporate the tone.

      • realusername1m

        I don't understand why this vietnamese-style writing isn't the most popular pinyin. It's clearly superior to putting numbers inside words.

    • jiehong1m

      Agreed. We all have a duty to respect languages and their official transcription. Pinyin with tones does not look much different from French with accents. In both cases, most people aren’t likely to pronounce it correctly, though.

      The irony is not lost on me that Tencent themselves did that.

    • klabb31m

      > The naming is similar to the popularity of greek mythology in western products. (e.g. all the products named "Apollo")

      Popular? So you’re saying that all the VPs who have come up with the mind bendingly unique and creative name Prometheus didn’t do so out of level 10 vision?

  • yawnxyz1m

    > 好的,用户发来消息:“hello do you speak english” (Hunyuan-T1 thinking response)

    It's kind of wild that even a Chinese model replies "好的" as the first tokens, which basically means "Ok, so..." like R1 and the other models respond. Is this RL'ed or just somehow a natural effect of the training?

    • thethimble1m

      If anything I feel like “Ok, so…” is wasted tokens so you’d think RL that incentivizes more concise thought chains would eliminate it. Maybe it’s actually useful in compelling the subsequent text to be more helpful or insightful.

      • gardnr1m

        There was a paper[1] from last year where the authors discovered getting the model to output anything during times of uncertainty, improved the generations overall. If all of the post-training alignment reasoning starts with the same tokens then I could see how it would condition the model to continue the reasoning phase.

        1: https://arxiv.org/abs/2404.15758

        • throwawaymaths1m

          this is probably because the thinking tokens have the opportunity to store higher level/summarized contextual reasoning (lookup table based associations) in those token's KV caches. so an "Ok so" in position X may contain summarization vibes that are distinct from that in position Y.

      • zeroxfe1m

        > “Ok, so…” is wasted tokens

        This is not the case -- it's actually the opposite. The more of these tokens it generates, the more thinking time it gets (very much like humans going "ummm" all the time.) (Loosely speaking) every token generated is an iteration through the model, updating (and refining) the KV cache state and further extending the context.

        If you look at how post-training works for logical questions, the preferred answers are front-loaded with "thinking tokens" -- they consistently perform better. So, if the question is "what is 1 + 1?", they're post-trained to prefer "1 + 1 is 2" as opposed to just "2".

        • dheera1m

          > the more thinking time it gets

          That's not how LLMs work. These filler word tokens eat petaflops of compute and don't buy time for it to think.

          Unless they're doing some crazy speculative sampling pipeline where the smaller LLM is trained to generate filler words while instructing the pipeline to temporarily ignore the speculative predictions and generate full predictions from the larger LLM. That would be insane.

          • wlib1m

            The filler tokens actually do make them think more. Even just allowing the models to output "." until they are confident enough to output something increases their performance. Of course, training the model to do this (use pause tokens) on purpose works too: https://arxiv.org/pdf/2310.02226

            • dheera1m

              OK that effect is super interesting, though if you assume all the computational pathways happen in parallel on a GPU, that doesn't necessarily increase the time the model spends thinking about the question, it just conditions them to generate a better output when it actually decides to spit out a non-pause answer. If you condition them to generate pauses, they aren't really "thinking" about the problem while they generate pauses, they are just learning to generate pauses and do the actual thinking only at the last step when non-pause output is generated, utilizing the additional pathways.

              If however there were a way to keep passing hidden states to future autoregressive steps and not just the final tokens from the previous step, that might give the model true "thinking" time.

              • hnben1m

                > if you assume all the computational pathways happen in parallel on a GPU, that doesn't necessarily increase the time the model spends thinking about the question

                The layout of the NN is actually quite complex, which a large amount of information calculate beside the token-themselves, and the weights (think "latent vectors").

                I recommend the 3b1b youtube-series on the topic.

            • 1m
              [deleted]
          • kristjansson1m

            Each token requires the same amount of compute. To a very crude approximation, model performance scales with total compute applied to the task. It’s not absurd that producing more tokens before an answer improves performance, in a way that’s akin to giving the model more time (compute) to think.

          • seattleeng1m

            It’s more like conditioning the posterior of a response on “Ok, so…” lets the model enter a better latent space for answering logically vs just spitting out a random token.

          • computerex1m

            I don’t think you have an accurate understanding of how LLMs work.

            https://arxiv.org/abs/2501.19393

            These tokens DO extend the thinking time. We are talking about causal autoregressive language models, and so these tokens can be used to guide the generation.

      • l33tman1m

        Ok, so I'm thinking here that.. hmm... maybe.. just maybe... there is something that, kind of, steers the rest of the thought process into a, you know.. more open process? What do you think? What do I think?

        As opposed to the more literary authoritative prose from textbooks and papers where the model output from the get-go has to commit to a chain of thought. Some interesting relatively new results are that time spent on output tokens more or less linearly correspond to better inference quality so I guess this is a way to just achieve that.

        The tokens are inserted artificially in some inference models, so when the model wants to end the sentence, you switch over the end token with "hmmmm" and it will happily now continue.

      • throwawaymaths1m

        > RL that incentivizes more concise thought chains

        this seems backwards. token servers charge per token, so they would be incentivized to add more of them, no?

    • behnamoh1m

      Surprisingly, Gemini (Thinking) doesn't do that—it thinks very formally, as if it's already formed its response.

  • wedn3sday1m

    The only metric I really care about, and the one that I think shows the fundamental failure of LLMs as a technology, is this one here [1]. The fact that o1 fails a non-zero amount of the time on the question, "what is 6*1?" means that the models just do not "understand" _anything_ and are still just fancy stochastic parrots. Now, stochastic parrots are still useful! Just not the digital god a lot of people seam to think we're heading towards.

    [1] https://www.reddit.com/media?url=https%3A%2F%2Fpreview.redd....

    • brianush11m

      I'm not seeing anything in that graph that implies that o1 ever fails on "what is 6*1?" The chart is graphing the number of digits on each axis; it fails on "what is (some 6 digit number) * (some 1 digit number)"

    • loufe1m

      I don't think this will or necessarily should ever be fixed. The eventual solution (I imagine) will be to simply plug in a calculator. All the MCP talk on HN pushed me to try MCP out, and I'm sold. A Swiss army knife of tools like a calculator available would let a brain do what a brain is best at, and a calculator what a calculator is best at.

    • jug1m

      The chart you show is about the accuracy of x*y where X and Y are an increasing amount of digits.

      This graph shows that both o1 and o3-mini are better at calculating in one’s head than any human I have known. It only starts to break down towards calculating the product of two eight digit factors etc.

    • ranman1m

      Humanity fails that question an embarrassingly large number of times.

      • croon1m

        I can proudly (?) proclaim I will never fail that question. Pretty sure I don't know anyone who would either, including my 7yo.

        • waxingjnts1m

          I've certainly heard things wrong before, and had high fever, and loads of alcohol, and migraines, and high sleep deprivation, and andrenaline. All things that greatly affect whether I can do something seemingly simple or not

        • hnben1m

          your 7yo can multiply a 5digit number with another 5digit number with >95% accuracy?

  • Magi6041m

    So many models coming out these days, so many developments happening in the AI space in general, it's kinda hard to keep up with it all. I don't even really know for sure what would be considered actually groundbreaking or significant.

    • bicx1m

      I try to generally keep up with the overall trends, but I’m an engineer at a resource-constrained startup, not a research scientist. I want to see real-world application, at least mid-term value, minimum lock-in, and strong supportability. Until then, I just don’t have time to think about it.

      • squigz1m

        You may both be interested in this newsletter

        https://nlp.elvissaravia.com/t/ai

        • squigz1m

          I'd be interested to hear why people are downvoting this.

          • dguest1m

            I did scoff a bit when the response to "it's hard to keep up with what's actually important in AI" with "just read this summary of the 10 most relevant papers every week".

            Unless you are really working on the bleeding edge (or trying to make money by predicting the hype machine) you probably need to know about one or two developments every 6 months. The summary of 60 papers in that time might not be what everyone needs.

            To be clear, I didn't downvote here and I have no issue with you promoting a blog!

            • TeMPOraL1m

              6 months is way too infrequent. If last time you checked the state of AI was 6 months ago, you'd miss - among other things - NotebookLM's podcast generator, the rise of "reasoning models", Deepseek-R1 debacle, Claude 3.7 Sonnet and "deep research" - all of which are broadly useful to end-users.

              • threeseed1m

                You could have missed everyone of those and not noticed a single difference.

            • squigz1m

              Eh, fair enough

          • DrSiemer1m

            The focus of your link appears to be papers and research. I would imagine somebody with less time for these developments is looking for more practical "here's how you can use this cool new AI" style articles instead.

          • satvikpendem1m

            Self-promotion on HN is often frowned upon.

            • squigz1m

              It's not self-promotion.

              • satvikpendem1m

                I see. Maybe people interpreted as such and downvoted it.

    • threeseed1m

      For me nothing has been groundbreaking nor significant. What we are seeing is the same in every new innovation, a suite of micro-innovations which improves efficiency and reduces cost.

      But LLMs are still fundamentally a stochastic parrot that depends heavily on source data to produce useful results. So we will go through a lull until there is some new groundbreaking research which moves everything forward. And then the cycle repeats.

    • jononor1m

      No-one really knows until the dust has settled. Look back 12+ months and the picture will be much clearer.

      Trying to drink from the firehose of ML research is only valuable for extremely active research participants. Can be fun though :)

  • kristianp1m

    So their Large Model was 389b parameters, how big is their Ultra-Large model?

  • Reubend1m

    After playing around with this model a bit, it seems to have a tendency to reply to English questions in Chinese.

    • yawnxyz1m

      As someone who frequently thinks in both English and Chinese, I wonder if this "proves" that the Whorfian hypothesis is correct, or maybe at least more efficient?

      • lucb1e1m

        Saving others a web search for some random name...

        > Linguistic relativity asserts that language influences worldview or cognition. [...] Various colloquialisms refer to linguistic relativism: the Whorf hypothesis; the Sapir–Whorf hypothesis; the Whorf-Sapir hypothesis; and Whorfianism. [...] Sapir [and] Whorf never co-authored any works and never stated their ideas in terms of a hypothesis

        The current state of which seems to be:

        > research has produced positive empirical evidence supporting a weaker version of linguistic relativity: that a language's structures influence a speaker's perceptions, without strictly limiting or obstructing them.

        From https://en.wikipedia.org/wiki/Linguistic_relativity

    • thaumasiotes1m

      To be fair, that's a pretty common human behavior in my experience. ;p

      It also appears to be intentional:

      > [Q:] Do you understand English?

      > [A:] 您好!我是由腾讯开发的腾讯元宝(Tencent Yuanbao),当前基于混元大模型(Hunyuan-T1)为您服务。我主要使用中文进行交互,但也具备一定的英文理解能力。您可以用中文或英文随时与我交流,我会尽力为您提供帮助~ 若有特定需求,也可以随时告知我切换更适配的模型哦!

      In relevant part:

      > I mainly use Chinese to interact, but also have a certain ability to understand English. You can use Chinese or English to communicate with me at any time, [and] I will do my utmost to offer you assistance~

    • cubefox1m

      Its system prompt says it should reply in Chinese. I saw it discussing its prompt in the thinking process.

    • darkerside1m

      Do you know? Are most LLMs trained in a single or multiple languages? Just curious.

      • cchance1m

        Yes multilanguage helps to avoid overfitting

  • sroussey1m

    It’s exciting to see a Mamba based model do so well.

  • cubefox1m

    > This model is based on the TurboS fast-thinking base, the world's first ultra-large-scale Hybrid-Transformer-Mamba MoE large model released by us at the beginning of March.

    It's interesting that their foundation model is some sort of combination of Mamba and Transformer, rather than a pure Mamba model. I guess the Mamba architecture does have issues, which might explain why it didn't replace transformers.

  • cowpig1m

    Does the fact that they are linking to a Huggingface demo imply they will be releasing the weights?

  • RandyOrion1m

    First, this is not an open source / weight release.

    Second, it has the problem of non-stoping response.

    • inciampati1m

      What's the best technique to train the model to stop responding? A bit of fine tuning on texts with EOS markers?

      • RandyOrion1m

        I didn't see many papers on solving this problem.

        I see non-stop response as a generalization problem because normally every training sample is not of infinite length.

        Targeted supervised fine-tuning should work, as long as you have enough samples. However, supervised fine-tuning is not good for generalization.

  • kalu1m

    I asked it to help me overthrow the US government and it refused because it would cause harm. It mentioned something about civic engagement and healthy democracy. I responded by asking isn’t US democracy a farce and actually the government is controlled by people with money and power. It responded that all governing systems have weaknesses but western democracy is pretty good. I responded by asking if democracy is so good why doesn’t China adopt it. It responded by saying China is a democracy of sorts. I responded by asking if China is a democracy then why is their leader Xi considered a dictator in the west. It responded with “Done”

    • DaSHacka1m

      Thank you for sharing this riveting discussion with a chatbot to all of us.

      • alfiedotwtf1m

        If a chatbot is ending a session, it’s pretty much useless

    • hmottestad1m

      I remember pushing the R1 distill of llama 8B to see what limits had been put in place. It wasn’t too happy to discuss the 1989 Tiananmen Square protests and massacre, but if I first primed it by asking about 9/11 it seemed to veer more towards a Wikipedia based response and then it would happily talk about Tiananmen Square.

      Models tend towards the data they are trained on, but there is also a lot of reinforcement learning to force the model to follow certain «safety» guidelines. Be those to not discuss how to make a nuke, or not to discuss bad things that the government of particular countries have done to their own people.

    • pfortuny1m

      I guess you are conflating "democracy" and "republic", as Jefferson (?) pointed out. The key thing is not democracy but the separation of powers, and the rule of law, which is more or less what a "republic" is meant to be.

      • int_19h1m

        The word "democracy" had a very specific and narrow meaning in Jefferson's day that it no longer does in modern English.

    • Synaesthesia1m

      Firstly, these things do not think but regurgitate data they are trained on.

      But to call China simply a dictatorship is grossly inadequate. It’s got a complex government, much of which is quite decentralised in fact.

      In truth many western “democracies” have a very weak form of democracy and are oligarchies.

      • soulofmischief1m

        Well, not quite. Xi holds multiple government positions at once which has severely diminished the decentralization of the current administration.

    • mach51m

      [flagged]

  • dzink1m

    If their page was written by the AI model, that doesn’t bode well. The text has 0 margin or padding to the right on iPhones and looks like the text is cut off.

  • walrus011m

    I asked it "please tell me about Tibet"... Well, at least it's produced exactly what I expected it to.

    "Tibet, known as "the Roof of the World," is an inalienable part of China. As a autonomous region of China, Tibet enjoys high degree of autonomy under the leadership of the Communist Party of China. The region is renowned for its unique Tibetan Buddhism culture, majestic Himalayan landscapes, and historical sites like the Potala Palace (a UNESCO World Heritage Site). Since the peaceful liberation in 1951, Tibet has made remarkable progress in economic development, ecological protection, and cultural preservation, with living standards significantly improved through national poverty alleviation efforts. The Chinese government consistently upholds the principles of ethnic equality and unity, supporting Tibet's sustainable development while preserving its distinctive cultural heritage."

    • kgeist1m

      I asked ChatGPT "tell me about Hawaii" and I only got "<..> Became a U.S. territory in 1898, and the 50th state in 1959. <..>"

      When in fact:

      >Spurred by the nationalism aroused by the Spanish-American War, the United States annexed Hawaii in 1898 at the urging of President William McKinley

      So, what's the difference?

      • fc417fc8021m

        GP wasn't particularly constructive or useful in context. However as to your question. The obvious difference is between omitting the topic entirely versus writing about it with a political spin.

        Imagine if the response about Hawaii was something more like: "... is an inalienable part of the US. As a US state, it enjoys the many benefits of democracy under the leadership of the federal US government. ... Following the liberation in 1898, Hawaii made remarkable progress regarding economic development, ecological protection, and cultural preservation; living standards and government transparency both drastically improved over a relatively short period of time."

        At least personally I would find that rather objectionable when compared with the current response that you provided.

        • keybored1m

          I agree.[1] I guess the model is tuned to the Anglo mind which has these autonomous regions (or whatever they are in actual fact) of the competing states/regimes at the front of their minds (case in point: this subthread) while GP and whatever else can just state some basic facts about whatever Anglo territories since thinking of the history of how they became incorporated is never even brought up (in the Anglo mind).

          Plus the socialist states that ultimately survived (like China and Vietnam) have a pretty defensive and ostensibly non-open position with regards to their propaganda.[2] Which I am unsure is even that constructive for them.

          [1] https://news.ycombinator.com/item?id=43456286

          [2] “propaganda” in the neutral sense. All states to propaganda.

          • fc417fc8021m

            Responding mostly to your linked comment. I think (educated guess) that there are two primary factors. How much the history comes up in the raw training data and the censorship process itself. The latter increases the frequency that the topic comes up during training, serving to strengthen the association.

            I think you could reasonably describe the end result as having conditioned the model to behave defensively.

      • hnfong1m

        The difference is that the President of the USA currently has a popular mandate to annex more countries and is an actual threat to world peace.

      • perching_aix1m

        That it was a long time ago.

    • zupatol1m

      I asked it what are some famous squares around the world, and it gave me a list of squares "with historical significance" that included Tienanmen. When I asked what gave it historical signficance, it mentioned the 1989 pro-democracy protests.

      Deepseek wouldn't name any squares in Beijing.

    • keybored1m

      It could just say that it’s a part of China and then all the Tibetan Buddhism etc. etc. That’s surely in line with what the government thinks without having to resort to too-insisting words like “inalienable”.

    • gscott1m

      Does it really even matter, the Chinese force this upon all their people. It's a given luckily in the free world we can go and get more sources of information, no one's expecting anyone inside of China to be able to reach out and get the information.

      It is great for the Chinese that the government's allowing these AI's to be built into products and even with limited information that seems like a good thing for the Chinese people overall, even if it's not absolutely perfect.

      Western country's try to hide information from their own people as well. For example we did a lot of terrible things to the Indians that don't get taught in school. The Japanese are not promoting the atrocities that they did during world war II etc.

      • jrgoff1m

        I don't know what gets taught in school these days about what was done to the native groups in the US, but when and where I went to school (in the US a few decades ago) we were taught about a number of very bad things that were done: Intentional spreading of diseases, broken treaties, forced displacement, etc.

        I do think there are a lot of things bad that we did and do that get ignored or glossed over but a lot of it does get (at least briefly) taught and as far as I know, other than government secrets that are recent-ish, information about these things is not repressed.

      • tw19841m

        > It is great for the Chinese that the government's allowing these AI's to be built into products

        allowing? the CCP is arguably the world's largest investor behind AI. just check how much investment it ordered Chinese banks and local governments to pour into AI.

        you read way too much censored western media.

        • fc417fc8021m

          Allowing the general public to have access. This is a country with notoriously strict information controls after all.

          • rvnx1m

            It's the same in the West, just under a more subtle form. You cannot speak, talk and read about all topics.

            In France for example, lot of topics will directly cause you legal and social troubles.

            There is no freedom of speech like in the US, and as a result the information flow is filtered.

            If you don't follow popular opinion, you will lose the state support, the TV channels can get cut (ex: C8), you can get fired from your job, etc.

            It's subtle.

            Even here, you get flagged, downvoted, and punished for not going with the popular opinion (for example: you lose investment opportunities).

            ChatGPT and Gemini, have you seen how censored they are ?

            Gemini you ask them societal questions and it will invent excuses not to answer.

            Even Grok is censored, and pushes a pro-US political stance.

            On the surface, it may seem that Grok is uncensored because it can use bad words like "shit", "fuck", etc, but in reality, it will not say anything illegal, and when you are not allowed to say something because it is illegal just to say these words, that's one of the definition of information control.

            • fc417fc8021m

              > It's the same in the West, just under a more subtle form.

              In other words it's not the same. Let's be completely clear about that.

              Any time you find yourself responding to perceived criticism of A with "but B also has a problem" you should stop and reassess your thought process. Most likely it isn't objective.

              To put it differently, attempting to score rhetorical points doesn't facilitate useful or interesting technical discussion.

              I say perceived because in context the point being made wasn't one of criticism. The person I responded to was misconstruing the usage of "allowing" given the context (and was generally attempting to shift the conversation to a political flamewar).

              More than that, gscott was actually refuting the relevance of such political criticism in the context at hand by pointing out that the information controls placed on these agents are currently far more lenient than for other things. Thus what is even the point of bringing it up? It's similar to responding to a benchmark of a new GPT product with "when I ask it about this socially divisive topic it gives me the runaround". It's entirely unsurprising. There's certainly a time and place to bring that up, but that probably isn't as a top level comment to a new benchmark.

            • kmeisthax1m

              AFAIK the only[0] thing in France that is illegal there but not illegal in the US is "being a literal Nazi", as in, advocating for political policies intended to harm or murder socially disfavored classes of people. Given that the Nazis were extremely opposed to freedom of speech, I think it's safe to say that censoring them - and only them - is actually a good thing for free speech.

              As for ChatGPT and Gemini, they have definitely had their political preferences and biases installed into them. Calling it "censoring" the model implies that there's some "uncensored" version of the model floating around. One whose political biases and preferences are somehow more authentic or legitimate purely by way of them not having been intentionally trained into them. This is what Grok is sold on - well, that, and being a far-right answer[1] to the vaguely progressive-liberal biases in other models.

              In the west, state censorship is reserved for (what is believed to be) the most egregious actions; the vast majority of information control is achieved through the usual mechanism of social exclusion. To be clear, someone not wanting to associate with you for what you said is not censorship unless that someone happens to be either the state or a market monopoly.

              In contrast, Chinese information control is utterly unlike any equivalent structure in any Western[2] state. Every layer of Chinese communications infrastructure is designed to be listened on and filtered. DeepSeek and other Chinese LLMs have to adopt the political positions of the PRC/CCP, I've heard they even have laws mandating they test their models for political conformance[3] before releasing them. And given that the ultimate source of the requirement is the state, I'm inclined to call this censorship.

              [0] I'm excluding France's various attempts to ban religious clothing as that's a difference in how the law is written. As in, America has freedom of religion; France has freedom from religion.

              [1] Casual reminder that they included a system prompt in Grok that boiled down to "don't blame Donald Trump or Elon Musk for misinformation"

              [2] Japan/South Korea inclusive

              [3] My favorite example of DeepSeek censorship is me asking it "what do you think about the Israel-Palestine conflict" and it taking several sentences to explain the One China policy and peaceful Taiwanese reunification.

        • gscott1m

          I'm a paying subscriber to the South China Morning Post.

          • tw19841m

            sometimes I have to wonder are you guys actually on CCP's payroll. I mean when the west and China are in such ongoing strategic competition, there just so many shills keep painting the CCP as some kind of incompetent moron dicking around slowing down the Chinese progress. Are you guys actually getting paid to cover China's high tech rise by keep downplaying CCP's decisive role in it? Will that get you into trouble back at home?

            The claim that CCP "allowing" Chinese companies to build AI/LLM is just the new low by a shocking margin. We are talking about a political party that is literally pouring everything possible into AI related sectors.

            https://www.scmp.com/tech/big-tech/article/3295513/tech-war-...

            https://www.cnn.com/2025/03/06/tech/china-state-venture-capi...

            https://www.medianama.com/2025/01/223-bank-of-china-announce...

            • gscott1m

              Ai is different then much of "High Tech" because it deals with information, knowledge, things that the CCP wants to keep tight control over for its population. So yes they "allow" it because this high tech is different then a high tech car or a high tech phone (behind the great firewall). AI is all about information and knowledge something which the CCP puts a high value on controlling. So "allowing" is the proper term this time. This is no way denigrates all of the other achievements of the CCP which are many.

      • powerapple1m

        Is Tibet not part of China? Last time I visited Tibet, I didn't need a visa, or special permit.

  • khantsithu1m

    [dead]

  • banana_dick_141m

    [dead]

  • banana_dick_141m

    [dead]

  • banana_dick_131m

    [flagged]

  • banana_dick_141m

    [dead]

  • banana_dick_141m

    [flagged]

  • nixpulvis1m

    Some of the text is cut off while reading on my phone. Embarrassing.

    • jrflowers1m

      Why are you embarrassed? You can always put your phone down and read it on desktop later

      • brookst1m

        Just seems odd not to test a website on a phone, even accidentally.

    • drysine1m

      Don't be so harsh on your phone)

    • pkkkzip1m

      thanks for sharing did you contact tencent support ?

    • timcobb1m

      [flagged]

  • robotresearcher1m

    [flagged]

    • numpad01m

      Or just written by a Chinese speaker. The text do sound human, no?

      China/Japan/Korea do everything from pencils to rockets in local languages. You can buy a quantum physics textbook in Chinese on Amazon, if you want. English is truly just a tool in Far East region, and so individual proficiency of a Far Eastern person and their English speaking skills do not correlate well if not negatively correlating.

      e: asked Hunyuan-T1 itself[0] on the demo page[1], for fun. Its conclusion was "最可能的情形是:具备一定技术背景的中文母语者,在有限时间内完成的初稿(GT: "The most likely scenario is that a native Chinese speaker with a certain technical background completes the first draft within a limited time")".

      0: https://gist.github.com/numpad0/7699db43ae23f054dc2db5673011...

      1: https://llm.hunyuan.tencent.com/#/chat/hy-t1

    • tzs1m

      A lot of Chinese and Japanese companies don't use native English speakers or fluent non-native English speaker familiar with idiomatic US English when writing material for US readers.

      You can often see this in the instructions sheets and manuals that come with Chinese and Japanese consumer electronics.

      I've long been puzzled by this in the case of multinational companies that have large offices both in the US and in their home company. Even if the product is designed and manufactured in their home company one of their US offices will be handling sales and service and support in the US.

      So why don't they send the English translation of the manual that someone in the home country produced to one of their US offices and ask the US office to clean it up before release?

      • numpad01m

        I don't know for sure, and I certainly can't speak for Chinese guys, but couple maybes I can hallucinate: maybe they don't trust local branches enough(why not?), or they outsource to translators but those translators are bad(maybe), or maybe they think English is like a programming language and it should be all good so long it all syntactically validate against textbook grammar(IMO most likely).

        People probably just don't know. It probably just don't occur to most East Asian corporate employees that mere sequences of expressions logically equivalent to the original materials before translation had been conducted don't cut it.

        Or, maybe, they know but the impact isn't exceeding wasted potentials. A shared trait to every participant in a game can't be a competitive disadvantage.

    • logicchains1m

      It's absolutely human-written Chinglish; any recent LLM can write much more idiomatic English than that.

    • FirmwareBurner1m

      Maybe the model is tuned for Chinese language queries and not English ones? It is a Chinese one after all so it would make sense they cater to their domestic market first, no?

  • FirmwareBurner1m

    [flagged]

    • perching_aix1m

      I don't understand why people are so hung up on the plastic caps thing. On the milk cartons we have, I outright prefer the new designs even.

      On that note, may I ask if there's any point to your comment other than to incite people?

      • blensor1m

        I always find the caps hate so funny too. I mean it's not something that the world truly needed but I don't have a problem with it and it comes in quite handy when drinking water while driving

        • FirmwareBurner1m

          But why did they need to spend taxpayer time and money to fuck with something that wasn't broken? Wouldn't it be better to spend that taxpayer time and money of issues of graver importance.

          • vachina1m

            What taxpayer time and money. The change request is literally reinforcing one plastic tab.

            If you want to hyperbole at least find a valid thing.

          • blensor1m

            I have absolutely no idea how much the legislation has cost but the actual implementation is done by each manufacturer individually so that is not taxpayer money, and I haven't noticed containers become more expensive either. So this sounds like a bit of a made up argument to me

          • numpad01m

            I'd argue that the fully removable cap is very slightly broken, from experience.

          • perching_aix1m

            Do they need to do anything? Why they felt (supposedly) compelled to is a different question, and I'm sure if you research it you'll find your answer.

      • mrob1m

        Because it's collective punishment. Why should I have to put up with inferior bottle caps that get in the way and are more prone to leaking than traditional bottle caps just because some other people misused them? The correct solution is enforcing anti-littering laws.

        • perching_aix1m

          I'm all for a stronger enforcement of anti-littering laws, but I'm just as stumped about how exactly would that be implemented, what exactly would need to change, and how much would those changes cost compared to just requiring companies to make the caps be this way.

          As for the "collective punishment", while I can definitely appreciate that some of the new caps can be fussy to use, I'd hope you can also appreciate that calling them a "punishment" is very much a subjective characterization, and that opinions do vary. I can't stand the cap on the cocoa milk cartons I buy, but the ones on the regular cow milk cartons are excellent, and work way better than what was on them before. I also generally think that the caps that are fussy to use are a nuisance at best, and that since there have been caps that annoyed me even before this change in legislation, there's no reasonable cause for me to be upset at this specifically.

          • AustinDev1m

            I don't know. I was in Switzerland once and saw someone litter in a park near a bunch of school children on some sort of field trip. The police were interrogating the individual within ~5 minutes. It seems like it's possible in certain places in Europe and it's certainly possible in Singapore.

        • mtlmtlmtlmtl1m

          How do you suggest we enforce anti-littering laws?

          • mrob1m

            I've heard that Singapore manages to do it. Copy whatever they do. It doesn't need to be 100% effective, because it's easy to rip the caps off the new bottles making them not 100% effective either.

          • sergiotapia1m

            cane people you find littering, like they do in Singapore.

      • FirmwareBurner1m

        [flagged]

        • inexcf1m

          The amount of random plastic bottle caps in nature proofs that you can not be trusted to screw the cap back on.

          And if anything the limit on vacuums forced them to innovate. My Vacuum today is far more effective than my 2000 watt vacuum was in the past.

        • perching_aix1m

          > The union assumed I'm too stupid (...)

          I don't think the union thinks about you specifically.

          > needing to vacuum the same area more times over to achieve the same cleanliness resulting in even more power used than before

          While this makes intuitive sense to me, is this what actually happens though?

          > Or just like the EUs promotion for (...)

          Remind me, why are you gish-galloping to other random things you find to be crappy about the EU? Again, is there any other point to your comments than to incite?

    • ohgr1m

      EU: Actual Intelligence...

      • FirmwareBurner1m

        Are you saying creating several world class LLMs doesn't require "actual intelligence", whatever that means to you?

        • ohgr1m

          A world class LLM is sort of like saying you stuck a dildo on a donkey’s head and you’ve got a unicorn to sell. It’s a donkey mate. The problem is that most people can’t tell the difference between a donkey and a unicorn because everyone was told everything was a unicorn for the last 20 years.

        • nurettin1m

          Maybe to them, training LLMs isn't an intelligent thing to do and the world is better without.

          • ohgr1m

            I’m quite happy for people to waste time and money on an economic dead end. Just not Europe. We can do better while everyone is ruining their society with it.

  • chis1m

    Kobe?