I'd say with confidence: we're living in the early days. AI has made jaw-dropping progress in two major domains: language and vision. With large language models (LLMs) like GPT-4 and Claude, and vision models like CLIP and DALL·E, we've seen machines that can generate poetry, write code, describe photos, and even hold eerily humanlike conversations.
But as impressive as this is, it’s easy to lose sight of the bigger picture: we’ve only scratched the surface of what artificial intelligence could be — because we’ve only scaled two modalities: text and images.
That’s like saying we’ve modeled human intelligence by mastering reading and eyesight, while ignoring touch, taste, smell, motion, memory, emotion, and everything else that makes our cognition rich, embodied, and contextual.
Human intelligence is multimodal. We make sense of the world through:
Touch (the texture of a surface, the feedback of pressure, the warmth of skin0; Smell and taste (deeply tied to memory, danger, pleasure, and even creativity); Proprioception (the sense of where your body is in space — how you move and balance); Emotional and internal states (hunger, pain, comfort, fear, motivation).
None of these are captured by current LLMs or vision transformers. Not even close. And yet, our cognitive lives depend on them.
Language and vision are just the beginning — the parts we were able to digitize first - not necessarily the most central to intelligence.
The real frontier of AI lies in the messy, rich, sensory world where people live. We’ll need new hardware (sensors), new data representations (beyond tokens), and new ways to train models that grow understanding from experience, not just patterns.
> Language and vision are just the beginning — the parts we were able to digitize first - not necessarily the most central to intelligence.
I respectfully disagree. Touch gives pretty cool skills, but language, video and audio are all that are needed for all online interactions. We use touch for typing and pointing, but that is only because we don't have a more efficient and effective interface.
Now I'm not saying that all other senses are uninteresting. Integrating touch, extensive proprioception, and olfaction is going to unlock a lot of 'real world' behavior, but your comment was specifically about intelligence.
Compare humans to apes and other animals and the thing that sets us apart is definitely not in the 'remaining' senses, but firmly in the realm of audio, video and language.
Language is literally an abstraction of sensory inputs and cognitive processes. One can make similar arguments about image generation. These abstractions might characterize the higher cognitive abilities of humans, but it makes no sense to ignore "lower level" cognition. Embodiment is the foundation of our rich internal world models, in particular spacetime, causality, etc.
Current generative models merely mimic the output, with a fuzzy abstract linguistic mess in place of any physical/causal models. It's unsurprising that their capacity to "reason" is so brittle.
> Language is literally an abstraction of sensory inputs and cognitive processes.
Language can exist entirely independently from senses and cognition. It is an encoding of patterns in the world where the only thing that matters is if anybody or anything wielding it can map the encodings to and from the patterns they encode for (which is more of a sociological/synchronisation challenge).
Does C, or Java, 'make no sense' because it 'ignores lower level cognition'?
There are many parts of non-programming languages that similarly have nothing to do with embodiment. Some of them are even about incredibly abstract things impossible in our universe. One could argue that for many fields genius lies in being able to mentally model what is so foreign to the intuition our embodiment has imbued us with or to be able to find a mapping to facilitate that intuition. Said otherwise: the experience our embodiment has given us might limit how well we can understand the world (Quantum Mechanics anyone?).
Again, embodiment is interesting and worth pursuing, but far from a requirement for far-reaching intelligence.
> Does C, or Java, 'make no sense' because it 'ignores lower level cognition'?
It makes sense in context, but that context includes the machine on which the compiled code runs. Without the underlying machine, there's no real purpose for C or Java. I'm open to the idea that 'lower level cognition' may be as relevant to language as the machine is to C or Java.
> Without the underlying machine, there's no real purpose for C or Java.
They do express algorithms, don't they?
> Language can exist entirely independently from senses and cognition.
Helen Keller begs to disagree. Language and cognition were clearly linked for her.
> It wasn't until April 5, 1887, when Anne took Helen to an old pump house, that Helen finally understood that everything has a name. Sullivan put Helen’s hand under the stream and began spelling “w-a-t-e-r” into her palm, first slowly, then more quickly.
> Keller later wrote in her autobiography, “As the cool stream gushed over one hand she spelled into the other the word water, first slowly, then rapidly. I stood still, my whole attention fixed upon the motions of her fingers. Suddenly I felt a misty consciousness as of something forgotten–-a thrill of returning thought; and somehow the mystery of language was revealed to me. I knew then that ‘w-a-t-e-r’ meant the wonderful cool something that was flowing over my hand. That living word awakened my soul, gave it light, hope, joy, set it free! There were barriers still, it is true, but barriers that could in time be swept away.”
I said language can exist independently, not that all language exists independently.
"one plus one equals two" can be understood and worked with without ever feeling water over your hand. It is a priori knowledge (see Hume's fork for an explanation).
You have to understand that the richness of language linked to cognition is due to your experience with that part of language and resulting romantization of it. It doesn't mean that it is a core defining feature of language, even though it feels that way (and as touching as that anecdote is).
> "one plus one equals two" can be understood and worked with without ever feeling water over your hand. It is a priori knowledge
"Understood" and "worked with" are completely different.
The complete absence of embodiment is several degrees removed from "feeling water over your hand." LLMs have no sensory apparatus to relate the word "one" to actual, discrete, singular objects. The most rudimentary calculator can represent and compute "1+1=2", but I doubt any philosophical tradition or even an educated layperson would claim calculators "understand what 1+1=2 means." The "understanding" part has nothing to do with the accuracy or truthfulness of the computation; it comes from the relation of the abstract statement to counting of actual objects.
> Language can exist entirely independently from senses and cognition.
Maybe? Are you just outlining the thesis, or saying this should be self-evident?
> It is an encoding of patterns in the world where the only thing that matters is if anybody or anything wielding it can map the encodings to and from the patterns they encode for (which is more of a sociological/synchronisation challenge).
Yes, and my point is, current genAI utterly fails in unpredictable / bizarre ways because it only mimics the abstract encodings, ignorant of patterns in the world. Obviously some people argue "next token prediction is all you need," but that's a claim that is far from self-evident.
> There are many parts of non-programming languages that similarly have nothing to do with embodiment. Some of them are even about incredibly abstract things impossible in our universe. One could argue that for many fields genius lies in being able to mentally model what is so foreign to the intuition our embodiment has imbued us with or to be able to find a mapping to facilitate that intuition.
I would say this misses the point. The meaningfulness of abstractions, even ones that are unintuitive or unphysical or illogical, come from our embodied experience. Our enjoyment of even the most absurd fiction comes from our ability to simultaneously comprehend what it is "about" and what is "possible." Both relate to experience-of-reality and mean nothing in a vacuum.
> Said otherwise: the experience our embodiment has given us might limit how well we can understand the world (Quantum Mechanics anyone?).
I agree that we are limited in some ways by our particular embodiment. e.g. There's a huge spectrum of sensory experiences - colors, sounds, smells ... - which we know other animals have that we do not.
As I understand, where we disagree is on the why. I would say our capacity for understanding comes from our embodiment, therefore it's only natural that the limits of our embodiment also limit our understanding. After all we could imagine that if we had direct sensory experience of quantum effects, we would understand QM better or at least easier. In some fuzzy way, (no embodiment => poor understanding) and (embodiment => better understanding) is evidence for (embodiment <=> understanding). I suppose your argument is a counterfactual that we might be able to imagine a (no embodiment & some understanding) so (no embodiment =/> poor understanding), but I don't see the evidence that this is not just imaginable but actually possible in reality.
> Language and vision are just the beginning — the parts we were able to digitize first - not necessarily the most central to intelligence.
I probably made a mistake when i asserted that -- should have thought it over. Vision is evolutionarily older and more “primitive”, while language is uniquely human [or maybe, more broadly, primate, cetacean, cephalopod, avian...] symbolic, and abstract — arguably a different order of cognition altogether. But i maintain that each and every sense is important as far as human cognition -- and its replication -- is concerned.
People who lack one of those senses, or even two of them, tend to do just fine.
Mostly thanks to other humans helping them.
If all humans lacked vision, the human race would definitely not do just fine.
I think we need to think about vision and world modelling somewhat separately. We could construct an artificial (tech enhanced) society where sight was not available. People would still "model the world in their minds" with the "abstract model" part of the vision/world system.
Vision is interesting in that it leverages the maximum speed with which it is easily possible to gather information about our surroundings in this universe. I believe that is what makes it special and very valuable. I also believe this aspect makes it a strong attractor for convergent evolution.
Language allows encoding and compression of information about the world, which is of course incredibly powerful and increases communication bandwidth enormously (as well as tons of other stuff).
I'd say that for high level cognitive processes, hearing and speaking were an important stepping stone because for some reason evolving organs that can generate relatively high bandwidth signals in audio seems to be easier than evolving something that does that for visuals (very few Teletubby screens on tummies in the natural world).
Interesting games to think about in this sense: Pictionary/drawing games and charades.
Regarding visual communication, I think you downsell posturing, gesturing and facial expressions a little. They may not be as high bandwidth as talking but they are very low latency and pretty stealthy if necessary.
Well, there is sign language, so I guess you're right. It would be interesting to see how high bandwidth gesturing can be compared to speaking.
I thought about this some more and I think the prevalence of making sounds rather than gesturing etc. is due to sound being a broadcasting mechanism that works over long distances and without line of sight.
Visually indicating that you've claimed some territory is pretty hard.
I was thinking more about "low level" communication which can be gleaned from body language, frowns, smiles, gaze, winks, pointing etc. Perhaps not very information dense, but very fast.
> Touch gives pretty cool skills, but language, video and audio are all that are needed for all online interactions. We use touch for typing and pointing, but that is only because we don't have a more efficient and effective interface.
It may be, that we are not using touch for anything important as adults. But babies rely on touch to explore their surroundings. They stick anything into their mouth, why? Because a tongue is the most touch-sensitive organ. They are exploring things by touching them with their tongues.
I can only guess what people get from that, but my guess is they get understanding of geometry and of surface properties of objects, which you'll have problems to get by processing photos or texts.
> your comment was specifically about intelligence.
Talking about intelligence, I do not believe that LLMs can match humans without deep understanding or 3d-space and material science^W intuition. It needs touch and temperature sensitivity at least. Probably you can replace it with billions of words of texts describing these things, but I doubt it.
It is trivial to train AI on 3D representations. In fact, that already happens in cases where robot algorithms are trained in simulations.
Another thing to remember is that the senses we have aren't the only ones in biology and far from the only ones possible. In fact, anything that gives you another type of information about the world (you're modeling) is a different sense. In that sense (ha), AI has access to an incredibly vast and varied array of senses that is inaccessible to humans. Lidar is a very simple example of that.
I don't think touch and temperature sensitivity are needed to achieve it, but I do agree that training with senses specifically for understanding 3D space is very important. At the very least binocular video.
> It is trivial to train AI on 3D representations.
So AI developers understand limitations and trying to remove them. It will help, but it will not make AI vision to be on par with a human's.
> In that sense (ha), AI has access to an incredibly vast and varied array of senses that is inaccessible to humans. Lidar is a very simple example of that.
I don't think that current uses of lidars have anything to do with intelligence. Not every neuro-net is about intelligence.
> I don't think touch and temperature sensitivity are needed to achieve it,
I'm sure they are. To understand forms you need to explore them with touch. The ability to understand forms by just looking at them is an acquired skill. Maybe it is possible to train these abilities without the touch, but how? I believe it will take a shitload of training data, and I'm not sure it will be good enough.
Temperature sensitivity is a big thing, because it allow you to guess thermal conductivity of a thing by just looking at it. It allows to guess wetness of a thing. It allows us to guess temperature of things by looking at them: like you see sun shining, fire burning, people touching things and yanking their hands from hot things. Or just how about a person that cautiously trying to learn a temperature of a thing, at first measuring infrared radiation, then a quick touch, then a touch for a longer time, and finally a long sustained contact: how could you understand all these proceedings without your own experience of grasping the hot thing, crying from a pain and dropping the thing on your feet?
These are just obvious ideas from top of my mind. What else comes from temperature sensitivity I don't know and no one is, because no one really knows how people learn to use their senses and to think. There are theories about it, but they are more of descriptive nature: they describe what is known without having a lot of a predictive power. Because of this the optimism of AI crowd seems overinflated. They don't know what they are trying to do, and still they believe in their eventual success.
Probably you can learn it by thinking, but can LLMs think, while training? You can learn it as a pattern of a behaviour, without understanding the meaning of it, but then you'll hallucinate this pattern all the time, just because some of the movements were close enough.
> At the very least binocular video.
I'm not sure that people can learn 3d by looking. At least they do not just rely on a binocular vision to learn it. They touch, they lick. They measure things in different ways (by sticking it in mouth, by grasping, by climbing on top of it or falling from it, by hugging it), they measure distances by crawling or walking along them. They are finding a spot where they can see what happens behind a pack of tree, or maybe behind something else. People not just using more senses, they are acting also, which allows them to learn causal relationships. Watching binocular video is not acting, so you can get correlation only without any hope to learn how to distinguish correlations from causations, and at the same time it is much more limited in a data available.
Science says that 80 or 90% of information people get is coming from their vision? I'm skeptical about this, because I don't know how they measure "information", but in any case human vision was trained with support from other senses. I wouldn't be surprised, if at certain stages of a baby's development other senses are more advanced and are used to get labelled data to train vision.
Humans are known for their exceptional sensitivity in their hands and fingers. There are only few animals that come close to our ability to manipulate objects.
Only octopuses, elephants and apes are in a similar league with regards to dexterity and finesse.
You can be born without hands and have zero cognitive deficits. Sensory info and action-feedback from hands, vision, hearing, isn't key to intelligence, but if you are born without vision and hearing it can cause developmental issues, but even if you lose vision and hearing before 2 years you can develop normally, like Hellen Keller.
Actually this is wrong. There's a connection between bodily sensation and emotion so profound that quadriplegics can develop flat affect which in turn leads to decision paralysis and cognitive deficit. Emotions are regulated somatically, and inform decision making and other aspects of motivation and reasoning.
Source: https://pmc.ncbi.nlm.nih.gov/articles/PMC2633768/
Spinal cord injury implies that you were born with those senses and then abruptly lost them.
In that case, all the processes and pathways in your brain are relying on and tied to those senses, so losing them might also disrupt or affect those pathways and how they function.
Organic adaption and persistence of memory I would say are the two major advancements that need to happen.
Human neural networks are dynamic, they change and rearrange, grow and sever. An LLM is fixed and relies on context, if you give it the right answer it won't "learn" that is the correct answer unless it is fed back into the system and trained over months. What if it's only the right answer for a limited period of time?
To build an intelligent machine, it must be able train itself in real time and remember.
Yes and: and forget.
Why is forgetting important? Things can either have an end time where they are no longer applicable or things we thought were true turn out to be false but it's still useful to see where we went wrong.
I imagine humans are limited by the # of synapses we have so it's useful to forget but maybe machines can move the useless stuff to deep storage until it's dug out, in the same way certain things can trigger a deep memory in humans.
Which is cheaper, among these two logically equivalent things: reducing one weight, or increasing every other weight?
Do you remove dead code? Get rid of clutter? Ever try to change a habit?
Yeah, I delete dead code but it's important to remember why I wrote it that way in the first place and why I'm deleting it now. Doomed to repeat past mistakes and all that.
I think one counterpoint to this idea is the compute cost.
To a great extent, it's not AI research that is the primary driver behind the huge advances in AI, either in terms of techniques (transformers) or data sets. Instead, the biggest single factor responsible for this huge boost are advances in compute hardware and compute power in general. Even if we had known about the Transformer architecture 20 years earlier, and we had had the datasets that OpenAI and Google amassed 20 years earlier, we still would not have been able to get anywhere close to training an LLM on hardware from 20 years ago.
And given this, and given that LLMs have already pushed this compute power to the limit, it's very possible that we'll stagnate at more or less the current level unless and until a new 10x or even 100x boost in compute power happens. It's very unlikely that you could train a model on 100x as much data as you get today without that, which is what you would likely require to add multiple modalities and then combine them.
Tick-tock model. Research switches between adding new ideas and pouring more power into good ideas.
> Language and vision are just the beginning..
Based on the architectures we have they may also be the ending. There’s been a lot of news in the past couple years about LLMs but has there been any breakthroughs making headlines anywhere else in AI?
> There’s been a lot of news in the past couple years about LLMs but has there been any breakthroughs making headlines anywhere else in AI?
Yeah, lots of stuff tied to robotics, for instance; this overlaps with vision, but the advances go beyond vision.
Audio has seen quite a bit. And I imagine there is stuff happening in niche areas that just aren't as publicly interesting as language, vision/imagery, audio, and robotics.
Two Nobel prizes in chemistry: https://www.nature.com/articles/s41746-024-01345-9
prizes != breakthroughs
progress != breakthroughs
Sure. In physics, math, chemistry, biology. To name a few.
> The real frontier of AI lies in the messy, rich, sensory world where people live. We’ll need new hardware (sensors), new data representations (beyond tokens), and new ways to train models that grow understanding from experience, not just patterns.
Like Dr. Who said: DALEKs aren't brains in a machine, they are the machine!
Same is true for humans. We really are the whole body, we're not just driving it around.
There are many people who mentally developed while paralyzed that literally drive around their bodies via motorized wheelchair. I don't think there's any evidence that a brain couldn't exist or develop in a jar, given only the inputs modern AI now has (text, video, audio).
> any evidence that a brain couldn't exist or develop in a jar
The brain could. Of course it could. It's just a signals processing machine.
But would it be missing anything we consider core to the way humans think? Would it struggle with parts of cognition?
For example: experiments were done with cats growing up in environments with vertical lines only. They were then put in a normal room and had a hard time understanding flat surfaces.
https://computervisionblog.wordpress.com/2013/06/01/cats-and...
This isn't remotely a hypothetical, so I imagine there are some examples out there, especially from back when polio was a problem. Although, for practical reasons, they might have had limited exposure to novelty, which could have negative consequences.
I agree it’s not hypothetical and also as a layperson I don’t know how much impact on cognition has been studied. Would be cool if it has!
I do know of studies that showed blind people start using their visual cortex to process sounds. That is pretty cool imo
> modeled human intelligence
That's not what these models do
Yeah, but are there new ideas or only wishes?
It’s pure magical thinking that would be correctly dismissed if it didn’t have AI attached to it. Imagine talking this way about anything else.
“We’ve barely scratched the surface with Rust, so far we’re only focused on code and haven’t even explored building mansions or ending world hunger”
AI has some real possibilities of building mansions and ending hunger in a way that Rust doesn't.
How is this ending anyones hunger? As long as humans are steering the ship, the commodity will be limited to those that control it and they will make all the money. If anything, it has a big potential to cause more hunger.
In a way until AI systems can feel the weight of a cup or flinch from heat, we're not close to modeling anything like embodied cognition
The big horizon isn't just incorporating another sensory modality, it's what Heidegger called being-in-the-world, living among us as a human-like social being. That advancement depends on robotics to provide emboddied experience.
> has made jaw-dropping progress
They took 1970s dead tech and deployed it on machines 1 million times more powerful. I'm not sure I'd qualify this as progress. I'd also need an explanation as to what systemic improvements in models and computations that give an exponential growth in performance are planned.
I don't see anything.
> They took 1970s dead tech and deployed it on machines 1 million times more powerful. I’m not sure I’d qualify this as progress
If this isn’t meant to be sarcasm or irony, you’ve got some really exciting research and learning ahead of you! At the moment it reads very “computers are just addition and multiplication and we’ve had that for thousands of years!”
> you’ve got some really exciting research and learning ahead of you
I've done the research. Which is why I made the point I did. You're being dismissive and rude instead of putting forth any sort of argument. It's the paper hat of fake intellect. Yawn.
> At the moment it reads very “computers are just addition and multiplication and we’ve had that for thousands of years!”
Let's be specific then. The problem with the models is they require exponential cost growth for model generation giving only linear increases in output performance. This cost curve is currently a factor or two stronger than the curve of increasing hardware performance. Putting the technology, absent any actual fundamental algorithmic improvements, which do /not/ seem forthcoming despite billions in speculative funding, into a strict coffin corner. In short: AI winter 2.0.
Got any plans for that? Any specific research that deals with that? Any thoughts of your own on this matter?
> I've done the research
Great. What's the 1970s equivalent of word2vec or embeddings, that we've simply scaled up? Where are the papers about the transformer architecture or attention from the 1970s? Sure feels like you think LLMs are just big perceptrons.
> The problem with the models is they require exponential cost growth
Let's stick to the assertion I was disputing instead.
A linear increase in technology can easily lead to a greater than linear increase in economic gain. Sometimes even small performance gains can overturn whole industries.
Winning two Nobel prizes wasn't enough progress?
Is progress measured in nobel prizes? My understanding is those are put to a vote by institutional committee.
Putting that aside. The shared prize in 2024 was given for work done in the 1970s and 1980s. Was this meant to be a confirmation of my point? You've done so beautifully.
In 2022 they saw fit to award Ben Bernanke. Yep. That one. For, I kid you not, work on the impacts of financial crises. Ironically also work originally done in the 1970s and 80s.
AlphaFold uses transformers. That is definitely not from the 70s and 80s.
Progress for me includes both small iterative refinements and big leaps. It also includes trying old techniques in new domains with new technology. So I think we just have differing definitions for progress.