512GB of unified memory is truly breaking new ground. I was wondering when Apple would overcome memory constraints, and now we're seeing a half-terabyte level of unified memory. This is incredibly practical for running large AI models locally ("600 billion parameters"), and Apple's approach of integrating this much efficient memory on a single chip is fascinating compared to NVIDIA's solutions. I'm curious about how this design of "fusing" two M3 Max chips performs in terms of heat dissipation and power consumption though
They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.
The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.
So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.
Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.
From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.
I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.
Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.
Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.
[1]: https://support.apple.com/en-us/101639
The thing that people are excited about here is unified memory that the GPU can address. Mac Pro had discrete GPUs with their own memory.
I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.
Intel integrated graphics, technically also used unified memory with the standard dram
Those also have terrible performance and worse bandwidth. I am not sure they are really relevant, to be honest.
Did the Xeons in the Mac Pro even have integrated graphics?
So did the Amiga, almost 40 years ago...
You mean this? ;) http://de.wikipedia.org/wiki/Datei:Amiga_1000_PAL.jpg
RIP Jay Miner who watched his unified memory daughters Agnus, Denise and Paula be slowly murdered by Jack Tramiel's vengeance against Irving Gould. [Why couldn't the shareholders have stormed their boardroom 180 days before the company ran out of cash, installed interim management who, in turn, would have brought back the megalomaniac Founder that would, until his dying breath, keep spreading their cash to the super brilliant geniuses that made all the magic chips happen and then turn the resulting empire over to ops people to make their workplace so uncomfortable they all retire early and live happily ever after on tropical islands and snowy mountain tops?]
Yep! Though one could argue the Amiga wasn't true unified memory due to the chip RAM limitations. Depending on the Agnus revision, you'd be limited to 512, 1 meg, or 2 meg max of RAM addressable by the custom chips ("chip RAM".)
fun fact: M-series that are configured to use more than 75% of shared memory for GPU can make the system go boom...something to do with assumptions macOS makes that can be fixed by someone with a "private key" to access kernel mode (maybe not a hardware limit).
I messed around with that setting on one of my Macs. I wanted to load a large LLM model and it needed more than 75% of shared memory.
That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.
That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.
The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.
Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.
As mentioned elsewhere in this thread, unified memory has existed long before Apple released the M1 CPU, and in fact many Intel processors that Apple used before supported it (though the Mac pros that supported 1.5TB of RAM did not, as they did not have integrated graphics).
The presence of unified memory does not necessarily make a system better. It’s a trade off: the M-series systems have high memory bandwidth thanks to the large number of memory channels, and the integrated GPUs are faster than most others. But you can’t swap in a faster GPU, and when using large LLMs even a Mac Studio is quite slow compared to using discrete GPUs.
Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?
> they specifically built this M3 Ultra for DeepSeek R1 4-bit
Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.
Don't they build these Macs just-in-time? The bandwidth doesn't change with the RAM, so surely it couldn't have been that hard to just... use higher capacity RAM modules?
"No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).
It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)
No one is saying they built a new chip.
But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.
Dies are designed in years.
This was just a coincidence.
What part of “no one is saying they designed a new chip” is lost here?
Sorry, non of us a fan boys trying to shape apple is great narratives
I don’t think you understand hardware timelines if you think this product had literally anything to do with anything DeepSeek.
Chip? Yes. Product? Not necessarily...
It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.
I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.
DeepSeek R1 came out Jan 20.
Literally impossible.
The scenario is that the 512gb M3 Ultra was validated for the Mac Studio, and in volume production for their servers, but a business decision was made to not offer more than a 256gb SKU for Mac Studio.
I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.
From “internal only” to “delivered to customers” in 6 weeks is literally impossible.
This change is mostly just using higher density ICs on the assembly line and printing different box art with a SKU change. It does not take much time, especially if they had planned it as a possible product just in case management changed its mind.
That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.
Apple is positively building custom servers, and quantities are closer to the 100k range than 1000 [0]
But I agree they are not using m3 ultra for that. It wouldn’t make any sense.
0. https://www.theregister.com/AMP/2024/06/11/apple_built_ai_cl...
That could be why they're also selling it as the Mac Studio M3 Ultra
My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.
Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.
See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...
I think by fuse I mean't its stuck on to the SOC module, not part of the SOC as I may have worded. While you could maybe still add NANDs later in the manufacturing process, it's probably not easy, especially if you need more NANDs and a larger module which might cause more design problems. The NAND is closer cause the controller is in the SOC. So the memory controller probably would also change with higher memory sizes which would mean this cannot be a last minute change.
Sheesh, the...comments on that link.
$10k to run a 4 bit quantized model. Ouch.
That's today. What about tomorrow?
The M4 MacBook Pro 128GB can run a 32B perimeter model with an 8 bit quantized model just fine
[flagged]
I'm downvoting you because your use of language is so annoying, not because I work for Apple.
So, Microsoft?
what?
Sorry, an apostrophe got lost in "PO's"
[flagged]
are you comparing the same models? How did you calculate the TOPS for M3 Ultra?
An M3 Ultra is two M3 Max chips connected via fabric, so physics.
Did not mean to shit on anyone's parade, but it's a trap for novices, with the caveat that you reportedly can't buy a GB10 until "May 2025" and the expectation that it will be severely supply constrained. For some (overfunded startups running on AI monkey code? Youtube Influencers?), that timing is an unacceptable risk, so I do expect these things to fly off the shelves and then hit eBay this Summer.
> they specifically built this M3 Ultra for DeepSeek R1 4-bit.
This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit
Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.
Looks like up to 480W listed here
https://www.apple.com/mac-studio/specs/
Thanks!!
The M2 Ultra Mac Pro could reach a maximum of 330W according to Apple:
https://support.apple.com/en-us/102839
I assume it is similar.
[dead]
I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?
Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.
> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.
"Memory bandwidth usage should be limited to the 37B active parameters."
Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.
Context window?
How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?
With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.
What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.
No one should be buying this for batch inference obviously.
I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.
Sure, nuance.
This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.
Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.
Have you seen prices on Lexus LFAs now? They haven't depreciated ha ha. And for those that don't know: https://www.youtube.com/watch?v=fWdXLF9unOE
Computers don't usually depreciate slowly
Relatively, as in a Mac or a Lexus will depreciate slower than other computers/cars.
It used to be very true, but with Apple's popularity the second-hand market is quite saturated (especially since there are many people buying them impulsively).
Unless you have a specific configuration, depreciation isn't much better than an equivalently priced PC. In fact, my experience is that the long tail value of the PC is better if you picked something that was high-end.
I don't know. Can't imagine it's easy to sell a used Windows laptop directly to begin with, and those big resellers probably offer very little. Even refurbished Dell Latitudes seem to go for cheap on eBay. I've had an easy time selling old Macs, or high-end desktop market might be simple too.
Macs are easy to sell if they are BtO with custom configuration, in that case you may not lose too much. But the depreciation hit hard on the base models, the market is flooded because people who buy those machines tend to change them often or are people who were trying, confused, etc.
Low end PCs (mostly laptops) don't keep value very well but then again you probably got them for cheap on a deal or something like that, so your depreciation might actually not be as bad as an equivalent Mac. The units you are talking about are entreprise stuff that are swapped every 3 years or so, for accounting reasons mostly, but it's not the type of stuff I would advise anyone to buy brand new (the strategy would actually be to pick up a second-hand unit).
High-end PCs, laptops or big desktops keep their value pretty well because they are niche by definition and very rare. Depending on your original choice you may actually have a better depreciation than an equivalently priced Mac because there are fewer of them on sale at any given time.
It all depends on your personal situation, strength of local market, ease of reselling through platforms that provides trust and many other variables.
What I meant is that it's not the early 2000's anymore, where you could offload a relatively new Mac (2-3 years) very easily; while not being hit by big depreciation because they were not very common.
In my medium sized town, there is a local second-hand electronic shop where they have all kinds of Mac at all kind of price points. High-end Razers sell for more money and are a rare sight. It's pretty much the same for iPhones, you see 3 years old models hit very hard with depreciation while some niche Android phones take a smaller hit.
Apple went through a weird strategy where at the same time they went for luxury pricing by overcharging for things that makes the experience much better (RAM/storage) but also tried to make it affordable to the mass (by largely compromising on things they shouldn't have).
Apple's behavior created a shady second-hand market with lots of moving part (things being shipped in and out of china) and this is all their doing.
Well these listed prices are asks, not bids, so they only give an upper bound on the value. I've tried to sell obscure things before where there are few or 0 other sellers, and no matter what you list it for, you might never find the buyer who wants that specific thing.
And the electronic shop is probably going to fetch a higher price than an individual seller would, due to trust factor as you mentioned. So have you managed to sell old Windows PCs for decent prices in some local market?
Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.
This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.
I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.
I hear you. I make these decisions for a public company.
When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.
There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.
No one who is using this for home use cares about anything except batch size 1 sequence size 1.
What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.
People want to talk to their computer, not service requests for a thousand users.
For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).
Anything in between suffers.
Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.
Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.
Agree. Finally I can have several hundred browser tabs open simultaneously with no performance degradation.
I'm not sure that unified memory is particularly relevant for that-- so e.g. on zen4/zen5 epyc there is more than enough arithmetic power that LLM inference is purely memory bandwidth limited.
On dual (SP5) Epyc I believe the memory bandwidth is somewhat greater than this apple product too... and at apple's price points you can have about twice the ram too.
Presumably the apple solution is more power efficient.
Is this on chip memory? From the 800GB/s I would guess more likely a 512bit bus (8 channel) to DDR5 modules. Doing it on a quad channel would just about be possible, but really be pushing the envelope. Still a nice thing.
As for practicality, which mainstream applications would benefit from this much memory paired with a nice but relative mid compute? At this price-point (14K for a full specced system), would you prefer it over e.g. a couple of NVIDIA project DIGITS (assuming that arrives on time and for around the announced the 3K price-point)?
Is putting RAM on the same chip as processing economical?
I would have assumed you’d want to save the best process/node for processing, and could use a less expensive processes for RAM.
It's a game changer for sure.... 512GB of unified memory really pushes the envelope, especially for running complex AI models locally. That said, the real test will be in how well the dual-chip design handles heat and power efficiency
The same thing could be designed with greater memory bandwidth, and so it's just a matter of time (for NVIDIA) until Apple decides to compete.
It will cost 4X what it costs to get 512GB on an x86 server motherboard.
I think the other big thing is that the base model finally starts at a normal amount of memory for a production machine. You can't get less than 96GB. Although an extra $4000 for the 512GB model seems Tim Apple levels of ridiculous. There is absolutely no way that the different costs anywhere near that much at the fab.
And the storage solution still makes no sense of course, a machine like this should start at 4TB for $0 extra, 8TB for $500 more, and 16TB for $1000 more. Not start at a useless 1TB, with the 8TB version costing an extra $2400 and 16TB a truly idiotic $4600. If Sabrent can make and sell 8TB m.2 NVMe drives for $1000, SoC storage should set you back half that, not over double that.
Nvidia has had the Grace Hoppers for a while now. Is this not like that?
This is just Apple disrespecting their customer base.
still not ECC
"unified memory"
funny that people think this is so new, when CRAY had Global Heap eons ago...
Why does it matter if you can run the LLM locally, if you're still running it on someone else's locked down computing platform?
For enterprise markets, this is table stakes. A lot of datacenter customers will probably ignore this release altogether since there isn't a high-bandwidth option for systems interconnect.