512GB of unified memory is truly breaking new ground. I was wondering when Apple would overcome memory constraints, and now we're seeing a half-terabyte level of unified memory. This is incredibly practical for running large AI models locally ("600 billion parameters"), and Apple's approach of integrating this much efficient memory on a single chip is fascinating compared to NVIDIA's solutions. I'm curious about how this design of "fusing" two M3 Max chips performs in terms of heat dissipation and power consumption though
They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.
The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.
So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.
Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.
From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.
I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.
Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.
Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.
[1]: https://support.apple.com/en-us/101639
The thing that people are excited about here is unified memory that the GPU can address. Mac Pro had discrete GPUs with their own memory.
I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.
Intel integrated graphics, technically also used unified memory with the standard dram
Those also have terrible performance and worse bandwidth. I am not sure they are really relevant, to be honest.
Did the Xeons in the Mac Pro even have integrated graphics?
So did the Amiga, almost 40 years ago...
You mean this? ;) http://de.wikipedia.org/wiki/Datei:Amiga_1000_PAL.jpg
RIP Jay Miner who watched his unified memory daughters Agnus, Denise and Paula be slowly murdered by Jack Tramiel's vengeance against Irving Gould. [Why couldn't the shareholders have stormed their boardroom 180 days before the company ran out of cash, installed interim management who, in turn, would have brought back the megalomaniac Founder that would, until his dying breath, keep spreading their cash to the super brilliant geniuses that made all the magic chips happen and then turn the resulting empire over to ops people to make their workplace so uncomfortable they all retire early and live happily ever after on tropical islands and snowy mountain tops?]
Yep! Though one could argue the Amiga wasn't true unified memory due to the chip RAM limitations. Depending on the Agnus revision, you'd be limited to 512, 1 meg, or 2 meg max of RAM addressable by the custom chips ("chip RAM".)
fun fact: M-series that are configured to use more than 75% of shared memory for GPU can make the system go boom...something to do with assumptions macOS makes that can be fixed by someone with a "private key" to access kernel mode (maybe not a hardware limit).
I messed around with that setting on one of my Macs. I wanted to load a large LLM model and it needed more than 75% of shared memory.
That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.
That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.
The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.
Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.
As mentioned elsewhere in this thread, unified memory has existed long before Apple released the M1 CPU, and in fact many Intel processors that Apple used before supported it (though the Mac pros that supported 1.5TB of RAM did not, as they did not have integrated graphics).
The presence of unified memory does not necessarily make a system better. It’s a trade off: the M-series systems have high memory bandwidth thanks to the large number of memory channels, and the integrated GPUs are faster than most others. But you can’t swap in a faster GPU, and when using large LLMs even a Mac Studio is quite slow compared to using discrete GPUs.
Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?
> they specifically built this M3 Ultra for DeepSeek R1 4-bit
Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.
Don't they build these Macs just-in-time? The bandwidth doesn't change with the RAM, so surely it couldn't have been that hard to just... use higher capacity RAM modules?
"No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).
It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)
No one is saying they built a new chip.
But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.
Dies are designed in years.
This was just a coincidence.
What part of “no one is saying they designed a new chip” is lost here?
Sorry, non of us a fan boys trying to shape apple is great narratives
I don’t think you understand hardware timelines if you think this product had literally anything to do with anything DeepSeek.
Chip? Yes. Product? Not necessarily...
It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.
I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.
DeepSeek R1 came out Jan 20.
Literally impossible.
The scenario is that the 512gb M3 Ultra was validated for the Mac Studio, and in volume production for their servers, but a business decision was made to not offer more than a 256gb SKU for Mac Studio.
I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.
From “internal only” to “delivered to customers” in 6 weeks is literally impossible.
This change is mostly just using higher density ICs on the assembly line and printing different box art with a SKU change. It does not take much time, especially if they had planned it as a possible product just in case management changed its mind.
That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.
Apple is positively building custom servers, and quantities are closer to the 100k range than 1000 [0]
But I agree they are not using m3 ultra for that. It wouldn’t make any sense.
0. https://www.theregister.com/AMP/2024/06/11/apple_built_ai_cl...
That could be why they're also selling it as the Mac Studio M3 Ultra
My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.
Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.
See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...
I think by fuse I mean't its stuck on to the SOC module, not part of the SOC as I may have worded. While you could maybe still add NANDs later in the manufacturing process, it's probably not easy, especially if you need more NANDs and a larger module which might cause more design problems. The NAND is closer cause the controller is in the SOC. So the memory controller probably would also change with higher memory sizes which would mean this cannot be a last minute change.
Sheesh, the...comments on that link.
$10k to run a 4 bit quantized model. Ouch.
That's today. What about tomorrow?
The M4 MacBook Pro 128GB can run a 32B perimeter model with an 8 bit quantized model just fine
[flagged]
I'm downvoting you because your use of language is so annoying, not because I work for Apple.
So, Microsoft?
what?
Sorry, an apostrophe got lost in "PO's"
[flagged]
are you comparing the same models? How did you calculate the TOPS for M3 Ultra?
An M3 Ultra is two M3 Max chips connected via fabric, so physics.
Did not mean to shit on anyone's parade, but it's a trap for novices, with the caveat that you reportedly can't buy a GB10 until "May 2025" and the expectation that it will be severely supply constrained. For some (overfunded startups running on AI monkey code? Youtube Influencers?), that timing is an unacceptable risk, so I do expect these things to fly off the shelves and then hit eBay this Summer.
> they specifically built this M3 Ultra for DeepSeek R1 4-bit.
This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit
Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.
Looks like up to 480W listed here
https://www.apple.com/mac-studio/specs/
Thanks!!
The M2 Ultra Mac Pro could reach a maximum of 330W according to Apple:
https://support.apple.com/en-us/102839
I assume it is similar.
[dead]
I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?
Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.
> Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.
"Memory bandwidth usage should be limited to the 37B active parameters."
Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.
Context window?
How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?
With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.
What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.
No one should be buying this for batch inference obviously.
I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.
Sure, nuance.
This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.
Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.
Have you seen prices on Lexus LFAs now? They haven't depreciated ha ha. And for those that don't know: https://www.youtube.com/watch?v=fWdXLF9unOE
Computers don't usually depreciate slowly
Relatively, as in a Mac or a Lexus will depreciate slower than other computers/cars.
It used to be very true, but with Apple's popularity the second-hand market is quite saturated (especially since there are many people buying them impulsively).
Unless you have a specific configuration, depreciation isn't much better than an equivalently priced PC. In fact, my experience is that the long tail value of the PC is better if you picked something that was high-end.
I don't know. Can't imagine it's easy to sell a used Windows laptop directly to begin with, and those big resellers probably offer very little. Even refurbished Dell Latitudes seem to go for cheap on eBay. I've had an easy time selling old Macs, or high-end desktop market might be simple too.
Macs are easy to sell if they are BtO with custom configuration, in that case you may not lose too much. But the depreciation hit hard on the base models, the market is flooded because people who buy those machines tend to change them often or are people who were trying, confused, etc.
Low end PCs (mostly laptops) don't keep value very well but then again you probably got them for cheap on a deal or something like that, so your depreciation might actually not be as bad as an equivalent Mac. The units you are talking about are entreprise stuff that are swapped every 3 years or so, for accounting reasons mostly, but it's not the type of stuff I would advise anyone to buy brand new (the strategy would actually be to pick up a second-hand unit).
High-end PCs, laptops or big desktops keep their value pretty well because they are niche by definition and very rare. Depending on your original choice you may actually have a better depreciation than an equivalently priced Mac because there are fewer of them on sale at any given time.
It all depends on your personal situation, strength of local market, ease of reselling through platforms that provides trust and many other variables.
What I meant is that it's not the early 2000's anymore, where you could offload a relatively new Mac (2-3 years) very easily; while not being hit by big depreciation because they were not very common.
In my medium sized town, there is a local second-hand electronic shop where they have all kinds of Mac at all kind of price points. High-end Razers sell for more money and are a rare sight. It's pretty much the same for iPhones, you see 3 years old models hit very hard with depreciation while some niche Android phones take a smaller hit.
Apple went through a weird strategy where at the same time they went for luxury pricing by overcharging for things that makes the experience much better (RAM/storage) but also tried to make it affordable to the mass (by largely compromising on things they shouldn't have).
Apple's behavior created a shady second-hand market with lots of moving part (things being shipped in and out of china) and this is all their doing.
Well these listed prices are asks, not bids, so they only give an upper bound on the value. I've tried to sell obscure things before where there are few or 0 other sellers, and no matter what you list it for, you might never find the buyer who wants that specific thing.
And the electronic shop is probably going to fetch a higher price than an individual seller would, due to trust factor as you mentioned. So have you managed to sell old Windows PCs for decent prices in some local market?
Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.
This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.
I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.
I hear you. I make these decisions for a public company.
When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.
There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.
No one who is using this for home use cares about anything except batch size 1 sequence size 1.
What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.
People want to talk to their computer, not service requests for a thousand users.
For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).
Anything in between suffers.
Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.
Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.
> The the question is if a llm will run with usable performance at that scale?
This is the big question to have answered. Many people claim Apple can now reliably be used as a ML workstation, but from the numbers I've seen from benchmarks, the models may fit in memory, but the performance for tok/sec is so slow to not feel worth it, compared to running it on NVIDIA hardware.
Although it be expensive as hell to get 512GB of VRAM with NVIDIA today, maybe moves like this from Apple could push down the prices at least a little bit.
It is much slower than nVidia, but for a lot of personal-use LLM scenarios, it's very workable. And it doesn't need to be anywhere near as fast considering it's really the only viable (affordable) option for private, local inference, besides building a server like this, which is no faster: https://news.ycombinator.com/item?id=42897205
It's fast enough for me to cancel monthly AI services on a mac mini m4 max.
Could you maybe share a lightweight benchmark where you share the exact model (+ quantization if you're using that) + runtime + used settings and how much tokens/second you're getting? Or just like a log of the entire run with the stats, if you're using something like llama.cpp, LMDesktop or ollama?
Also, would be neat if you could say what AI services you were subscribed to, there is a huge difference between paid Claude subscription and the OpenAI Pro subscription for example, both in terms of cost and the quality of responses.
Hm, the AI services over 5 years cost half of m4 max minimal configuration which can barely run severely lobotomized LLaMA 70B. And they provide significantly better models.
Sure, with something like Kagi you even get many models to choose from for a relatively low price, but not everybody likes to send over their codebase and documents to OpenAI.
It's probably much worse than that, with the falling prices of compute.
Smaller, dumber models are faster than bigger, slower ones.
What model do you find fast enough and smart enough?
Not OP but I am finding the Qwen 2.5 32b distilled with DeepSeek R1 model to be a good speed/smartness ratio on the M4 Pro Mac Mini.
I'm running the same exact models.
How much RAM?
It takes between 22GB-37GB depending on the context size etc. from what I've observed.
Thanks!
I presume you're using the Pro, not the Max.
Anyways, what ram config, and what model are you using?
How much RAM are you running on?
Do we know if is it slower because of hardware is not as well suited for the task or is it mostly a software issue -- the code hasn't been optimized to run on Apple Silicon?
AFAICT the neural engine has accelerators for CNNs and integer math, but not the exact tensor operations in popular LLM transformer architectures that are well-supported in GPUs.
The neural engine is perfectly capable of accelerating matmults. It's just that autoregressive decoding in single batch LLM inference is memory bandwidth constrained, so there are no performance benefits to using the ANE for LLM inference (although, there's a huge power efficiency benefit). And the only way to use the neural engine is via CoreML. Using the GPU with MLX or MPS is often easier.
I have to assume they’re doing something like that in the lab for 4 years from now.
Memory bandwidth is the issue
> The question is if a llm will run with usable performance at that scale?
For the self-attention mechanism, memory bandwidth requirements scale ~quadratically with the sequence length.
Someone has got to be working on a better method than that. Hundreds of billions are at stake.
Guess what? I'm on a mission to completely max out all 512GB of mem...maybe by running DeepSeek on it. Pure greed!
You could always just open a few Chrome tabs…
It may not be Firefox in terms of hundreds or thousands of tabs but Chrome has gotten a lot more memory efficient since around 2022.
[flagged]
I downvote all Reddit-style memes, jokes, reference humor, catchphrases, and so on. It’s low-effort content that doesn’t fit the vibe of HN and actively makes the site worse for its intended purpose.
>Edit: WTF, someone downvoted "Enjoy the upvotes?" Pathetic.
You should read HN posting Guidelines if you want to understand why. Although I guess mostly in this case it is someone fat thumbed downvote.
Give Cities Skylines 2 a try.
It doesn't support Macs yet
[dead]
Any idea what the sRAM to uRAM ratio is on these new GPUs ? If they have meaningfully higher sRAM than the Hopper GPUs, it could lead to meaningful speedups in large model training.
If they didn't increase the memory bandwidth, then 512GB will enable longer context lengths and that's about it right? No speedups
For any speedups You may need some new variant of FlashAttention3 or something along similar lines to be purpose built for Apple GPUs.
I don't know what you mean by s and u, but there is only one kind of memory in the machine, that's what unified memory means.
I assume they mean SRAM versus unified (D)RAM?
Yeah they did? The M4 has a max memory bandwidth of 546GBps, the M3 Ultra bumps that up to a max of 819GBps.
(and the 512GB version is $4,000 more rather than $10,000 - that's still worth mocking, but it's nowhere near as much)
Not that dramatic of an increase actually - the M2 Max already had 400GB/s and M2 Ultra 800GB/s memory bandwidth, so the M3 Ultra's 819GB/s is just a modest bump. Though the M4's additional 146GB/s is indeed a more noticeable improvement.
Also should note that 800/819GB/s of memory bandwidth is actually VERY usable for LLMs. Consider that a 4090 is just a hair above 1000GB/s
Does it work like that though at this larger scale? 512GB of VRAM would be across multiple NVIDIA cards, so the bandwidth and access is parallelized.
But here it looks more of a bottleneck from my (admittedly naive) understanding.
For inference the bandwidth is generally not parallelized because the weights need to go through the model layer by layer. The most common model splitting method is done by assigning each GPU a subset of the LLM layers and it doesn't take much bandwidth to send model weights via PCIE to the next GPU.
My understanding is that the GPU must still load its assigned layer from VRAM into registers and L2 cache for every token, because those aren’t large enough to hold a significant portion. So naively, for a 24GB layer, you‘d need to move up to 24GB for every token.
But the memory bandwidth is only part of the equation; the 4090 is at least several times faster at compute compared to the fastest Apple CPU/GPU.
Agree. Finally I can have several hundred browser tabs open simultaneously with no performance degradation.
Well at least 20
New update just came in, make that 15
My M1 Max regularly pushes 1000+ tabs without breaking a sweat, I feel like this particular metric is no longer useful now that background tab memory is almost always unloaded by the browser.
I'm not sure that unified memory is particularly relevant for that-- so e.g. on zen4/zen5 epyc there is more than enough arithmetic power that LLM inference is purely memory bandwidth limited.
On dual (SP5) Epyc I believe the memory bandwidth is somewhat greater than this apple product too... and at apple's price points you can have about twice the ram too.
Presumably the apple solution is more power efficient.
Is this on chip memory? From the 800GB/s I would guess more likely a 512bit bus (8 channel) to DDR5 modules. Doing it on a quad channel would just about be possible, but really be pushing the envelope. Still a nice thing.
As for practicality, which mainstream applications would benefit from this much memory paired with a nice but relative mid compute? At this price-point (14K for a full specced system), would you prefer it over e.g. a couple of NVIDIA project DIGITS (assuming that arrives on time and for around the announced the 3K price-point)?
NVIDIA project DIGITS has 128 GB LPDDR5x coherent unified system memory at a 273 Gb/s memory bus speed.
It would be 273 GB/s (gigabytes, not gigabits). But in reality we don't know the bandwidth. Some ex employee said 500 GB/s.
You're source is a reddit post in which they try to match the size to existing chips, without realizing that its very likely that NVIDIA is using custom memory here produced by Micron. Like Apple uses custom memory chips.
Yes, but for the price of that single M3 ultra I could have 4 of those GB10's running in a 2x2 cluster with the full NVIDIA stack supported (which is still a big thing)
So M3 preference will depend on whether a niche can significantly benefit from a monolitic lower compute high memory vs higher compute but distributed setup.
Unless something had changed its on package, but not the same die.
Is putting RAM on the same chip as processing economical?
I would have assumed you’d want to save the best process/node for processing, and could use a less expensive processes for RAM.
It's a game changer for sure.... 512GB of unified memory really pushes the envelope, especially for running complex AI models locally. That said, the real test will be in how well the dual-chip design handles heat and power efficiency
The same thing could be designed with greater memory bandwidth, and so it's just a matter of time (for NVIDIA) until Apple decides to compete.
It will cost 4X what it costs to get 512GB on an x86 server motherboard.
What would it cost to get 512GB of VRAM on an Nvidia card? That’s the real comparison.
Apples to oranges. NVIDIA cards have an order of magnitude more horsepower for compute than this thing. A B100 has 8 TB/s of memory bandwidth, 10 times more than this. If NVIDIA made a card with 512GB of HBM I'd expect it to cost $150K.
The compute and memory bandwidth of the M3 Ultra is more in-line with what you'd get from a Xeon or Epyc/Threadripper CPU on a server motherboard; it's just that the x86 "way" of doing things is usually to attach a GPU for way more horsepower rather than squeezing it out of the CPU.
This will be good for local LLM inference, but not so much for training.
This prompts an "old guy anecdote"; forgive me.
When I was much younger, I got to work on compilers at Cray Computer Corp., which was trying to bring the Cray-3 to market. (This was basically a 16-CPU Cray-2 implemented with GaAs parts; it never worked reliably.)
Back then, HPC performance was measured in mere megaflops. And although the Cray-2 had peak performance of nearly 500MF/s/CPU, it was really hard to attain, since its memory bandwidth was just 250M words/s/CPU (2GB/s/CPU); so you had to have lots of operand re-use to not be memory-bound. The Cray-3 would have had more bandwidth, but it was split between loads and stores, so it was still quite a ways away from the competing Cray X-MP/Y-MP/C-90 architecture, which could load two words per clock, store one, and complete an add and a multiply.
So I asked why the Cray-3 didn't have more read bandwidth to/from memory, and got a lesson from the answer that has stuck. You could actually see how much physical hardware in that machine was devoted to the CPU/memory interconnect, since the case was transparent -- there was a thick nest of tiny blue & white twisted wire pairs between the modules, and the stacks of chips on each CPU devoted to the memory system were a large proportion of the total. So the memory and the interconnect constituted a surprising (to me) majority of the machine. Having more floating-point performance in the CPUs than the memory could sustain meant that the memory system was oversubscribed, and that meant that more of the machine was kept fully utilized. (Or would have been, had it worked...)
In short, don't measure HPC systems with just flops. Measure the effective bandwidth over large data, and make sure that the flops are high enough to keep it utilized.
That is a great story. Please never hesitate to drop these in.
Do you have a blog?
> so you had to have lots of operand re-use to not be memory-bound
Looking at Nvidia's spec sheet, an H100 SXM can do 989 tf32 teraflops (or 67 non-tensor core fp32 teraflops?) and 3.35 TB/s memory (HBM) bandwidth, so ... similar problem?
There is caching today.
The cache hitrate is effectively 0 for LLMs since the datasets are so huge.
Yep, it's apples to oranges. But sometimes you want apples, and sometimes you want oranges, so it's all good!
There's a wide spectrum of potential requirements between memory capacity, memory bandwidth, compute speed, compute complexity, and compute parallelism. In the past, a few GB was adequate for tasks that we assigned to the GPU, you had enough storage bandwidth to load the relevant scene into memory and generate framebuffers, but now we're running different workloads. Conversely, a big database server might want its entire contents to be resident in many sticks of ECC DIMMs for the CPU, but only needed a couple dozen x86-64 threads. And if your workload has many terabytes or petabytes of content to work with, there are network file systems with entirely different bandwidth targets for entire racks of individual machines to access that data at far slower rates.
There's a lot of latency between the needs of programmers and the development and shipping of hardware to satisfy those needs, I'm just happy we have a new option on that spectrum somewhere in the middle of traditional CPUs and traditional GPUs.
As you say, if Nvidia made a 512 GB card it would cost $150k, but this costs an order of magnitude less than that. Even high-end consumer cards like a 5090 have 16x less memory than this does (average enthusiasts on desktops have maybe 8 GB) and just over double the bandwidth (1.7 TB/s).
Also, nit pick FTA:
> Starting at 96GB, it can be configured up to 512GB, or over half a terabyte.
512 GB is exactly half of a terabyte, which is 1024 GB. It's too late for hard drives - the marketing departments have redefined storage to use multipliers of 1000 and invented "tebibytes" - but in memory we still work with powers of two. Please.
Sure, if you want to do training get an NVIDIA card. My point is that it's not worth comparing either Mac or CPU x86 setup to anything with NVIDIA in it.
For inference setups, my point is that instead of paying $10000-$15000 for this Mac you could build an x86 system for <$5K (Epyc processor, 512GB-768GB RAM in 8-12 channels, server mobo) that does the same thing.
The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.
But this is how it wonderfully works. +$4000 does two things: 1. Make Apple very very rich 2. Make people think this is better than a $10k EPYC. Win-Win for Apple. At the point when you have convinced that you are the best, higher price just means people think you are even better.
> The "+$4000" for 512GB on the Apple configurator would be "+$1000" outside the Apple world.
That requires an otherwise equivalent PC to exist. I haven’t seen anyone name a PC with a half-TB of unified memory in this thread.
Yeah it’s $4k. Yeah that’s nuts. But it’s the only game in town like that. If the replacement is a $40k setup from Nvidia or whatever that’s a bargain.
An X86 server comparable in performance to M3 Ultra will likely be a few times more energy hungry, no?
> we still work with powers of two. Please.
We do. Common people don't. It's easier to write "over half a terabyte" than explain (again) to millions of people what the power of two is.
Anyone who calls 512 gigs "over half a terabyte" is bullshitting. No, thank you.
Wasn't me.
[flagged]
Since the GH200 has over a terabyte of VRAM at $343,000 and the H100 has 80GB that makes that $195,993 with a bit over 512GB of VRAM . You could beat the price of the Apple M3 Ultra with an AMD EPYC build.
GH200 is nowhere near $343,000 number. You can get a single server order around 45k (with inception discount). If you are buying bulk, it goes down to sub-30k ish. This comes with a H100's performance and insane amount of high bandwith memory.
They probably meant 8xH200 for $343,000 which is in the ballpark.
Yes this is what I meant since 8 would cover 512GB of Ram
About $12k when Project Digits comes out.
Apple is shipping today. No future promises.
That will only have 128GB of unified memory
128GB for 3K; per the announcement their ConnectX networking allows two Project Digits devices to be plugged into eachother and work together as one device giving you 256GB for $6k, and, AFAIK, existing frameworks can split models across devices, as well, hence, presumably, the upthread suggestion that Project Digits would provide 512GB for $12k, though arguably the last step is cheating.
the reason Nvidia only talk about two machines over the network is I think they only have one network port, so you need to add costs for a switch.
It clearly have two ports. Just watch on the right side of the picture:
https://www.storagereview.com/wp-content/uploads/2025/01/Sto...
You will however get half of the bandwidth and a lot more latency if you have to go through multiple systems.
If you want to split tensorwise yes. Layerwise splits could go over Ethernet.
I would be interested to see how feasible hybrid approaches would be, e.g. connect each pair up directly via ConnectX and then connect the sets together via Ethernet.
You can build an x86 machine that can fully run DeepSeek R1 with 512GB VRAM for ~$2,500?
You will have to explain to me how.
https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...
Is that a CPU based inference build? Shouldn't you be able to get more performance out of the M3's GPU?
Inference is about memory bandwidth and some CPUs have just as much bandwidth as a GPU.
https://news.ycombinator.com/item?id=42897205
How would you compare the tok/sec between this setup and the M3 Max?
3.5 - 4.5 tokens/s on the $2,000 AMD Epyc setup. Deepseek 671b q4.
The AMD Epyc build is severely bandwidth and compute constrained.
~40 tokens/s on M3 Ultra 512GB by my calculation.
IMO, it would be more interesting to have a 3-way comparison of price/performance between DeepSeek 671b running on :
1. M3 Ultra 512 2. AMD Epyc (which Gen ? AVX512 and DDR5 might make a difference in both performance and cost , Gen 4 or Gen 5 have 8 or 9 t/s https://github.com/ggml-org/llama.cpp/discussions/11733 ) 2. AMD Epyc + 4090 or 5090 running KTransformers (over 10 t/s decode ? https://github.com/kvcache-ai/ktransformers/blob/main/doc/en...)
Thanks!
If the M3 can run 24/7 without overheating it's a great deal to run agents. Especially considering that it should run only using 350W... so roughly $50/mo in electricity costs.
Out of curiosity, if you dont mind: what kind of an agent would you run 24/7 locally?
I'd assume this thing peaks at 350W (or whatever) but idles at around 40w tops?
I’m guessing they might be thinking long training jobs as opposed to model use in an end product if done sort.
What kind of Nvidia-based rig would one need to achieve 40 tokens/sec on Deepseek 671b? And how much would it cost?
Around 5x Nvidia A100 80GB can fit 671b Q4. $50k just for the GPUs and likely much more when including cooling, power, motherboard, CPU, system RAM, etc.
So the M3 Ultra is amazing value then. And from what I could tell, an equivalent AMD Epyc would still be so constrained that we're talking 4-5 tokens/s. Is this a fair assumption?
No. The advantage of Epic is you get 12 channels of ram so it should be ~6x faster than a consumer cpu.
I realize that but apparently people are still getting very low tokens/sec on Epyc. Why is that? I don't get it, as on paper it should be fast.
The Epyc would only set you back $2000 though, so it’s only a slightly worse price/return.
How many tokens/s would that be though?
That's what I'm trying to get to. Looking to set up a rig, and AMD Epyc seems reasonable but I'd rather go Mac if it's giving many more tokens per second. It does sound like the Mac with M3 Ultra will easily give 40 tokens/s, where as the Epyc is just internally constrained too much, giving 4-5 tokens/s but I'd like someone to confirm that, instead of buying the HW and finding out myself. :)
Probably a lot more. Those are server-grade GPUs. We're talking prosumer grade Macs.
I don't know how to calculate tokens/s for H100s linked together. ChatGPT might help you though. :)
Well, ChatGPT quotes 25k-75k tokens/s with 5 H100 (so very very far from the 40 tokens/s), but I doubt this is accurate (e.g. it completly ignored the fact they are linked together and instead just multiplied the estimation of the tokens/s for one H100 by 5).
If this is remotely accurate though it's still at least an order of magnitude more convenient than the M3 Ultra, even after factoring in all the other costs associated with the infrastructure.
Not really like for like.
The pricing isn't as insane as you'd think, 96 to 256GB is 1500 which isn't 'cheap' but, it could be worse.
All in 5,500 gets you a ultra with 256GB memory, 28 cores, 60 GPU cores, 10Gb network - I think you'd be hard pushed to build a server for less.
5,500 easily gets me either vastly more CPU cores if I care more about that or a vastly faster GPU if I care more about that. Or for both a 9950x + 5090 (assuming you can actually find one in stock) is ~$3000 for the pair + motherboard, leaving a solid $2500 for whatever amount of RAM, storage, and networking you desire.
The M3 strikes a very particular middle ground for AI of lots of RAM but a significantly slower GPU which nothing else matches, but that also isn't inherently the right balance either. And for any other workloads, it's quite expensive.
You'll need a couple of 32GB 5090s to run a quantized 70B model, maybe 4 to run a 70b model without quantization, forget about anything larger than that. A huge model might run slow on a M3 Ultra, but at least you can run it all.
I have a Max M3 (the non-binned one), and I feel like 64GB or 96GB is within the realm of enabling LLMs that run reasonable fast on it (it is also a laptop, so I can do things on planes or trips). I thought about the Ultra, if you have 128GB for a top line M3 Ultra, the models that you could fit into memory would run fairly fast. For 512GB, you could run the bigger models, but not very quickly, so maybe not much point (at least for my use cases).
That config would also use about 10x the power, and you still wouldn't be able to run a model over 32GB whereas the studio can easily cope with 70B llama and plenty of space to grow.
I think it actually is perfect for local inference in a way that build or any other pc build in this price range would be.
The M3 Ultra studio also wouldn't be able to run path traced Cyberpunk at all no matter how much RAM it has. Workloads other than local inference LLMs exist, you know :) After all, if the only thing this was built to do was run LLMs then they wouldn't have bothered adding so many CPU cores or video engines. CPU cores (along with networking) being 2 of the specs highlighted by the person I was responding to, so they were obviously valuing more than just LLM use cases.
Bad game example because cyberpunk with raytracing is coming to macOS and will run on this.
The core customer market for this thing remains Video Editors. That’s why they talk about simultaneous 8K encoding streams.
Apple’s Pro segment has been video editors since the 90s.
Well that's what (s)he meant, the Mac Studio fits the AI use case but not other ones so much.
Consumer hardware is cheap, if 192 GB of RAM is enough for you. But if you want to go beyond that, the Mac Studio is very competitively priced. A minimal Threadripper workstation with 256 GB is ~$7400 from Puget Systems. If you increase the memory to 512 GB, the price goes up to ~$10900. Mostly because 128 GB modules are about as expensive as what Apple charges for RAM. A Threadripper Pro workstation can use cheaper 8x64 GB for the same capacity, but because the base system is more expensive, you'll end up paying ~$11600.
The Mac almost fits in the palm of your hand, and runs, if not silently, practically so. It doesn't draw excessive power or generate noticeable heat.
None of those will be true for any PC/Nvidia build.
It's hard to put a price on quality of life.
That’s not going to yield the same bandwidth or memory latency though, right?
You'd need a chip with 8 memory channels. 16 DIMM slots, IIRC.
I think the other big thing is that the base model finally starts at a normal amount of memory for a production machine. You can't get less than 96GB. Although an extra $4000 for the 512GB model seems Tim Apple levels of ridiculous. There is absolutely no way that the different costs anywhere near that much at the fab.
And the storage solution still makes no sense of course, a machine like this should start at 4TB for $0 extra, 8TB for $500 more, and 16TB for $1000 more. Not start at a useless 1TB, with the 8TB version costing an extra $2400 and 16TB a truly idiotic $4600. If Sabrent can make and sell 8TB m.2 NVMe drives for $1000, SoC storage should set you back half that, not over double that.
> There is absolutely no way that the different costs anywhere near that much at the fab.
price premium probably, but chip lithography errors (thus, yields) at the huge memory density might be partially driving up the cost for huge memory.
> but chip lithography errors (thus, yields) at the huge memory density might be partially driving up the cost for huge memory.
Apple's not having TSMC fab a massive die full of memory. They're buying a bunch of small dies of commodity memory and putting them in a package with a pair of large compute dies. How many of those small commodity memory dies they use has nothing to do with yield.
Is there a teardown link available for what you wrote? If so, that’s interesting.
This has been pretty clear about all Apple chip designs, going back to some of the first A series afaik. They are "unified memory" but not "memory on die", they've always been "memory on package"-- ie. the ram is packaged together with the CPU, often under a single heat spreader, but they are separate components.
Apple's own product shots have shown this. Here's a bunch of links that clearly show the memory as separate. Lots of these modules you can make out the serial or model numbers and look up the manufacturer of them from directly :)
- Side-by-side teardown of M1 Pro vs M2 Pro laptop motherboards showing separate ram chips with discussion on how apple is moving to different type of ram configurations: https://www.ifixit.com/News/71442/tearing-down-the-14-macboo...
- M2 teardown with the chip + ram highlighted: https://www.macrumors.com/2022/07/18/macbook-air-m2-chip-tea...
- Photo of the A12 with separate ram chips on a single "package": https://en.wikipedia.org/wiki/Apple_A12X
- M1 Ultra with heat spreader removed, clearly showing 3rd party ram chips onpackage: https://iphone-mania.jp/news-487859/
neat! thanks
This is also a niche product. The number they sell is going to be very tiny compared to the base model MacBook, let alone the iPhone.
Apple absolutely loves to gouge for upgrades, but the chips in this have got to be expensive. I almost wonder if the absolute base model of this machine has much noticeably lower margins than a normal Apple product because that. But they expect/know that most everyone who buys one is going to spec it up.
It's Apple, price premium is a given.
Nvidia has had the Grace Hoppers for a while now. Is this not like that?
This is cheap compared to GB200, which has a street price of >$70k for just the chip alone if you can even get one. Also GB200 technically has only 192GB per GPU and access to more than that happens over NVLink/RDMA, whereas here it’s just one big flat pool of unified memory without any tiered access topology.
We finally encountered the situation where an Apple computer is cheaper than its competition ;-)
All joking aside, I don't think Apples are that expensive compared to similar high-end gear. I don't think there is any other compact desktop computer with half a terabyte of RAM accessible to the GPU.
I mean expensive relative to who, Nvidia? Both are enjoying little to no competition in their respective niche and are using that monopoly power to extract massive margins. I have no doubt it could be much cheaper if there was actual competition in the market.
Fortunately it seems like AMD is finally catching on and working towards producing a viable competitor to the M series chips.
And yet all that cash still just goes to TSMC
They are selling the shovels for this gold rush. Also, ASML, who sells machines to make shovels.
This is just Apple disrespecting their customer base.
still not ECC
"unified memory"
funny that people think this is so new, when CRAY had Global Heap eons ago...
The real hardware needed for artificial intelligence wasn't NVIDIA, it was a CRAY XMP from 1982 all along
WHen I was with Mirantis, I flew to Austin TX to meet a client in a non-descript multi-tenant office building...
we walked in and getting our bearings, we come upon CRAY office. WTF?!
I tried the doors, locked - and it was clearly empty... but damn did I want to steal their office door signage.
It's new for mainstream PCs to have it.
Nope, it was common in 8 and 16 bit home computers, and in respect to PCs themselves graphics memory was mapped into the main memory until the arrival of 3D dedicated cards.
And even with 3D, integrated GPUs have existed for years.
The CPUs with iGPUs didn't also have the memory on-chip. The Nintendo 64 did. Not sure about the old home computers, but I thought those had separate memory usually.
Of course not, because they are not designed as SOCs, the only memory on chip is cache, it doesn't change the fact the memory is one whole block shared between CPU and iGPU.
Apple does not have the memory on-chip (on the same die as the CPU) either.
Like pretty much every game console.
New for performance machines maybe. I remember "integrated graphics" when that meant some shitty co-processor and 16 or 32MB of semi-reserved system RAM.
It's not new for PC to block user ram upgrade
You mean the room sized super computer than sold tens of units?
Yes, but now its in my pocket.
Why did it take so long for us to get here?
Some possible groups of reasons: 1. Until recently RAM amount was something the end user liked to configure, so little market demand. 2. Technically, building such a large system on a chip or collection of chiplets was not possible. 3. RAM speed wasn't a bottleneck for most tasks, it was IO or CPU. LLMs changed this.
M1 came out before the LLM rush, though
The M1 is in a product segment where discrete GPUs have been gone for decades, in favor of integrated graphics that shares one pool of RAM with the CPU. The better question to ask is why Apple kept using that unified memory design even when moving up to larger chips like the M1 Max and M1 Ultra.
The GPU is built into the same physical die as the CPU.
So if you wanted to give it a second ram pool you would have to add an entire second memory interface just for the on-die GPU.
Now all you’ve done is make it more complicated, slower because now you have to move things between the two pools, and gained what exactly?
I think it was a very clear and obvious decision to make. It’s an outgrowth out of how the base chips were designed, and it turned out to be extremely handy for some things. Plus since all their modern devices now work this way that probably simplify the software.
I’m not saying it’s genius foresight, but it certainly worked out rather well. There’s nothing stopping them from supporting discreet GPUs too if they wanted to. They just clearly don’t.
I'd guess that they inherited it from the iPhone chips. It was nice and fast and also makes Apple a lot of profit as no third party RAM is possible.
They put the M1 into the desktops too
Apple debuted dedicated machine learning hardware in 2017 with the Neural Engine on iPhones. While I don’t think they predicted the LLM explosion in particular, they knew machine learning was important and they have been allowing that to influence hardware design.
Apple has always liked to integrate as much as possible on the same chip. It was only natural that they would come to this conclusion, with the improved perf the cherry on top.
Well also these chips originated in phones, where they kinda had to integrate it. And the quicker RAM and disk access are pretty nice.
Laptops have had unified memory for ten years or more. For desktops very few apps benefit from unified memory.
And game consoles that use similar parts as laptops.
Just a guess, but fabricating this can't be easy. Yield is probably higher if you have less memory per chip.
It's regular memory on separate chips.
Why does it matter if you can run the LLM locally, if you're still running it on someone else's locked down computing platform?
Running locally, your data is not sent outside of your security perimeter off to a remote data center.
If you are going to argue that the OS or even below that the hardware could be compromised to still enable exfiltration, that is true, but it is a whole different ballgame from using an external SaaS no matter what the service guarantees.
For enterprise markets, this is table stakes. A lot of datacenter customers will probably ignore this release altogether since there isn't a high-bandwidth option for systems interconnect.
The Mac Studio isn’t meant for data centers anyway? It’s a small and silent desktop form factor — in every respect the opposite of a design you’d want to put in a rack.
A long time ago Apple had a rackmount server called Xserve, but there’s no sign that they’re interested in updating that for the AI age.
It's the Ultra chip, the same one that goes into the rackmount Mac Pro. I don't think there's much confusion as to who this is for.
> there’s no sign that they’re interested in updating that for the AI age.
https://security.apple.com/blog/private-cloud-compute/
The rackmount Mac Pro is for A/V studios, not datacenters.
Don't forget CI/CD farms for iOS builds, although I think it's much more cost effective to just make Minis or Studios work, despite their nonstandard formfactor
Google and Facebook have vast fleets of Minis in custom chassis for this purpose.
I genuinely forgot the Mac Pro still exists. It’s been so long since I even saw one.
And I’ve had every previous Mac tower design since 1999: G4, G5, the excellent dual Xeon, the horrible black trash can… But Apple Silicon delivers so much punch in the Studio form factor, the old school Pro has become very niche.
Edit - looks like the new M3 Ultra is only available in Mac Studio anyway? So the existence of the Pro is moot here.
never understood the hate on the trash can. Isn't the mac studio basically the same idea as the trash can but even less upgradeable?
The Mac Studio hit a sweet spot in 2023 that the trash can Mac Pro couldn't ten years earlier. It's mostly thanks to the high integration of Apple Silicon and improved device availability and speed of Thunderbolt.
The 2013 Mac Pro was stuck forever with its original choice of Intel CPU and AMD GPU. And it was unfortunately prone to overheating due to these same components.
The trash can also suffered from hitting the market right around when the industry gave up on making dual-GPU work.
Yep. It was designed for CPU grunt, and came out right when people swapped to wanting tons of GPU grunt.
The cooling solution wasn’t designed for huge GPUs. So it couldn’t really be upgraded in ways most people wanted.
Folks that want to keep the customisation aspect of Mac Pro hardly see that.
In fact a very famous podcaster is still holding out to his.
The Studio also hits a sweet spot for home users like me that want tons of IO and no built in input devices.
Outside of extremely niche use cases, who is racking apple products in 2025?
There's MacMiniVault (nee MacMiniColo) https://www.macminivault.com/
Not sure if they count as niche or not.
Every provider who offers MacOS in the cloud.
So MacOS is still not allowed to be virtualized per the EULA? Wow if that's true...
MacOS is permitted to be virtualized... as long as the host is a Mac. :)
AWS
github for their macos runners (pretty sure theyre m1 minis)
Apple recently announced they’re building a new plant in Texas to produce servers. Yes, they need servers for their Private Compute Cloud used by Apple Intelligence, but it doesn’t only need to be for that.
From https://www.apple.com/newsroom/2025/02/apple-will-spend-more...
As part of its new U.S. investments, Apple will work with manufacturing partners to begin production of servers in Houston later this year. A 250,000-square-foot server manufacturing facility, slated to open in 2026, will create thousands of jobs.
Thunderbolt 5 can do bi-directional 80 Gbps....and Mac Studio Ultra has 6 ports...
That's still not even competitive with 100G Ethernet on a per-port basis. An overall bandwidth of 480 Gbps pales in comparison with, for example, the 3200 Gbps you get with a P5 instance on EC2.
A 3 year reservation of a P5 is over a million dollars though? Not sure how that's comparable....
To add to this GPU servers like supermicro have a 400GBe port per GPU plus more for the CPU.
Cost competitive though?
You can use Thunderbolt 5 interconnect (80Gbps) to run LLMs distributed across 4 or 5 Mac Studios.
But 80Gbit/s is way slower than even regular dual channel RAM, or am I missing something here? That would mean the LLM would be excruciatingly slow. You could get an old EPYC for a fraction of that price and have more performance.
The weights don't go over the network so performance is OK.
If I'm not mistaken, each token produced roughly equals the whole model in memory transfers (the exception being MoE models). That's why memory bandwidth is so important in the first place, or not?
My understanding is that if you can store 1/Nth of the weights in RAM on each of the N nodes then there's no need to send the weights over the network.
You're correct about the weights: each machine could in fact store all of the weights. However I think you still have to transfer the activations and the KV-Cache while performing inference.
why would you ever want to do that remains an open question
Probably some kind of local LLM server. 1TB of 1.6 TB/s memory if you link 2 together. $20k total. Half the price of a single Blackwell chip.
with a vanishingly small fraction of flops and a small fraction of memory bandwidth
It's good enough to run whatever local model you want. 2x 80core GPU is no joke. Linking them together gives it effectively 1.6 TB/s of bandwidth. 1TB of total memory.
You can run the full Deepseek 671b q8 model at 40 tokens/s. Q4 model at 80 tokens/s. 37B active params at a time because R1 is MoE.
Linking 2 of these together let's you run a model more capable (R1) than GPT4o at a comfortable speed at home. That was simply fantasy a year ago.
> with a vanishingly small fraction of flops and a small fraction of memory bandwidth
Is it though?
Wikipedia says [1] an M3 Max can do 14 TFLOPS of FP32, so an M3 Ultra ought to do 28 TFLOPS. nVidia claims [2] a Blackwell GPU does 80 TFLOPs of FP32. So M3 Ultra is 1/3 the speed of a Blackwell.
Calling that "a vanishingly small fraction" seems like a bit of an exaggeration.
I mean, by that metric, a single Blackwell GPU only has "a vanishingly small fraction" of the memory of an M3 Ultra. And the M3 Ultra is only burning "a vanishingly small fraction" of a Blackwell's electrical power.
nVidia likes throwing around numbers like "20 petaFLOPs" for FP4, but that's not real floating point... it's just 1990's-vintage uLaw/aLaw integer math.
[1] https://en.wikipedia.org/wiki/Apple_silicon#Comparison_of_M-...
[2] https://resources.nvidia.com/en-us-blackwell-architecture/da...
Edit: Further, most (all?) of the TFLOPs numbers you see on nVidia datasheets for "Tensor FLOPs" have a little asterisk next to them saying they are "effective" TFLOPs using the sparsity feature, where half the elements of the matrix multiplication are zeroed.
TFLOPS are teraflops not “tensor flops”.
Blackwell and modern AI chips are built for fp16. B100 has 1750 tflops of fp16. M3 ultra has ~80tflops of fp16 or about 4% that of b100
That article says you can connect them through the Thunderbolt 5 somehow to form clusters.
I wonder if that’s something new, or just the same virtual network interface that’s been around since the TB1 days (a new network interface appears when you connect two Macs with a TB cable)
Its the same host-to-host usb network, I believe.
I'm super interested in the clustering capability. At launch people said they were only getting like 11Gbps from their TB4 drive arrays, which was really way less than expected.
Apple does kind of advertise that each TB port has its own controllers. Which gives me hope that whatever 1x port can do 6x can do 6x better.
AMD's Strix Halo victory feels much more shallow today. Eventually 48GB or 64GB sticks will probably expand Strix Halo to 192 then 256GB. But Strix Halo is super super io starved, is basically a desktop of IO, with no way to easily host-to-host, and Apple absolutely understands that the use of a chip is bounded by what it can connect to. 6x TB5, if even half true, will be utterly outstanding.
It's been so so so so cool to see Non-Transparent Bridging atop thunderbolt, so one host can act like a device. Since it's PCIe, that hypothetically would allow amazing RDMA over TB. USB4 mandates host to host networking, but I have no idea how it is implemented and I suspect it's no where near as close to the metal.
In 2017 I was working for a company that was trying to develop foundation models and I was developing a framework for training what were then large neural network [1] and other models.
It was "yet another mac-oriented startup" but I had them get me an Alienware laptop because I could get one with a 1070 mobile card that meant I could train on my laptop whereas the data sci's had to do everything on our DGX-1. [2]
Today it is the other way around, the Mac Studio looks like the best AI development workstation you can get.
[1] I was really partial to a character-level CNN model we had
[2] CEO presented next to Jensen Huang at a NVIDIA conference, his favorite word was "incredible". I thought it was "incredible" when I heard they got bought by Nike, but it was true.
Well already it is faster than GigE...
https://arstechnica.com/gadgets/2013/10/os-x-10-9-brings-fas...
Thunderbolt is PCIe-based and I could imagine it being extended to do what https://en.wikipedia.org/wiki/Compute_Express_Link and https://en.wikipedia.org/wiki/InfiniBand