506 comments
  • cxie3m

    512GB of unified memory is truly breaking new ground. I was wondering when Apple would overcome memory constraints, and now we're seeing a half-terabyte level of unified memory. This is incredibly practical for running large AI models locally ("600 billion parameters"), and Apple's approach of integrating this much efficient memory on a single chip is fascinating compared to NVIDIA's solutions. I'm curious about how this design of "fusing" two M3 Max chips performs in terms of heat dissipation and power consumption though

    • FloatArtifact3m

      They didn't increase the memory bandwidth. You can get the same memory bandwidth, which is available on the M2 Studio. Yes, yes, of course you can get 512 gigabytes of uRAM for 10 grand.

      The the question is if a llm will run with usable performance at that scale? The point is there's diminishing returns despite having enough uRAM with the same amount of memory bandwidth even with increased processing speed of the new chip for AI.

      So there must be a min-max performance ratio between memory bandwidth and the size of the memory pool in relation to the processing power.

      • lhl3m

        Since no one specifically answered your question yet, yes, you should be able to get usable performance. A Q4_K_M GGUF of DeepSeek-R1 is 404GB. This is a 671B MoE that "only" has 37B activations per pass. You'd probably expect in the ballpark of 20-30 tok/s (depends on how much actually MBW can be utilized) for text generation.

        From my napkin math, the M3 Ultra TFLOPs is still relatively low (around 43 FP16 TFLOPs?), but it should be more than enough to handle bs=1 token generation (should be way <10 FLOPs/byte for inference). Now as far is its prefill/prompt processing speed... well, that's another matter.

        • lynguist3m

          I actually think it’s not a coincidence and they specifically built this M3 Ultra for DeepSeek R1 4-bit. They also highlight in their press release that they tested it with 600B class LLMs (DeepSeek R1 without referring to it by name). And they specifically did not stop at 256 GB RAM to make this happen. Maybe I’m reading too much into it.

          • tgma3m

            Pretty sure this has absolutely nothing to do with Deepseek and even local LLM at large, which has been a thing for a while and an obvious use case original Llama leak and llama.cpp coming around.

            Fact is Mac Pros in the Intel days supported 1.5TB RAM in some configurations[1] and that was 6 years ago expectations of their high end customer base. They needed to address the gap for those customers so they would have shipped such a product regardless. Local LLM is cherry-on-top. Deepseek in particular almost certainly had nothing to do with it. They will still need to double their supported RAM in their SoC to get there. Perhaps in a Mac Pro or a different quad-Max-glued chip.

            [1]: https://support.apple.com/en-us/101639

            • saagarjha3m

              The thing that people are excited about here is unified memory that the GPU can address. Mac Pro had discrete GPUs with their own memory.

              • tgma3m

                I understand why they are excited about it—just pointing out it is a happy coincidence. They would have and should have made such a product to address the need of RAM users alone, not VRAM in particular, before they have a credible case to cut macOS releases on Intel.

              • water93m

                Intel integrated graphics, technically also used unified memory with the standard dram

                • kergonath3m

                  Those also have terrible performance and worse bandwidth. I am not sure they are really relevant, to be honest.

                • McDaveNZ3m

                  Did the Xeons in the Mac Pro even have integrated graphics?

                • icedchai3m

                  So did the Amiga, almost 40 years ago...

                  • vaxman3m

                    You mean this? ;) http://de.wikipedia.org/wiki/Datei:Amiga_1000_PAL.jpg

                    RIP Jay Miner who watched his unified memory daughters Agnus, Denise and Paula be slowly murdered by Jack Tramiel's vengeance against Irving Gould. [Why couldn't the shareholders have stormed their boardroom 180 days before the company ran out of cash, installed interim management who, in turn, would have brought back the megalomaniac Founder that would, until his dying breath, keep spreading their cash to the super brilliant geniuses that made all the magic chips happen and then turn the resulting empire over to ops people to make their workplace so uncomfortable they all retire early and live happily ever after on tropical islands and snowy mountain tops?]

                    • icedchai3m

                      Yep! Though one could argue the Amiga wasn't true unified memory due to the chip RAM limitations. Depending on the Agnus revision, you'd be limited to 512, 1 meg, or 2 meg max of RAM addressable by the custom chips ("chip RAM".)

                      • vaxman3m

                        fun fact: M-series that are configured to use more than 75% of shared memory for GPU can make the system go boom...something to do with assumptions macOS makes that can be fixed by someone with a "private key" to access kernel mode (maybe not a hardware limit).

                        • icedchai3m

                          I messed around with that setting on one of my Macs. I wanted to load a large LLM model and it needed more than 75% of shared memory.

          • kmacdough3m

            That or it's the luckiest coincidence! In all seriousness, Apple is fairly consistent about not pushing specs that don't matter and >256GB is just unnecessary for most other common workloads. Factors like memory bandwidth, core count and consumption/heat would have higher impact.

            That said, I doubt it was explicitly for R1, but rather based the industry a few years ago when GPT 3s 170B was SOTA, but the industry was still looking larger. "As much memory as possible" is the name of the game for AI in a way that's not true for other workloads. It may not be true for AI forever either.

            • icedchai3m

              The high end Intel Macs supported over a TB of RAM, over 5 years ago. It's kinda crazy Apple's own high end chips didn't support more RAM. Also, the LLM use case isn't new... Though DeepSeek itself may be. RAM requirements always go up.

              • teknologist3m

                Just to clarify. There is an important difference between unified memory, meaning accessible by both CPU and GPU, and regular RAM that is only accessible by CPU.

                • angoragoats3m

                  As mentioned elsewhere in this thread, unified memory has existed long before Apple released the M1 CPU, and in fact many Intel processors that Apple used before supported it (though the Mac pros that supported 1.5TB of RAM did not, as they did not have integrated graphics).

                  The presence of unified memory does not necessarily make a system better. It’s a trade off: the M-series systems have high memory bandwidth thanks to the large number of memory channels, and the integrated GPUs are faster than most others. But you can’t swap in a faster GPU, and when using large LLMs even a Mac Studio is quite slow compared to using discrete GPUs.

          • brookst3m

            Design work on the Ultra would have started 2-3 years ago, and specs for memory at least 18 months ago. I’m not sure they had that kind of inside knowledge for what Deepseek specifically was doing that far in advance. Did Deepseek even know that long ago?

          • happyopossum3m

            > they specifically built this M3 Ultra for DeepSeek R1 4-bit

            Which came out in what, mid January? Yeah, there's no chance Apple (or anyone) has built a new chip in the last 45 days.

            • tempaccount4203m

              Don't they build these Macs just-in-time? The bandwidth doesn't change with the RAM, so surely it couldn't have been that hard to just... use higher capacity RAM modules?

            • vaxman3m

              "No chance?" But it has been reported that the next generation of Apple Silicon started production a few weeks ago. Those deliveries may enable Apple to release its remaining M3 Ultra SKUs for sale to the public (because it has something Better for its internal PCC build-out).

              It also may point to other devices ᯅ depending upon such new Apple Silicon arriving sooner, rather than later. (Hey, I should start a YouTube channel or religion or something. /s)

            • SV_BubbleTime3m

              No one is saying they built a new chip.

              But the decision to come to market with a 512GB sku may have changed from not making sense to “people will buy this”.

              • cyanydeez3m

                Dies are designed in years.

                This was just a coincidence.

                • SV_BubbleTime3m

                  What part of “no one is saying they designed a new chip” is lost here?

                  • cyanydeez3m

                    Sorry, non of us a fan boys trying to shape apple is great narratives

              • 3m
                [deleted]
          • forrestthewoods3m

            I don’t think you understand hardware timelines if you think this product had literally anything to do with anything DeepSeek.

            • reitzensteinm3m

              Chip? Yes. Product? Not necessarily...

              It's not completely out of the question that the 512gb version of M3 Ultra was built for their internal Apple silicon servers powering Private Compute Cloud, but not intended for consumer release, until a compelling use case suddenly arrived.

              I don't _think_ this is what happened, but I wouldn't go as far as to call it impossible.

              • forrestthewoods3m

                DeepSeek R1 came out Jan 20.

                Literally impossible.

                • reitzensteinm3m

                  The scenario is that the 512gb M3 Ultra was validated for the Mac Studio, and in volume production for their servers, but a business decision was made to not offer more than a 256gb SKU for Mac Studio.

                  I don't think this happened, but it's absolutely not "literally impossible". Engineering takes time, artificial segmentation can be changed much more quickly.

                  • forrestthewoods3m

                    From “internal only” to “delivered to customers” in 6 weeks is literally impossible.

                    • ryao3m

                      This change is mostly just using higher density ICs on the assembly line and printing different box art with a SKU change. It does not take much time, especially if they had planned it as a possible product just in case management changed its mind.

              • jahewson3m

                That's absurd. Fabing custom silicon is not something anybody does for a few thousand internal servers. The unit economics simply don't work. Plus Apple is using OpenAI to provide its larger models anyway, so the need never even existed.

            • bustling-noose3m

              My thoughts too. This product was in the pipeline maybe 2-3 years ago. Maybe with LLMs getting popular a year ago they tried to fit more memory but it’s almost impossible to do that that close to a launch. Especially when memory is fused not just a module you can swap.

              • tgma3m

                Your conclusion is correct but to be clear the memory is not "fused." It's soldered close to the main processor. Not even a Package-on-Package (two story) configuration.

                See photo without heatspreader here: https://wccftech.com/apple-m2-ultra-soc-delidded-package-siz...

                • bustling-noose3m

                  I think by fuse I mean't its stuck on to the SOC module, not part of the SOC as I may have worded. While you could maybe still add NANDs later in the manufacturing process, it's probably not easy, especially if you need more NANDs and a larger module which might cause more design problems. The NAND is closer cause the controller is in the SOC. So the memory controller probably would also change with higher memory sizes which would mean this cannot be a last minute change.

                • fennecfoxy3m

                  Sheesh, the...comments on that link.

          • nightski3m

            $10k to run a 4 bit quantized model. Ouch.

            • OriginalMrPink3m

              That's today. What about tomorrow?

            • water93m

              The M4 MacBook Pro 128GB can run a 32B perimeter model with an 8 bit quantized model just fine

            • vaxman3m

              [flagged]

              • titanomachy3m

                I'm downvoting you because your use of language is so annoying, not because I work for Apple.

              • fredoliveira3m

                what?

                • vaxman3m

                  Sorry, an apostrophe got lost in "PO's"

              • vaxman3m

                [flagged]

                • 1R0533m

                  are you comparing the same models? How did you calculate the TOPS for M3 Ultra?

                  • vaxman3m

                    An M3 Ultra is two M3 Max chips connected via fabric, so physics.

                    Did not mean to shit on anyone's parade, but it's a trap for novices, with the caveat that you reportedly can't buy a GB10 until "May 2025" and the expectation that it will be severely supply constrained. For some (overfunded startups running on AI monkey code? Youtube Influencers?), that timing is an unacceptable risk, so I do expect these things to fly off the shelves and then hit eBay this Summer.

          • jrflowers3m

            > they specifically built this M3 Ultra for DeepSeek R1 4-bit.

            This makes sense. They started gluing M* chips together to make Mac Studios three years ago, which must have been in anticipation of DeepSeek R1 4-bit

          • a1o3m

            Any ideas on power consumption? I wonder how much power would that use. It looks like it would be more efficient than everything else that currently exists.

          • khana3m

            [dead]

        • drited3m

          I would be curious about context window size that would be expected when generating ballpark 20 to 20 tokens per second using Deepseek-R1 Q4 on this hardware?

        • 3m
          [deleted]
      • valine3m

        Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

        • FloatArtifact3m

          > Probably helps that models like deepseek are mixture of expert. Having all weights in VRAM means you don’t have to unlod/reload. Memory bandwidth usage should be limited to the 37B active parameters.

          "Memory bandwidth usage should be limited to the 37B active parameters."

          Can someone do a deep dive above quote. I understand having the entire model loaded into RAM helps with response times. However, I don't quite understand the memory bandwidth to active parameters.

          Context window?

          How much the model can actively be processed despite being fully loaded into memory based on memory bandwidth?

          • valine3m

            With a mixture of experts model you only need to read a subset of the weights from memory to compute the output of each layer. The hidden dimensions are usually smaller as well so that reduces the size of the tensors you write to memory.

            • ein0p3m

              What people who did not actually work with this stuff in practice don't realize is the above statement only holds for batch size 1, sequence size 1. For processing the prompt you will need to read all the weights (which isn't a problem, because prefill is compute-bound, which, in turn is a problem on a weak machine like this Mac or an "EPYC build" someone else mentioned). Even for inference, batch size greater than 1 (more than one inference at a time) or sequence size of greater than 1 (speculative decoding), could require you to read the entire model, repeatedly. MoE is beneficial, but there's a lot of nuance here, which people usually miss.

              • valine3m

                No one should be buying this for batch inference obviously.

                I remember right after OpenAI announced GPT3 I had a conversation with someone where we tried to predict how long it would be before GPT3 could run on a home desktop. This mac studio that has enough VRAM to run the full 175B parameter GPT3 with 16bit precision, and I think that’s pretty cool.

              • doctorpangloss3m

                Sure, nuance.

                This is why Apple makes so much fucking money: people will craft the wildest narratives about how they’re going to use this thing. It’s part of the aesthetics of spending $10,000. For every person who wants a solution to the problem of running a 400b+ parameter neural network, there are 19 who actually want an exciting experience of buying something, which is what Apple really makes. It has more in common with a Birkin bag than a server.

                • jonfromsf3m

                  Birkin bags appreciate in value. This is more like a Lexus. It's a well-crafted luxury good that will depreciate relatively slowly.

                  • fennecfoxy3m

                    Have you seen prices on Lexus LFAs now? They haven't depreciated ha ha. And for those that don't know: https://www.youtube.com/watch?v=fWdXLF9unOE

                  • hot_gril3m

                    Computers don't usually depreciate slowly

                    • km3r3m

                      Relatively, as in a Mac or a Lexus will depreciate slower than other computers/cars.

                      • seec3m

                        It used to be very true, but with Apple's popularity the second-hand market is quite saturated (especially since there are many people buying them impulsively).

                        Unless you have a specific configuration, depreciation isn't much better than an equivalently priced PC. In fact, my experience is that the long tail value of the PC is better if you picked something that was high-end.

                        • hot_gril3m

                          I don't know. Can't imagine it's easy to sell a used Windows laptop directly to begin with, and those big resellers probably offer very little. Even refurbished Dell Latitudes seem to go for cheap on eBay. I've had an easy time selling old Macs, or high-end desktop market might be simple too.

                          • seec3m

                            Macs are easy to sell if they are BtO with custom configuration, in that case you may not lose too much. But the depreciation hit hard on the base models, the market is flooded because people who buy those machines tend to change them often or are people who were trying, confused, etc.

                            Low end PCs (mostly laptops) don't keep value very well but then again you probably got them for cheap on a deal or something like that, so your depreciation might actually not be as bad as an equivalent Mac. The units you are talking about are entreprise stuff that are swapped every 3 years or so, for accounting reasons mostly, but it's not the type of stuff I would advise anyone to buy brand new (the strategy would actually be to pick up a second-hand unit).

                            High-end PCs, laptops or big desktops keep their value pretty well because they are niche by definition and very rare. Depending on your original choice you may actually have a better depreciation than an equivalently priced Mac because there are fewer of them on sale at any given time.

                            It all depends on your personal situation, strength of local market, ease of reselling through platforms that provides trust and many other variables.

                            What I meant is that it's not the early 2000's anymore, where you could offload a relatively new Mac (2-3 years) very easily; while not being hit by big depreciation because they were not very common.

                            In my medium sized town, there is a local second-hand electronic shop where they have all kinds of Mac at all kind of price points. High-end Razers sell for more money and are a rare sight. It's pretty much the same for iPhones, you see 3 years old models hit very hard with depreciation while some niche Android phones take a smaller hit.

                            Apple went through a weird strategy where at the same time they went for luxury pricing by overcharging for things that makes the experience much better (RAM/storage) but also tried to make it affordable to the mass (by largely compromising on things they shouldn't have).

                            Apple's behavior created a shady second-hand market with lots of moving part (things being shipped in and out of china) and this is all their doing.

                            • hot_gril3m

                              Well these listed prices are asks, not bids, so they only give an upper bound on the value. I've tried to sell obscure things before where there are few or 0 other sellers, and no matter what you list it for, you might never find the buyer who wants that specific thing.

                              And the electronic shop is probably going to fetch a higher price than an individual seller would, due to trust factor as you mentioned. So have you managed to sell old Windows PCs for decent prices in some local market?

                • ein0p3m

                  Pretty much. In addition, PyTorch on the Mac is abysmally bad. As is Jax. Idk why Apple doesn't implement proper support, seems important. There's MLX which is pretty good, but you can't really port the entire ecosystem of other packages to MLX this far along in the game. Apple's best bet to credibly sell this as "AI hardware" is to make PyTorch support on the Mac excellent. Right now, as far as AI workloads are concerned, this is only suitable for Ollama.

                • DevKoala3m

                  This is true. Not sure why you are getting downvoted. I say this as someone who ordered a maxed out model. I know I will never have a need to run a model locally, I just want to know I can.

                  • ein0p3m

                    I run Mistral Large locally on two A6000's, in 4 bits. It's nice, but $10K in GPUs buys a lot of subscriptions. Plus some of the strongest LLMs are now free (Grok, DeepSeek) for web use.

                    • DevKoala3m

                      I hear you. I make these decisions for a public company.

                      When engineers tell me they want to run models on the cloud, I tell them they are free to play with it, but that isn’t a project going into the roadmap. OpenAI/Anthropic and others are much cheaper in terms of token/dollar thanks to economies of scale.

                      There is still value in running your models for privacy issues however, and that’s the reason why I pay attention to efforts in reducing the cost to run models locally or in your cloud provider.

                • 3m
                  [deleted]
              • Der_Einzige3m

                No one who is using this for home use cares about anything except batch size 1 sequence size 1.

                • ein0p3m

                  What if you're doing bulk inference? The efficiency and throughput of bs=1 s=1 is truly abysmal.

                  • saagarjha3m

                    People want to talk to their computer, not service requests for a thousand users.

              • rfoo3m

                For decode, MoE is nice for either bs=1 (decoding for a single user), or bs=<very large> (do EP to efficiently serve a large amount of users).

                Anything in between suffers.

            • bick_nyers3m

              Just to add onto this point, you expect different experts to be activated for every token, so not having all of the weights in fast memory can still be quite slow as you need to load/unload memory every token.

              • valine3m

                Probably better to be moving things from fast memory to faster memory than from slow disk to fast memory.

    • sudoshred3m

      Agree. Finally I can have several hundred browser tabs open simultaneously with no performance degradation.

    • nullc3m

      I'm not sure that unified memory is particularly relevant for that-- so e.g. on zen4/zen5 epyc there is more than enough arithmetic power that LLM inference is purely memory bandwidth limited.

      On dual (SP5) Epyc I believe the memory bandwidth is somewhat greater than this apple product too... and at apple's price points you can have about twice the ram too.

      Presumably the apple solution is more power efficient.

    • PeterStuer3m

      Is this on chip memory? From the 800GB/s I would guess more likely a 512bit bus (8 channel) to DDR5 modules. Doing it on a quad channel would just about be possible, but really be pushing the envelope. Still a nice thing.

      As for practicality, which mainstream applications would benefit from this much memory paired with a nice but relative mid compute? At this price-point (14K for a full specced system), would you prefer it over e.g. a couple of NVIDIA project DIGITS (assuming that arrives on time and for around the announced the 3K price-point)?

    • rlt3m

      Is putting RAM on the same chip as processing economical?

      I would have assumed you’d want to save the best process/node for processing, and could use a less expensive processes for RAM.

    • 3m
      [deleted]
    • RataNova3m

      It's a game changer for sure.... 512GB of unified memory really pushes the envelope, especially for running complex AI models locally. That said, the real test will be in how well the dual-chip design handles heat and power efficiency

    • resters3m

      The same thing could be designed with greater memory bandwidth, and so it's just a matter of time (for NVIDIA) until Apple decides to compete.

    • dheera3m

      It will cost 4X what it costs to get 512GB on an x86 server motherboard.

    • TheRealPomax3m

      I think the other big thing is that the base model finally starts at a normal amount of memory for a production machine. You can't get less than 96GB. Although an extra $4000 for the 512GB model seems Tim Apple levels of ridiculous. There is absolutely no way that the different costs anywhere near that much at the fab.

      And the storage solution still makes no sense of course, a machine like this should start at 4TB for $0 extra, 8TB for $500 more, and 16TB for $1000 more. Not start at a useless 1TB, with the 8TB version costing an extra $2400 and 16TB a truly idiotic $4600. If Sabrent can make and sell 8TB m.2 NVMe drives for $1000, SoC storage should set you back half that, not over double that.

    • tempest_3m

      Nvidia has had the Grace Hoppers for a while now. Is this not like that?

    • ProAm3m

      This is just Apple disrespecting their customer base.

    • asdffdasy3m

      still not ECC

    • samstave3m

      "unified memory"

      funny that people think this is so new, when CRAY had Global Heap eons ago...

    • amelius3m

      Why does it matter if you can run the LLM locally, if you're still running it on someone else's locked down computing platform?

    • bigyabai3m

      For enterprise markets, this is table stakes. A lot of datacenter customers will probably ignore this release altogether since there isn't a high-bandwidth option for systems interconnect.

  • InTheArena3m

    Whoa. M3 instead of M4. I wonder if this was basically binning, but I thought that I had read somewhere that the interposer that enabled this for the M1 chips where not available.

    That Said, 512GB of unified ram with access to the NPU is absolutely a game changer. My guess is that Apple developed this chip for their internal AI efforts, and are now at the point where they are releasing it publicly for others to use. They really need a 2U rack form for this though.

    This hardware is really being held back by the operating system at this point.

    • exabrial3m

      If Apple supported Linux (headless) natively, and we could rack m4 pros, I absolutely would use them in our Colo.

      The CPUs have zero competition in terms of speed, memory bandwidth. Still blown away no other company has been able to produce Arm server chips that can compete.

    • stego-tech3m

      > This hardware is really being held back by the operating system at this point.

      It really is. Even if they themselves won't bring back their old XServe OS variant, I'd really appreciate it if they at least partnered with a Linux or BSD (good callout, ryao) dev to bring a server OS to the hardware stack. The consumer OS, while still better (to my subjective tastes) than Windows, is increasingly hampered by bloat and cruft that make it untenable for production server workloads, at least to my subjective standards.

      A server OS that just treats the underlying hardware like a hypervisor would, making the various components attachable or shareable to VMs and Containers on top, would make these things incredibly valuable in smaller datacenters or Edge use cases. Having an on-prem NPU with that much RAM would be a godsend for local AI acceleration among a shared userbase on the LAN.

    • klausa3m

      >I had read somewhere that the interposer that enabled this for the M1 chips where not available.

      With all my love and respect for "Apple rumors" writers; this was always "I read five blogposts about CPU design and now I'm an expert!" territory.

      The speculation was based on the M3 Maxes die shots not having the interposer visible, which... implies basically nothing whether that _could have_ been supported in an M3 Ultra configuration; as evidenced by the announcement today.

    • kokada3m

      > This hardware is really being held back by the operating system at this point.

      Apple could either create a 2U rack hardware and support Linux (and I mean Apple supporting it, not hobbysts), or have a build of Darwin headless that could run on that hardware. But in the later case, we probably wouldn't have much software available (though I am sure people would eventually starting porting software to it, there is already MacPorts and Homebrew and I am sure they could be adapted to eventually run in that platform).

      But Apple is also not interested in that market, so this will probably never happen.

    • GeekyBear3m

      I also wondered about binning, so I pulled together how heavily Apple's Max chips were binned in shipping configurations.

      M1 Max - 24 to 32 GPU cores

      M2 Max - 30 to 38 GPU cores

      M3 Max - 30 to 40 GPU cores

      M4 Max - 32 to 40 GPU cores

      I also looked up the announcement dates for the Max and the Ultra variant in each generation.

      M1 Max - October 18, 2021

      M1 Ultra - March 8, 2022

      M2 Max - January 17, 2023

      M2 Ultra - June 5, 2023

      M3 Max - October 30, 2023

      M3 Ultra - March 12, 2025

      M4 Max - October 30, 2024

      > My guess is that Apple developed this chip for their internal AI efforts

      As good a guess as any, given the additional delay between the M3 Max and Ultra being made available to the public.

    • AlchemistCamp3m

      Keep in mind the minimum configuration that has 512GB of unified RAM is $9,499.

    • jmyeet3m

      I've been looking at the potential for Apple to make really interesting LLM hardware. Their unified memory model could be a real game-changer because NVidia really forces market segmentation by limiting memory.

      It's worth adding the M3 Ultra has 819GB/s memory bandwidth [1]. For comparison the RTX 5090 is 1800GB/s [2]. That's still less but the M4 Mac Minis have 120-300GB/s and this will limit token throughput so 819GB/s is a vast improvement.

      For $9500 you can buy a M3 Ultra Mac Studio with 512GB of unified memory. I think that has massive potential.

      [1]: https://www.apple.com/mac-studio/specs/

      [2]: https://www.nvidia.com/en-us/geforce/graphics-cards/50-serie...

    • hedora3m

      Other than the NPU, it’s not really a game changer; here’s a 512GB AMD deepseek build for $2000:

      https://digitalspaceport.com/how-to-run-deepseek-r1-671b-ful...

    • hajile3m

      One of the leakers who got this Mac Studio right claims Apple is reserving the M4 ultra for the Mac Pro to differentiate the products a bit more.

    • darthrupert3m

      Yeah, if only Apple at least semi-supported Linux, their computers would have no competition.

    • hinkley3m

      Given that the M1 Ultra and M2 Ultra also exist, I'd expect either straight binning, or two designs that use mostly the same designs for the cores but more of them and a few extra features.

      I love Apple but they love to speak in half truths in product launches. Are they saying the M3 Ultra is their first Thunderbolt 5 computer? I don't recall seeing any previous announcements.

    • intrasight3m

      It certainly is held back and that is unfortunate. But if you can run your workloads on this amazing machine, then that's a lot of compute for the buck.

      I assume that there's a community of developers focusing on leveraging this hardware instead of complaining about the operating system.

    • Teever3m

      FTFA

      > Apple’s custom-built UltraFusion packaging technology uses an embedded silicon interposer that connects two M3 Max dies across more than 10,000 signals, providing over 2.5TB/s of low-latency interprocessor bandwidth, and making M3 Ultra appear as a single chip to software.

    • jagged-chisel3m

      > This hardware is really being held back by the operating system at this point.

      Please elucidate.

    • behnamoh3m

      > My guess is that Apple developed this chip for their internal AI efforts

      what internal AI efforts?

      Apple Intelligence is bunkers, and Apple MLX framework remains a hobby project for Apple

  • ksec3m

    Previous model of M2 Ultra had max memory of 192GB. Or 128GB for Pro and some other M3 model, which I think is plenty for even 99.9% of professional task.

    They now bump it to 512GB. Along with insane price tag of $9499 for 512GB Mac Studio. I am pretty sure this is some AI Gold rush.

    • InTheArena3m

      Every single AI shop on the planet is trying to figure out if there is enough compute or not to make this a reasonable AI path. If the answer is yes, that 10k is a absolute bargain.

    • HPsquared3m

      LLMs easily use a lot of RAM, and these systems are MUCH, MUCH cheaper (though slower) than a GPU setup with the equivalent RAM.

      A 4-bit quantization of Llama-3.1 405b, for example, should fit nicely.

    • segmondy3m

      The question will be how it will perform. I suspect Deepseek, Llama405B demonstrated the need for larger memory. Right now folks could build an epyc system with that much ram or more to run Deepseek at about 6 tokens/sec for a fraction of that cost. However not everyone is a tinker, so there's a market for this for those that don't want to be bothered. You say "AI Gold rush" like it's a bad thing, it's not.

    • MR4D3m

      Remember, that RAM is also VRAM, so 1/2 terabyte of VRAM ain’t cheap. By comparison, Apple is a downright bargain!

    • bloppe3m

      Big question is: Does the $10k price already reflect Trump's tariffs on China? Or will the price rise further still..

    • dwighttk3m

      Maybe .1% of tasks need this RAM, why are they charging so much?

  • jjuliano3m

    Currently, Docker does not support Metal GPUs.

    When running LLMs on Docker with an Apple M3 or M4 chip, they will operate in CPU mode regardless of the chip's class, as Docker only supports Nvidia and Radeon GPUs.

    If you're developing LLMs on Docker, consider getting a Framework laptop with an Nvidia or Radeon GPU instead.

    Source: I develop an AI agent framework that runs LLMs inside Docker on an M3 Max (https://kdeps.com).

  • lauritz3m

    They update the Studio to M3 Ultra now, so M4 Ultra can presumably go directly into the Mac Pro at WWDC? Interesting timing. Maybe they'll change the form factor of the Mac Pro, too?

    Additionally, I would assume this is a very low-volume product, so it being on N3B isn't a dealbreaker. At the same time, these chips must be very expensive to make, so tying them with luxury-priced RAM makes some kind of sense.

    • lauritz3m

      Interestingly, Apple apparently confirmed to a French website that M4 lacks the interconnect required to make an "Ultra" [0][1], so contrary to what I originally thought, they maybe won't make this after all? I'll take this report with a grain of salt, but apparently it's coming directly from Apple.

      Makes it even more puzzling what they are doing with the M2 Mac Pro.

      [0] https://www.numerama.com/tech/1919213-m4-max-et-m3-ultra-let...

      [1] More context on Macrumors: https://www.macrumors.com/2025/03/05/apple-confirms-m4-max-l...

    • layer83m

      Apple says that not every generation will get an “Ultra” variant: https://arstechnica.com/apple/2025/03/apple-announces-m3-ult...

    • agloe_dreams3m

      My understanding was that Apple wanted to figure out how to build systems with multi-SOCs to replace the Ultra chips. The way it is currently done means that the Max chips need to be designed around the interconnect. Theoretically speaking, a multi-SOC setup could also scale beyond two chips and into a wider set of products.

    • raydev3m

      Honestly I don't think we'll see the M4 Ultra at all this year. That they introduced the Studio with an M3 Ultra tells me M4 Ultras are too costly or they don't have capacity to build them.

      And anyway, I think the M2 Mac Pro was Apple asking customers "hey, can you do anything interesting with these PCIe slots? because we can't think of anything outside of connectivity expansion really"

      RIP Mac Pro unless they redesign Apple Silicon to allow for upgradeable GPUs.

    • jsheard3m

      > Maybe they'll change the form factor of the Mac Pro, too?

      Either that or kill the Mac Pro altogether, the current iteration is such a half-assed design and blatantly terrible value compared to the Studio that it feels like an end-of-the-road product just meant to tide PCIe users over until they can migrate everything to Thunderbolt.

      They recycled a design meant to accommodate multiple beefy GPUs even though GPUs are no longer supported, so most of the cooling and power delivery is vestigial. Plus the PCIe expansion was quietly downgraded, Apple Silicon doesn't have a ton of PCIe lanes so the slots are heavily oversubscribed with PCIe switches.

    • 3m
      [deleted]
  • TheTxT3m

    512GB unified memory is absolutely wild for AI stuff! Compared to how many NVIDIA GPUs you would need, the pricing looks almost reasonable.

    • InTheArena3m

      A server with 512GB of high-bandwidth GPU addressable RAM in a server is probably a six figure expenditure. If memory is your constrain, this is absolutely the server for you.

      (sorry, should have specified that the NPU and GPU cores need to access that ram and have reasonable performance). I specified it above, but people didn't read that :-)

    • jeroenhd3m

      If you're going to overthrow your entire AI workflow to use a different API anyway, surely the AMD Instinct accelerator cards make more sense. They're expensive, but also a lot faster, and you don't need to deal with making your code work on macOS.

    • chakintosh3m

      14k for a maxed out Mac Studio

  • mrtksn3m

    Let's say you want to have the absolute max memory(512GB) to run AI models and let's say that you are O.K. with plugging a drive to archive your model weights then you can get this for a little bit shy of $10K. What a dream machine.

    Compared to Nvidia's Project DIGITS which is supposed to cost $3K and be available "soon", you can get a specs matching 128GB & 4TB version of this Mac for about $4700 and the difference would be that you can actually get it in a week and will run macOS(no idea how much performance difference to expect).

    I can't wait to see someone testing the full DeepSeek model on this, maybe this would be the first little companion AI device that you can fully own and can do whatever you like with it, hassle-free.

    • bloomingkales3m

      There’s an argument that replaceable pc parts is what you want at that price point, but Apple usually provides multi year durability on their pcs. An Apple ai brick should last awhile.

    • NightlyDev3m

      The full deepseek R1 model needs more memory than 512GB. The model is 720GB alone. You can run a quantized version on it, but not the full model.

    • behnamoh3m

      > I can't wait to see someone testing the full DeepSeek model on this

      at 819 GB per second bandwidth, the experience would be terrible

      • coder5433m

        DeepSeek-R1 only has 37B active parameters.

        A back of the napkin calculation: 819GB/s / 37GB/tok = 22 tokens/sec.

        Realistically, you’ll have to run quantized to fit inside of the 512GB limit, so it could be more like 22GB of data transfer per token, which would yield 37 tokens per second as the theoretical limit.

        It is likely going to be very usable. As other people have pointed out, the Mac Studio is also not the only option at this price point… but it is neat that it is an option.

      • mrtksn3m

        How many t/s would you expect? I think I feel perfectly fine when its over 50.

        Also, people figured a way to run these things in parallel easily. The device is pretty small, I think for someone who wouldn't mind the price tag stacking 2-3 of those wouldn't be that bad.

        • yk3m

          I think I've seen 800 GB/s memory bandwidth, so a q4 quant of a 400 B model should be 4 t/s if memory bound.

        • behnamoh3m

          I know you’re referring to the exolabs app, but the t/s is really not that good. it uses thunderbolt instead of NVlink.

      • bearjaws3m

        Not sure why you are being downvoted, we already know the performance numbers due to memory bandwidth constraints on the M4 Max chips, it would apply here as well.

        525GB/s to 1000GB/s will double the TPS at best, which is still quite low for large LLMs.

        • lanceflt3m

          Deepseek R1 (full, Q1) is 14t/s on an M2 Ultra, so this should be around 20t/s

  • teleforce3m

    Thunderbolt 5 (TB 5) is pretty handy, you can have a very thin and lightweight laptop, then can get access to external GPU or eGPU via TB 5 if needed [1]. Now you can have your cake (lightweight laptop) and eat it too (potent GPU).

    [1] Asus just announced the world’s first Thunderbolt 5 eGPU:

    https://www.theverge.com/24336135/asus-thunderbolt-5-externa...

    • ben-schaaf3m

      Except that you're stuck with macOS, so there aren't any drivers for NVIDIA, AMD or Intel GPUs.

      • iamtheworstdev3m

        and that no one is developing games for MacOS.

        • rafram3m

          (1) That’s obviously not actually true.

          (2) “No one” is developing games for Linux either, but the Steam Deck works great. Why? Wine, which you can run on macOS too.

          • iamtheworstdev3m

            Ah, yes, you got me. You identified that other market that also no one cares about.

        • smilebot3m

          Valve supports games on MacOS

    • wpm3m

      Apple Silicon does not work with eGPU.

    • emp_3m

      eGPU has a ton of issues on MacOS - I've used it for years, but now on Silicon its prob much worse - but let me give a shout out to the amazing (somewhat new) High Performance screen sharing mode added in Sonoma.

      When I connect to my Mac Studio via Macbook I can select that mode, then change the Displays setting to Dynamic Resolution and then my 'thin client':

      - Is fullscreen using the entire 16:10 Macbook screen

      - Gets 60 fps low latency performance (including on actual games)

      - Transfers audio, I can attend meetings in this mode

      - Blanks the host Mac Studio screen

      All things that were impossible via VNC - RDP is much better but this new High Performance Screen Share is even more powerful.

      The thin lightweight laptop that remotes into a loaded machine has always been my idea of high mobility instead of suffering a laptop running everything locally. This works via LTE as well with some firewall setup.

  • bustling-noose3m

    I wonder if Apple needs to reconsider Xserve. While Apple probably have some kind of server infrastructure teams, making their own server infrastructure out of their own hardware and software sounds like something they could explore. The app ecosystem coupled with apples servers offered in the cloud or ones you could buy would be a very interesting service business they could get into. Apples App Store needs better apps given how much the hardware is capable of now especially with iPads using M chips. A cloud backed hardware and software service specially designed for the app ecosystem sounds very tempting.

    The hardware has evolved faster than software at Apple. It’s usually the opposite with most tech companies where hardware is unable to keep up with software.

  • c0deR3D3m

    When would Apple silicons made natively support for OSes such as Linux? Apple seemlingly reluctant to release detailed technical reference manual for M-series SoCs, which makes running Linux natively on Apple silicon challenging.

    • bigyabai3m

      Probably never. We don't have official Linux support for the iPhone or iPad, I would't hold out hope for Apple to change their tune.

      • dylan6043m

        That makes sense to me though. If you don’t run iOS, you don’t have App Store and that means a loss of revenue.

        • bigyabai3m

          Right. Same goes for MacOS and all of it's convenient software services. Apple might stand to sell more units with a more friendlier stance towards Linux, but unless it sells more Apple One subscriptions or increases hardware margins on the Mac, I doubt Cook would consider it.

          If you sit around expecting selflessness from Apple you will waste an enormous amount of time, trust me.

        • AndroTux3m

          If you don't run macOS, you don't have Apple iCloud Drive, Music, Fitness, Arcade, TV+ and News and that means a loss of revenue.

          • dylan6043m

            As I replied in else where here, I do not run any Apple Services on my Mac hardware. I do on my iDevices though, but that's a different topic. Again, I could be the edge case

            • bigyabai3m

              > I do not run any Apple Services on my Mac hardware

              Not even OCSP?

              • dylan6043m

                I have no idea what that is, so ???

                But if you're being pedantic, I meant Apple SaaS requiring monthly payments or any other form of using something from Apple where I give them money outside the purchase of their hardware.

                If you're talking background services as part of macOS, then you're being intentionally obtuse to the point and you know it

        • jobs_throwaway3m

          You lose out on revenue from people who require OS freedom though

          • orangecat3m

            All seven of them. I kid, I have a lot of sympathy for that position, but as a practical matter running Linux VMs on an M4 works great, you even get GPU acceleration.

    • dylan6043m

      That’s what’s weird to me too. It’s not like they would lose sales of macOS as it is given away with the hardware. So if someone wants to buy Apple hardware to run Linux, it does not have a negative affect to AAPL

      • bigfishrunning3m

        Except the linux users won't be buying Apple software, from the app store or elsewhere. They won't subscribe to iCloud.

        • dylan6043m

          I have Mac hardware and and have spent $0 through the Mac App Store. I do not use iCloud on it either. I do on iDevices though. I must be an edge case though.

          • xp843m

            All of us on HN are basically edge cases. The main target market of Macs is super dependent on Apple service subscriptions.

            Maybe that's why they ship with insultingly-small SSDs by default, so that as people's photo libraries, Desktop and Documents folders fill up, Apple can "fix your problem" for you by selling you the iCloud/Apple One plan to offload most of the stuff to only live in iCloud.

            Either they spend the $400 up front to get 2 notches up on the SSD upgrade, to match what a reasonable device would come with, or they spend that $400 $10 a month for the 40 month likely lifetime of the computer. Apple wins either way.

            • seec3m

              Of course this is the reason. And this is why Apple has become so bad for the tech enthusiasts, no matter how good the OS/software can be, you have to pay a tax that is way too big because you already have the competence that should allow you to bypass it.

              It's like learning about growing vegetables in your garden but then having to pay the seeds for it much more because you actually know how to produce value with them.

              The philosophy at Apple has changed from premium tools for professional to luxury device for normies that makes them pay for their incompetence.

          • c0deR3D3m

            Same here.

        • tgv3m

          You also lose out on developers. The more macOS users, the more attractive it is to develop for. Supporting Linux would be a loss for the macOS ecosystem, and we all know what that leads to.

        • cosmic_cheese3m

          Those buying the hardware to run Linux also aren’t writing software for macOS to help make the platform more attractive.

          • dylan6043m

            There are a large number of macOS users that are not app software devs. There's a large base of creative users that couldn't code their way out of a wet paper bag, yet spend lots of money on Mac hardware.

            This forum looses track of the world outside this echo chamber

            • cosmic_cheese3m

              I’m among them, even if creative works aren’t my bread and butter (I’m a dev with a bit of an artistic bent).

              That said, attracting creative users also adds value to the platform by creating demand for creative software for macOS, which keeps existing packages for macOS maintained and brings new ones on board every so often.

              • dylan6043m

                I'm a mix of both, however, my dev time does not create macOS or iDevice apps. My dev is still focused on creative/media workflows, while I still get work for photo/video. I don't even use Xcode any further than running the CLI command to install the necessary tools to have CLI be useful.

        • jeroenhd3m

          While I don't think Apple wants to change course from its services-oriented profit model, surely someone within Apple has run the calculations for a server-oriented M3/M4 device. They're not far behind server CPUs in terms of performance while running a lot cooler AND having accelerated amd64 support, which Ampere lacks.

          Whatever the profit margin on an iMac Studio is these days, surely improving non-consumer options becomes profitable at some point if you start selling them by the thousands to data centers.

      • amelius3m

        But then they'd have to open up their internal documentation of their silicon, which could possibly be a legal disaster (patents).

      • re-thc3m

        > So if someone wants to buy Apple hardware to run Linux, it does not have a negative affect to AAPL

        It does. Support costs. How do you prove it's a hardware failure or software? What should they do? Say it "unofficially" supports Linux? People would still try to get support. Eventually they'd have to test it themselves etc.

        • dylan6043m

          Apple has already been in this spot. With the TrashCan MacPro, there was an issue with DaVinci Resolve under OS X at the time where the GPU was cause render issues. If you then rebooted into Windows with BootCamp using the exact same hardware and open up the exact same Resolve project with the exact same footage, the render errors disappeared. Apple blamed Resolve. DaVinci blamed GPU drivers. GPU blamed Apple.

          • re-thc3m

            > Apple has already been in this spot.

            Has been. This is importance. Past tense. Maybe that's the point - they gave up on it acknowledging the extra costs / issues.

        • k8sToGo3m

          We used to have bootcamp though.

          • dylan6043m

            There you go using logical arguments in an emotional illogical debate.

    • WillAdams3m

      Is it not an option to run Darwin? What would Linux offer that that would not?

      • internetter3m

        Darwin is a terrible server operating system. Even getting a process to run at server boot reliably is a nightmare.

      • kbolino3m

        I don't think Darwin has been directly distributed in bootable binary format for many years now. And, as far as I know, it has never been made available in that format for Apple silicon.

  • _alex_3m

    apple keeps talking about the Neural Engine. Does anything actually use it? Seems like all the current LLM and Stable Diffusion packages (including MLX) use the GPU.

    • gield3m

      Face ID, taking pictures, Siri, ARKit, voice-to-text transcription, face recognition and OCR in photos, noise filtering, ...

      • cubefox3m

        These have been possible in much smaller smartphone chips for years.

        • stouset3m

          Possible != energy efficient, which is important for mobile devices.

          • cubefox3m

            If the energy efficiency of things like Face ID was indeed so far so bad that you need a more efficient M3 Ultra, how come Face ID was integrated into smartphones years ago, apparently without significant negative impact on battery life?

            • xu_ituairo3m

              FaceID was just one example they gave (which is probably faster and more energy efficient now).

              Image recognition, OCR, AR and more are applications of the NPU that didn’t exist at all on older iPhones because they have would be too intensive for the chips and batteries.

              • cubefox3m

                That's false. Face ID is in fact a complex form of image recognition, so image recognition was definitely possible on older NPUs. OCR is the simplest form of image recognition (OCR was literally the first application of LeCun's CNN), so this was definitely possible as well. "AR" is an extremely vague term. If you refer to Snapchat style video overlays, those have been possible for a long time as well.

            • KerrAvon3m

              You seem to be arguing with a strawman here -- who said you need an M3 Ultra for energy efficient Face ID?

              • cubefox3m

                "stouset" implied that those are merely possible but not energy efficient on the older mobile hardware.

                • badc0ffee3m

                  The original question was asking what features have taken advantage of a NPU. Face ID was introduced with Apple's first "Neural Engine" CPU, the A11 Bionic.

                  You're confusing this with what features/enhancements new generations of NPUs bring, which nobody else was talking about. Everyone else in the conversation is comparing pre- and post-NPU.

                  • cubefox3m

                    The original question was clearly about the NPU of the currently discussed M3 Ultra, which is twice as large as the previous one. The question is what this one is good for, not what much, much smaller NPUs are good for which have nothing to do with the M3 Ultra topic.

        • KerrAvon3m

          Yes, they have.

          > September 12, 2017; 7 years ago

          https://en.wikipedia.org/wiki/Apple_A11#Neural_Engine

        • gield3m

          Indeed, but the neural engine does this faster and using heavier models. For example, on-device Siri was not possible until the introduction of the neural engine in 2017.

    • dcchambers3m

      Historically no, Ollama and the like have only used the CPU+GPU.

      That said, there are efforts being made to use the NPU. See: https://github.com/Anemll/Anemll - you can now run small models directly on your Apple Silicon Mac's NPU.

      It doesn't give better performance but it's massively more power efficient than using the GPU.

    • anentropic3m

      Yeah I agree.

      The Neural Engine is useful for a bunch of Apple features, but seems weirdly useless for any LLM stuff... been wondering if they'd address it on any of these upcoming products. AI is so hype right now it seems odd that they have specialised processor that doesn't get used for the kind of AI people are doing. I can see in the latest release:

      > Mac Studio is a powerhouse for AI, capable of running large language models (LLMs) with over 600 billion parameters entirely in memory, thanks to its advanced GPU

      https://www.apple.com/newsroom/2025/03/apple-unveils-new-mac...

      i.e. LLMs still run on the GPU not the NPU

  • rjeli3m

    Wow, incredible. I told myself I’d stop waffling and just buy the next 800gb/s mini or studio to come out, so I guess I’m getting this.

    Not sure how much storage to get. I was floating the idea of getting less storage, and hooking it up to a TB5 NAS array of 2.5” SSDs, 10-20tb for models + datasets + my media library would be nice. Any recommendations for the best enclosure for that?

    • kridsdale13m

      It depends on your bandwidth needs.

      I also want to build the thing you want. There are no multi SSD M2 TB5 bays. I made one that holds 4 drives (16TB) at TB3 and even there the underlying drives are far faster than the cable.

      My stuff is in OWC Express 4M2.

  • cynicalpeace3m

    Can someone explain what it would take for Apple to overtake NVIDIA as the preferred solution for AI shops?

    This is my understanding (probably incorrect in some places)

    1. NVIDIA's big advantage is that they design the hardware (chips) and software (CUDA). But Apple also designs the hardware (chips) and software (Metal and MacOS).

    2. CUDA has native support by AI libraries like PyTorch and Tensorflow, so works extra well during training and inference. It seems Metal is well supported by PyTorch, but not well supported by Tensorflow.

    3. NVIDIA uses Linux rather than MacOS, making it easier in general to rack servers.

    • bigyabai3m

      It's still boiling down to hardware and software differences.

      In terms of hardware - Apple designs their GPUs for GPU workloads, whereas Nvidia has a decades-old lead on optimizing for general-purpose compute. They've gotten really good at pipelining and keeping their raster performance competitive while also accelerating AI and ML. Meanwhile, Apple is directing most of their performance to just the raster stuff. They could pivot to an Nvidia-style design, but that would be pretty unprecedented (even if a seemingly correct decision).

      And then there's CUDA. It's not really appropriate to compare it to Metal, both in feature scope and ease of use. CUDA has expansive support for AI/ML primatives and deeply integrated tensor/SM compute. Metal does boast some compute features, but you're expected to write most of the support yourself in the form of compute shaders. This is a pretty radical departure from the pre-rolled, almost "cargo cult" CUDA mentality.

      The Linux shtick matters a tiny bit, but it's mostly a matter of convenience. If Apple hardware started getting competitive, there would be people considering the hardware regardless of the OS it runs.

      • cynicalpeace3m

        > keeping their raster performance competitive while also accelerating AI and ML. Meanwhile, Apple is directing most of their performance to just the raster stuff. They could pivot to an Nvidia-style design, but that would be pretty unprecedented (even if a seemingly correct decision).

        Isn't Apple also focusing on the AI stuff? How has it not already made that decision? What would prevent Apple from making that decision?

        > Metal does boast some compute features, but you're expected to write most of the support yourself in the form of compute shaders. This is a pretty radical departure from the pre-rolled, almost "cargo cult" CUDA mentality.

        Can you give an example of where Metal wants you to write something yourself whereas CUDA is pre-rolled?

        • bigyabai3m

          > Isn't Apple also focusing on the AI stuff?

          Yes, but not with their GPU architecture. Apple's big bet was on low-power NPU hardware, assuming the compute cost of inference would go down as the field progressed. This was the wrong bet - LLMs and other AIs have scaled up better than they scaled down.

          > How has it not already made that decision? What would prevent Apple from making that decision?

          I mean, for one, Apple is famously stubborn. They're the last ones to admit they're wrong whenever they make a mistake, presumably admitting that the NPU is wasted silicon would be a mea-culpa for their AI stance. It's also easier to wait for a new generation of Apple Silicon to overhaul the architecture, rather than driving a generational split as soon as the problem is identified.

          As for what's preventing them, I don't think there's anything insurmountable. But logically it might not make sense to adopt Nvidia's strategy even if it's better. Apple can't neccessarily block Nvidia from buying the same nodes they get from TSMC, so they'd have to out-design Nvidia if they wanted to compete on their merits. Even then, since Apple doesn't support OpenCL it's not guaranteed that they would replace CUDA. It would just be another proprietary runtime for vendors to choose from.

          > Can you give an example of where Metal wants you to write something yourself whereas CUDA is pre-rolled?

          Not exhaustively, no. Some of them are performance-optimized kernels like cuSPARSE, some others are primative sets like cuDNN, others yet are graph and signal processing libraries with built-out support for industrial applications.

          To Apple's credit, they've definitely started hardware-accelerating the important stuff like FFT and ray tracing. But Nvidia still has a decade of lead time that Apple spent shopping around with AMD for other solutions. The head-start CUDA has is so great that I don't think Apple can seriously respond unless the executives light a fire under their ass to make some changes. It will be an "immovable rock versus an unstoppable force" decision for Apple's board of directors.

          • aldonius3m

            I think betting on low-power NPU hardware wasn't necessarily wrong - if you're Apple you're trying to optimise performance/watt across the system as a whole. So in a context where you're shipping first-party bespoke on-device ML features it can make sense to have a modestly sized dedicated accelerator.

            I'd say the biggest problem with the NPU is that you can only use it from Core ML. Even MLX can't access it it!

            As you say the big world-changing LLMs are scaling up, not down. At the same time (at least so far) LLM usage is intermittent - we want to consume thousands of tokens in seconds, but a couple of times a minute. That's a client-server timesharing model for as long as the compute and memory demand can't fit on a laptop.

  • submeta3m

    I am confused. I got an M4 with 64 GB Ram. Did I buy something from the future? :) Now why M3 now? Not M4 Ultra.

    • seanmcdirmid3m

      It took them awhile to developed their ultra chip and this is what they had ready? I’m sure they are working on the M4 ultra, but they are just slow at it.

      I bought a refubished M3 max to run LLMs (can only go up to 70b with 4 bit quant), and it is only slightly slower than the more expensive M4 max.

    • opan3m

      Haven't the Max/Ultra type chips always come much later, close to when the next number of standard chips came out? M2 Max was not available when M2 launched, for example.

      • SirMaster3m

        An Ultra has never come out after the next gen base model, let alone the next gen Pro/Max model before.

        M1: November 10, 2020

        M1 Pro: October 18, 2021

        M1 Max: October 18, 2021

        M1 Ultra: March 8, 2022

        -------------------------

        M2: June 6, 2022

        M2 Pro: January 17, 2023

        M2 Max: January 17, 2023

        M2 Ultra: June 5, 2023

        -------------------------

        M3: October 30, 2023

        M3 Pro: October 30, 2023

        M3 Max: October 30, 2023

        -------------------------

        M4: May 7, 2024

        M4 Pro: October 30, 2024

        M4 Max: October 30, 2024

        -------------------------

        M3 Ultra: March 5, 2025

        • ellisv3m

          I'd also point out that there was a rather awkward situation with M1/M2 chips where lower end devices were getting newer chips before the higher end devices. For example, the 14 and 16-inch MacBooks Pro didn't get a M2 series chip until about 6 months after the 13 and 15-inch MacBooks Air. This left some professionals and power users frustrated.

          The M3 Ultra might perform as well as the M4 Max - I haven't seen benchmarks yet - but the newer series is in the higher end devices which is what most people expect.

        • kridsdale13m

          So about a year and a half delay for Ultra, but the M2 was an anomaly.

    • 3m
      [deleted]
  • narrator3m

    Not to rain on the Apple parade, but cloud video editing with the models running on H100s that can edit videos based on prompts is going to be vastly more productive than anything running locally. This will be useful for local development with the big Deepseek models though. Not sure if it's worth the investment unless Deepseek is close to the capability of cloud models, or privacy concerns overwhelm everything.

  • raydev3m

    I know it's basically nitpicking competing luxury sports cars at this point, but I am very bothered that existing benchmarks for the M3 show single core perf that is approximately 70% of M4 single core perf.

    I feel like I should be able to spend all my money to both get the fastest single core performance AND all the cores and available memory, but Apple has decided that we need to downgrade to "go wide". Annoying.

    • xp843m

      > both get the fastest single core performance AND all the cores

      I'm a major Apple skeptic myself, but hasn't there always been a tradeoff between "fastest single core" vs "lots of cores" (and thus best multicore)?

      For instance, I remember when you could buy an iMac with an i9 or whatever, with a higher clock speed and faster single core, or you could buy an iMac Pro with a Xeon with more cores, but the iMac (non-Pro) would beat it in a single core benchmark. Note: Though I used Macs as the example due to the simple product lines, I thought this was pretty much universal among all modern computers.

      • raydev3m

        > hasn't there always been a tradeoff between "fastest single core" vs "lots of cores" (and thus best multicore)?

        Not in the Apple Silicon line. The M2 Ultra has the same single core performance as the M2 Max and Pro. No benchmarks for the M3 Ultra yet but I'm guessing the same vs M3 Max and Pro.

        • xp843m

          Okay, good to know. Interesting change then.

          • LPisGood3m

            I think the traditional reason for this is that other chips like to use complex scheduling logic to have more logical cores than physical cores. This costs single threaded speed but allows you to run more threads faster.

  • divan3m

    Model with 512GB VRAM costs $9500, if anyone wonders.

  • ferguess_k3m

    Ah, if we can have the hardware and the freedom of installing a good Linux repo on top of it. How is Asahi? Is it good enough? I assume, that since Asahi is focused on Apple hardware, it should have an easier time figuring out drivers and etc?

    • bigyabai3m

      > How is Asahi?

      For M3 and M4 machines, hardware support is pretty derilict: https://asahilinux.org/docs/M3-Series-Feature-Support/

      • ferguess_k3m

        Thanks, looks like even M1 support has some gaps:

        https://asahilinux.org/docs/M1-Series-Feature-Support/#table...

        I assume anything that doesn't have "linux-asahi" is not supported -- or any WIP is not supported.

        Wish I had the skills to help them. Targeting just one set of architecture, I think Asahi has more chances of success.

        • bigyabai3m

          It's just not an easy task. I can't help but compare it to the Nouveau project spending years of effort to reverse-engineer just a few GPU designs. Then Nvidia changed their software and hardware architecture, and things went from "relatively hopeful" to "there is no chance" overnight.

          • ferguess_k3m

            I agree, it's a lot of work, plus Apple definitely is not not going to help with the project. Maybe an alternative is something like Framework -- find some good enough hardware and support it.

  • crest3m

    Too bad it lacks even the streaming mode SVE2 found in M4 cores. If only Apple would provide a full SVE2 implementation to put pressure on ARM to make it non-optional so AArch64 isn't effectively restricted to NEON for SIMD.

    • vlovich1233m

      This is for AI which is going to benefit more from use of metal / NPU than SIMD.

      • bigyabai3m

        Sure, but larger models that fit in that 512gb memory are going to take a long time to tokenize/detokenize without hardware-accelerated BLAS.

        • danieldk3m

          Why would you need BLAS for tokenization/detokenization? Pretty much everyone still uses BBPE which amounts to iteratively applying merges.

          (Maybe I'm missing something here.)

        • ryao3m

          Tokenization/detokenization does not use BLAS.

    • stouset3m

      Hell I’m just sitting here hoping the future M5 adopts SVE. Not even SVE2.

  • enether3m

    Can anyone with older Mac Studios/Minis comment - do you also notice a "throttling" of the hardware?

    I'm not sure if this is me not maintaining it properly (e.g fans having dust block them) - but I've always got this sense that Apple throttles their older devices in some indirect ways. I experience it the most with iPhones - my old iPhone is pretty slow doing basic things despite nothing really changing on it (just the OS updating?)

    So my only concern with this is - how many years until it's slow enough to annoy you into buying a new one?

    • BobAliceInATree3m

      > (just the OS updating?)

      "Just the OS updating" is not insignificant. Software developers, in general, are not known for making sure latest versions of their software run smoothly on older hardware.

      Also, performance on iPhones is throttled when your battery is very old. There was a whole class-action lawsuit about it.

      https://support.apple.com/en-us/106348

      • scarface_743m

        You realize the alternative is the iPhone shutting down completely?

        • bigyabai3m

          No, I don't. There are multiple alternatives including (though not limited to) opt-in battery protection, removing battery DRM to enable hardware repairs, or declaring a first-party recall on the faulty units to replace components that are damaging to the hardware.

          Conspicuously, Apple just so happened to pick the one that encouraged people to upgrade the entire phone. You know, an entire phone that is otherwise functional without arbitrary restrictions by the OEM.

  • iambateman3m

    People who know more than me: they’re talking a lot about RAM and not much about GPU.

    Do you expect this will be able to handle AI workloads well?

    All I’ve heard for the past two years is how important a beefy GPU is. Curious if that holds true here too.

    • gatienboquet3m

      LLMs are primarily "memory-bound" rather than "compute-bound" during normal use.

      The model weights (billions of parameters) must be loaded into memory before you can use them.

      Think of it like this: Even with a very fast chef (powerful CPU/GPU), if your kitchen counter (VRAM) is too small to lay out all the ingredients, cooking becomes inefficient or impossible.

      Processing power still matters for speed once everything fits in memory, but it's secondary to having enough VRAM in the first place.

      • whimsicalism3m

        Transformers are typically memory-bandwidth bound during decoding. This chip is going to have a much worse memory b/w than the nvidia chips.

        My guess is that these chips could be compute-bound though given how little compute capacity they have.

        • Gracana3m

          It's pretty close. A 3090 or 4090 has about 1TB/s of memory bandwidth, while the top Apple chips have a bit over 800GB/s. Where you'll see a big difference is in prompt processing. Without the compute power of a pile of GPUs, chewing through long prompts, code, documents etc is going to be slower.

          • whimsicalism3m

            nobody in industry is using a 4090, they are using H100s which have 3TB/s. Apple also doesn’t have any equivalent to nvlink.

            I agree that compute is likely to become the bottleneck for these new Apple chips, given they only have like ~0.1% the number of flops

            • Gracana3m

              I chose the 3090/4090 because it seems to me that this machine could be a replacement for a workstation or a homelab rig at a similar price point, but not a $100-250k server in a datacenter. It's not really surprising or interesting that the datacenter GPUs are superior.

              FWIW I went the route of "bunch of GPUs in a desktop case" because I felt having the compute oomph was worth it.

            • _zoltan_3m

              4.8TB/s on H200, 8TB/s on B200, pretty insane.

              • Gracana3m

                That’s wild, somehow I hadn’t seen the B200 specs before now. I wish I could have even a fraction of that!

        • gatienboquet3m

          VRAM capacity is the initial gatekeeper, then bandwidth becomes the limiting factor.

          • whimsicalism3m

            i suspect that compute actually might be the limiter for these chips before b/w, but not certain

        • cubefox3m

          > Transformers are typically memory-bandwidth bound during decoding.

          Not in case of language models, which are typically bound by memory size rather than bandwidth.

          • whimsicalism3m

            nope

            • cubefox3m

              I assume even this one won't run on an RTX 5090 due to constrained memory size: https://news.ycombinator.com/item?id=43270843

              • whimsicalism3m

                sure on consumer GPUs but that is not what is constraining the model inference in most actual industry setups. technically even then, you are CPU-GPU memory bandwidth bound more than just GPU memory, although that is maybe splitting hairs

                • cubefox3m

                  Why are industry setups considered actual while others are not?

    • qwertox3m

      A beefy GPU which can't hold models in VRAM is of very limited use. You'll see 16 GB of VRAM on gamer Nvidia cards, the RTX 5090 being an exception with 32 GB VRAM. The professional cards have around 96 GB of VRAM.

      The thing with these Apple chips is that they have unified memory, where CPU and GPU use the same memory chips, which means that you can load huge models into RAM (no longer VRAM, because that doesn't exist on those devices). And while Apple's integrated GPU isn't as powerful as an Nvidia GPU, it is powerful enough for non-professional workloads and has the huge benefit of access to lots of memory.

    • simlevesque3m

      What's more important isn't how beefy it is, it's how much memory it has.

      These are unified memory. The M3 Ultra with 512gb has as much VRAM as sixteen 5090.

    • lynndotpy3m

      VRAM is what takes a model from "can not run at all" to "can run" (even if slowly), hence the emphasis.

      • vlovich1233m

        No, with limited VRAM you could offload the model partially or split across CPU and GPU. And since CPU has swap, you could run the absolute largest model. It’s just really really slow.

        • jeffhuys3m

          Really, really, really, really, really, REALLY REALLY slow.

        • Espressosaurus3m

          The difference between Deepseek-r1:70b (edit: actually 32b) running on an M4 Pro (48 GB unified RAM, 14 CPU cores, 20 GPU cores) and on an AMD box (64 GB DDR4, 16 core 5950X, RTX 3080 with 10 GB of RAM) is more than a factor of 2.

          The M4 pro was able to answer the test prompt twice--once on battery and once on mains power--before the AMD box was able to finish processing.

          The M4's prompt parsing took significantly longer, but token generation was significantly faster.

          Having the memory to the cores that matter makes a big difference.

          • vlovich1233m

            You're adding detail that's not relevant to anything I said. I was saying this statement:

            > VRAM is what takes a model from "can not run at all" to "can run" (even if slowly), hence the emphasis.

            Is false. Regardless of how much VRAM you have, if the criteria is "can run even if slowly", all machines can run all models because you have swap. It's unusably slow but that's not what OP was claiming the difference is.

            • Espressosaurus3m

              The criteria for purchase for anybody trying to use it is "run slowly but acceptably" vs. "run so slow as to be unusable".

              My memory is wrong, it was the 32b. I'm running the 70b against a similar prompt and the 5950X is probably going to take over an hour for what the M4 managed in about 7 minutes.

              edit: an hour later and the 5950 isn't even done thinking yet. Token generation is generously around 1 token/s.

              edit edit: final statistics. M4 Pro managing 4 tokens/s prompt eval, 4.8 tokens/s token generation. 5950X managing 150 tokens/s prompt eval, and 1 token/s generation.

              Perceptually I can live with the M4's performance. It's a set prompt, do something else, come back sort of thing. The 5950/RTX3080's is too slow to be even remotely usable with the 70b parameter model.

              • vlovich1233m

                I don't disagree. I'm just taking OP at the literal statement they made.

            • lynndotpy3m

              Sure, this is technically correct, but somewhere there's a line of practicality. Running off a CPU (especially with swap) will be past that line.

              Otherwise, you don't even need a computer. Pen and paper is plenty.

              For all practical purposes, VRAM is a limiting factor.

      • dartos3m

        You can say the same about GPU clock speed as well…

    • Retr0id3m

      When it comes to LLMs in particular, it comes down to memory size+bandwidth more than anything else.

    • matwood3m

      I was able to run and use the DeepSeek distilled 24gb on an M1 Max with 64gb of ram. It wasn't speedy, but it was usable. I imagine the M3/4s are much faster, especially on smaller, more specific models.

  • alok-g3m

    Two questions for the fellow HNers:

    1. What are various average joe (as opposed to researchers, etc.) use cases for running powerful AI models locally vs. just using cloud AI. Privacy of course is a benefit, but it by itself may not justify upgrades for an average user. Or are we expecting that new innovation will lead to much more proliferation of AI and use cases that will make running locally more feasible?

    2. With the amount of memory used jumping up, would there be a significant growth for companies making memories? If so, which ones would be the best positioned?

    Thanks.

    • theshrike793m

      For 1: censorship

      A local model will do anything you ask it to, as far as it "knows" about it. It doesn't need to please investors or be afraid of bad press.

      LM Studio + a group of select models from huggingface and you can do whatever you want.

      For generic coding assistance and knowledge, online services are still better quality.

    • zamalek3m

      I don't think there's a huge use-case locally, if you're happy with the subscription cost and privacy. That is, yet. Give it maybe 2 years and someone will probably invent something which local inference would seriously benefit from. I'm anticipating inference for the home appliances (something mac mini form factor that plugs into your router) but that's based on what would make logical sense for consumers, not what consumers would fall for.

      Apple seems to be using LPDDR, but HBM will also likely be a key tech. SK Hynix and Samsung are the most reputable for both.

      • alok-g3m

        Thanks.

        >> Apple seems to be using LPDDR, but HBM will also likely be a key tech. SK Hynix and Samsung are the most reputable for both.

        So not much Micron? Any US based stocks to invest in? :-)

        • zamalek3m

          I forgot about Micron, absolutely. TSMC is the supplier for all of these, so you're covering both memory and compute if that's your strategy (the risk is that US TSMC is over provisioning manufacturing based on the pandemic hardware boom).

    • christiangenco3m

      IMO it's all about privacy. Perhaps also availability if the main LLM providers start pulling shenanigans but it seems like that's not going to be a huge problem with how many big players are in the space.

      I think a great use case for this would be in a company that doesn't want all of their employees sending LLM queries about what they're working on outside the company. Buy one or two of these and give everybody a client to connect to it and hey presto you've got a secure private LLM everybody in the company can use while keeping data private.

      • chamomeal3m

        I’ll add to this that while I couldn’t care less about open AI seeing my general coding questions, I wouldn’t run actual important data through ChatGPT.

        With a local model, I could toss anything in there. Database query outputs, private keys, stuff like that. This’ll probably become more relevant as we give LLM’s broader use over certain systems.

        Like right now I still mostly just type or paste stuff into ChatGPT. But what about when I have a little database copilot that needs to read query results, and maybe even run its own subset of queries like schema checks? Or some open source computer-use type thingy needs to click around in all sorts of places I don’t want openAI going, like my .env or my bash profile? That’s the kinda thing I’d only use a local model for

      • user39393823m

        Hopefully homomorphic encryption can solve this rather than building a new hardware layer everywhere.

    • lvturner3m

      Risk of shut-out.

      I'm in Hong Kong, I can't even subscribe to OpenAI or Claude directly, though granted this doesn't so much apply to the already "open" models

    • JadedBlueEyes3m

      One important one that I haven't seen mentioned is simply working without an internet connection. It was quite important for me when I was using AI whilst travelling through the countryside, where there is very limited network access.

    • archagon3m

      I don’t currently use AIs, but if I did, they would be local. Simply put: I can’t build my professional career around tools that I do not own.

      • alok-g3m

        >> ... around tools that I do not own.

        That just may be dependent on how much trust you have on the providers you use. Or do you do your own electricity generation?

        • archagon3m

          That's quite a reductio ad absurdum. No, I don't generate my own electricity (though I could). But I don't use tools for work that can change out from under me at any moment, or that can increase 10x in price on a corporate whim.

          • hobofan3m

            And why would that require running AI models locally? You can be in essentially full control by using open source (/open weight) models (DeepSeek etc.) running on exchangable cloud providers that are as replaceable as your electricity provider.

            • archagon3m

              Sure, I guess you can do that as long as you use an open weight model. (Offline support is a nice perk, however.)

          • alok-g3m

            We align.

            I tend to do the same thing. I do not consider myself as a good representative of an average user though.

        • globular-toast3m

          In my country things like electricity and water supply are considered a right and a supplier has to go to court to get a supply shut off. Unfortunately we don't yet consider an internet connection in the same way, despite the government essentially requiring it these days.

      • globular-toast3m

        Even then you're still dependent on someone training new models on sketchy data sources that they don't own.

    • piotrpdev3m

      1. Lower latency for real time tasks e.g. transcription + translation?

  • tempodox3m

    I could salivate over the hardware no end, if only Apple software (including the OS) weren't that shoddy.

  • joshhart3m

    This is pretty exciting. Now an organization could produce an open weights mixture of experts model that has 8-15b active parameters but could still be 500b+ parameters and it could be run locally with INT4 quantization with very fast performance. DeepSeek R1 is a similar model but over 30b active parameters which makes it a little slow.

    I do not have a good sense of how well quality scales with narrow MoEs but even if we get something like Llama 3.3 70b in quality at only 8b active parameters people could do a ton locally.

  • 827a3m

    Very curiously: They upgraded the Mac Studio but not the Mac Pro today.

  • screye3m

    How does the 500gb vram compare with 8xA100s ? ($15/hr rentals)

    If it is equivalent, then the machine pays for itself in 300 hours. That's incredible value.

  • dlachausse3m

    Interesting that they’re releasing M3 Ultra after the M4 Macs have already shipped.

    I wonder if the plan is to only release Ultras for odd number generations.

    • jmull3m

      I'm guessing it's more because "Ultra" versions, which "fuse" multiple chips take significant additional engineering work. So we might expect an ultra M4 next year, possibly after non-ultra M5s are released.

    • pier253m

      They released the M2 Ultra

      • dlachausse3m

        Good point, I forgot about that. Maybe it just got really delayed in production.

        • ryao3m

          Reportedly Apple is using its own silicon in data centers to run “Apple Intelligence” and other things like machine translation in safari. I suspect that the initial supply was sent to Apple’s datacenters.

    • _alex_3m

      m2 ultra tho

  • VVilhelmsen3m

    Who is this made for? Who needs a personal computer this powerful? Not trying to be funny - it's a genuine question.

    Gamers don't generally use a mac because of the lack of games and I'm guessing those who are really into LLMs use Linux for the flexibility. Video editing can be done on much cheaper hardware.

    Very rich LLM enthusiasts who wants to try out mac?

    • eric-burel3m

      I don't think people into LLMs necessarily use Linux, most devs I see around use a mac, and I think I'll buy one and move out of Ubuntu if I start using them more seriously. The perfs are the key selling point as many professional use cases benefits from running LLMs locally.

      • VVilhelmsen3m

        I used to run Ubuntu with i3 and recently switched to a macbook air. I thought I would hate it coming from i3, but honestly when using the stage manager + tmux & nvim and betterTouchTool for keybinds it feels just as effective. The fact that fullscreen applications get their own desktop is nice too.

    • rperez3333m

      Content creation usually, which consists of audio, photo, and video editing mostly. The hardware and software integration makes video editing on a Mac really superior, specially with their ProRes codec.

      You can get a good experience on a Windows or Linux machine with DaVinci Resolve, but that’s mostly because of the way better GPUs like the 4090/RTX series you’ve got at your disposal.

    • kalleboo3m

      Modern video editing includes tasks such as AI upscaling and subtitle generation that can use lots of power

    • perfmode3m

      I would love 32 cores for audio editing in RX 11.

    • tomovo3m

      Swift developers.

  • tap-snap-or-nap3m

    All this hardware but I don't know how to best utilize it because 1) I am not a pro, and 2) The apps are not as helpful which can make complex jobs easier, which is what old apple used to do really well.

  • Sharlin3m

    > it can be configured up to 512GB, or over half a terabyte.

    Hah, I see what they did there.

    • kridsdale13m

      If they added 1 byte, it counts.

      • Sharlin3m

        The "what they did here" part is they mean base-10 terabytes, and 512 GiB = 512·1024^3 bytes is indeed strictly more than 0.5 TB = 500·1000^3 bytes.

  • epolanski3m

    Can anybody ELI5 why aren't there multi gpu builds to run LLMs locally?

    It feels like one should be able to build a good machine for 3/4k if not less with 6 16GB mid level gaming GPUs.

    • snitty3m

      Reddit's LocalLLama has a lot of these. 3090s are pretty popular for these purposes. But they're not trivial to build and run at home. Among other issues are that you're drawing >1kW for just the GPUs if you have four of them at 100% usage.

    • risho3m

      6 * 16 is still nowhere near 512gb of vram. On top of that that monster that you create requires hyper specific server grade hardware, will be huge, loud and pull down enough power to trip a circuit breaker. i'm sure most people would rather pay a 30 percent premium to get twice the ram and have a power sipping device that you can hold in the palm of your hand.

  • martin_a3m

    > Apple today announced M3 Ultra, the highest-performing chip it has ever created

    Well, duh, it would be a shame if you made a step backwards, wouldn't it? I hate that stupid phrase...

  • 3m
    [deleted]
  • bredren3m

    Apart from enabling an 120h update to the XDR Pro, does TB5 offer a viable pathway for eGPUs on Apple Silicon macbooks?

    This is a cool computer, but not something I'd want to lug around.

    • mohsen13m

      For AI stuff, 120GB/s is not really useful really...

  • DidYaWipe3m

    How do people feel about the value of the M3 Ultra vs. the M4 Max for general computing, assuming that you max out the RAM on the M4 version of the Studio?

    • 827a3m

      The kinds of workloads that could truly leverage the M2 Ultra over the M2 Max were vanishingly small. When comparing the M3 Ultra to the M4 Max, that number gets even smaller, because the M4 Max will have ~15% higher single core perf. The insane memory available on M3 Ultra is its only interesting capability, but its still not big enough to run the series of largest open source LLMs.

      Hot take: You can tie yourself into six knots trying to spin a yarn about why the M3 Ultra spec is super awesome for some AI use-case, meanwhile you could buy a Mac Mini and like 200 million GPT-4o tokens for the cost of this machine that can't even run R1.

      • LeoPanthera3m

        I suspect most people running LLMs locally are unable to use the big cloud models for either legal or ethical reasons. If you could use gpt4, you would, it's just not that expensive.

  • aurareturn3m

    You can run the full Deepseek 671b q4 model at 40 tokens/s. 37B active params at a time because R1 is MoE.

    • KingOfCoders3m

      In another of your comments it was "by my calculation". Now it's just fact?

      • aurareturn3m

        Easiest calculation.

        • KingOfCoders3m

          Me pulling numbers out of my hat, easiest calculation too.

          • aurareturn3m

            Troll.

            • KingOfCoders3m

              Exactly. Sub-species Apple fanboy.

              • aurareturn3m

                Wannabe CTO. Engineering manager at some random Wordpress real estate website and an ex-CTO at an "eBay subsidiary". Looks to be an misinformed AMD fanboy as well. Trolls HN trying to advertise a book that was probably written by ChatGPT.

                • KingOfCoders3m

                  [flagged]

                  • aurareturn3m

                    You ain't owning anything. I've seen countless "coaches" like yourself with questionable experience. Looks more like a BS artist honestly.

  • wewewedxfgdf3m

    Computers these days - the more appealing, exciting, cooler desirable, the higher the price, into the stratosphere.

    $9499

    What ever happening to competition in computing?

    Computing hardware competition used to be cut throat, drop dead, knife fight, last man standing brutally competitive. Now it's just a massive gold rush cash grab.

    • niek_pas3m

      The Macintosh plus, released in 1986, cost $2600 at the time, or $7460 adjusted for inflation.

      • bigyabai3m

        It even came with an official display! Nowadays that's a $1,600-$6,000 accessory, depending on whether you own a VESA mount.

    • WXLCKNO3m

      You take the top price of the top of the line newest pro chip apple produces and then make this argument?

    • Underphil3m

      You're thinking about this as if the average Joe is interested in this. They're not. The tech folk salivating in this discussion are a rounding error when it comes to computing. The vast majority has no need for this.

    • bustling-noose3m

      The MacBook Air with M4 chip 16gb ram and an amazing display and camera is just 999$. During back to school that will happen soon it’s 899$ with free AirPods. That’s really great value given how good the hardware is.

    • hu33m

      It doesn't even run Linux properly.

      Could cost half of that and it would still be uninteresting for my use cases.

      For AI, on-demand cloud processing is magnitudes better in speed and software compatibility anyway.

      • dcchambers3m

        There are legitimate use cases for local LLMs.

        For example, I'll happily feed my entire directory of private notes/diary entries into an LLM running offline on my laptop. I would never do that with someone else's LLM running in the cloud.

  • ein0p3m

    That's all nice, but if they are to be considered a serious AI hardware player, they will need to invest in better support of their hardware in deep learning frameworks such as PyTorch and Jax. Currently the support is rather poor, and is not suitable for any serious work.

  • maz1b3m

    Surprised they didn't do an M4 Ultra. I really hope they don't do an M4 Ultra for the Mac Pro and add in this very undesirable kind of product matrix just for the sake of differentiation. I would be ok with an M3 Extreme in the Mac Pro, however.

  • ballooney3m

    I'm from the dark ages and am interested in this for non-AI things like CFD. What is the state of SDK support for these chips? Is there a nice rust or C++ library that abstracts the hardware and lets you just do very big Matrix multiplications?

  • 3m
    [deleted]
  • FloatArtifact3m

    So, what's the question if the M1/M2 Ultra was limited by GPU/NPU or more memory bandwidth at this point?

    I'm curious what instruction sets may have been included with the M3 chip that the other two lack for AI.

    So far the candidates seem to be NVIDIA digits, Framework Desktop, M1 64gb M2/M3 128gb studio/ultra.

    The GPU market isn't competitive enough for the amount of VRAM needed. I was hoping for an Battlemage GPU Model with 24GB that would be reasonably priced and available.

    The framework desktop and devices I think a second generation will be significantly better than what's currently on offer today. Rationale below...

    For a max spec processor with ram at $2,000, this seems like a decent deal given today's market. However, this might age very fast for three reasons.

    Reason 1: LPDDR6 may debut in the next year or two this could bring massive improvements to memory bandwidth and capacity for soldered on memory.

    LPDDR6 vs LPDDR5 - Data bus width - 24 bits, 16 bits Burst length - 24 bits, 15 bits Memory bandwidth - Up to 38.4 GB/s, Up to 6.7 GB/s

    - Camm ram may or may not be maintain signal integrity as memory bandwidth increases. Until I see it implemented for a AI use-case in a cost-effective manner, I am skeptical.

    Reason 2: - It's a laptop chip with limited PCI lanes and reduced power envelope. Theoretically, a desktop chip could have better performance, more lanes, socketable (Although, I don't think I've seen a socketed CPU with soldered RAM)

    Reason 3: In addition, what does hardware look like being repurposed in the future compared to alternatives?

    - Unlike desktop or server counterparts which can have a higher cpu core count, PCEe/IO Expansion, this processor with its motherboard is limited on re-purposing later down the line as a server to self-host other software besides AI. I suppose could be turned into a overkill, NAS with ZFS and HBA Single Controller Card in new case.

    - Buying into the framework desktop is pretty limited based on the form factor. Next generation might be able to include a 16x slot fully populated, a 10G nic. That seems about it if they're going to maintain the backward compatibility philosophy given the case form factor.

  • k_sze3m

    Why did Apple call it M3 Ultra when it's supposed to be more performant than the current top of the line M4 Max? Why not just call it M4 Ultra?

    • gnabgib3m

      Because it's two M3 Max chips fused with a high-speed link? The M4 Ultra will presumably be comprised of two M4 Max processors fused similarly (next year).

  • casey23m

    They are going to have more memory than storage at this rate.

  • gigatexal3m

    16TB, 512gb ram, m3 ultra 15k+ usd. Wow.

    Did they say why there’s not an m4 ultra?

  • teknologist3m

    Something nobody mentions is the power draw. This kind of compute horsepower for so little power usage is outstanding.

  • xyst3m

    I might like Apple again if the SoC could be sold separately and opened up. It would be interesting to see a PC with Asahi or Windows running on Apple’s chips.

  • m3kw93m

    Instantly hippa compliant high end models running locally.

  • api3m

    Half a terabyte could run 8 bit quantized versions of some of those full size llama and deepseek models. Looking forward to seeing some benchmarks on that.

    • zamadatix3m

      Deepseek would need Q5ish level quantization to fit.

  • gpapilion3m

    I think this will eventually morph into apples server fleet. This in conjunction with the ai server factory they are opening makes a lot of sense.

  • nottorp3m

    > support for more than half a terabyte of unified memory

    Soldered?

    • universenz3m

      Is there a single Apple SoC where they’ve provided removable ram? Not that I can recall.

      • danpalmer3m

        Is there even an existing replaceable memory standard that would meet the current needs of Apple's "Unified Memory" architecture? I'm not an expert but I'd suspect probably not. The bus probably looks a lot more like VRAM on GPUs, and I've never seen a GPU with replaceable RAM.

        • jsheard3m

          CAMM2 could kinda work, but each module is only 128-bit so I think the furthest you could possibly push it is a 512-bit M Max equivalent with CAMM2 modules north, east, west and south of the SOC. There just isn't room to put eight modules right next to the SOC for a 1024-bit bus like the M Ultra.

          • eigenspace3m

            Framework said that when they built a Strix Halo machine, AMD assigned an engineer to work with them on seeing if there's a way to get CAMM2 memory working with it, and after a bunch of back and forth it was decided that CAMM2 still made the traces too long to maintain proper signal integrity due to the 256 bit interface.

            These machines have a 512 bit interface, so presumably even worse.

            • zamadatix3m

              Current (individual, not counting dual socketed) AMD Epyc CPUs have 576 GB/s over a 768 bit bus using socketed DIMMs.

              • eigenspace3m

                My understanding is that works out due to the lower clock speeds of those RAM modules though right?

                It's getting that bandwdith by going very wide on very very very many channels, rather than trying to push a gigantic amount of bandwidth through only a few channels.

                • zamadatix3m

                  Yeah, "channels" are just a roundabout way to say "wider bus" and you can't get too much past 128 GB/s of memory bandwidth without leaning heavily into a very wide bus (i.e. more than the "standard" 128 bit we're used to on consumer x86) regardless who's making the chip. Looking at it from the bus width perspective:

                  - The AI Max+ 395 is a 256 bit bus ("4 channels") of 8000 MHz instead of 128 bits ("2 channels") of 16000 MHz because you can't practically get past 9000 MHz in a consumer device, even if you solder the RAM, at the moment. Max capacity 128 GB.

                  - 5th Gen Epyc is a 768 bit bus ("12 channels") of 6000 MHz because that lets you use a standard socketed setup. Max capacity 6 TB.

                  - M3 Ultra is a 1024 bit bus ("16 channels") of "~6266 MHz" as it's 2x the M3 Max (which is 512 bits wide) and we know the final bandwidth is ~800 GB/s. Max capacity 512 GB.

                  Note: "Channels" is in quotes because the number of bits per channel isn't actually the same per platform (and DDR5 is actually 2x32 bit channels per DIMM instead of 1x64 per DIMM like older DDR... this kind of shit is why just looking at the actual bit width is easier :p).

                  So really the frequencies aren't that different even though these are completely different products across completely different segments. The overwhelming factor is bus width (channels) and the rest is more or less design choice noise from the perspective of raw performance.

            • jsheard3m

              Yeah, but AMDs memory controllers are really finnicky. That might have been more of a Strix Halo issue than a CAMM2 issue.

              • eigenspace3m

                Entirely possible. Obviously Apple wouldn't have been interested in letting you upgrade the RAM even if it was doable.

                I'd love to have more points of comparison available, but Strix Halo is the most analogous chip to an M-series chip on the market right now from a memory point of view, so it's hard to really know anything.

                I very much hope CAMM2 or something else can be made to work with a Strix-like setup in the future, but I have my doubts.

        • nottorp3m

          I thought so too when they launched the M1, but I soon got corrected.

          The memory bus is the same as for modules, it's just very short. The higher end SoCs have more memory bandwidth because the bus is wider (i.e. more modules in parallel).

          You could blame DDR5 (who thought having a speed negotiation that can go over a minute at boot is a good idea?), but I blame the obsession with thin and the ability to overcharge your customers.

          > I've never seen a GPU with replaceable RAM

          I still have one :) It's an ISA Trident TVGA 8900 that I personally upgraded from 512k VRAM to one full megabyte!!!

        • hoseja3m

          It's really unfortunate that GPUs aren't fully customizable daughterboards, isn't it.

    • klausa3m

      It's not soldered, it's _on the package_ with the SoC.

      • eigenspace3m

        It is _not_ on die. It's soldered onto the package.

        There's a good reason it's soldered, i.e. the wide memory interface and huge bandwidth mean that the extra trace lengths needed for an upgradable RAM slot would screw up the memory timings too much, but there's no need to make false claims like saying it's on-die.

        • sschueller3m

          > RAM slot would screw up the memory timings

          Existing ones possibly but why not build something that lets you snap-in a BGA package just like we snap in CPUs on full sized PC mainboards?

          • eigenspace3m

            The longer traces are the problem. They want these modules as physically close as possible to the CPU to make the timings work out and maintain signal integrity.

            It's the same reason nobody sells GPUs that have user upgradable non-soldered GDDR VRAM modules.

      • georgeburdell3m

        Probably on package at best

        • klausa3m

          Right, yes, sorry for imprecise language!

          • riidom3m

            Thanks for clarifying

    • ZekeSulastin3m

      Not even Framework has escaped from soldered RAM for this kind of thing.

    • simlevesque3m

      As are all Apple M devices.

    • reaperducer3m

      Soldered?

      Figure out a way to make it unified without also soldering it, and you'll be a billionaire.

      Or are you just grinding a tired, 20-year-old axe.

      • rsynnott3m

        _That_, in itself, wouldn't be that difficult, and there are shared-memory setups that do use modular memory. Where you'd really run into trouble is making it _fast_; this is very, very high bandwidth memory.

      • jonjojojon3m

        Like all intel/amd integrated graphics that use the systems ram as vram?

    • varispeed3m

      You know that memory can be "easily" de-soldered and soldered at home?

      The issue is availability of chips and most likely you have to know which components to change so the new memory is recognised. For instance that could be changing a resistor to different value or bridging certain pads.

      • A4ET8a8uTh0_v23m

        This viewpoint is interesting. It is not exactly inaccurate, but it does appear to be missing a point. Soldering in itself is a valuable and useful skill, but I can't say you can just get in and start de-soldering willy-nilly as opposed to opening a box and upgrading ram by plopping stuff in a designated spot.

        What if both are an issue?

        • varispeed3m

          Do you know that "plopping stuff in a designated spot" can also be out of reach to some people? I know plenty who would give their computer to a tech do to the upgrade for them even if they are shown in person how to do all the steps. Soldering is just one step (albeit fairly big) above that. But the fact this can be done at home with fairly inexpensive tools, means tech person with reasonable skill could do it, so such upgrade could be accessible in computer/phone repair shop if parts were available to do so. Soldering is not a barrier - what I am trying to say.

  • datadrivenangel3m

    Unclear what devices this will be in outside of the mac studio. Also most of the comparisons were with M1 and M2 chips, not M4.

    • reaperducer3m

      most of the comparisons were with M1 and M2 chips, not M4.

      Is anyone other than a vanishingly small number of hard core hobbiests going to upgrade from an M4 to an M4 Ultra?

      • nordsieck3m

        > Is anyone other than a vanishingly small number of hard core hobbiests going to upgrade from an M4 to an M4 Ultra?

        I expect that the 2 biggest buyers of M4 Ultra will be people who want to run LLMs locally, and people who want the highest performance machine they can get (professionals), but are wedded to mac-only software.

        • bredren3m

          Anecdotal, and reasonable criticisms of the release aside, OpenAI's gpt-4.5 introduction video was done from a hard-to-miss Apple laptop.

          It is reasonable to say many folks in the field prefer to work on mac hardware.

    • dlachausse3m

      It is a bit misleading to do that, but in fairness to Apple, almost nobody is upgrading to this from an M4 Mac, so those are probably more useful comparisons.

  • tuananh3m

    but is it actually usable for anything if it's too slow.

    Has anyone has a ballpark number how many tokens per second we can get with this?

  • datadeft3m

    Apple today announced M3 Ultra, the highest-performing chip it has ever created

    I thought it was few weeks ago when M4 Max came by.

  • wewewedxfgdf3m

    The good news is that AMD and Intel are both in good positions to develop similar products.

  • crowcroft3m

    Kinda curious to see how man tok/sec it can crush. Could be a fun way to host AI apps.

  • pier253m

    So weird they released the Mac Studio with an M4 Max and M3 Ultra.

    Why? They have too many M3 chips on stock?

    • bigfishrunning3m

      The M4 Max is faster, the M3 Ultra supports more unified memory -- So pick whichever meets your requirements

      • pier253m

        Yes but why not release an M4 Ultra?

        • 3m
          [deleted]
        • wpm3m

          Because the M4 architecture doesn't have the interconnects needed to fuse two Max SoCs together.

  • desertmonad3m

    Time to upgrade m1 ultra I guess! M1 ultra has been pretty good with deepseek locally.

    • _alex_3m

      what flavor of deepseek are you running? what kind of performance are you seeing?

  • apatheticonion3m

    God I wish Linux ran on Apple Silicon (with first class hardware support).

    • walterbell3m

      We need more mainline Linux devs to work on Apple Silicon feature gaps, instead of expecting tiny AsahiLinux to roll support up the mountain alone.

      • apatheticonion3m

        For sure. The work that Asahi have done is unbelievable, especially given the size of the team and the challenges they've faced. Would be amazing if there was more support from mainline devs (or better still, Apple directly).

  • runeks3m

    Is the new Mac Studio the only product this chip will go into?

  • ozten3m

    We've come a long way since beowulf clusters of smart toasters.

  • behnamoh3m

    819GB/s bandwidth...

    what's the point of 512GB RAM for LLMs on this Mac Studio if the speed is painfully slow?

    it's as if Apple doesn't want to compete with Nvidia... this is really disappointing in a Mac Studio. FYI: M2 Ultra already has 800GB/s bandwidth

    • gatienboquet3m

      NVIDIA RTX 4090: ~1,008 GB/s

      NVIDIA RTX 4080: ~717 GB/s

      AMD Radeon RX 7900 XTX: ~960 GB/s

      AMD Radeon RX 7900 XT: ~800 GB/s

      How's that slow exactly ?

      You can have 10000000Gb/s and without enough VRAM it's useless.

      • ttul3m

        I have a 4090 and, out of curiosity, I looked up the FLOPS in comparison with Apple chips.

        Nvidia RTX 4090 (Ada Lovelace)

        FP32: Approximately 82.6 TFLOPS

        FP16: When using its 4th‑generation Tensor Cores in FP16 mode with FP32 accumulation, it can deliver roughly 165.2 TFLOPS (in non‑tensor mode, the FP16 rate is similar to FP32).

        FP8: The Ada architecture introduces support for an FP8 format; using this mode (again with FP32 accumulation), the RTX 4090 can achieve roughly 330.3 TFLOPS (or about 660.6 TOPS, depending on how you count operations).

        Apple M1 Ultra (The previous‑generation top‑end Apple chip)

        FP32: Around 15.9 TFLOPS (as reported in various benchmarks)

        FP16: By similar scaling, FP16 performance would be roughly double that value—approximately 31.8 TFLOPS (again, an estimate based on common patterns in Apple’s GPU designs)

        FP8: Like the M3 family, the M1 Ultra does not support a dedicated FP8 precision mode.

        So a $2000 Nvidia 4090 gives you about 5x the FLOPS, but with far less high speed RAM (24GB vs. 512GB from Apple in the new M3 Ultra). The RAM bandwidth on the Nvidia card is over 1TBps, compared with 800GBps for Apple Silicon.

        Apple is catching up here and I am very keen for them to continue doing so! Anything that knocks Nvidia down a notch is good for humanity.

        • bigyabai3m

          > Anything that knocks Nvidia down a notch is good for humanity.

          I don't love Nvidia a whole lot but I can't understand where this sentinent comes from. Apple abandoned their partnership with Nvidia, tried to support their own CUDA alternative with blackjack and hookers (OpenCL), abandoned that, and began rolling out a proprietary replacement.

          CUDA sucks for the average Joe, but Apple abandoned any chance of taking the high road when they cut ties with Khronos. Apple doesn't want better AI infrastructure for humanity; they envy the control Nvidia wields and want it for themselves. Metal versus CUDA is the type of competition where no matter who wins, humanity loses. Bring back OpenCL, then we'll talk about net positives again.

        • ricebunny3m

          Uhm, we can expect close to 8 FP32 TFLOPS from the CPUs alone on the M3 Ultra. It comes with 4 tensor engines (AMX) each capable of about 2 TFLOPs.

          M3 Max GPU benchmarks around 14 TFLOPs, so the Ultra should score around 28 TFLOPs.

          Double the numbers for FP16.

      • whimsicalism3m

        h100 sxm - 3TB/s

        vram is not really the limiting factor for serious actors in this space

        • gatienboquet3m

          If my grandmother had wheels, she’d be a bicycle

    • aurareturn3m

        what's the point of 512GB RAM for LLMs on this Mac Studio if the speed is painfully slow?
      
      You can fit the entire Deepseek 671B q4 into this computer and get 41 tokens/s because it's an MoE model.
      • KingOfCoders3m

        Your comments went from

        "40 tokens/s by my calculations"

        to

        "40 tokens/s"

        to

        "41 tokens/s"

        Is there a dice involved in "your calculations?"

        • aurareturn3m

          41 was when I learned it has a little over 800B/s.

          Doesn't matter. All theorized because no one has publicly tested one.

  • dangoodmanUT3m

    800GB/s and 512 unified ram is going to go stupid for llms

  • ummonk3m

    Is the Mac Pro dead or are they waiting for M4 Ultra refresh it?

    • mrcwinn3m

      The Mac Pro died a long, long time ago my friend.

  • fintechie3m

    IMO this is a bigger blow to the AI big boys than Deepseek's release. This is massive for local inference. Exciting times ahead for open source AI.

    • kcb3m

      The market for local inference and $10k+ Macs is not nearly significant enough to effect the big boys.

    • whimsicalism3m

      it is absolutely not

    • bigyabai3m

      I don't think you understand what the "AI big boys" are in the market for.

  • mlboss3m

    $14K with 512gb memory and 16 Tb storage

    • maverwa3m

      I cannot believe I’m saying that, but: for apple that’s rather cheap. Threadripper boxes with that amount of memory do not come a lot cheaper. Considering what apples pricing when it comes to memory in other devices, 4K for the 96GB to 512GB upgrade is a bargain.

      • jltsiren3m

        It's not that much cheaper that with earlier comparable models. Apple memory prices have been $25/GB for the base and Pro chips and $12.5/GB for the Max and Ultra chips. With the new Studios, we get $12.5/GB until 128 GB and $9.375/GB beyond that.

        If you configure a Threadripper workstation at Puget Systems, memory price seems to be ~$6/GB. Except if you use 128 GB modules, which are almost $10/GB. You can get 768 GB for a Threadripper Pro cheaper than 512 GB for a Threadripper, but the base cost of a Pro system is much higher.

        • aurareturn3m

          Pudget system Threadripper 32core + 512GB of system RAM + RTX 4060ti is already $12,000.49.

          Meanwhile, this thing has a faster CPU, GPU, and 512GB of 800GB/s VRAM for $9,500.

  • minton3m

    + $4,000 to bump to 512GB from 96GB.

  • anArbitraryOne3m

    Can't wait to run asahi on it

  • gatienboquet3m

    No benchmarks yet for the LLMs :(

  • universenz3m

    96gb on baseline model m3 ultra with a max of 512gb! Looks like they’re leaning in hard with the AI crowd.

  • okamiueru3m

    Don't know what the prior extreme apple is alluding to here. But, apple marketing is what it is.

  • ntqvm3m

    Disappointing announcement. M4 brings a significant uplift over M3, and the ST performance of the M3 Ultra will be significantly worse than the M4 Max.

    Even for its intended AI audience, the ISA additions in M4 brought significant uplift.

    Are they waiting to put M4 Ultra into the Mac Pro?

  • 1attice3m

    Now with Ultra-class backdoors? https://news.ycombinator.com/item?id=43003230

    • saagarjha3m

      It's unlikely that was a backdoor.

  • perfmode3m

    32 core, 512GB RAM, 8TB SSD

    please take my money now

  • xedrac3m

    Now let me run Linux on it natively without having to jump through hoops. That would be something...

  • daft_pink3m

    Really? M4 Max or M3 Ultra instead of M4 Ultra?

  • chvid3m

    Now make a data center version.

  • varjag3m

    Call me a unit fundamentalist but calling 512Gb "over half a terabyte memory" irks me to no end.

    • klausa3m

      It's over half a _tera_byte; exactly half of _tebi_byte if you wanna be a fundamentalist.

      • varjag3m

        It is exactly the opposite. Every computer architecture in production addresses memory in the powers of two.

        SI has no business in memory size nomenclature as it is not derived from fundamental physical units. The whole klownbyte change was pushed through by hard drive marketers in 1990s.

        • umanwizard3m

          > Every computer architecture in production addresses memory in the powers of two.

          What does it mean to "address memory in powers of two" ? There are certainly machines with non-power-of-two memory quantities; 96 GiB is common for example.

          > The whole klownbyte change was pushed through by hard drive marketers in 1990s.

          The metric prefixes based on powers of 10 have been around since the 1790s.

          • varjag3m

            > What does it mean to "address memory in powers of two" ? There are certainly machines with non-power-of-two memory quantities; 96 GiB is common for example.

            I challenge you to show me any SKU from any memory manufacturer that has a power of 10 capacity. Or a CPU whose address space is a power of 10. This is an unavoidable artefact of using a binary address bus.

            > The metric prefixes based on powers of 10 have been around since the 1790s.

            And Babylonians used power of 60, what gives?

        • kstrauser3m

          *bibytes are a practical joke played on computer scientists by the salespeople to make it sound like we’re drunk. “Tell us more about your mebibytes, Fred elbows colleague, listen to this”.

          If Donald Knuth and Gordon Bell say we use base-2 for RAM, that’s good enough for me.

        • jltsiren3m

          It's more complicated than that. Data storage sizes are not connected to fundamental physical units, but data transfer rates are. Things get annoying when a 1 MB/s connection cannot transfer a megabyte in a second.

          • varjag3m

            Line discipline rarely has sequences of bytes without any service information (parity, delimiters, preambles etc). So I don't see it as a practical issue.

        • 3m
          [deleted]
        • esafak3m

          Do SSD companies do the same thing? We ought to go back to referring to storage capacity in powers of two.

          • jl63m

            SSDs have added weirdness like 3-bit TLC cells and overprovisioning. Usable storage size of an SSD is typically not an exact power of 10 or 2.

    • kissiel3m

      You're nitpicking, but then you use lowercase b for a byte ;)

    • transcriptase3m

      Perhaps they’re including the CPU cache and rounding down for brevity.

  • merillecuz563m

    [dead]

  • johntitorjr3m

    Lots of AI HW is focused on RAM (512GB!). I have a cost-sensitive application that needs speed (300+ TOPS), but only 1GB of RAM. Are there any HW companies focused on that space?

    • xyzsparetimexyz3m

      Isn't that just any discrete (Nvidia,AMD) GPU?

    • jms553m

      Like others have said, basically traditional GPUs (RTX 40/50 series in particular, 20/30 series have much weaker tensor cores).

      In terms of software, recent NVIDIA and AMD research has focused on fast evaluation of small ~4 layer MLPs using FP8 weights for things like denoising, upscaling, radiance caching, and texture and material BRDF compression/decompression.

      NVIDIA has just put out some new graphics API extensions and samples/demos for loading a chunk of neural net weights and performing inference from within a shader.

    • stefan_3m

      Just buy any gaming card? Even something like the Jetson AGX Orin boasts 275 TOPS (but they add in all kind of different subsystems to reach that number).

      • johntitorjr3m

        The Jetson is interesting!

        Can you elaborate on how the TOPS value is inflated? What GPU would be the equivalent of the Jetson AGX Orin?

        • stefan_3m

          The problem with the TOPS is that they add in ~100 TOPS from the "Deep Learning Accelerator" coprocessors, but they have a lot of awkward limitations on what they can do (and software support is terrible). The GPU is an Ampere generation, but there is no strict consumer GPU equivalent.

    • NightlyDev3m

      Most recent GPUs will do. An older RTX 4070 is over 400 TOPS, the new RTX 5070 is around 1000 TOPS, and the RTX 5090 is around 3600 TOPS.

      • johntitorjr3m

        Yeah, that's basically where I'm at with options. Not ideal for a cost sensitive application.

    • Havoc3m

      Greyskull cards might be a fit. Think they’re not entirely plug and play though

  • Acelar03m

    [dead]

  • junglistguy3m

    [dead]

  • JacksCracked3m

    [dead]

  • cytocync3m

    [dead]

  • catlover763m

    [dead]

  • lincpa3m

    [dead]

  • GypsyKing7163m

    [flagged]

  • mythz3m

    Ultra disappointing, they waited 2 years just to push out a single gen bump, even my last year's iPad Pro runs M4.

    • heeton3m

      For AI workflows that's quite a lot cheaper than the alternative in GPUs.

      • mythz3m

        Yeah VRAM option is good (if it performs well), just sad we'd have to drop 10K to access it tied to a prev gen M3 when they'll likely have M5 by the end of the year.

        Hard to drop that much cash on an outdated chip.

  • ldng3m

    Well, a shame for Apple, a lot of the rest of the world is going to boycott american products after such level of treacherousness.

  • giancarlostoro3m

    At 9 grand I would certainly hope that they support the device software wise longer than they supported my 2017 Macbook Air. I see no reason to be forced to cough up 10 grand essentially every 7 years to Apple, that's ridiculous.

  • NorwegianDude3m

    The memory amount is fantastic, memory bandwidth is half decent(~800 GB/s), and the compute capabilities are terrible(36 TOPS).

    For comparison, a single consumer card like the RTX 5090 is only 32 GB of memory, has 1792 GB/s memory and 3593 TOPS of compute.

    The use cases will be limited. While you can't run a 600B model directly like Apple says(cause you need more memory for that), you can run a quantized version, but it will be very slow unless its a MoE architecture.

    • dagmx3m

      You're comparing two different things.

      The compute level you’re talking about on the M3 Ultra is the neural engine. Not including the GPU.

      I expect the GPU here will be behind a 5090 for compute but not by the unrelated numbers you’re quoting. After all, the 5090 alone is multiple times the wattage of this SoC.

      • llm_nerd3m

        Using the NPU numbers grossly overstates the AI performance of the Apple Silicon hardware, so they're actually giving Apple the benefit of the doubt.

        Most AI training and inference (including generative AI) is bound by large scale matrix MACs. That's why nvidia fills their devices with enormous numbers of tensor cores and Apple / Qualcomm et al are adding NPUs, filling largely the same gap. Only nvidia's not only are a magnitude+ more performant, they've massively more flexible (in types and applications), usable for training and inference, while Apple's is only even useful for a limited set of inference tasks (due to architecture and type limits).

        Apple can put the effort in and making something actually competitive with nvidia, but this isn't it.

        • dagmx3m

          Care to share the TOPs numbers for the Apple GPUs and show how this would “grossly overstate” the numbers?

          Apple won’t compete with NVIDIA, I’m not arguing that. But your opening line will only make sense if you can back up the numbers and the GPU performance is lower than the ANE TOPS.

          • llm_nerd3m

            Tensor / neural cores are very easy to benchmark and give a precise number because they do a single well-defined thing at a large scale. So GPU numbers are less common and much more use-specific.

            However the M2 Ultra GPU is estimated, with every bit of compute power working together, at about 26 TOPS.

            • dagmx3m

              Could you provide a link for that TOPS count? (And specifically TOPs with comparable unit sizes since NVIDIA and Apple did not use the same units till recently)

              The only similar number I can find is for TFLOPS vs TOPS

              Again I’m not saying the GPU will be comparable to an NVIDIA one, but that the comparison point isn’t sensible in the comments I originally replied to.

      • bigyabai3m

        > After all, the 5090 alone is multiple times the wattage of this SoC.

        FWIW, normalizing the wattages (or even underclocking the GPU) will still give you an Nvidia advantage most days. Apple's GPU designs are closer to AMD's designs than Nvidia's, which means they omit a lot of AI accelerators to focus on a less-LLM-relevent raster performance figure.

        Yes, the GPU is faster than the NPU. But Apple's GPU designs haven't traditionally put their competitors out of a job.

        • dagmx3m

          M2 Ultra is ~250W (averaging various reports since Apple don’t publish) for the entire SoC.

          5090 is 575W without the CPU.

          You’d have to cut the Nvidia to a quarter and then find a comparable CPU to normalize the wattage for an actual comparison.

          I agree that Apple GPUs aren’t putting the dedicated GPU companies in danger on the benchmarks, but they’re also not really targeting it? They’re in completely different zones on too many fronts to really compare.

          • bigyabai3m

            Well, select your hardware of choice and see for yourself then: https://browser.geekbench.com/opencl-benchmarks

            > but they’re also not really targeting it?

            That's fine, but it's not an excuse to ignore the power/performance ratio.

            • dagmx3m

              But I’m not ignoring the power/performance ratio? If anything, you are doing that by handwaving away the difference.

              Give me a comparable system build where the NVIDIA GPU + any CPU of your choice is running at the same wattage as an M2 Ultra, and outperforms it on average. You’d get 150W for the GPU and 150W for the CPU.

              Again, you can’t really compare the two. They’re inherently different systems unless you only care about singular metrics.

      • NorwegianDude3m

        No, I'm not. I'm comparing the TOPS of the M3 Ultra and the tensor cores of the RTX 5090.

        If not, what is the TOPS of the GPU, and why isn't apple talking about it if there is more performance hidden somewhere? Apple states 18 TOPS for the M3 Max. And why do you think Apple added the neural engine, if not to accelerate compute?

        The power draw is quite a bit higher, but it's still much more efficient as the performance is much higher.

        • dagmx3m

          The ANE and tensor cores are not comparable though. One is literally meant for low cost inference while the others are meant for acceleration of training.

          If you squint, yeah they look the same, but so does the microcontroller on the GPU and a full blown CPU. They’re fundamentally different purposes, architectures and scale of use.

          The ANE can’t even really be used directly. Apple heavily restricts the use via CoreML APIs for inference. It’s only usable for smaller, lightweight models.

          If you’re comparing to the tensor cores, you really need to compare against the GPU which is what gets used by apples ml frameworks such as MLX for training etc.

          It will still be behind the NVIDIA GpU, but not by anywhere near the same numbers.

          • llm_nerd3m

            >The ANE and tensor cores are not comparable though

            They're both built to do the most common computation in AI (both training and inference), which is multiply and accumulate of matrices - A * B + C. The ANE is far more limited because they decided to spend a lot less silicon space on it, focusing on low-power inference of quantized models. It is fantastically useful for a lot of on-device things like a lot of the photo features (e.g. subject detection, text extraction, etc).

            And yes, you need to use CoreML to access it because it's so limited. In the future Apple will absolutely, with 100% certainty, make an ANE that is as flexible and powerful as tensor cores, and they force you through CoreML because it will automatically switch to using it (where now you submit a job to CoreML and for many it will opt to use the CPU/GPU instead, or a combination thereof. It's an elegant, forward thinking implementation). Their AI performance and credibility will greatly improve when they do.

            >you really need to compare against the GPU

            From a raw performance perspective, the ANE is capable of more matrix multiply/accumulates than the GPU is on Apple Silicon, it's just limited to types and contexts that make it unsuitable for training, or even for many inference tasks.

          • NorwegianDude3m

            So now the TOPS are not comparable because M3 is much slower than an Nvidia GPU? That's not how comparisons work.

            My numbers are correct, the M3 Ultra has around 1 % of the TOPS performance of a RTX 5090.

            Comparing against the GPU would look even worse for apple. Do you think Apple added the neural engine just for fun? This is exactly what the neural engine is there for.

            • dagmx3m

              You’re completely missing the point. The ANE is not equivalent as a component to the tensor cores. It has nothing to do with comparison of TOPs but as what they’re intended for.

              Try and use the ANE in the same way you would use the tensor cores. Hint: you can’t, because the hardware and software will actively block you.

              They’re meant for fundamentally different use cases and power loads. Even apples own ML frameworks do not use the ANE for anything except inference.

    • Havoc3m

      >36 Tops

      Thats going to be the NPU specifically. Pretty much nothing on llm front seems to use NPUs at this stage (copilot snapdragon laptops aside) so not sure the low number is a problem

    • llm_nerd3m

      I do think people are going a little overboard with all the commentary about AI in this discussion, and you rightly cite some of the empirical reasons. People are trying to rationalize convincing themselves to buy one of these, but they're deluding themselves.

      It's nice that these devices have loads of memory, but they don't have remotely the necessary level of compute to be competitive in the AI space. As a fun thing to run a local LLM as a hobbyist, sure, but this presents zero threat to nvidia.

      Apple hardware is irrelevant in the AI space, outside of making YouTube "I ran a quantized LLM on my 128GB Mac Mini" type content for clicks, and this release doesn't change that.

      Looks like a great desktop chip though.

      It would be nice if nvidia could start giving their less expensive offerings more memory, though they're currently in the realm Intel was 15 yearsago, thinking that their biggest competition is themselves.

    • BonoboIO3m

      A factor of 100 faster in compute … wow.

      It will be interesting when somebody will upgrade the ram ram of the 5090 like they did with 4090s

      • bilbo0s3m

        They’re a bit confused and not comparing the same compute.

        Pretty sure they’re comparing Nvidia’s gpu to Apple’s npu.

        • NorwegianDude3m

          I'm not confused at all. It's the real numbers. Feel free to provide anything that suggests that the TOPS of the GPU in M chips are faster than the dedicated hardware for it. But you can't, cause it's not true. If you think Apple added the neural engine just for fun then I don't know what to tell you.

          You have a fundamental flaw in your understanding of how both chips work. Not using the tensor cores would be slower, and the same goes for apples neural engine. The numbers are both for the hardware both have implemented for maximum performance for this task.

  • moondev3m

    > support for more than half a terabyte of unified memory — the most ever in a personal computer

    AMD Ryzen Threadripper PRO 3995WX released over four years ago and supports 2TB (64c/128t)

    > Take your workstation's performance to the next level with the AMD Ryzen Threadripper PRO 3995WX 2.7 GHz 64-Core sWRX8 Processor. Built using the 7nm Zen Core architecture with the sWRX8 socket, this processor is designed to deliver exceptional performance for professionals such as artists, architects, engineers, and data scientists. Featuring 64 cores and 128 threads with a 2.7 GHz base clock frequency, a 4.2 GHz boost frequency, and 256MB of L3 cache, this processor significantly reduces rendering times for 8K videos, high-resolution photos, and 3D models. The Ryzen Threadripper PRO supports up to 128 PCI Express 4.0 lanes for high-speed throughput to compatible devices. It also supports up to 2TB of eight-channel ECC DDR4 memory at 3200 MHz to help efficiently run and multitask demanding applications.

    • Shank3m

      > unified memory

      So unified memory means that the memory is accessible to the GPU and the CPU in a shared pool. AMD does not have that.

    • aaronmdjones3m

      > It also supports up to 2TB of eight-channel ECC DDR4 memory at 3200 MHz (sic) to help efficiently run and multitask demanding applications.

      8 channels at 3200 MT/s (1600 MHz) is only 204.8 GB/sec; less than a quarter of what the M3 Ultra can do. It's also not GPU-addressable, meaning it's not actually unified memory at all.

    • JamesSwift3m

      > unified memory

      Its a very specific claim that isnt comparing itself to DIMMs

    • lowercased3m

      I don't think that's "unified memory" though.

    • 3m
      [deleted]
    • ryao3m

      I suspect that they do not consider workstations to be personal computers.

      • agloe_dreams3m

        No the comment misunderstood the difference between CPU memory and unified memory. This can dedicate 500GB of high bandwidth memory to the GPU. - ~3.5X that of an H200.