73 comments
  • Luker888m

    I basically only buy AMD, but I want to point out how rocm still doesn't fully support the 780M.

    I have a laptop with a 680M and a mini pc with a 780M both beefy enough to play around with small LLM. You basically have to force the gpu detection to an older version, and I get tons of gpu resets on both.

    AMD your hardware is good please give the software more love.

    • dannyw8m

      AMD doesn't realise the wide penetration and availability of CUDA is what makes the ecosystem so strong. Developers can develop and test on their personal devices which are prevalent, and that's what creates such a big software ecosystem for the expensive chips.

      When I raised this feedback with our AMD Rep, they said it was intentional and that consumer GPUs are primarily meant for gaming. Absolutely shortsighted.

      • brookst8m

        I can forgive AMD for not seeing how important CUDA was ten years ago. Nvidia was both smart and lucky.

        But failing to see it five years ago is inexcusable. Missing it two years ago is insane. And still failing to treat ML as an existential threat is, IDK, I’ve got no words.

        • KeplerBoy8m

          That's besides the point. They are offering ML solutions. I believe pytorch and most other stuff works decently well on their datacenter/hpc GPUs these days. They just haven't managed to offer something attractive to small scale enterprises and hobbyists, which costs them a lot of midshare in discussions like these.

          But they're definitely aware of AI/ML stuff, pitching it to their investors, acquiring other companies in the field and so on.

          • mrguyorama8m

            Meanwhile the complete lack of enthusiast ML software for their consumer grade cards mean they can put gobs of GPU memory on their GPUs without eating into their HPC business line.

            I feel like that's something they would be explaining to their investors if it was intentional though.

            • KeplerBoy8m

              Not sure which complete lack you're talking about. You can run the SotA open source image and text generation models on the 7900 xtx. They might be one or two iterations behind their nvidia counterparts and you will run into more issues, but there is a community.

      • carlmr8m

        It's either strategic incompetence, technical incompetence, or both at this point.

        • ahartmetz8m

          At least they seem to be seriously trying to fix it now.

          https://www.techpowerup.com/324171/amd-is-becoming-a-softwar...

        • keyringlight8m

          One thing I wonder about with AMD is that they know the history of how CUDA got to it's current position, but even if you say trying to compete in that market is fighting yesterday's war and they don't want to dedicate much resources to it they don't seem to have much vision to start and do the long-term commitment to what could be the next big thing. What projects do they have to boost their strengths, and can't be copied easily? The examples I would point to though are Zen (a massive course correction after K10-Bulldozer) and early HSA after acquiring ATi

        • anaisbetts8m

          I suspect that it is legal fears tbh - it is almost certain that if AMD or anyone else tried to make some kind of CUDA compatibility, nVidia would pretty fiercely sue them into the ground. This is almost certainly why both Intel and AMD bailed on ZLUDA.

          • carlmr8m

            They don't need compatibility but functionality in the first place.

            Most AI workloads use an abstraction layer anyway, e.g. pytorch.

      • dboreham8m

        Obviously anything that's known on this thread is known to AMD management, or at least their assistants.

    • hedora8m

      I recently tried to setup Linux on a few machines with nvidia and AMD GPUs, and, while AMD could improve, they're way ahead of nvidia on all fronts except machine learning.

      Nvidia's drivers are still uniformly garbage (as they have been for the last 20 years) across the board, but they do work sometimes, and I guess they're better for machine learning. I have a pile of "supported" nvidia cards that can't run most opengl / glx software, even after installing dkms, recompiling the planet, etc, etc, etc.

      Since AMD upstreamed their stuff into the kernel, everything just works out of the box, but you're stuck with rocm.

      So, for all use cases except machine learning, AMD's software blows Nvidia's out of the water for me. This includes running Windows games, which works better under Linux than Windows (the last time I checked), thanks to Steam.

      On my 780m, I installed current devuan (~= debian) stable, and had a few xscreensaver crashes and reboots. I checked dmesg, and it had clear errors about irq state machines being wrong for some of the radeon stuff. So, even when running future hardware, their error logs are great.

      After enabling backports and upgrading the kernel, the dmesg errors went away, and it's a 100% uptime machine.

      The remaining hardware problem is that pulseaudio is still terrible after all these years, so I have to repeatedly switch audio out to hdmi.

      • colordrops8m

        Use pipewire instead of pulseaudio. Much better.

        • hedora8m

          I would have already switched, but, apparently, the pipewire authors decided it shouldn't daemonize properly:

          https://dev1galaxy.org/viewtopic.php?id=5867

          Taking a dependency on systemd is a strange choice for a project who's entire point is ripping out Poettering's second-to-last train wreck.

          • aktau8m

            That doesn't read like taking a dependency on systemd. Rather, Pipewire doesn't have custom code to "daemonize" by itself.

            Is there any reason why all individual tools should learn how to daemonize (in addition to or in replacement of running in the foreground)? There's external tools that can take care of that uniformly, and using the latest/greatest syscalls for it. That seems better than every application including this code. As highlighted in the thread, there are other programs that can launch+daemonise another a process (like the aptly named [daemon(1)](https://manpages.debian.org/unstable/daemon/daemon.1.en.html) tool). Seems more like the UNIX way, from the outset.

            • hedora8m

              That tool’s RSS is somehow 170KB (vs zero for a self-daemonizing process).

              Also, it’s incredibly complicated. (I looked at the source.)

              Here’s a writeup of a simple daemon: https://pavaka.github.io/simple-linux-daemon-tutorial/

              Given that it’s typed once (by the daemon author, and not the end user), it seems like a big win vs. daemon(1) to me.

              • aktau8m

                > That tool’s RSS is somehow 170KB (vs zero for a self-daemonizing process).

                Why is the RSS relevant? I assume it doesn't need to keep on running. Also, even if it kept running, 170KB is not the end of the world.

                > Also, it’s incredibly complicated. (I looked at the source.) Here’s a writeup of a simple daemon: https://pavaka.github.io/simple-linux-daemon-tutorial/

                Maybe it's complicated, but perhaps it's trying to replicate daemon(3) without bugs, and for different processes. See the BUGS section in the daemon(3) man page.

                > Given that it’s typed once (by the daemon author, and not the end user), it seems like a big win vs. daemon(1) to me.

                This seems like a false comparison. It's not the case that the end user writes the code to daemonise in the non-included case. The user would just use daemon(1) or systemd(8) or something else that can daemonise. Or perhaps a service manager that doesn't need to daemonise, like runit(8) (https://smarden.org/runit/) and its ilk.

                The more I read about this, the more I want to know why it's so important that pipewire is running "daemonized" (whether it does it itself or not). Can you explain the advantages and disadvantages?

    • stuaxo8m

      Having had AMS Ryzen laptops for the last 6 plus years, so much this.

      Right now I'm messing around trying to get pytorch vulkcan support compiling just so I avoid switching to ROCM.

  • rishav_sharan8m

    I love how well Intel's Arc iGPU and AMDs Strix Point iGPU are doing. I am planning to get an iGPU laptop with 64 Gb RAM. I plan on using local llms and image generators and hopefully with that large of shared RAM that shouldn't be too much of a problem. But I am worried that all LLM tools today are pretty much NVidia specific, and I wouldn't be able to get my local setup going.

    • replete8m

      I've noticed some BIOS' do not allow the full capacity of unified memory to be allocated, so if you do this check you can actually allocate 16GB, some are limited to 2 or 4GB, seemingly unnecessarily

      • ComputerGuru8m

        Apparently this is a legacy holdover and you should choose the smallest size in the bios. Fully unified memory is the norm, you don’t need to do the memory splitting that way.

      • 8m
        [deleted]
    • aappleby8m

      You'll be limited by memory bandwidth more than compute.

      • imtringued8m

        Anyone who uses a CPU for inference is severely compute constrained. Nobody cares about tokens per second the moment inference is faster than you can read, but staring down a blank screen for 5 minutes? Yikes.

        • lhl8m

          Just as a point of reference, this is what a 65W power-limited 7940HS (Radeon 790M) with 64GB of DDR5-5600 looks like w/ a 7B Q4_K_M model atm w/ llama.cpp. While it's not amazing, at 240 t/s prefill, it means that at 4K context, you'll wait about 17 seconds before token generation starts, which isn't awful. The 890M should have about 20% better compute, so about 300 t/s prefill, and with LPDDR5-7500/8000, you should get to about 20 t/s.

            ./llama-bench -m /data/ai/models/llm/gguf/mistral-7b-instruct-v0.1.Q4_K_M.gguf
            ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
            ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
            ggml_cuda_init: found 1 ROCm devices:
            Device 0: AMD Radeon 780M, compute capability 11.0, VMM: no
            | model                          |       size |     params | backend    | ngl |          test |              t/s |
            | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
            | llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | ROCm       |  99 |         pp512 |    242.69 ± 0.99 |
            | llama 7B Q4_K - Medium         |   4.07 GiB |     7.24 B | ROCm       |  99 |         tg128 |     15.33 ± 0.03 |
          
            build: e11bd856 (3620)
          • ComputerGuru8m

            > you'll wait about 17 seconds before token generation starts, which isn't awful

            Let’s be honest, it might not be awful but it’s a nonstarter for encouraging local LLM adoption and most will prefer to pay to pay pennies for api access instead (friction aside).

            • lhl8m

              I don't know why anyone would think a meh performing iGPU would encourage local LLM adoption at all? A 7B local model is already not going to match frontier models for many use cases - if you don't care about using a local model (don't have privacy or network concerns) then I'd argue you probably should use an API. If you care about using a capable local LLM comfortably, then you should get as powerful a dGPU as your power/dollar budget allows. Your best bang/buck atm will probably be Nvidia consumer Ada GPUs (or used Ampere models).

              However, if for anyone that is looking to use a local model on a chip with the Radeon 890M:

              - look into implementing (or waiting for) NPU support - XDNA2's 50 TOPS should provide more raw compute than the 890M for tensor math (w/ Block FP16)

              - use a smaller, more appropriate model for your use case (3B's or smaller can fulfill most simple requests) and of course will be faster

              - don't use long conversations - when your conversations start they will have 0 context and no prefill; no waiting for context

              - use `cache_prompt` for bs=1 interactive use you can save input/generations to cache

          • szundi8m

            For a lot of usecases it is actually awful

        • jodleif8m

          The problem is memory bandwith. There is a reason Apple Macbooks do relatively well with LLMs it’s not that the GPU is any better than zen5, but 4,5,6x memory bandwidth is huge (80ish gb/s vs 400gb/s)

        • aurareturn8m

          >Nobody cares about tokens per second the moment inference is faster than you can read, but staring down a blank screen for 5 minutes? Yikes.

          I don't think so. Humans scan for keywords very often. No body really reads every word. Faster than reading speed inference is definitely beneficial.

          • brookst8m

            And thank you for making me conscious of my reading while reading your comment. May you become aware of your breathing.

    • dagmx8m

      Can they access the full RAM? Afaik they get capped to a portion of total available RAM.

      But to your other point, very little of the current popular ML stack does more than CUDA and MPS. Some will do rocm but I don’t know if the AMD iGPUs are guaranteed to support it? There’s not much for Intel GPUs.

      • hedgehog8m

        It depends on the API used, whether the data is in the region considered "GPU memory" or whether it's shared with the compute API from the app's memory space. Support is somewhat in flux and I haven't been following closely but if you're curious this is my bookmarked jumping of point (a PyTorch ticket about this):

        https://github.com/pytorch/pytorch/issues/107605

        • slavik818m

          My understanding is that as of Linux 6.10, the driver will now dynamically allocate more memory for the iGPU [1]. The driver team apparently reused a strategy that had been developed for MI300A.

          I'm hoping that in combination with the gfx11-generic ISA introduced in LLVM 18, this will make it straightforward to enable compute applications on both Phoenix and Strix (even if they are not officially supported by ROCm).

          [1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

          • lhl8m

            One issue is that even if you are using GTT (dynamically allocated memory), this is still limited as a percentage of total RAM. Eg, currently on my 7940HS, I have 64GB of memory, 8GB of dedicated VRAM (GART), and then a limit of 28GB of GTT - there is an amdgpu.gttsize parameter to "Restrict the size of GTT domain in MiB for testing. The default is -1 (It’s VRAM size if 3GB < VRAM < 3/4 RAM, otherwise 3/4 RAM size)", but I'm not sure what the practical/effective limits are.

          • dagmx8m

            FWIW, It’s one of the patches linked in their linked issue.

            However the commits do have some caveats called out, as do the techniques they use to achieve the higher allocations.

        • dagmx8m

          I appreciate the link. The GTT and SDMA tricks mentioned there don’t really increase the shared ram use imho. They just increase the virtual memory the GPU can address, but with several tradeoffs in terms of allocation and copy operations.

          As an aside, it just feels like a lot of hacks that AMD and Intel should have handled ages ago for their iGPUs instead of letting them languish.

      • sillystuff8m

        > Some will do rocm but I don’t know if the AMD iGPUs are guaranteed to support it?

        If you only care about inference, llama.cpp supports Vulkan on any iGPU with Vulkan drivers. On my laptop with crap bios that does not allow changing any video ram settings, reserved "vram" is 2GB, but llama.cpp-vulkan can access 16GB of "vram" (half of physical ram). 16GB vram is sufficient to run any model that has even remotely practical execution speed on my bottom-of-the-line ryzen 3 3250U (Picasso/Raven 2); you can always offload some layers to CPU to run even larger.

        (on Debian stable) Vulkan support:

          apt install libvulkan1 mesa-vulkan-drivers vulkan-tools
        
        Build deps for llama.cpp:

          apt install libshaderc-dev glslang-dev libvulkan-dev
        
        Build llama.cpp with vulkan back-end:

          make clean (I added this, in case you previously built with a diff back-end)
        
          make LLAMA_VULKAN=1
        
        If more than one GPU: When running, you have to set GGML_VK_VISIBLE_DEVICES to the indices of the devices you want e.g.,

          export GGML_VK_VISIBLE_DEVICES=0,1,2
        
        The indices correspond to the device order in

          vulkaninfo --summary. 
        
        By default llama.cpp will only use the first device it finds.

        llama.cpp-vulkan has worked really well, for me. But, per benchmarks from back when Vulkan support was first released, using the CUDA back-end was faster than the Vulkan back-end on NVIDIA GPUs. Probably same Rocm vs Vulkan on AMD too. But, zero non-free / binary blobs required for Vulkan, and Vulkan supports more devices (e.g., my iGPU is not supported by Rocm)-- haven't tried, but you can probably mix GPUs from diff manufacturers using Vulkan.

      • guilamu8m

        Be careful, most bios will let you use only 1/4 of the total ram for the integrated GPU. Some - really bad - bios are even limiting to 2gb totally ignoring how much ram is available.

    • allen_fisher8m

      I set up both stable diffusion and LLMs on my desktop without Nvidia GPU. Everything goes well. Stable diffusion can run on onnx backend on my AMD GPU, and LLMs run through gguf format through ollama on CPU, model scale and speed are limited though.

    • jodleif8m

      The problem here is the slow memory… the iGPU is really already limited by slow ram, and with LLMs memory bandwidth is king

  • bornfreddy8m

    Interesting:

    > With Strix Point, AMD’s mobile iGPU has a newer graphics architecture than its desktop counterparts. It’s an unprecedented situation, but not a surprising one. Since the DX11 era, AMD has never been able to take and hold the top spot in the discrete GPU market. Nvidia has been building giant chips where cost is no object for a long time, and they’re good at it. Perhaps AMD sees lower power gaming as a market segment where they can really excel. Strix Point seems to be a reflection of that.

    Did AMD figure out that this market segment is underserved by NVidia? If so, good for them, laptops could use better GPUs.

    • dagmx8m

      I doubt Strix Point is gunning for NVIDIA.

      It’s more than likely this is just a stronger play to get ahead of Intel in market share.

      That’s a much more tangible competitor in that space.

      Whether it means more games optimize for AMD as a side effect is tangential at best. Otherwise there’s no real reason to treat this as competing with NVIDIA. It’s an integrated GPU so it’s not moving any extra units.

    • mmaniac8m

      Nvidia can't really enter this segment unless Windows on ARM takes off, and they don't want to be the one to put the first foot forward.

      If Snapdragon X Elite is a success, you can bet Nvidia will be producing laptop SoCs with passable CPUs and great iGPUs.

  • torrance8m

    These results are promising and hopefully carry over to the upcoming Strix Halo which I’m eagerly awaiting. With a rumoured 40 compute cores and performance on par with a low power (<95W) mobile RTX4070, it would make an exciting small form gaming box.

    • jauntywundrkind8m

      I've been super excited for Strix Halo, but I'm also nervous. Strix Halo is a multi-chip design, and I'm pretty nervous about whether AMD can pull it off in a mobile form factor, while still being a good mobile chip.

      Strix Point can be brought down to 15W and still do awesome. And go up to 55W+ and be fine. Nice idles. But it's monolithic, and I'm not sure if AMD & TSMC are really making that power penalty of multichip go down enough.

      • luyu_wu8m

        Very valid concerns! AMD's current die-to-die interconnects have some pretty abysmal energy/bit. Really hope they can pull off something similar to Intel's EMIB maybe?

        • wtallis8m

          The rumors saying Strix Halo will be a multi-chip product are saying it's re-using the existing Zen5 CPU chiplets from the desktop and server parts and just replacing the IO die with one that has a vastly larger iGPU and double the DRAM controllers. So they might be able to save a bit of power with more advanced packaging that puts the chiplets right next to each other, but it'll still be the same IF links that are power-hungry on the desktop parts.

      • asmor8m

        The 7945HX3D needs 55W minimum, if that's any indicator.

    • KingOfCoders8m

      I hope Strix Halo gets a desktop motherboard (no socket :-( for the memory bandwidth for faster compiles (Go). That or a 9950X3D (like the 7950X3D).

      • naoru8m

        Me too. There's at least one manufacturer who makes pretty sweet mini-ITX motherboard with R9 7945HX, I hope they will follow up with Strix Halo once it's released.

    • layer88m

      That kind of performance will still require significant cooling, which if you want it to be quiet is helped by a larger box.

  • aurareturn8m

    Some comparisons:

    4k Aztec High GFX

    * AMD 890M: 39.1fps

    * M3: 51.8fps

    3DMark Wild Life Extreme

    * AMD 890M: 7623

    * M3: 8286

    Power:

    * AMD 890M: 46w

    * M3: 8286: 17w

    M3 about ~253% more efficient.

    But of course, if your goal is gaming, AMD's GPU will still be better because of Vulkan, DirectX, and Windows support. In pure architecture, AMD is quite a bit behind Apple.

    • adrian_b8m

      The "170%" number is bogus.

      Reducing the power of 890M to 17 W, the same as quoted for M3, would reduce the performance much less than the reduction in power consumption, improving the energy efficiency.

      For a valid comparison of the energy efficiency, both systems must be configured for the same power consumption.

      Moreover, by themselves those performance values do not prove that AMD is behind Apple in GPU architecture.

      The better performance of the Apple GPU could be entirely caused by the much higher memory bandwidth and by the better CMOS process used for the Apple GPU.

      For any conclusions about architecture, much more detailed tests would be needed, to separate the effects of the other differences that exist between these systems.

      • aurareturn8m

        >The "170%" number is bogus.

        Actually, it's 253%. I made a mistake assuming the 890M was limited to 35w. It was actually 46w as measured by Notebookcheck.[0]

        >Reducing the power of 890M to 17 W, the same as quoted for M3, would reduce the performance much less than the reduction in power consumption, improving the energy efficiency.

        That depends. Sure, give almost any chip less power and it will be more efficient. I'm not arguing against that.

        The problem with reducing power for the 890M is that it's already slower than the M3 by 26% while using 2.7x the power.

        If you give the 890M 17w, yes, it will be more efficient than 46w. It just just be even slower than the M3.

        >The better performance of the Apple GPU could be entirely caused by the much higher memory bandwidth

        M3's bandwidth is 102.4 GB/s. AMD Strix Point uses LPDDR5X-7500 in dual channel mode so it should be around 120GB/s.

        >and by the better CMOS process used for the Apple GPU.

        AMD's Strix Point is manufactured on TSMC's N4P. M3 is on N3B, which is roughly 10% more power efficient than N4P. It doesn't explain the huge discrepancy in efficiency.

        [0]https://www.notebookcheck.net/AMD-Zen-5-Strix-Point-iGPU-ana...

    • 8m
      [deleted]
    • 8m
      [deleted]
  • mastax8m

    How fortunate for Intel that as soon as they ruin their CPU naming scheme, AMD follow suit.

    • mmaniac8m

      AMD's mobile CPUs have had confusing names since 2022 with the 7000 series, and are now completely bonkers. The completely unnecessary insertion of "AI" into every name, the TDP suffix now an optional prefix, and the extremely poorly justified generation counter starting at 300.

      Intel on the other hand were fairly sensible. i7 becomes Ultra 7 and the numbering restarts from 100 (Meteor Lake). That's easy to follow.

  • setgree8m

    So what was AMD thinking with its release of the 8700G and 8600G APUs, and is it planning to phase them out?

    They come with the 780M and 680M processors, respectively, and both are outperformed by the 980M at a lower power draw [0]. Theoretically a consumer can't put these parts directly in a pc there's already a mini-pc with the laptop part 980M [1]. The 7800G sometimes shows up in mid-range and high-end gaming PCs with discrete graphics cards [2], which makes so little sense that I wonder if AMD quietly offloaded them in bulk at a steep discount to vendors.

    I've commented on this before [3], can anyone shed light on the situation?

    [0] https://www.anandtech.com/show/21485/the-amd-ryzen-ai-hx-370...

    [1] https://www.tomshardware.com/desktops/mini-pcs/soyos-upcomin...

    [2] https://www.tomshardware.com/desktops/gaming-pcs/hp-omen-35l...

    [3] https://news.ycombinator.com/item?id=41140287

    • 8m
      [deleted]
    • luyu_wu8m

      APUs have their own small niche in the DIY market! From what I understand, some people build their own computer without a dGPU first, then later purchase a GPU and prefer to be able to change the CPU at that point as well? Hopefully this rationale is what you're asking for!

      • setgree8m

        In theory this makes sense, but reviews all suggested that the price proposition for the APUs didn't really work and now AMD has released a laptop part that fills the same niche. DIY folks won't be able to get their hands on the laptop part right away but presumably soon? It just seems kind of odd to release something that outperforms your own product in the same segment.

        • burmanm8m

          The 8600G/8700G are installed to normal AM5 socket and can be swapped. The laptop parts are socketed and are not going to be available in the AM5 market.

          • setgree8m

            This isn't my niche but this makes sense at face value. I couldn't find any sales data for either processor but maybe the DIY market is bigger than I know.

  • Shorel8m

    Similar performance to Nvidia 1080 dedicated GPU.

    Would I get it? Absolutely yes. A full desktop small form factor is a very convenient, nice thing.

    • dagmx8m

      Where do you see a performance comparison for the 1080?

      The only mention of NVIDIA in the post is of the 1050 which is a considerable step away from a 1080.

      > It also moves ahead of Nvidia’s Pascal based GTX 1050 3 GB