Using the Matrix Cores of AMD RDNA 4 architecture GPUs

semessier 13 hours ago

ROCm, PyTorch?

DiabloD3 13 minutes ago

What about them?
RDNA4 is officially supported on ROCm (release for that came out shortly after the drivers shipped), and PyTorch officially supports ROCm and AMD officially supports PyTorch's ROCm target.
roenxi 6 hours ago

People go a bit crazy about CUDA, ROCm and PyTorch, but I've been watching for a few years and have seen no evidence whatsoever that they are serious blockers. PyTorch does work on AMD cards and whatever ROCm can't do doesn't seem to be important because no-one has articulated why they need it in my line of sight. By far AMD's biggest problem is that their linux kernel drivers historically don't seem to be able to handle GEMM workloads without kernel panics.
Having some senior engineers taking a public interest in putting up this sort of article is rather exciting. I'm not going to give AMD the benefit of the doubt after their horrific performance in the 2010s and early 2020s but observing from a safe distance - they do look like they're on the right track and possibly even a fair way down the path to getting into the game.
- fancyfredbot 4 hours ago
  
  You seem to be saying AMD GPUs can run PyTorch but can't run GEMM? Can you explain? I thought PyTorch used GEMM extensively.
  I also don't understand the comment on "whatever ROCm can't do doesn't seem to be important because no-one has articulated why they need it in my line of sight". Isn't the problem with ROCm the lack of support? It's only officially supported on a tiny proportion of AMD's product line?
  - benreesman 3 hours ago
    
    Parent seems to be saying that stability issues on consumer RDNA cards are the issue as opposed to ROCm support in PyTorch.
    
    imtringued 2 hours ago
    
    My pet theory is that the scheduler can't handle desktop graphics + ML workloads simultaneously, which leads to a deadlock in the firmware.
    
    DiabloD3 3 minutes ago
    
    Your pet theory needs work.
    1) I have no issues throwing ROCm (which still uses its own path in the driver) at my Radeon at the same time I'm hammering it with "normal" path APIs (Legacy D3D, D3D12, OpenGL, Vulkan, etc), they are scheduled normally and compete for GPU resources normally.
    2) ROCm memory allocation is weird in the driver. I have gotten my GPU to hardlock the entire system by allocating about 2x my VRAM, because I suspect its misusing/overusing mprotect().
    
    compsciphd 22 minutes ago
    
    for those who are using it for ML loads, what's the point of using it for desktop graphics (at least at the same time).
    i.e. I'd argue that unless one is getting server chips (which negates the desktop graphics comment), it seems the vast majority of modern CPUs come with iGPUs that are sufficient for running a desktop environment. Unless one is planning to game on the same machine (but again, probably also not at the same time), if the above is the problem, why not use the iGPU for the desktop and use the dGPU for your ML workloads.

ROCM support for RDNA4 cards in ubuntu is very poor. Worse yet, it's looking like ubuntu 25.10 isnt going to make anything better.

benreesman 2 hours ago

There's a fundamental tension between "conservative desktop-origin mass appeal Linux distribution" and "extreme performance hardware accelerated numerics coprocessor". In the places where a mass market need is obvious (local LLM) solutions are emerging with vibrant and diverve back end options (gguf).
I think its OK to install stuff to get extreme scientific compute performance.
I use NixOS BTW.
- incomingpain 2 hours ago
  
  >There's a fundamental tension between "conservative desktop-origin mass appeal Linux distribution" and "extreme performance hardware accelerated numerics coprocessor". In the places where a mass market need is obvious (local LLM) solutions are emerging with vibrant and diverve back end options (gguf).
  /me urgently awaits devstral 2507 on ollama
  >I use NixOS BTW.
  As a desktop environment? Tell me more! Please!
veber-alex 2 hours ago

I thought you can just get all the stuff you need in a docker container for both AMD and NVIDIA.
What does the OS even matter?
hardolaf 3 hours ago

I'm going to say this in a not nice way: that's a you problem.
You willingly use a distribution which purposely ships out of date software based on some misguided philosophical belief that such a behavior makes the system better or more stable. In reality, it just means that you're running out of date software with security vulnerabilities, bad driver support, and even worse distribution maintainer half-assed patches to fix the aforementioned vulnerabilities.
I'm not saying that you should switch to Arch Linux, but there is a wide gap between RHEL and Debian based distributions and a continuously rolling distribution. There are distributions that update weekly, biweekly, monthly, quarterly, etc.
- cpgxiii 2 hours ago
  
  AMD officially supports precisely three Linux platforms for current ROCm:
  1. Ubuntu 24.04.4 with kernel 6.11
  2. Ubuntu 22.04.5 with kernel 6.8
  3. RHEL 9.6 with kernel 5.14
  Anything else, like your preferred rolling release distribution, is entirely on your own.
  - hardolaf 2 hours ago
    
    Sure that makes sense from a support perspective. My FPGA tools also only support a small number of OSes. But they, like ROCm, run fine on pretty much anything as up to date or newer.
- incomingpain 2 hours ago
  
  >I'm going to say this in a not nice way: that's a you problem.
  I always prefer this.
  >I'm not saying that you should switch to Arch Linux,
  Especially when you Arch isnt supported at all by any version and quite likely to not even work as a video card. Manjaro also not supported.
  >ut there is a wide gap between RHEL and Debian based distributions and a continuously rolling distribution. There are distributions that update weekly, biweekly, monthly, quarterly, etc.
  RHEL seems to be up to date, the RHEL from May is well supported. I have tested out Alma as vms, but ive never used even fedora or centos in ages.
  - hardolaf 2 hours ago
    
    I will agree that RHEL has gotten better about upgrading software when they do minor releases but I'm still painfully aware of the pre-9.X days when they would release a new version and the software was already a year out of date.
    I personally used Fedora for a long time at the same time as I ran Arch Linux on servers. I honestly couldn't really tell the difference as long as I was updating Fedora every time a version bump came out. The release cadence was fast enough that it never caused problems. I ended up switching to it for my home devices entirely. Though now I run SteamOS and CachyOS because they're Arch without the headaches of Arch.