How AMD GPUs actually compute.
Short, dense decks on the silicon underneath hipfire. Two
compute families, two matrix paths, one Rust engine that has to span
both. The HIP/ROCm side of the conversation that brrr-style
CUDA explainers don't cover.
RDNA & WMMA
wave32 silicon, GDDR6 / system-RAM, 16×16 matrix tiles via WMMA, and a dispatch story that has to span five generations from RDNA1 (no matrix unit) to RDNA4 (WMMA32, 1.5× the register file). With empirical hipfire numbers across gfx1010 → gfx1201.
CDNA & MFMA
The Instinct lineage. wave64, HBM3 / HBM3E / HBM4, v_mfma_*
16×16 / 32×32 matrix tiles, and a transistor budget that
throws away the graphics pipeline. From the roofline up through
CU internals to rocprofv3 metrics.
Why hipfire — RDNA vs NVIDIA
7900 XTX vs RTX 4090, head-to-head. Where the chips look alike, where they diverge (matrix-tile shape, cache topology, register file, scheduling), why hipified-CUDA kernels reach only 37 % of GDDR6X bandwidth while hand-written hipfire kernels reach 69 %. The empirical case for RDNA-native silicon targeting.
why this exists
The CUDA side of the world has a healthy supply of explainers — roofline plots, tensor-core diagrams, NSight walkthroughs. The HIP/ROCm side has reference docs but very little of the same kind of load-bearing pictorial intuition. Worse, most CUDA explainers conflate “GPU” with “NVIDIA datacenter GPU,” which silently bakes in wave64, HBM, tensor cores, and CUDA cores as if those were universal. On AMD silicon, two of those four are wrong most of the time.
These decks are the load-bearing pictures we wanted while writing hipfire. The numbers are measured on the cards in the same drawer that runs the engine — not slide-deck nameplate TFLOPs, not vendor marketing, not synthetic micro-benches.
Each deck is a self-contained read; they cross-reference each other. Start with whichever family you own.
the divergence, in one row
The single most useful framing — same parent company, same HIP source, two fundamentally different chips.
| CDNA — Instinct | RDNA — Radeon / Ryzen AI | |
|---|---|---|
| wavefront | 64 threads | 32 threads |
| matrix unit | v_mfma_* | v_wmma_* (RDNA3+) |
| memory | HBM3 / HBM3E / HBM4 | GDDR6 / GDDR6X / system RAM |
| graphics pipeline | none | full |
| LLVM target | gfx908 → gfx950 | gfx1010 → gfx1201 |
| typical user | cloud / HPC / lab | desktop / workstation / iGPU |
The same .hip source compiles for either — but ported
kernels hit three landmines: wavefront width, matrix intrinsic, and
memory tier sizing. The decks walk each one.
related reading
- /docs/architecture — the engine layout these decks describe
- /docs/benchmarks — the full measured per-arch perf table
- /docs/quantization — what the WMMA & MFMA kernels actually consume
- github.com/Kaden-Schutt/hipfire — engine source