/learn

How AMD GPUs actually compute.

Short, dense decks on the silicon underneath hipfire. Two compute families, two matrix paths, one Rust engine that has to span both. The HIP/ROCm side of the conversation that brrr-style CUDA explainers don't cover.

wave32 vs wave64 WMMA vs MFMA GDDR vs HBM measured on real silicon

deck 01 · consumer / pro / APU

RDNA & WMMA

wave32 silicon, GDDR6 / system-RAM, 16×16 matrix tiles via WMMA, and a dispatch story that has to span five generations from RDNA1 (no matrix unit) to RDNA4 (WMMA32, 1.5× the register file). With empirical hipfire numbers across gfx1010 → gfx1201.

~11 slides • gfx10xx — gfx12xx • 2.94× WMMA speedup, measured

deck 02 · data center

CDNA & MFMA

The Instinct lineage. wave64, HBM3 / HBM3E / HBM4, v_mfma_* 16×16 / 32×32 matrix tiles, and a transistor budget that throws away the graphics pipeline. From the roofline up through CU internals to rocprofv3 metrics.

~24 slides • gfx908 — gfx950 • MI300X: 304 CUs · 192 GB HBM3

deck 03 · the thesis

Why hipfire — RDNA vs NVIDIA

7900 XTX vs RTX 4090, head-to-head. Where the chips look alike, where they diverge (matrix-tile shape, cache topology, register file, scheduling), why hipified-CUDA kernels reach only 37 % of GDDR6X bandwidth while hand-written hipfire kernels reach 69 %. The empirical case for RDNA-native silicon targeting.

~10 sections • gfx1100 vs sm_89 • 132 vs 71 tok/s · same card, same model

why this exists

The CUDA side of the world has a healthy supply of explainers — roofline plots, tensor-core diagrams, NSight walkthroughs. The HIP/ROCm side has reference docs but very little of the same kind of load-bearing pictorial intuition. Worse, most CUDA explainers conflate “GPU” with “NVIDIA datacenter GPU,” which silently bakes in wave64, HBM, tensor cores, and CUDA cores as if those were universal. On AMD silicon, two of those four are wrong most of the time.

These decks are the load-bearing pictures we wanted while writing hipfire. The numbers are measured on the cards in the same drawer that runs the engine — not slide-deck nameplate TFLOPs, not vendor marketing, not synthetic micro-benches.

Each deck is a self-contained read; they cross-reference each other. Start with whichever family you own.

the divergence, in one row

The single most useful framing — same parent company, same HIP source, two fundamentally different chips.

	CDNA — Instinct	RDNA — Radeon / Ryzen AI
wavefront	64 threads	32 threads
matrix unit	`v_mfma_*`	`v_wmma_*` (RDNA3+)
memory	HBM3 / HBM3E / HBM4	GDDR6 / GDDR6X / system RAM
graphics pipeline	none	full
LLVM target	`gfx908 → gfx950`	`gfx1010 → gfx1201`
typical user	cloud / HPC / lab	desktop / workstation / iGPU

The same .hip source compiles for either — but ported kernels hit three landmines: wavefront width, matrix intrinsic, and memory tier sizing. The decks walk each one.