hipfire
/learn

How AMD GPUs actually compute.

Short, dense decks on the silicon underneath hipfire. Two compute families, two matrix paths, one Rust engine that has to span both. The HIP/ROCm side of the conversation that brrr-style CUDA explainers don't cover.

wave32 vs wave64 WMMA vs MFMA GDDR vs HBM measured on real silicon

why this exists

The CUDA side of the world has a healthy supply of explainers — roofline plots, tensor-core diagrams, NSight walkthroughs. The HIP/ROCm side has reference docs but very little of the same kind of load-bearing pictorial intuition. Worse, most CUDA explainers conflate “GPU” with “NVIDIA datacenter GPU,” which silently bakes in wave64, HBM, tensor cores, and CUDA cores as if those were universal. On AMD silicon, two of those four are wrong most of the time.

These decks are the load-bearing pictures we wanted while writing hipfire. The numbers are measured on the cards in the same drawer that runs the engine — not slide-deck nameplate TFLOPs, not vendor marketing, not synthetic micro-benches.

Each deck is a self-contained read; they cross-reference each other. Start with whichever family you own.

the divergence, in one row

The single most useful framing — same parent company, same HIP source, two fundamentally different chips.

CDNA — Instinct RDNA — Radeon / Ryzen AI
wavefront 64 threads 32 threads
matrix unit v_mfma_* v_wmma_* (RDNA3+)
memory HBM3 / HBM3E / HBM4 GDDR6 / GDDR6X / system RAM
graphics pipeline none full
LLVM target gfx908 → gfx950 gfx1010 → gfx1201
typical user cloud / HPC / lab desktop / workstation / iGPU

The same .hip source compiles for either — but ported kernels hit three landmines: wavefront width, matrix intrinsic, and memory tier sizing. The decks walk each one.

related reading