RDNA & WMMA — A Distil
Companion to AMD Distil — CDNA / MFMA. This deck zooms in on the other AMD compute family: RDNA (consumer / pro / APU), the WMMA matrix path that arrived with RDNA3, and the empirical performance findings from the hipfire LLM-inference engine on real RDNA silicon (gfx1010 / gfx1030 / gfx1100 / gfx1151 / gfx12xx).
Where the CDNA deck spends its budget on data-center matrix throughput (MFMA, HBM, XCDs), this deck spends its budget on the consumer-silicon realities: wave32, GDDR6 / system-RAM, 32×16 matrix tiles via WMMA, and a dispatch story that has to span five generations from RDNA1 (no matrix unit at all) up to RDNA4 (WMMA32 with 1.5× the register file).
01 The RDNA family at a glance
hipfire targets the entire RDNA family with one Rust
binary — the matrix-path feature gating is the load-bearing
piece of the dispatch layer.
| arch | LLVM target | wave | VGPRs / SIMD | waves / SIMD | LDS / CU | matrix path |
|---|---|---|---|---|---|---|
| Navi 10 (5700 XT) | gfx1010 | 32 | 1024 | 20 | 64 KB | dp4a only |
| Navi 21 (6900 XT) | gfx1030 | 32 | 1024 | 16 | 64 KB | dp4a only |
| Navi 31 (7900 XT/XTX) | gfx1100 | 32 | 1024 | 16 | 64 KB | WMMA |
| Navi 32 / 33 | gfx1101 / gfx1102 | 32 | 1024 | 16 | 64 KB | WMMA |
| Strix Halo APU | gfx1150 / gfx1151 | 32 | 1024 | 16 | 64 KB | WMMA |
| Navi 48 / 44 (RX 9070) | gfx1200 / gfx1201 | 32 | 1536 | 16 | 128 KB | WMMA32 |
Two structural breaks worth memorizing:
- RDNA1/2 → RDNA3 is where the matrix unit
appears. Before
gfx1100, the fastest f16-mac path isv_dot2_f32_f16(a 2-wide packed-FMA) orv_dot4_i32_i8(dp4a). Aftergfx1100, you havev_wmma_*with a 16×16 accumulator per wave. - RDNA3 → RDNA4 keeps the wave32 / WMMA shape but
widens it: 1.5× the VGPR file per SIMD, 2× the LDS per
CU, and a 32-element WMMA tile (
WMMA32). Most kernels need their own.gfx12.hipvariant to get the speedup.
02 CDNA vs RDNA — the divergence summary
| CDNA — Instinct MFMA path, data center | RDNA — Radeon / Ryzen AI WMMA path, consumer | |
|---|---|---|
| Wavefront | 64 threads | 32 threads |
| Matrix unit | MFMA (v_mfma_*) — 16×16×16 fp16, 32×32×8 bf16, FP8 on CDNA 3 | WMMA (v_wmma_*) — 16×16×16 fp16 (RDNA3) / WMMA32 (RDNA4) |
| Memory | HBM3 / HBM3E / HBM4 (192–288 GB) | GDDR6 / GDDR6X (8–24 GB) or shared system RAM (Strix Halo) |
| Cache | Large L2, compute-optimized | Smaller L1/L2 + Infinity Cache (256 MB+ on dGPU) |
| Graphics | None (no rasterizer, no RT, no display) | Full graphics + RT pipeline |
| Transistor budget | More CUs, bigger matrix engines, more HBM controllers | Display engines, rasterizer, RT cores compete with compute |
| Typical port | gfx908 / gfx90a / gfx942 / gfx950 | gfx1010 → gfx1201 |
Why this matters for code: the same .hip
file compiles for either family — but a CDNA kernel ported to
RDNA hits three landmines: wavefront width (64 → 32 halves the
lanes seen by __shfl_*), matrix intrinsic (v_mfma_*
doesn't exist on RDNA — you must call v_wmma_*),
and memory-tier sizing (256 MB Infinity Cache changes your tiling
math vs HBM-direct).
03 WMMA — one instruction, one 16×16 tile
The WMMA family is RDNA3+'s answer to CDNA's MFMA. Same shape —
a fused D = A × B + C over a 16×16 tile
— but reachable from wave32 (half the lanes) and from a
fundamentally consumer-silicon caches.
The canonical RDNA3 builtin (used in hipfire's gemm_f16_wmma.hip):
// WMMA layout: __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a, b, c)
// computes D[i][j] = sum_k a[i][k] * b[j][k] + c[i][j]
// which is D = A @ B^T + C.
//
// A is half16_t (16 fp16 values per lane)
// B is half16_t
// C / D are float8_t (8 fp32 accumulators per lane on wave32)
typedef _Float16 __attribute__((ext_vector_type(16))) half16_t;
typedef float __attribute__((ext_vector_type(8))) float8_t;
__launch_bounds__(32, 2)
extern "C" __global__ void gemm_f16_wmma(
const _Float16* __restrict__ W, const float* __restrict__ X,
float* __restrict__ Y, int M, int K, int N)
{
// Grid: [ceil(M/16), ceil(N/16)], Block: [32]
float8_t acc = {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f};
for (int k0 = 0; k0 < K; k0 += 16) {
half16_t a_reg, b_reg;
// ...load 16x16 tiles into a_reg and b_reg...
acc = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_reg, b_reg, acc);
}
// ...store acc back to Y...
} - tile shape
- M=16, N=16, K=16
- inputs · output
- fp16 × fp16 → fp32 accumulator
- per issue
- 256 fma ops · 16-element dot product per output cell
- theoretical peak
- ~123 TFLOPs · 7900 XTX (gfx1100)
- notes
- Training-class precision. Default WMMA path on RDNA3.
What WMMA gives you (RDNA3, gfx1100 / 1151):
-
One
v_wmma_f32_16x16x16_f16issue per CU per cycle group; ~256 fp16 MACs per cycle per CU = the headline TFLOPs (e.g. 7900 XTX ≈ 123 TFLOPs fp16 WMMA, vs ≈ 41 TFLOPs from packed dot2). -
Wave32 native —
_w32suffix on the builtin. The wave64 sibling exists for CDNA-style code paths but RDNA3 issues wave32. -
The accumulator (
float8_ton wave32,float16_ton wave64) lives in the VGPR file, so a fused-GEMM kernel typically lands at 80–120 VGPRs/wave even without spills — see § 06 for what that does to occupancy.
What WMMA32 (RDNA4, gfx1200+) adds:
-
32-element matrix tiles (the
WMMA32family) — bigger tile, same wave32 shape. Most existing 16-tile kernels need their own.gfx12.hipoverride to switch to the wider builtin. - 1.5× the register file (1536 VGPRs/SIMD) means the bigger accumulator footprint doesn't cost occupancy proportionally.
04 The roofline — three matrix paths, one chip
Because RDNA3 ships both v_wmma_* and
v_dot2_f32_f16 (and the scalar fallback), it has
three different compute ceilings on the same chip.
Which one bounds you is decided by your kernel's matrix path, not by
the GPU's nameplate TFLOPs.
- Decode GEMV (batch=1) lives far left on the AI axis. Even with WMMA available, you can't feed a 16×16 matrix unit from a length-1 batch — you fall back to dot2/scalar and end up memory-bound on GDDR or system RAM. Decode is a memory-bandwidth problem on RDNA for the same reason it is on CDNA.
- Batched prefill (B=8) sits mid-AI. WMMA starts helping but you haven't filled the matrix-unit pipeline yet.
- Wide prefill (B=32+) is where WMMA wins. The arithmetic intensity climbs above the dot2 roof's ridge point and the WMMA roof is the only ceiling left.
This is exactly the regime split the hipfire empirical bench in § 07 confirms.
05 Why CDNA “throws away” graphics and RDNA doesn't
Mirror of slide 40 from the CDNA deck, but inverted:
- Rasterizers, ROPs, display engines, video encode/decode.
- Ray-tracing accelerators (since RDNA2).
- Mesh shaders, primitive shaders.
- Smaller, latency-optimized L0 / L1.
- Fewer CUs (96 on 7900 XTX vs 304 on MI300X — about 3×).
- No HBM (24 GB GDDR6X vs 192 GB HBM3 — about 8×).
- No FP8 / FP6 / FP4 MFMA — RDNA3 WMMA is fp16 / bf16 / i8 / i4 only; RDNA4 adds FP8.
- No XCD-level multi-die packaging — RDNA3 chiplets are MCDs (memory) not compute.
The result for an LLM workload: an RDNA3 board costs ≈ 1/10 of an MI300X. The matrix-unit throughput per dollar is similar if your kernel actually saturates WMMA. The headline divergence is that one of them runs a display.
06 The wave32 occupancy table
WMMA-fused GEMMs are register-hungry. The 8-float accumulator alone costs 8 VGPRs/lane; with tile-staging and A/B register tiles you quickly land in the 80–120 VGPR range. That has to fit the wave32 occupancy curve — drag the slider:
full lookup table
| VGPRs / wave | RDNA3 max waves / SIMD 1024 VGPRs/SIMD | RDNA4 max waves / SIMD 1536 VGPRs/SIMD |
|---|---|---|
| ≤ 64 | 16 | 16 |
| 65–84 | 12 | 16 |
| 85–96 | 10 | 16 |
| 97–112 | 9 | 13 |
| 113–128 | 8 | 12 |
| 129–168 | 6 | 9 |
| 169–204 | 5 | 7 |
| 205–256 | 4 | 6 |
Two rules of thumb that show up repeatedly in the hipfire kernel notes:
- WMMA / MFMA kernels run hot on VGPRs. Common allocation is 80–120 VGPR/wave with zero spills.
- High theoretical occupancy + low VALUBusy = memory-bound. More occupancy won't help; you need more in-flight HBM/GDDR transactions per wave (multi-quad interleave, half-wave splits, prefetch).
.private_segment_fixed_size: 0is the reliable “no spills” indicator from the AMDGPU note section — thevgpr_spill_countfield is sometimes elided by the toolchain when zero.
Reading those numbers out of a compiled .hsaco (from the gfx-kernel-metadata skill):
ARCH=gfx1100
# 1. Unbundle the offload container into a real ELF
/opt/rocm/llvm/bin/clang-offload-bundler --type=o --unbundle \
--input=kernel.hsaco --output=/tmp/kernel.elf \
--targets=hipv4-amdgcn-amd-amdhsa--$ARCH
# 2. Read AMDGPU notes
/opt/rocm/llvm/bin/llvm-readelf --notes /tmp/kernel.elf
# → .vgpr_count, .group_segment_fixed_size, .private_segment_fixed_size,
# .wavefront_size: 32 (RDNA) / 64 (CDNA) 07 Empirical — what WMMA is actually worth on real silicon
The headline empirical anchor from hipfire's May-2026 RDNA investigation. Same 1100-token NIAH-style prompt, 9B mq4, asym3 KV, three fresh-process warm runs per card.
| card | arch | WMMA | dp4a | prefill tok/s | decode tok/s | TTFT @ 1100 tok |
|---|---|---|---|---|---|---|
| RX 5700 XT | gfx1010 | ✗ | ✗ | 190.4 ± 2.5 | 54.7 | 5.78 s |
| RX 6950 XT | gfx1030 | ✗ | ✓ | 328.1 ± 6.8 | 71.5 | 3.35 s |
| Strix Halo iGPU | gfx1151 | ✓ | ✓ | 965.3 ± 2.7 | 45.0 | 1.14 s |
Speedup ratios — what each lever is worth
gfx1151 vs gfx1030
gfx1151 vs gfx1010
gfx1030 vs gfx1010
The pre-registered “strong win” threshold was 2×. WMMA cleared it by 2.5×.
The asymmetry is the lesson. Strix Halo's 256 GB/s shared system RAM is a quarter of the 5700 XT's 448 GB/s GDDR6. Yet it wins prefill 5×, because prefill is compute-bound and WMMA is the lever. The same card loses decode by 18%, because decode (batch=1 GEMV) is BW-bound and GDDR > system RAM. Specialize each tier — this is what “WMMA-downcasting prefill node + dedicated-GDDR decode node” means in practice on a heterogeneous RDNA cluster.
08 Memory hierarchy — RDNA edition
Same layered story as CDNA, with consumer-silicon numbers. The absolute bandwidths are smaller across the board, but Infinity Cache (256 MB+ on RDNA3 dGPU) is a uniquely RDNA tier — it sits between L2 and DRAM and dramatically widens the effective LLM-friendly working set.
tier scope cap BW lat
──── ───── ─── ── ───
VGPRs per CU 8 KB/wave ~100 TB/s 0 c
LDS per CU 64 / 128 KB ~10 TB/s ~20 c
L1 per CU 32 KB ~5 TB/s ~50 c
L2 per array 6 MB ~3 TB/s ~150 c
Infinity Cache device-wide 96–256 MB ~17 TB/s ~200 c ← RDNA-only
GDDR / sys RAM device-wide 8–96+ GB 0.5–1.0 ~400 c
The standard CDNA-deck rule still holds: every step down the
ladder is at least an order of magnitude slower, so a kernel
that touches the same byte twice should get the second touch from
LDS/L1, not DRAM. The RDNA twist is that Infinity Cache
makes the device-wide working set ~256 MB instead of ~32 MB of
L2 — a Qwen-9B asym3-KV slice for short context lives
entirely in IC, which is one of the reasons hipfire's decode tok/s
on gfx1100 is competitive with much wider-memory parts.
09 Dispatch — fast paths first, baseline last
rdna-compute::dispatch is the kernel-selection hot path
in hipfire. Every GEMM / GEMV / norm / fused op routes through here.
The shape is always the same: arch-feature predicates, fast paths
first, baseline (scalar / wave64-RDNA1) last.
pub fn gemm_qkv_hfq4g256(&self, ...) -> HipResult<()> {
if has_wmma_f16_gfx12(&self.arch) { // gfx1200/1201
return self.gemm_qkv_hfq4g256_wmma_gfx12(...);
}
if has_wmma_f16(&self.arch) { // gfx11xx, gfx1150/51
return self.gemm_qkv_hfq4g256_wmma(...);
}
if has_dot2_f32_f16(&self.arch) { // RDNA2 fallback
return self.gemm_qkv_hfq4g256_dot2(...);
}
self.gemm_qkv_hfq4g256_baseline(...) // RDNA1, fallback
}
// Predicates live at the top of dispatch.rs:
fn has_wmma_f16(arch: &str) -> bool { arch.starts_with("gfx11") }
fn has_wmma_f16_gfx12(arch: &str) -> bool { arch.starts_with("gfx12") }
fn has_wmma_fp8_gfx12(arch: &str) -> bool { arch.starts_with("gfx12") } Two contributor rules from the architecture doc:
- Predicates are arch-feature checks, not inline
arch.starts_with(...)chains. New silicon = update one predicate, not 40 call sites. - No unreachable branches. When a new arch absorbs a
check that was matched by an older
|| starts_with("gfxN")clause, drop the redundant clause in the same diff.
The per-arch kernel-file naming convention follows the dispatch:
kernels/src/gemm_qkv_hfq4g256_wmma.hip # default WMMA (RDNA3)
kernels/src/gemm_qkv_hfq4g256_wmma.gfx12.hip # gfx12xx override (WMMA32)
kernels/src/gemv_hfq4g256.gfx1030.v4.hip # chip-specific versioned
kernels/src/gemv_hfq6g256_residual_wave64.hip # wave64 (CDNA / RDNA1 fall-back) scripts/compile-kernels.sh resolves chip →
family → default in that order;
.gfx12.hip covers both gfx1200 and
gfx1201 with one file.
10 hipfire headline numbers — what this all adds up to
The reason any of the above engineering effort matters is the end-to-end inference throughput on consumer RDNA silicon.
7900 XTX (gfx1100), default config (asym3 KV,
FlashAttention auto):
| model | decode tok/s | peak prefill tok/s | vs ollama Q4_K_M |
|---|---|---|---|
| Qwen 3.5 0.8B | 391 | 7383 | 2.10× |
| Qwen 3.5 4B | 180 | 2487 | 1.78× |
| Qwen 3.5 9B | 132 | 1663 | 1.71× |
| Qwen 3.5 27B | 47 | 478 | — |
DFlash speculative decode adds another lever on top: 217 tok/s peak on 27B HumanEval/53 (4.8× over AR), 129 tok/s mean across HumanEval N=33 (2.9× AR mean). On 9B: 372 tok/s peak. The DFlash gain is genre-conditional; see benchmarks for the per-genre table.
27B AR across the RDNA family — same model, same prompt sweep,
kv-mode asym3, --no-chatml:
| card | arch | 27B AR decode tok/s | prefill tok/s | TTFT |
|---|---|---|---|---|
| 7900 XTX | gfx1100 | 44.9 | 462 | 328 ms |
| R9700 | gfx1201 | 35.1 | — | — |
| Strix Halo iGPU | gfx1151 | 14.95 | 161 | 950 ms |
HumanEval N=33 sweep, deterministic to ±1.3 % across reruns. The 3.0× spread from 7900 XTX to Strix Halo is the gap between 24 GB GDDR6X (~960 GB/s) and shared system RAM (~256 GB/s) for a memory-bound decode workload — consistent with the §07 prefill numbers flipping the other direction.
And on the APU — Strix Halo (gfx1151) solo, 27B mq4
with the 27B-DFlash sidecar, canonical CODE prompt, default adaptive-b
(the selector self-clamps to B=16):
| mode | tok/s | vs AR |
|---|---|---|
| AR baseline (no draft) | 14.95 | — |
| hipfire DFlash | 82.0 | 5.5× |
| lucebox DFlash · llama.cpp fork, same model | 27.4 | 1.8× |
hipfire DFlash row: 3-cell median, τ=10.0, accept rate 0.67. Lucebox is a separate llama.cpp fork running DFlash-style spec decode on the same hardware and model — hipfire's tuned WMMA kernels and adaptive-b selector together deliver 3.0× the lucebox throughput. HumanEval N=33 DFlash sweep on this card is currently rerunning.
That 1.84× DFlash gain on a system-RAM iGPU is a real silicon win attributable to the WMMA unit accelerating the verify-batch step, even when system-memory bandwidth limits the AR baseline.
11 Summary — what to remember
- RDNA wave width is 32, not 64.
__shfl_*, occupancy math, WMMA_w32builtins all key off this. - WMMA is the RDNA3+ matrix unit. Uses
v_wmma_*intrinsics, 16×16 tiles on RDNA3 (gfx1100/1101/1102/1150/1151), 32-wide tiles on RDNA4 (gfx1200/1201viaWMMA32). RDNA1/2 (gfx10xx) have no matrix unit — the best you get isv_dot2_f32_f16orv_dot4_i32_i8. - WMMA is worth 2.94× prefill on hipfire's actual kernels, measured apples-to-apples against a dp4a-only RDNA2 card (gfx1151 vs gfx1030, 9B mq4, 1100-token prompt). It is not worth anything for batch=1 decode, which stays memory-bound.
- Specialize each tier in a hetero cluster. Prefill on a WMMA card, decode on a high-BW dGPU. The same workload runs 5× faster end-to-end vs solo RDNA1.
- Dispatch is feature-gated, not arch-string-matched. New silicon costs one predicate update, not 40 call sites.
- WMMA fused-GEMMs cost ~80–120 VGPRs/wave. Fine on RDNA3 (occupancy stays at 10–12 waves/SIMD); RDNA4's 1.5× VGPR file keeps occupancy higher at the same pressure.
- Infinity Cache is a uniquely RDNA tier, sitting between L2 and DRAM at ~256 MB. It widens the effective LLM working set enough that a 24 GB consumer card competes with HBM-class parts at short context.
Source repository: github.com/Kaden-Schutt/hipfire · Companion: CDNA / MFMA deck · Further reading: /docs/architecture, /docs/benchmarks, /docs/quantization.