/learn · deck 01 · RDNA / WMMA

RDNA & WMMA — A Distil

Companion to AMD Distil — CDNA / MFMA. This deck zooms in on the other AMD compute family: RDNA (consumer / pro / APU), the WMMA matrix path that arrived with RDNA3, and the empirical performance findings from the hipfire LLM-inference engine on real RDNA silicon (gfx1010 / gfx1030 / gfx1100 / gfx1151 / gfx12xx).

Where the CDNA deck spends its budget on data-center matrix throughput (MFMA, HBM, XCDs), this deck spends its budget on the consumer-silicon realities: wave32, GDDR6 / system-RAM, 32×16 matrix tiles via WMMA, and a dispatch story that has to span five generations from RDNA1 (no matrix unit at all) up to RDNA4 (WMMA32 with 1.5× the register file).

01 The RDNA family at a glance

hipfire targets the entire RDNA family with one Rust binary — the matrix-path feature gating is the load-bearing piece of the dispatch layer.

interactive · matrix-path generations · what each arch adds

RDNA3 (gfx1100) First gen with the WMMA matrix unit. 16×16 fp16 tile per issue, ~123 TFLOPs theoretical. The 2.94× prefill speedup measured in §07 starts here.

arch	LLVM target	wave	VGPRs / SIMD	waves / SIMD	LDS / CU	matrix path
Navi 10 (5700 XT)	`gfx1010`	32	1024	20	64 KB	dp4a only
Navi 21 (6900 XT)	`gfx1030`	32	1024	16	64 KB	dp4a only
Navi 31 (7900 XT/XTX)	`gfx1100`	32	1024	16	64 KB	WMMA
Navi 32 / 33	`gfx1101` / `gfx1102`	32	1024	16	64 KB	WMMA
Strix Halo APU	`gfx1150` / `gfx1151`	32	1024	16	64 KB	WMMA
Navi 48 / 44 (RX 9070)	`gfx1200` / `gfx1201`	32	1536	16	128 KB	WMMA32

Two structural breaks worth memorizing:

RDNA1/2 → RDNA3 is where the matrix unit appears. Before gfx1100, the fastest f16-mac path is v_dot2_f32_f16 (a 2-wide packed-FMA) or v_dot4_i32_i8 (dp4a). After gfx1100, you have v_wmma_* with a 16×16 accumulator per wave.
RDNA3 → RDNA4 keeps the wave32 / WMMA shape but widens it: 1.5× the VGPR file per SIMD, 2× the LDS per CU, and a 32-element WMMA tile (WMMA32). Most kernels need their own .gfx12.hip variant to get the speedup.

02 CDNA vs RDNA — the divergence summary

	CDNA — Instinct MFMA path, data center	RDNA — Radeon / Ryzen AI WMMA path, consumer
Wavefront	64 threads	32 threads
Matrix unit	MFMA (`v_mfma_*`) — 16×16×16 fp16, 32×32×8 bf16, FP8 on CDNA 3	WMMA (`v_wmma_*`) — 16×16×16 fp16 (RDNA3) / WMMA32 (RDNA4)
Memory	HBM3 / HBM3E / HBM4 (192–288 GB)	GDDR6 / GDDR6X (8–24 GB) or shared system RAM (Strix Halo)
Cache	Large L2, compute-optimized	Smaller L1/L2 + Infinity Cache (256 MB+ on dGPU)
Graphics	None (no rasterizer, no RT, no display)	Full graphics + RT pipeline
Transistor budget	More CUs, bigger matrix engines, more HBM controllers	Display engines, rasterizer, RT cores compete with compute
Typical port	`gfx908 / gfx90a / gfx942 / gfx950`	`gfx1010 → gfx1201`

Why this matters for code: the same .hip file compiles for either family — but a CDNA kernel ported to RDNA hits three landmines: wavefront width (64 → 32 halves the lanes seen by __shfl_*), matrix intrinsic (v_mfma_* doesn't exist on RDNA — you must call v_wmma_*), and memory-tier sizing (256 MB Infinity Cache changes your tiling math vs HBM-direct).

03 WMMA — one instruction, one 16×16 tile

The WMMA family is RDNA3+'s answer to CDNA's MFMA. Same shape — a fused D = A × B + C over a 16×16 tile — but reachable from wave32 (half the lanes) and from a fundamentally consumer-silicon caches.

The canonical RDNA3 builtin (used in hipfire's gemm_f16_wmma.hip):

// WMMA layout: __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a, b, c)
// computes  D[i][j] = sum_k a[i][k] * b[j][k] + c[i][j]
// which is  D = A @ B^T + C.
//
// A is half16_t (16 fp16 values per lane)
// B is half16_t
// C / D are float8_t (8 fp32 accumulators per lane on wave32)

typedef _Float16 __attribute__((ext_vector_type(16))) half16_t;
typedef float    __attribute__((ext_vector_type(8)))  float8_t;

__launch_bounds__(32, 2)
extern "C" __global__ void gemm_f16_wmma(
    const _Float16* __restrict__ W, const float* __restrict__ X,
    float*          __restrict__ Y, int M, int K, int N)
{
    // Grid: [ceil(M/16), ceil(N/16)],  Block: [32]
    float8_t acc = {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f};
    for (int k0 = 0; k0 < K; k0 += 16) {
        half16_t a_reg, b_reg;
        // ...load 16x16 tiles into a_reg and b_reg...
        acc = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_reg, b_reg, acc);
    }
    // ...store acc back to Y...
}

interactive · v_wmma_f32_NxNxN_f16 · one issue, one tile

A · fp16 · 16×16

B · fp16 · 16×16

C · fp32 · accumulator

D · fp32 · result

wave32 lanes contributing:

1 issue · 256 fma per issue — ~123 TFLOPs on 7900 XTX

tile shape: M=16, N=16, K=16
inputs · output: fp16 × fp16 → fp32 accumulator
per issue: 256 fma ops · 16-element dot product per output cell
theoretical peak: ~123 TFLOPs · 7900 XTX (gfx1100)
notes: Training-class precision. Default WMMA path on RDNA3.

What WMMA gives you (RDNA3, gfx1100 / 1151):

One v_wmma_f32_16x16x16_f16 issue per CU per cycle group; ~256 fp16 MACs per cycle per CU = the headline TFLOPs (e.g. 7900 XTX ≈ 123 TFLOPs fp16 WMMA, vs ≈ 41 TFLOPs from packed dot2).
Wave32 native — _w32 suffix on the builtin. The wave64 sibling exists for CDNA-style code paths but RDNA3 issues wave32.
The accumulator (float8_t on wave32, float16_t on wave64) lives in the VGPR file, so a fused-GEMM kernel typically lands at 80–120 VGPRs/wave even without spills — see § 06 for what that does to occupancy.

What WMMA32 (RDNA4, gfx1200+) adds:

32-element matrix tiles (the WMMA32 family) — bigger tile, same wave32 shape. Most existing 16-tile kernels need their own .gfx12.hip override to switch to the wider builtin.
1.5× the register file (1536 VGPRs/SIMD) means the bigger accumulator footprint doesn't cost occupancy proportionally.

04 The roofline — three matrix paths, one chip

Because RDNA3 ships both v_wmma_* and v_dot2_f32_f16 (and the scalar fallback), it has three different compute ceilings on the same chip. Which one bounds you is decided by your kernel's matrix path, not by the GPU's nameplate TFLOPs.

interactive · roofline · 7900 XTX · drag the batch-size slider

batch size B=1

arithmetic intensity ~3.0 FLOPs/byte

currently bound by GDDR bandwidth

attained perf ~2.5 TFLOPs/s

Decode GEMV (batch=1) lives far left on the AI axis. Even with WMMA available, you can't feed a 16×16 matrix unit from a length-1 batch — you fall back to dot2/scalar and end up memory-bound on GDDR or system RAM. Decode is a memory-bandwidth problem on RDNA for the same reason it is on CDNA.
Batched prefill (B=8) sits mid-AI. WMMA starts helping but you haven't filled the matrix-unit pipeline yet.
Wide prefill (B=32+) is where WMMA wins. The arithmetic intensity climbs above the dot2 roof's ridge point and the WMMA roof is the only ceiling left.

This is exactly the regime split the hipfire empirical bench in § 07 confirms.

05 Why CDNA “throws away” graphics and RDNA doesn't

Mirror of slide 40 from the CDNA deck, but inverted:

RDNA spends transistors on ( + )

Rasterizers, ROPs, display engines, video encode/decode.
Ray-tracing accelerators (since RDNA2).
Mesh shaders, primitive shaders.
Smaller, latency-optimized L0 / L1.

What it gives up vs CDNA ( − )

Fewer CUs (96 on 7900 XTX vs 304 on MI300X — about 3×).
No HBM (24 GB GDDR6X vs 192 GB HBM3 — about 8×).
No FP8 / FP6 / FP4 MFMA — RDNA3 WMMA is fp16 / bf16 / i8 / i4 only; RDNA4 adds FP8.
No XCD-level multi-die packaging — RDNA3 chiplets are MCDs (memory) not compute.

The result for an LLM workload: an RDNA3 board costs ≈ 1/10 of an MI300X. The matrix-unit throughput per dollar is similar if your kernel actually saturates WMMA. The headline divergence is that one of them runs a display.

06 The wave32 occupancy table

WMMA-fused GEMMs are register-hungry. The 8-float accumulator alone costs 8 VGPRs/lane; with tile-staging and A/B register tiles you quickly land in the 80–120 VGPR range. That has to fit the wave32 occupancy curve — drag the slider:

full lookup table

VGPRs / wave	RDNA3 max waves / SIMD 1024 VGPRs/SIMD	RDNA4 max waves / SIMD 1536 VGPRs/SIMD
≤ 64	16	16
65–84	12	16
85–96	10	16
97–112	9	13
113–128	8	12
129–168	6	9
169–204	5	7
205–256	4	6

Two rules of thumb that show up repeatedly in the hipfire kernel notes:

WMMA / MFMA kernels run hot on VGPRs. Common allocation is 80–120 VGPR/wave with zero spills.
High theoretical occupancy + low VALUBusy = memory-bound. More occupancy won't help; you need more in-flight HBM/GDDR transactions per wave (multi-quad interleave, half-wave splits, prefetch).
.private_segment_fixed_size: 0 is the reliable “no spills” indicator from the AMDGPU note section — the vgpr_spill_count field is sometimes elided by the toolchain when zero.

Reading those numbers out of a compiled .hsaco (from the gfx-kernel-metadata skill):

ARCH=gfx1100
# 1. Unbundle the offload container into a real ELF
/opt/rocm/llvm/bin/clang-offload-bundler --type=o --unbundle \
    --input=kernel.hsaco --output=/tmp/kernel.elf \
    --targets=hipv4-amdgcn-amd-amdhsa--$ARCH
# 2. Read AMDGPU notes
/opt/rocm/llvm/bin/llvm-readelf --notes /tmp/kernel.elf
# → .vgpr_count, .group_segment_fixed_size, .private_segment_fixed_size,
#   .wavefront_size: 32 (RDNA) / 64 (CDNA)

07 Empirical — what WMMA is actually worth on real silicon

The headline empirical anchor from hipfire's May-2026 RDNA investigation. Same 1100-token NIAH-style prompt, 9B mq4, asym3 KV, three fresh-process warm runs per card.

fig · per-card prefill tok/s · 1100-token prompt, 9B mq4

gfx1010 · 5700 XT

gfx1030 · 6950 XT

gfx1151 · Strix Halo

card	arch	WMMA	dp4a	prefill tok/s	decode tok/s	TTFT @ 1100 tok
RX 5700 XT	`gfx1010`	✗	✗	190.4 ± 2.5	54.7	5.78 s
RX 6950 XT	`gfx1030`	✗	✓	328.1 ± 6.8	71.5	3.35 s
Strix Halo iGPU	`gfx1151`	✓	✓	965.3 ± 2.7	45.0	1.14 s

Speedup ratios — what each lever is worth

2.94×

WMMA alone
gfx1151 vs gfx1030

5.07×

WMMA + arch gen
gfx1151 vs gfx1010

1.72×

dp4a alone
gfx1030 vs gfx1010

The pre-registered “strong win” threshold was 2×. WMMA cleared it by 2.5×.

The asymmetry is the lesson. Strix Halo's 256 GB/s shared system RAM is a quarter of the 5700 XT's 448 GB/s GDDR6. Yet it wins prefill 5×, because prefill is compute-bound and WMMA is the lever. The same card loses decode by 18%, because decode (batch=1 GEMV) is BW-bound and GDDR > system RAM. Specialize each tier — this is what “WMMA-downcasting prefill node + dedicated-GDDR decode node” means in practice on a heterogeneous RDNA cluster.

08 Memory hierarchy — RDNA edition

Same layered story as CDNA, with consumer-silicon numbers. The absolute bandwidths are smaller across the board, but Infinity Cache (256 MB+ on RDNA3 dGPU) is a uniquely RDNA tier — it sits between L2 and DRAM and dramatically widens the effective LLM-friendly working set.

fig · memory hierarchy, 7900 XTX class

 tier             scope         cap          BW          lat
 ────             ─────         ───          ──          ───
 VGPRs            per CU        8 KB/wave    ~100 TB/s   0 c
 LDS              per CU        64 / 128 KB  ~10 TB/s    ~20 c
 L1               per CU        32 KB        ~5 TB/s     ~50 c
 L2               per array     6 MB         ~3 TB/s     ~150 c
 Infinity Cache   device-wide   96–256 MB    ~17 TB/s    ~200 c   ← RDNA-only
 GDDR / sys RAM   device-wide   8–96+ GB     0.5–1.0     ~400 c

The standard CDNA-deck rule still holds: every step down the ladder is at least an order of magnitude slower, so a kernel that touches the same byte twice should get the second touch from LDS/L1, not DRAM. The RDNA twist is that Infinity Cache makes the device-wide working set ~256 MB instead of ~32 MB of L2 — a Qwen-9B asym3-KV slice for short context lives entirely in IC, which is one of the reasons hipfire's decode tok/s on gfx1100 is competitive with much wider-memory parts.

09 Dispatch — fast paths first, baseline last

rdna-compute::dispatch is the kernel-selection hot path in hipfire. Every GEMM / GEMV / norm / fused op routes through here. The shape is always the same: arch-feature predicates, fast paths first, baseline (scalar / wave64-RDNA1) last.

interactive · dispatch tree · click an arch to see which kernel branch wins

gemm_qkv_hfq4g256(...)

has_wmma_f16_gfx12 → WMMA32 path gfx1200 / 1201

has_wmma_f16 → WMMA16 path gfx1100 / 1101 / 1102 / 1150 / 1151

has_dot2_f32_f16 → dot2 path gfx1030

(default) → scalar baseline gfx1010, wave64 fallback

gfx1100 takes the WMMA16 branch — ~123 TFLOPs ceiling.

pub fn gemm_qkv_hfq4g256(&self, ...) -> HipResult<()> {
    if has_wmma_f16_gfx12(&self.arch) {                 // gfx1200/1201
        return self.gemm_qkv_hfq4g256_wmma_gfx12(...);
    }
    if has_wmma_f16(&self.arch) {                       // gfx11xx, gfx1150/51
        return self.gemm_qkv_hfq4g256_wmma(...);
    }
    if has_dot2_f32_f16(&self.arch) {                   // RDNA2 fallback
        return self.gemm_qkv_hfq4g256_dot2(...);
    }
    self.gemm_qkv_hfq4g256_baseline(...)                // RDNA1, fallback
}

// Predicates live at the top of dispatch.rs:
fn has_wmma_f16(arch: &str)        -> bool { arch.starts_with("gfx11") }
fn has_wmma_f16_gfx12(arch: &str)  -> bool { arch.starts_with("gfx12") }
fn has_wmma_fp8_gfx12(arch: &str)  -> bool { arch.starts_with("gfx12") }

Two contributor rules from the architecture doc:

Predicates are arch-feature checks, not inline arch.starts_with(...) chains. New silicon = update one predicate, not 40 call sites.
No unreachable branches. When a new arch absorbs a check that was matched by an older || starts_with("gfxN") clause, drop the redundant clause in the same diff.

The per-arch kernel-file naming convention follows the dispatch:

kernels/src/gemm_qkv_hfq4g256_wmma.hip            # default WMMA (RDNA3)
kernels/src/gemm_qkv_hfq4g256_wmma.gfx12.hip      # gfx12xx override (WMMA32)
kernels/src/gemv_hfq4g256.gfx1030.v4.hip          # chip-specific versioned
kernels/src/gemv_hfq6g256_residual_wave64.hip     # wave64 (CDNA / RDNA1 fall-back)

scripts/compile-kernels.sh resolves chip → family → default in that order; .gfx12.hip covers both gfx1200 and gfx1201 with one file.

10 hipfire headline numbers — what this all adds up to

The reason any of the above engineering effort matters is the end-to-end inference throughput on consumer RDNA silicon.

7900 XTX (gfx1100), default config (asym3 KV, FlashAttention auto):

model	decode tok/s	peak prefill tok/s	vs ollama Q4_K_M
Qwen 3.5 0.8B	391	7383	2.10×
Qwen 3.5 4B	180	2487	1.78×
Qwen 3.5 9B	132	1663	1.71×
Qwen 3.5 27B	47	478	—

DFlash speculative decode adds another lever on top: 217 tok/s peak on 27B HumanEval/53 (4.8× over AR), 129 tok/s mean across HumanEval N=33 (2.9× AR mean). On 9B: 372 tok/s peak. The DFlash gain is genre-conditional; see benchmarks for the per-genre table.

27B AR across the RDNA family — same model, same prompt sweep, kv-mode asym3, --no-chatml:

card	arch	27B AR decode tok/s	prefill tok/s	TTFT
7900 XTX	`gfx1100`	44.9	462	328 ms
R9700	`gfx1201`	35.1	—	—
Strix Halo iGPU	`gfx1151`	14.95	161	950 ms

HumanEval N=33 sweep, deterministic to ±1.3 % across reruns. The 3.0× spread from 7900 XTX to Strix Halo is the gap between 24 GB GDDR6X (~960 GB/s) and shared system RAM (~256 GB/s) for a memory-bound decode workload — consistent with the §07 prefill numbers flipping the other direction.

And on the APU — Strix Halo (gfx1151) solo, 27B mq4 with the 27B-DFlash sidecar, canonical CODE prompt, default adaptive-b (the selector self-clamps to B=16):

mode	tok/s	vs AR
AR baseline (no draft)	14.95	—
hipfire DFlash	82.0	5.5×
lucebox DFlash · llama.cpp fork, same model	27.4	1.8×

hipfire DFlash row: 3-cell median, τ=10.0, accept rate 0.67. Lucebox is a separate llama.cpp fork running DFlash-style spec decode on the same hardware and model — hipfire's tuned WMMA kernels and adaptive-b selector together deliver 3.0× the lucebox throughput. HumanEval N=33 DFlash sweep on this card is currently rerunning.

That 1.84× DFlash gain on a system-RAM iGPU is a real silicon win attributable to the WMMA unit accelerating the verify-batch step, even when system-memory bandwidth limits the AR baseline.

11 Summary — what to remember

RDNA wave width is 32, not 64. __shfl_*, occupancy math, WMMA _w32 builtins all key off this.
WMMA is the RDNA3+ matrix unit. Uses v_wmma_* intrinsics, 16×16 tiles on RDNA3 (gfx1100/1101/1102/1150/1151), 32-wide tiles on RDNA4 (gfx1200/1201 via WMMA32). RDNA1/2 (gfx10xx) have no matrix unit — the best you get is v_dot2_f32_f16 or v_dot4_i32_i8.
WMMA is worth 2.94× prefill on hipfire's actual kernels, measured apples-to-apples against a dp4a-only RDNA2 card (gfx1151 vs gfx1030, 9B mq4, 1100-token prompt). It is not worth anything for batch=1 decode, which stays memory-bound.
Specialize each tier in a hetero cluster. Prefill on a WMMA card, decode on a high-BW dGPU. The same workload runs 5× faster end-to-end vs solo RDNA1.
Dispatch is feature-gated, not arch-string-matched. New silicon costs one predicate update, not 40 call sites.
WMMA fused-GEMMs cost ~80–120 VGPRs/wave. Fine on RDNA3 (occupancy stays at 10–12 waves/SIMD); RDNA4's 1.5× VGPR file keeps occupancy higher at the same pressure.
Infinity Cache is a uniquely RDNA tier, sitting between L2 and DRAM at ~256 MB. It widens the effective LLM working set enough that a 24 GB consumer card competes with HBM-class parts at short context.

Source repository: github.com/Kaden-Schutt/hipfire · Companion: CDNA / MFMA deck · Further reading: /docs/architecture, /docs/benchmarks, /docs/quantization.