hipfire
/learn · deck 01 · RDNA / WMMA

RDNA & WMMA — A Distil

Companion to AMD Distil — CDNA / MFMA. This deck zooms in on the other AMD compute family: RDNA (consumer / pro / APU), the WMMA matrix path that arrived with RDNA3, and the empirical performance findings from the hipfire LLM-inference engine on real RDNA silicon (gfx1010 / gfx1030 / gfx1100 / gfx1151 / gfx12xx).

Where the CDNA deck spends its budget on data-center matrix throughput (MFMA, HBM, XCDs), this deck spends its budget on the consumer-silicon realities: wave32, GDDR6 / system-RAM, 32×16 matrix tiles via WMMA, and a dispatch story that has to span five generations from RDNA1 (no matrix unit at all) up to RDNA4 (WMMA32 with 1.5× the register file).


01 The RDNA family at a glance

hipfire targets the entire RDNA family with one Rust binary — the matrix-path feature gating is the load-bearing piece of the dispatch layer.

interactive · matrix-path generations · what each arch adds
matrix unit arrives WMMA32 + 1.5× regs RDNA1 gfx1010 scalar RDNA2 gfx1030 scalar dp4a (i8) RDNA3 gfx1100 scalar dp4a dot2 fp16 v_wmma_* 16×16 tile RDNA3.5 gfx1151 scalar dp4a dot2 fp16 v_wmma_* APU / iGPU RDNA4 gfx1201 scalar dp4a dot2 fp16 v_wmma_* 32×32 tile
RDNA3 (gfx1100) First gen with the WMMA matrix unit. 16×16 fp16 tile per issue, ~123 TFLOPs theoretical. The 2.94× prefill speedup measured in §07 starts here.
arch LLVM target wave VGPRs / SIMD waves / SIMD LDS / CU matrix path
Navi 10 (5700 XT)gfx10103210242064 KBdp4a only
Navi 21 (6900 XT)gfx10303210241664 KBdp4a only
Navi 31 (7900 XT/XTX)gfx11003210241664 KBWMMA
Navi 32 / 33gfx1101 / gfx11023210241664 KBWMMA
Strix Halo APUgfx1150 / gfx11513210241664 KBWMMA
Navi 48 / 44 (RX 9070)gfx1200 / gfx120132153616128 KBWMMA32

Two structural breaks worth memorizing:

  1. RDNA1/2 → RDNA3 is where the matrix unit appears. Before gfx1100, the fastest f16-mac path is v_dot2_f32_f16 (a 2-wide packed-FMA) or v_dot4_i32_i8 (dp4a). After gfx1100, you have v_wmma_* with a 16×16 accumulator per wave.
  2. RDNA3 → RDNA4 keeps the wave32 / WMMA shape but widens it: 1.5× the VGPR file per SIMD, 2× the LDS per CU, and a 32-element WMMA tile (WMMA32). Most kernels need their own .gfx12.hip variant to get the speedup.

02 CDNA vs RDNA — the divergence summary

CDNA — Instinct
MFMA path, data center
RDNA — Radeon / Ryzen AI
WMMA path, consumer
Wavefront64 threads32 threads
Matrix unitMFMA (v_mfma_*) — 16×16×16 fp16, 32×32×8 bf16, FP8 on CDNA 3WMMA (v_wmma_*) — 16×16×16 fp16 (RDNA3) / WMMA32 (RDNA4)
MemoryHBM3 / HBM3E / HBM4 (192–288 GB)GDDR6 / GDDR6X (8–24 GB) or shared system RAM (Strix Halo)
CacheLarge L2, compute-optimizedSmaller L1/L2 + Infinity Cache (256 MB+ on dGPU)
GraphicsNone (no rasterizer, no RT, no display)Full graphics + RT pipeline
Transistor budgetMore CUs, bigger matrix engines, more HBM controllersDisplay engines, rasterizer, RT cores compete with compute
Typical portgfx908 / gfx90a / gfx942 / gfx950gfx1010 → gfx1201

Why this matters for code: the same .hip file compiles for either family — but a CDNA kernel ported to RDNA hits three landmines: wavefront width (64 → 32 halves the lanes seen by __shfl_*), matrix intrinsic (v_mfma_* doesn't exist on RDNA — you must call v_wmma_*), and memory-tier sizing (256 MB Infinity Cache changes your tiling math vs HBM-direct).

03 WMMA — one instruction, one 16×16 tile

The WMMA family is RDNA3+'s answer to CDNA's MFMA. Same shape — a fused D = A × B + C over a 16×16 tile — but reachable from wave32 (half the lanes) and from a fundamentally consumer-silicon caches.

The canonical RDNA3 builtin (used in hipfire's gemm_f16_wmma.hip):

// WMMA layout: __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a, b, c)
// computes  D[i][j] = sum_k a[i][k] * b[j][k] + c[i][j]
// which is  D = A @ B^T + C.
//
// A is half16_t (16 fp16 values per lane)
// B is half16_t
// C / D are float8_t (8 fp32 accumulators per lane on wave32)

typedef _Float16 __attribute__((ext_vector_type(16))) half16_t;
typedef float    __attribute__((ext_vector_type(8)))  float8_t;

__launch_bounds__(32, 2)
extern "C" __global__ void gemm_f16_wmma(
    const _Float16* __restrict__ W, const float* __restrict__ X,
    float*          __restrict__ Y, int M, int K, int N)
{
    // Grid: [ceil(M/16), ceil(N/16)],  Block: [32]
    float8_t acc = {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f};
    for (int k0 = 0; k0 < K; k0 += 16) {
        half16_t a_reg, b_reg;
        // ...load 16x16 tiles into a_reg and b_reg...
        acc = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_reg, b_reg, acc);
    }
    // ...store acc back to Y...
}
interactive · v_wmma_f32_NxNxN_f16 · one issue, one tile
A · fp16 · 16×16
×
B · fp16 · 16×16
+
C · fp32 · accumulator
=
D · fp32 · result
wave32 lanes contributing:
1 issue · 256 fma per issue — ~123 TFLOPs on 7900 XTX
tile shape
M=16, N=16, K=16
inputs · output
fp16 × fp16 → fp32 accumulator
per issue
256 fma ops · 16-element dot product per output cell
theoretical peak
~123 TFLOPs · 7900 XTX (gfx1100)
notes
Training-class precision. Default WMMA path on RDNA3.

What WMMA gives you (RDNA3, gfx1100 / 1151):

What WMMA32 (RDNA4, gfx1200+) adds:

04 The roofline — three matrix paths, one chip

Because RDNA3 ships both v_wmma_* and v_dot2_f32_f16 (and the scalar fallback), it has three different compute ceilings on the same chip. Which one bounds you is decided by your kernel's matrix path, not by the GPU's nameplate TFLOPs.

interactive · roofline · 7900 XTX · drag the batch-size slider
arithmetic intensity (FLOPs / byte) → performance (TFLOPs/s) → GDDR6 roof dot2 roof · 41 TF WMMA roof · 123 TF B=1 · decode GEMV
B=1
arithmetic intensity ~3.0 FLOPs/byte
currently bound by GDDR bandwidth
attained perf ~2.5 TFLOPs/s

This is exactly the regime split the hipfire empirical bench in § 07 confirms.

05 Why CDNA “throws away” graphics and RDNA doesn't

Mirror of slide 40 from the CDNA deck, but inverted:

RDNA spends transistors on ( + )
  • Rasterizers, ROPs, display engines, video encode/decode.
  • Ray-tracing accelerators (since RDNA2).
  • Mesh shaders, primitive shaders.
  • Smaller, latency-optimized L0 / L1.
What it gives up vs CDNA ( − )
  • Fewer CUs (96 on 7900 XTX vs 304 on MI300X — about 3×).
  • No HBM (24 GB GDDR6X vs 192 GB HBM3 — about 8×).
  • No FP8 / FP6 / FP4 MFMA — RDNA3 WMMA is fp16 / bf16 / i8 / i4 only; RDNA4 adds FP8.
  • No XCD-level multi-die packaging — RDNA3 chiplets are MCDs (memory) not compute.

The result for an LLM workload: an RDNA3 board costs ≈ 1/10 of an MI300X. The matrix-unit throughput per dollar is similar if your kernel actually saturates WMMA. The headline divergence is that one of them runs a display.

06 The wave32 occupancy table

WMMA-fused GEMMs are register-hungry. The 8-float accumulator alone costs 8 VGPRs/lane; with tile-staging and A/B register tiles you quickly land in the 80–120 VGPR range. That has to fit the wave32 occupancy curve — drag the slider:

interactive · wave32 occupancy explorer
96
RDNA3 1024 VGPRs / SIMD · gfx1100 / 1151
0481216
10 waves / SIMD
RDNA4 1536 VGPRs / SIMD · gfx1200 / 1201
0481216
16 waves / SIMD
Typical WMMA fused GEMM (~96 VGPRs/wave): RDNA4 keeps full occupancy; RDNA3 already loses 6 waves/SIMD.
full lookup table
VGPRs / wave RDNA3 max waves / SIMD
1024 VGPRs/SIMD
RDNA4 max waves / SIMD
1536 VGPRs/SIMD
≤ 64 1616
65–84 1216
85–96 1016
97–112 9 13
113–1288 12
129–1686 9
169–2045 7
205–2564 6

Two rules of thumb that show up repeatedly in the hipfire kernel notes:

Reading those numbers out of a compiled .hsaco (from the gfx-kernel-metadata skill):

ARCH=gfx1100
# 1. Unbundle the offload container into a real ELF
/opt/rocm/llvm/bin/clang-offload-bundler --type=o --unbundle \
    --input=kernel.hsaco --output=/tmp/kernel.elf \
    --targets=hipv4-amdgcn-amd-amdhsa--$ARCH
# 2. Read AMDGPU notes
/opt/rocm/llvm/bin/llvm-readelf --notes /tmp/kernel.elf
# → .vgpr_count, .group_segment_fixed_size, .private_segment_fixed_size,
#   .wavefront_size: 32 (RDNA) / 64 (CDNA)

07 Empirical — what WMMA is actually worth on real silicon

The headline empirical anchor from hipfire's May-2026 RDNA investigation. Same 1100-token NIAH-style prompt, 9B mq4, asym3 KV, three fresh-process warm runs per card.

fig · per-card prefill tok/s · 1100-token prompt, 9B mq4
gfx1010 · 5700 XT
0
gfx1030 · 6950 XT
0
gfx1151 · Strix Halo
0
card arch WMMA dp4a prefill tok/s decode tok/s TTFT @ 1100 tok
RX 5700 XTgfx1010 190.4 ± 2.5 54.7 5.78 s
RX 6950 XTgfx1030 328.1 ± 6.8 71.5 3.35 s
Strix Halo iGPUgfx1151 965.3 ± 2.7 45.0 1.14 s

Speedup ratios — what each lever is worth

2.94×
WMMA alone
gfx1151 vs gfx1030
5.07×
WMMA + arch gen
gfx1151 vs gfx1010
1.72×
dp4a alone
gfx1030 vs gfx1010

The pre-registered “strong win” threshold was 2×. WMMA cleared it by 2.5×.

The asymmetry is the lesson. Strix Halo's 256 GB/s shared system RAM is a quarter of the 5700 XT's 448 GB/s GDDR6. Yet it wins prefill 5×, because prefill is compute-bound and WMMA is the lever. The same card loses decode by 18%, because decode (batch=1 GEMV) is BW-bound and GDDR > system RAM. Specialize each tier — this is what “WMMA-downcasting prefill node + dedicated-GDDR decode node” means in practice on a heterogeneous RDNA cluster.

08 Memory hierarchy — RDNA edition

Same layered story as CDNA, with consumer-silicon numbers. The absolute bandwidths are smaller across the board, but Infinity Cache (256 MB+ on RDNA3 dGPU) is a uniquely RDNA tier — it sits between L2 and DRAM and dramatically widens the effective LLM-friendly working set.

fig · memory hierarchy, 7900 XTX class
 tier             scope         cap          BW          lat
 ────             ─────         ───          ──          ───
 VGPRs            per CU        8 KB/wave    ~100 TB/s   0 c
 LDS              per CU        64 / 128 KB  ~10 TB/s    ~20 c
 L1               per CU        32 KB        ~5 TB/s     ~50 c
 L2               per array     6 MB         ~3 TB/s     ~150 c
 Infinity Cache   device-wide   96–256 MB    ~17 TB/s    ~200 c   ← RDNA-only
 GDDR / sys RAM   device-wide   8–96+ GB     0.5–1.0     ~400 c
        

The standard CDNA-deck rule still holds: every step down the ladder is at least an order of magnitude slower, so a kernel that touches the same byte twice should get the second touch from LDS/L1, not DRAM. The RDNA twist is that Infinity Cache makes the device-wide working set ~256 MB instead of ~32 MB of L2 — a Qwen-9B asym3-KV slice for short context lives entirely in IC, which is one of the reasons hipfire's decode tok/s on gfx1100 is competitive with much wider-memory parts.

09 Dispatch — fast paths first, baseline last

rdna-compute::dispatch is the kernel-selection hot path in hipfire. Every GEMM / GEMV / norm / fused op routes through here. The shape is always the same: arch-feature predicates, fast paths first, baseline (scalar / wave64-RDNA1) last.

interactive · dispatch tree · click an arch to see which kernel branch wins
gemm_qkv_hfq4g256(...)
has_wmma_f16_gfx12 WMMA32 path gfx1200 / 1201
has_wmma_f16 WMMA16 path gfx1100 / 1101 / 1102 / 1150 / 1151
has_dot2_f32_f16 dot2 path gfx1030
(default) scalar baseline gfx1010, wave64 fallback
gfx1100 takes the WMMA16 branch — ~123 TFLOPs ceiling.
pub fn gemm_qkv_hfq4g256(&self, ...) -> HipResult<()> {
    if has_wmma_f16_gfx12(&self.arch) {                 // gfx1200/1201
        return self.gemm_qkv_hfq4g256_wmma_gfx12(...);
    }
    if has_wmma_f16(&self.arch) {                       // gfx11xx, gfx1150/51
        return self.gemm_qkv_hfq4g256_wmma(...);
    }
    if has_dot2_f32_f16(&self.arch) {                   // RDNA2 fallback
        return self.gemm_qkv_hfq4g256_dot2(...);
    }
    self.gemm_qkv_hfq4g256_baseline(...)                // RDNA1, fallback
}

// Predicates live at the top of dispatch.rs:
fn has_wmma_f16(arch: &str)        -> bool { arch.starts_with("gfx11") }
fn has_wmma_f16_gfx12(arch: &str)  -> bool { arch.starts_with("gfx12") }
fn has_wmma_fp8_gfx12(arch: &str)  -> bool { arch.starts_with("gfx12") }

Two contributor rules from the architecture doc:

  1. Predicates are arch-feature checks, not inline arch.starts_with(...) chains. New silicon = update one predicate, not 40 call sites.
  2. No unreachable branches. When a new arch absorbs a check that was matched by an older || starts_with("gfxN") clause, drop the redundant clause in the same diff.

The per-arch kernel-file naming convention follows the dispatch:

kernels/src/gemm_qkv_hfq4g256_wmma.hip            # default WMMA (RDNA3)
kernels/src/gemm_qkv_hfq4g256_wmma.gfx12.hip      # gfx12xx override (WMMA32)
kernels/src/gemv_hfq4g256.gfx1030.v4.hip          # chip-specific versioned
kernels/src/gemv_hfq6g256_residual_wave64.hip     # wave64 (CDNA / RDNA1 fall-back)

scripts/compile-kernels.sh resolves chip → family → default in that order; .gfx12.hip covers both gfx1200 and gfx1201 with one file.

10 hipfire headline numbers — what this all adds up to

The reason any of the above engineering effort matters is the end-to-end inference throughput on consumer RDNA silicon.

7900 XTX (gfx1100), default config (asym3 KV, FlashAttention auto):

model decode tok/s peak prefill tok/s vs ollama Q4_K_M
Qwen 3.5 0.8B39173832.10×
Qwen 3.5 4B 18024871.78×
Qwen 3.5 9B 13216631.71×
Qwen 3.5 27B 47 478

DFlash speculative decode adds another lever on top: 217 tok/s peak on 27B HumanEval/53 (4.8× over AR), 129 tok/s mean across HumanEval N=33 (2.9× AR mean). On 9B: 372 tok/s peak. The DFlash gain is genre-conditional; see benchmarks for the per-genre table.

27B AR across the RDNA family — same model, same prompt sweep, kv-mode asym3, --no-chatml:

card arch 27B AR decode tok/s prefill tok/s TTFT
7900 XTXgfx1100 44.9 462 328 ms
R9700gfx1201 35.1
Strix Halo iGPUgfx1151 14.95 161 950 ms

HumanEval N=33 sweep, deterministic to ±1.3 % across reruns. The 3.0× spread from 7900 XTX to Strix Halo is the gap between 24 GB GDDR6X (~960 GB/s) and shared system RAM (~256 GB/s) for a memory-bound decode workload — consistent with the §07 prefill numbers flipping the other direction.

And on the APU — Strix Halo (gfx1151) solo, 27B mq4 with the 27B-DFlash sidecar, canonical CODE prompt, default adaptive-b (the selector self-clamps to B=16):

modetok/svs AR
AR baseline (no draft)14.95
hipfire DFlash82.05.5×
lucebox DFlash · llama.cpp fork, same model27.41.8×

hipfire DFlash row: 3-cell median, τ=10.0, accept rate 0.67. Lucebox is a separate llama.cpp fork running DFlash-style spec decode on the same hardware and model — hipfire's tuned WMMA kernels and adaptive-b selector together deliver 3.0× the lucebox throughput. HumanEval N=33 DFlash sweep on this card is currently rerunning.

That 1.84× DFlash gain on a system-RAM iGPU is a real silicon win attributable to the WMMA unit accelerating the verify-batch step, even when system-memory bandwidth limits the AR baseline.

11 Summary — what to remember


Source repository: github.com/Kaden-Schutt/hipfire · Companion: CDNA / MFMA deck · Further reading: /docs/architecture, /docs/benchmarks, /docs/quantization.