/learn · deck 02 · CDNA / MFMA

CDNA & MFMA — A Distil

Companion to RDNA & WMMA. This deck walks the data-center half of the AMD compute family: the Instinct lineage, the v_mfma_* matrix engine, the HBM3 memory hierarchy, the roofline that decides whether your kernel is memory-bound or compute-bound, and the rocprofv3 counters that actually answer that question.

Where the RDNA deck spent its budget on consumer realities (wave32, GDDR, dispatch over five generations), this deck spends its budget on the data-center invariants: wave64, MFMA tiles, HBM, and a transistor budget that throws away every rasterizer.

01 The roofline model

Two hardware ceilings — compute throughput and memory bandwidth — place an upper bound on any kernel. Arithmetic intensity (FLOPs per byte) decides which one you hit.

fig · roofline, MI300X class

AI = FLOPs ÷ bytes transferred. Left of the ridge = memory-bound → optimize data reuse. Right of the ridge = compute-bound → optimize ALU use.
Algorithm AI scaling: SAXPY O(1) · FFT O(log N) · SGEMM O(N).
The roofline is the only diagnostic that tells you which optimization to try first. Profile, plot, then tune.

02 Profiling — which metrics, and why

rocprofv3 collects hundreds of counters. Three families diagnose a kernel: timing, compute, memory.

rocprofv3 --kernel-trace \
  --pmc SQ_WAVES VALUBusy MemUnitBusy FetchSize \
  -- ./my_kernel
# writes  <pid>_kernel_trace.csv          (timing)
#         <pid>_counter_collection.csv    (counters)

family	counters	what it tells you
timing	`Start_Timestamp`, `End_Timestamp`, `Kernel_Name`, `VGPR_Count`	Which kernel runs longest, and is its occupancy register-limited?
compute	`VALUBusy`, `VALUUtilization`, `SQ_WAVES`	Low VALUBusy → CU idle → memory- or latency-bound. Low VALUUtilization → divergence.
memory	`MemUnitBusy`, `L2CacheHit`, `FetchSize`	High MemUnitBusy + low L2CacheHit → hitting HBM not cache. Poor coalescing or no reuse.

03 ROCm — the software stack underneath

ROCm = Radeon Open Compute — AMD's open-source software stack for GPU computing.

what ships ( + )

Open source. Entire stack on GitHub. No vendor lock-in — inspect, patch, or fork any layer you depend on.
CUDA compatible. HIP mirrors the CUDA runtime API. hipify automates most of the port rewrite.
Production proven. Meta (Llama), Microsoft Azure, Oracle OCI, OpenAI. Supercomputers: Frontier, El Capitan, LUMI.

core SDK groups

Frameworks: JAX, ONNX-RT, PyTorch, TensorFlow.
ML / CV: CK, MIGraphX, MIOpen, MIVisionX, RPP, rocAL, rocDecode, rocJPEG.
Comm: RCCL, rocSHMEM.
Math: hipBLAS, hipBLASLt, hipFFT, hipRAND, hipSOLVER, hipSPARSE, rocBLAS, rocFFT, rocSOLVER, rocWMMA, Tensile, hipfort, half.
Primitives: hipCUB, hipTensor, rocPRIM, rocThrust.
Tools: AMD SMI, ROCm SMI, rocminfo, rocprofv3, Compute Profiler, HIPIFY, ROCgdb.
Compilers: HIPCC, ROCm compilers, FLANG.
Runtime: HIP runtime.

04 ROCm releases & compatibility

Which product runs which architecture, and how that maps to an LLVM target.

product	architecture	LLVM target
MI355X · MI350X	CDNA 4	`gfx950`
MI325X · MI300X · MI300A	CDNA 3	`gfx942`
MI250X · MI250 · MI210	CDNA 2	`gfx90a`
MI100	CDNA 1	`gfx908`
RX 9070 / 9060 · PRO R9700	RDNA 4	`gfx1200 / 1201`
RX 7900 / 7800 / 7700 · PRO W7900 / W7800	RDNA 3	`gfx1100 / 1101`
Ryzen AI / iGPU	RDNA 3.5	`gfx1150 / 1151`

05 HIP — one source, two targets

Compiler that targets both AMD and NVIDIA GPUs from a single source file.

fig · HIP compilation flow

                       my_kernel.hip
                            │
                          hipcc
                       ┌────┴────┐
                       ▼         ▼
                   amdclang     nvcc
                       │         │
                       ▼         ▼
                   AMD ISA    NVIDIA PTX

Same source: same .hip compiles on either platform — no #ifdefs, no per-vendor branches.
CUDA-like syntax: __global__, <<<grid,block>>>, threadIdx, blockIdx.
Most CUDA code ports: hipify-perl renames CUDA → HIP automatically.

06 HIP vs CUDA — side by side

Minimal vector-add kernel in both. The shape is identical — only the API prefix differs.

HIP (AMD)

#include <hip/hip_runtime.h>
__global__ void add(float* a, float* b, float* c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  c[i] = a[i] + b[i];
}
int main() {
  float *a, *b, *c;
  hipMalloc(&a, N*sizeof(float));
  hipMalloc(&b, N*sizeof(float));
  add<<<blocks, threads>>>(a,b,c);
  hipDeviceSynchronize();
}

CUDA (NVIDIA)

#include <cuda_runtime.h>
__global__ void add(float* a, float* b, float* c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  c[i] = a[i] + b[i];
}
int main() {
  float *a, *b, *c;
  cudaMalloc(&a, N*sizeof(float));
  cudaMalloc(&b, N*sizeof(float));
  add<<<blocks, threads>>>(a,b,c);
  cudaDeviceSynchronize();
}

Kernel body is identical — only the API prefix differs.

07 ROCm tool tour

Six tools cover the full cycle: monitor → query → compile → profile → trace → compute.

tool	what it does	when you use it
`amd-smi`	GPU dashboard: temp, clocks, memory, power, utilization.	Every session — is it busy? overheating?
`rocminfo`	Lists GPUs: name, CU count, ISA, wavefront, clocks.	First thing after install — does ROCm see the GPU?
`hipcc`	HIP compiler — auto-detects GPU, emits native code.	Every compile: `hipcc -O3 -o out in.cpp`
`rocprofv3`	Kernel times, HW counters, occupancy metrics.	Profiling: `rocprofv3 --stats -- ./app`
`rocBLAS`	BLAS / GEMM — matrix-multiply engine.	PyTorch calls it for every linear layer.
`MIOpen`	DL primitives: conv, batchnorm, pooling, RNNs.	PyTorch / TF call it (`miopenConv*`).

08 How your code maps to the chip

The kitchen analogy — how software entities land on the hardware:

kitchen analogy	software	hardware
recipe	KERNEL — the function code	loaded into every CU's instruction cache
all the meals	GRID — all threads launched	dispatched across the whole GPU
a team of cooks	BLOCK — 64–1024 threads	runs on ONE CU (shares that CU's LDS)
a squad of 64 in sync	WAVEFRONT — 64 threads of a block	runs on ONE SIMD = 64 ALUs lockstep
one cook	THREAD — one work-item	executes on ONE ALU lane
the fridge	MEMORY — global arrays	stored in HBM (fetched via L2 → L1)

09 SIMT — the GPU execution model

Single Instruction, Multiple Threads — write code for ONE thread; the GPU runs thousands of copies, grouped into 64-thread wavefronts on CDNA.

fig · one wavefront, 64 threads in lockstep

 kernel pc → ┐
             ▼
 T0  T1  T2  T3  T4  ... T60 T61 T62 T63   ← 64 lanes, all advancing one
 ▏   ▏   ▏   ▏   ▏        ▏   ▏   ▏   ▏      instruction per cycle
 ───────────── 1 wavefront ─────────────

Write code for one thread. The kernel describes work for a single thread; the hardware spawns thousands of copies — one per work-item.
64 threads = 1 wavefront. The scheduler issues 64 threads together. They share a program counter and advance one instruction per cycle on a SIMD.
MI300X scale: 304 CUs × 64 ALUs — the GPU can keep ~20,000 threads in flight at once.

10 Wavefronts & latency hiding

A wavefront is the GPU's unit of scheduling: 64 threads in lockstep. Keeping many wavefronts in flight hides memory latency — on memory stall, the CU scheduler switches to another wavefront with zero context-switch cost.

interactive · latency hiding · three wavefronts, one stall

WF A

running

WF B

queued

WF C

queued

running on SIMD stalled on HBM queued / done CU: busy

11 Three bottleneck types

Every kernel sits in one of three regimes. The optimization recipe is different for each:

regime	limit	symptom	on roofline	fix
MEMORY-BOUND	data movement HBM ↔ cache / LDS	Low AI · mem pipes near peak · ALUs idle waiting	left of ridge	Coalesce, stage reuse in LDS, raise AI via fusion, smaller precision
COMPUTE-BOUND	ALU / MFMA throughput	ALUs or matrix engines near peak · memory mostly idle	at the compute roof	Use MFMA, reduce precision (FP8 / BF16), increase ILP, fewer redundant ops
OVERHEAD-BOUND	host scheduling, launch, tiny arrays	GPU mostly idle · few wavefronts active · many tiny launches	far below both roofs	Larger workgroups, fewer fatter kernels, HIP graphs, batch small launches

12 MI300 — package, XCD, compute unit

From the full board (package + HBM + I/O) down to one CU's internals. Three zoom levels:

fig · MI300X zoom

 [ MI300X package ]
 ┌──────────────────────────────────────────────────────┐
 │  HBM3        XCD       XCD       XCD       XCD  HBM3 │
 │  HBM3        XCD       XCD       XCD       XCD  HBM3 │
 │  HBM3            PCIe 5  +  Infinity Fabric I/O  HBM3│
 │  HBM3                                             HBM3│
 └──────────────────────────────────────────────────────┘
        8 XCDs · 8 HBM3 stacks · 192 GB total HBM3

 [ one XCD ]                          [ one CU ]
 ┌────────────────────┐               ┌─────────────────┐
 │ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢   │               │ SIMD0 SIMD1     │
 │ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢   │   ────→       │ SIMD2 SIMD3     │
 │ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢   │               │ + MFMA  + scalar│
 │ ▢ ▢ ▢ ▢ ▢ ▢       │               │ LDS 64 KB · L1  │
 └────────────────────┘               └─────────────────┘
   ~38 CUs / XCD                       4 SIMDs × 16 ALUs = 64 lanes
   304 active CUs total                + matrix engine + LDS + L1

13 MFMA — one instruction, one full matrix tile

The matrix engine computes a 16×16 tile multiply-accumulate in a single instruction — the building block of every GEMM.

interactive · v_mfma_f32_16x16x16_f16 · one instruction, one tile

A · fp16 · 16×16

B · fp16 · 16×16

C · fp32 · accumulator

D · fp32 · result

1 issue · ~256 fp16 MACs per cycle, per CU

What it is: fused matrix op — one instruction does a full tile MMA.
How it works: each CU has its own MFMA unit, pipelined alongside SIMD ALUs.
Why it matters: 70–90% of AI compute is GEMM — rocBLAS / MIOpen use MFMA automatically.
Variants: 16×16×16 fp16, 32×32×8 bf16, FP8 / FP6 / FP4 on CDNA 3+.

14 CDNA vs RDNA — two architectures

Same parent, different goals. The differences matter because porting CUDA code or tuning across families hits real pitfalls.

	CDNA — Instinct data-center AI & HPC	RDNA — Radeon consumer / workstation graphics
Wavefront	64 threads	32 threads
Memory	HBM3 / HBM3E / HBM4	GDDR6 / GDDR6X
Cache	large L2, compute-optimized	graphics-optimized L1 + GL1
ISA family	`gfx908, gfx90a, gfx942, gfx950`	`gfx1100, gfx1200, gfx1201`
Examples	MI300X, MI350X, MI400	RX 9070, Radeon PRO

Same HIP source, different machine code: hipcc reads the ISA family.

15 Why CDNA has no graphics pipeline

Every transistor on a CDNA die goes to compute. The result is a radically different chip from a gaming GPU — even though both are “GPUs.”

transistor budget — what it spends on ( + )

More compute units.
Bigger matrix engines.
Larger caches.
More HBM controllers.

what it gives up ( − )

No rasterizers.
No ray-tracing units.
No display outputs.

The result:

304 CUs

MI300X (CDNA 3)
192 GB HBM3

96 CUs

RX 7900 XTX (RDNA 3)
24 GB GDDR6X

3× / 8×

more CUs / more memory
because no silicon is wasted on graphics

16 Memory hierarchy — CDNA edition

The 10–50× bandwidth gap from registers to HBM drives every optimization strategy. Keep hot data as close to the CU as possible.

level	capacity	bandwidth	latency	scope	tier
Registers (VGPRs)	~256 / thread	~100 TB/s	0 c	per CU	user · fastest
LDS (shared)	64 KB	~10 TB/s	~20 c	per CU	user
L1	32 KB	~5 TB/s	~50 c	per CU	HW
L2	8–32 MB	~3 TB/s	~200 c	per XCD	HW
Infinity Cache	256 MB+	~17 TB/s	~150 c	device-wide	HW
HBM (global)	64–288 GB	1.6–8 TB/s	~400 c	device-wide	user · slowest

Every step down the ladder is at least an order of magnitude slower. Keep hot data as close to the CU as possible. If a kernel touches the same byte twice, it should come from LDS or L1 the second time — not HBM.

17 Memory optimization recipes

technique	aspect	rule	scope
Coalescing	global memory access pattern	Make consecutive threads in a wavefront read consecutive addresses.	CU → L2 → HBM
LDS bank conflicts	shared-memory access pattern	Pad shared-memory arrays so threads in a wavefront hit different banks.	CU → LDS
Cache behaviour	L1 / L2 / Infinity Cache reuse	If you reuse data, keep it cached; if you stream once, bypass with `glc/slc`.	CU → L1 → L2 → HBM
Vectorised loads	instructions-per-byte	When threads read consecutive floats, use `float4` to fetch 16 bytes per instruction.	instruction-level
Data layout	row-major / col-major / tiling	Lay your arrays out so neighbouring threads naturally land on neighbouring addresses.	structure-time

Workflow: profile first (rocprofv3) → identify the bottleneck → apply the matching technique.

18 Summary — what to remember

CDNA wave width is 64. Every __shfl_*, occupancy calculation, and MFMA builtin assumes 64 lanes. The number flips to 32 on RDNA.
MFMA is the CDNA matrix unit. One instruction does a full 16×16 (or 32×32) tile MAC. 70–90% of AI compute is GEMM, so MFMA is the load-bearing piece.
HBM3 is the only DRAM tier on CDNA. 192 GB on MI300X, ~5 TB/s aggregate. No GDDR, no Infinity Cache on most CDNA parts — latency hiding via occupancy is mandatory.
The roofline tells you which optimization to try first. Memory-bound → coalesce / fuse / reuse. Compute-bound → MFMA / lower precision / ILP. Overhead-bound → fewer fatter kernels / HIP graphs.
rocprofv3 gives you the three counters that matter: VALUBusy, MemUnitBusy, L2CacheHit. Three numbers and you can place a kernel on the roofline.
ROCm is open source, top to bottom. HIP is CUDA-portable, hipify-perl auto-renames most CUDA code, and the entire stack lives on GitHub.
The same HIP source compiles for either family — but a CDNA kernel ported to RDNA needs new matrix intrinsics and a wavefront-width audit. That's the whole reason the companion deck exists.

Source repository: github.com/Kaden-Schutt/hipfire · Companion: RDNA / WMMA deck · Further reading: /docs/architecture.