CDNA & MFMA — A Distil
Companion to RDNA & WMMA. This deck
walks the data-center half of the AMD compute family: the Instinct
lineage, the v_mfma_* matrix engine, the HBM3 memory
hierarchy, the roofline that decides whether your kernel is
memory-bound or compute-bound, and the
rocprofv3 counters that actually answer that question.
Where the RDNA deck spent its budget on consumer realities (wave32, GDDR, dispatch over five generations), this deck spends its budget on the data-center invariants: wave64, MFMA tiles, HBM, and a transistor budget that throws away every rasterizer.
01 The roofline model
Two hardware ceilings — compute throughput and memory bandwidth — place an upper bound on any kernel. Arithmetic intensity (FLOPs per byte) decides which one you hit.
- AI = FLOPs ÷ bytes transferred. Left of the ridge = memory-bound → optimize data reuse. Right of the ridge = compute-bound → optimize ALU use.
- Algorithm AI scaling: SAXPY O(1) · FFT O(log N) · SGEMM O(N).
- The roofline is the only diagnostic that tells you which optimization to try first. Profile, plot, then tune.
02 Profiling — which metrics, and why
rocprofv3 collects hundreds of counters. Three families
diagnose a kernel: timing, compute, memory.
rocprofv3 --kernel-trace \
--pmc SQ_WAVES VALUBusy MemUnitBusy FetchSize \
-- ./my_kernel
# writes <pid>_kernel_trace.csv (timing)
# <pid>_counter_collection.csv (counters) | family | counters | what it tells you |
|---|---|---|
| timing | Start_Timestamp, End_Timestamp, Kernel_Name, VGPR_Count | Which kernel runs longest, and is its occupancy register-limited? |
| compute | VALUBusy, VALUUtilization, SQ_WAVES | Low VALUBusy → CU idle → memory- or latency-bound. Low VALUUtilization → divergence. |
| memory | MemUnitBusy, L2CacheHit, FetchSize | High MemUnitBusy + low L2CacheHit → hitting HBM not cache. Poor coalescing or no reuse. |
03 ROCm — the software stack underneath
ROCm = Radeon Open Compute — AMD's open-source software stack for GPU computing.
- Open source. Entire stack on GitHub. No vendor lock-in — inspect, patch, or fork any layer you depend on.
- CUDA compatible. HIP mirrors the CUDA runtime API.
hipifyautomates most of the port rewrite. - Production proven. Meta (Llama), Microsoft Azure, Oracle OCI, OpenAI. Supercomputers: Frontier, El Capitan, LUMI.
- Frameworks: JAX, ONNX-RT, PyTorch, TensorFlow.
- ML / CV: CK, MIGraphX, MIOpen, MIVisionX, RPP, rocAL, rocDecode, rocJPEG.
- Comm: RCCL, rocSHMEM.
- Math: hipBLAS, hipBLASLt, hipFFT, hipRAND, hipSOLVER, hipSPARSE, rocBLAS, rocFFT, rocSOLVER, rocWMMA, Tensile, hipfort, half.
- Primitives: hipCUB, hipTensor, rocPRIM, rocThrust.
- Tools: AMD SMI, ROCm SMI, rocminfo, rocprofv3, Compute Profiler, HIPIFY, ROCgdb.
- Compilers: HIPCC, ROCm compilers, FLANG.
- Runtime: HIP runtime.
04 ROCm releases & compatibility
Which product runs which architecture, and how that maps to an LLVM target.
| product | architecture | LLVM target |
|---|---|---|
| MI355X · MI350X | CDNA 4 | gfx950 |
| MI325X · MI300X · MI300A | CDNA 3 | gfx942 |
| MI250X · MI250 · MI210 | CDNA 2 | gfx90a |
| MI100 | CDNA 1 | gfx908 |
| RX 9070 / 9060 · PRO R9700 | RDNA 4 | gfx1200 / 1201 |
| RX 7900 / 7800 / 7700 · PRO W7900 / W7800 | RDNA 3 | gfx1100 / 1101 |
| Ryzen AI / iGPU | RDNA 3.5 | gfx1150 / 1151 |
05 HIP — one source, two targets
Compiler that targets both AMD and NVIDIA GPUs from a single source file.
my_kernel.hip
│
hipcc
┌────┴────┐
▼ ▼
amdclang nvcc
│ │
▼ ▼
AMD ISA NVIDIA PTX
- Same source: same
.hipcompiles on either platform — no#ifdefs, no per-vendor branches. - CUDA-like syntax:
__global__,<<<grid,block>>>,threadIdx,blockIdx. - Most CUDA code ports:
hipify-perlrenames CUDA → HIP automatically.
06 HIP vs CUDA — side by side
Minimal vector-add kernel in both. The shape is identical — only the API prefix differs.
#include <hip/hip_runtime.h>
__global__ void add(float* a, float* b, float* c) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
c[i] = a[i] + b[i];
}
int main() {
float *a, *b, *c;
hipMalloc(&a, N*sizeof(float));
hipMalloc(&b, N*sizeof(float));
add<<<blocks, threads>>>(a,b,c);
hipDeviceSynchronize();
} #include <cuda_runtime.h>
__global__ void add(float* a, float* b, float* c) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
c[i] = a[i] + b[i];
}
int main() {
float *a, *b, *c;
cudaMalloc(&a, N*sizeof(float));
cudaMalloc(&b, N*sizeof(float));
add<<<blocks, threads>>>(a,b,c);
cudaDeviceSynchronize();
} Kernel body is identical — only the API prefix differs.
07 ROCm tool tour
Six tools cover the full cycle: monitor → query → compile → profile → trace → compute.
| tool | what it does | when you use it |
|---|---|---|
amd-smi | GPU dashboard: temp, clocks, memory, power, utilization. | Every session — is it busy? overheating? |
rocminfo | Lists GPUs: name, CU count, ISA, wavefront, clocks. | First thing after install — does ROCm see the GPU? |
hipcc | HIP compiler — auto-detects GPU, emits native code. | Every compile: hipcc -O3 -o out in.cpp |
rocprofv3 | Kernel times, HW counters, occupancy metrics. | Profiling: rocprofv3 --stats -- ./app |
rocBLAS | BLAS / GEMM — matrix-multiply engine. | PyTorch calls it for every linear layer. |
MIOpen | DL primitives: conv, batchnorm, pooling, RNNs. | PyTorch / TF call it (miopenConv*). |
08 How your code maps to the chip
The kitchen analogy — how software entities land on the hardware:
| kitchen analogy | software | hardware |
|---|---|---|
| recipe | KERNEL — the function code | loaded into every CU's instruction cache |
| all the meals | GRID — all threads launched | dispatched across the whole GPU |
| a team of cooks | BLOCK — 64–1024 threads | runs on ONE CU (shares that CU's LDS) |
| a squad of 64 in sync | WAVEFRONT — 64 threads of a block | runs on ONE SIMD = 64 ALUs lockstep |
| one cook | THREAD — one work-item | executes on ONE ALU lane |
| the fridge | MEMORY — global arrays | stored in HBM (fetched via L2 → L1) |
09 SIMT — the GPU execution model
Single Instruction, Multiple Threads — write code for ONE thread; the GPU runs thousands of copies, grouped into 64-thread wavefronts on CDNA.
kernel pc → ┐
▼
T0 T1 T2 T3 T4 ... T60 T61 T62 T63 ← 64 lanes, all advancing one
▏ ▏ ▏ ▏ ▏ ▏ ▏ ▏ ▏ instruction per cycle
───────────── 1 wavefront ─────────────
- Write code for one thread. The kernel describes work for a single thread; the hardware spawns thousands of copies — one per work-item.
- 64 threads = 1 wavefront. The scheduler issues 64 threads together. They share a program counter and advance one instruction per cycle on a SIMD.
- MI300X scale: 304 CUs × 64 ALUs — the GPU can keep ~20,000 threads in flight at once.
10 Wavefronts & latency hiding
A wavefront is the GPU's unit of scheduling: 64 threads in lockstep. Keeping many wavefronts in flight hides memory latency — on memory stall, the CU scheduler switches to another wavefront with zero context-switch cost.
11 Three bottleneck types
Every kernel sits in one of three regimes. The optimization recipe is different for each:
| regime | limit | symptom | on roofline | fix |
|---|---|---|---|---|
| MEMORY-BOUND | data movement HBM ↔ cache / LDS | Low AI · mem pipes near peak · ALUs idle waiting | left of ridge | Coalesce, stage reuse in LDS, raise AI via fusion, smaller precision |
| COMPUTE-BOUND | ALU / MFMA throughput | ALUs or matrix engines near peak · memory mostly idle | at the compute roof | Use MFMA, reduce precision (FP8 / BF16), increase ILP, fewer redundant ops |
| OVERHEAD-BOUND | host scheduling, launch, tiny arrays | GPU mostly idle · few wavefronts active · many tiny launches | far below both roofs | Larger workgroups, fewer fatter kernels, HIP graphs, batch small launches |
12 MI300 — package, XCD, compute unit
From the full board (package + HBM + I/O) down to one CU's internals. Three zoom levels:
[ MI300X package ]
┌──────────────────────────────────────────────────────┐
│ HBM3 XCD XCD XCD XCD HBM3 │
│ HBM3 XCD XCD XCD XCD HBM3 │
│ HBM3 PCIe 5 + Infinity Fabric I/O HBM3│
│ HBM3 HBM3│
└──────────────────────────────────────────────────────┘
8 XCDs · 8 HBM3 stacks · 192 GB total HBM3
[ one XCD ] [ one CU ]
┌────────────────────┐ ┌─────────────────┐
│ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ │ │ SIMD0 SIMD1 │
│ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ │ ────→ │ SIMD2 SIMD3 │
│ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢ │ │ + MFMA + scalar│
│ ▢ ▢ ▢ ▢ ▢ ▢ │ │ LDS 64 KB · L1 │
└────────────────────┘ └─────────────────┘
~38 CUs / XCD 4 SIMDs × 16 ALUs = 64 lanes
304 active CUs total + matrix engine + LDS + L1
13 MFMA — one instruction, one full matrix tile
The matrix engine computes a 16×16 tile multiply-accumulate in a single instruction — the building block of every GEMM.
- What it is: fused matrix op — one instruction does a full tile MMA.
- How it works: each CU has its own MFMA unit, pipelined alongside SIMD ALUs.
- Why it matters: 70–90% of AI compute is GEMM — rocBLAS / MIOpen use MFMA automatically.
- Variants: 16×16×16 fp16, 32×32×8 bf16, FP8 / FP6 / FP4 on CDNA 3+.
14 CDNA vs RDNA — two architectures
Same parent, different goals. The differences matter because porting CUDA code or tuning across families hits real pitfalls.
| CDNA — Instinct data-center AI & HPC | RDNA — Radeon consumer / workstation graphics | |
|---|---|---|
| Wavefront | 64 threads | 32 threads |
| Memory | HBM3 / HBM3E / HBM4 | GDDR6 / GDDR6X |
| Cache | large L2, compute-optimized | graphics-optimized L1 + GL1 |
| ISA family | gfx908, gfx90a, gfx942, gfx950 | gfx1100, gfx1200, gfx1201 |
| Examples | MI300X, MI350X, MI400 | RX 9070, Radeon PRO |
Same HIP source, different machine code: hipcc reads the ISA family.
15 Why CDNA has no graphics pipeline
Every transistor on a CDNA die goes to compute. The result is a radically different chip from a gaming GPU — even though both are “GPUs.”
- More compute units.
- Bigger matrix engines.
- Larger caches.
- More HBM controllers.
- No rasterizers.
- No ray-tracing units.
- No display outputs.
The result:
192 GB HBM3
24 GB GDDR6X
because no silicon is wasted on graphics
16 Memory hierarchy — CDNA edition
The 10–50× bandwidth gap from registers to HBM drives every optimization strategy. Keep hot data as close to the CU as possible.
| level | capacity | bandwidth | latency | scope | tier |
|---|---|---|---|---|---|
| Registers (VGPRs) | ~256 / thread | ~100 TB/s | 0 c | per CU | user · fastest |
| LDS (shared) | 64 KB | ~10 TB/s | ~20 c | per CU | user |
| L1 | 32 KB | ~5 TB/s | ~50 c | per CU | HW |
| L2 | 8–32 MB | ~3 TB/s | ~200 c | per XCD | HW |
| Infinity Cache | 256 MB+ | ~17 TB/s | ~150 c | device-wide | HW |
| HBM (global) | 64–288 GB | 1.6–8 TB/s | ~400 c | device-wide | user · slowest |
Every step down the ladder is at least an order of magnitude slower. Keep hot data as close to the CU as possible. If a kernel touches the same byte twice, it should come from LDS or L1 the second time — not HBM.
17 Memory optimization recipes
| technique | aspect | rule | scope |
|---|---|---|---|
| Coalescing | global memory access pattern | Make consecutive threads in a wavefront read consecutive addresses. | CU → L2 → HBM |
| LDS bank conflicts | shared-memory access pattern | Pad shared-memory arrays so threads in a wavefront hit different banks. | CU → LDS |
| Cache behaviour | L1 / L2 / Infinity Cache reuse | If you reuse data, keep it cached; if you stream once, bypass with glc/slc. | CU → L1 → L2 → HBM |
| Vectorised loads | instructions-per-byte | When threads read consecutive floats, use float4 to fetch 16 bytes per instruction. | instruction-level |
| Data layout | row-major / col-major / tiling | Lay your arrays out so neighbouring threads naturally land on neighbouring addresses. | structure-time |
Workflow: profile first (rocprofv3) → identify the bottleneck → apply the matching technique.
18 Summary — what to remember
- CDNA wave width is 64. Every
__shfl_*, occupancy calculation, and MFMA builtin assumes 64 lanes. The number flips to 32 on RDNA. - MFMA is the CDNA matrix unit. One instruction does a full 16×16 (or 32×32) tile MAC. 70–90% of AI compute is GEMM, so MFMA is the load-bearing piece.
- HBM3 is the only DRAM tier on CDNA. 192 GB on MI300X, ~5 TB/s aggregate. No GDDR, no Infinity Cache on most CDNA parts — latency hiding via occupancy is mandatory.
- The roofline tells you which optimization to try first. Memory-bound → coalesce / fuse / reuse. Compute-bound → MFMA / lower precision / ILP. Overhead-bound → fewer fatter kernels / HIP graphs.
rocprofv3gives you the three counters that matter:VALUBusy,MemUnitBusy,L2CacheHit. Three numbers and you can place a kernel on the roofline.- ROCm is open source, top to bottom. HIP is CUDA-portable,
hipify-perlauto-renames most CUDA code, and the entire stack lives on GitHub. - The same HIP source compiles for either family — but a CDNA kernel ported to RDNA needs new matrix intrinsics and a wavefront-width audit. That's the whole reason the companion deck exists.
Source repository: github.com/Kaden-Schutt/hipfire · Companion: RDNA / WMMA deck · Further reading: /docs/architecture.