hipfire
/learn · deck 02 · CDNA / MFMA

CDNA & MFMA — A Distil

Companion to RDNA & WMMA. This deck walks the data-center half of the AMD compute family: the Instinct lineage, the v_mfma_* matrix engine, the HBM3 memory hierarchy, the roofline that decides whether your kernel is memory-bound or compute-bound, and the rocprofv3 counters that actually answer that question.

Where the RDNA deck spent its budget on consumer realities (wave32, GDDR, dispatch over five generations), this deck spends its budget on the data-center invariants: wave64, MFMA tiles, HBM, and a transistor budget that throws away every rasterizer.


01 The roofline model

Two hardware ceilings — compute throughput and memory bandwidth — place an upper bound on any kernel. Arithmetic intensity (FLOPs per byte) decides which one you hit.

fig · roofline, MI300X class
arithmetic intensity (FLOPs / byte) → performance (FLOPs/s) → HBM3 BW roof peak compute roof (MFMA) ridge MEMORY-BOUND COMPUTE-BOUND SAXPY (O(1)) FFT (O(log N)) SGEMM (O(N))

02 Profiling — which metrics, and why

rocprofv3 collects hundreds of counters. Three families diagnose a kernel: timing, compute, memory.

rocprofv3 --kernel-trace \
  --pmc SQ_WAVES VALUBusy MemUnitBusy FetchSize \
  -- ./my_kernel
# writes  <pid>_kernel_trace.csv          (timing)
#         <pid>_counter_collection.csv    (counters)
familycounterswhat it tells you
timing Start_Timestamp, End_Timestamp, Kernel_Name, VGPR_Count Which kernel runs longest, and is its occupancy register-limited?
compute VALUBusy, VALUUtilization, SQ_WAVES Low VALUBusy → CU idle → memory- or latency-bound. Low VALUUtilization → divergence.
memory MemUnitBusy, L2CacheHit, FetchSize High MemUnitBusy + low L2CacheHit → hitting HBM not cache. Poor coalescing or no reuse.

03 ROCm — the software stack underneath

ROCm = Radeon Open Compute — AMD's open-source software stack for GPU computing.

what ships ( + )
  • Open source. Entire stack on GitHub. No vendor lock-in — inspect, patch, or fork any layer you depend on.
  • CUDA compatible. HIP mirrors the CUDA runtime API. hipify automates most of the port rewrite.
  • Production proven. Meta (Llama), Microsoft Azure, Oracle OCI, OpenAI. Supercomputers: Frontier, El Capitan, LUMI.
core SDK groups
  • Frameworks: JAX, ONNX-RT, PyTorch, TensorFlow.
  • ML / CV: CK, MIGraphX, MIOpen, MIVisionX, RPP, rocAL, rocDecode, rocJPEG.
  • Comm: RCCL, rocSHMEM.
  • Math: hipBLAS, hipBLASLt, hipFFT, hipRAND, hipSOLVER, hipSPARSE, rocBLAS, rocFFT, rocSOLVER, rocWMMA, Tensile, hipfort, half.
  • Primitives: hipCUB, hipTensor, rocPRIM, rocThrust.
  • Tools: AMD SMI, ROCm SMI, rocminfo, rocprofv3, Compute Profiler, HIPIFY, ROCgdb.
  • Compilers: HIPCC, ROCm compilers, FLANG.
  • Runtime: HIP runtime.

04 ROCm releases & compatibility

Which product runs which architecture, and how that maps to an LLVM target.

productarchitectureLLVM target
MI355X · MI350XCDNA 4gfx950
MI325X · MI300X · MI300ACDNA 3gfx942
MI250X · MI250 · MI210CDNA 2gfx90a
MI100CDNA 1gfx908
RX 9070 / 9060 · PRO R9700RDNA 4gfx1200 / 1201
RX 7900 / 7800 / 7700 · PRO W7900 / W7800RDNA 3gfx1100 / 1101
Ryzen AI / iGPURDNA 3.5gfx1150 / 1151

05 HIP — one source, two targets

Compiler that targets both AMD and NVIDIA GPUs from a single source file.

fig · HIP compilation flow
                       my_kernel.hip
                            │
                          hipcc
                       ┌────┴────┐
                       ▼         ▼
                   amdclang     nvcc
                       │         │
                       ▼         ▼
                   AMD ISA    NVIDIA PTX
        

06 HIP vs CUDA — side by side

Minimal vector-add kernel in both. The shape is identical — only the API prefix differs.

HIP (AMD)
#include <hip/hip_runtime.h>
__global__ void add(float* a, float* b, float* c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  c[i] = a[i] + b[i];
}
int main() {
  float *a, *b, *c;
  hipMalloc(&a, N*sizeof(float));
  hipMalloc(&b, N*sizeof(float));
  add<<<blocks, threads>>>(a,b,c);
  hipDeviceSynchronize();
}
CUDA (NVIDIA)
#include <cuda_runtime.h>
__global__ void add(float* a, float* b, float* c) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  c[i] = a[i] + b[i];
}
int main() {
  float *a, *b, *c;
  cudaMalloc(&a, N*sizeof(float));
  cudaMalloc(&b, N*sizeof(float));
  add<<<blocks, threads>>>(a,b,c);
  cudaDeviceSynchronize();
}

Kernel body is identical — only the API prefix differs.

07 ROCm tool tour

Six tools cover the full cycle: monitor → query → compile → profile → trace → compute.

toolwhat it doeswhen you use it
amd-smiGPU dashboard: temp, clocks, memory, power, utilization.Every session — is it busy? overheating?
rocminfoLists GPUs: name, CU count, ISA, wavefront, clocks.First thing after install — does ROCm see the GPU?
hipccHIP compiler — auto-detects GPU, emits native code.Every compile: hipcc -O3 -o out in.cpp
rocprofv3Kernel times, HW counters, occupancy metrics.Profiling: rocprofv3 --stats -- ./app
rocBLASBLAS / GEMM — matrix-multiply engine.PyTorch calls it for every linear layer.
MIOpenDL primitives: conv, batchnorm, pooling, RNNs.PyTorch / TF call it (miopenConv*).

08 How your code maps to the chip

The kitchen analogy — how software entities land on the hardware:

kitchen analogy software hardware
recipeKERNEL — the function codeloaded into every CU's instruction cache
all the mealsGRID — all threads launcheddispatched across the whole GPU
a team of cooksBLOCK — 64–1024 threadsruns on ONE CU (shares that CU's LDS)
a squad of 64 in syncWAVEFRONT — 64 threads of a blockruns on ONE SIMD = 64 ALUs lockstep
one cookTHREAD — one work-itemexecutes on ONE ALU lane
the fridgeMEMORY — global arraysstored in HBM (fetched via L2 → L1)

09 SIMT — the GPU execution model

Single Instruction, Multiple Threads — write code for ONE thread; the GPU runs thousands of copies, grouped into 64-thread wavefronts on CDNA.

fig · one wavefront, 64 threads in lockstep
 kernel pc → ┐
             ▼
 T0  T1  T2  T3  T4  ... T60 T61 T62 T63   ← 64 lanes, all advancing one
 ▏   ▏   ▏   ▏   ▏        ▏   ▏   ▏   ▏      instruction per cycle
 ───────────── 1 wavefront ─────────────
        

10 Wavefronts & latency hiding

A wavefront is the GPU's unit of scheduling: 64 threads in lockstep. Keeping many wavefronts in flight hides memory latency — on memory stall, the CU scheduler switches to another wavefront with zero context-switch cost.

interactive · latency hiding · three wavefronts, one stall
WF A
running
WF B
queued
WF C
queued
running on SIMD stalled on HBM queued / done CU: busy

11 Three bottleneck types

Every kernel sits in one of three regimes. The optimization recipe is different for each:

regime limit symptom on roofline fix
MEMORY-BOUND data movement HBM ↔ cache / LDS Low AI · mem pipes near peak · ALUs idle waiting left of ridge Coalesce, stage reuse in LDS, raise AI via fusion, smaller precision
COMPUTE-BOUND ALU / MFMA throughput ALUs or matrix engines near peak · memory mostly idle at the compute roof Use MFMA, reduce precision (FP8 / BF16), increase ILP, fewer redundant ops
OVERHEAD-BOUND host scheduling, launch, tiny arrays GPU mostly idle · few wavefronts active · many tiny launches far below both roofs Larger workgroups, fewer fatter kernels, HIP graphs, batch small launches

12 MI300 — package, XCD, compute unit

From the full board (package + HBM + I/O) down to one CU's internals. Three zoom levels:

fig · MI300X zoom
 [ MI300X package ]
 ┌──────────────────────────────────────────────────────┐
 │  HBM3        XCD       XCD       XCD       XCD  HBM3 │
 │  HBM3        XCD       XCD       XCD       XCD  HBM3 │
 │  HBM3            PCIe 5  +  Infinity Fabric I/O  HBM3│
 │  HBM3                                             HBM3│
 └──────────────────────────────────────────────────────┘
        8 XCDs · 8 HBM3 stacks · 192 GB total HBM3

 [ one XCD ]                          [ one CU ]
 ┌────────────────────┐               ┌─────────────────┐
 │ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢   │               │ SIMD0 SIMD1     │
 │ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢   │   ────→       │ SIMD2 SIMD3     │
 │ ▢ ▢ ▢ ▢ ▢ ▢ ▢ ▢   │               │ + MFMA  + scalar│
 │ ▢ ▢ ▢ ▢ ▢ ▢       │               │ LDS 64 KB · L1  │
 └────────────────────┘               └─────────────────┘
   ~38 CUs / XCD                       4 SIMDs × 16 ALUs = 64 lanes
   304 active CUs total                + matrix engine + LDS + L1
        

13 MFMA — one instruction, one full matrix tile

The matrix engine computes a 16×16 tile multiply-accumulate in a single instruction — the building block of every GEMM.

interactive · v_mfma_f32_16x16x16_f16 · one instruction, one tile
A · fp16 · 16×16
×
B · fp16 · 16×16
+
C · fp32 · accumulator
=
D · fp32 · result
1 issue · ~256 fp16 MACs per cycle, per CU

14 CDNA vs RDNA — two architectures

Same parent, different goals. The differences matter because porting CUDA code or tuning across families hits real pitfalls.

CDNA — Instinct
data-center AI & HPC
RDNA — Radeon
consumer / workstation graphics
Wavefront64 threads32 threads
MemoryHBM3 / HBM3E / HBM4GDDR6 / GDDR6X
Cachelarge L2, compute-optimizedgraphics-optimized L1 + GL1
ISA familygfx908, gfx90a, gfx942, gfx950gfx1100, gfx1200, gfx1201
ExamplesMI300X, MI350X, MI400RX 9070, Radeon PRO

Same HIP source, different machine code: hipcc reads the ISA family.

15 Why CDNA has no graphics pipeline

Every transistor on a CDNA die goes to compute. The result is a radically different chip from a gaming GPU — even though both are “GPUs.”

transistor budget — what it spends on ( + )
  • More compute units.
  • Bigger matrix engines.
  • Larger caches.
  • More HBM controllers.
what it gives up ( − )
  • No rasterizers.
  • No ray-tracing units.
  • No display outputs.

The result:

304 CUs
MI300X (CDNA 3)
192 GB HBM3
96 CUs
RX 7900 XTX (RDNA 3)
24 GB GDDR6X
3× / 8×
more CUs / more memory
because no silicon is wasted on graphics

16 Memory hierarchy — CDNA edition

The 10–50× bandwidth gap from registers to HBM drives every optimization strategy. Keep hot data as close to the CU as possible.

level capacity bandwidth latency scope tier
Registers (VGPRs)~256 / thread~100 TB/s0 cper CUuser · fastest
LDS (shared) 64 KB ~10 TB/s~20 cper CUuser
L1 32 KB ~5 TB/s~50 cper CUHW
L2 8–32 MB ~3 TB/s~200 cper XCDHW
Infinity Cache 256 MB+ ~17 TB/s~150 cdevice-wideHW
HBM (global) 64–288 GB 1.6–8 TB/s~400 cdevice-wideuser · slowest

Every step down the ladder is at least an order of magnitude slower. Keep hot data as close to the CU as possible. If a kernel touches the same byte twice, it should come from LDS or L1 the second time — not HBM.

17 Memory optimization recipes

technique aspect rule scope
Coalescingglobal memory access patternMake consecutive threads in a wavefront read consecutive addresses.CU → L2 → HBM
LDS bank conflictsshared-memory access patternPad shared-memory arrays so threads in a wavefront hit different banks.CU → LDS
Cache behaviourL1 / L2 / Infinity Cache reuseIf you reuse data, keep it cached; if you stream once, bypass with glc/slc.CU → L1 → L2 → HBM
Vectorised loadsinstructions-per-byteWhen threads read consecutive floats, use float4 to fetch 16 bytes per instruction.instruction-level
Data layoutrow-major / col-major / tilingLay your arrays out so neighbouring threads naturally land on neighbouring addresses.structure-time

Workflow: profile first (rocprofv3) → identify the bottleneck → apply the matching technique.

18 Summary — what to remember


Source repository: github.com/Kaden-Schutt/hipfire · Companion: RDNA / WMMA deck · Further reading: /docs/architecture.