Why hipfire — the RDNA-native thesis
Take the same workload — Qwen 3.5 9B, MQ4 weights, decode — and run it on two GPUs that, on paper, look almost identical: Radeon 7900 XTX and RTX 4090. Both 24 GB. Both ~1 TB/s GDDR6X. Both same generation, same year, same price tier. Both have a matrix unit. Both schedule 32-thread waves.
Yet the same kernel hipified from a CUDA original reaches ≈ 37% of the GDDR6X bandwidth ceiling on the 7900 XTX, while hipfire's hand-written equivalent reaches ≈ 69%. That's not a ROCm bug. That's an architectural mismatch that hipify can't paper over. This page is the proof — what's different, why hipified kernels leave throughput on the table, and where RDNA actually shines.
01 Two chips, same job
Spec sheets, post-marketing-gloss. Both are 2022-era consumer flagships; both nominally target the same workloads.
| Radeon 7900 XTX RDNA 3 · gfx1100 | GeForce RTX 4090 Ada Lovelace · sm_89 | |
|---|---|---|
| Compute units | 96 CUs (48 WGPs) | 128 SMs |
| Wave / warp width | 32 (wave32) | 32 (warp32) |
| Matrix unit | v_wmma_* — 16×16×16 fp16 | Tensor Core 4th gen — m16n8k16 / m16n16k16 / many shapes |
| Matrix throughput (fp16, fp32 acc) | ~123 TFLOPs | ~330 TFLOPs |
| FP32 (scalar) throughput | ~61 TFLOPs | ~83 TFLOPs |
| VRAM | 24 GB GDDR6X | 24 GB GDDR6X |
| Memory bandwidth | 960 GB/s | 1008 GB/s |
| L2 cache | 6 MB | 72 MB |
| L3 / Infinity Cache | 96 MB | none |
| Shared memory (LDS / shmem) per CU/SM | 64 KB | 128 KB |
| Register file per SIMD / scheduler | 1024 VGPRs (32 KB) | 64 KB (per-scheduler) |
| Power (board, peak) | 355 W | 450 W |
| Launch price (Q4 2022) | $999 | $1599 |
Sources: AMD RDNA 3 architecture whitepaper, NVIDIA Ada Lovelace
whitepaper, public ISA references, on-card rocminfo /
nvidia-smi dumps. Theoretical TFLOPs at boost clocks; the
important comparisons in this deck are the architectural ratios, not
the absolute peaks.
02 Where they look alike
Real talk first: a CUDA developer porting code to HIP isn't entering an alien architecture. The mental model carries over surprisingly far.
- Wave32 ↔ warp32. Both GPUs lockstep 32 threads as the scheduling unit.
__shfl_*, ballot, vote — same semantics, same lane count. - Explicit shared memory. Both expose a per-CU / per-SM scratchpad (LDS on AMD, shmem on NVIDIA) with explicit allocation, single-cycle access from the SIMT array, and bank-conflict semantics.
- SIMT + barrier programming model.
__syncthreads()/__syncwarpbehave the same; cooperative thread arrays / blocks map to workgroups one-to-one. - Matrix-fused-multiply-accumulate is a thing on both. RDNA3 has WMMA, Ada has Tensor Cores. Both compute a tile of
D = A×B + Cin one issue, both consume fp16/bf16/int8. - HIP syntax is CUDA-shaped.
__global__,<<<grid, block>>>,threadIdx,blockIdx— same.hipify-perlcan mechanically renamecuda*tohip*for most runtime API calls.
If the workload is a pure compute-bound stencil with no matrix-unit intrinsics and no precision tricks, a hipified CUDA kernel can land within a few percent of native. That's the surface area where HIP portability genuinely delivers.
Everywhere a high-perf inference kernel actually spends its cycles, though, the chips diverge.
03 Where they diverge — the four mismatches that matter
3.1 Matrix tile shape and binding
RDNA3's only matrix shape is 16×16×16: one wave produces a 16×16 fp32 accumulator from two 16×16 fp16 inputs. Wave32 cooperative, accumulator lives in 8 VGPRs per lane.
Ada's Tensor Cores expose many shapes: m16n8k16, m16n16k16,
m16n8k8, m8n8k16, plus the FP8 / TF32 variants. The fastest path for
most CUDA inference kernels is mma.sync.aligned.m16n8k16
— a half-wide tile that produces 16×8 per warp. Tensor
Cores can issue these at ≈4× the per-SM rate of a
comparable WMMA on RDNA3.
RDNA3 (gfx1100) Ada Lovelace (sm_89)
─────────────── ────────────────────
v_wmma_f32_16x16x16_f16_w32 mma.sync.m16n8k16.f32.f16.f16.f32 ← favored
mma.sync.m16n16k16.f32.f16.f16.f32
mma.sync.m16n8k8.f32.f16.f16.f32
mma.sync.m8n8k16.f32.f16.f16.f32
mma.sync.m16n8k16.f32.bf16.bf16.f32
mma.sync.m16n8k32.s32.s8.s8.s32 ← int8
mma.sync.m16n8k16.f32.e4m3.e4m3.f32 ← FP8
ONE shape on RDNA3. ~10+ shapes on Ada.
Consequence for hipified kernels. A CUDA kernel that
issues m16n8k16 mma operations has no direct
equivalent on RDNA. rocWMMA (the C++ shim) can bridge
the C++ nvcuda::wmma:: API with the AMD WMMA builtins, but
the underlying tile shapes don't line up: a 16×8×16 logical
tile has to be expressed as two halves of a 16×16×16 WMMA
issue, with explicit zeroing in the unused half. That's wasted compute
— you're paying for a full WMMA and using half of it. Hand-written
hipfire kernels just tile the GEMM directly to 16×16 and
get the full issue rate.
3.2 Cache hierarchy: Infinity Cache is RDNA-only
Ada has a single large L2 (72 MB) shared across all SMs and DRAM. RDNA3 has a two-tier on-die cache: 6 MB L2 per shader array plus 96 MB Infinity Cache (device-wide L3) sitting between L2 and DRAM. The bandwidths are roughly comparable at peak, but the working-set landing zones are completely different.
| working-set size | 7900 XTX (RDNA3) | RTX 4090 (Ada) |
|---|---|---|
| < 6 MB | hits L2 | hits L2 |
| 6–72 MB | hits Infinity Cache (~17 TB/s) | hits L2 (~5 TB/s) |
| 72–96 MB | hits Infinity Cache | spills to DRAM |
| > 96 MB | spills to DRAM | spills to DRAM |
Two kernels written for the same algorithm but tuned for different targets pick different tile sizes:
- A CUDA kernel tuned for Ada will pick a tile that fits in ~50 MB of L2 and re-reads aggressively. Same tile on RDNA spills out of L2 (6 MB) and lives at IC bandwidth (~17 TB/s — still good, but the kernel's been written assuming 5 TB/s L2 access patterns, so it under-utilizes the IC BW).
- A hipfire kernel tunes for the IC: tile small enough to fit in 6 MB L2 for hot reuse, sized so the full pass fits in 96 MB IC. A short-context Qwen 9B asym3 KV slice fits entirely in IC — one of the reasons hipfire's 9B decode hits 654 GiB/s sustained against a 960 GB/s GDDR6X ceiling.
3.3 Register file: different occupancy math
Both chips have similar total register storage per SM/CU. The occupancy curve is different.
| VGPRs / regs per thread | RDNA3: max waves / SIMD | Ada: max warps / SM |
|---|---|---|
| ≤ 32 | 16 | 48 |
| 64 | 16 | 32 |
| 96 | 10 | 21 |
| 128 | 8 | 16 |
| 168 | 6 | 12 |
| 256 | 4 | 8 |
WMMA-fused GEMMs land at 80–120 VGPRs/wave, so RDNA3 occupancy sits at 8–10 waves/SIMD — fine, but the constraint is tight enough that spills are catastrophic. A CUDA kernel that uses 130 registers/thread runs comfortably on Ada (16 warps/SM × 4 schedulers = 64 active warps) but collapses on RDNA to 6 waves/SIMD with high register pressure. Hipfire kernels are budgeted against the wave32 curve, not the warp32 curve.
3.4 Wave scheduling: dual-issue vs SM-level multi-scheduler
Ada SMs have 4 independent warp schedulers, each able to issue one instruction per cycle — so a single SM can have 4 warps active doing 4 different things (one FMA, one INT, one load, one SFU). RDNA3 has 4 SIMD32 lanes per WGP, each running one wave at a time, with RDNA3-introduced dual-issue capability: under specific instruction pairings, one SIMD can retire two instructions per cycle from the same wave.
That's a meaningfully different scheduling story. Ada amortizes latency by having many warps in flight per SM; RDNA3 amortizes by having multiple SIMDs per WGP plus ILP within a single wave through dual-issue. A kernel written for Ada's "lots of small warps, SM picks" topology doesn't naturally exploit RDNA's dual-issue — you have to deliberately schedule independent FMA + memory-op pairs in the kernel body. Hipfire kernels do this consciously; hipified CUDA kernels rarely do.
04 What hipify produces vs what hipfire writes
A concrete example. The inner loop of a GEMM with WMMA / Tensor Core on each side. Both compute the same math; the binding to the matrix unit is what differs.
CUDA — tuned for Ada Tensor Cores (PTX inline)
// 16x8x16 fp16 → fp32 mma, the fast Ada path.
// One warp produces a 16x8 output tile per issue.
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
"{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%0,%1,%2,%3};\n"
: "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3])
: "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]),
"r"(b[0]), "r"(b[1])
); Hipified output — what hipify-perl emits
// The asm-block is preserved verbatim by hipify-perl. It targets
// PTX, which hipcc → amdclang doesn't accept. Compile fails:
//
// error: invalid instruction mnemonic 'mma.sync.aligned.m16n8k16...'
//
// Best-case outcome: the kernel author had a #ifdef __CUDA_ARCH__
// fallback to a scalar fp16 fma loop. Then the kernel COMPILES on
// AMD — but issues no WMMA, runs at ~3% of peak.
//
// Worst-case: the author used wmma:: C++ API and rocWMMA bridges
// the call to AMD's intrinsic — but the shape is m16n8k16, which
// RDNA only emulates by issuing a m16n16k16 with half the B-tile
// zeroed. Half the matrix unit's work is thrown away. hipfire — native RDNA3 WMMA
// Real code from kernels/src/gemm_f16_wmma.hip.
// One wave produces a 16x16 output tile per issue. No shape
// mismatch, no PTX, no fallback. Tile dims match the only WMMA
// shape RDNA3 offers.
typedef _Float16 __attribute__((ext_vector_type(16))) half16_t;
typedef float __attribute__((ext_vector_type(8))) float8_t;
__launch_bounds__(32, 2)
extern "C" __global__ void gemm_f16_wmma(...) {
float8_t acc = {0};
for (int k0 = 0; k0 < K; k0 += 16) {
half16_t a_reg, b_reg;
// ... load 16x16 tiles into a_reg / b_reg from LDS ...
acc = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_reg, b_reg, acc);
}
// ... store acc to global ...
}
This pattern repeats across 66 WMMA kernels,
29 sdot4 (dp4a) kernels, and 7 fdot2 packed-fp16
fma kernels in kernels/src/ — all
hand-written, all binding directly to AMD intrinsics, no PTX, no
rocWMMA shim, no shape conversion.
05 The hipfire approach — tuned to the silicon
The kernel-count numbers tell the story. As of current master:
RDNA3+, gfx1100/1151/1201
RDNA2 fallback path
Hipfire's RDNA-native moves:
- Per-arch dispatch. Every GEMM / GEMV / attention op routes through
rdna-compute::dispatch, which picks WMMA32 (gfx12xx), WMMA16 (gfx11xx), dot2 (gfx1030), or scalar baseline (gfx1010) at runtime. Feature predicates:has_wmma_f16,has_wmma_f16_gfx12,has_dot2_f32_f16,has_wave64_native. - Custom quant formats designed for AMD's matrix path.
MQ4(mode-quant 4-bit, Lloyd-Max centroids),HFQ4G256(half-precision quantized, 256-element groups, matches WMMA K=16 tiles 16-way),HFP4G32(FP4 with E2M1 + UE8M0 per-group scale — tuned for RDNA4'sv_wmma_f32_16x16x16_fp8_fp8_w32_gfx12path). These formats are designed against AMD intrinsics, not borrowed from CUDA's K-tile conventions. - Wave32-native dispatch. 91 kernels use LDS; nearly all set
__launch_bounds__(32, ...)to lock the workgroup at one wave wide. No wave64 trampolines, no warp-divergence handling lifted from CUDA. - Chip-specific overrides where they matter. 20 per-chip
*.gfxNNNN.hipvariants —fused_qkv_mq3g256_lloyd.gfx1100.hipuses an RDNA3-specific K4 unroll + LDS-resident codebook; gfx1030's dp4a path is its own file family; gfx1201 unlocks WMMA32 with its own kernel set. - Two
dlopencalls into ROCm at runtime. Nothing else. No rocBLAS, no MIOpen, no hipBLASLt, no Composable Kernel.libamdhip64.sois the runtime,librocblas.soloads lazily only when MI300X-class hardware is in play. Absence is recoverable.
06 The empirical proof
Same hardware, same model, same quantization-equivalent format, different kernel paths.
Qwen 3.5 9B decode on 7900 XTX, both daemons configured to use the full card:
| engine | kernel path | decode tok/s | effective BW | % of GDDR6X peak |
|---|---|---|---|---|
| hipfire | hand-written WMMA + MQ4 | 132 | ~654 GiB/s | ~69% |
| llama.cpp ROCm | hipified Q4_K_M / MMQ kernels | 71 | ~352 GiB/s | ~37% |
| ollama (Q4_K_M) | llama.cpp under the hood | 77 | ~381 GiB/s | ~40% |
The gap is the thesis. Decode is bandwidth-bound on 7900 XTX — the 9B weights have to be read every token, ~5 GB at 4-bit quant. The hardware ceiling is 960 GB/s GDDR6X bandwidth. Hipfire achieves 69% of that ceiling. The hipified-CUDA kernels in llama.cpp's ROCm backend, even with the engine actively engineered for AMD, get 37%.
That 32-percentage-point gap is what hand-written kernels buy you: WMMA hits the right shape on the first issue, LDS layouts avoid AMD's 32-bank conflicts (different bank-stride from NVIDIA), and dual-issue opportunities are scheduled rather than left to chance.
llama.cpp/ollama numbers from docs/BENCHMARKS.md: Qwen3 9B Q4_K_M on the same 7900 XTX, ROCm 6, default flags. The 2.10× / 1.78× / 1.71× vs-ollama deltas on the homepage are this same comparison across 0.8B / 4B / 9B respectively.
07 Where RDNA shines
- Single-stream inference. The smaller matrix unit means lower fixed cost per matrix issue. For batch=1 decode, where you're filling a 16×16 tile once per layer per token, RDNA's WMMA isn't bottlenecked by under-fed Tensor Cores. Hipfire hits 132 tok/s on 9B at B=1; a Tensor-Core-optimized kernel often loses throughput at B=1 because the unit wants larger tiles.
- Working sets that fit Infinity Cache. Asym3 KV for a 9B model at short context: ~80 MB. Fits IC. Lives at ~17 TB/s instead of 960 GB/s. Ada has no equivalent tier — you get L2 (72 MB, ~5 TB/s) and then a 7× cliff to GDDR6X. RDNA3's IC softens that cliff dramatically for the LLM working-set range.
- Predictable bandwidth, no SM-scheduling quirks. RDNA waves are scheduled simply: one per SIMD per cycle, dual-issue opportunistic. There's no equivalent of NVIDIA's per-SM contention for the SFU / load/store unit between four warp schedulers. A carefully written kernel can predict its instruction issue cadence on RDNA to within a couple percent.
- Per-watt for inference. 355 W board power for 7900 XTX vs 450 W for RTX 4090. On a memory-bound decode workload, the 7900 XTX is within striking distance of the 4090 on tok/s while drawing ~21% less power. (Not a hipfire claim, just the chip.)
08 Where NVIDIA shines
Honest comparisons require honest concessions. Ada has real advantages.
- Large matrix workloads. Training, batch=64+ inference, attention with long KV at high concurrency — anything where the matrix unit can be kept saturated benefits from Tensor Cores' raw throughput. RDNA's 123 TFLOPs WMMA vs Ada's 330 TFLOPs Tensor (fp16 with fp32 acc) is a 2.7× gap that no amount of kernel tuning closes.
- Mixed-precision training. TF32 is an Ada-native throughput tier (mantissa-truncated fp32 that runs at fp16 speed through Tensor Cores). RDNA3 has no equivalent. FP8 support is farther along on Ada too (the FP4 path on gfx1201 is recent and relatively narrow).
- Large-L2 workloads. Kernels with 50–72 MB working sets that fit Ada's L2 but don't benefit from RDNA's higher-latency Infinity Cache (e.g., latency-bound search / graph algorithms) run faster on Ada.
- Software ecosystem. cuBLAS, cuDNN, FlashAttention, TensorRT — mature, well-tuned vendor libraries. ROCm equivalents exist but coverage is uneven on consumer RDNA. (This is exactly why hipfire doesn't use the ROCm libraries — they're not the path to competitive perf on consumer cards.)
09 How hipfire masks RDNA's weaknesses
The thesis is “tune to RDNA's strengths, and where you can't, cover the weaknesses in algorithm space, not kernel space.”
- Smaller matrix throughput → speculative decode. DFlash batches a sidecar drafter's K-token guess into a single verification pass through the target model. The verify step is a B=K wide GEMM — one WMMA-batched call that amortizes the per-issue fixed cost. Result: 4.8× over AR on code prompts, 5.5× on Strix Halo APU. We extract throughput Ada gets from raw matrix width by extracting it from batched amortization instead.
- Smaller L2 → aggressive KV compression. Asym2 / asym3 / asym4 KV cache compression keeps the active KV slice in the 96 MB Infinity Cache for 4K–16K context windows. Hipfire's asym3 path is 2.7× smaller than fp16 KV at acceptable quality loss.
- Tighter register budget → LDS-resident codebooks. Hipfire's quant formats keep their dequant codebooks in LDS (256 B to 4 KB) rather than spilling into VGPRs as constants. RDNA's 64 KB LDS/CU is plenty; the result is 80–120 VGPR fused-GEMM kernels with zero register spills, holding 8–10 waves/SIMD occupancy.
- Lower fp16 ALU throughput → quantize harder. MQ4 / HFQ4G256 mean the GEMMs run at 4-bit weight precision, with dequant fused into the dot product. The dot2 / dp4a / WMMA unit isn't blocked waiting on dequant; the weight-stream BW is the binding constraint, which is the constraint we wanted to be on anyway (since BW is the bottleneck the engineering effort can move most).
- No vendor BLAS → deterministic kernels. rocBLAS / hipBLASLt make autotuning decisions at runtime that hipfire can't reproduce across boots. Hipfire's hand-written kernels are deterministic by file — same prompt, same flags, same tok/s within ±1.3%. That's a debuggability and benchmarking posture, not a perf claim, but it's the posture that makes the rest of the engineering possible.
10 Summary — the thesis in one paragraph
RDNA and Ada are not the same chip with different vendor logos. The waves are the same width and the API surface looks alike, but the matrix-unit tile shape, the cache topology, the register-file occupancy curve, and the wave-scheduling story are all different in ways that compound. A kernel written against Ada's strengths leaves ≈30 percentage points of GDDR6X bandwidth on the floor when it lands on RDNA — not because ROCm is bad, but because the kernel was written to a different machine. Hipfire's 321 hand-written HIP kernels exist because the gap is real, the work is RDNA-shaped, and the empirical proof is on the same card any reader can buy: 132 tok/s vs 71 on a 7900 XTX running the same Qwen 9B.
Source repository: github.com/Kaden-Schutt/hipfire · Companion decks: RDNA / WMMA, CDNA / MFMA · Further reading: /docs/architecture, /docs/benchmarks.