hipfire
/learn · deck 03 · the thesis

Why hipfire — the RDNA-native thesis

Take the same workload — Qwen 3.5 9B, MQ4 weights, decode — and run it on two GPUs that, on paper, look almost identical: Radeon 7900 XTX and RTX 4090. Both 24 GB. Both ~1 TB/s GDDR6X. Both same generation, same year, same price tier. Both have a matrix unit. Both schedule 32-thread waves.

Yet the same kernel hipified from a CUDA original reaches ≈ 37% of the GDDR6X bandwidth ceiling on the 7900 XTX, while hipfire's hand-written equivalent reaches ≈ 69%. That's not a ROCm bug. That's an architectural mismatch that hipify can't paper over. This page is the proof — what's different, why hipified kernels leave throughput on the table, and where RDNA actually shines.


01 Two chips, same job

Spec sheets, post-marketing-gloss. Both are 2022-era consumer flagships; both nominally target the same workloads.

Radeon 7900 XTX
RDNA 3 · gfx1100
GeForce RTX 4090
Ada Lovelace · sm_89
Compute units96 CUs (48 WGPs)128 SMs
Wave / warp width32 (wave32)32 (warp32)
Matrix unitv_wmma_* — 16×16×16 fp16Tensor Core 4th gen — m16n8k16 / m16n16k16 / many shapes
Matrix throughput (fp16, fp32 acc)~123 TFLOPs~330 TFLOPs
FP32 (scalar) throughput~61 TFLOPs~83 TFLOPs
VRAM24 GB GDDR6X24 GB GDDR6X
Memory bandwidth960 GB/s1008 GB/s
L2 cache6 MB72 MB
L3 / Infinity Cache96 MBnone
Shared memory (LDS / shmem) per CU/SM64 KB128 KB
Register file per SIMD / scheduler1024 VGPRs (32 KB)64 KB (per-scheduler)
Power (board, peak)355 W450 W
Launch price (Q4 2022)$999$1599

Sources: AMD RDNA 3 architecture whitepaper, NVIDIA Ada Lovelace whitepaper, public ISA references, on-card rocminfo / nvidia-smi dumps. Theoretical TFLOPs at boost clocks; the important comparisons in this deck are the architectural ratios, not the absolute peaks.

02 Where they look alike

Real talk first: a CUDA developer porting code to HIP isn't entering an alien architecture. The mental model carries over surprisingly far.

If the workload is a pure compute-bound stencil with no matrix-unit intrinsics and no precision tricks, a hipified CUDA kernel can land within a few percent of native. That's the surface area where HIP portability genuinely delivers.

Everywhere a high-perf inference kernel actually spends its cycles, though, the chips diverge.

03 Where they diverge — the four mismatches that matter

3.1   Matrix tile shape and binding

RDNA3's only matrix shape is 16×16×16: one wave produces a 16×16 fp32 accumulator from two 16×16 fp16 inputs. Wave32 cooperative, accumulator lives in 8 VGPRs per lane.

Ada's Tensor Cores expose many shapes: m16n8k16, m16n16k16, m16n8k8, m8n8k16, plus the FP8 / TF32 variants. The fastest path for most CUDA inference kernels is mma.sync.aligned.m16n8k16 — a half-wide tile that produces 16×8 per warp. Tensor Cores can issue these at ≈4× the per-SM rate of a comparable WMMA on RDNA3.

fig · matrix-unit tile shapes per arch
 RDNA3 (gfx1100)              Ada Lovelace (sm_89)
 ───────────────              ────────────────────
 v_wmma_f32_16x16x16_f16_w32  mma.sync.m16n8k16.f32.f16.f16.f32   ← favored
                              mma.sync.m16n16k16.f32.f16.f16.f32
                              mma.sync.m16n8k8.f32.f16.f16.f32
                              mma.sync.m8n8k16.f32.f16.f16.f32
                              mma.sync.m16n8k16.f32.bf16.bf16.f32
                              mma.sync.m16n8k32.s32.s8.s8.s32     ← int8
                              mma.sync.m16n8k16.f32.e4m3.e4m3.f32 ← FP8

   ONE shape on RDNA3.            ~10+ shapes on Ada.
        

Consequence for hipified kernels. A CUDA kernel that issues m16n8k16 mma operations has no direct equivalent on RDNA. rocWMMA (the C++ shim) can bridge the C++ nvcuda::wmma:: API with the AMD WMMA builtins, but the underlying tile shapes don't line up: a 16×8×16 logical tile has to be expressed as two halves of a 16×16×16 WMMA issue, with explicit zeroing in the unused half. That's wasted compute — you're paying for a full WMMA and using half of it. Hand-written hipfire kernels just tile the GEMM directly to 16×16 and get the full issue rate.

3.2   Cache hierarchy: Infinity Cache is RDNA-only

Ada has a single large L2 (72 MB) shared across all SMs and DRAM. RDNA3 has a two-tier on-die cache: 6 MB L2 per shader array plus 96 MB Infinity Cache (device-wide L3) sitting between L2 and DRAM. The bandwidths are roughly comparable at peak, but the working-set landing zones are completely different.

working-set size 7900 XTX (RDNA3) RTX 4090 (Ada)
< 6 MB hits L2 hits L2
6–72 MB hits Infinity Cache (~17 TB/s)hits L2 (~5 TB/s)
72–96 MBhits Infinity Cachespills to DRAM
> 96 MB spills to DRAM spills to DRAM

Two kernels written for the same algorithm but tuned for different targets pick different tile sizes:

3.3   Register file: different occupancy math

Both chips have similar total register storage per SM/CU. The occupancy curve is different.

VGPRs / regs per thread RDNA3: max waves / SIMD Ada: max warps / SM
≤ 32 1648
64 1632
96 1021
128 816
168 612
256 48

WMMA-fused GEMMs land at 80–120 VGPRs/wave, so RDNA3 occupancy sits at 8–10 waves/SIMD — fine, but the constraint is tight enough that spills are catastrophic. A CUDA kernel that uses 130 registers/thread runs comfortably on Ada (16 warps/SM × 4 schedulers = 64 active warps) but collapses on RDNA to 6 waves/SIMD with high register pressure. Hipfire kernels are budgeted against the wave32 curve, not the warp32 curve.

3.4   Wave scheduling: dual-issue vs SM-level multi-scheduler

Ada SMs have 4 independent warp schedulers, each able to issue one instruction per cycle — so a single SM can have 4 warps active doing 4 different things (one FMA, one INT, one load, one SFU). RDNA3 has 4 SIMD32 lanes per WGP, each running one wave at a time, with RDNA3-introduced dual-issue capability: under specific instruction pairings, one SIMD can retire two instructions per cycle from the same wave.

That's a meaningfully different scheduling story. Ada amortizes latency by having many warps in flight per SM; RDNA3 amortizes by having multiple SIMDs per WGP plus ILP within a single wave through dual-issue. A kernel written for Ada's "lots of small warps, SM picks" topology doesn't naturally exploit RDNA's dual-issue — you have to deliberately schedule independent FMA + memory-op pairs in the kernel body. Hipfire kernels do this consciously; hipified CUDA kernels rarely do.

04 What hipify produces vs what hipfire writes

A concrete example. The inner loop of a GEMM with WMMA / Tensor Core on each side. Both compute the same math; the binding to the matrix unit is what differs.

CUDA — tuned for Ada Tensor Cores (PTX inline)

// 16x8x16 fp16 → fp32 mma, the fast Ada path.
// One warp produces a 16x8 output tile per issue.
asm volatile(
  "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
  "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%0,%1,%2,%3};\n"
  : "+f"(d[0]), "+f"(d[1]), "+f"(d[2]), "+f"(d[3])
  : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]),
    "r"(b[0]), "r"(b[1])
);

Hipified output — what hipify-perl emits

// The asm-block is preserved verbatim by hipify-perl. It targets
// PTX, which hipcc → amdclang doesn't accept. Compile fails:
//
//   error: invalid instruction mnemonic 'mma.sync.aligned.m16n8k16...'
//
// Best-case outcome: the kernel author had a #ifdef __CUDA_ARCH__
// fallback to a scalar fp16 fma loop. Then the kernel COMPILES on
// AMD — but issues no WMMA, runs at ~3% of peak.
//
// Worst-case: the author used wmma:: C++ API and rocWMMA bridges
// the call to AMD's intrinsic — but the shape is m16n8k16, which
// RDNA only emulates by issuing a m16n16k16 with half the B-tile
// zeroed. Half the matrix unit's work is thrown away.

hipfire — native RDNA3 WMMA

// Real code from kernels/src/gemm_f16_wmma.hip.
// One wave produces a 16x16 output tile per issue. No shape
// mismatch, no PTX, no fallback. Tile dims match the only WMMA
// shape RDNA3 offers.

typedef _Float16 __attribute__((ext_vector_type(16))) half16_t;
typedef float    __attribute__((ext_vector_type(8)))  float8_t;

__launch_bounds__(32, 2)
extern "C" __global__ void gemm_f16_wmma(...) {
    float8_t acc = {0};
    for (int k0 = 0; k0 < K; k0 += 16) {
        half16_t a_reg, b_reg;
        // ... load 16x16 tiles into a_reg / b_reg from LDS ...
        acc = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_reg, b_reg, acc);
    }
    // ... store acc to global ...
}

This pattern repeats across 66 WMMA kernels, 29 sdot4 (dp4a) kernels, and 7 fdot2 packed-fp16 fma kernels in kernels/src/ — all hand-written, all binding directly to AMD intrinsics, no PTX, no rocWMMA shim, no shape conversion.

05 The hipfire approach — tuned to the silicon

The kernel-count numbers tell the story. As of current master:

321
total HIP kernel files
66
use WMMA builtins
RDNA3+, gfx1100/1151/1201
36
use dp4a / dot2
RDNA2 fallback path

Hipfire's RDNA-native moves:

06 The empirical proof

Same hardware, same model, same quantization-equivalent format, different kernel paths.

Qwen 3.5 9B decode on 7900 XTX, both daemons configured to use the full card:

engine kernel path decode tok/s effective BW % of GDDR6X peak
hipfire hand-written WMMA + MQ4 132 ~654 GiB/s ~69%
llama.cpp ROCm hipified Q4_K_M / MMQ kernels 71 ~352 GiB/s ~37%
ollama (Q4_K_M) llama.cpp under the hood 77 ~381 GiB/s ~40%

The gap is the thesis. Decode is bandwidth-bound on 7900 XTX — the 9B weights have to be read every token, ~5 GB at 4-bit quant. The hardware ceiling is 960 GB/s GDDR6X bandwidth. Hipfire achieves 69% of that ceiling. The hipified-CUDA kernels in llama.cpp's ROCm backend, even with the engine actively engineered for AMD, get 37%.

That 32-percentage-point gap is what hand-written kernels buy you: WMMA hits the right shape on the first issue, LDS layouts avoid AMD's 32-bank conflicts (different bank-stride from NVIDIA), and dual-issue opportunities are scheduled rather than left to chance.

llama.cpp/ollama numbers from docs/BENCHMARKS.md: Qwen3 9B Q4_K_M on the same 7900 XTX, ROCm 6, default flags. The 2.10× / 1.78× / 1.71× vs-ollama deltas on the homepage are this same comparison across 0.8B / 4B / 9B respectively.

07 Where RDNA shines

08 Where NVIDIA shines

Honest comparisons require honest concessions. Ada has real advantages.

09 How hipfire masks RDNA's weaknesses

The thesis is “tune to RDNA's strengths, and where you can't, cover the weaknesses in algorithm space, not kernel space.”

10 Summary — the thesis in one paragraph

RDNA and Ada are not the same chip with different vendor logos. The waves are the same width and the API surface looks alike, but the matrix-unit tile shape, the cache topology, the register-file occupancy curve, and the wave-scheduling story are all different in ways that compound. A kernel written against Ada's strengths leaves ≈30 percentage points of GDDR6X bandwidth on the floor when it lands on RDNA — not because ROCm is bad, but because the kernel was written to a different machine. Hipfire's 321 hand-written HIP kernels exist because the gap is real, the work is RDNA-shaped, and the empirical proof is on the same card any reader can buy: 132 tok/s vs 71 on a 7900 XTX running the same Qwen 9B.


Source repository: github.com/Kaden-Schutt/hipfire · Companion decks: RDNA / WMMA, CDNA / MFMA · Further reading: /docs/architecture, /docs/benchmarks.