LLM inference for AMD RDNA GPUs.
A Rust engine over hand-written HIP kernels — two
dlopen calls into ROCm at runtime, nothing else.
No PyTorch. No Python in the hot path. Single binary.
headline numbers
Decode tok/s on a Radeon 7900 XTX (gfx1100), default config
(asym3 KV, FlashAttention auto). vs. ollama Q4_K_M on the
same hardware.
| model | decode tok/s | prefill (peak) | vs ollama |
|---|---|---|---|
| Qwen 3.5 0.8B | 391 | 7383 | 2.10× |
| Qwen 3.5 4B | 180 | 2487 | 1.78× |
| Qwen 3.5 9B | 132 | 1663 | 1.71× |
| Qwen 3.5 27B | 47 | 478 | — |
DFlash speculative decode lifts code prompts much further. On the
canonical merge_sort code prompt:
577 tok/s on 9B
and 254 tok/s on 27B
(7900 XTX, MQ4, q8 KV, τ=13.18). Strix Halo APU on system RAM hits
105 tok/s on 27B;
RDNA 4 R9700 hits
372 tok/s on 9B.
DFlash speedup is genre-conditional — code wins big, prose can be a net loss. See DFlash explained for the τ=13.18 cross-arch invariant and the genre matrix, and Benchmarks for live per-arch numbers (this page is built from localmaxxing).
quickstart
Linux with ROCm 6+ and an RDNA GPU.
$ curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash $ hipfire pull qwen3.5:9b $ hipfire run qwen3.5:9b "What is the capital of France?" $ hipfire serve -d # background daemon, OpenAI-compatible API on :11435
Windows, source builds, and NixOS: see Getting started and NixOS.
hardware support
| arch | example | status |
|---|---|---|
| gfx1100 (RDNA 3) | Radeon 7900 XTX / XT | primary target — tuned |
| gfx1201 (RDNA 4) | Radeon 9070 XT, R9700 | supported |
| gfx1151 (RDNA 3.5) | Strix Halo APU | supported |
| gfx1030 (RDNA 2) | Radeon 6950 XT | supported |
| gfx1010 (RDNA 1) | Radeon 5700 XT | experimental |
| gfx906 (Vega 20) | MI50 / MI60 | community-driven |
| gfx942 (CDNA 3) | MI300X | rocBLAS path, partial |
Per-arch kernel variants live in kernels/src/*.gfxNNNN.hip; dispatch is
feature-gated (has_wmma_f16, has_dot2_f32_f16),
not chip-string matching.
HIP, not ROCm-the-stack
llama.cpp + ROCm works on RDNA, but it leans on a userspace
stack — rocBLAS, MIOpen, hipBLASLt, Tensile — that AMD
officially supports on a handful of datacenter cards. Consumer RDNA is
a second-class citizen there.
hipfire takes a different path: a Rust orchestrator that dlopens
libamdhip64.so directly, plus an in-tree set of HIP C++
kernels that we tune per chip. librocblas.so is loaded
lazily and only used on MI300X-class hardware; absence is recoverable.
We don't call MIOpen, RCCL, hipBLASLt, or Composable Kernel.
The pattern is borrowed from ncdrone/rustane's approach to Apple's Neural Engine: safe Rust over a thin FFI to whatever the driver runtime actually exposes.
what's in the box
321 HIP kernels
Hand-written for RDNA, with per-chip variants for gfx1010, gfx1030, gfx1100, gfx1151, gfx1201, and gfx906.
Custom quant formats
MQ4, HFQ4G256, HFQ6G256, MFP4G32, HFP4G32 — designed for RDNA's dot-product and WMMA instructions.
DFlash speculative decode
Own implementation, inspired by Lucebox. Up to 4.45× over AR on code prompts.
FlashAttention & asym KV
Asym2 / Asym3 / Asym4 KV-cache compression keeps long contexts cheap on consumer VRAM budgets.
Multi-GPU pipeline parallel
Including mixed-arch — e.g. gfx1010 + gfx1030 + gfx1151 + gfx1201 in one rig.
OpenAI-compatible API
hipfire serve -d exposes /v1/chat/completions on :11435.
MoE & hybrid models
Qwen 3.5 MoE routing (35B-A3B) and DeltaNet linear-attention for hybrid layers.
BYO models
Quantize from HuggingFace safetensors or llama.cpp GGUF via hipfire quantize — CPU-side, no GPU required.
NixOS first-class
Flake, dev shell, and a services.hipfire NixOS module.
learn
Short, dense decks on the silicon underneath. The HIP/ROCm side of the
conversation that brrr-style CUDA explainers don't cover.
RDNA & WMMA
wave32, GDDR / system-RAM, 16×16 matrix tiles via WMMA, with empirical hipfire numbers across gfx1010 → gfx1201.
CDNA & MFMA
The Instinct lineage. wave64, HBM3, v_mfma_* tiles,
and the rocprofv3 counters that diagnose where cycles go.
See the full index at /learn.
documentation
- Getting started — install, first run
- CLI reference — every subcommand
- Models — curated tags & BYO
- Quantize — HF / safetensors / GGUF → hipfire formats
- Serve — OpenAI-compatible HTTP API
- Architecture — engine layout, dispatch, two model paths
- Quantization design — MQ4 / HFQ4 / asym KV / FWHT math
- Multi-GPU — pipeline-parallel, memory budget
- Benchmarks — measured perf per arch
Looking for the “how GPUs actually work” pictures? See /learn — RDNA / WMMA and CDNA / MFMA decks.