LLM inference for AMD RDNA GPUs.

2.10×

ollama on 7900 XTX Qwen 3.5 0.8B · 391 vs 187 tok/s

7×

DFlash on Strix Halo 27B code · 105 vs 15 tok/s on the APU

321

hand-written HIP kernels per-chip variants gfx906 → gfx1201

A Rust engine over hand-written HIP kernels — two dlopen calls into ROCm at runtime, nothing else. No PyTorch. No Python in the hot path. Single binary.

no PyTorch no Python in the hot path single binary Ollama-style UX

see the numbers >>> get started github discord

headline numbers

Decode tok/s on a Radeon 7900 XTX (gfx1100), default config (asym3 KV, FlashAttention auto). vs. ollama Q4_K_M on the same hardware.

model	decode tok/s	prefill (peak)	vs ollama
Qwen 3.5 0.8B	391	7383	2.10×
Qwen 3.5 4B	180	2487	1.78×
Qwen 3.5 9B	132	1663	1.71×
Qwen 3.5 27B	47	478	—

DFlash speculative decode lifts code prompts much further. On the canonical merge_sort code prompt: 577 tok/s on 9B and 254 tok/s on 27B (7900 XTX, MQ4, q8 KV, τ=13.18). Strix Halo APU on system RAM hits 105 tok/s on 27B; RDNA 4 R9700 hits 372 tok/s on 9B.

DFlash speedup is genre-conditional — code wins big, prose can be a net loss. See DFlash explained for the τ=13.18 cross-arch invariant and the genre matrix, and Benchmarks for live per-arch numbers (this page is built from localmaxxing).

quickstart

Linux with ROCm 6+ and an RDNA GPU.

$ curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

$ hipfire pull qwen3.5:9b
$ hipfire run  qwen3.5:9b "What is the capital of France?"
$ hipfire serve -d   # background daemon, OpenAI-compatible API on :11435

Windows, source builds, and NixOS: see Getting started and NixOS.

hardware support

arch	example	status
gfx1100 (RDNA 3)	Radeon 7900 XTX / XT	primary target — tuned
gfx1201 (RDNA 4)	Radeon 9070 XT, R9700	supported
gfx1151 (RDNA 3.5)	Strix Halo APU	supported
gfx1030 (RDNA 2)	Radeon 6950 XT	supported
gfx1010 (RDNA 1)	Radeon 5700 XT	experimental
gfx906 (Vega 20)	MI50 / MI60	community-driven
gfx942 (CDNA 3)	MI300X	rocBLAS path, partial

Per-arch kernel variants live in kernels/src/*.gfxNNNN.hip; dispatch is feature-gated (has_wmma_f16, has_dot2_f32_f16), not chip-string matching.

HIP, not ROCm-the-stack

llama.cpp + ROCm works on RDNA, but it leans on a userspace stack — rocBLAS, MIOpen, hipBLASLt, Tensile — that AMD officially supports on a handful of datacenter cards. Consumer RDNA is a second-class citizen there.

hipfire takes a different path: a Rust orchestrator that dlopens libamdhip64.so directly, plus an in-tree set of HIP C++ kernels that we tune per chip. librocblas.so is loaded lazily and only used on MI300X-class hardware; absence is recoverable. We don't call MIOpen, RCCL, hipBLASLt, or Composable Kernel.

The pattern is borrowed from ncdrone/rustane's approach to Apple's Neural Engine: safe Rust over a thin FFI to whatever the driver runtime actually exposes.

what's in the box

321 HIP kernels

Hand-written for RDNA, with per-chip variants for gfx1010, gfx1030, gfx1100, gfx1151, gfx1201, and gfx906.

Custom quant formats

MQ4, HFQ4G256, HFQ6G256, MFP4G32, HFP4G32 — designed for RDNA's dot-product and WMMA instructions.

DFlash speculative decode

Own implementation, inspired by Lucebox. Up to 4.45× over AR on code prompts.

FlashAttention & asym KV

Asym2 / Asym3 / Asym4 KV-cache compression keeps long contexts cheap on consumer VRAM budgets.

Multi-GPU pipeline parallel

Including mixed-arch — e.g. gfx1010 + gfx1030 + gfx1151 + gfx1201 in one rig.

OpenAI-compatible API

hipfire serve -d exposes /v1/chat/completions on :11435.

MoE & hybrid models

Qwen 3.5 MoE routing (35B-A3B) and DeltaNet linear-attention for hybrid layers.

BYO models

Quantize from HuggingFace safetensors or llama.cpp GGUF via hipfire quantize — CPU-side, no GPU required.

NixOS first-class

Flake, dev shell, and a services.hipfire NixOS module.

learn

Short, dense decks on the silicon underneath. The HIP/ROCm side of the conversation that brrr-style CUDA explainers don't cover.

deck 01 · consumer / pro / APU

RDNA & WMMA

wave32, GDDR / system-RAM, 16×16 matrix tiles via WMMA, with empirical hipfire numbers across gfx1010 → gfx1201.

2.94× WMMA speedup, measured

deck 02 · data center

CDNA & MFMA

The Instinct lineage. wave64, HBM3, v_mfma_* tiles, and the rocprofv3 counters that diagnose where cycles go.

MI300X: 304 CUs · 192 GB HBM3

See the full index at /learn.

documentation

Getting started — install, first run
CLI reference — every subcommand
Models — curated tags & BYO
Quantize — HF / safetensors / GGUF → hipfire formats
Serve — OpenAI-compatible HTTP API
Architecture — engine layout, dispatch, two model paths
Quantization design — MQ4 / HFQ4 / asym KV / FWHT math
Multi-GPU — pipeline-parallel, memory budget
Benchmarks — measured perf per arch

Looking for the “how GPUs actually work” pictures? See /learn — RDNA / WMMA and CDNA / MFMA decks.