hipfire
hipfire

LLM inference for AMD RDNA GPUs.

2.10×
ollama on 7900 XTX Qwen 3.5 0.8B · 391 vs 187 tok/s
DFlash on Strix Halo 27B code · 105 vs 15 tok/s on the APU
321
hand-written HIP kernels per-chip variants gfx906 → gfx1201

A Rust engine over hand-written HIP kernels — two dlopen calls into ROCm at runtime, nothing else. No PyTorch. No Python in the hot path. Single binary.

no PyTorch no Python in the hot path single binary Ollama-style UX

headline numbers

Decode tok/s on a Radeon 7900 XTX (gfx1100), default config (asym3 KV, FlashAttention auto). vs. ollama Q4_K_M on the same hardware.

model decode tok/s prefill (peak) vs ollama
Qwen 3.5 0.8B39173832.10×
Qwen 3.5 4B18024871.78×
Qwen 3.5 9B13216631.71×
Qwen 3.5 27B47478

DFlash speculative decode lifts code prompts much further. On the canonical merge_sort code prompt: 577 tok/s on 9B and 254 tok/s on 27B (7900 XTX, MQ4, q8 KV, τ=13.18). Strix Halo APU on system RAM hits 105 tok/s on 27B; RDNA 4 R9700 hits 372 tok/s on 9B.

DFlash speedup is genre-conditional — code wins big, prose can be a net loss. See DFlash explained for the τ=13.18 cross-arch invariant and the genre matrix, and Benchmarks for live per-arch numbers (this page is built from localmaxxing).

quickstart

Linux with ROCm 6+ and an RDNA GPU.

$ curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

$ hipfire pull qwen3.5:9b
$ hipfire run  qwen3.5:9b "What is the capital of France?"
$ hipfire serve -d   # background daemon, OpenAI-compatible API on :11435

Windows, source builds, and NixOS: see Getting started and NixOS.

hardware support

archexamplestatus
gfx1100 (RDNA 3)Radeon 7900 XTX / XTprimary target — tuned
gfx1201 (RDNA 4)Radeon 9070 XT, R9700supported
gfx1151 (RDNA 3.5)Strix Halo APUsupported
gfx1030 (RDNA 2)Radeon 6950 XTsupported
gfx1010 (RDNA 1)Radeon 5700 XTexperimental
gfx906 (Vega 20)MI50 / MI60community-driven
gfx942 (CDNA 3)MI300XrocBLAS path, partial

Per-arch kernel variants live in kernels/src/*.gfxNNNN.hip; dispatch is feature-gated (has_wmma_f16, has_dot2_f32_f16), not chip-string matching.

HIP, not ROCm-the-stack

llama.cpp + ROCm works on RDNA, but it leans on a userspace stack — rocBLAS, MIOpen, hipBLASLt, Tensile — that AMD officially supports on a handful of datacenter cards. Consumer RDNA is a second-class citizen there.

hipfire takes a different path: a Rust orchestrator that dlopens libamdhip64.so directly, plus an in-tree set of HIP C++ kernels that we tune per chip. librocblas.so is loaded lazily and only used on MI300X-class hardware; absence is recoverable. We don't call MIOpen, RCCL, hipBLASLt, or Composable Kernel.

The pattern is borrowed from ncdrone/rustane's approach to Apple's Neural Engine: safe Rust over a thin FFI to whatever the driver runtime actually exposes.

what's in the box

321 HIP kernels

Hand-written for RDNA, with per-chip variants for gfx1010, gfx1030, gfx1100, gfx1151, gfx1201, and gfx906.

Custom quant formats

MQ4, HFQ4G256, HFQ6G256, MFP4G32, HFP4G32 — designed for RDNA's dot-product and WMMA instructions.

DFlash speculative decode

Own implementation, inspired by Lucebox. Up to 4.45× over AR on code prompts.

FlashAttention & asym KV

Asym2 / Asym3 / Asym4 KV-cache compression keeps long contexts cheap on consumer VRAM budgets.

Multi-GPU pipeline parallel

Including mixed-arch — e.g. gfx1010 + gfx1030 + gfx1151 + gfx1201 in one rig.

OpenAI-compatible API

hipfire serve -d exposes /v1/chat/completions on :11435.

MoE & hybrid models

Qwen 3.5 MoE routing (35B-A3B) and DeltaNet linear-attention for hybrid layers.

BYO models

Quantize from HuggingFace safetensors or llama.cpp GGUF via hipfire quantize — CPU-side, no GPU required.

NixOS first-class

Flake, dev shell, and a services.hipfire NixOS module.

learn

Short, dense decks on the silicon underneath. The HIP/ROCm side of the conversation that brrr-style CUDA explainers don't cover.

See the full index at /learn.

documentation

Looking for the “how GPUs actually work” pictures? See /learn — RDNA / WMMA and CDNA / MFMA decks.