hipfire
/docs/dflash

DFlash speculative decode

hipfire's spec-decode path. A small drafter model proposes a block of tokens; the target verifies them in one fused pass. When draft and target agree, you get N tokens for the cost of one decode step.

what is DFlash

Speculative decoding in the standard shape: a small drafter model emits a block of K candidate tokens, the large target model verifies them in a single batched forward pass, and any prefix that matches the target's argmax is committed. The acceptance length (τ) is the average number of tokens committed per target step.

DFlash is hipfire's implementation: the drafter is a z-lab Qwen3.5-DFlash native head (MQ4 weights, single transformer block trained on the target's hidden states), and the target's verify pass is a fused flash-attention kernel over the K-tree on RDNA's WMMA path. The adaptive block_size sweeps 8..16 with a mean B=16 when the drafter is hot.

the τ=13.18 invariant

DFlash acceptance is a property of (target distribution, draft distribution, prompt) — not the hardware. On the canonical merge_sort_thinking_off prompt (md5 253c7ac50857fe6d0e10fb0d2c5e35c0) with greedy decode (--temp 0.0), hipfire produces byte-identical token streams across every RDNA arch we test. The acceptance numbers come out the same to four decimals:

arch gen model τ accept_rate decode tok/s

That hardware-invariant τ is a strong correctness signal: it means the target's argmax sequence on this prompt is bit-identical across gfx1010, gfx1030, gfx1100, gfx1151, and gfx1201 — five different kernel paths (scalar-FMA, dp4a, WMMA16, WMMA32, mixed) producing the same logits. Numerical differences would manifest as drafted token rejections and τ would drift cell to cell.

genre-conditional speedup

DFlash speedup is genre-conditional. Code prompts win big — the syntactic structure makes per-position predictions high-confidence, draft and target agree often. Prose, where continuations are high-entropy, the draft's top-1 frequently disagrees with the target's, τ drops to ~1, and the K-token verify pass costs more than just decoding one token.

model genre AR tok/s DFlash tok/s × τ
Qwen 3.5 27Bcode (HumanEval/53)44.1196.04.45×9.82
Qwen 3.5 27Bcode (merge_sort)122.9576.94.70×13.18
Qwen 3.5 27Bprose (Rome essay)44.049.61.13×1.67
Qwen 3.5 27Binstruct (sky-color)44.644.71.00×1.39
Qwen 3.5 9Bcode (HumanEval/0)121.9372.93.06×8.23
Qwen 3.5 9Binstruct (sky-color)124.4246.91.99×4.76
Qwen 3.5 9Bprose (Federalist)125.399.40.79× 1.20

7900 XTX, MQ4, q8 KV. Code rows are real wins; prose can be a net loss on draft-target divergence.

The runtime auto-detects with dflash_mode=auto: it turns DFlash on for dense Qwen 3.5+ targets and off where it historically loses. Per-model overrides: hipfire config qwen3.5:27b set dflash_mode on.

drafter mismatch — Qwen 3.6 27B

On the canonical code prompt, Qwen 3.5 27B + the z-lab Qwen3.5 drafter gets τ=13.18. The Qwen 3.6 27B target + the z-lab Qwen3.6 drafter gets τ=10.93 (accept_rate 0.73 vs 0.88) — a 17% τ drop translating to ~16% lower DFlash throughput (88.3 vs 104.5 tok/s on Strix Halo). Same hardware, same prompt, same engine.

That's draft-target distribution mismatch, not a kernel issue. The Qwen 3.6 drafter is younger and trained against a different target distribution snapshot than the deployed 3.6 weights. As the 3.6 drafter matures, that gap should close.

how to enable

From the CLI:

$ hipfire pull qwen3.5:9b
$ hipfire pull qwen3.5:9b-dflash   # downloads the drafter sidecar
$ hipfire config qwen3.5:9b set dflash_mode on

From the daemon's HTTP API or the example binary:

$ ./target/release/examples/dflash_spec_demo \
    --target  ~/.hipfire/models/qwen3.5-9b.mq4 \
    --draft   ~/.hipfire/models/qwen35-9b-dflash-mq4.hf4 \
    --prompt-file benchmarks/prompts/merge_sort_thinking_off.txt \
    --max 256 --temp 0.0 --no-chatml --kv-mode q8 --ctx 4096

caveats

  • Default off on v0.1.x (genre-conditional regression risk on prose). dflash_mode=auto is the recommended opt-in.
  • KV cache mode matters. --kv-mode q8 holds drafter cross-attention quality on prose; asym3 can collapse prose τ to ~1.
  • Prompt-shape sensitivity. One newline can swing τ by 17%. Pin your prompt md5 if you're comparing across sessions.
  • Drafter required per target family. 3.5 drafter ≠ 3.6 drafter. Mixing silently degrades acceptance — hipfire warns if the base_model_revision headers don't match.