what is DFlash
Speculative decoding in the standard shape: a small drafter model emits a block of K candidate tokens, the large target model verifies them in a single batched forward pass, and any prefix that matches the target's argmax is committed. The acceptance length (τ) is the average number of tokens committed per target step.
DFlash is hipfire's implementation: the drafter is a z-lab Qwen3.5-DFlash native head (MQ4 weights, single transformer block trained on the target's hidden states), and the target's verify pass is a fused flash-attention kernel over the K-tree on RDNA's WMMA path. The adaptive block_size sweeps 8..16 with a mean B=16 when the drafter is hot.
the τ=13.18 invariant
DFlash acceptance is a property of (target distribution, draft
distribution, prompt) — not the hardware. On the canonical
merge_sort_thinking_off prompt
(md5 253c7ac50857fe6d0e10fb0d2c5e35c0) with greedy
decode (--temp 0.0), hipfire produces byte-identical
token streams across every RDNA arch we test. The acceptance
numbers come out the same to four decimals:
| arch | gen | model | τ | accept_rate | decode tok/s |
|---|
That hardware-invariant τ is a strong correctness signal: it means the target's argmax sequence on this prompt is bit-identical across gfx1010, gfx1030, gfx1100, gfx1151, and gfx1201 — five different kernel paths (scalar-FMA, dp4a, WMMA16, WMMA32, mixed) producing the same logits. Numerical differences would manifest as drafted token rejections and τ would drift cell to cell.
genre-conditional speedup
DFlash speedup is genre-conditional. Code prompts win big — the syntactic structure makes per-position predictions high-confidence, draft and target agree often. Prose, where continuations are high-entropy, the draft's top-1 frequently disagrees with the target's, τ drops to ~1, and the K-token verify pass costs more than just decoding one token.
| model | genre | AR tok/s | DFlash tok/s | × | τ |
|---|---|---|---|---|---|
| Qwen 3.5 27B | code (HumanEval/53) | 44.1 | 196.0 | 4.45× | 9.82 |
| Qwen 3.5 27B | code (merge_sort) | 122.9 | 576.9 | 4.70× | 13.18 |
| Qwen 3.5 27B | prose (Rome essay) | 44.0 | 49.6 | 1.13× | 1.67 |
| Qwen 3.5 27B | instruct (sky-color) | 44.6 | 44.7 | 1.00× | 1.39 |
| Qwen 3.5 9B | code (HumanEval/0) | 121.9 | 372.9 | 3.06× | 8.23 |
| Qwen 3.5 9B | instruct (sky-color) | 124.4 | 246.9 | 1.99× | 4.76 |
| Qwen 3.5 9B | prose (Federalist) | 125.3 | 99.4 | 0.79× ✗ | 1.20 |
7900 XTX, MQ4, q8 KV. Code rows are real wins; prose can be a net loss on draft-target divergence.
The runtime auto-detects with dflash_mode=auto: it
turns DFlash on for dense Qwen 3.5+ targets and off where it
historically loses. Per-model overrides:
hipfire config qwen3.5:27b set dflash_mode on.
drafter mismatch — Qwen 3.6 27B
On the canonical code prompt, Qwen 3.5 27B + the z-lab Qwen3.5 drafter gets τ=13.18. The Qwen 3.6 27B target + the z-lab Qwen3.6 drafter gets τ=10.93 (accept_rate 0.73 vs 0.88) — a 17% τ drop translating to ~16% lower DFlash throughput (88.3 vs 104.5 tok/s on Strix Halo). Same hardware, same prompt, same engine.
That's draft-target distribution mismatch, not a kernel issue. The Qwen 3.6 drafter is younger and trained against a different target distribution snapshot than the deployed 3.6 weights. As the 3.6 drafter matures, that gap should close.
how to enable
From the CLI:
$ hipfire pull qwen3.5:9b $ hipfire pull qwen3.5:9b-dflash # downloads the drafter sidecar $ hipfire config qwen3.5:9b set dflash_mode on
From the daemon's HTTP API or the example binary:
$ ./target/release/examples/dflash_spec_demo \
--target ~/.hipfire/models/qwen3.5-9b.mq4 \
--draft ~/.hipfire/models/qwen35-9b-dflash-mq4.hf4 \
--prompt-file benchmarks/prompts/merge_sort_thinking_off.txt \
--max 256 --temp 0.0 --no-chatml --kv-mode q8 --ctx 4096 caveats
- Default off on v0.1.x (genre-conditional regression
risk on prose).
dflash_mode=autois the recommended opt-in. - KV cache mode matters.
--kv-mode q8holds drafter cross-attention quality on prose;asym3can collapse prose τ to ~1. - Prompt-shape sensitivity. One newline can swing τ by 17%. Pin your prompt md5 if you're comparing across sessions.
- Drafter required per target family. 3.5 drafter ≠ 3.6 drafter. Mixing
silently degrades acceptance — hipfire warns if the
base_model_revisionheaders don't match.