Skip to content

Kaden-Schutt/hipfire

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,185 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

hipfire

LLM inference for AMD RDNA GPUs. Rust + HIP. Single binary. No Python in the hot path. Ollama-style UX.

hipfire pull qwen3.5:9b
hipfire run  qwen3.5:9b "What is the capital of France?"
hipfire serve -d        # background daemon, OpenAI-compatible API on 0.0.0.0:11435

Current release: v0.1.20 — engine modularization. See CHANGELOG.md.

Discord: https://discord.gg/F3BaywB8Rs

Why

llama.cpp + ROCm works on RDNA but is painful: upstream ROCm officially supports only a handful of datacenter cards; consumer RDNA is a second-class citizen. hipfire targets the entire RDNA family (RDNA1 → RDNA4, consumer + pro + APU) with a single Rust binary that ships pre-compiled kernel blobs when possible and JIT-compiles the rest through HIP. No Python, no PyTorch, no ROCm userspace stack at runtime.

Headline numbers — 7900 XTX (gfx1100)

Decode tok/s, default config (asym3 KV, FlashAttention auto):

Model hipfire decode hipfire prefill (peak) vs ollama Q4_K_M
Qwen 3.5 0.8B 391 7383 2.10× decode
Qwen 3.5 4B 180 2487 1.78× decode
Qwen 3.5 9B 132 1663 1.71× decode
Qwen 3.5 27B 47 478

DFlash speculative decode lifts code prompts further: 218 tok/s peak on 27B HumanEval/53 (4.45× over AR), 372 tok/s peak on 9B. DFlash speedup is genre-conditional — see docs/BENCHMARKS.md for the full per-genre table and the cross-arch matrix (RDNA1 / RDNA2 / APU / MI300X).

Install

Linux with ROCm 6+:

curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

For Windows, source builds, and verifying the install: docs/GETTING_STARTED.md.

NixOS

First-class support via Nix flake. See docs/NIXOS.md.

nix develop github:Kaden-Schutt/hipfire  # dev shell with Rust + ROCm + bun
nix build github:Kaden-Schutt/hipfire    # build package

NixOS module:

{
  inputs.hipfire.url = "github:Kaden-Schutt/hipfire";
  # then in configuration.nix:
  services.hipfire.enable = true;
  services.hipfire.gpuTargets = [ "gfx1100" ];
}

Inspiration: Lucebox

hipfire's DFlash work was substantially shaped by Davide Ciffa's Lucebox DFlash on ggml — a standalone C++/ggml/CUDA DFlash for Qwen 3.5-27B on a single RTX 3090. Different stack, different vendor — but Lucebox's blog gave us concrete published numbers to target, n_gen-aware bench methodology, and pointers at where the fat is. Cached snapshot at .research-cache/lucebox-dflash27b.html for forensic reproducibility.

Inspiration: gfx906 (MI50/MI60) optimizations

hipfire's gfx906 prefill MMQ kernel and AR-decode optimizations were shaped by two community forks of llama.cpp that target Vega 20:

  • iacopPBK/llama.cpp-gfx906 — the original fork that ported and tuned gfx906-specific code paths (warp-cooperative GEMV via half-wave split, Y-tile prefetch via inline-asm global_load_dword, __builtin_amdgcn_readfirstlane-based SGPR hoisting, separate HBM-load → register-cache → LDS-store pipelining in the MMQ body). The "2602.01 version" commit eec153c086df6a9e7a69499bea3639597c085fff was the canonical reference we audited against.
  • skyne98/llama.cpp-gfx906 — fork-of-fork that propagates iacop's optimizations (commit 42c298c "port iacop optimizations") and tracks upstream more aggressively. The accompanying skyne98/wiki-gfx906 is the best public reference for gfx906 ISA quirks (LDS bank-conflict patterns at stride 32, dp4a issue-rate ceiling, Q8_1 activation layout) — we used it as a sanity-check for several PMC-driven redesign decisions.

And of course an extra shout-out to ggml-org/llama.cpp itself: the templated mmq_x body in mul_mat_q.cu was the architectural scaffold we ported to gfx906 (templated mmq_x ladder, per-thread accumulator layout, MMQ_TILE_NE_K=32 sub-block factoring, Q8_1 quantize math). The inner loop is gfx906-specific; the outer shape is descendant.

A standalone gfx906 perf investigation log is at docs/perf-checkpoints/2026-05-05-gfx906-decode-investigation.md; the prefill MMQ redesign log is at docs/perf-checkpoints/2026-05-05-gfx906-mmq-redesign-final.md.

Documentation

Page Topic
GETTING_STARTED.md Install, first run, what to read next
NIXOS.md NixOS flake, module, dev shell
CLI.md Every subcommand, flags, file locations
MODELS.md Curated tags, BYO models, file extensions
QUANTIZE.md hipfire quantize for HF / safetensors / GGUF
CONFIG.md Every config key, env overrides
SERVE.md OpenAI-compatible HTTP API
BENCHMARKS.md Measured perf per arch, vs ollama
ARCHITECTURE.md Engine layout, dispatch, two model paths
QUANTIZATION.md MQ4 / HF4 design, asym KV cache, FWHT math
multi-gpu.md Pipeline-parallel (pp≥2) — memory budget, deployment, refusals
methodology/perf-benchmarking.md Bench protocol — read before claiming a perf win

License

hipfire is dual-licensed under MIT or Apache-2.0 at your option. See LICENSE (dual-license pointer), LICENSE-MIT, LICENSE-APACHE, and NOTICE for details.

New contributions default to Apache-2.0 via DCO sign-off; existing contributors' MIT-licensed contributions remain MIT unless they opt in. Each source file carries an SPDX-License-Identifier reflecting actual authorship (MIT, Apache-2.0, or MIT OR Apache-2.0). See CONTRIBUTING.md for the contributor side and docs/governance/relicense-2026-05.md for the decision record (including the 2026-05-19 course correction from a unilateral Apache-2.0 relicense to dual licensing).

Original architectural innovations originating in hipfire are catalogued in PRIOR-ART.md; derivative works (including reimplementations informed by hipfire's design) should attribute the corresponding inventions per AGENTS.md.

Contributing

See CONTRIBUTING.md. Any change to kernels, quant formats, dispatch, fusion, rotation, rmsnorm, or the spec-decode path must pass ./scripts/coherence-gate-dflash.sh before commit. The canonical correctness gate is per-arch channel-test; the speed-gate catches regressions on the baseline arch. Don't bypass either with --no-verify — see methodology/perf-benchmarking.md.