hipfire

LLM inference for AMD RDNA GPUs. Rust + HIP. Single binary. No Python in the hot path. Ollama-style UX.

hipfire pull qwen3.5:9b
hipfire run  qwen3.5:9b "What is the capital of France?"
hipfire serve -d        # background daemon, OpenAI-compatible API on 0.0.0.0:11435

Current release: v0.1.20 — engine modularization. See CHANGELOG.md.

Discord: https://discord.gg/F3BaywB8Rs

Why

llama.cpp + ROCm works on RDNA but is painful: upstream ROCm officially supports only a handful of datacenter cards; consumer RDNA is a second-class citizen. hipfire targets the entire RDNA family (RDNA1 → RDNA4, consumer + pro + APU) with a single Rust binary that ships pre-compiled kernel blobs when possible and JIT-compiles the rest through HIP. No Python, no PyTorch, no ROCm userspace stack at runtime.

Headline numbers — 7900 XTX (gfx1100)

Decode tok/s, default config (asym3 KV, FlashAttention auto):

Model	hipfire decode	hipfire prefill (peak)	vs ollama Q4_K_M
Qwen 3.5 0.8B	391	7383	2.10× decode
Qwen 3.5 4B	180	2487	1.78× decode
Qwen 3.5 9B	132	1663	1.71× decode
Qwen 3.5 27B	47	478	—

DFlash speculative decode lifts code prompts further: 218 tok/s peak on 27B HumanEval/53 (4.45× over AR), 372 tok/s peak on 9B. DFlash speedup is genre-conditional — see docs/BENCHMARKS.md for the full per-genre table and the cross-arch matrix (RDNA1 / RDNA2 / APU / MI300X).

Install

Linux with ROCm 6+:

curl -L https://raw.githubusercontent.com/Kaden-Schutt/hipfire/master/scripts/install.sh | bash

For Windows, source builds, and verifying the install: docs/GETTING_STARTED.md.

NixOS

First-class support via Nix flake. See docs/NIXOS.md.

nix develop github:Kaden-Schutt/hipfire  # dev shell with Rust + ROCm + bun
nix build github:Kaden-Schutt/hipfire    # build package

NixOS module:

{
  inputs.hipfire.url = "github:Kaden-Schutt/hipfire";
  # then in configuration.nix:
  services.hipfire.enable = true;
  services.hipfire.gpuTargets = [ "gfx1100" ];
}

Inspiration: Lucebox

hipfire's DFlash work was substantially shaped by Davide Ciffa's Lucebox DFlash on ggml — a standalone C++/ggml/CUDA DFlash for Qwen 3.5-27B on a single RTX 3090. Different stack, different vendor — but Lucebox's blog gave us concrete published numbers to target, n_gen-aware bench methodology, and pointers at where the fat is. Cached snapshot at .research-cache/lucebox-dflash27b.html for forensic reproducibility.

Inspiration: gfx906 (MI50/MI60) optimizations

hipfire's gfx906 prefill MMQ kernel and AR-decode optimizations were shaped by two community forks of llama.cpp that target Vega 20:

iacopPBK/llama.cpp-gfx906 — the original fork that ported and tuned gfx906-specific code paths (warp-cooperative GEMV via half-wave split, Y-tile prefetch via inline-asm global_load_dword, __builtin_amdgcn_readfirstlane-based SGPR hoisting, separate HBM-load → register-cache → LDS-store pipelining in the MMQ body). The "2602.01 version" commit eec153c086df6a9e7a69499bea3639597c085fff was the canonical reference we audited against.
skyne98/llama.cpp-gfx906 — fork-of-fork that propagates iacop's optimizations (commit 42c298c "port iacop optimizations") and tracks upstream more aggressively. The accompanying skyne98/wiki-gfx906 is the best public reference for gfx906 ISA quirks (LDS bank-conflict patterns at stride 32, dp4a issue-rate ceiling, Q8_1 activation layout) — we used it as a sanity-check for several PMC-driven redesign decisions.

And of course an extra shout-out to ggml-org/llama.cpp itself: the templated mmq_x body in mul_mat_q.cu was the architectural scaffold we ported to gfx906 (templated mmq_x ladder, per-thread accumulator layout, MMQ_TILE_NE_K=32 sub-block factoring, Q8_1 quantize math). The inner loop is gfx906-specific; the outer shape is descendant.

A standalone gfx906 perf investigation log is at docs/perf-checkpoints/2026-05-05-gfx906-decode-investigation.md; the prefill MMQ redesign log is at docs/perf-checkpoints/2026-05-05-gfx906-mmq-redesign-final.md.

Documentation

Page	Topic
GETTING_STARTED.md	Install, first run, what to read next
NIXOS.md	NixOS flake, module, dev shell
CLI.md	Every subcommand, flags, file locations
MODELS.md	Curated tags, BYO models, file extensions
QUANTIZE.md	`hipfire quantize` for HF / safetensors / GGUF
CONFIG.md	Every config key, env overrides
SERVE.md	OpenAI-compatible HTTP API
BENCHMARKS.md	Measured perf per arch, vs ollama
ARCHITECTURE.md	Engine layout, dispatch, two model paths
QUANTIZATION.md	MQ4 / HF4 design, asym KV cache, FWHT math
multi-gpu.md	Pipeline-parallel (pp≥2) — memory budget, deployment, refusals
methodology/perf-benchmarking.md	Bench protocol — read before claiming a perf win

License

hipfire is dual-licensed under MIT or Apache-2.0 at your option. See LICENSE (dual-license pointer), LICENSE-MIT, LICENSE-APACHE, and NOTICE for details.

New contributions default to Apache-2.0 via DCO sign-off; existing contributors' MIT-licensed contributions remain MIT unless they opt in. Each source file carries an SPDX-License-Identifier reflecting actual authorship (MIT, Apache-2.0, or MIT OR Apache-2.0). See CONTRIBUTING.md for the contributor side and docs/governance/relicense-2026-05.md for the decision record (including the 2026-05-19 course correction from a unilateral Apache-2.0 relicense to dual licensing).

Original architectural innovations originating in hipfire are catalogued in PRIOR-ART.md; derivative works (including reimplementations informed by hipfire's design) should attribute the corresponding inventions per AGENTS.md.

Contributing

See CONTRIBUTING.md. Any change to kernels, quant formats, dispatch, fusion, rotation, rmsnorm, or the spec-decode path must pass ./scripts/coherence-gate-dflash.sh before commit. The canonical correctness gate is per-arch channel-test; the speed-gate catches regressions on the baseline arch. Don't bypass either with --no-verify — see methodology/perf-benchmarking.md.

Name		Name	Last commit message	Last commit date
Latest commit History 1,185 Commits
.agents/skills		.agents/skills
.githooks		.githooks
.github		.github
bench		bench
benchmarks		benchmarks
cli		cli
crates		crates
docker		docker
docs		docs
findings		findings
kernels/src		kernels/src
nix		nix
scripts		scripts
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
CREDITS.md		CREDITS.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
MANUAL_REVIEW.md		MANUAL_REVIEW.md
NOTICE		NOTICE
PR-115-lloyd-max-cb-plan-rev-gemini.md		PR-115-lloyd-max-cb-plan-rev-gemini.md
PRIOR-ART.md		PRIOR-ART.md
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
harness.sh		harness.sh
test-serve.sh		test-serve.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hipfire

Why

Headline numbers — 7900 XTX (gfx1100)

Install

NixOS

Inspiration: Lucebox

Inspiration: gfx906 (MI50/MI60) optimizations

Documentation

License

Contributing

About

Licenses found

Uh oh!

Releases 19

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

hipfire

Why

Headline numbers — 7900 XTX (gfx1100)

Install

NixOS

Inspiration: Lucebox

Inspiration: gfx906 (MI50/MI60) optimizations

Documentation

License

Contributing

About

Topics

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 19

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages