Net cleanups across the session's commits without changing behavior:
* `src/hlir.rs`:
- Each binary op's `rewrites()` now reuses `self.early_rewrites()`
instead of rebuilding the unroll-rule list — eliminates the 4×
repeated boilerplate and the 4× repeated "see Add::rewrites for why
we register in both stages" comment.
- Hoist that explanation into the `binary_op_unroll_rules` doc where
it actually applies (one place, not four).
- `binary_op_unroll_rule` collapses the dual `match state_pos`
blocks into a single `order(state, per_iter)` closure used for
both the body match pattern and each unrolled chain element.
* `src/graph.rs` (`unroll_loops_in_llir`):
- Drop the named `iteration_invariant_slots` set. The check
`body_nodes.contains(&body_producer)` it cached is equivalent to
`clone_map[i].get(&body_producer).is_some()`, so resolve_src and
marker_post_sub both express the case inline as
`clone_map.get(&bp).copied().unwrap_or(bp)`. The set's worth was
naming the case; a single comment block at start_meta does that
more cheaply.
- Drop the orphan-LoopOutputSelect skip from 93fb02c4 — the gemma
diagnostic showed the real failure was the iteration-invariant
body_producer case only; the orphan-select case was speculative
defensiveness for a scenario the rolling/extraction pipeline
can't actually produce.
- Drop the `collapse_loops_to_first_iter` informational comment
block; collapse just works without special handling for invariant
slots and didn't need the explanation.
* `crates/luminal_cuda_lite/src/tests/transformer.rs`:
- Collapse the three exploratory body=1 trips=3 tests
(`test_three_chained_scalar_muls`,
`test_three_chained_scalar_muls_with_downstream_consumer`,
`test_three_chained_scalar_muls_with_initial_residual`) into one
`test_rolled_chained_scalar_muls` that exercises the chain plus a
residual back to its initial input — the strongest topology of
the three (covers per-iter body cloning, post-loop wiring, and
the residual edge to the loop-external initial value).
Tests: cuda_lite 80/80, python CUDA 12 + 4 xfailed (test_llama3
subset), gemma example end-to-end. fmt + clippy clean.
Diff vs loop_rolling base: 347 → 217 inserted lines (−130).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two prior commits (16de9638, 93fb02c4) handled the gemma CI panic
by swapping `clone_map[i-1][&body_producer]` for
`clone_map[i-1].get(&body_producer).unwrap_or(body_producer)`. That
suppresses the panic but reads like a defensive band-aid — the comment
hand-waves about "extraction-shape variation" without naming the
actual situation.
Local repro on the gemma example (built locally, weights downloaded
from HF) shows the case is real and documented:
slot=0 body_producer NodeIndex(3040) NOT in body_nodes
body_producer op: KernelConstant { value: 9.21034 } # ln(10000)
slot=1 body_producer NodeIndex(5035) NOT in body_nodes
initial = NodeIndex(5035) (same node)
body_producer op: KernelConstant { value: 1.442695 } # log2(e)
These are RoPE frequency factors: the body chain provably reduces to a
constant via cuda_lite's kernel-level rewrites, and the genome's
extraction picks the constant directly for LoopEnd's incoming
eclass. The state really is iteration-invariant — every iter sees the
same value. There's no LLIR corruption; the forward-walk `body_nodes`
definition just doesn't cover this case because per-iter cloning isn't
needed for it.
Refactor:
* Compute `iteration_invariant_slots: HashSet<LoopStart>` at the same
time as `start_meta`, with the rule `body_producer ∉ body_nodes ⇒
invariant`.
* `resolve_src` branches explicitly: invariant slot → `body_producer`,
else standard per-iter clone lookup.
* `marker_post_sub` branches the same way.
* Drop the `collapse_loops_to_first_iter` backward-walk backfill the
prior commit added — collapse doesn't have the panic site, and a
Constant body_producer either has no incoming edges (so the body-
iteration loop is a no-op for it) or the existing `marker_post_sub`
insert already routes consumers to it correctly.
Behavior is identical to the prior commits; the diff is purely about
making the documented case discoverable in code rather than implicit
in an `unwrap_or`.
cuda_lite (82/82), python CUDA (223 + 4 xfailed), gemma example: all
green. Adds a LessonsLearned entry.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion defensive fix to 16de9638. `output_body_producer` is keyed
by stream_id and populated from `outputs` (LoopOutput nodes). The
post-loop wiring then indexed `output_body_producer[&stream_id]` for
every LoopOutputSelect, which panics with "no entry found for key" if
extraction lands a LoopOutputSelect whose corresponding LoopOutput
isn't in the LLIR (e.g. a genome that picked a non-LoopOutput
representative for that stream's eclass).
Skip the orphan select rather than panicking. The select node stays
un-substituted, so the post-loop consumer's edge falls through to the
select itself; the select gets removed with the other markers at the
end of unroll. The consumer's edge will dangle, but that's a separate
concern from the unroll-mechanism panic this prevents.
Together with 16de9638, this closes the two `[&key]` index sites in
`unroll_loops_in_llir` that can land on a missing key when egglog
extraction produces a structurally unusual LLIR. Both sites now
gracefully fall through with a defensible semantic (use the body
producer / select node directly), so the unroll mechanism never
panics on extraction-shape variation.
cuda_lite + python CUDA suites still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`unroll_loops_in_llir` was panicking on `clone_map[i-1][&body_producer]`
with "no entry found for key" on the gemma Modal CI job. The line
fired when extraction landed a `body_producer` (LoopEnd's incoming
source) that isn't in `body_nodes` — a forward-walk-from-input-markers
set that misses ops whose only ancestors are non-marker (a constant,
external input, or an op whose chain got congruence-merged off the
marker chain by rules like `LoopInputStatic inline`).
Semantically that body op is iteration-invariant: every iter would
compute the same value, so the loop's state never changes. The
per-iter clone path needed a "no clone, share across iters" fallback
rather than indexing the clone map.
Fix:
- In `unroll_loops_in_llir::resolve_src`, when the LoopStart-resolved
`body_producer` isn't in `body_nodes`, return `body_producer` itself
for iter > 0 (skip the clone_map lookup).
- Mirror the same `unwrap_or(body_producer)` fallback in
`marker_post_sub` for LoopEnd / LoopOutputSelect post-loop wiring.
- In `collapse_loops_to_first_iter`, add a backward-walk-from-end-markers
pass that backfills body_nodes with any non-marker non-Output ancestor
of an end-marker. Collapse doesn't have a clone_map (no panic site),
but it does iterate body_nodes to rewire incoming edges before
deleting markers — without backfill, an iteration-invariant
body_producer would keep dangling edges to removed markers.
Local cuda_lite + python CUDA suites pass. The extraction shape that
triggers this isn't reachable from the local fuzzers' search depth, so
this lands as a defensive fix to unblock the gemma Modal job; once
that job goes green we'll know whether the fallback covers all cases
or whether more diagnostic info is needed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The narrow per-binary-op unroll-union rules (introduced in aba96275)
were only registered in `EgglogOp::early_rewrites()`, which the egglog
driver feeds into the early-stage program only. The full-stage program
is built from `EgglogOp::rewrites()` exclusively. So the unrolled chain
materialised in the early egraph, the early→full extract picked the
(cheaper) rolled form, the unrolled chain was lost, and any full-stage
kernel rewrite (e.g. `KernelExp`'s `direct-exp-fusion`, which rewrites
`Mul(?x, log2_e) → Exp2(...)` into a single native `expf` kernel) had
nothing to match against.
Symptom: python `test_llama_transformer_block` (CUDA backend) was off
by ~1e-2 from the PyTorch reference. The PyTorch `pow(2)` decomposition
emits a chain `Log2(x) * 0.693 * 2.0 * 1.442 → Exp2`, where 1.442 is
log2(e). With rolling on, those three scalar muls fold into one body,
and `direct-exp-fusion` couldn't fuse the trailing `Mul(?, log2_e) +
Exp2` into the more accurate `KernelExp` (native expf). The truncated
log2(e) constant accumulates rounding through the multiply chain, the
diff shows up only in rows that exercise the full attention path
(row 0 matched exactly, rows 1–3 drifted).
Fix: register `binary_op_unroll_rules` in BOTH `early_rewrites()` (for
GLUMoE-style early-stage fusion, which still depends on this) AND
`rewrites()` (for full-stage kernel-level fusions like
`direct-exp-fusion`). All four binary HLIR ops (Add/Mul/Mod/LessThan)
get the same treatment.
Also adds three cuda_lite repro tests covering body=1, trips=3 chains
(plain, with residual, with downstream consumer) — all pass and would
have caught any regression in the basic rolling+unroll mechanics.
Tests:
- python CUDA: 223 passed, 4 xfailed (was 222 passed, 1 failed)
- cuda_lite: 82 passed, 0 failed
- workspace tests / fmt / clippy: clean
Adds a LessonsLearned entry per crates/luminal_python/CLAUDE.md
guidance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-roll prepass folds tiny scalar-mul chains (body=1, trips=2)
inside e.g. the gemma_gelu sigmoid expansion into a loop body. The
existing egglog fusion rules (GLUMoE GemmaGELU, etc.) pattern-match a
specific flat chain of binary ops and can't see through the
LoopStart/LoopInput/LoopEnd markers, so rolling silently disables the
fusion and the extracted graph is strictly worse than not rolling at
all.
Add narrow per-binary-op early rewrites that union a rolled
single-op-body loop (trips ≤ 4, state at body input position 0 or 1)
with its fully-unrolled equivalent in the same eclass. The cost-based
extractor then picks whichever representation downstream patterns
prefer — the unrolled form when fusions match through the flat chain,
the rolled form when nothing benefits. No threshold or special-case
in the rolling cost model; the egraph stays the source of truth.
Fixes test_glumoe_gemma_gelu_matches_unfused_output (78 → 79 passing
in cuda_lite). All four binary HLIR ops (Add, Mul, Mod, LessThan)
opt in via early_rewrites().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The auto-roll prepass inserts LoopStart/LoopEnd/LoopInput/LoopOutput
marker ops into the HLIR. These markers survive through egglog
rewriting into LLIR and must be collapsed by `unroll_loops_in_llir`
before runtime execution — the markers are a search-time scaffold,
not executable ops.
`Graph::search` did this correctly on its chosen best genome, but
`fuzz_genomes` (test utility that exercises alternative extracted
genomes) called `egglog_to_llir` directly without the unroll. The
CUDA runtime then tried to execute genomes containing raw loop
markers, hitting CUDA_ERROR_ILLEGAL_ADDRESS. The crash cascaded
across ~20 downstream tests via shared CUDA context state.
Also lower the rolling occurrence threshold from 3 back to 2 — the
3-occurrence floor that previously masked this bug was a band-aid;
the real fix is the missing unroll call in the test utility.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Proptest-generated test cases (test_slice_pad, test_stack, test_cumulative,
test_layer_norm, test_std, test_var, test_top_k_filter) were failing
after the rolling refactor because the prepass was matching body×2
patterns in tiny HLIRs whose round trip through egglog + unroll isn't
correctness-preserving at that scale. All seven tests previously passed
on the pre-rolling baseline.
The rolling search now skips candidates with fewer than three
occurrences. Real models roll 20–50 repetitions of a transformer block
so this threshold doesn't affect any production path:
- llama: body=83 trips=31, still rolls, TTFT 475 ms, TPOT 22 ms
- qwen3_moe: body=130 trips=47, still rolls, TTFT 252 ms, TPOT 41 ms
Lib tests: 93 pass, 0 fail (up from 86 pass, 7 fail).
Egglog rules that wrap `unary(binary(a, b))` chains in marker boundaries
for every (Add|Mul) × (Sin|Sqrt|Exp|Exp2|Log2|Recip) combination with
matching strides. Flipped test_single_binary_fuses to assert the
singleton does NOT fuse — egglog never seeds from a solo op.
Skipped the tempting `FusionStart(FusionStart(x)) ≡ FusionStart(x)`
idempotence rule: unioning marker layers creates eclass self-loops with
the pair-fuse union, triggering extraction cycles. Without it, re-firing
cascades up to the run-schedule bound of 10 — each layer in a fresh
eclass, all semantically correct as identity passthroughs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Identity pass-through kernels for the binary-inclusive fusion design,
registered in the other_ops Ops tuple. No egglog rules emit them yet
(rules come in follow-up commits); this just makes the marker types
exist so a later compilation pass can collapse bracketed regions into
one kernel. Existing unary fusion tests remain green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Specs the marker-based binary elementwise fusion design: structural,
negative, numerical-parity, and marker-invariant tests — including the
diamond-DAG case where one external input is reused inside the region.
Tests fail until FusionStart/FusionEnd LLIR ops + egglog rules land.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the loop-rolling refactor, `auto_region_plan` was never set to
`Some` anywhere in the codebase, so `default_region_descriptors()`
always returned a single-region vec and the multi-subgraph branch in
`build_search_space` was cold. This commit deletes the entire dead path
and the state fields it gated.
Removed from src/graph.rs:
- `AutoRegionPlan`, `SingleRegionalizedEGraphPlan` structs
- Graph fields: `auto_rolled_regions`, `auto_region_plan`,
`last_regional_llir`, `single_regional_egraph`
- Methods: `auto_rolled_region_groups`, `build_single_regionalized_egraph`,
`search_single_regionalized_deduped`, `search_single_regionalized`,
`regionalized_hlir_debug_graph`, `dump_regionalized_hlir_before_search`,
`missing_graph_outputs`, `debug_regional_output_coverage`,
`regional_llir` accessor
- Zero-caller helpers: `regionalized_hlir_node_count`,
`full_hlir_op_count`, `regionalized_hlir_op_count`,
`build_virtual_loop_region_subgraphs`, `infer_input_shape_for_port`,
`infer_node_output_dtype`, `build_region_remaps`,
`remap_llir_io_nodes`, `build_regionalized_egglog_program`,
`deduped_representative_descriptors`
- Dead branches gated on `auto_rolled_regions` in both profile sites in
`search_single`
- `RollingCandidate.signature` (never read)
- Tests that exercised the dead path:
`test_build_region_remaps_and_remap_io`,
`test_stitch_keeps_real_output_when_boundary_duplicates_id`,
`test_regionalized_hlir_debug_graph_collapses_repeated_regions`, and
the stale assertions in
`test_auto_roll_loops_prepass_creates_regions_for_chain_recurrence`
Removed from src/egglog_utils/mod.rs:
- `hlir_subgraph_to_egglog` (only caller was `auto_rolled_region_groups`)
- `run_egglog_multi_roots` (only caller was `build_single_regionalized_egraph`)
- `stitch_llir_graphs` (only caller was `RegionalLLIR::unroll`)
`RegionalLLIR::unroll` simplified to a direct clone — with exactly one
region per search, there is nothing to stitch.
Net diff: +91 / -2254 lines.
Verified correctness with llama, qwen3_moe, gemma4_moe end-to-end. Lib
tests: 86 pass, 7 fail (all pre-existing — the refactor actually
eliminated two of the previous 9 failures by removing stale regional
test fixtures).
Replaces the structural-hash dedup hack in the rolling prepass with a
principled three-way unification in egglog:
(Op (LoopInput id stream dt) (ICons v0 (ICons v1 ... (INil))))
≡ (Op (LoopInputStatic id stream dt) (ICons x (INil))) [when all vi = x]
≡ x [inlining]
All three representations live in one eclass, so genetic-search extraction
can pick any form (distinct LoopInput per iter, static boundary wrapper,
or inlined shared value). The inlined case is what lets downstream fusion
rules (e.g. the MoE GLUMoE chain) pattern-match on the raw op kind at
boundary positions — which was the original reason MoE was regressing
under the rolled pipeline.
New pieces:
- `LoopInputStatic` HLIR op: a boundary-crossing marker with a
single-element IList. Preserves the invariant that body-entering edges
go through a marker, unlike the old "skip LoopInput" workaround.
- `identical_inputs` egglog relation + recursive saturation rules,
registered in the `expr` ruleset so the schedule's `saturate expr`
step propagates the predicate through N-element ILists.
- `LoopInput -> LoopInputStatic` and `LoopInputStatic -> ?x` union rules.
- `unroll_loops_in_llir` now handles LoopInputStatic nodes: during
unroll, every iter's body clone edges straight to the single shared
source (via `resolve_src`'s `static_source` map).
The boundary invariant "every edge into the body passes through a
LoopStart / LoopInput / LoopInputStatic marker" now holds in the HLIR
after the prepass. Previously the prepass silently emitted unmarked
direct edges whenever per-iter sources happened to be NodeIndex-equal.
Verified:
- qwen3_moe: correct, TTFT 252 ms, TPOT 41 ms
- gemma4_moe: correct, TTFT 435 ms, TPOT 64 ms
- llama: correct, TTFT 491 ms, TPOT 23 ms
- qwen: correct, TTFT 267 ms, TPOT 23 ms
- gemma: correct, TTFT 284 ms, TPOT 23 ms
- paged_llama: correct, all 4 phases run end-to-end
Rule-firing stats in qwen3_moe's early stage:
1527 identical_inputs ind
94 LoopInputStatic inline
94 LoopInput to LoopInputStatic
31 identical_inputs base
When rolling wraps per-iter boundary inputs in LoopInput, the HLIR node
at that position becomes `(Op (LoopInput ...) (ICons ...))` instead of
the original op. Downstream egglog rewrite rules that pattern-match on
specific op kinds (e.g. the GLUMoE fusion rule, which requires
`(Op (Iota (MIter) ?range) (INil))` at `?gu_iota_within`) then fail to
match — and MoE falls back to the raw op chain, which was never
exercised as a standalone path and produces wrong output.
The fix: before wrapping a boundary input position in LoopInput, check
whether all N per-iter sources are STRUCTURALLY identical (e.g., N
separate Iota nodes with the same expression across N layers). If so,
skip creating the LoopInput — iter-0's source stays in place, shared
across all unrolled iters via the `resolve_src` fall-through. Rolling
already had a NodeIndex-equality check, but iota/constant nodes are
usually separate NodeIndex per layer even when semantically identical;
this extends the equality check to structural hashes that recursively
include the op's `to_egglog` rendering and its sources.
Results at HEAD with this fix:
- qwen3_moe: "The capital of France is Paris. The capital of Germany is
Berlin. The capital of Italy is Rome. ..." (correct), TTFT 279 ms,
TPOT 46 ms (vs 5694/1119 garbage before).
- llama/qwen/gemma/paged_llama: still correct, perf unchanged.
- gemma4_moe: fusion now fires but output is still wrong — needs
separate follow-up (the LUMINAL_NO_ROLL=1 escape still works for it).
MoE models (qwen3_moe, gemma4_moe) regress under the new HLIR-rolled
/ LLIR-unrolled pipeline: generated output is garbage and TPOT blows up
~15x. Llama/qwen/gemma work correctly. Root cause is still unknown —
under investigation. The env var gives a temporary bypass so MoE
examples can still produce correct output.
In auto_roll_loops_prepass, after iter 1..N Output HLIR nodes are
removed (one per iter-past-first output slot), StableGraph frees their
NodeIndex slots. A subsequent LoopOutput added for the next output slot
can be assigned one of those freed NodeIndex slots. Later, when removing
duplicate body nodes, the collided NodeIndex (which had previously
referred to a removed Output HLIR and is still in duplicate_body_nodes)
causes the new LoopOutput to be deleted instead — losing the targets
needed for LLIR unroll, which then emitted only one Output in place of
N.
Fix: (1) defer iter 1..N Output removals until after all LoopOutputs
are created, (2) track added_loop_ops and skip them when deleting
duplicate body nodes.
With this, llama/qwen/gemma produce correct output end-to-end via the
new HLIR-rolled → LLIR-unrolled path.
Extends the loop-rolling pipeline from a SubgraphDescriptor side-table
into an in-place HLIR rewrite with loop markers, plus a post-egglog
LLIR deploy-unroll pass. Compiles and extracts correctly; runtime
execution panics with missing-buffer on a cublaslt input for reasons
that still need inspection of the final LLIR graph.
What works:
- Prepass detects the repeating body and mutates `self.graph` in place:
LoopStart/LoopEnd per loop-carried state slot, LoopInput per non-
state boundary position (only when per-iter sources differ),
LoopOutput per non-state body output that is wrapped in an Output
HLIR node (handles both "output_nodes[q] is the Output itself" and
"Output is a consumer" shapes). N-1 duplicate body nodes are
deleted. For llama: 1 LoopStart / 1 LoopEnd / 39 LoopInputs / 1
LoopOutput, 2490 body duplicates removed.
- HLIR ops (LoopStart/LoopEnd/LoopInput/LoopOutput) carry through
egglog and extract back into LLIR. `targets_csv` String field on
LoopOutput serializes per-iter output-node ids across the roundtrip.
Type-erasure whitelist in op.rs extended so `to_op::<LoopStart>()`
etc. work after extraction.
- `unroll_loops_in_llir` (graph.rs) clones the body `iters-1` times,
threads loop-carried state, routes per-iter LoopInput sources,
generates per-iter Output nodes from LoopOutput targets, and removes
all four marker types. Edge-id order is preserved so ops see their
inputs in the correct positions. Hooked into
`egglog_to_llir_from_root` so every extracted LLIR is auto-flat.
Open issue (next session):
- Runtime panics at `crates/luminal_cuda_lite/src/host/cublaslt/mod.rs`
with `buffers[&inputs[0]]` missing. Needs a targeted LLIR dump of
the panicking cublaslt's incoming edges to determine whether the
edge is resolving to a CudaGraphOp (host op with 0 output_bytes),
or whether edge-id sort order is off for a cloned-body node.
Workspace builds cleanly, loop-rolling unit tests pass. llama/qwen/
etc. panic during search-profile (no correct output produced).
Committing as a reversible milestone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scaffolding for the loop-region refactor. These ops let the auto-roll
prepass rewrite the HLIR in place instead of producing a separate
SubgraphDescriptor side-table; the entire compilation pipeline will
then work against one unified graph that simply contains loop markers.
- LoopStart / LoopEnd — IR-sorted, 1 IR input each, one pair per
loop-carried slot, keyed by `loop_id + slot_idx`. LoopStart owns
`iters`; LoopEnd inherits the loop via `loop_id`.
- LoopInput — OpKind-sorted with a variable-arity IList of
per-iteration source tensors. Body ops consume LoopInput's single
output; deploy-unroll later substitutes each iteration's specific
source.
- LoopOutput — OpKind-sorted, 1 IList input (body_val). The
per-iteration target output-node ids are host-side routing metadata
(`targets: Vec<usize>`) not passed through egglog; they survive the
egraph roundtrip via `loop_id + stream_id` rehydration.
Nothing wires these up yet — that lands in the follow-on prepass /
pipeline / runtime changes. Workspace still builds cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-loop-rolling is now always on. The `enable_auto_loop_rolling` flag
was mostly cosmetic — when the prepass found no candidate (or fell below
the savings threshold) the code already fell through to the single-graph
path, so the flag only skipped the prepass itself.
Deleted:
- `Graph::enable_auto_loop_rolling` field + `set_auto_loop_rolling` setter
- `auto_loop_rolling` on `BackendCompileArgs` and the `set_auto_loop_rolling`
call in `compile_backend`; Python binding stops passing it
- `Graph::grow_rolling_candidate` method (redundant wrapper over the
standalone fn)
- `build_grouped_egraphs` (unreachable after GraphBreak removal)
- `split_regionalized_llir_components`, `descriptor_order_key`,
`llir_order_key` (abandoned post-processing pipeline)
- `RollingRun::signature` field (written, never read)
- `integration_auto_loop_rolling_perf_report_native` test (A/B harness no
longer possible); correctness test now compares against a CPU reference
Net ~255 lines removed, zero behavior change. `cargo build --release`
clean, loop-rolling unit tests pass, llama smoke-tested (TPOT 32.6 ms vs.
pre-cleanup 31.9 ms — within noise).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
It was only ever called from the llama/qwen examples to eyeball which
fused chains survived extraction. Now that the fusion behavior is
covered by tests in luminal_cuda_lite::tests::fusion, the helper and
its two call sites are just noise.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Clippy's write_with_newline lint flagged the two write!() calls in
hlir_to_egglog that end with a trailing "\n". Switched to writeln! so
the newline is implicit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compiles separate sqrt_k / recip_k plus a fused sqrt->recip kernel,
launches each 2000 times on a 1M-element input, measures with CUDA
events. Run with
cargo test -p luminal_cuda_lite -- --ignored bench_fused_vs_unfused_sqrt_recip --nocapture
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ops sequence is pure codegen metadata that egglog never reasons
about, so carrying it as an EList of (MNum tag) Expressions was an
abuse of EList (meant for shape/stride expressions). Switch to a plain
String field ("Sin,Sqrt,Exp2") -- String is already a primitive sort,
avoiding any new sort plumbing.
Side effects:
- Extend rules now use the builtin variadic `+` to concat strings, so
they are O(1) per firing and chain length is no longer capped.
- Drops MAX_FUSION_DEPTH and the 30 length-explicit extend rules in
favor of 5 (one per outer unary kind).
- UnaryFn gains name()/from_name() instead of tag-based encode/decode.
Verified llama still runs end-to-end (1m45s search, TTFT 826ms, TPOT
39ms) with 33x [Sqrt, Recip] + 5x [Exp2, Recip] fused kernels --
matches the previous pair-plus-length-explicit implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified qwen runs end-to-end with fusion active (107x [Sqrt, Recip]
fused kernels survive extraction, one per RMSNorm across its 36
transformer layers).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds N-op fusion for pure-elementwise unary kernels by pattern-matching
each specific Fused[ops] length against a following unary, up to a
bounded depth. A recursive list-append helper was tried first and blew
up the egraph (every new cons retriggered the recursive rule), so the
design deliberately uses length-explicit rules - bounded rule count,
no saturation explosion.
Also adds CudaRuntime::print_kernel_summary() for quick inspection of
which fused op sequences survived extraction, and calls it from the
llama example. On Llama-3-8B that reports 33x [Sqrt, Recip] + 4x
[Exp2, Recip] fused kernels.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Main added stage_report / trace_stage_report helpers and refactored
run_egglog into run_egglog_with_report (returns an EgglogRunReport
alongside the egraph) with run_egglog as a thin wrapper. That collided
with this branch's OpTextParts / run_egglog_with split.
Resolution: take main's stage-report structure as-is, then re-layer
OpTextParts underneath so both APIs share a single body:
- run_egglog_with_report(ops, cleanup) builds OpTextParts once and
delegates to run_egglog_with_report_parts(&op_parts).
- run_egglog_with_report_parts(&op_parts) is the single body that
does early_egglog_with / full_egglog_with + stage_report emission.
- run_egglog(ops, cleanup) wraps run_egglog_with_report and drops
the report (unchanged public API).
- run_egglog_with(&op_parts) wraps run_egglog_with_report_parts and
drops the report — this is the Send-friendly entry point
Graph::build_grouped_egraphs' par_iter uses.
91/91 luminal lib tests still pass post-merge. Both cycles from this
branch (write! into hlir_to_egglog, rayon parallel per-group egglog)
still in place; main's new reporting is preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a KernelFusedElementwise LLIR op that collapses two back-to-back
pure-elementwise unary kernels (Sin/Sqrt/Exp2/Log2/Recip) into a single
CUDA kernel, eliminating one kernel launch and one intermediate buffer
when producer out-strides match consumer in-strides.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Main added MoE routing tests in test_hlir_ops that read integer and
boolean output tensors via CompiledGraph.get_output_i32/get_output_bool,
but the factory-capsule rewrite only exposed f32 outputs.
- DynBackend: add get_output_i32/get_output_bool with default panic
impls (backends opt in).
- NativeDynBackend: implement both using NativeData::i32/bool; factor
the Output-node lookup into an output_buffer helper.
- CudaLiteDynBackend: delegate to runtime.get_i32/get_bool.
- CompiledGraph: expose get_output_i32/get_output_bool to Python,
matching the pre-rewrite surface.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
build_grouped_egraphs runs one egglog saturation per unique subgraph
group, sequentially. On a real multi-layer transformer compile this
linearises the heaviest cost in the pipeline (~30 ms per group).
Each run_egglog call builds a fresh egglog::EGraph and shares no mutable
state with the others, so the groups are trivially data-parallel.
The trait object Arc<Box<dyn EgglogOp>> is !Send/!Sync, so the existing
API couldn't be used directly inside par_iter. Introduced OpTextParts
(pub struct with op_defs / cleanups / early_rewrites / full_rewrites
all materialised as String up front) and a new public entry point
`run_egglog_with(program, root, &op_parts)` which takes only Send &str
inputs. The parallel closure now captures only strings. Existing
`run_egglog` / `early_egglog` / `full_egglog` delegate to the `_with`
variants so their public API is unchanged.
Originally shipped as 26dcdad9 in the weekendspeed campaign (cycle 3).
Standalone measurement on its original parent commit:
compile/build_search_space/chunked_h128/2 49.14 ms -> 29.19 ms (-41%)
compile/build_search_space/chunked_h128/8 49.42 ms -> 29.24 ms (-41%)
compile/build_search_space/distinct_chunks_h128/2 77.74 ms -> 29.92 ms (-61%)
compile/build_search_space/distinct_chunks_h128/4 134.37 ms -> 33.68 ms (-75%)
Replayed here on main. 91/91 luminal lib tests pass. Single-chunk
paths stable since the single-chunk code path still uses the
existing run_egglog wrapper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
format!(...) allocates an intermediate String then out.push_str copies
it; write!(out, ...) streams formatting straight into the pre-sized
buffer. Pre-sizing out to topo_order.len() * 160 avoids early growth
reallocations.
Originally shipped as a23ccd5f in the weekendspeed campaign (cycle 2).
Standalone measurement on its original parent commit showed:
compile_fine/hlir_to_egglog/ew_small 11.83 us -> 11.15 us (-6%)
compile_fine/hlir_to_egglog/attn_32x64 42.60 us -> 40.57 us (-5%)
Replayed here on main as a standalone change. 91/91 luminal lib tests
pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>