luminal

mirror of https://git.teahaven.kr/Rust-related/luminal.git synced 2026-06-04 08:39:48 +09:00

Author	SHA1	Message	Date
Joe Fioti	61904fbc76	img	2026-04-26 18:38:30 -07:00
Joe Fioti	f461fca3da	Simplify loop-rolling diff: -130 lines, same functionality Net cleanups across the session's commits without changing behavior: * `src/hlir.rs`: - Each binary op's `rewrites()` now reuses `self.early_rewrites()` instead of rebuilding the unroll-rule list — eliminates the 4× repeated boilerplate and the 4× repeated "see Add::rewrites for why we register in both stages" comment. - Hoist that explanation into the `binary_op_unroll_rules` doc where it actually applies (one place, not four). - `binary_op_unroll_rule` collapses the dual `match state_pos` blocks into a single `order(state, per_iter)` closure used for both the body match pattern and each unrolled chain element. * `src/graph.rs` (`unroll_loops_in_llir`): - Drop the named `iteration_invariant_slots` set. The check `body_nodes.contains(&body_producer)` it cached is equivalent to `clone_map[i].get(&body_producer).is_some()`, so resolve_src and marker_post_sub both express the case inline as `clone_map.get(&bp).copied().unwrap_or(bp)`. The set's worth was naming the case; a single comment block at start_meta does that more cheaply. - Drop the orphan-LoopOutputSelect skip from `93fb02c4` — the gemma diagnostic showed the real failure was the iteration-invariant body_producer case only; the orphan-select case was speculative defensiveness for a scenario the rolling/extraction pipeline can't actually produce. - Drop the `collapse_loops_to_first_iter` informational comment block; collapse just works without special handling for invariant slots and didn't need the explanation. * `crates/luminal_cuda_lite/src/tests/transformer.rs`: - Collapse the three exploratory body=1 trips=3 tests (`test_three_chained_scalar_muls`, `test_three_chained_scalar_muls_with_downstream_consumer`, `test_three_chained_scalar_muls_with_initial_residual`) into one `test_rolled_chained_scalar_muls` that exercises the chain plus a residual back to its initial input — the strongest topology of the three (covers per-iter body cloning, post-loop wiring, and the residual edge to the loop-external initial value). Tests: cuda_lite 80/80, python CUDA 12 + 4 xfailed (test_llama3 subset), gemma example end-to-end. fmt + clippy clean. Diff vs loop_rolling base: 347 → 217 inserted lines (−130). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 21:47:34 +00:00
Joe Fioti	5f199e94c6	Refactor iteration-invariant state slots as a named first-class case The two prior commits (`16de9638`, `93fb02c4`) handled the gemma CI panic by swapping `clone_map[i-1][&body_producer]` for `clone_map[i-1].get(&body_producer).unwrap_or(body_producer)`. That suppresses the panic but reads like a defensive band-aid — the comment hand-waves about "extraction-shape variation" without naming the actual situation. Local repro on the gemma example (built locally, weights downloaded from HF) shows the case is real and documented: slot=0 body_producer NodeIndex(3040) NOT in body_nodes body_producer op: KernelConstant { value: 9.21034 } # ln(10000) slot=1 body_producer NodeIndex(5035) NOT in body_nodes initial = NodeIndex(5035) (same node) body_producer op: KernelConstant { value: 1.442695 } # log2(e) These are RoPE frequency factors: the body chain provably reduces to a constant via cuda_lite's kernel-level rewrites, and the genome's extraction picks the constant directly for LoopEnd's incoming eclass. The state really is iteration-invariant — every iter sees the same value. There's no LLIR corruption; the forward-walk `body_nodes` definition just doesn't cover this case because per-iter cloning isn't needed for it. Refactor: * Compute `iteration_invariant_slots: HashSet<LoopStart>` at the same time as `start_meta`, with the rule `body_producer ∉ body_nodes ⇒ invariant`. * `resolve_src` branches explicitly: invariant slot → `body_producer`, else standard per-iter clone lookup. * `marker_post_sub` branches the same way. * Drop the `collapse_loops_to_first_iter` backward-walk backfill the prior commit added — collapse doesn't have the panic site, and a Constant body_producer either has no incoming edges (so the body- iteration loop is a no-op for it) or the existing `marker_post_sub` insert already routes consumers to it correctly. Behavior is identical to the prior commits; the diff is purely about making the documented case discoverable in code rather than implicit in an `unwrap_or`. cuda_lite (82/82), python CUDA (223 + 4 xfailed), gemma example: all green. Adds a LessonsLearned entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 16:37:52 +00:00
Joe Fioti	93fb02c495	Skip orphan LoopOutputSelect when its LoopOutput is missing Companion defensive fix to `16de9638`. `output_body_producer` is keyed by stream_id and populated from `outputs` (LoopOutput nodes). The post-loop wiring then indexed `output_body_producer[&stream_id]` for every LoopOutputSelect, which panics with "no entry found for key" if extraction lands a LoopOutputSelect whose corresponding LoopOutput isn't in the LLIR (e.g. a genome that picked a non-LoopOutput representative for that stream's eclass). Skip the orphan select rather than panicking. The select node stays un-substituted, so the post-loop consumer's edge falls through to the select itself; the select gets removed with the other markers at the end of unroll. The consumer's edge will dangle, but that's a separate concern from the unroll-mechanism panic this prevents. Together with `16de9638`, this closes the two `[&key]` index sites in `unroll_loops_in_llir` that can land on a missing key when egglog extraction produces a structurally unusual LLIR. Both sites now gracefully fall through with a defensible semantic (use the body producer / select node directly), so the unroll mechanism never panics on extraction-shape variation. cuda_lite + python CUDA suites still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 15:28:13 +00:00
Joe Fioti	16de9638fc	Handle iteration-invariant body producers in loop unroll `unroll_loops_in_llir` was panicking on `clone_map[i-1][&body_producer]` with "no entry found for key" on the gemma Modal CI job. The line fired when extraction landed a `body_producer` (LoopEnd's incoming source) that isn't in `body_nodes` — a forward-walk-from-input-markers set that misses ops whose only ancestors are non-marker (a constant, external input, or an op whose chain got congruence-merged off the marker chain by rules like `LoopInputStatic inline`). Semantically that body op is iteration-invariant: every iter would compute the same value, so the loop's state never changes. The per-iter clone path needed a "no clone, share across iters" fallback rather than indexing the clone map. Fix: - In `unroll_loops_in_llir::resolve_src`, when the LoopStart-resolved `body_producer` isn't in `body_nodes`, return `body_producer` itself for iter > 0 (skip the clone_map lookup). - Mirror the same `unwrap_or(body_producer)` fallback in `marker_post_sub` for LoopEnd / LoopOutputSelect post-loop wiring. - In `collapse_loops_to_first_iter`, add a backward-walk-from-end-markers pass that backfills body_nodes with any non-marker non-Output ancestor of an end-marker. Collapse doesn't have a clone_map (no panic site), but it does iterate body_nodes to rewire incoming edges before deleting markers — without backfill, an iteration-invariant body_producer would keep dangling edges to removed markers. Local cuda_lite + python CUDA suites pass. The extraction shape that triggers this isn't reachable from the local fuzzers' search depth, so this lands as a defensive fix to unblock the gemma Modal job; once that job goes green we'll know whether the fallback covers all cases or whether more diagnostic info is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 07:45:05 +00:00
Joe Fioti	f08d24e73f	Register loop unroll-union rules in full egglog stage too The narrow per-binary-op unroll-union rules (introduced in `aba96275`) were only registered in `EgglogOp::early_rewrites()`, which the egglog driver feeds into the early-stage program only. The full-stage program is built from `EgglogOp::rewrites()` exclusively. So the unrolled chain materialised in the early egraph, the early→full extract picked the (cheaper) rolled form, the unrolled chain was lost, and any full-stage kernel rewrite (e.g. `KernelExp`'s `direct-exp-fusion`, which rewrites `Mul(?x, log2_e) → Exp2(...)` into a single native `expf` kernel) had nothing to match against. Symptom: python `test_llama_transformer_block` (CUDA backend) was off by ~1e-2 from the PyTorch reference. The PyTorch `pow(2)` decomposition emits a chain `Log2(x) * 0.693 * 2.0 * 1.442 → Exp2`, where 1.442 is log2(e). With rolling on, those three scalar muls fold into one body, and `direct-exp-fusion` couldn't fuse the trailing `Mul(?, log2_e) + Exp2` into the more accurate `KernelExp` (native expf). The truncated log2(e) constant accumulates rounding through the multiply chain, the diff shows up only in rows that exercise the full attention path (row 0 matched exactly, rows 1–3 drifted). Fix: register `binary_op_unroll_rules` in BOTH `early_rewrites()` (for GLUMoE-style early-stage fusion, which still depends on this) AND `rewrites()` (for full-stage kernel-level fusions like `direct-exp-fusion`). All four binary HLIR ops (Add/Mul/Mod/LessThan) get the same treatment. Also adds three cuda_lite repro tests covering body=1, trips=3 chains (plain, with residual, with downstream consumer) — all pass and would have caught any regression in the basic rolling+unroll mechanics. Tests: - python CUDA: 223 passed, 4 xfailed (was 222 passed, 1 failed) - cuda_lite: 82 passed, 0 failed - workspace tests / fmt / clippy: clean Adds a LessonsLearned entry per crates/luminal_python/CLAUDE.md guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 06:15:29 +00:00
Joe Fioti	aba9627563	Union small rolled loops with their unrolled form in egglog The auto-roll prepass folds tiny scalar-mul chains (body=1, trips=2) inside e.g. the gemma_gelu sigmoid expansion into a loop body. The existing egglog fusion rules (GLUMoE GemmaGELU, etc.) pattern-match a specific flat chain of binary ops and can't see through the LoopStart/LoopInput/LoopEnd markers, so rolling silently disables the fusion and the extracted graph is strictly worse than not rolling at all. Add narrow per-binary-op early rewrites that union a rolled single-op-body loop (trips ≤ 4, state at body input position 0 or 1) with its fully-unrolled equivalent in the same eclass. The cost-based extractor then picks whichever representation downstream patterns prefer — the unrolled form when fusions match through the flat chain, the rolled form when nothing benefits. No threshold or special-case in the rolling cost model; the egraph stays the source of truth. Fixes test_glumoe_gemma_gelu_matches_unfused_output (78 → 79 passing in cuda_lite). All four binary HLIR ops (Add, Mul, Mod, LessThan) opt in via early_rewrites(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 04:26:24 +00:00
Joe Fioti	7d68b62aa8	Fix CUDA crash in fuzz_genomes after loop rolling prepass The auto-roll prepass inserts LoopStart/LoopEnd/LoopInput/LoopOutput marker ops into the HLIR. These markers survive through egglog rewriting into LLIR and must be collapsed by `unroll_loops_in_llir` before runtime execution — the markers are a search-time scaffold, not executable ops. `Graph::search` did this correctly on its chosen best genome, but `fuzz_genomes` (test utility that exercises alternative extracted genomes) called `egglog_to_llir` directly without the unroll. The CUDA runtime then tried to execute genomes containing raw loop markers, hitting CUDA_ERROR_ILLEGAL_ADDRESS. The crash cascaded across ~20 downstream tests via shared CUDA context state. Also lower the rolling occurrence threshold from 3 back to 2 — the 3-occurrence floor that previously masked this bug was a band-aid; the real fix is the missing unroll call in the test utility. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 03:45:31 +00:00
Joe Fioti	13c870de86	fmt and clippy	2026-04-26 02:42:51 +00:00
Joe Fioti	f8b742d718	fixed conflicts	2026-04-26 02:30:32 +00:00
Joe Fioti	3555d169bd	generalized loop rolling	2026-04-26 02:19:05 +00:00
Joe Fioti	be74153c12	loop rolling improvements	2026-04-26 01:36:01 +00:00
Joe Fioti	75535c93f0	Print region partition (inside vs outside) in rolling prepass output Rolled prepass now also reports how many post-roll HLIR nodes live inside the rolled region (body + markers) versus outside it (embedding, weights, post-loop / lm-head): Rolled region partition: 126 inside (83 body + 43 markers) / 3695 outside Examples: llama: 126 inside (83 body + 43 markers) / 3695 outside (3821 total) qwen3_moe: 194 inside (130 body + 64 markers) / 6830 outside (7024 total)	2026-04-26 00:20:30 +00:00
Joe Fioti	84f13cae00	Print before/after HLIR node counts in rolling prepass output Rolled lines now show the explicit reduction: Rolled rolled HLIR: 6268 -> 3821 nodes (43 loop ops inserted, 2490 duplicate body nodes deleted) Examples: llama: 6268 -> 3821 nodes (~39% reduction) qwen3_moe: 12940 -> 7024 nodes (~46% reduction)	2026-04-26 00:09:10 +00:00
Joe Fioti	703c2d9ea4	Require trips >= 3 for loop-rolling prepass Proptest-generated test cases (test_slice_pad, test_stack, test_cumulative, test_layer_norm, test_std, test_var, test_top_k_filter) were failing after the rolling refactor because the prepass was matching body×2 patterns in tiny HLIRs whose round trip through egglog + unroll isn't correctness-preserving at that scale. All seven tests previously passed on the pre-rolling baseline. The rolling search now skips candidates with fewer than three occurrences. Real models roll 20–50 repetitions of a transformer block so this threshold doesn't affect any production path: - llama: body=83 trips=31, still rolls, TTFT 475 ms, TPOT 22 ms - qwen3_moe: body=130 trips=47, still rolls, TTFT 252 ms, TPOT 41 ms Lib tests: 93 pass, 0 fail (up from 86 pass, 7 fail).	2026-04-24 04:19:20 +00:00
Matthew Gunton	44324f1c2d	Add Binary→Unary pair-fuse rules emitting FusionStart/End markers Egglog rules that wrap `unary(binary(a, b))` chains in marker boundaries for every (Add\|Mul) × (Sin\|Sqrt\|Exp\|Exp2\|Log2\|Recip) combination with matching strides. Flipped test_single_binary_fuses to assert the singleton does NOT fuse — egglog never seeds from a solo op. Skipped the tempting `FusionStart(FusionStart(x)) ≡ FusionStart(x)` idempotence rule: unioning marker layers creates eclass self-loops with the pair-fuse union, triggering extraction cycles. Without it, re-firing cascades up to the run-schedule bound of 10 — each layer in a fresh eclass, all semantically correct as identity passthroughs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 00:02:46 +00:00
Matthew Gunton	f6845011d8	Scaffold FusionStart/FusionEnd marker ops Identity pass-through kernels for the binary-inclusive fusion design, registered in the other_ops Ops tuple. No egglog rules emit them yet (rules come in follow-up commits); this just makes the marker types exist so a later compilation pass can collapse bracketed regions into one kernel. Existing unary fusion tests remain green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:44:19 +00:00
Matthew Gunton	6e7ee5581d	Add binary-fusion test suite (FusionStart/FusionEnd markers) Specs the marker-based binary elementwise fusion design: structural, negative, numerical-parity, and marker-invariant tests — including the diamond-DAG case where one external input is reused inside the region. Tests fail until FusionStart/FusionEnd LLIR ops + egglog rules land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 23:36:29 +00:00
Joe Fioti	2e3158c48e	Delete the regionalized search pipeline (~2100 LOC) After the loop-rolling refactor, `auto_region_plan` was never set to `Some` anywhere in the codebase, so `default_region_descriptors()` always returned a single-region vec and the multi-subgraph branch in `build_search_space` was cold. This commit deletes the entire dead path and the state fields it gated. Removed from src/graph.rs: - `AutoRegionPlan`, `SingleRegionalizedEGraphPlan` structs - Graph fields: `auto_rolled_regions`, `auto_region_plan`, `last_regional_llir`, `single_regional_egraph` - Methods: `auto_rolled_region_groups`, `build_single_regionalized_egraph`, `search_single_regionalized_deduped`, `search_single_regionalized`, `regionalized_hlir_debug_graph`, `dump_regionalized_hlir_before_search`, `missing_graph_outputs`, `debug_regional_output_coverage`, `regional_llir` accessor - Zero-caller helpers: `regionalized_hlir_node_count`, `full_hlir_op_count`, `regionalized_hlir_op_count`, `build_virtual_loop_region_subgraphs`, `infer_input_shape_for_port`, `infer_node_output_dtype`, `build_region_remaps`, `remap_llir_io_nodes`, `build_regionalized_egglog_program`, `deduped_representative_descriptors` - Dead branches gated on `auto_rolled_regions` in both profile sites in `search_single` - `RollingCandidate.signature` (never read) - Tests that exercised the dead path: `test_build_region_remaps_and_remap_io`, `test_stitch_keeps_real_output_when_boundary_duplicates_id`, `test_regionalized_hlir_debug_graph_collapses_repeated_regions`, and the stale assertions in `test_auto_roll_loops_prepass_creates_regions_for_chain_recurrence` Removed from src/egglog_utils/mod.rs: - `hlir_subgraph_to_egglog` (only caller was `auto_rolled_region_groups`) - `run_egglog_multi_roots` (only caller was `build_single_regionalized_egraph`) - `stitch_llir_graphs` (only caller was `RegionalLLIR::unroll`) `RegionalLLIR::unroll` simplified to a direct clone — with exactly one region per search, there is nothing to stitch. Net diff: +91 / -2254 lines. Verified correctness with llama, qwen3_moe, gemma4_moe end-to-end. Lib tests: 86 pass, 7 fail (all pre-existing — the refactor actually eliminated two of the previous 9 failures by removing stale regional test fixtures).	2026-04-23 23:26:25 +00:00
Joe Fioti	8af22776aa	Introduce LoopInputStatic + identical_inputs egglog rules Replaces the structural-hash dedup hack in the rolling prepass with a principled three-way unification in egglog: (Op (LoopInput id stream dt) (ICons v0 (ICons v1 ... (INil)))) ≡ (Op (LoopInputStatic id stream dt) (ICons x (INil))) [when all vi = x] ≡ x [inlining] All three representations live in one eclass, so genetic-search extraction can pick any form (distinct LoopInput per iter, static boundary wrapper, or inlined shared value). The inlined case is what lets downstream fusion rules (e.g. the MoE GLUMoE chain) pattern-match on the raw op kind at boundary positions — which was the original reason MoE was regressing under the rolled pipeline. New pieces: - `LoopInputStatic` HLIR op: a boundary-crossing marker with a single-element IList. Preserves the invariant that body-entering edges go through a marker, unlike the old "skip LoopInput" workaround. - `identical_inputs` egglog relation + recursive saturation rules, registered in the `expr` ruleset so the schedule's `saturate expr` step propagates the predicate through N-element ILists. - `LoopInput -> LoopInputStatic` and `LoopInputStatic -> ?x` union rules. - `unroll_loops_in_llir` now handles LoopInputStatic nodes: during unroll, every iter's body clone edges straight to the single shared source (via `resolve_src`'s `static_source` map). The boundary invariant "every edge into the body passes through a LoopStart / LoopInput / LoopInputStatic marker" now holds in the HLIR after the prepass. Previously the prepass silently emitted unmarked direct edges whenever per-iter sources happened to be NodeIndex-equal. Verified: - qwen3_moe: correct, TTFT 252 ms, TPOT 41 ms - gemma4_moe: correct, TTFT 435 ms, TPOT 64 ms - llama: correct, TTFT 491 ms, TPOT 23 ms - qwen: correct, TTFT 267 ms, TPOT 23 ms - gemma: correct, TTFT 284 ms, TPOT 23 ms - paged_llama: correct, all 4 phases run end-to-end Rule-firing stats in qwen3_moe's early stage: 1527 identical_inputs ind 94 LoopInputStatic inline 94 LoopInput to LoopInputStatic 31 identical_inputs base	2026-04-23 21:46:34 +00:00
Joe Fioti	cd8c01f620	Fix MoE regression: dedupe structurally-identical per-iter boundary inputs When rolling wraps per-iter boundary inputs in LoopInput, the HLIR node at that position becomes `(Op (LoopInput ...) (ICons ...))` instead of the original op. Downstream egglog rewrite rules that pattern-match on specific op kinds (e.g. the GLUMoE fusion rule, which requires `(Op (Iota (MIter) ?range) (INil))` at `?gu_iota_within`) then fail to match — and MoE falls back to the raw op chain, which was never exercised as a standalone path and produces wrong output. The fix: before wrapping a boundary input position in LoopInput, check whether all N per-iter sources are STRUCTURALLY identical (e.g., N separate Iota nodes with the same expression across N layers). If so, skip creating the LoopInput — iter-0's source stays in place, shared across all unrolled iters via the `resolve_src` fall-through. Rolling already had a NodeIndex-equality check, but iota/constant nodes are usually separate NodeIndex per layer even when semantically identical; this extends the equality check to structural hashes that recursively include the op's `to_egglog` rendering and its sources. Results at HEAD with this fix: - qwen3_moe: "The capital of France is Paris. The capital of Germany is Berlin. The capital of Italy is Rome. ..." (correct), TTFT 279 ms, TPOT 46 ms (vs 5694/1119 garbage before). - llama/qwen/gemma/paged_llama: still correct, perf unchanged. - gemma4_moe: fusion now fires but output is still wrong — needs separate follow-up (the LUMINAL_NO_ROLL=1 escape still works for it).	2026-04-23 18:18:37 +00:00
Joe Fioti	461b746937	Add LUMINAL_NO_ROLL env-var escape to bypass loop rolling prepass MoE models (qwen3_moe, gemma4_moe) regress under the new HLIR-rolled / LLIR-unrolled pipeline: generated output is garbage and TPOT blows up ~15x. Llama/qwen/gemma work correctly. Root cause is still unknown — under investigation. The env var gives a temporary bypass so MoE examples can still produce correct output.	2026-04-23 07:11:48 +00:00
Joe Fioti	38e467aa6c	Fix LoopOutput NodeIndex collision with freed duplicate body slots In auto_roll_loops_prepass, after iter 1..N Output HLIR nodes are removed (one per iter-past-first output slot), StableGraph frees their NodeIndex slots. A subsequent LoopOutput added for the next output slot can be assigned one of those freed NodeIndex slots. Later, when removing duplicate body nodes, the collided NodeIndex (which had previously referred to a removed Output HLIR and is still in duplicate_body_nodes) causes the new LoopOutput to be deleted instead — losing the targets needed for LLIR unroll, which then emitted only one Output in place of N. Fix: (1) defer iter 1..N Output removals until after all LoopOutputs are created, (2) track added_loop_ops and skip them when deleting duplicate body nodes. With this, llama/qwen/gemma produce correct output end-to-end via the new HLIR-rolled → LLIR-unrolled path.	2026-04-23 06:14:28 +00:00
Joe Fioti	7429ac163b	WIP: HLIR loop mutation + LLIR unroll (runtime-exec broken) Extends the loop-rolling pipeline from a SubgraphDescriptor side-table into an in-place HLIR rewrite with loop markers, plus a post-egglog LLIR deploy-unroll pass. Compiles and extracts correctly; runtime execution panics with missing-buffer on a cublaslt input for reasons that still need inspection of the final LLIR graph. What works: - Prepass detects the repeating body and mutates `self.graph` in place: LoopStart/LoopEnd per loop-carried state slot, LoopInput per non- state boundary position (only when per-iter sources differ), LoopOutput per non-state body output that is wrapped in an Output HLIR node (handles both "output_nodes[q] is the Output itself" and "Output is a consumer" shapes). N-1 duplicate body nodes are deleted. For llama: 1 LoopStart / 1 LoopEnd / 39 LoopInputs / 1 LoopOutput, 2490 body duplicates removed. - HLIR ops (LoopStart/LoopEnd/LoopInput/LoopOutput) carry through egglog and extract back into LLIR. `targets_csv` String field on LoopOutput serializes per-iter output-node ids across the roundtrip. Type-erasure whitelist in op.rs extended so `to_op::<LoopStart>()` etc. work after extraction. - `unroll_loops_in_llir` (graph.rs) clones the body `iters-1` times, threads loop-carried state, routes per-iter LoopInput sources, generates per-iter Output nodes from LoopOutput targets, and removes all four marker types. Edge-id order is preserved so ops see their inputs in the correct positions. Hooked into `egglog_to_llir_from_root` so every extracted LLIR is auto-flat. Open issue (next session): - Runtime panics at `crates/luminal_cuda_lite/src/host/cublaslt/mod.rs` with `buffers[&inputs[0]]` missing. Needs a targeted LLIR dump of the panicking cublaslt's incoming edges to determine whether the edge is resolving to a CudaGraphOp (host op with 0 output_bytes), or whether edge-id sort order is off for a cloned-body node. Workspace builds cleanly, loop-rolling unit tests pass. llama/qwen/ etc. panic during search-profile (no correct output produced). Committing as a reversible milestone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 03:36:39 +00:00
Joe Fioti	07c151dd70	Add LoopStart/LoopEnd/LoopInput/LoopOutput HLIR ops Scaffolding for the loop-region refactor. These ops let the auto-roll prepass rewrite the HLIR in place instead of producing a separate SubgraphDescriptor side-table; the entire compilation pipeline will then work against one unified graph that simply contains loop markers. - LoopStart / LoopEnd — IR-sorted, 1 IR input each, one pair per loop-carried slot, keyed by `loop_id + slot_idx`. LoopStart owns `iters`; LoopEnd inherits the loop via `loop_id`. - LoopInput — OpKind-sorted with a variable-arity IList of per-iteration source tensors. Body ops consume LoopInput's single output; deploy-unroll later substitutes each iteration's specific source. - LoopOutput — OpKind-sorted, 1 IList input (body_val). The per-iteration target output-node ids are host-side routing metadata (`targets: Vec<usize>`) not passed through egglog; they survive the egraph roundtrip via `loop_id + stream_id` rehydration. Nothing wires these up yet — that lands in the follow-on prepass / pipeline / runtime changes. Workspace still builds cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 22:22:55 +00:00
Joe Fioti	c0f7f1f054	Remove non-rolling flag and dead rolling helpers Auto-loop-rolling is now always on. The `enable_auto_loop_rolling` flag was mostly cosmetic — when the prepass found no candidate (or fell below the savings threshold) the code already fell through to the single-graph path, so the flag only skipped the prepass itself. Deleted: - `Graph::enable_auto_loop_rolling` field + `set_auto_loop_rolling` setter - `auto_loop_rolling` on `BackendCompileArgs` and the `set_auto_loop_rolling` call in `compile_backend`; Python binding stops passing it - `Graph::grow_rolling_candidate` method (redundant wrapper over the standalone fn) - `build_grouped_egraphs` (unreachable after GraphBreak removal) - `split_regionalized_llir_components`, `descriptor_order_key`, `llir_order_key` (abandoned post-processing pipeline) - `RollingRun::signature` field (written, never read) - `integration_auto_loop_rolling_perf_report_native` test (A/B harness no longer possible); correctness test now compares against a CPU reference Net ~255 lines removed, zero behavior change. `cargo build --release` clean, loop-rolling unit tests pass, llama smoke-tested (TPOT 32.6 ms vs. pre-cleanup 31.9 ms — within noise). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 20:48:00 +00:00
Joe Fioti	df96fe5110	loop rollig fixed for all examples	2026-04-22 20:22:21 +00:00
Joe Fioti	18a550dd15	loop rolling working with llama	2026-04-22 16:27:31 +00:00
Joe Fioti	254680001d	loop rolling working with llama	2026-04-22 05:21:25 +00:00
Joe Fioti	2920011897	Implement regional loop rolling prepass and remove GraphBreak path	2026-04-21 15:30:56 -07:00
Joe Fioti	d879376697	Merge pull request #274 from luminal-ai/elementwise-fusion Elementwise fusion for adjacent unary kernels in cuda_lite	2026-04-21 14:37:39 -07:00
Joe Fioti	2be30c18cd	Merge pull request #275 from luminal-ai/worktree-weekendspeed Worktree weekendspeed	2026-04-21 14:36:54 -07:00
Matthew Gunton	48f921d2a1	Remove print_kernel_summary debug helper It was only ever called from the llama/qwen examples to eyeball which fused chains survived extraction. Now that the fusion behavior is covered by tests in luminal_cuda_lite::tests::fusion, the helper and its two call sites are just noise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:50:57 +00:00
Tucker Morgan	f55e7e0589	fix clippy: use writeln! in hlir_to_egglog buffer writes Clippy's write_with_newline lint flagged the two write!() calls in hlir_to_egglog that end with a trailing "\n". Switched to writeln! so the newline is implicit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 20:35:05 +00:00
Matthew Gunton	db2027d345	Add ignored microbench for sqrt->recip fusion Compiles separate sqrt_k / recip_k plus a fused sqrt->recip kernel, launches each 2000 times on a 1M-element input, measures with CUDA events. Run with cargo test -p luminal_cuda_lite -- --ignored bench_fused_vs_unfused_sqrt_recip --nocapture Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 19:02:50 +00:00
Matthew Gunton	9a5032bfc9	Use egglog String for the fused ops list The ops sequence is pure codegen metadata that egglog never reasons about, so carrying it as an EList of (MNum tag) Expressions was an abuse of EList (meant for shape/stride expressions). Switch to a plain String field ("Sin,Sqrt,Exp2") -- String is already a primitive sort, avoiding any new sort plumbing. Side effects: - Extend rules now use the builtin variadic `+` to concat strings, so they are O(1) per firing and chain length is no longer capped. - Drops MAX_FUSION_DEPTH and the 30 length-explicit extend rules in favor of 5 (one per outer unary kind). - UnaryFn gains name()/from_name() instead of tag-based encode/decode. Verified llama still runs end-to-end (1m45s search, TTFT 826ms, TPOT 39ms) with 33x [Sqrt, Recip] + 5x [Exp2, Recip] fused kernels -- matches the previous pair-plus-length-explicit implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:41:02 +00:00
Matthew Gunton	c665b01c4e	cargo fmt and kernel summary in qwen example Verified qwen runs end-to-end with fusion active (107x [Sqrt, Recip] fused kernels survive extraction, one per RMSNorm across its 36 transformer layers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 18:14:07 +00:00
Matthew Gunton	883508e682	Extend elementwise fusion to chains up to 8 unaries Adds N-op fusion for pure-elementwise unary kernels by pattern-matching each specific Fused[ops] length against a following unary, up to a bounded depth. A recursive list-append helper was tried first and blew up the egraph (every new cons retriggered the recursive rule), so the design deliberately uses length-explicit rules - bounded rule count, no saturation explosion. Also adds CudaRuntime::print_kernel_summary() for quick inspection of which fused op sequences survived extraction, and calls it from the llama example. On Llama-3-8B that reports 33x [Sqrt, Recip] + 4x [Exp2, Recip] fused kernels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:56:44 +00:00
Tucker Morgan	080b99b69e	Merge branch 'main' into perf/compile-write-rayon Main added stage_report / trace_stage_report helpers and refactored run_egglog into run_egglog_with_report (returns an EgglogRunReport alongside the egraph) with run_egglog as a thin wrapper. That collided with this branch's OpTextParts / run_egglog_with split. Resolution: take main's stage-report structure as-is, then re-layer OpTextParts underneath so both APIs share a single body: - run_egglog_with_report(ops, cleanup) builds OpTextParts once and delegates to run_egglog_with_report_parts(&op_parts). - run_egglog_with_report_parts(&op_parts) is the single body that does early_egglog_with / full_egglog_with + stage_report emission. - run_egglog(ops, cleanup) wraps run_egglog_with_report and drops the report (unchanged public API). - run_egglog_with(&op_parts) wraps run_egglog_with_report_parts and drops the report — this is the Send-friendly entry point Graph::build_grouped_egraphs' par_iter uses. 91/91 luminal lib tests still pass post-merge. Both cycles from this branch (write! into hlir_to_egglog, rayon parallel per-group egglog) still in place; main's new reporting is preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:15:45 +00:00
Matthew Gunton	0bd19289ea	Add elementwise fusion for adjacent unary kernels in cuda_lite Adds a KernelFusedElementwise LLIR op that collapses two back-to-back pure-elementwise unary kernels (Sin/Sqrt/Exp2/Log2/Recip) into a single CUDA kernel, eliminating one kernel launch and one intermediate buffer when producer out-strides match consumer in-strides. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 17:00:41 +00:00
Joe Fioti	a3b7f6ecc1	add profile limiting	2026-04-21 05:13:14 +00:00
Joe Fioti	438ae460bf	Merge pull request #271 from luminal-ai/dyn-backend-plugin-system Add DynBackend trait and plugin system for external backends	2026-04-20 14:55:24 -07:00
Tucker Morgan	da440fdef0	Add get_output_i32/bool to DynBackend + CompiledGraph Main added MoE routing tests in test_hlir_ops that read integer and boolean output tensors via CompiledGraph.get_output_i32/get_output_bool, but the factory-capsule rewrite only exposed f32 outputs. - DynBackend: add get_output_i32/get_output_bool with default panic impls (backends opt in). - NativeDynBackend: implement both using NativeData::i32/bool; factor the Output-node lookup into an output_buffer helper. - CudaLiteDynBackend: delegate to runtime.get_i32/get_bool. - CompiledGraph: expose get_output_i32/get_output_bool to Python, matching the pre-rewrite surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 21:25:48 +00:00
Tucker Morgan	586365be4d	perf: parallelize per-group egglog compile with rayon build_grouped_egraphs runs one egglog saturation per unique subgraph group, sequentially. On a real multi-layer transformer compile this linearises the heaviest cost in the pipeline (~30 ms per group). Each run_egglog call builds a fresh egglog::EGraph and shares no mutable state with the others, so the groups are trivially data-parallel. The trait object Arc<Box<dyn EgglogOp>> is !Send/!Sync, so the existing API couldn't be used directly inside par_iter. Introduced OpTextParts (pub struct with op_defs / cleanups / early_rewrites / full_rewrites all materialised as String up front) and a new public entry point `run_egglog_with(program, root, &op_parts)` which takes only Send &str inputs. The parallel closure now captures only strings. Existing `run_egglog` / `early_egglog` / `full_egglog` delegate to the `_with` variants so their public API is unchanged. Originally shipped as 26dcdad9 in the weekendspeed campaign (cycle 3). Standalone measurement on its original parent commit: compile/build_search_space/chunked_h128/2 49.14 ms -> 29.19 ms (-41%) compile/build_search_space/chunked_h128/8 49.42 ms -> 29.24 ms (-41%) compile/build_search_space/distinct_chunks_h128/2 77.74 ms -> 29.92 ms (-61%) compile/build_search_space/distinct_chunks_h128/4 134.37 ms -> 33.68 ms (-75%) Replayed here on main. 91/91 luminal lib tests pass. Single-chunk paths stable since the single-chunk code path still uses the existing run_egglog wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 21:08:10 +00:00
Joe Fioti	3c962a9df8	Merge branch 'main' into dyn-backend-plugin-system	2026-04-20 14:06:10 -07:00
tucker-luminal	1a460bac96	Merge pull request #265 from alityb/feat/luminal-python-moe-routing-support build MoE routing support in luminal_python	2026-04-20 14:05:05 -07:00
tucker-luminal	ce06a901cc	Update mod.rs	2026-04-20 13:28:26 -07:00
Tucker Morgan	c97288cdae	perf: write! directly into hlir_to_egglog output buffer format!(...) allocates an intermediate String then out.push_str copies it; write!(out, ...) streams formatting straight into the pre-sized buffer. Pre-sizing out to topo_order.len() * 160 avoids early growth reallocations. Originally shipped as a23ccd5f in the weekendspeed campaign (cycle 2). Standalone measurement on its original parent commit showed: compile_fine/hlir_to_egglog/ew_small 11.83 us -> 11.15 us (-6%) compile_fine/hlir_to_egglog/attn_32x64 42.60 us -> 40.57 us (-5%) Replayed here on main as a standalone change. 91/91 luminal lib tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:25:10 +00:00
tucker-luminal	d66b3f2643	Merge branch 'main' into feat/luminal-python-moe-routing-support	2026-04-20 13:16:43 -07:00
Joe Fioti	66b0807462	Merge pull request #272 from luminal-ai/gemma Gemma	2026-04-19 09:02:30 -07:00

1 2 3 4 5 ...

2830 Commits