luminal

mirror of https://git.teahaven.kr/Rust-related/luminal.git synced 2026-06-04 08:39:48 +09:00

Author	SHA1	Message	Date
Joe Fioti	75e4e6be0a	Simplify example mains and trim CUDA profiling output (#339 ) * Simplify example mains and trim CUDA profiling output * Simplify model examples and adjust CUDA profiling output * Simplify example model setup and CUDA profiling output	2026-05-29 23:37:13 -04:00
tucker-luminal	4cd47ffa45	luminal_python: dynamic-shape gather/scatter in the PT2 translator (#334 ) `gather_elements` / `scatter_elements` / `scatter_nd` in luminal-core require concrete shape dims, so `torch.compile(model, backend=luminal_backend)` crashed the moment Dynamo handed us a SymInt for batch or seq_len. The translator now lowers all three through Expression-typed shape arithmetic and only calls luminal-core primitives that already accept Expressions, with a small `dim_arith` helper that keeps every shape product in canonical commutative order so different code paths don't build syntactically-different versions of the same logical dim. Verified end-to-end on Qwen3-30B-A3B across varying prompt lengths. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 16:43:38 -05:00
Joe Fioti	db72cf505c	Dyn dim intervals (#333 ) * Consolidate compile APIs and bucket config * Fix Metal compile options API for clippy and llama CI	2026-05-24 18:01:56 -04:00
tucker-luminal	766db93b08	Dtype i64 f64 first class (#323 ) * tests for interface specification * luminal_python: skip CUDA zero-copy for float64 outputs Luminal collapses `DType::F64` to F32 internally, so a CUDA kernel for an f64-typed output actually writes f32 bytes. The Python wrapper was registering an `f64` pre-allocated tensor's `data_ptr` as the zero-copy destination — handing the kernel a 12-byte payload for a 24-byte buffer, leaving half of every f64 element as garbage. Fix: only set the device pointer for the dtypes luminal natively writes end-to-end on CUDA (f32, f16, bf16). For f64, pre-allocate the f64 output tensor but skip the device-ptr handoff; the collection path then falls through to `get_output()` (which reads the kernel's actual f32 output) and casts to f64 via the existing read-and-cast branch. Pre-existing latent bug — the test scaffolding from the prior commit exposes it as `test_boundary_noop_preserves_dtype_and_values [cuda-float64_f32_exact]`. Phase E adds first-class f64 IR support which will eventually let the kernel write real f64 bytes and restore zero-copy here; this commit unblocks the CUDA test sweep until then. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal: first-class I64 / F64 in IR + CPU + PT2 boundary Today luminal collapses every PT2 integer dtype to `DType::Int` (i32) and `float64` to `DType::F32` at the FFI boundary. The LUM-486 commit papered over symptoms by storing the user-visible PT2 dtype code in a sidecar and casting back at the Python wrapper — but the IR still computes in i32 / f32, so values outside those ranges (`2*40`, `1.0000000000000002`) lose information before the kernel ever runs. This commit makes i64 and f64 first-class through the IR end-to-end: - `DType::I64` added; custom `Debug` impl maps it to `"Int64"` (not `"I64"`) because egglog has a built-in primitive sort named `I64` for integer literals in shape expressions, and the egglog-format sites in `hlir.rs` serialize `DType` via `{:?}` — emitting `"I64"` would shadow the primitive and panic the egraph loader with `UnboundFunction("I64", ...)`. Documented at the variant. - `f64_dt: sort(DTYPE, "F64", &[])` and `int64_dt: sort(DTYPE, "Int64", &[])` registered in `egglog_utils::base`; matching arms added to `extract_dtype`. - `NativeData::I64(Vec<i64>)` and `NativeData::F64(Vec<f64>)` added. `len`, `f32`/`f16`/`bf16`/`i32`/`bool` accessors widen for both; new `i64()` and `f64()` accessors mirror the existing access pattern. `From<Vec<i64>>` and `From<Vec<f64>>` impls round out the inference. - Cast op covers the full new Cartesian product. Cast to `Int` from `I64` saturates, matching `tensor.to(torch.int32)` overflow semantics. Cast to `F32` from `F64` narrows. - CPU kernels handle I64/F64 directly in Add, Mul, Mod, Gather, Scatter, SumReduce, MaxReduce. Unary transcendentals (`Log2`, `Exp2`, etc.) still bridge through f32 in v1 — the translator inserts cast-bridges around them; reaching the kernel with `I64`/`F64` panics with a pointer to the missing bridge. - `dyn_backend::bytes_to_native_data` preserves i64 / f64 bytes directly; `dummy_data_for_dtype` includes i64 fill. New trait methods `get_output_i64` / `get_output_f64` on `DynBackend` with the native runtime impl. - `cuda_dtype` extended (`"long long"` for I64). Full CUDA kernel support for i64/f64 elementwise emit is Phase F — the mapping is here so the egglog ext correctly types the kernel inputs, but several elementwise CUDA paths still need codegen work. - PT2 boundary: `torch_dtype_int_to_luminal` returns `I64`/`F64` for codes 5/8. `TypedData::from_pytorch_bytes` and `pt2_compiled_model::bytes_to_typed` preserve raw bytes for both. `luminal_dtype_to_pt2_code` round-trips `I64` to code 5. - `CompiledGraph` exposes `get_output_i64` / `get_output_f64`. The Python wrapper routes `torch.int64` / `torch.float64` outputs through them — no more i32-buffer-then-`.to(int64)` cast-back layer. - Test scaffolding updated: the `int64_` and `float64_` cases move from `test_boundary_warns_when_input_dtype_requires_conversion` (where they previously had to warn because a conversion was real) to `test_boundary_does_not_warn_when_input_dtype_matches_graph`. Reflecting the new contract: int64 / float64 inputs match the graph's input dtype directly. xfails removed from `int64_outside_i32_range` and `float64_precision_sensitive`. Both now pass on CPU end-to-end. CUDA parity for i64/f64 elementwise kernels lands in Phase F (commit 17). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> luminal: hard-reject dtype mismatch at the FFI boundary Before: when a caller passed an input whose dtype didn't match the graph's declared input dtype, the Python wrapper silently `.to(expected_dtype)`-ed it and emitted a `DTypeBoundaryWarning`. Two real problems: 1. Precision bugs hid. A user passing `torch.float64` into a graph that wanted `torch.float32` lost precision-sensitive values (`1.0000000000000002` → `1.0`) without anything in the test suite or logs flagging it. The warning only showed up at first call and was trivially missed in a CI log. 2. Per-call allocation+copy burnt cycles the caller couldn't see in their profile. For a model invoked thousands of times a second, the cast was a real cost the user wasn't aware was happening. The contract is now strict: `model(x)` requires `x.dtype == model.input_dtypes[i]` for every positional input. Mismatched dtype raises `DTypeBoundaryError` before any FFI work. Migration: call `.to(model.input_dtypes[i])` at the call site. - Add `DTypeBoundaryError(TypeError)` to `compiled_model.py` with a docstring that names the prior precision-bug class and points the user to the call-site migration. - Delete `.to(expected_dtype)` from the input hot path; replace with a direct `raise`. `DTypeBoundaryWarning` removed entirely. - Metal backend factory rejects `DType::I64` and `DType::F64` inputs at translate-time with `UnsupportedDtype` — Metal codegen has no native 64-bit kernels, and reaching the kernel emitter with these used to panic deep in MSL generation with an unhelpful error. - Test scaffolding: `test_boundary_warns_when_input_dtype_requires_conversion` becomes `test_input_dtype_mismatch_rejects` and asserts the raise. `test_boundary_does_not_warn_when_input_dtype_matches_graph` becomes `test_matching_dtype_does_not_raise`. The set of "first-class round- trip" dtypes is captured as `_FIRST_CLASS_NOOP_DTYPES` — narrow integers (uint8 / int8 / int16) collapse to luminal's `Int` (i32), so they can't round-trip the noop model without an explicit `.to(int32)` cast and live only in the reject-path test. Breaks user code that today silently autocasts. Intentional. The migration message at the raise site names the exact `.to(...)` call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal_cuda_lite: I64 / F64 output read paths Wires the runtime side of `DType::I64` / `DType::F64` for CUDA. The `cuda_dtype` mapping in `luminal_cuda_lite/src/lib.rs` already returned `"long long"` / `"double"` for these (added with first-class IR support), so the kernel emitters were producing correctly-typed output bytes — but the Python wrapper's `get_output_i64` / `get_output_f64` calls landed on the trait-default panic ("not supported by 'cuda_lite'"), surfacing as 8 CUDA test failures on the test_dtype_boundary suite. Adds: - `CudaRuntime::get_i64` / `get_f64` — read raw 8-byte chunks from the output buffer and reinterpret. Mirrors the existing `get_f16` / `get_bf16` byte-reinterpret pattern. - `CudaLiteDynBackend::get_output_i64` / `get_output_f64` — thin forwarders to the runtime methods. Verified end-to-end with `test_boundary_noop_preserves_dtype_and_values[cuda-int64_outside_i32_range]` (2*40 round-trips bitexactly through the CUDA kernel) and `[cuda-float64_precision_sensitive]` (1.0000000000000002 round-trips without f32 truncation). Full CUDA dtype suite: 42 passed, 0 failed. The design-doc commit 18 (int32 / bool CUDA zero-copy output plumbing) is deferred to a follow-up. Both dtypes already work end-to-end via the host-roundtrip `get_output_` path; zero-copy is a perf optimization not blocking any test in the contract suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal_cuda_lite: include I64 in scatter elem-size tables CUDA scatter kernels compute output buffer / load / store byte counts via per-dtype size tables. After landing first-class I64, the scatter emission for an i64 output panicked with `Unsupported dtype for scatter output_bytes: Int64`, which surfaced as the egglog optimizer reporting "Failed to find a viable initial genome after 100 attempts" because every candidate genome containing an i64 scatter immediately panicked. Adds I64 → 8 bytes alongside F64 to the five size tables in `kernel/other_ops.rs` and `kernel/hlir.rs`. MoE routing (idx_dtype = int32 and int64) now compiles and runs end-to-end on CUDA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tests: drop input-layout / mutation-alias tests from dtype branch These two test files came along with `d0cec1fc tests for interface specification` as the test scaffolding for the broader boundary- contract work — input layout strides (Phase G) and mutation/alias writebacks (Phase D). Neither feature is in the dtype-only branch, so the tests either xfail or skip here and are noise to the reader trying to understand what this branch ships. Keep only `test_dtype_boundary.py` since that's the suite that exercises the I64/F64 IR work and the FFI dtype-mismatch rejection this branch actually delivers. The two removed files live on `pt2-boundary-contract` where the features they test land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tests: drop removed files from run_all_tests.sh and run_test.sh Follow-up to the previous commit's deletion of test_input_layout.py and test_mutation_alias_contract.py. Both scripts referenced those files in their pytest invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * boundary: strict dtype at output read; translator inserts Cast Reviewer's "no implicit casts at the read boundary" directive, applied to both runtimes: * `CudaRuntime::get_i64` / `get_f64` now check the producer buffer's `buffer_specs[..].dtype` and panic on anything other than `I64` / `F64`. The panic message points at the translator as the place to insert an explicit `Cast` — no silent widening from i32 / bool / f32 / f16 / bf16. * `NativeDynBackend::get_output_i64` / `get_output_f64` match only `NativeData::I64` / `F64` and panic otherwise. The internal `NativeData::i64()` / `f64()` accessors stay (they're load-bearing for in-kernel mixed-dtype binary ops); only the user-visible read boundary is strict. * `CompiledGraph::get_output_i64` / `get_output_f64` docstrings drop the "widens i32 / bool when the producer chose a narrower dtype" line; replaced with "Strict on producer dtype — the graph's output node must already be I64 / F64." For the strict boundary to be reachable when the EP-declared dtype differs from what the producer chose (e.g. `Argsort` / `TopK` emit i32 indices but `torch.int64` was requested), the translator's output loop now inserts an explicit `tensor.cast(declared)` before `output()` when the declared dtype is `I64` / `F64`. The Cast is in the graph — egglog can see it. `Vec<f32>::from([…])` typed-local style applied to test set_data call sites that previously relied on float-literal inference collapsing to `Vec<f32>`; after `941b6962` added `From<Vec<f64>>`, those literals now infer as `Vec<f64>` and the buffer lands as `NativeData::F64`, panicking the strict read. CPU: 234 pytest passed, 21 skipped. Core: 112 luminal + 16 luminal_nn tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * translator: explicit F32 bridge around unary transcendentals on F64 The CPU `unary_impl` has no native F64 path — `Log2` / `Exp2` / `Sin` / `Sqrt` / `Recip` and the higher-level transcendentals that compose them all bridge through f32 in v1. Previously the panic inside `unary_impl` for `NativeData::F64` was the only thing keeping the F32-bridge story honest, and the comment apologized for not inserting the bridge ourselves. Two changes: * Add `Translator::translate_unary_op_f32_bridge` — same shape as `translate_unary_op`, but when the input is `DType::F64` wraps the op as `f(input.cast(F32)).cast(F64)`. The two `Cast` nodes are in the graph; egglog sees them; the kernel only ever sees F32. * Re-dispatch every transcendental unary in `translator/dispatch.rs` (`aten.{log,log2,exp,exp2,sin,cos,sqrt,rsqrt,reciprocal,sigmoid, tanh,silu,gelu}.default`) through the f32-bridge variant. Ops that don't need transcendentals (`neg` = mul-by-(-1), `relu`, `abs`) stay on plain `translate_unary_op` and preserve F64 natively. * Update the `unary_impl` F64 panic message to direct readers at `translate_unary_op_f32_bridge` — reaching the panic now means a new transcendental dispatch site forgot to bridge. Tests: CPU 234 passed, 21 skipped. The `test_boundary_noop_preserves_dtype_and_values[-float64_]` cases continue to pass via the bridge (they go through the noop addition not a transcendental, so the bridge doesn't fire for them; but if anyone adds an F64-transcendental test it'll exercise the bridge end-to-end). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ffi: panic on narrow-int dtype codes; defer first-class narrower-int IR Reviewer flagged the "narrow-int widening" docstring at typed_data.rs:156 as concerning: today luminal collapses uint8 / int8 / int16 to `DType::Int` at the byte-conversion boundary. The restrained answer is to panic at the boundary rather than widen silently — matches the "no implicit casts" directive end-to-end. Both byte-conversion entry points now reject narrow-int PT2 codes: * `TypedData::from_pytorch_bytes` (user inputs via `set_input_from_ptr`) — codes 1 (uint8) / 2 (int8) / 3 (int16) panic with "cast to torch.int32 at the call site, or wait for the narrower-int IR follow-up." * `pt2_compiled_model::bytes_to_typed` (PT2 file weights) — same panic, same message. Models that previously round-tripped through implicit widening (e.g. quantized int8 weights) will now fail at load time with a clear message pointing at the missing infrastructure. Follow-up issue: "Narrower integer dtypes (i8 / u8 / i16) first-class in `NativeData` + CPU kernels" — once that lands, these panics disappear and the bytes flow through as `DType::U8` / etc. Tests: `test_dtype_boundary.py` 21 passed, 21 skipped. The narrow-int cases in `test_input_dtype_mismatch_rejects` continue to assert `pytest.raises` — the rejection now comes from the FFI panic instead of the input-dtype boundary check, but the contract from the user's perspective is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * compiled_model: unify output read dispatch; clarify zero-copy comment Three review comments addressed in one place: * `compiled_model.py:148` — the stale comment ("float64 is collapsed to f32 internally; registering an f64 device-ptr would have the kernel write 12 bytes into a 24-byte buffer") was wrong after `941b6962` made F64 first-class. Rewrite to explain why pre-allocation is GPU-only: the CUDA kernel needs the device-ptr registered before `run()`, while CPU reads back after via `_read_typed_output`. * `compiled_model.py:189` — the per-dtype elif chain duplicated across the CUDA-zero-copy and native paths. Refactor into a single `_output_readers` dispatch table keyed on `out_dtype` → `(getter_name, read_dtype, final_cast)`. The zero-copy fast path for f32 / f16 / bf16 stays as a single check at the top; every other dtype goes through `_read_typed_output`. * `compiled_model.py:243` — annotate the `if _use_zero_copy:` pre-allocation branch: "the CUDA kernel needs the output's device pointer registered before `_graph.run()` so the final kernel writes directly into PyTorch's buffer. CPU never zero-copies — there's no separate device buffer to register against." Tests: CPU 234 passed, 21 skipped (no behavior change, just refactor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify scope of kernel-internal widening accessors Two reviewer comments addressed: * `src/hlir.rs:1212` — the "Narrowing cast: explicit i64 -> i32 … used when the translator bridges an i64 value through a kernel that only has an i32 path" comment apologized for a non-existent problem. Reword: the `Cast` op IS the explicit graph-level conversion; saturating via `as i32` matches `tensor.to(torch.int32)` semantics on overflow. No bridging framing. * `src/hlir.rs:2989` (and the matching `f64` accessor at :2914) — the docstring said "Used by I64-aware kernels; widens other variants when an op promotes a mixed-dtype binary to I64" without scoping why that's OK. Rewrite to be explicit: this is a kernel-internal widening accessor, used by binary kernels to read RHS at LHS's width, mirroring PyTorch eager's mixed-dtype promotion. The user-visible read boundary (`DynBackend::get_output_`) is strict — that's where the reviewer was originally complaining about implicit casts. A follow-up translator pass that inserts explicit `Cast` ops on mixed-dtype binary operands would remove this in-kernel widening entirely; not in scope here. No code change. Tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Revert "translator: explicit F32 bridge around unary transcendentals on F64" This reverts commit `f77a2b92`. The bridge inserted `Cast(F32) → unary → Cast(F64)` inside the translator whenever a user called `torch.exp(x)` (or sin/cos/log/...) on an `f64` tensor. The output kept the `torch.float64` dtype tag, but the math itself ran in single precision — exactly the kind of silent precision downgrade hidden behind a wider dtype that this PR's "no implicit casts" directive is meant to reject. The bridge solved one reviewer comment ("unary_impl panics on F64") by relocating the implicit cast from the runtime to the translator — not by removing it. Restore the original behavior: `unary_impl` panics on `F64`, and now with a sharper message that says outright "cast inputs to F32 at the call site" and explicitly names the rejected alternative ("silent F32 bridging is intentionally rejected: it would hide a precision downgrade behind an `F64` dtype tag"). The same wording goes on the Int / I64 / Bool arms so each unsupported variant has a clear, self-contained recovery path. A native F64 transcendental kernel is the proper fix for double- precision `exp`/`log`/`sin`/... — tracked in the F64-CUDA-elementwise follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * hlir: drop From<Vec<f64>> for NativeData; revert luminal_nn / movement / tests churn The PR was carrying ~70 lines of `vec![1., 2., 3.]` → `Vec::<f32>::from([1., 2., 3.])` style churn across `luminal_nn/src/attention.rs`, `src/frontend/movement.rs`, and `src/tests/mod.rs`. The trigger was the new `impl From<Vec<f64>> for NativeData`: it made float literals ambiguous between `Vec<f32>` and `Vec<f64>` at every `set_data` call site, forcing the explicit `Vec::<f32>::from([...])` spelling. Drop the `From<Vec<f64>>` impl. It had no callers (`grep -rn` for `Vec<f64>` going into NativeData turned up nothing — the F64 buffer-construction sites in `dyn_backend.rs` and `typed_data.rs` use `as_bytes` on a raw `Vec<f64>`, not the `From` impl). Callers that genuinely want an F64 buffer can still write `NativeData::F64(my_vec)` directly. With the impl gone, float literals re-infer to `f32` via the sole `From<Vec<f32>>` impl — the original idiom — so the three churn-only files revert cleanly to their `main` state. A short comment at the deletion site explains why this impl is intentionally absent. Net diff on the PR drops by ~70 lines of pure style churn. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * translator: cast argsort / topk / sort indices to I64 at the PT2 boundary `torch.argsort` / `torch.topk(...).indices` / `torch.sort(...).indices` always return `int64`. luminal's frontend `stable_argsort` returns `Int` (i32) — the storage-efficient default that direct Rust callers want and that the existing `luminal_cuda_lite` op-functional / search-equivalence tests read back via `rt.get_i32(...)`. Previously, the gap was bridged with a post-hoc Cast in the translator's output loop (`translator/mod.rs`) — "if the EP declared `I64` and the producer chose `Int`, insert a Cast(I64) before Output." That meant a graph node was being inserted by the framework whose presence and location the user couldn't see in their dispatch — exactly the kind of hidden behavior this PR's "no implicit casts" directive is meant to avoid. It also did nothing to fix the underlying mismatch — the producer was still emitting i32 indices. Move the cast to the producer side of the PT2 boundary instead: * `translate_argsort` casts the `stable_argsort` result to I64 before inserting it into the tensor map. * `translate_topk` casts the sliced `topk_indices` to I64. Same buffer feeds both the values-gather (via `gather_elements`, which accepts any int dtype on its index operand) and the indices output. * `translate_sort` casts the indices half of the tuple to I64; the values half stays at the source dtype. The frontend `argsort` / `stable_argsort` are unchanged — direct Rust callers continue to get i32 indices. Drops the band-aid output-Cast block from `translator/mod.rs`, which is no longer needed (the producer now emits the right dtype). The strict read boundary still catches any future dtype mismatch loudly. Verification: * `cargo test -p luminal -p luminal_nn`: 114 + 16 + 5 passed. * CPU pytest (hlir_ops + unary + dtype_boundary): 250 passed, 21 skipped. * CUDA pytest (same suites + test_llama3 non-slow): 281 passed (previously 278 passed, 3 failed on `test_argsort_stable_duplicates [idx_dtype1]`, `test_topk_values_width_128_with_indices`, `test_tiny_moe_routing[idx_dtype1]` — all now passing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * translator: cast argmax / argmin to I64; fix narrow-int range-pattern clippy CI surfaced two issues on the dtype-i64-f64-first-class branch: * `Python CUDA Tests` failed on 15 `tests/test_scalars.py` cases for `argmax` / `argmin` (all variants — keepdim, 0d, all-reduce, per-dim). Same root cause as the previously-fixed argsort/topk/sort cases: PyTorch's `torch.argmax` / `torch.argmin` return int64 indices (same `kLong` contract as `sort` / `topk`, pinned in the structured kernel meta function), but `translate_argextremum` was emitting i32 — and the strict CUDA `get_i64` read boundary refused to widen. The old docstring for `translate_argextremum` already named the trick: "the Python wrapper widens at the boundary." That wrapper is gone (strict reads), so the fix is to cast at the translator site, same as argsort/topk/sort: - `Ok(result * 1)` → `Ok((result * 1).cast(DType::I64))` - The 0-d short-circuit path's `.cast(DType::Int)` becomes `.cast(DType::I64)`. - Docstring updated to reflect the new boundary cast. I had missed these locally because `test_scalars.py` wasn't in the CUDA sweep I ran while iterating; the PR-CI full pytest run caught them. * `CUDA Clippy` failed on two `1 \| 2 \| 3 =>` match arms — Rust 1.95 clippy now flags those under `manual_range_patterns`. Rewrote both as `1..=3 =>`. No behavior change. Verification: * `cargo clippy -p luminal_python --features cuda --tests -- -D warnings`: clean. * `LUMINAL_TEST_DEVICE=cuda pytest tests/test_scalars.py`: 171 passed, 4 xfailed (all previously-failing argmax/argmin cases now pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: refresh get_output_i64 / get_output_f64 comments + panic messages The doc comments on the strict i64 / f64 readers still pointed at the band-aid output-loop Cast in `translator::translate_graph` ("the translator inserts an explicit `Cast(I64)` before the Output; see `translator::translate_graph`"). That block was reverted earlier in this PR — casts now live at each producer op's translator dispatch site (`translate_argsort` / `translate_topk` / `translate_sort` / `translate_argextremum`, mirroring PyTorch's `kLong` contract pinned by the structured-kernel meta function in `Sorting.cpp`). Updates the doc + panic-message wording in three places to match the post-revert reality: * `CompiledGraph::get_output_i64` / `get_output_f64` (pyo3 wrapper, `compiled_graph.rs`) * `NativeDynBackend::get_output_i64` / `get_output_f64` (`dyn_backend.rs`) * `CudaRuntime::get_i64` / `get_f64` (`runtime.rs`) Each one now says, in substance: "the producer's buffer must already carry the requested dtype; on the PT2 path that's handled at the per-op translator dispatch site, not in a centralized output loop." Panic messages reworded from "Insert an explicit Cast(I64) in the graph before the Output" — which read like advice to an end user authoring the IR by hand — to "Add a `Cast(DType::I64)` before the Output in the producer graph," which fits both manual IR-authoring callers and the translator-dispatch case naturally. For `get_output_f64`, also added a one-liner pointing readers at the `unary_impl` F64 panic policy (cast inputs to F32 at the call site; no silent F32 bridging behind an F64 tag). No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: trim get_output_i64 / get_output_f64 doc + panic strings Previous pass over these comments name-dropped every per-op translator dispatch site (`translate_argsort`, `translate_topk`, ...) — context that's irrelevant to a caller of the read functions. Reduce each to a one-line contract: "Strict: the buffer must already be `DType::Xxx`; no widening at the read boundary." Panic strings shortened the same way — keep the "Add a `Cast(...)` before the Output" pointer, drop the editorial trailing clause. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: trim argextremum dtype paragraph Replace the seven-line "PyTorch's kLong contract / structured kernel meta function / storage-efficient default" exposition with one line: "The result is cast to `DType::I64` to match PyTorch's int64 argmax / argmin indices." The rest of the docstring (FX positional inputs, `dim=None` flattening, slice-then-materialize rationale) stays — those are non-obvious mechanical details a reader fixing a bug actually needs. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dtype: PyTorch ScalarType as source-of-truth for PT2 dtype codes Addresses the last open review comment on `compiled_graph.rs:84` / `pt2_util.rs:208`: "we maintain our own enum of the datatypes in pytorch, it would be nice if we could bind over the ones from pytorch and use that as the source of truth and not our own." Before: the PT2 dtype-code numbering (`1=uint8, 2=int8, ..., 13=bfloat16`) was duplicated across four Rust sites and a hand-rolled dict in Python. Renumbering or new variants in PyTorch's PT2 schema (e.g. the float8 family added in pytorch/pytorch#143343) silently miscompiled at runtime. After: a single `TorchDType` enum in `crates/luminal_python/rust/src/ torch_dtype.rs` owns the canonical numbering. All four call sites route through it: * `pt2_util::torch_dtype_int_to_luminal` — delegates to `TorchDType::from_code(...).into()`. * `typed_data::from_pytorch_bytes` — matches on named variants; narrow-int panic now reads `TorchDType::Byte \| Char \| Short` instead of `1..=3`. The silent `_ => f32` fallback is gone — unknown codes panic with the variant name. * `pt2_compiled_model::bytes_to_typed` — collapsed to a one-line delegate (`TypedData::from_pytorch_bytes(bytes.to_vec(), dtype)`); the duplicated panic block is deleted. * `compiled_graph::luminal_dtype_to_pt2_code` — delegates to `TorchDType::try_from(dtype).map(\|t\| t.code())`. Python side: `dtype_util.py`'s hardcoded `_TORCH_DTYPE_TO_CODE` dict is rebuilt at import time from `torch._export.serde.schema. ScalarType.<NAME>.value` — PyTorch becomes the runtime source of truth on both sides of the FFI boundary. `torch._export.serde. schema` is a quasi-private API (leading underscore) but it's the module PT2 actually wire-serializes against; documented at the import site. Parity test: `tests/test_torch_dtype_parity.py` consumes a new pyo3-exported `_torch_dtype_codes()` map and asserts every Rust variant matches PyTorch's enum by name and value. If PyTorch renumbers or adds a variant, the test fails loudly at CI rather than miscompiling silently at runtime. Negative-test verified locally by setting `Long = 99` — fails with `LONG: luminal=99, pytorch=5`. Added to both `run_test.sh` and `run_all_tests.sh`; CUDA runner globs `tests/` so it picks it up automatically. `TorchDType` enumerates all 19 variants currently in `torch._export.serde.schema.ScalarType` (including `Unknown`, the three `Complex` types, `Uint16`, and the four `Float8E` variants); `TryFrom<TorchDType> for DType` returns `Err` for any variant luminal's IR doesn't model, with the boundary code panicking on `Err` with the variant name. Verification: * `cargo test -p luminal_python` — 8 passed (3 new for the enum, 5 pre-existing). * `cargo test -p luminal` — 114 passed. * `cargo clippy -p luminal_python --features cuda --tests -- -D warnings` — clean. * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. * CUDA pytest (same suites + `test_scalars`, `-m "not slow"`) — 444 passed, 4 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fmt: cargo fmt on torch_dtype refactor Applies `cargo fmt --all` to the three files touched by the previous commit. The Fmt CI job caught: * `lib.rs` — `_torch_dtype_codes` chain wrapped over multiple lines. * `pt2_compiled_model.rs` — `use crate::pt2_parser;` ordered before `use crate::pt2_schema;`. * `typed_data.rs` — `unwrap_or_else` closure inlined onto one line. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: bump Python CUDA Slow Tests to A100-80GB `test_hf_qwen3_moe_real_config_full` loads the real Qwen3-30B-A3B checkpoint at bf16 (≈60 GiB of weights). Modal's default `--gpu A100` is the 40 GiB SKU, which can't hold the full model + PyTorch's reference forward state. When the test OOMs it doesn't release its allocated memory back to the CUDA driver, so every subsequent big-model test in the run inherits a ~39 GiB dead-memory wall and also OOMs (`test_hf_llama38b_mark_dynamic_seq_dim_before_compile`, `test_hf_llama3_full`, ...). Request the 80 GiB SKU explicitly. Aligns with the model-specific Modal jobs on this PR (`gemma`, `qwen3_moe`, etc.) which already spec `A100-80GB` and pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * pt2_util: panic on narrow ints instead of widening to Int `torch_dtype_int_to_luminal` was the one remaining site that silently collapsed `Byte` / `Char` / `Short` to `DType::Int`. Even though the byte-loading paths (`typed_data::from_pytorch_bytes`, `pt2_compiled_model::bytes_to_typed`) already refuse those codes, the metadata-read path through `pt2_util` was still happy to widen, which left the user's actual dtype invisible past the FFI boundary on graphs whose declared inputs were narrow ints. Reject at this site too. Same panic message as the byte paths ("isn't a first-class IR type yet — cast to torch.int32 at the call site, or wait for the narrower-int IR follow-up"), so the failure mode is consistent across all three sites. Test update: `test_input_dtype_mismatch_rejects[uint8 / int8 / int16]` previously asserted a `DTypeBoundaryError` raised at call time — that was the artifact of the silent widening flow (the graph compiled with narrow → int32 substitution, then call-time refused because the user's tensor still had the narrow dtype). The reject now fires at compile time via the translator panic, so the test asserts on the panic message instead. `pyo3_runtime.PanicException` inherits from `BaseException`, not `Exception`, so `pytest.raises` broadens to `BaseException`; the message match keeps the contract test specific. Verification: * `cargo test -p luminal_python` — 8 passed. * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * torch_dtype: refuse narrow-int conversions in both directions The two `TryFrom` impls were silently mapping narrow ints: * `TryFrom<TorchDType> for DType` mapped `Byte` → `DType::U8`, `Char` → `DType::I8`, `Short` → `DType::I16`, `Uint16` → `DType::U16`. Those luminal DType variants exist in the enum but aren't first- class through the IR (no kernels, no codegen) — handing them out produced buffers downstream code couldn't actually run on. * `TryFrom<DType> for TorchDType` was the mirror: `U8` → `Byte`, `I8` → `Char`, `I16` → `Short`, plus a stale `U16` → `Int` workaround (silently aliased uint16 bytes as int32, predating PyTorch's `UINT16 = 28` schema entry). Move all of those to the `Err` arm in both directions. Downstream sites (`compiled_graph::luminal_dtype_to_pt2_code`, `pt2_util::torch_dtype_int_to_luminal`, ...) translate the `Err` into a typed panic with the variant name, so the failure mode is consistent with the rest of the no-implicit-cast directive — same spirit as the previous commit on `pt2_util`. Test updates: * `supported_dtypes_roundtrip` no longer includes `U8`/`I8`/`I16` — they aren't first-class, can't roundtrip. * New `narrow_ints_refuse_conversion` asserts the `Err` direction on `Byte`/`Char`/`Short` (forward) and `U8`/`I8`/`I16`/`U16` (reverse). Verification: * `cargo test -p luminal_python --lib torch_dtype` — 4 passed. * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. * `cargo clippy -p luminal_python --features cuda --tests -- -D warnings` — clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * torch_dtype: refuse TF32 in DType → TorchDType conversion `TryFrom<DType> for TorchDType` was silently aliasing `DType::TF32 → TorchDType::Float`. TF32 isn't a storage dtype on the PyTorch side (PyTorch has no `torch.tf32`); it's a compute-mode hint that affects how matmuls are rounded but the underlying buffer is still f32. If a luminal graph genuinely carried `DType::TF32` through to the boundary and we mapped it to `Float`, PyTorch would receive a tensor tagged as f32 that the caller had been tracking as TF32 inside luminal — exactly the silent-dtype-aliasing pattern we've been hunting down through the rest of this PR. Refuse instead. A caller that needs a real f32 bridge can insert an explicit `Cast(F32)` upstream — same pattern as the F64 transcendental story (a graph-level Cast rather than a hidden runtime conversion). The existing `Err`-handling at every caller (`compiled_graph::luminal_dtype_to_pt2_code`, ...) panics with the named variant. Test update: `TF32` joins the narrow-int set in `narrow_ints_refuse_conversion`. Verification: * `cargo test -p luminal_python --lib torch_dtype` — 4 passed. * CPU pytest sweep — 252 passed, 21 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * compiled_model: strict per-dtype dispatch; drop narrow-int reread cast Addresses two recent review comments on `crates/luminal_python/src/luminal/compiled_model.py`: * "are we making vectors that are int32 and putting narrow int types in each slot? What is going on?" — `_output_readers` had three entries that read via `get_output_i32` then `.to(narrow_dtype)`'d back: a leftover from when the IR silently widened narrow ints to i32. After the recent narrow-int rejections in `pt2_util` and `torch_dtype.rs`, no graph can actually reach this code with a narrow-int declared output, so the dispatch entries are unreachable. Delete them. * "Why do we fallback to f32 instead of erroring?" — `_read_typed_ output`'s `if entry is None:` branch read the buffer as f32 and `.to(out_dtype)`'d back regardless of the declared dtype. That's the same silent-dtype-aliasing pattern we've been hunting down through the rest of the PR. Replace with an explicit `NotImplementedError` naming the unsupported dtype. Add explicit `_output_readers` entries for `float32` (which was relying on the fallback as a no-op cast on CPU) and for `float16` / `bfloat16` (documented as reading via the generic f32 getter and `.to()`-ing back — the runtime kernels already emit f32 bytes for these, so the cast at the end is the inverse of upstream's conversion, not a fresh precision drop; a proper typed getter is follow-up work). Net effect: every supported output dtype is an explicit dispatch entry, every unsupported one raises a clear `NotImplementedError`, and the narrow-int reread-and-cast path is gone. Verification: * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. * CUDA pytest (same suites + `test_scalars`, `-m "not slow"`) — 444 passed, 4 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * compiled_model: error vs silently default to float32 on missing dtypes Addresses the new review comment on `compiled_model.py:214` ("probably should error vs assume?"): The pattern `code_to_torch_dtype(codes[i]) if i < len(codes) else torch.float32` appeared in three places (one input loop, two output loops) and silently defaulted to float32 when the Rust side returned a shorter dtype-code list than the declared input/output count. Same silent-default pattern the reviewer's been hunting down through the rest of the PR. Replace all three sites with up-front length checks that raise `RuntimeError` if the counts don't match, then build the typed `torch.dtype` list once from the codes and reuse it. Net effect: * If the Rust side returns inconsistent counts, the error names the declared names and the count mismatch directly — points at the graph-construction bug instead of papering it over. * No `else torch.float32` remains for missing-code fallbacks. Also tightened `dtype_util.py`: * `code_to_torch_dtype(unknown_code)` and `torch_dtype_code(unsupported_dtype)` now raise `KeyError` listing the known set, instead of silently aliasing the unknown to float32. Verification: * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 273 passed. * CUDA pytest (same suites + `test_scalars`, `-m "not slow"`) — 444 passed, 4 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * hlir: drop unused widening accessors; add typed f16 / bf16 read path The reviewer flagged the kernel-internal `NativeData::{f32, f16, bf16, i32, i64, f64, bool}(i)` accessors as silent wideners. After main's PR #330 made the binary kernels strict on dtype, those accessors became dead code — `rg` for callers across `src/` and `crates/` turns up only their own panic strings. Delete all seven. The bigger silent-widening surface was on the read side: the native backend's `get_output_f32 / get_output_i32 / get_output_bool` just delegated to `NativeData::to_{f32,i32,bool}_vec()`, which happily accept any source variant. That's the same "widen on read" pattern the reviewer's been hammering on for `get_output_i64 / get_output_f64`. Tighten them with the same match-on-variant + panic-on-mismatch pattern (`Add a Cast(DType::X) before the Output`). Tightening the read boundary broke the existing `float16` / `bfloat16` output paths in `compiled_model.py`, which were dispatching through the generic f32 getter and `.to(half)`-ing back — relying on exactly the silent widening we just removed. Add proper typed paths: * Backend trait: `get_output_f16` / `get_output_bf16` with default panic impls (`src/dyn_backend.rs`). * `NativeDynBackend`: strict match on `F16` / `Bf16` variants. * `luminal_cuda_lite::CudaRuntime`: pre-existing `get_f16` / `get_bf16` reinterpreted bytes without checking dtype — add the same buffer-spec strictness as `get_i64` / `get_f64`. * `CudaLiteDynBackend`: wire `get_output_f16` / `get_output_bf16` through. * `CompiledGraph` (pyo3): new `get_output_f16` / `get_output_bf16` methods that return `bytes` (Python has no native f16/bf16); caller bit-casts via `torch.frombuffer(..., dtype=torch.float16)` / `torch.bfloat16`. * `compiled_model.py`: dispatch table maps `torch.float16` → `get_output_f16` (and same for bf16); the helper bit-casts the bytes back, then `.clone()`s so the returned tensor owns its storage. Net effect: every supported read boundary is strict — buffer dtype must already match the requested width. No silent widening anywhere in the read path. Verification: * `cargo test -p luminal -p luminal_python` — 114 + 9 + 5 passed. * `cargo clippy -p luminal_python --features cuda --tests -- -D warnings` — clean. * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. * CUDA pytest (same suites + `test_scalars`, `-m "not slow"`) — 444 passed, 4 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * compiled_model: use bytearray for f16/bf16 frombuffer to silence warning The f16/bf16 read path in `_read_typed_output` calls `torch.frombuffer` on the `bytes` returned by `CompiledGraph::get_output_f16` / `get_output_bf16`. Python `bytes` is immutable, so PyTorch emits a `UserWarning` ("The given buffer is not writable... You may want to copy the buffer to protect its data or make it writable before converting it to a tensor"). That warning's message contains the word "converting", which `test_dtype_boundary.test_matching_dtype_does_not_raise` catches in its boundary-warning filter — surfaced in CI as a `[cpu-bfloat16]` failure on the most recent run. Wrap the bytes in `bytearray()` before `frombuffer` so the storage is writable and no warning fires. `bytearray(b)` copies the underlying bytes once; the returned tensor owns its own storage, so the previous `.clone()` becomes unnecessary and is removed. No behavior change. CPU sweep still 252 passed / 21 skipped locally (verified before push this time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ruff format: line-break style on f16/bf16 frombuffer call Ruff's `pre-commit` hook reformats the multi-line `torch.frombuffer(...) .reshape(tuple(shape))` chain to break after `.reshape(` instead of inside `frombuffer(...)`. CI's Ruff Format step flagged it on the previous commit (`4d882763`). No semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Austin Glover <austin_glover@berekely.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 00:35:46 -04:00
tucker-luminal	4e93f02725	Tucker/llama3 rsqrt fix (#331 )	2026-05-21 21:29:37 -07:00
Ali	25393a9fdd	call for allocate_intermediate_buffers is redundant (#321 )	2026-05-21 02:10:51 -04:00
Joe Fioti	81ea750e6b	cargo examples (#325 ) * cargo examples * Fix commit message generation for diff context * Generalize GLUMoE search-space checks and harden NaN tests	2026-05-21 02:09:32 -04:00
spinlocked	f94335b1b8	Bucket Qwen decode positions (#328 ) Add a positive-position bucket for Qwen cached decode so Metal can reuse a compiled bucket as p advances during generation. Keep p=0 as the prefill bucket. Co-authored-by: Joe Fioti <jafioti@gmail.com>	2026-05-20 16:56:29 -04:00
spinlocked	f62e3c50d0	Optimize Metal runtime setup and buffer reuse (#326 ) * Cache MPS matmul setup objects * Precompute per-bucket execution metadata * Reuse dynamic intermediate buffers at bucket capacity * Fix Metal shader language import --------- Co-authored-by: Joe Fioti <jafioti@gmail.com>	2026-05-20 14:37:58 -04:00
Ali	eeeabd7c20	Online normalizer calculation for softmax (#324 )	2026-05-20 14:26:55 -04:00
Joe Fioti	0f02466f3d	Reject implicit casting in native binary ops (#330 ) * Reject implicit casting in native binary ops * Make native dtype handling strict and explicit	2026-05-20 13:51:06 -04:00
Joe Fioti	156fac518e	Metal qwen (#327 ) * Refine Luminal graph rewrite handling * Generalize Metal scatter reuse and Qwen validation * Add Qwen safetensor size accounting * Fix Modal example imports for shared output validation * Clarify Luminal contributor guidance * Revert direct shard loading from qwen metal * Remove qwen Metal CI job * Add Metal Llama 1B CI and restore safe profiling timeouts * Fix duplicate Metal ops and tests * Fix Metal pipeline compilation on llama * Run llama Metal CI on xlarge runners * Resample search generations after timeout failures	2026-05-20 13:26:34 -04:00
Joe Fioti	a3df68bd43	Add full-modal-ready CUDA test workflows (#329 )	2026-05-20 01:13:02 -04:00
Ali	7a95e56a8b	copy_device_buffer_to_new_slice synchronizes stream unnecessarily (#322 )	2026-05-19 17:26:38 -04:00
Joe Fioti	e558ce6849	Flux2 cleanup (#319 ) * Refactor core graph and plugin interfaces * Switch examples to batched prefill * Add native-reference MoE fuzz tests * Add native MoE fuzzing and relax qwen3_moe CI check * Fix CI checks and CUDA fuzz harness * Fix llama clippy warnings and normalize fuzz seeds * Use pure HLIR for YOLO v11 model * Remove conv2d custom wrapper and use KernelConv2D rewrites * Fix conv view indexing and trim flux materializations * Skip flux CUDA tests without driver * Restrict core CI to CPU packages	2026-05-18 19:06:31 -04:00
Joe Fioti	c898b7fd53	Metal qwen ci/cd tests and many metal fixes (#318 ) * Refine Luminal graph rewrite handling * Generalize Metal scatter reuse and Qwen validation * Add Qwen safetensor size accounting * Fix Modal example imports for shared output validation * Clarify Luminal contributor guidance * Revert direct shard loading from qwen metal * Remove qwen Metal CI job	2026-05-18 14:07:30 -04:00
Joe Fioti	6cfbf538d0	Absorb FP8 cuBLASLt scale paths in egglog (#320 )	2026-05-18 13:49:31 -04:00
Joe Fioti	966f6f8147	Parallel prefill in rust examples (#317 ) * Refactor core graph and plugin interfaces * Switch examples to batched prefill * Add native-reference MoE fuzz tests * Add native MoE fuzzing and relax qwen3_moe CI check * Fix CI checks and CUDA fuzz harness * Fix llama clippy warnings and normalize fuzz seeds	2026-05-17 23:21:20 -04:00
Joe Fioti	8ea9a71747	Enhance README with PyTorch integration and clarity (#316 ) Added PyTorch-native integration and improved descriptions throughout the README.	2026-05-17 01:48:57 -04:00
spinlocked	861c3f0419	Add Metal support for Qwen3-4B generation (#297 ) Extend MetalRuntime with the runtime APIs needed for loading safetensors, managing persistent KV-cache buffers, round-tripping output buffers back into inputs, and reading logits during autoregressive decoding. Update the Qwen example to support both CUDA and Metal through mutually exclusive cuda and metal feature flags.	2026-05-16 23:21:40 -04:00
Joe Fioti	8f17561094	Flux 2 Dev (#304 ) * flux2 example Adds black-forest-labs/FLUX.2-dev as a Rust example: FlowMatchEuler scheduler (validated <1e-4 vs diffusers), Mistral3 text encoder branch (30 layers, GQA, taps at 10/20/30), full DiT (8 double + 48 single stream blocks, 4D RoPE), and AutoencoderKLFlux2 decoder. NVFP4 weights are dequantized in pure HLIR (cast + per-block scale broadcast + scalar outer scale), no new ops or custom kernels. Supporting core changes: - luminal_cuda_lite: load F4/F6/F8/I8 safetensors via raw-bytes path - egglog_utils: add F4E2M1/F6/F8/U4/I8/U8/I16/U16 dtypes to the enum - egglog_utils: bump RUN_SCHEDULE repeat 10 -> 30 so deep conv chains in the VAE actually find a valid schedule - graph.rs: LUMINAL_DISABLE_LOOP_ROLLING / DISABLE_CLEANUP / DUMP_HLIR_PROGRAM debug env vars * flux2: wire pack/unpack/BN inverse/unpatchify between transformer and VAE The previous full pipeline fed the transformer's (S_img, 128) output straight into the VAE expecting (32, h_lat, w_lat) — wrong shape and also missing the per-channel BatchNorm inverse that diffusers' Flux2 pipeline applies before VAE decode. Fix mirrors `Flux2Pipeline.__call__` exactly: 1. Use `S_img = (H/16) * (W/16)` (post-pack) and build RoPE on the post-pack `(h_pack, w_pack)` grid. Previously these used the pre-pack `(h_lat, w_lat) = (H/8, W/8)` grid, giving 4× too many tokens and `mu` ~3.2 instead of 1.15 at 1024² (the latter now matches the diffusers reference). 2. Host-side _unpack_latents_with_ids: (S_img, 128) → (128, h_pack, w_pack) 3. Host-side BN inverse: x = x * sqrt(running_var + 1e-4) + running_mean using `bn.running_mean`/`bn.running_var` read directly from the VAE safetensors. 4. Host-side _unpatchify_latents: (128, h_pack, w_pack) → (32, h_lat, w_lat) 5. Feed the (32, h_lat, w_lat) latent to the existing VaeDecoder. Also: * Add `* 1.0` materialization barrier in `conv2d_bias` between the unfold's permute/merge chain and the matmul. Without it the matmul's A operand has the unfold's broadcast/permuted strides and the cublaslt egg rule won't match, so search falls back to broadcast Mul + SumReduce and OOMs with a (M, K, N) intermediate even at 128². With the barrier the VAE compiles and runs at 128² (~2.8 GiB peak). * Plumb `BuildSearchSpaceOptions::max_memory_` into all three search paths (VAE/text encoder/transformer), tunable via env vars `VAE_MEM_GIB` / `TEXT_MEM_GIB` / `TX_MEM_GIB`. Without a memory budget the search picks candidates that allocate beyond GPU memory and fails with `Failed to find a viable initial genome after 100 attempts`. Update `print_status` to show the corrected post-pack image_seq_len and a more honest per-component status. * flux2: document VAE conv2d scaling limits, opt-in memory budget After investigation, the VAE memory budget mechanism doesn't help here and actively prevents the search from running: * The estimator in `memory_analysis::estimate_graph_memory_bytes` sums every node's output bytes (including views) across the whole graph instead of computing peak live memory. For the VAE this sum is in the hundreds of GiB at 256² even though the real peak is ~5 GiB, so any reasonable budget rejects 100% of candidates upfront. * Trace logging in `allocate_intermediate_buffers` showed that a successful 128² candidate allocates ~77 GiB total (no buffer reuse across nodes — each node owns its own buffer). When search picks the broadcast Mul + SumReduce fallback for any one of the ~30 decoder matmuls, that single matmul's (M, N, K) intermediate is 9.6–38 GiB and the candidate OOMs. So budget enforcement is left opt-in via `VAE_MEM_GIB`. Default uses the unbounded path, which works at 128² (search succeeds within ~100 random initial-genome attempts; peak ~12 GiB, output PNG written) but fails at 256²+ — every random genome OOMs because the C_in=512 / C_out=512 layer's broadcast intermediate alone exceeds 96 GB GPU. The conditional KernelMul cleanup rule in `cublaslt/mod.rs` doesn't delete the broadcast-Mul KernelMul reliably enough at deeper channel counts; making the search picky enough is fundamentally not a fix — the unfold-based conv2d's per-conv `(M, K)` materialized matrix at 1024² is 4.8 GB, summed across ~10 large convs that's ~50 GB even on the happy path. End-to-end at the actual Flux 2 1024² resolution requires a real `KernelConv2D` in luminal_cuda_lite that fuses unfold+matmul+bias into a single kernel with no intermediate matrix. A long inline comment in conv2d_bias points the next attempt at this. * luminal_cuda_lite: add direct Conv2DBias kernel (one thread per output) Adds `kernel::conv2d::Conv2DKernel` (impls `KernelOp`) plus a `Conv2DCustom` wrapper that goes through `cx.custom_op`, so it bypasses egglog rewrites entirely — the conv has no useful fusion opportunities with surrounding ops in the graphs it's used in (VAE resnet blocks), and pattern-matching the unfold + matmul + bias chain reliably from egglog is significantly more work than just dropping in a custom op. Helper `kernel::conv2d_bias(input, weight, bias, K, S, P)` constructs the custom op. Public re-export at `kernel::{conv2d_bias, Conv2DCustom, Conv2DKernel}`. CUDA kernel: one thread per output element. All shape/kernel params (H, W, Cin, Cout, K, S, P) are baked into the source via #defines, so each conv shape gets its own compiled & cached function. No `(H_outW_out, C_inKK)` materialized intermediate, no `(M, N, K)` broadcast intermediate — just the input/weight/bias/output buffers. Far from peak FLOPs (no shared-mem tiling, no warp-level reduction over K) but correct and memory-bounded. flux2 VAE side: replaced the 60-line unfold + permute + merge_dims + matmul + bias + gather chain in `examples/flux2/src/vae.rs` with a 1-line call to `luminal_cuda_lite::kernel::conv2d_bias`. All 4 existing unit tests against the scalar reference still pass. Scaling impact (VAE_TEST): Old (unfold + matmul, with `* 1.0` materialization): 32²: ok (0.6 GiB peak). 64²: ok (4.5 GiB). 128²: ok (12 GiB). 256²: 100/100 random initial genomes OOM — single bad pick on a Cin=Cout=512 layer creates a 38 GiB broadcast Mul intermediate. * New (Conv2DBias custom kernel): 256²: 6 s search. 512²: 16 s. 768²: 20 s. All clean, no OOMs. 1024²: now blocked by the AttnBlock's Q@K^T falling into the same broadcast Mul + SumReduce path (524 GiB single intermediate at HW=128² mid-block resolution); the conv path is no longer the bottleneck. Next: same treatment for the AttnBlock (or get cublaslt to fire 100% of the time on its matmuls) to unblock end-to-end at 1024². * luminal_cuda_lite: add direct Matmul2D kernel + use it in VAE AttnBlock Adds `kernel::matmul2d::Matmul2DKernel` (impls `KernelOp`) plus the usual `Matmul2DCustom` wrapper. Three public helpers: * `matmul_2d(a, b)` → `(M, K) @ (K, N) = (M, N)` * `matmul_2d_t(a, b)` → `(M, K) @ (N, K)ᵀ = (M, N)` * `linear_bias(a, b, c)` → `(M, K) @ (N, K)ᵀ + bias` (linear projection) The CUDA kernel is a textbook 2D-blocked SGEMM with 16×16 output tiles and shared-memory K-staging — naive vs cuBLAS but correct, no extra intermediate, and (critically) goes through `cx.custom_op` so search can't pick a broadcast Mul + SumReduce alternative. Why this exists: the cublaslt 2D rules in `host/cublaslt/cublaslt_Cm_rewrite.egg` and `cublaslt_Rm_rewrite.egg` should match any `Mul + SumReduce` lowering with the right stride patterns, and the conditional KernelMul cleanup rule should delete the broadcast-Mul fallback whenever a cublaslt alternative exists. In practice, on the VAE's mid-block AttnBlock, only 3 of the ~6 matmuls get cublaslt (`cuda-memory-cublaslt-F32-bytes` reports 3 matches; the 2D rule names don't appear in the rule-activity output at all, only the batched variants). At 1024², when the bad path on `q @ kᵀ` does get picked, it allocates a `(HW, HW, C) = (16384, 16384, 512)` ≈ 524 GiB single intermediate that OOMs the 96 GiB GPU. Routing the AttnBlock matmuls (Q/K/V/out projections + scores + attn) through `linear_bias` / `matmul_2d_t` / `matmul_2d` makes that path deterministic. The `merged = normed.merge_dims(1,2).transpose(0,1)` ends up as a column-major view, which the matmul kernels assume away, so a `* 1.0_f32` materializer is added there. Three new tests vs scalar reference: `matmul_2d`, `matmul_2d_t`, `linear_bias`. All 11 vae tests pass. VAE_TEST end-to-end (search_iters=1, F32, GH200): * Old (just KernelConv2D, AttnBlock via egg matmul): 128²: ok. 256²: 6 s. 512²: 16 s. 768²: 20 s. 1024²: OOM. * New (KernelConv2D + Matmul2D in AttnBlock): 128²: 4.7 s. 256²: 5.1 s. 512²: 7.6 s. 768²: 11.6 s. 1024²: 17.9 s — full Flux 2 resolution unblocked, output PNG written. Smaller sizes are also faster because eliminating search variance in the AttnBlock cuts the retry cost. * flux2: route text encoder + transformer matmuls through direct kernels Extends `Matmul2DKernel` with mixed-precision (BF16 weight, F32 act) and optional batch axis, then wires the text encoder and transformer's matmuls through it instead of the egglog matmul lowering. Also fixes a SwiGLU rank bug that made the transformer's `DoubleStreamBlock` FeedForward unrunnable. * `kernel::matmul2d`: weight_dtype param (F32 or BF16). For BF16, the kernel declares B as `__nv_bfloat16` and converts on each load via `__bfloat162float`, so the caller does NOT need a `.cast(F32)` op on the weight tensor (a 24 GB → 48 GB cast for the text encoder, or 32 GB → 64 GB for the transformer, would not fit on the GPU). `kernel::matmul2d`: optional `batch` axis. Same kernel, with `gridDim.z = batch` and pointer offsets computed from the batch index. Used by the new `matmul_3d` / `matmul_3d_t` helpers for the attention `q @ kᵀ` / `attn_w @ v` matmuls. * `kernel::linear_no_bias_bf16_w(a, b_bf16)` is the entry point LLM-style projections want. * `text_encoder.rs`: `linear_no_bias` now uses `linear_no_bias_bf16_w` for the 2D case (Q/K/V/O projections + FF gate/up/down). Falls through to the standard lowering for higher ranks. * `text_encoder.rs::causal_sdpa`: `q @ kᵀ` and `attn_w @ v` go through `matmul_3d_t` / `matmul_3d` after `* 1.0_f32` materialization barriers that fix the strided views produced by upstream transpose / GQA expand_dim chains. * `transformer.rs`: same treatment in `linear_no_bias` and `sdpa`. * `transformer.rs::swiglu`: was hardcoded to a 3D slice pattern `(.., .., ..half)` but `DoubleStreamBlock`'s FeedForward calls it with 2D input. Now handles both ranks. * `main.rs`: opt-in `TEXT_MEM_GIB` / `TX_MEM_GIB` budgets for the same reason `VAE_MEM_GIB` is opt-in (estimator over-counts). Default path runs unbounded. * Five new vae::tests against scalar references: `matmul_3d`, `matmul_3d_t`, `linear_no_bias_bf16_w`, plus the existing `matmul_2d` / `matmul_2d_t` / `linear_bias`. All pass. End-to-end at this commit: * `TEXT_TEST=1` with default `TEXT_LEN=512`: 12 s compile, 4 s encode, output (512, 15360) — works without OOM. Previously OOM'd every candidate at TEXT_LEN ≥ 256. * Full pipeline (`FULL=1`): in progress — text encoder runs cleanly, transformer compile is still going (large graph, ~10k HLIR nodes after auto-loop-rolling). * flux2: full end-to-end pipeline runs (with reduced transformer layers) Three fixes that together make `FULL=1` produce an out.png: 1. Persistent inputs across diffusion-loop iterations. `text_in`, `cos_in`, `sin_in`, `guidance_in` are now `.persist()` so their buffers survive between successive `runtime.execute()` calls. Without this the second step's execute reads freed memory and panics with `CUDA_ERROR_ILLEGAL_ADDRESS` on the post-kernel sync. `latent_in` and `timestep_in` change every iteration so they stay non-persist. 2. VAE search budget made opt-in here too. The `run_full_pipeline` VAE step still had the old `BuildSearchSpaceOptions::max_memory_gib` default of 32. Now matches `run_vae_only`: only enforced when `VAE_MEM_GIB` is explicitly set. Without this, the post-diffusion VAE compile panics ("did not estimate candidate memory") because custom ops don't participate in `memory_analysis::local_output_bytes`. 3. `FLUX2_NUM_LAYERS` / `FLUX2_NUM_SINGLE_LAYERS` env overrides for the transformer. At full 8 + 48 layers the egglog cycle on the transformer egraph runs away to 200+ GB CPU RAM and never converges because (a) auto-loop-rolling isn't detecting the repeated double-/single-stream-block structure (rolled HLIR: 10051 → 10041 nodes, only 18 dedups for the entire 56-layer transformer), and (b) without rolling, every layer's intermediates stay live for the whole forward pass, so even when egglog finishes, the runtime can't fit > ~16 layers on the GPU. Reducing layer count is a workaround for end-to-end validation. Also fixed a `swiglu` rank bug surfaced by running the transformer: was hardcoded to a 3D slice `(.., .., ..half)`, but the `DoubleStreamBlock` FF calls it with a 2D tensor. Now handles both. Status: * `FLUX2_NUM_LAYERS=1 FLUX2_NUM_SINGLE_LAYERS=1`: full pipeline runs at 128² in ~80 s (text encode + transformer compile + 2 diffusion steps + VAE decode). Output PNG written. * Scales to `8 + 16` layers without OOM at 128². * `8 + 32` and above: transformer compile finishes (~4 min) but runtime alloc OOMs because there's no live-range buffer reuse — every node owns a buffer for the whole forward. * Full `8 + 48` is unreachable until auto-loop-rolling detects the repeated block structure or the runtime gets buffer reuse. * graph: iterate the auto-loop-rolling prepass until no more candidates `auto_roll_loops_prepass` finds and rolls one best candidate per call. For models with multiple distinct repeated patterns — e.g. Flux 2's mid-block (2 resnets) + 8 double-stream blocks + 48 single-stream blocks, all with different body shapes — only the first pattern got rolled before this change, leaving the rest unrolled and search still operating on the full unrolled chain. Now `run_auto_loop_rolling_prepass` calls the inner pass repeatedly until no candidate is found, capped at 32 passes. On Flux 2 the first three passes pick up the mid-block resnets (body=18 ×2), the double-stream blocks (body=129 ×7), and a small ×2 pattern. The 48 single-stream blocks still don't roll — `collect_state_params` detects no state across iterations for that pattern, which is a separate bug — but the partial rolling is enough to make Flux 2 at 1+1 layers compile end-to-end. * graph: gate iterated loop rolling behind LUMINAL_LOOP_ROLL_ITERATE The previous commit unconditionally iterated the auto-loop-rolling prepass, which broke fusion codegen on Flux 2 at 8 + 16 layers (`region_codegen.rs:232: FusionStart with no predecessor`). Multiple rolling passes can split a fusion region with loop markers in ways the downstream code doesn't expect. Now iteration is opt-in via `LUMINAL_LOOP_ROLL_ITERATE=1`. Default back to a single pass — preserves all existing example behaviour, including the 8 + 16 layer Flux 2 path that was working before. Use the env var when you have a model with multiple distinct repeating patterns (Flux 2 at full 8 + 48 layers) AND have verified the fusion codegen still succeeds for it. * flux2: full end-to-end at 1024² with FLUX2_NUM_LAYERS=1 + LUMINAL_DISABLE_LOOP_ROLLING=1 Verified `FULL=1` runs end-to-end (text encode → transformer diffusion loop → VAE decode → PNG) at 1024² resolution with the smallest transformer config: ~50 s wall clock for 2 diffusion steps. * Text encoder compile + load + encode: ~17 s * Transformer compile (1 double + 1 single block): 21 s * Per diffusion step: ~3-6 s * VAE decode: ~10 s * PNG written Two env vars are required for end-to-end success: * `LUMINAL_DISABLE_LOOP_ROLLING=1` — auto-loop-rolling produces a rolled body that includes our `CustomOpKind`-wrapped kernels (conv, matmul) and the resulting LLIR graph crashes with `CUDA_ERROR_ILLEGAL_ADDRESS` on first execute. The rolling pass itself reports success ("rolled HLIR: 688 → 678 nodes, 18 dedups"); the failure is downstream in either how loop input/output edges wire to a CustomOp's input pointers or how the runtime allocates buffers across loop iterations of a custom-op-bearing body. Standard egglog-rewritten kernels handle the rolling fine, so the bug is specifically in the CustomOp + Loop interaction. * `FLUX2_NUM_LAYERS` / `FLUX2_NUM_SINGLE_LAYERS` — without live-range buffer reuse in `CudaRuntime::allocate_intermediate_buffers`, each layer's intermediates stay alive for the whole forward pass. The 8 + 48 default exceeds GPU memory above ~16 single-stream blocks; `1 + 1` validates the entire pipeline plumbing. Both limitations are tractable follow-up work, not blockers: * Loop+CustomOp: investigate `output_alias_map` and per-iter buffer reuse in `runtime.rs::execute()`; the `Conv2DKernel` / `Matmul2DKernel` ops likely need to participate in the loop's iteration-buffer scheme the same way `KernelMul` etc. do. * Buffer reuse: implement liveness analysis on the LLIR graph and reuse non-overlapping buffers, similar to register allocation. * luminal_cuda_lite: live-range buffer reuse at exec-graph level Each LLIR intermediate node in `buffer_specs` was previously its own owned `CudaSlice<u8>` for the whole forward pass — total intermediate memory grew linearly with depth even when the actual peak live working set was a fraction of that. A 56-layer transformer at 1024² needs >100 GiB just for intermediates with no reuse, even though the real working set is a few GiB. Adds a slot-assignment pass to `allocate_intermediate_buffers`: * For each node in `buffer_specs`, look up its live range `(start_pos, end_pos)` from the precomputed `bucket.live_ranges` map (built once in `compile_bucket` from an exec-graph toposort). Start = position of the exec op that produces the node; end = max position of any exec op that consumes it. End = `usize::MAX` for user-readable outputs (no consumer in exec graph). * Greedy slot assignment in `(start, end)` order, best-fit by size. Two nodes can share a slot iff their live ranges don't overlap. Output nodes (anything reachable through `output_producers` after following `output_alias_map`) get dedicated slots so `get_f32` and related readbacks see a buffer sized exactly to the output node's bytes — sharing those slots with larger non-output nodes would silently lengthen the readback (per-node `output_bytes()` no longer matches `buf.len()`). * `bucket.buffers` keeps the owned `CudaSlice<u8>` keyed by slot primary; non-primary nodes are recorded in a new `slot_alias` map that points back to the primary. New helper `bucket.buffer_for(node)` resolves a node → primary → buffer in one step; existing call sites that did `bucket.buffers.get(&node)` now go through this helper. (~30 call-sites updated.) Granularity is intentionally exec-level, not LLIR-level. Inside a single `CudaGraphOp` every kernel sits at the same exec position, so its intermediates all overlap and don't share slots. This is conservative but safe — within a `CudaGraphOp`'s compiled CUDA graph, data-independent kernels can run concurrently (the CUDA graph only serializes pairs with an explicit dep edge), so two unrelated kernels sharing a slot would race. Slot reuse across `CudaGraphOp` boundaries is enforced by the surrounding stream's implicit ordering, which is why exec-level liveness is the right thing to use here. The reuse mechanism finds significant savings on graphs that have multiple ExecOps (e.g. workloads with auto-loop-rolled bodies and distinct prefix/body/suffix CudaGraphOps). For Flux 2 in its current single-CudaGraphOp shape it finds 0% — unblocking the full 8+48 layer transformer at 1024² requires intra-`CudaGraphOp` LLIR-level reuse, which in turn requires `kernel_to_host` to inject explicit memory- ordering deps into each CUDA graph for shared-slot kernels. That's a follow-up on top of this infrastructure (the slot assignment is fine, it's the runtime concurrency model that needs the additional wiring). Opt-out via `LUMINAL_NO_BUFFER_REUSE=1` for bisecting. `LUMINAL_DEBUG_REUSE=1` prints a per-allocation summary of how many ranges collapsed into how many slots and the resulting MiB totals. All 98 existing `luminal_cuda_lite` tests pass with reuse on by default. End-to-end Flux 2 pipeline (text encoder + transformer + VAE → PNG) still succeeds at 1024² with `FLUX2_NUM_LAYERS=1 FLUX2_NUM_SINGLE_LAYERS=1 LUMINAL_DISABLE_LOOP_ROLLING=1`. * luminal_cuda_lite: intra-CudaGraphOp live-range buffer reuse Refines the previous exec-graph-level liveness pass into LLIR-level ranges that see inside each CudaGraphOp. The result: 21 GB → 950 MB text-encoder intermediates (96% saved), 82 GB → 23 GB transformer intermediates at 1024² (72% saved) — enough to actually fit the full 8 + 48 layer Flux 2 transformer alongside its 64 GB weights on a 96 GB GPU. How: * `CudaGraphOp::kernel_topo_order()` — the LLIR node IDs of every kernel inside this CudaGraphOp, in the order `kernel_to_host` pushed them into `state.kernels`. That's the order they actually execute: each kernel was added to the CUDA graph with `prev_graph_node` as its sole dep, so kernels inside one CudaGraphOp run strictly serialized — they can safely share physical buffers when their live ranges in this order don't overlap. * `CudaGraphOp::kernel_inputs(node)` — direct LLIR inputs of one kernel inside the graph. Used to refine consumer positions: kernel B reading kernel A's output bumps A's `consumer_max_pos` up to B's position only, NOT to the whole CudaGraphOp's last position. * `compile_bucket` now stitches a unified position space — exec-graph toposort, expanded inside each CudaGraphOp by that op's `kernel_topo_order()`. Every LLIR intermediate gets one integer `(start, end)` whose ordering matches real execution. * Slot assignment in `allocate_intermediate_buffers` is unchanged (greedy best-fit by size) but now operates on those finer ranges. Sort key includes `node` as a tiebreaker so the resulting slot map is deterministic — `buffer_specs` is a hash map, iterating it directly gave non-deterministic orderings that produced different (sometimes wrong) slot assignments under thread races during parallel test runs. Correctness: all 98 luminal_cuda_lite tests pass under both single- threaded and parallel cargo runs. All 27 flux2 tests pass. End-to-end pipeline still produces a 1024² PNG at FLUX2_NUM_LAYERS=1 FLUX2_NUM_SINGLE_LAYERS=1 LUMINAL_DISABLE_LOOP_ROLLING=1. * luminal_cuda_lite: pin unmapped buffer_specs nodes forever If a node appears in `buffer_specs` but the LLIR-position pass didn't see it (e.g. an intermediate referenced by a CudaGraphOp from outside that isn't in `extra_buffer_nodes()`), conservatively pin its live range to `(0, usize::MAX)` so it never participates in slot reuse. Also expanded the comment on the opt-out env var to describe the parallel-test flake observed in the cuda_lite suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * matmul2d: document why linear_no_bias_bf16_w stays a custom op Tried lowering it to plain HLIR (cast(F32→BF16) + matmul + cast(F32)) to unblock loop-rolling on the transformer body — the BF16 cuBLAS 2D rule does fire and the matmul2d unit test passes. At full text-encoder scale the genetic search still occasionally picks the broadcast Mul + SumReduce fallback for at least one of the ~280 projections before the conditional KernelMul cleanup removes it, producing a 40 GB intermediate that OOMs the GPU. Until the extraction is pinned to the cublaslt alternative once it exists (or the cleanup is made eager), this entry point stays as a custom op. Recording the finding in the doc comment so the next attempt doesn't relitigate it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * egglog: eager cuBLAS-aware KernelMul stripping (opt-in) After egglog finishes, walk the serialized egraph and explicitly delete the matmul broadcast `KernelMul` (and its co-resident HLIR `Mul`) from any Mul eclass that feeds into a Sum eclass which has a `cublaslt` alternative. The egglog `:ruleset cleanup` rule does the same conditional delete in principle, but at flux2 text- encoder scale (~280 BF16 projections) it misses some Mul eclasses — likely small stride-form variations vs. the rule's exact pattern — and the surviving KernelMul produces an `(M, N, K)` broadcast intermediate (~80 GB at M=512 N=15360 K=5120) that OOMs the GPU during genetic search profiling. The Rust pass replays the same logic with the same broadcast stride check (`a_n_stride == MNum 0`, `b_m_stride == MNum 0`) so non-matmul KernelMul enodes that happen to live in nearby eclasses are left alone. Opt-in via `LUMINAL_EAGER_CUBLAS_CLEANUP=1`. Default-off because on smaller models (Llama MLP unit tests, K=256) cuBLASLt initialization itself is unreliable on this hardware/driver combo, and the existing KernelMul fallback is what kept the search viable. flux2's main.rs sets the env var on entry. With this in place, `linear_no_bias_bf16_w` switches to plain HLIR (`cast(F32→BF16) + matmul + cast(BF16→F32)`) and the BF16 cuBLAS path becomes the actual extraction target — visible to auto-loop-rolling, no `cx.custom_op` boundary in the way. End-to- end flux2 with `FLUX2_NUM_LAYERS=1 FLUX2_NUM_SINGLE_LAYERS=1 HEIGHT=128 WIDTH=128 STEPS=1 FULL=1` (no LUMINAL_DISABLE_LOOP_- ROLLING) compiles, runs the diffusion loop, and writes out.png. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * graph: dump every discovered rolling run via LUMINAL_DEBUG_ROLLING=1 The rolling pass already records the top-N runs with ≥20 occurrences, but at moderate layer counts (e.g. flux2 NUM_LAYERS=2 SINGLE=8) every block-level pattern has trips=2, which falls below that threshold. Tracking every discovered run behind an env var lets us see why the layer-level pattern isn't getting rolled — for flux2 it surfaces that single-stream blocks pair up nicely (body=18 trips=2 with state_params=2 for the first two pairs) but the topo order interleaves cross-layer nodes (modulation tensors, RMSNorm weights) between every pair, so trips never extends past 2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * egglog: gate fusion_pair behind LUMINAL_NO_FUSION_PAIR=1 fusion_pair is the dominant cost in cycle 001 (>97% on text encoder, ~80% on transformer). It scales as O(B²·iter) in the number of binary ops and is the proximate reason the 8+48 transformer cycle takes minutes / blows up RAM. With it dropped from the schedule on flux2 4+8 the transformer cycle 001 goes from 26s to 1.4s (~18× speedup). The fusion_grow/fusion_merge phase still runs and composes whatever direct_kernel + kernel_lower produced. Caveat: search currently can't find a viable genome with fusion_pair off — without paired Kernel/FusionEnd seeds, fusion_grow has too little to work with and the resulting candidates fail profiling. That's a separate problem to debug. Keeping the gate so we can A/B test cycle-001 cost vs. genome viability without rebuilding. Also added LUMINAL_DEBUG_STATE_PARAMS to dump why each candidate boundary position fails the state-param check in the rolling pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> region_codegen: pin FusionStart-with-no-predecessor panic in a test Adds a #[should_panic] unit test that constructs a minimal LLIR graph (FusionStart → FusedAdd → FusionEnd, with the FS having no incoming edge), runs `build_compile_units`, and asserts the panic fires at the expected `expect("FusionStart with no predecessor")` in `region_codegen.rs`. This is the same panic that appears at flux2 8+48 scale — every search profile genome produced from the iterated rolling pass has a malformed FS leaf, the panic fires under catch_unwind, and the search retry loop accumulates state until the process is OOM-killed. The test pins the panic location so a regression either fixes it properly (in which case the test's #[should_panic] assertion fires and reminds us to flip it to a positive assertion) or doesn't silently move the failure to a different message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * diagnostics: gate FusionStart-panic and dangling-FS dumps behind env vars Adds two opt-in diagnostics for the FusionStart-with-no-predecessor panic at flux2 8+48 scale: 1. `LUMINAL_DEBUG_FUSION_PANIC=1` in `region_codegen` — when the panic fires, dump which FE triggered the walk, every FS leaf with its in/out degree, and the interior FusedX nodes. 2. `LUMINAL_DEBUG_DANGLING_FS=1` in `egglog_to_llir` — after each genome's LLIR is built, walk every extracted FusionStart node and report any with zero incoming edges. Surfaces whether the bug is at extraction time (choice picked an INil over the real ICons, or the input eclass was emptied without cascading up to the FS) vs. introduced later by a downstream pass. Both are behind env vars so they don't fire on the per-genome hot path during normal search. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flux2: allow LUMINAL_NO_EAGER_CUBLAS_CLEANUP=1 to override the auto-set Previously main.rs unconditionally set LUMINAL_EAGER_CUBLAS_CLEANUP=1 on entry, which made it impossible to A/B test the eager cleanup against runs without it. Now the auto-set only fires if neither env var is set, so users (or debugging sessions) can pass LUMINAL_NO_EAGER_CUBLAS_CLEANUP=1 to disable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * graph: fix dangling FusionStart from chained-marker resolution in collapse_loops_to_first_iter The genetic search retry-loop OOM at flux2 8+48 (and at 4+16 with LUMINAL_LOOP_ROLL_ITERATE=1) was triggered by every profile genome panicking in `region_codegen::build_compile_units` with `FusionStart with no predecessor`. Diagnostic dumps showed a FS node with `in_deg=0 out_deg=1-2` whose initial predecessor was a loop marker that got stripped in the post-collapse rewire pass without the consumer's edge being redirected to the marker's underlying value. Two real bugs: 1. `resolve_src` (used to rewire body-node incoming edges) only resolved one level. Iterated rolling produces chained markers — a LoopInput whose first source is a LoopStart whose initial is another marker — and the body edge ended up pointing at an intermediate marker about to be removed. Fixed with bounded transitive resolution. 2. `marker_post_sub` (used to rewire post-loop-consumer incoming edges) only had entries for `LoopEnd` and `LoopOutputSelect`. A FusionStart that egglog inserted to wrap a `LoopOutput`, `LoopStart`, `LoopInput`, or `LoopInputStatic` directly fell through to `unwrap_or(src)`, the marker was removed, and the FS dangled. Added entries for all four marker kinds and made the resolution transitive too. Also added two diagnostic env vars to keep this debuggable: - `LUMINAL_DEBUG_COLLAPSE_FS=1` — snapshot every FS's incoming at entry to `collapse_loops_to_first_iter`, report any whose edge is gone before compaction with what its pre-collapse predecessor was. Surfaces this exact bug class. - `LUMINAL_DEBUG_DANGLING_FS_POST_COLLAPSE=1` — same scan in the search loop right after `collapse_loops_to_first_iter` returns, so we can confirm whether the dangling FS comes from collapse vs later passes. The earlier `LUMINAL_DEBUG_DANGLING_FS=1` (egglog_to_llir-time check) is still there. With it set on the failing run no DANGLING fired at extract — proof the bad LLIR was born inside collapse, not at extraction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runtime: surface which buffer overflows when alloc_zeros OOMs The bare unwrap on alloc_zeros made flux2 OOM failures opaque — you only saw "out of memory" with no clue which kernel's intermediate was the multi-GB outlier. Now the panic prints the slot's primary node, dtype, byte count + GB, and a top-5 ranked list of all slot.max_size values in the bucket. Without this diagnostic, telling apart "egglog picked a broadcast Mul fallback" from "the buffer-reuse pass over-grouped a tiny+huge pair into one slot" required guessing. Used to chase the 4+16-layer OOM and confirm the 36 GB / 20 GB buffers come from a small handful of slots, not from a single bad slot whose live-range neighbors over-expanded it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * egglog: add LUMINAL_NO_FUSION=1 to drop all three fusion phases Extends the existing LUMINAL_NO_FUSION_PAIR=1 gate to also drop fusion_grow and fusion_merge from the schedule. Use case is when fusion's combinatorial growth blows up RAM (flux2 8+48 transformer hits 500 GB RSS in fusion_pair) and the smaller egraph + per-op kernel launches are an acceptable tradeoff vs. the alternative of not running at all. Effect on flux2 4+8: - cycle 001 (text encoder): 49.9s -> 1.5s (33x) - cycle 001 (transformer): 26.0s -> 0.9s (29x) - end-to-end still writes correct out.png Effect on flux2 4+16: cycle 001 also drops dramatically, but a separate OOM appears — every search candidate has 5 BF16 intermediate buffers of ~20 GB each, totaling >100 GB on a 96 GB GPU. This is unrelated to fusion (it's some matmul whose intermediate egglog can't simplify and cuBLAS doesn't replace); disabling fusion just unblocks the egglog stage so we now see that downstream issue. Also adds LUMINAL_DEBUG_INIT_GENOME=1 to log per-attempt rejection reasons (NaN outputs vs. panic-with-message) when the search exhausts its 100-attempt budget. Used to discriminate the OOM from numerical NaN in the runs above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flux2: materialize attention output before o_proj — fixes ~36 GB OOM After `attn.transpose(0, 1).merge_dims(1, 2)`, the merged `(seq, n_headshead_dim)` tensor's K stride is non-contiguous — specifically `(((z/HEAD_DIM)HEAD_DIM)SEQ)+(z%HEAD_DIM)`. The existing cublaslt 2D rule asserts `K stride = MIter` (contiguous z) so it can't match, and the fallback broadcast Mul + SumReduce intermediate is `(SEQ, HIDDEN, KV_DIM)` BF16 — ~36 GB at flux2's transformer dimensions. Every search candidate hits this. Two ` 1.0` materialization barriers fix it (one in the text encoder's `causal_sdpa`, two in the transformer's dual-stream and single-stream blocks). The barrier forces the merged view to materialize as a contiguous (seq, hidden) tensor; cublaslt then matches, and the broadcast Mul becomes a normal GEMM. End-to-end results with `LUMINAL_NO_FUSION=1`: - 4+8 layers, 128²: out.png written, ~30s total - 4+16 layers, 128²: out.png written, ~50s total - 8+48 layers, 128²: out.png written, transformer compile 26s - 8+48 layers, 1024²: out.png written, transformer compile 137s, diffusion step 23s/iter Also extends the `alloc_zeros` OOM diagnostic to capture the LLIR op's `Debug` print (gated on `LUMINAL_DEBUG_ALLOC=1`), so future runaway intermediates surface their full shape/strides identity rather than just a node index. That diagnostic is exactly what made it possible to localize this bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flux2: pass attention_mask through Mistral self-attention + numerics harness Two text-encoder bugs found via side-by-side comparison with diffusers: 1. `tokenize_prompt` padded with token id 0 (`<unk>`) instead of id 11 (`<pad>`). Mistral's actual pad token is 11; padding with the wrong id silently gave every padding position a different embedding than diffusers and the per-layer attention diverged from there. 2. `causal_sdpa` only applied a causal mask. Diffusers' Mistral pipeline passes `attention_mask` so padding KEYS are masked out: padding queries (positions ≥ real_len) only attend to the real prefix, not to other padding tokens. Without it our padding hidden states drift, and since the transformer's cross-attention reads ALL 512 tokens, that drift contaminates the velocity prediction. Threaded a `(seq,) F32` mask input through `Mistral3TextEncoder` → `MistralLayer` → `causal_sdpa`, broadcast as a per-key column added to the score mask. Effect on `prompt_embeds` cos_sim vs diffusers: 0.6510 → 0.9980. Remaining ~0.002 is BF16 precision noise. Numerics harness: - `scripts/dump_reference.py` runs diffusers Flux2Pipeline with the same prompt/seed/resolution and dumps prompt_embeds, the step-0 noise + velocity, and the final image as raw F32 .bin files. Uses `enable_model_cpu_offload` so the full pipeline fits on a 96 GB GPU. - `flux2 main.rs` learns `DUMP_REFS=1` (writes our matching tensors as `ours_.bin`) and `LOAD_REF_NOISE=1` (substitutes diffusers' step-0 noise for ours so transformer/VAE stages can be compared against equivalent inputs). - `scripts/compare_refs.py` prints per-tensor max\|Δ\|, mean\|Δ\|, and cos_sim. Drove this entire fix. The transformer (velocity_step0 cos_sim 0.51) and VAE (final_image cos_sim -0.5) still diverge — those are separate bugs surfaced by this harness, to be debugged next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> flux2: stop scaling timestep + guidance by 1000× before transformer Diffusers' Flux2 pipeline calls the transformer with `timestep = scheduler_timestep / 1000` (so 0..1, sigma-like) and `guidance = guidance_scale` (raw, e.g. 2.5). Our code was passing `timestep * 1000` and `guidance * 1000` — making the `timesteps_proj(t) = cos/sin(t * exp(-log(10000) * j/half))` arguments saturate at 10^4..10^6 and produce essentially-random embeddings. The downstream `temb → modulation` then gives every block scrambled (shift, scale, gate) parameters. This is strictly necessary to match diffusers but does not by itself produce a coherent image — `velocity_step0` cos_sim still diverges (separate bug, likely in attention or modulation plumbing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flux2: fix transformer-internal 1000 timestep+guidance scaling Diffusers' `Flux2Transformer2DModel.forward` does `timestep = timestep.to(dtype) 1000` and the same for guidance right before calling `self.time_guidance_embed`. The pipeline upstream had divided by 1000; the transformer multiplies it back so `time_proj`'s sin/cos argument is in 0..1000 range — what the model was trained on. Our previous code skipped the 1000 inside `embed_time`, so `time_proj` saw arg ≈ 1.0 instead of 1000.0 and produced an embedding that was nearly orthogonal to the trained-distribution embedding. Cascaded: tx_temb cos: 0.227 → 0.9998 tx_mod_ cos: ~0.55 → 1.0000 tx_after_double_0_* cos: 0.93 → 1.0000 tx_after_single_0 cos: 0.12 → 0.9985 velocity_step0 cos: -0.74 → 0.9999 Found by capturing every transformer intermediate (temb, modulations, x_embedded, context_embedded, per-block outputs) from both diffusers and flux2 and comparing per-tensor cos_sim: the discontinuity was at temb, isolating the embedding scale as the cause. The added `dump_transformer_internals.py` (diffusers side) and `forward_with_internals` returning a Vec<(name, GraphTensor)> (flux2 side) are committed so future regressions can be re-bisected the same way. Final image is still broken (cos_sim 0.12 against diffusers) — bug is now isolated to the VAE pipeline (unpack_packed_host / bn_inverse_host / unpatchify_host / VaeDecoder), to be debugged next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flux2: VAE pipeline numerics harness — confirms each stage matches After fixing the transformer's 1000 scaling, the rest of the pipeline was already correct but I needed proof. Added matching dumps in our VAE pipeline (`vae_packed_latent`, `vae_unpacked`, `vae_bn_inversed`, `vae_input`, `vae_raw_decoded`, `vae_final_image`) plus a Python `dump_vae_internals.py` that captures the same points from diffusers via `pipe.vae.decode` hook. End-to-end cos_sim against diffusers (HEIGHT=128 STEPS=1): velocity_step0 0.9999 vae_input 0.9998 (post unpack/BN/unpatchify) vae_raw_decoded 0.9975 (vae.decode raw output) vae_final_image 0.9998 (after (x+1)/2 postprocess) Output image is now a coherent smooth shape rather than noise. With STEPS=1 the result is naturally blurry — the diffusion only took one Euler step. Real generation needs STEPS=28+. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add fused KernelRMSNorm + flux2 integration Replace flux2's 5-7 op rmsnorm chain (square→mean→+eps→sqrt→recip→broadcast→mul→weight-mul) with a single fused CUDA kernel. One block per row, 256-thread cooperative tree reduce in shared memory. Supports BF16 and F32 weights inline (no Cast HLIR needed). Forces input contiguity via `* 1.0` materialization barrier — flux2's Q/K-norm calls feed it non-contiguous slice+split_dims views that the kernel can't index directly. Net: 4.3 → 3.8 s/step at 512² (12% faster, MFU 2.07% → 2.34%). Cat-in-hat output unchanged. 6 unit tests cover F32 weight, BF16 weight, 3D input, large flux2 main shape, text-encoder shape, and chained 3-call composition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add KernelRoPE scaffold (env-gated) Single-kernel rotary position embedding for the interleaved-pair convention (Flux 2 / diffusers `repeat_interleave_real=True`). Replaces the 6-op chain (split_dims / slice / squeeze / neg / concat_along / merge_dims / 4× cast / mul / add) with one launch. Unit tests cover small + flux2 (S=1536, H=48, D=128) shapes; both within 2.4e-7 absolute error of the CPU reference. Performance neutral at 512² in flux2 — the saved launches (~90 ms/step) sit inside run-to-run variance. Default-off behind ROPE_KERNEL=1 so it doesn't silently regress; scaffold useful as a starting point for flash-attention which can subsume RoPE into the QK^T pre-mul. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fusion: family-gating env var + subsume inner FE in grow rules Two changes around the elementwise fusion blowup that OOMs the host CPU at 538 GB RSS on the full 32B flux2 transformer. 1. LUMINAL_FUSION_FAMILIES env var: comma-separated subset of {uu, bu, ub, bb}. When set, only those families' pair-fuse rules are emitted. Default (env unset) keeps all four families as before. Confirmed on flux2 transformer: - all four families → 538 GB CPU (OOM) - uu → 128 GB CPU, slower at runtime (rare U-U in flux2) - uu + bu + ub → 141 GB CPU, matches no-fusion runtime (4.1 s/step) - bb only → 538+ GB CPU (killed) So bb is the binding combinatorial constraint — each bb match adds 6 enodes (3 FusionStart + 2 FusedBinary + 1 FusionEnd) and the pair-fuse matcher enumerates O(B²) binary-binary pairs in one pass. 2. Subsume the inner FusionEnd in all `grow-FE-` rules. Once an FE has been extended by a downstream op, the smaller (partially-fused) FE has no value — the un-fused KernelX chain is still extractable via the pair-fuse union, so multi-consumer fan-out still works. This matches the "only the un-fused or the fully-fused variant" search-space design intent from the discussion. Note: subsume here does not* fix the BB OOM (which happens in pair-fuse before any grow rule fires); it just cleans up the eclass alternatives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fusion: revert subsume-in-grow (broke multi-consumer diamond fusion) The subsume in `grow-FE-` from `1edd4cfe` was correct in spirit for the "prune partials" idea but broke 6 fusion unit tests (diamond DAG, region merge, multi-FE join). The bug: In a diamond DAG (`t = a+b; u = exp2(t); v = sin(t); ...`), `t` has two consumers (u and v). After pair-fuse seeds an FE around `t`, both grow-FE-U on u and grow-FE-U on v need that inner FE to extend their respective chains. Subsuming the inner FE after the first grow makes the second grow's match impossible — u's chain gets fused but v's stays un-fused and the merge-FE-FE that combines them at `out = w + v` never fires. The test asserts ONE region containing all 5 ops; we got two. Subsume was the wrong tool here. The partial-FE explosion isn't actually a problem for extraction (cheapest alternative wins via the un-fused chain that pair-fuse preserves via union). And it doesn't help the underlying BB-family OOM either — that explosion is in pair-fuse rule MATCHING, not in the eclass alternatives that subsume cleans up. Keep the LUMINAL_FUSION_FAMILIES env var from the same commit (that one's useful: lets users disable BB at runtime to avoid the 32B-flux2 OOM). Also leaves placeholder comment for the single-consumer BB guard idea (detect inner-binary fan-out and skip BB when multi-consumer). Spent a session trying to encode that without egglog negation support and hit dangling-reference panics in two cublaslt rewrite tests; the encoding needs more thought than fits this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fusion: env-gate subsume on grow rules (LUMINAL_FUSION_GROW_SUBSUME_{U,B}) Investigated BB+subsume miscompile with deeper bisection. Findings: UNDERSTANDING: - The 32B flux2 BB blowup (538 GB CPU) is from O(N²) intermediate FE region variants: every (start, end) prefix/suffix of an op chain becomes a separate FE enode. Subsume in grow rules prunes this to O(N) per chain — the largest region only. - Subsume itself is correct for U-grow and UB-grow rules. Verified: families=uu,bu,ub + subsume_U + subsume_B → cat-in-hat correct, peak ~141 GB family=bb alone, no subsume → 538 GB OOM family=bb alone, subsume_U + subsume_B → completes (~448 GB peak), but produces non-deterministically wrong output at full DiT depth. - At smaller depths (2+2, 4+12, 8+24 layers) BB+subsume produces correct output. At full 8+48 it produces gray noise most runs but the correct cat-in-hat some runs (and consistently correct with SEARCH_ITERS=1). So the egraph alternatives are valid; the random-genome search picks an invalid combination across the larger search space at full depth. - Subsumed enodes are correctly filtered out of `extract_generation`'s per-eclass enode list (mod.rs:2087 / 2099 / 2106), so the search doesn't pick subsumed enodes directly. The miscompile must come from a more subtle interaction between BB-seeded regions (which carry FB-inside-FB structure inside their FE) and per-eclass enode picks at search time. INTERIM SOLUTION: - Env gates: LUMINAL_FUSION_GROW_SUBSUME_U / LUMINAL_FUSION_GROW_SUBSUME_B let users opt into subsume per grow-family. Off by default; the 24-test fusion suite passes with default behavior. Combined with the existing LUMINAL_FUSION_FAMILIES gate, a user can ship the safe combination (`uu,bu,ub` with both subsumes on) and skip BB until the deeper search interaction is resolved. NOT FIXED: - A proper end-to-end BB+grow+subsume that produces deterministically- correct output at full DiT scale. Likely needs either: (a) understand which specific genome shape the search picks that miscompiles, and rule out that shape in egglog, or (b) accept BB-fusion via a different rule design (e.g. specific modulation/residual patterns rather than the generic BB family). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fusion: enable subsume-in-grow by default; full 4-family fusion ships DEEP DEBUG outcome: the BB+subsume miscompile I was chasing was a flaky genome pick, not a real correctness bug. Verified by: 1. Added TX_SEARCH_SEED env in examples/flux2/src/main.rs for reproducible search. 2. Six seeds (1, 7, 42, 100, 256, 999) at STEPS=4 BB-only+subsume: all six produce the cat-in-hat. No miscompile under any seed. 3. Five unseeded STEPS=4 BB-only+subsume: all five produce the cat. 4. Three unseeded STEPS=4 ALL-families+subsume: all three produce the cat. Peak CPU 99 GB (vs 538 GB OOM without subsume). So the earlier "gray noise" observation was a one-off — almost certainly a transient code state I'd built locally that had a different bug, and the current rules are correct. Made subsume the default: - Removed LUMINAL_FUSION_GROW_SUBSUME_U / _B env gates. - All four fusion families enabled by default at flux2 scale. Ignored 4 unit tests that asserted the pre-subsume "ideal" multi-consumer diamond fusion shape — those structural assertions are no longer guaranteed (subsume keeps only the largest fused region per chain, multi-consumer producers stay un-fused in their other branches), but the numerical-output tests (test__preserves_output) still pass and the flux2 end-to-end image is identical to the no-fusion baseline. 192 / 192 tests pass. 6 ignored (the 4 above + 2 pre-existing benchmarks). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> fusion: default family subset drops BB (post-merge memory regression) Post-merge with main (flashinfer + extended cublaslt rewrites), the all-4-families+subsume combination consistently OOMs the host CPU on the full 32B flux2 transformer (deterministic exit 137 across 3 retries). Before the merge, the same combination ran at 99 GB peak; main's new HLIR/host modules push the post-fusion egraph past the 525 GB system limit. Subsume in grow rules is still active and necessary — without it BB alone would still OOM. Default subset shipped here: uu + bu + ub. This is the safe combination that produced correct output reliably both pre- and post-merge, at the same 4.0 s/step and ~99 GB peak CPU as no-fusion. BB is opt-in via LUMINAL_FUSION_FAMILIES=uu,bu,ub,bb. The two BB-specific unit tests (test_chain_of_binaries_fuses, test_pair_fuse_binary_to_binary_rhs) set the env var before constructing the Graph so the BB rules are emitted. Final post-merge state: - 192/192 unit tests pass (6 ignored, including the 4 pre-subsume structural-fusion tests) - flux2 8+48 / 512² / 4 steps: 4.0 s/step, peak CPU 99 GB, cat-in-hat output identical to no-fusion baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * host: remove ComputeAttnMask — unused dead code ComputeAttnMask was defined as an HLIROp + EgglogOp + HostOp and registered in CudaRuntime::Ops, but no caller existed: it had `rewrites() -> vec![]` ("inserted directly by model code") and yet nothing in examples/, model code, or the Python translator inserts it. The only references were the op definition itself, host/mod.rs registration, a comment in flashinfer/find_indptrs.rs, and two unit tests that exercised the op in isolation. FlashInfer's actual mask anchor matches a primitive-op chain (arange / expand / gather / eq / sum / cast / mul / add ending in `Mul(allowed, Constant(1e10))`), not the ComputeAttnMask op. The indptr-recovery walk in find_indptrs.rs traverses that primitive chain directly. So ComputeAttnMask was infrastructure staged for a future "fuse the mask builder into one op" change that hasn't landed. Verified after removal: - 108 / 108 lib tests pass (0 failures; the deleted ComputeAttnMask tests are gone, everything else green). - examples/paged_llama runs end-to-end: 21-token prefill in 160 ms, 30-token decode, 37.8 ms TPOT supersequence — the FlashInfer rule still fires 8 times per cycle and selects the FlashInfer path. - examples/flux2 still produces the cat-in-hat at 4.3 s/step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fixed bb fusion: * flux cleanup * removed workarounds * fmt + clippy across workspace; drop flux2 debug harness cargo fmt across the workspace (a few new kernels and the flux2 example hadn't been formatted since they were added), plus fixes for every clippy warning under the two CI invocations (workspace minus cuda/metal/ bench, and luminal_cuda_lite alone). Deleted the flux2 numerics-comparison harness now that the model matches diffusers: scripts/, reference/, dump_ref / load_ref helpers, DUMP_REFS / LOAD_REF_* env paths, and Flux2Transformer::forward_with_internals (collapsed back to forward). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * flux2: drop CUDA-dependent unit tests from the example The core CI runs `cargo test --workspace --exclude luminal_cuda_lite ...`, which still walks examples and runs their tests. The flux2 example tests called `CudaContext::new(0)` to validate the VAE / transformer primitives against scalar references during development — those `dlopen` libcuda.so at runtime and so fail on the CPU CI container. The kernels these tests covered (matmul_2d, conv2d_bias, group_norm, layer_norm helpers, RoPE tables, FFN, etc.) are all exercised by the end-to-end pipeline and by unit tests in luminal_cuda_lite, so the remaining pure-Rust tests in scheduler / text_encoder / quant are enough for what runs on CPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fmt Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Remove example smoke env overrides --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 14:52:35 -04:00
tucker-luminal	d5e9001c8b	Add dynamic KV-cache llama chat server (#314 ) * Add dynamic KV-cache llama chat server * Track persistent inputs explicitly * Fix Python lint and clippy issues * Fix Qwen3 MoE bf16 grouped matmul * Replay static PT2 weights in luminal_python * Add explicit mark_dynamic torch.compile regressions * Run explicit mark_dynamic tests on CPU too * Use PT2 range constraints in symbolic shape checks * Reduce symbolic dim checks in binary ops * Simplify grouped_mm dtype normalization * Reduce translator binary boilerplate * Revert frontend binary symbolic dim checks * Remove LessonsLearned branch notes * Reduce translator binary shape logic * Move static weight replay into llama server * Remove pt2 expr inline tests * Remove llama chat server example * Remove unused PT2 weight reload hooks * Trim compiled graph weight setup * Fix clippy warnings in flashinfer tests * Remove stale PT2 decode replay test * Apply rustfmt to PT2 translator changes	2026-05-15 11:03:06 -07:00
tucker-luminal	6416ddb5f8	Use parallel launches for small CUDA kernels (#315 ) * Use parallel launches for cast and iota kernels * Use parallel launch for embed kernel	2026-05-14 00:47:12 -04:00
Austin Glover	c9d4ce6217	Better scalar support: tests + 12 fixes (LUM-474) (#300 ) * Add scalar torture test suite (LUM-474) 60 tests asserting strict shape, dtype, and value match between PyTorch eager and luminal_backend. Includes 9 xfail markers (12 cases) for the known scalar bugs being addressed under LUM-485 through LUM-490. * Add aten.select.int support to luminal_python translator (LUM-487) Single-element indexing (`x[0]`, `x[i, j]`, `x[1, 2, 3]`) lowers to `aten.select.int` in the FX graph. The translator previously bailed with "Unsupported ATen op", blocking any model that reads a scalar by indexing. Implements `aten.select.int(self, dim, index)` as `slice_along(index..index+1, dim).squeeze(dim)` — a pure shape-manipulation that the luminal compiler can fold into surrounding ops, with a single iota for the slice. Negative `dim` is normalized via the existing `normalize_dim` helper; negative `index` is normalized against the (concrete) axis size, mirroring how `translate_gather` normalizes negative gather indices. Removes the four `xfail(_INDEX_SELECT_REASON)` markers in `tests/test_scalar_torture.py` (and the now-unused reason constant); these tests now pass. Final counts: 52 passed / 8 xfailed (was 48 / 12). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Fix LUM-488: support rank-0 tensor mod/lt and add aten.remainder dispatch Two related issues prevented `x % torch.tensor(c)` from translating: 1. The luminal_python translator did not dispatch aten.remainder.Tensor / aten.remainder.Scalar at all, so any module that mods a tensor against a 0-d torch.tensor failed with "Unsupported ATen op". 2. core::ops::Rem and GraphTensor::lt asserted exact dim equality, blocking rank-0 to rank-N broadcasting that the backend already supports transparently for Add/Mul (the input_shapes vec is forwarded to the strided iterator). Drop the dim assertions in Rem and lt so they match Add/Mul's broadcast behavior, and add aten.remainder.Tensor/Scalar handlers in dispatch.rs that mirror aten.fmod.Tensor (with ensure_same_dtype + broadcast_binary). For the Scalar form, build a constant_float and expand_rhs onto the LHS shape. Tests: - New proptests test_mod_scalar_broadcast / test_lt_scalar_broadcast in src/frontend/binary.rs cover rank-0 RHS via expand_rhs. - Removed @pytest.mark.xfail from test_mod_by_scalar_tensor; added the test_scalar_torture.py file to luminal_python's test suite. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * luminal_python translator: dispatch aten.clamp.Tensor (LUM-489) torch.clamp(x, lo, hi) where lo/hi are 0-d tensors routes to aten.clamp.Tensor, which the translator did not previously handle. Add a dedicated dispatch that decomposes clamp(x, lo, hi) into min(max(x, lo), hi), broadcasting each rank-0 bound up to x's shape via expand_rhs. Either bound may be absent (PyTorch allows min=None or max=None), so each side is applied only when its FX input is a tensor. Removes the @pytest.mark.xfail on test_clamp_with_scalar_tensors; test_scalar_torture now reports 50 passed / 10 xfailed (was 48 / 12). * luminal_python: support aten.prod.default full-reduction (LUM-490) The translator's dispatch table mapped aten.{sum,mean,amax,amin}.default to translate_reduction but lacked an entry for aten.prod.default, so x.prod() with no axis raised "Unsupported ATen op". Add the missing dispatch entry; the ReductionOp::Prod branch in translate_reduction already handles both full-reduce and dim-reduce cases. aten.prod.dim_int was already wired up; verified it routes correctly. Removes the xfail marker on test_prod_all_produces_scalar in test_scalar_torture.py — suite now reports 50 passed / 10 xfailed (was 48 / 12). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * luminal_python: preserve int64 (and other integer) output dtypes (LUM-486) Full reductions of int64 tensors silently downcast to int32 on the PyTorch boundary because `output_dtypes` was stored as luminal `DType`, which collapses every integer width to `DType::Int` (i32). The Python wrapper therefore reported int32 to PyTorch even when the user passed int64, breaking strict dtype checks and risking silent overflow on larger reductions / downstream ops that require int64. Store `output_dtypes` directly as PT2 dtype codes (the original PyTorch type IDs) instead of converting through luminal `DType` first. This preserves int64 vs int32 (and similar) end-to-end. The Python output path now reads int outputs as i32 and casts to the requested torch dtype, so int8/int16/int32/int64/uint8 outputs all round-trip with the right type tag. Updates two existing assertions (`test_argsort_stable_duplicates`, `test_tiny_moe_routing`) that were pinning int32 — the new behavior matches PyTorch eager (int64). Adds `test_reduce_sum_all_axes_int64_preserves_dtype` as a regression check, and removes the xfail on `test_int_sum_produces_int_scalar`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * luminal_python: parametrize argsort/MoE dtype tests over int32 and int64 The LUM-486 fix preserves whichever integer dtype the eager model declares on output. The original tests hardcoded int64 (the dtype torch.argsort and torch.topk natively produce), which only exercised one path through the preservation logic. Add an idx_dtype knob to ArgsortStableDuplicatesModel and TinyMoERoutingModel that casts the integer outputs to the requested dtype, and parametrize both tests over [torch.int32, torch.int64]. Internal indices (passed to gather / scatter) stay int64 since PyTorch requires that for index tensors; the cast applies only to the returned values. LUM-486 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Remove xfail markers for fixed scalar bugs Drops the @pytest.mark.xfail markers on tests now passing after the LUM-486, LUM-487, LUM-488, LUM-489, and LUM-490 fixes: - test_prod_all_produces_scalar (LUM-490) - test_clamp_with_scalar_tensors (LUM-489) - test_mod_by_scalar_tensor (LUM-488) - test_index_1d_produces_scalar (LUM-487) - test_index_all_dims_produces_scalar (LUM-487) - test_index_then_add_scalar_const (LUM-487) - test_model_returns_scalar_from_index (LUM-487) - test_int_sum_produces_int_scalar (LUM-486) Also removes the now-unused _INDEX_SELECT_REASON constant. The single remaining xfail is test_unsqueeze_expand_sum_back, blocked on LUM-485 (full reduction returns shape [1] instead of rank-0 ()). * luminal_python: full reductions return rank-0 () instead of [1] (LUM-485) The translator's full-reduce path used to flatten the input to [1, N] and reduce axis 1, leaving a residual [1] dimension. PyTorch eager produces rank-0 () for x.sum() etc., and downstream ops (e.g. unsqueeze(0).expand(5)) rely on that rank — the residual [1] caused panics like "Cannot expand from 2 dims to 1 dims" once the scalar fed any further op. Drop the flatten and reduce over every axis directly. Special-case rank-0 input as a no-op so reducing a scalar is well-defined. Mean still divides by the cached total to avoid redundant axis-prod work. Removes the xfail marker on test_unsqueeze_expand_sum_back, which now passes. With this commit the integration branch has zero xfails: 284 passed across test_scalar_torture.py + test_hlir_ops.py + test_unary.py. * ruff format: tests/test_hlir_ops.py Collapse a two-line f-string into one line per ruff format. No behavior change. * Expand scalar torture suite with PyTorch / NumPy gap coverage Cross-referenced our suite against PyTorch's test_torch / test_reductions / test_view_ops / test_indexing / test_type_promotion / test_binary_ufuncs and NumPy's test_multiarray / test_indexing / test_shape_base. Added 14 new sections covering 47 in-scope gaps: - Binary ops with INPUT 0-d (not reduction-derived) on either side: add/sub/mul/div/mod/maximum/minimum/pow/floor_divide - Pure 0-d ↔ 0-d arithmetic (no broadcasting required) - Full comparison set (gt/ge/lt/le/eq/ne) on input 0-d, plus mask-by-eq - Reduction extras: argmax/argmin (no-arg + keepdim), sum(dim=()), sum/mean of 0-d input, cumsum of 0-d - Shape-flattening on 0-d: flatten/ravel/reshape(-1)/view(-1) all return shape (1,); reshape(()) on 1-element collapses to (); plus permute([]), contiguous(), squeeze() of (1,1,1,1), expand_as - Indexing extras: ellipsis x[...], index by 0-d int tensor, gather with 0-d index, negative-index x[-1] - Type promotion: float-0-d + int-Nd, int-0-d + float-Nd, cast roundtrip through 0-d, .float()/.int() shorthands, where with mixed-dtype scalar branches - Unary math (abs/neg/exp/sin/cos/tanh/sigmoid/sqrt/sign/floor/ceil) on reduction-derived 0-d - Bool logic: AND, OR, XOR, NOT on 0-d bool from comparisons - Stack of 0-ds; cat of unsqueezed 0-ds - Constants: torch.full((), v), torch.full_like on 0-d - Reduction edge cases: keepdim across all axes then divide; scalar broadcast onto transposed tensor - Mixed where/clamp shapes: clamp(x, scalar_tensor, py_float), where(cond, scalar_tensor, x) - Multi-output models: (scalar, tensor) tuple Result: 363 passed / 15 xfailed across the python suite. The 15 new xfails are documented inline with concrete failure modes: - 6 op-coverage gaps: aten.argmax.default, aten.argmin.default, aten.eq.Scalar, aten.ne.Tensor (translator dispatch entries needed). - 2 PT2 export issues: 0-d int64 graph inputs hit "invalid type: null, expected i64" in luminal's model.json parser; affects test_int_0d_plus_float_nd and test_gather_with_0d_index. - 2 real correctness bugs: * floor_divide with 0-d divisor returns the un-floored quotient (float division result, not floor(x/d)). * cumsum on a 0-d tensor panics with index-out-of-bounds. - 1 dynamo guard edge case: torch._dynamo emits an unresolved 'L' name in _guards_fn for 0-d index tensors. Plus 4 cross-marker xfails on consequence of the above (the parametric ne case, mask_by_scalar_eq variants, and other downstream effects). * Rename test_scalar_torture.py -> test_scalars.py; drop 'torture' wording The original 'torture test' label is jargon. The file is just a scalar test module — keep the name simple to match the rest of the suite (test_unary.py, test_hlir_ops.py). * luminal_python: parse rounding_mode string arg correctly (LUM-494) torch.floor_divide(x, d) decomposes to aten.div.Tensor_mode with rounding_mode='floor' during PT2 export. The translator was reading the kwarg via serde_json::Value::as_str(), but PT2 serializes string args as {"as_string": "<value>"} objects, not bare JSON strings. The extraction silently returned None, so the floor branch was skipped and the regular un-floored quotient was returned. Drill into the as_string field as a fallback so floor_divide and div(x, d, rounding_mode='floor'/'trunc') produce floor(x/d) / trunc(x/d) as expected. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * luminal_python: fix cumsum on rank-0 tensor (LUM-495) The translator's cumsum handler called normalize_dim(dim, a.shape.len()) and then a.cumsum(dim) for any rank — including rank-0. The underlying cumop in src/frontend/unary.rs indexes self.dims()[axis] inside the padding/unfold loop, which panics with "index out of bounds: the len is 0 but the index is 0" when shape is empty. PyTorch eager treats torch.cumsum(s, 0) on a 0-d tensor as an identity op (cumsum of a single element is the element itself). Mirror the rank-0 short-circuit pattern from the LUM-485 reduction fix and return the input unchanged when a.shape.is_empty(). Move the dim arg fetch inside the non-empty branch since dim is unused for rank-0. Drops the xfail marker on test_cumsum_of_0d and adds a 1-element 1-D sibling test that asserts shape (1,) round-trips. * luminal_python: support aten.argmax/argmin (LUM-496) argmax/argmin were missing from the translator dispatch table even though we already have stable_argsort. Add a thin wrapper so the PyTorch boundary lights up: argmax(x, dim=None) -> argsort(flatten(x), descending=True).select(0, 0) argmax(x, dim=N) -> argsort(x, dim=N, descending=True).select(N, 0) argmax(x, dim=N, keepdim=True) -> .unsqueeze(N) over the above argmin(...) -> same with descending=False The slice + squeeze chain produces a non-contiguous DType::Int view whose underlying buffer is still sized for the un-sliced argsort tensor. Final `* 1` materializes a contiguous Int copy with strides matching the visible shape — same trick `translate_topk` uses for its sliced index output. Without it the keepdim case panics ("No output node found") and the full-reduce case throws a Python shape mismatch on the oversized buffer. PyTorch's argmax returns int64 while luminal collapses to int32 (Int); LUM-486 already widens at the Python boundary, so the contract is preserved end-to-end. Drops the three `@pytest.mark.xfail` markers from `test_argmax_all`, `test_argmin_all`, and `test_argmax_keepdim_1d` in `test_scalars.py` (6 cases via parametrization). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * luminal_python: dispatch aten.eq.Scalar and aten.ne.Tensor (LUM-497) Add the two missing comparison overloads to the translator dispatch. eq.Scalar mirrors the existing ne.Scalar handler (constant_float + cast + expand_rhs to broadcast the scalar), and ne.Tensor mirrors the existing eq.Tensor handler. Removes the corresponding xfail markers on test_input_0d_comparisons[_NeInput0ds-...] and test_mask_by_scalar_eq. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * luminal_python: accept null range_constraint bounds (LUM-498) A 0-d int64 graph input made PyTorch 2.10+ emit `range_constraints: { sN: { min_val: null, max_val: null } }` for the unbacked symbol PT2 introduces around the rank-0 tensor. Our serde schema modeled `RangeConstraint.min_val` as `i64`, so deserialization failed with `invalid type: null, expected i64`, blocking any model with a scalar integer tensor input. Make `min_val` and `max_val` `Option<i64>` (matching PT2's `Optional[int]`) and fall back to 1 as the initial dynamic-dim value when no lower bound is provided. Tests: removes the xfail on `test_int_0d_plus_float_nd`, adds a new `test_int32_0d_plus_float_nd` regression, and updates the xfail reason on `test_gather_with_0d_index` (the parse error is fixed; a separate downstream gather panic remains). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * luminal_python: drop dynamo input guards in pt2_backend (LUM-499) When a 0-d int tensor is used as a tensor index (x[i] where i = torch.tensor(2)), torch.export records duplicate input guards that reference both the original local source (L['i']) and the rewrapped flat args (L['args'][1]). The unlift pass cannot resolve L['i'] against the wrapped (args, kwargs) signature, leaving a literal `L` reference in the generated _guards_fn that raises NameError during retracing. The data-dependent .item() in the surviving guard then trips fake-tensor analysis with DataDependentOutputException. Drop the guard list before run_decompositions so unlift produces an empty _guards_fn, and DCE any leftover dead aten.item.default nodes that came from index specialization. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> luminal_python: fix gather with rank-0 index on rank-1 source PyTorch eager allows torch.gather(rank-1, dim, rank-0) — the only rank-mismatch case it permits — and returns a rank-0 scalar. Our gather_elements requires source-rank == index-rank, so the rank-0 index hit flatten_strides with mismatched (0, 1) lengths and panicked. Detect this specific pattern in translate_gather: unsqueeze the rank-0 index to (1,), gather, then squeeze the result back to (). Output shape and value match eager. This was the last remaining xfail in test_scalars.py. Suite is now 381 passed / 0 xfailed / 0 failed across test_scalars.py + test_hlir_ops.py + test_unary.py. * luminal_python: clamp.Tensor handles all broadcastable bound shapes PyTorch's aten.clamp.Tensor accepts bounds with any NumPy-broadcastable shape (rank-0, same-shape, or broadcastable). The previous translator used expand_rhs(result.shape) which appends dims rather than broadcasts, so only rank-0 bounds came out correctly. Same-shape and broadcastable bounds either panicked or silently produced wrong values. Switch to broadcast_binary (the right-align + size-1 expand helper used by aten.remainder.Tensor, aten.eq.Tensor, etc.). Now all three modes work uniformly. Add 7 new tests covering the previously-broken modes: - same-shape bounds (per-element clamp, e.g. learned bounds) - per-row broadcast (3,1) against (3,4) - per-col broadcast (4,) against (3,4) - mixed rank-0 lo + same-shape hi - min-only with same-shape lo - max-only with per-row hi - 3-D x with 2-D bounds (left-unsqueeze broadcast) Suite goes from 381 to 388 passing, 0 xfailed. * shape: empty Expression product returns 1, not 0 The empty product is the multiplicative identity (1) — every shape-iterator call site (`shape.iter().product()` for `numel`, output-buffer sizing, CUDA grid-dim computation) implicitly relies on this. The previous impl returned 0 for an empty iterator, which was a latent bug masked while no path produced rank-0 shapes. The LUM-485 fix (full reductions return rank-0 () instead of rank-1 [1]) exposed it on CUDA: SumReduce kernels with rank-0 output got `n_outputs=0`, launched with `grid=(0, 1, 1)`, and crashed with "invalid CUDA launch dimensions" — every CUDA reduction in the Python CUDA tests was failing. Fix: return Expression::from(1) for empty iteration. Sum's identity (0) was already correct and is unchanged. Add two unit tests covering both identities. * cargo fmt * Fix PT2 passthrough input output ID collision * Fix scalar argextremum keepdim behavior * Defer PT2 interface collision fix * Keep HLIR binary ops shape-strict * fixed gemma issue * Fix explicit broadcasts and conv shape division * Normalize Whisper cache slice shape --------- Co-authored-by: Austin Glover <austin@luminal.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Austin Glover <austin_glover@berekely.edu> Co-authored-by: Joe Fioti <jafioti@gmail.com>	2026-05-13 20:16:30 -04:00
June	1dcd0370ce	feat: add CUDA 13.2 support via cudarc 0.19.4 (#312 ) * Update cudarc to 0.19.4 to support CUDA 13.2 Fixes #291 Changes: - Upgrade cudarc from 0.18.2 to 0.19.4 - Remove get_global call for __constant__ memory tracking Rationale: cudarc 0.19.0 changed get_global to return CudaViewMut instead of CudaSlice to prevent double-free of __constant__ memory managed by the CUDA module. The old code worked around this by storing the CudaSlice and calling std::mem::forget on cleanup. With the new API, the view's lifetime is tied to the module borrow, making the workaround unnecessary. Since the constants HashMap was only used for this workaround and never accessed otherwise, we now return an empty HashMap. CUDA 13.2 support was added in cudarc 0.19.4. * fix: migrate embed kernel to shared dyn_dims buffer The cudarc 0.18→0.19 bump removed get_global, but simply dropping the call left __constant__ memory declared-but-never-written, producing wrong results for models with dynamic-shape embeddings. Migrate to the same dyn_dims parameter + #define pattern every other kernel uses.	2026-05-13 13:43:36 -04:00
Ali	6757a4e37b	pack scatter kernel into 256-thread blocks (#309 )	2026-05-13 13:43:15 -04:00
Joe Fioti	631451f8b8	Remove Testing section from README (#313 ) Removed the Testing section from the README.	2026-05-12 17:36:33 -04:00
Joe Fioti	70bdd75163	flashinfer (#311 ) * luminal_python + cuda_lite: unblock Qwen3-MoE compile path Four small fixes that together let Qwen3MoeForCausalLM compile end-to-end through torch.compile + luminal_backend, plus a regression test suite. 1. KernelScatter bf16 OOB crates/luminal_cuda_lite/src/kernel/hlir.rs The Scatter kernel sized n_vec as `n_dest / 4`, correct only for 4-byte dtypes. For bf16 (and any 1/2/8-byte type) the float4 vectorised copy walked the destination 2× / 4× / 0.5× the actual buffer size. Whether that crashed with CUDA_ERROR_ILLEGAL_ADDRESS or silently corrupted neighbouring allocations depended on which surrounding kernels the egglog search picked → ~40% crash rate at search-iters≥5 on StaticCache(dtype=bfloat16) MoE inference. Fix: parameterise n_vec and remainder_start by elements_per_vec = 16 / sizeof(self.dtype). For F32/Int the generated PTX is identical. 2. maximum_f32 dtype mismatch on Int tensors src/frontend/binary.rs `maximum_f32(rhs)` built an F32 `constant_float`; the inner `lt` then panicked "Dtypes must match to compare tensors. Got Int and F32" whenever self was Int — e.g. `aten.clamp` on top-k expert indices coming out of an MoE router. Fix: cast the constant to self.dtype before the compare. For Int self this floors the bound, matching PyTorch's `clamp(int_tensor, min=<float>)` semantics. 3. Three new ATen ops in the luminal_python translator crates/luminal_python/rust/src/translator/{dispatch,tensor}.rs - aten.empty.memory_format - aten.empty_permuted.default → translate_empty (zero-fill) - aten.histc.default → translate_histc Qwen3-MoE allocates the expert-output staging tensor via `empty_permuted` and counts tokens-per-expert via `torch.histc(expert_ids.int(), bins=K, min=0, max=K-1)`. empty / empty_permuted lower to a zero-filled tensor of the requested shape — PyTorch's contract on empty outputs is undefined for any read prior to a write, and downstream writes overwrite our zeros, so this is sound. histc implements only the bincount-equivalent case (one integer per bin); non-integer-bin or non-contiguous-bin usage bails with a clear error rather than silently dropping values. 4. crates/luminal_python/tests/test_qwen3_moe.py — new file Four regression tests over progressively larger Qwen3MoeForCausalLM configs: - tiny: 2 experts, top-1, ~70K params (atol 1e-5) - small: 4 experts, top-2 (atol 1e-4) - medium: 8 experts, top-2, 2 layers (atol 1e-4) - real_config_1layer: full Qwen3-30B-A3B arch (128 experts, top-8, 2048 hidden), num_hidden_layers=1, random weights (atol 1e-3) The size ladder lets any future regression surface at the cheapest test that catches it. Each individual fix above is exercised: gather-then-matmul (PR #298) by every test, KernelScatter bf16 indirectly via the bf16 weight init path, the clamp-on-Int and the empty/histc translators by every test. Validation on H200/CUDA: - 4 passed in tests/test_qwen3_moe.py (this PR's new tests) - 223 passed across tests/test_unary.py, test_capsule_validation.py, test_hlir_ops.py — no existing-test regression Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: add full-depth Qwen3-30B-A3B regression test The 1-layer real-config test exercised the production layer shape but not the full network depth. Adds a sibling test that loads the actual Qwen/Qwen3-30B-A3B pretrained checkpoint at its native bf16 dtype, keeps all 48 layers, and runs a full forward through luminal_backend. Asserts compile+run completes and the compiled output is finite + in the right magnitude band vs eager (within 10×). Tight numerical equivalence at full depth is not asserted: random egglog seeds can pick lowering plans whose 48-layer accumulation diverges structurally from eager even though per-layer correctness holds. The smaller-config tests above use atol≤1e-3 and cover the per-op correctness this test cannot. This catches: - egglog cleanup behaviour over a 48-layer-wide e-graph (the `egglog_utils.rs:1286: No valid graphs` panic surfaces here if the cleanup cascade re-regresses on MoE root-eclasses); - per-layer state plumbing that single-layer tests can't see; - bf16-specific code paths that fp32 random-init tests mask. Memory profile: ~60 GB bf16 weights + ~15 GB compiled-runtime peak; single-token input keeps activations and KV cache trivial. Fits an H200 or H100 with margin to spare. Run time: ~90 s for compile (egglog search at default budget) + ~1 s for both forward passes. Verified with 5 passed in 5:29 on H200/CUDA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal_python: fix bf16 cast-back on where / masked_fill `translate_where`, `translate_where_scalar_other`, and `translate_masked_fill_scalar` all computed `c * x + (1 - c) * y` in F32 and never cast the result back to the input dtype. When the input was bf16 (the common case for MoE inference), the F32 buffer was downstream read as bf16 — which walks the buffer at half-stride and produces output[1] = input[0], output[3] = input[1], … with zeros at the even positions. For Qwen3-MoE's `batched_mm_experts_forward` the corruption landed at the masked-fill of unused expert outputs and propagated as ~10^38 saturation through the rest of the layer. Three changes: 1. Extract a shared `where_formula(cond, x, y, out_dtype)` helper that builds the cx + (1-c)y graph in F32 and then `cast(out_dtype)`s the result. All three callers route through it now. 2. `translate_where_scalar_other` and `translate_masked_fill_scalar` build a tensor for the scalar branch via the same `constant_float(val).cast(out_dtype).expand_rhs(shape)` recipe that `translate_full_like` uses, then call the shared helper. 3. The standalone half-stride misread on a tiny `masked_fill` graph is still observable in isolation (egglog picks a different rewrite plan for that graph than for `full_like + where`), but does not occur in real models — the qwen3-moe test suite (5 tests, including full `Qwen/Qwen3-30B-A3B` pretrained at all 48 layers) is now green and the bench's `Qwen3MoeExperts` path produces correct output. Validation on H200/CUDA: - 5 passed in tests/test_qwen3_moe.py (was: full-config wrong-magnitude output blocking the regression test from being meaningful) - 223 passed in tests/test_unary.py + test_capsule_validation.py + test_hlir_ops.py — no existing-test regression Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cargo fmt * ruff format on tests/test_qwen3_moe.py * clippy: use += instead of x = x + y * fixed whisper with schedule edges in runtime * scatter no copy fix * whisper fix * hold out slow tests * flashinfer * fmt * flashinfer jit --------- Co-authored-by: Tucker Morgan <tucker@luminal.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 23:18:12 -04:00
Ali	855f2bfd02	implement warp-level reduce using register shuffle (#310 )	2026-05-11 19:20:09 -04:00
Ali	cf7fa2297c	get_* is leaking mem (#308 )	2026-05-11 15:41:29 -04:00
tucker-luminal	cd3f55a3a7	luminal_python + cuda_lite: unblock Qwen3-MoE compile path (#301 ) * luminal_python + cuda_lite: unblock Qwen3-MoE compile path Four small fixes that together let Qwen3MoeForCausalLM compile end-to-end through torch.compile + luminal_backend, plus a regression test suite. 1. KernelScatter bf16 OOB crates/luminal_cuda_lite/src/kernel/hlir.rs The Scatter kernel sized n_vec as `n_dest / 4`, correct only for 4-byte dtypes. For bf16 (and any 1/2/8-byte type) the float4 vectorised copy walked the destination 2× / 4× / 0.5× the actual buffer size. Whether that crashed with CUDA_ERROR_ILLEGAL_ADDRESS or silently corrupted neighbouring allocations depended on which surrounding kernels the egglog search picked → ~40% crash rate at search-iters≥5 on StaticCache(dtype=bfloat16) MoE inference. Fix: parameterise n_vec and remainder_start by elements_per_vec = 16 / sizeof(self.dtype). For F32/Int the generated PTX is identical. 2. maximum_f32 dtype mismatch on Int tensors src/frontend/binary.rs `maximum_f32(rhs)` built an F32 `constant_float`; the inner `lt` then panicked "Dtypes must match to compare tensors. Got Int and F32" whenever self was Int — e.g. `aten.clamp` on top-k expert indices coming out of an MoE router. Fix: cast the constant to self.dtype before the compare. For Int self this floors the bound, matching PyTorch's `clamp(int_tensor, min=<float>)` semantics. 3. Three new ATen ops in the luminal_python translator crates/luminal_python/rust/src/translator/{dispatch,tensor}.rs - aten.empty.memory_format - aten.empty_permuted.default → translate_empty (zero-fill) - aten.histc.default → translate_histc Qwen3-MoE allocates the expert-output staging tensor via `empty_permuted` and counts tokens-per-expert via `torch.histc(expert_ids.int(), bins=K, min=0, max=K-1)`. empty / empty_permuted lower to a zero-filled tensor of the requested shape — PyTorch's contract on empty outputs is undefined for any read prior to a write, and downstream writes overwrite our zeros, so this is sound. histc implements only the bincount-equivalent case (one integer per bin); non-integer-bin or non-contiguous-bin usage bails with a clear error rather than silently dropping values. 4. crates/luminal_python/tests/test_qwen3_moe.py — new file Four regression tests over progressively larger Qwen3MoeForCausalLM configs: - tiny: 2 experts, top-1, ~70K params (atol 1e-5) - small: 4 experts, top-2 (atol 1e-4) - medium: 8 experts, top-2, 2 layers (atol 1e-4) - real_config_1layer: full Qwen3-30B-A3B arch (128 experts, top-8, 2048 hidden), num_hidden_layers=1, random weights (atol 1e-3) The size ladder lets any future regression surface at the cheapest test that catches it. Each individual fix above is exercised: gather-then-matmul (PR #298) by every test, KernelScatter bf16 indirectly via the bf16 weight init path, the clamp-on-Int and the empty/histc translators by every test. Validation on H200/CUDA: - 4 passed in tests/test_qwen3_moe.py (this PR's new tests) - 223 passed across tests/test_unary.py, test_capsule_validation.py, test_hlir_ops.py — no existing-test regression Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: add full-depth Qwen3-30B-A3B regression test The 1-layer real-config test exercised the production layer shape but not the full network depth. Adds a sibling test that loads the actual Qwen/Qwen3-30B-A3B pretrained checkpoint at its native bf16 dtype, keeps all 48 layers, and runs a full forward through luminal_backend. Asserts compile+run completes and the compiled output is finite + in the right magnitude band vs eager (within 10×). Tight numerical equivalence at full depth is not asserted: random egglog seeds can pick lowering plans whose 48-layer accumulation diverges structurally from eager even though per-layer correctness holds. The smaller-config tests above use atol≤1e-3 and cover the per-op correctness this test cannot. This catches: - egglog cleanup behaviour over a 48-layer-wide e-graph (the `egglog_utils.rs:1286: No valid graphs` panic surfaces here if the cleanup cascade re-regresses on MoE root-eclasses); - per-layer state plumbing that single-layer tests can't see; - bf16-specific code paths that fp32 random-init tests mask. Memory profile: ~60 GB bf16 weights + ~15 GB compiled-runtime peak; single-token input keeps activations and KV cache trivial. Fits an H200 or H100 with margin to spare. Run time: ~90 s for compile (egglog search at default budget) + ~1 s for both forward passes. Verified with 5 passed in 5:29 on H200/CUDA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal_python: fix bf16 cast-back on where / masked_fill `translate_where`, `translate_where_scalar_other`, and `translate_masked_fill_scalar` all computed `c * x + (1 - c) * y` in F32 and never cast the result back to the input dtype. When the input was bf16 (the common case for MoE inference), the F32 buffer was downstream read as bf16 — which walks the buffer at half-stride and produces output[1] = input[0], output[3] = input[1], … with zeros at the even positions. For Qwen3-MoE's `batched_mm_experts_forward` the corruption landed at the masked-fill of unused expert outputs and propagated as ~10^38 saturation through the rest of the layer. Three changes: 1. Extract a shared `where_formula(cond, x, y, out_dtype)` helper that builds the cx + (1-c)y graph in F32 and then `cast(out_dtype)`s the result. All three callers route through it now. 2. `translate_where_scalar_other` and `translate_masked_fill_scalar` build a tensor for the scalar branch via the same `constant_float(val).cast(out_dtype).expand_rhs(shape)` recipe that `translate_full_like` uses, then call the shared helper. 3. The standalone half-stride misread on a tiny `masked_fill` graph is still observable in isolation (egglog picks a different rewrite plan for that graph than for `full_like + where`), but does not occur in real models — the qwen3-moe test suite (5 tests, including full `Qwen/Qwen3-30B-A3B` pretrained at all 48 layers) is now green and the bench's `Qwen3MoeExperts` path produces correct output. Validation on H200/CUDA: - 5 passed in tests/test_qwen3_moe.py (was: full-config wrong-magnitude output blocking the regression test from being meaningful) - 223 passed in tests/test_unary.py + test_capsule_validation.py + test_hlir_ops.py — no existing-test regression Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cargo fmt * ruff format on tests/test_qwen3_moe.py * clippy: use += instead of x = x + y * fixed whisper with schedule edges in runtime * scatter no copy fix * whisper fix * hold out slow tests * fixing issues with bad rewrite --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Joe Fioti <jafioti@gmail.com>	2026-05-11 12:34:52 -07:00
Ali	11653c6903	capacity should be used instead of len for Vec::from_raw_parts (#307 )	2026-05-11 11:30:02 -04:00
Ali	6d16bdba21	n_elements should use constant not device (#306 )	2026-05-11 11:29:20 -04:00
Joe Fioti	7bfd19fb72	Refine cublasLt rewrites and shrink their test coverage (#305 )	2026-05-09 01:29:10 -04:00
tucker-luminal	42caa4750e	luminal_python: dynamic shapes through torch.compile + translator cleanups (#302 ) * luminal_python: tighten translator lowerings Reduce graph-node count in PT2 → HLIR translators without semantic changes; CUDA suite is 233P/4X before and after. - where / masked_fill / bool-mask index_put: rewrite the blend as `y + c(x - y)` instead of `cx + (1-c)y`, dropping a mul, a sub, and the `1.0` constant per call. - gather / index.Tensor: keep negative-index normalization in Int instead of round-tripping through F32, dropping three Cast nodes per indexed dim; works for symbolic axis sizes too. - ceil: lower as `trunc(x) + (x > trunc(x))` instead of `-floor(-x)`. - _to_copy: skip the Cast op when the dtype already matches; PT2 emits `_to_copy` as a clone hint and the redundant cast was surviving until later optimizer passes. - Full reductions (sum.default etc.): match the contiguity guard translate_reshape already applies — without it the `[1, N]` view treats stride-0 broadcast dims as if they held N distinct values and reads past the backing buffer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> luminal_python: end-to-end dynamic-shape support through torch.compile Previously the standard torch.compile(model, backend=luminal_backend) path silently dropped Dynamo's dynamic-shape information on re-export, so every new input shape forced a full backend recompile. The luminal.pt2.compile() "explicit" entry point also bailed out on float inputs and on anything beyond a single bare-symbol dim. This commit makes both paths actually flow symbolic dims end-to-end. pt2_backend (the path torch.compile users hit): - Detect SymInt placeholders Dynamo emits alongside tensor inputs and rewrite their uses into `aten.sym_size.int(tensor, dim)` so re-export sees a tensor-only signature. - Build a torch.export `dynamic_shapes` spec from the surviving tensor placeholders' FakeTensor shapes (Dim.AUTO; relationships are recovered from the FakeTensor metadata). - Defer the entire compile pipeline to the first runtime call when dynamic_shapes is non-None — torch.export with dynamic_shapes mutates the ShapeEnv that Dynamo is still relying on to install guards, and doing it inside the backend frame trips an internal "Guard failed on the same frame" assertion. Lazy compile sidesteps this cleanly. - Compose the lifted-weight and SymInt filter steps into a single user_indices the CompiledModel uses to drop both kinds of non-tensor args at __call__ time. Fix the device-detection lookup to walk user_inputs (post-filter) rather than `inputs[0]`, which can be a SymInt under Dynamo. - _detect_factory_capsule similarly walks for the first real tensor. Compound shape expressions (`2s`, `s+1`, etc.): - resolve_dim_sizes now parses sympy `srepr` strings — Symbol, Integer, n-ary Mul/Add — into proper luminal Expressions instead of collapsing every non-bare-symbol form to size 1. Falls back to the EP's `hint` when the head isn't recognised so output-shape resolution still returns a usable concrete size. - auto_set_dims_from_input_shapes inverts single-variable affine forms by sampling two probe points (x=2, x=3), recovering slope/intercept, and verifying the candidate value round-trips through exec_single_var_checked. Multi-variable / non-affine / non-monotonic forms are rejected so we never write a wrong guess into dyn_map. Explicit luminal.pt2.compile() API (unchanged behavior for existing callers, plus): - Accepts `dynamic_shapes=` passthrough for full torch.export-style control (named Dims, ranges, multi-input, shared symbols). - `dynamic_dim` accepts an int, an Iterable[int], or "auto"; "auto" marks every non-trivial axis of the first input as Dim.AUTO instead of being integer-input-only. - Multi-input `example_input` lists are accepted directly. - The legacy `dynamic_dim=None` integer-tail-axis heuristic is preserved so the existing decode-loop test keeps working unchanged. Op-arg SymInt awareness: - get_int_arg / get_ints_arg fall through to expression resolution and accept SymInt entries that bind to concrete values, instead of failing with a misleading "not an int" message. Tests: - New tests/test_dynamic_shapes.py covers torch.compile under both automatic_dynamic_shapes and dynamic=True (the latter reuses a single compile across every shape — verified via backend invocation count), lifted-weight + SymInt composition, multi-dim dynamic, compound shape expressions (`cat([x, x], 0)` produces `2s`), and the new explicit-API surface (float-input dynamic_dim and dynamic_shapes passthrough). Full CUDA suite: 239 passed / 4 xfailed (was 233/4); no regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix CI: pass user_indices through _save_and_compile + apply fmt The lazy-compile path passes user_indices= to _save_and_compile, but the function signature never accepted it — ruff F821 caught the undefined name in the early return path. Add it as a kwarg. Also apply ruff format and cargo fmt to satisfy the corresponding pre-commit checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix bad merge: restore _decomp_table() on all run_decompositions sites The merge of main into worktree-fasteraten kept _decomp_table() on only one of the three ep.run_decompositions() call sites. The other two — the dynamic-shapes compile() path and the _eager_pt2_compile (torch.compile backend) path — were left calling run_decompositions() with no args, which decomposes SDPA and breaks the translator with unsupported eq.Scalar / scalar_tensor(-Infinity) ops from the all-masked sentinel chain. Restore _decomp_table() at all three sites. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:27:09 -07:00
Joe Fioti	1279dca4e6	Memory analysis post pass (#303 ) * Simplify CUDA memory analysis and arena planning * Simplify CUDA memory planning and fix clippy warnings	2026-05-08 11:24:37 -04:00
tucker-luminal	53f7960130	luminal_python: translate F.scaled_dot_product_attention as one fused op (#285 ) Adds translator support for `torch.ops.aten.scaled_dot_product_attention.default` and the four backend variants (`_scaled_dot_product_efficient_attention`, `_scaled_dot_product_flash_attention`, `_scaled_dot_product_flash_attention_for_cpu`, `_scaled_dot_product_cudnn_attention`) so calls to `torch.nn.functional.scaled_dot_product_attention` lower to a single matmul+softmax+matmul chain instead of the ~20-op default decomposition (which uses `eq.Scalar`/`logical_not`/`any.dim`/`where.self`/`full_like` to implement the all-masked-row sentinel). The default `ep.run_decompositions()` table decomposes SDPA away. Strip the five SDPA entries from the table in `pt2.py:_decomp_table()` so the op survives into the FX graph and our translator catches it. Tests cover the three commonly-hit branches: - basic Q/K/V (default scale, no mask, no causal flag) - is_causal=True (triangular-mask branch) - additive attn_mask broadcast over heads Verified on native (224 passed) and CUDA (239 passed / 4 xfailed). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:36:36 -04:00
Joe Fioti	5c3407c596	Reduce default profiling trials to 3 (#299 ) * Reduce default profiling trials to 3 * rm out.png * Set Modal CI timeouts to 2 hours	2026-05-06 13:04:57 -04:00
tucker-luminal	47530062a4	luminal_python: gather-then-matmul lowering for grouped_mm (#298 ) translate_grouped_mm was casting the full [G, K, N] expert weight tensor to F32 before a broadcast batched matmul, producing ~2.1 GB of intermediate buffers per layer on Qwen3-30B-A3B. Across 48 MoE layers this OOM'd the search profiler at runtime.rs:711 (alloc_zeros), failing every python_luminal qwen3-moe bench run for the past ~2 weeks. Switch to the gather-first pattern that examples/qwen3_moe uses: compute expert_id from offs, gather only the [S, K, N] active slice, then matmul. The shape mirrors what glumoe_rewrite.egg matches, and the gather is 16x smaller at prefill (S = num_tokens * top_k = 8 vs G = 128). Two refinements baked in vs the broadcast-and-mask version: 1. Stay in Int for the entire expert_id computation. arange and offs are already Int; ge → Bool → cast(Int) → sum → minimum handles the clamp without four F32 round-trips. Same value as HF MoE's `expert_ids.clamp(0, num_experts-1)` for invalid expert IDs from EP, AND protects search-time profiling: dummy-1 input bytes give offs=[1,…,1], pushing the raw count to G for any token with index ≥ 1, which would OOB the gather without the clamp. 2. Drop the cast(F32) on input and on the gathered weight. The broadcast-and-mask version needed F32 because it casted the mask to F32; gather-then-matmul has no such requirement, and casting `[S, K, N]` to F32 doubled the gather scratch (~100 MB → ~200 MB per layer for Qwen3-30B-A3B prefill). Matmul rewrites (cuBLASLt etc.) handle bf16 input with F32 accumulator internally — no precision loss in practice. Verification: - tests/test_hlir_ops.py::test_grouped_mm_fallback{,_routing_invariance} pass. - Synthetic g=128, s=8, k=2048, n=1536 bf16 test: max-abs-diff 1.56e-02 (within bf16 accumulation tolerance; expected to drop to F32-accurate once the cuBLASLt rewrite fires at higher search budgets). Result: original OOM-in-search is gone. With --search-iters 1 the full Qwen3-30B-A3B bench end-to-ends (TTFT ~9.4s). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 16:38:15 -04:00
Joe Fioti	8524636d6f	Yolo v11 example (#296 ) * Add YOLO v11n example on luminal_cuda_lite (WIP) End-to-end Object Detection demo running Ultralytics yolo11n on the cuda_lite backend. Includes a Rust example crate (`yolo_v11`, `yolo_v11_tiny`, `yolo_v11_egglog_debug`), a PyTorch reference + weight-prep script, and a torch.compile path through luminal_python. Surfaced and worked around several e-graph extraction issues that the heavy conv + multi-stage Detect head exposes: - Gather dtype propagation (`src/hlir.rs`): the HLIR Gather dtype-from- data rule was emitted in the default ruleset, so it only advanced one Gather per `(run)` iteration of the schedule. YOLO has deeply nested Gathers (each conv padding + each `make_contiguous` becomes a Gather); put the rule in `dtype_prop` so it saturates with Mul/Add/Sum/etc. Did the same for Scatter for symmetry. - KernelGather IList tail variable (`crates/luminal_cuda_lite/src/ kernel/hlir.rs`): mirror the `?__tail` pattern that Gather's dtype rule uses instead of a strict `(INil)` so the kernel-rewrite still matches when egglog has unioned the IList tail eclass with another chain. - Conditional cleanup (`src/egglog_utils/mod.rs`): replaced `(saturate cleanup)` with a Rust post-pass that strips HLIR ops only when a kernel survivor exists in the same Op eclass. Otherwise the cleanup cascade kills the root with "No valid graphs present" on conv-heavy graphs. - inject_kernel_alternatives (`src/egglog_utils/mod.rs`): synthesises KernelMul/KernelAdd/.../KernelMax enodes for HLIR-only Op eclasses whose dtype propagation didn't make it in time, with a deep-clone fallback that creates new ELIST chains so the extractor's first-enode walk is deterministic. Filtered by `OpTextParts::all_op_names` so the native runtime tests don't get CUDA-only kernel kinds. - enforce_consistent_first_kind_enodes + prefer_econs_first_in_ elists + extract-time consistency check (`src/egglog_utils/mod.rs`): reorder OpKind eclasses so the first enode is a kernel kind whose ELIST children all walk to the same length, and reorder ELIST eclasses so they start with `ECons`/`ENil` instead of `RemoveNthFromEnd` / `MReplaceList` / `RowMajor` (which would crash `extract_expr_list`). - Defensive truncate in KernelMul::extract (`crates/luminal_cuda_ lite/src/kernel/hlir.rs`): when an inconsistent kind enode survives all the above, truncate shape and strides to the shortest length so `flatten_strides` is structurally satisfied. Numerically wrong for that candidate but harmless to the search, which profiles many. - Diagnostic env vars (`src/egglog_utils/mod.rs`, `crates/luminal_cuda_lite/src/runtime.rs`, `crates/luminal_cuda_lite/src/kernel/fusion/{markers,region_codegen}.rs`): `LUMINAL_DUMP_CLEANUP`, `LUMINAL_DUMP_INJECT`, `LUMINAL_DUMP_GATHER`, `LUMINAL_DUMP_CONSISTENCY`, `LUMINAL_DUMP_EXTRACT`, `LUMINAL_DUMP_ EGGLOG`, `LUMINAL_STRICT_KERNEL_ONLY`, `LUMINAL_DISABLE_INJECT`, `LUMINAL_DISABLE_FUSION`, `LUMINAL_DUMP_FUSED_REGION`, `LUMINAL_SYNC_EACH_OP`. - Unrelated egglog rule disables (`src/egglog_utils/base.rs`): `div-div` and `div-cancel-factor` triggered combinatorial explosion on the conv-heavy graph; replaced `div-div` with the constant-divisor variant `div-div-num`. Status: - Llama: 96/96 tests still pass. - `yolo_v11_tiny YOLO_TINY_LAYERS=1..13` matches PyTorch within cumulative numerical drift. - Full `yolo_v11`: compiles in ~150s and runs the forward in ~640ms. Detection accuracy is currently degraded (max_abs ~182 vs PyTorch reference) because of remaining multi-variant ELIST eclasses that fall through to the defensive truncate. The truncation produces wrong indices for those few ops; further work is needed on the e-graph rewriter side. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Accept YOLO input and output paths as CLI args * Update commit message generation instructions * metal clippy * metal unit tests * Fix yolo example clippy warnings * Simplify yolo_v11 to a single self-contained binary * Extend CUDA Modal test timeout to 2 hours * Require CUDA build in Modal pytest runner * Loosen Modal pytest timeout for CUDA CI * Loosen Modal timeouts for CUDA CI --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 16:22:56 -04:00
Joe Fioti	22e7b2da49	Merge pull request #295 from luminal-ai/add-late-egraph-memory-analysis Add late egraph memory analysis	2026-05-03 21:29:42 -07:00
Joe Fioti	198bd2d76b	Merge main into add late egraph memory analysis	2026-05-04 01:31:02 +00:00
Joe Fioti	6a86e70a19	Merge pull request #293 from spinlocked/spinlocked/fix-metal-index-arithmetic-and-non-contiguous-gather-lowering Fix Metal index arithmetic and non-contiguous gather lowering	2026-05-03 18:29:26 -07:00
Joe Fioti	141c06f2bf	Merge remote-tracking branch 'origin/main' into add-late-egraph-memory-analysis # Conflicts: # src/egglog_utils/mod.rs	2026-05-04 01:12:33 +00:00
Joe Fioti	352478f63c	Merge pull request #294 from luminal-ai/egglog_saturation initial egglog saturation	2026-05-03 18:08:28 -07:00
Joe Fioti	a63a5278b9	Fix Metal lowering ruleset selection	2026-05-03 16:57:14 -07:00
Joe Fioti	6b5504de47	initial egglog saturation	2026-05-03 23:39:15 +00:00
spinlocked	6ad13f06d3	Fix Metal index arithmetic and non-contiguous gather lowering Metal binary kernels were reading Int inputs through float conversion, which could lose precision for large computed indices. Keep Add, Mul, and Mod in integer space when the output dtype is Int, and use the integer `%` operator for Int modulo. MetalGather also lowered gathered data offsets using the output/index shape instead of the source data shape. Thread data_shape through the MetalGather egglog op and use it with data_strides when computing the final data index, so gathers from transposed or otherwise non-contiguous tensors address the right elements.	2026-05-03 14:33:59 -07:00
Joe Fioti	2d736cc499	Merge pull request #292 from luminal-ai/remove-earlyrewrites Remove early rewrites and move GLUMoE and sigmoid staging into main schedule	2026-05-03 13:52:19 -07:00
Joe Fioti	2862f7ed22	Add detailed egglog metrics and plan reporting	2026-05-03 20:24:18 +00:00

1 2 3 4 5 ...

2830 Commits