* tests for interface specification * luminal_python: skip CUDA zero-copy for float64 outputs Luminal collapses `DType::F64` to F32 internally, so a CUDA kernel for an f64-typed output actually writes f32 bytes. The Python wrapper was registering an `f64` pre-allocated tensor's `data_ptr` as the zero-copy destination — handing the kernel a 12-byte payload for a 24-byte buffer, leaving half of every f64 element as garbage. Fix: only set the device pointer for the dtypes luminal *natively* writes end-to-end on CUDA (f32, f16, bf16). For f64, pre-allocate the f64 output tensor but skip the device-ptr handoff; the collection path then falls through to `get_output()` (which reads the kernel's actual f32 output) and casts to f64 via the existing read-and-cast branch. Pre-existing latent bug — the test scaffolding from the prior commit exposes it as `test_boundary_noop_preserves_dtype_and_values [cuda-float64_f32_exact]`. Phase E adds first-class f64 IR support which will eventually let the kernel write real f64 bytes and restore zero-copy here; this commit unblocks the CUDA test sweep until then. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal: first-class I64 / F64 in IR + CPU + PT2 boundary Today luminal collapses every PT2 integer dtype to `DType::Int` (i32) and `float64` to `DType::F32` at the FFI boundary. The LUM-486 commit papered over symptoms by storing the user-visible PT2 dtype code in a sidecar and casting back at the Python wrapper — but the IR still computes in i32 / f32, so values outside those ranges (`2**40`, `1.0000000000000002`) lose information before the kernel ever runs. This commit makes i64 and f64 first-class through the IR end-to-end: - `DType::I64` added; custom `Debug` impl maps it to `"Int64"` (not `"I64"`) because egglog has a built-in primitive sort named `I64` for integer literals in shape expressions, and the egglog-format sites in `hlir.rs` serialize `DType` via `{:?}` — emitting `"I64"` would shadow the primitive and panic the egraph loader with `UnboundFunction("I64", ...)`. Documented at the variant. - `f64_dt: sort(DTYPE, "F64", &[])` and `int64_dt: sort(DTYPE, "Int64", &[])` registered in `egglog_utils::base`; matching arms added to `extract_dtype`. - `NativeData::I64(Vec<i64>)` and `NativeData::F64(Vec<f64>)` added. `len`, `f32`/`f16`/`bf16`/`i32`/`bool` accessors widen for both; new `i64()` and `f64()` accessors mirror the existing access pattern. `From<Vec<i64>>` and `From<Vec<f64>>` impls round out the inference. - Cast op covers the full new Cartesian product. Cast to `Int` from `I64` saturates, matching `tensor.to(torch.int32)` overflow semantics. Cast to `F32` from `F64` narrows. - CPU kernels handle I64/F64 directly in Add, Mul, Mod, Gather, Scatter, SumReduce, MaxReduce. Unary transcendentals (`Log2`, `Exp2`, etc.) still bridge through f32 in v1 — the translator inserts cast-bridges around them; reaching the kernel with `I64`/`F64` panics with a pointer to the missing bridge. - `dyn_backend::bytes_to_native_data` preserves i64 / f64 bytes directly; `dummy_data_for_dtype` includes i64 fill. New trait methods `get_output_i64` / `get_output_f64` on `DynBackend` with the native runtime impl. - `cuda_dtype` extended (`"long long"` for I64). Full CUDA kernel support for i64/f64 elementwise emit is Phase F — the mapping is here so the egglog ext correctly types the kernel inputs, but several elementwise CUDA paths still need codegen work. - PT2 boundary: `torch_dtype_int_to_luminal` returns `I64`/`F64` for codes 5/8. `TypedData::from_pytorch_bytes` and `pt2_compiled_model::bytes_to_typed` preserve raw bytes for both. `luminal_dtype_to_pt2_code` round-trips `I64` to code 5. - `CompiledGraph` exposes `get_output_i64` / `get_output_f64`. The Python wrapper routes `torch.int64` / `torch.float64` outputs through them — no more i32-buffer-then-`.to(int64)` cast-back layer. - Test scaffolding updated: the `int64_*` and `float64_*` cases move from `test_boundary_warns_when_input_dtype_requires_conversion` (where they previously had to warn because a conversion was real) to `test_boundary_does_not_warn_when_input_dtype_matches_graph`. Reflecting the new contract: int64 / float64 inputs match the graph's input dtype directly. xfails removed from `int64_outside_i32_range` and `float64_precision_sensitive`. Both now pass on CPU end-to-end. CUDA parity for i64/f64 elementwise kernels lands in Phase F (commit 17). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal: hard-reject dtype mismatch at the FFI boundary Before: when a caller passed an input whose dtype didn't match the graph's declared input dtype, the Python wrapper silently `.to(expected_dtype)`-ed it and emitted a `DTypeBoundaryWarning`. Two real problems: 1. Precision bugs hid. A user passing `torch.float64` into a graph that wanted `torch.float32` lost precision-sensitive values (`1.0000000000000002` → `1.0`) without anything in the test suite or logs flagging it. The warning only showed up at first call and was trivially missed in a CI log. 2. Per-call allocation+copy burnt cycles the caller couldn't see in their profile. For a model invoked thousands of times a second, the cast was a real cost the user wasn't aware was happening. The contract is now strict: `model(x)` requires `x.dtype == model.input_dtypes[i]` for every positional input. Mismatched dtype raises `DTypeBoundaryError` before any FFI work. Migration: call `.to(model.input_dtypes[i])` at the call site. - Add `DTypeBoundaryError(TypeError)` to `compiled_model.py` with a docstring that names the prior precision-bug class and points the user to the call-site migration. - Delete `.to(expected_dtype)` from the input hot path; replace with a direct `raise`. `DTypeBoundaryWarning` removed entirely. - Metal backend factory rejects `DType::I64` and `DType::F64` inputs at translate-time with `UnsupportedDtype` — Metal codegen has no native 64-bit kernels, and reaching the kernel emitter with these used to panic deep in MSL generation with an unhelpful error. - Test scaffolding: `test_boundary_warns_when_input_dtype_requires_conversion` becomes `test_input_dtype_mismatch_rejects` and asserts the raise. `test_boundary_does_not_warn_when_input_dtype_matches_graph` becomes `test_matching_dtype_does_not_raise`. The set of "first-class round- trip" dtypes is captured as `_FIRST_CLASS_NOOP_DTYPES` — narrow integers (uint8 / int8 / int16) collapse to luminal's `Int` (i32), so they can't round-trip the noop model without an explicit `.to(int32)` cast and live only in the reject-path test. Breaks user code that today silently autocasts. Intentional. The migration message at the raise site names the exact `.to(...)` call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal_cuda_lite: I64 / F64 output read paths Wires the runtime side of `DType::I64` / `DType::F64` for CUDA. The `cuda_dtype` mapping in `luminal_cuda_lite/src/lib.rs` already returned `"long long"` / `"double"` for these (added with first-class IR support), so the kernel emitters were producing correctly-typed output bytes — but the Python wrapper's `get_output_i64` / `get_output_f64` calls landed on the trait-default panic ("not supported by 'cuda_lite'"), surfacing as 8 CUDA test failures on the test_dtype_boundary suite. Adds: - `CudaRuntime::get_i64` / `get_f64` — read raw 8-byte chunks from the output buffer and reinterpret. Mirrors the existing `get_f16` / `get_bf16` byte-reinterpret pattern. - `CudaLiteDynBackend::get_output_i64` / `get_output_f64` — thin forwarders to the runtime methods. Verified end-to-end with `test_boundary_noop_preserves_dtype_and_values[cuda-int64_outside_i32_range]` (2**40 round-trips bitexactly through the CUDA kernel) and `[cuda-float64_precision_sensitive]` (1.0000000000000002 round-trips without f32 truncation). Full CUDA dtype suite: 42 passed, 0 failed. The design-doc commit 18 (int32 / bool CUDA zero-copy output plumbing) is deferred to a follow-up. Both dtypes already work end-to-end via the host-roundtrip `get_output_*` path; zero-copy is a perf optimization not blocking any test in the contract suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * luminal_cuda_lite: include I64 in scatter elem-size tables CUDA scatter kernels compute output buffer / load / store byte counts via per-dtype size tables. After landing first-class I64, the scatter emission for an i64 output panicked with `Unsupported dtype for scatter output_bytes: Int64`, which surfaced as the egglog optimizer reporting "Failed to find a viable initial genome after 100 attempts" because every candidate genome containing an i64 scatter immediately panicked. Adds I64 → 8 bytes alongside F64 to the five size tables in `kernel/other_ops.rs` and `kernel/hlir.rs`. MoE routing (idx_dtype = int32 and int64) now compiles and runs end-to-end on CUDA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tests: drop input-layout / mutation-alias tests from dtype branch These two test files came along with `d0cec1fc tests for interface specification` as the test scaffolding for the broader boundary- contract work — input layout strides (Phase G) and mutation/alias writebacks (Phase D). Neither feature is in the dtype-only branch, so the tests either xfail or skip here and are noise to the reader trying to understand what this branch ships. Keep only `test_dtype_boundary.py` since that's the suite that exercises the I64/F64 IR work and the FFI dtype-mismatch rejection this branch actually delivers. The two removed files live on `pt2-boundary-contract` where the features they test land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tests: drop removed files from run_all_tests.sh and run_test.sh Follow-up to the previous commit's deletion of test_input_layout.py and test_mutation_alias_contract.py. Both scripts referenced those files in their pytest invocations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * boundary: strict dtype at output read; translator inserts Cast Reviewer's "no implicit casts at the read boundary" directive, applied to both runtimes: * `CudaRuntime::get_i64` / `get_f64` now check the producer buffer's `buffer_specs[..].dtype` and panic on anything other than `I64` / `F64`. The panic message points at the translator as the place to insert an explicit `Cast` — no silent widening from i32 / bool / f32 / f16 / bf16. * `NativeDynBackend::get_output_i64` / `get_output_f64` match only `NativeData::I64` / `F64` and panic otherwise. The internal `NativeData::i64()` / `f64()` accessors stay (they're load-bearing for in-kernel mixed-dtype binary ops); only the user-visible read boundary is strict. * `CompiledGraph::get_output_i64` / `get_output_f64` docstrings drop the "widens i32 / bool when the producer chose a narrower dtype" line; replaced with "Strict on producer dtype — the graph's output node must already be I64 / F64." For the strict boundary to be reachable when the EP-declared dtype differs from what the producer chose (e.g. `Argsort` / `TopK` emit i32 indices but `torch.int64` was requested), the translator's output loop now inserts an explicit `tensor.cast(declared)` before `output()` when the declared dtype is `I64` / `F64`. The Cast is in the graph — egglog can see it. `Vec<f32>::from([…])` typed-local style applied to test set_data call sites that previously relied on float-literal inference collapsing to `Vec<f32>`; after941b6962added `From<Vec<f64>>`, those literals now infer as `Vec<f64>` and the buffer lands as `NativeData::F64`, panicking the strict read. CPU: 234 pytest passed, 21 skipped. Core: 112 luminal + 16 luminal_nn tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * translator: explicit F32 bridge around unary transcendentals on F64 The CPU `unary_impl` has no native F64 path — `Log2` / `Exp2` / `Sin` / `Sqrt` / `Recip` and the higher-level transcendentals that compose them all bridge through f32 in v1. Previously the panic inside `unary_impl` for `NativeData::F64` was the only thing keeping the F32-bridge story honest, and the comment apologized for not inserting the bridge ourselves. Two changes: * Add `Translator::translate_unary_op_f32_bridge` — same shape as `translate_unary_op`, but when the input is `DType::F64` wraps the op as `f(input.cast(F32)).cast(F64)`. The two `Cast` nodes are in the graph; egglog sees them; the kernel only ever sees F32. * Re-dispatch every transcendental unary in `translator/dispatch.rs` (`aten.{log,log2,exp,exp2,sin,cos,sqrt,rsqrt,reciprocal,sigmoid, tanh,silu,gelu}.default`) through the f32-bridge variant. Ops that don't need transcendentals (`neg` = mul-by-(-1), `relu`, `abs`) stay on plain `translate_unary_op` and preserve F64 natively. * Update the `unary_impl` F64 panic message to direct readers at `translate_unary_op_f32_bridge` — reaching the panic now means a new transcendental dispatch site forgot to bridge. Tests: CPU 234 passed, 21 skipped. The `test_boundary_noop_preserves_dtype_and_values[*-float64_*]` cases continue to pass via the bridge (they go through the noop addition not a transcendental, so the bridge doesn't fire for them; but if anyone adds an F64-transcendental test it'll exercise the bridge end-to-end). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ffi: panic on narrow-int dtype codes; defer first-class narrower-int IR Reviewer flagged the "narrow-int widening" docstring at typed_data.rs:156 as concerning: today luminal collapses uint8 / int8 / int16 to `DType::Int` at the byte-conversion boundary. The restrained answer is to **panic** at the boundary rather than widen silently — matches the "no implicit casts" directive end-to-end. Both byte-conversion entry points now reject narrow-int PT2 codes: * `TypedData::from_pytorch_bytes` (user inputs via `set_input_from_ptr`) — codes 1 (uint8) / 2 (int8) / 3 (int16) panic with "cast to torch.int32 at the call site, or wait for the narrower-int IR follow-up." * `pt2_compiled_model::bytes_to_typed` (PT2 file weights) — same panic, same message. Models that previously round-tripped through implicit widening (e.g. quantized int8 weights) will now fail at load time with a clear message pointing at the missing infrastructure. Follow-up issue: "Narrower integer dtypes (i8 / u8 / i16) first-class in `NativeData` + CPU kernels" — once that lands, these panics disappear and the bytes flow through as `DType::U8` / etc. Tests: `test_dtype_boundary.py` 21 passed, 21 skipped. The narrow-int cases in `test_input_dtype_mismatch_rejects` continue to assert `pytest.raises` — the rejection now comes from the FFI panic instead of the input-dtype boundary check, but the contract from the user's perspective is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * compiled_model: unify output read dispatch; clarify zero-copy comment Three review comments addressed in one place: * `compiled_model.py:148` — the stale comment ("float64 is collapsed to f32 internally; registering an f64 device-ptr would have the kernel write 12 bytes into a 24-byte buffer") was wrong after941b6962made F64 first-class. Rewrite to explain why pre-allocation is GPU-only: the CUDA kernel needs the device-ptr registered before `run()`, while CPU reads back after via `_read_typed_output`. * `compiled_model.py:189` — the per-dtype elif chain duplicated across the CUDA-zero-copy and native paths. Refactor into a single `_output_readers` dispatch table keyed on `out_dtype` → `(getter_name, read_dtype, final_cast)`. The zero-copy fast path for f32 / f16 / bf16 stays as a single check at the top; every other dtype goes through `_read_typed_output`. * `compiled_model.py:243` — annotate the `if _use_zero_copy:` pre-allocation branch: "the CUDA kernel needs the output's device pointer registered *before* `_graph.run()` so the final kernel writes directly into PyTorch's buffer. CPU never zero-copies — there's no separate device buffer to register against." Tests: CPU 234 passed, 21 skipped (no behavior change, just refactor). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: clarify scope of kernel-internal widening accessors Two reviewer comments addressed: * `src/hlir.rs:1212` — the "Narrowing cast: explicit i64 -> i32 … used when the translator bridges an i64 value through a kernel that only has an i32 path" comment apologized for a non-existent problem. Reword: the `Cast` op IS the explicit graph-level conversion; saturating via `as i32` matches `tensor.to(torch.int32)` semantics on overflow. No bridging framing. * `src/hlir.rs:2989` (and the matching `f64` accessor at :2914) — the docstring said "Used by I64-aware kernels; widens other variants when an op promotes a mixed-dtype binary to I64" without scoping why that's OK. Rewrite to be explicit: this is a **kernel-internal** widening accessor, used by binary kernels to read RHS at LHS's width, mirroring PyTorch eager's mixed-dtype promotion. The user-visible read boundary (`DynBackend::get_output_*`) is strict — that's where the reviewer was originally complaining about implicit casts. A follow-up translator pass that inserts explicit `Cast` ops on mixed-dtype binary operands would remove this in-kernel widening entirely; not in scope here. No code change. Tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert "translator: explicit F32 bridge around unary transcendentals on F64" This reverts commitf77a2b92. The bridge inserted `Cast(F32) → unary → Cast(F64)` inside the translator whenever a user called `torch.exp(x)` (or sin/cos/log/...) on an `f64` tensor. The output kept the `torch.float64` dtype tag, but the math itself ran in single precision — exactly the kind of silent precision downgrade hidden behind a wider dtype that this PR's "no implicit casts" directive is meant to reject. The bridge solved one reviewer comment ("unary_impl panics on F64") by relocating the implicit cast from the runtime to the translator — not by removing it. Restore the original behavior: `unary_impl` panics on `F64`, and now with a sharper message that says outright "cast inputs to F32 at the call site" and explicitly names the rejected alternative ("silent F32 bridging is intentionally rejected: it would hide a precision downgrade behind an `F64` dtype tag"). The same wording goes on the Int / I64 / Bool arms so each unsupported variant has a clear, self-contained recovery path. A native F64 transcendental kernel is the proper fix for double- precision `exp`/`log`/`sin`/... — tracked in the F64-CUDA-elementwise follow-up issue. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * hlir: drop From<Vec<f64>> for NativeData; revert luminal_nn / movement / tests churn The PR was carrying ~70 lines of `vec![1., 2., 3.]` → `Vec::<f32>::from([1., 2., 3.])` style churn across `luminal_nn/src/attention.rs`, `src/frontend/movement.rs`, and `src/tests/mod.rs`. The trigger was the new `impl From<Vec<f64>> for NativeData`: it made float literals ambiguous between `Vec<f32>` and `Vec<f64>` at every `set_data` call site, forcing the explicit `Vec::<f32>::from([...])` spelling. Drop the `From<Vec<f64>>` impl. It had no callers (`grep -rn` for `Vec<f64>` going into NativeData turned up nothing — the F64 buffer-construction sites in `dyn_backend.rs` and `typed_data.rs` use `as_bytes` on a raw `Vec<f64>`, not the `From` impl). Callers that genuinely want an F64 buffer can still write `NativeData::F64(my_vec)` directly. With the impl gone, float literals re-infer to `f32` via the sole `From<Vec<f32>>` impl — the original idiom — so the three churn-only files revert cleanly to their `main` state. A short comment at the deletion site explains why this impl is intentionally absent. Net diff on the PR drops by ~70 lines of pure style churn. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * translator: cast argsort / topk / sort indices to I64 at the PT2 boundary `torch.argsort` / `torch.topk(...).indices` / `torch.sort(...).indices` always return `int64`. luminal's frontend `stable_argsort` returns `Int` (i32) — the storage-efficient default that direct Rust callers want and that the existing `luminal_cuda_lite` op-functional / search-equivalence tests read back via `rt.get_i32(...)`. Previously, the gap was bridged with a post-hoc Cast in the translator's output loop (`translator/mod.rs`) — "if the EP declared `I64` and the producer chose `Int`, insert a Cast(I64) before Output." That meant a graph node was being inserted by the framework whose presence and location the user couldn't see in their dispatch — exactly the kind of hidden behavior this PR's "no implicit casts" directive is meant to avoid. It also did nothing to fix the underlying mismatch — the producer was still emitting i32 indices. Move the cast to the producer side of the PT2 boundary instead: * `translate_argsort` casts the `stable_argsort` result to I64 before inserting it into the tensor map. * `translate_topk` casts the sliced `topk_indices` to I64. Same buffer feeds both the values-gather (via `gather_elements`, which accepts any int dtype on its index operand) and the indices output. * `translate_sort` casts the indices half of the tuple to I64; the values half stays at the source dtype. The frontend `argsort` / `stable_argsort` are unchanged — direct Rust callers continue to get i32 indices. Drops the band-aid output-Cast block from `translator/mod.rs`, which is no longer needed (the producer now emits the right dtype). The strict read boundary still catches any future dtype mismatch loudly. Verification: * `cargo test -p luminal -p luminal_nn`: 114 + 16 + 5 passed. * CPU pytest (hlir_ops + unary + dtype_boundary): 250 passed, 21 skipped. * CUDA pytest (same suites + test_llama3 non-slow): 281 passed (previously 278 passed, 3 failed on `test_argsort_stable_duplicates [idx_dtype1]`, `test_topk_values_width_128_with_indices`, `test_tiny_moe_routing[idx_dtype1]` — all now passing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * translator: cast argmax / argmin to I64; fix narrow-int range-pattern clippy CI surfaced two issues on the dtype-i64-f64-first-class branch: * `Python CUDA Tests` failed on 15 `tests/test_scalars.py` cases for `argmax` / `argmin` (all variants — keepdim, 0d, all-reduce, per-dim). Same root cause as the previously-fixed argsort/topk/sort cases: PyTorch's `torch.argmax` / `torch.argmin` return int64 indices (same `kLong` contract as `sort` / `topk`, pinned in the structured kernel meta function), but `translate_argextremum` was emitting i32 — and the strict CUDA `get_i64` read boundary refused to widen. The old docstring for `translate_argextremum` already named the trick: "the Python wrapper widens at the boundary." That wrapper is gone (strict reads), so the fix is to cast at the translator site, same as argsort/topk/sort: - `Ok(result * 1)` → `Ok((result * 1).cast(DType::I64))` - The 0-d short-circuit path's `.cast(DType::Int)` becomes `.cast(DType::I64)`. - Docstring updated to reflect the new boundary cast. I had missed these locally because `test_scalars.py` wasn't in the CUDA sweep I ran while iterating; the PR-CI full pytest run caught them. * `CUDA Clippy` failed on two `1 | 2 | 3 =>` match arms — Rust 1.95 clippy now flags those under `manual_range_patterns`. Rewrote both as `1..=3 =>`. No behavior change. Verification: * `cargo clippy -p luminal_python --features cuda --tests -- -D warnings`: clean. * `LUMINAL_TEST_DEVICE=cuda pytest tests/test_scalars.py`: 171 passed, 4 xfailed (all previously-failing argmax/argmin cases now pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: refresh get_output_i64 / get_output_f64 comments + panic messages The doc comments on the strict i64 / f64 readers still pointed at the band-aid output-loop Cast in `translator::translate_graph` ("the translator inserts an explicit `Cast(I64)` before the Output; see `translator::translate_graph`"). That block was reverted earlier in this PR — casts now live at each producer op's translator dispatch site (`translate_argsort` / `translate_topk` / `translate_sort` / `translate_argextremum`, mirroring PyTorch's `kLong` contract pinned by the structured-kernel meta function in `Sorting.cpp`). Updates the doc + panic-message wording in three places to match the post-revert reality: * `CompiledGraph::get_output_i64` / `get_output_f64` (pyo3 wrapper, `compiled_graph.rs`) * `NativeDynBackend::get_output_i64` / `get_output_f64` (`dyn_backend.rs`) * `CudaRuntime::get_i64` / `get_f64` (`runtime.rs`) Each one now says, in substance: "the producer's buffer must already carry the requested dtype; on the PT2 path that's handled at the per-op translator dispatch site, not in a centralized output loop." Panic messages reworded from "Insert an explicit Cast(I64) in the graph before the Output" — which read like advice to an end user authoring the IR by hand — to "Add a `Cast(DType::I64)` before the Output in the producer graph," which fits both manual IR-authoring callers and the translator-dispatch case naturally. For `get_output_f64`, also added a one-liner pointing readers at the `unary_impl` F64 panic policy (cast inputs to F32 at the call site; no silent F32 bridging behind an F64 tag). No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: trim get_output_i64 / get_output_f64 doc + panic strings Previous pass over these comments name-dropped every per-op translator dispatch site (`translate_argsort`, `translate_topk`, ...) — context that's irrelevant to a caller of the read functions. Reduce each to a one-line contract: "Strict: the buffer must already be `DType::Xxx`; no widening at the read boundary." Panic strings shortened the same way — keep the "Add a `Cast(...)` before the Output" pointer, drop the editorial trailing clause. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: trim argextremum dtype paragraph Replace the seven-line "PyTorch's kLong contract / structured kernel meta function / storage-efficient default" exposition with one line: "The result is cast to `DType::I64` to match PyTorch's int64 argmax / argmin indices." The rest of the docstring (FX positional inputs, `dim=None` flattening, slice-then-materialize rationale) stays — those are non-obvious mechanical details a reader fixing a bug actually needs. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dtype: PyTorch ScalarType as source-of-truth for PT2 dtype codes Addresses the last open review comment on `compiled_graph.rs:84` / `pt2_util.rs:208`: "we maintain our own enum of the datatypes in pytorch, it would be nice if we could bind over the ones from pytorch and use that as the source of truth and not our own." Before: the PT2 dtype-code numbering (`1=uint8, 2=int8, ..., 13=bfloat16`) was duplicated across **four** Rust sites and a hand-rolled dict in Python. Renumbering or new variants in PyTorch's PT2 schema (e.g. the float8 family added in pytorch/pytorch#143343) silently miscompiled at runtime. After: a single `TorchDType` enum in `crates/luminal_python/rust/src/ torch_dtype.rs` owns the canonical numbering. All four call sites route through it: * `pt2_util::torch_dtype_int_to_luminal` — delegates to `TorchDType::from_code(...).into()`. * `typed_data::from_pytorch_bytes` — matches on named variants; narrow-int panic now reads `TorchDType::Byte | Char | Short` instead of `1..=3`. The silent `_ => f32` fallback is gone — unknown codes panic with the variant name. * `pt2_compiled_model::bytes_to_typed` — collapsed to a one-line delegate (`TypedData::from_pytorch_bytes(bytes.to_vec(), dtype)`); the duplicated panic block is deleted. * `compiled_graph::luminal_dtype_to_pt2_code` — delegates to `TorchDType::try_from(dtype).map(|t| t.code())`. Python side: `dtype_util.py`'s hardcoded `_TORCH_DTYPE_TO_CODE` dict is rebuilt at import time from `torch._export.serde.schema. ScalarType.<NAME>.value` — PyTorch becomes the runtime source of truth on both sides of the FFI boundary. `torch._export.serde. schema` is a quasi-private API (leading underscore) but it's the module PT2 actually wire-serializes against; documented at the import site. Parity test: `tests/test_torch_dtype_parity.py` consumes a new pyo3-exported `_torch_dtype_codes()` map and asserts every Rust variant matches PyTorch's enum by name and value. If PyTorch renumbers or adds a variant, the test fails loudly at CI rather than miscompiling silently at runtime. Negative-test verified locally by setting `Long = 99` — fails with `LONG: luminal=99, pytorch=5`. Added to both `run_test.sh` and `run_all_tests.sh`; CUDA runner globs `tests/` so it picks it up automatically. `TorchDType` enumerates all 19 variants currently in `torch._export.serde.schema.ScalarType` (including `Unknown`, the three `Complex*` types, `Uint16`, and the four `Float8E*` variants); `TryFrom<TorchDType> for DType` returns `Err` for any variant luminal's IR doesn't model, with the boundary code panicking on `Err` with the variant name. Verification: * `cargo test -p luminal_python` — 8 passed (3 new for the enum, 5 pre-existing). * `cargo test -p luminal` — 114 passed. * `cargo clippy -p luminal_python --features cuda --tests -- -D warnings` — clean. * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. * CUDA pytest (same suites + `test_scalars`, `-m "not slow"`) — 444 passed, 4 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fmt: cargo fmt on torch_dtype refactor Applies `cargo fmt --all` to the three files touched by the previous commit. The Fmt CI job caught: * `lib.rs` — `_torch_dtype_codes` chain wrapped over multiple lines. * `pt2_compiled_model.rs` — `use crate::pt2_parser;` ordered before `use crate::pt2_schema;`. * `typed_data.rs` — `unwrap_or_else` closure inlined onto one line. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: bump Python CUDA Slow Tests to A100-80GB `test_hf_qwen3_moe_real_config_full` loads the real Qwen3-30B-A3B checkpoint at bf16 (≈60 GiB of weights). Modal's default `--gpu A100` is the 40 GiB SKU, which can't hold the full model + PyTorch's reference forward state. When the test OOMs it doesn't release its allocated memory back to the CUDA driver, so every subsequent big-model test in the run inherits a ~39 GiB dead-memory wall and also OOMs (`test_hf_llama38b_mark_dynamic_seq_dim_before_compile`, `test_hf_llama3_full`, ...). Request the 80 GiB SKU explicitly. Aligns with the model-specific Modal jobs on this PR (`gemma`, `qwen3_moe`, etc.) which already spec `A100-80GB` and pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * pt2_util: panic on narrow ints instead of widening to Int `torch_dtype_int_to_luminal` was the one remaining site that silently collapsed `Byte` / `Char` / `Short` to `DType::Int`. Even though the byte-loading paths (`typed_data::from_pytorch_bytes`, `pt2_compiled_model::bytes_to_typed`) already refuse those codes, the metadata-read path through `pt2_util` was still happy to widen, which left the user's actual dtype invisible past the FFI boundary on graphs whose declared inputs were narrow ints. Reject at this site too. Same panic message as the byte paths ("isn't a first-class IR type yet — cast to torch.int32 at the call site, or wait for the narrower-int IR follow-up"), so the failure mode is consistent across all three sites. Test update: `test_input_dtype_mismatch_rejects[uint8 / int8 / int16]` previously asserted a `DTypeBoundaryError` raised at *call* time — that was the artifact of the silent widening flow (the graph compiled with narrow → int32 substitution, then call-time refused because the user's tensor still had the narrow dtype). The reject now fires at *compile* time via the translator panic, so the test asserts on the panic message instead. `pyo3_runtime.PanicException` inherits from `BaseException`, not `Exception`, so `pytest.raises` broadens to `BaseException`; the message match keeps the contract test specific. Verification: * `cargo test -p luminal_python` — 8 passed. * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * torch_dtype: refuse narrow-int conversions in both directions The two `TryFrom` impls were silently mapping narrow ints: * `TryFrom<TorchDType> for DType` mapped `Byte` → `DType::U8`, `Char` → `DType::I8`, `Short` → `DType::I16`, `Uint16` → `DType::U16`. Those luminal DType variants exist in the enum but aren't first- class through the IR (no kernels, no codegen) — handing them out produced buffers downstream code couldn't actually run on. * `TryFrom<DType> for TorchDType` was the mirror: `U8` → `Byte`, `I8` → `Char`, `I16` → `Short`, plus a stale `U16` → `Int` *workaround* (silently aliased uint16 bytes as int32, predating PyTorch's `UINT16 = 28` schema entry). Move all of those to the `Err` arm in both directions. Downstream sites (`compiled_graph::luminal_dtype_to_pt2_code`, `pt2_util::torch_dtype_int_to_luminal`, ...) translate the `Err` into a typed panic with the variant name, so the failure mode is consistent with the rest of the no-implicit-cast directive — same spirit as the previous commit on `pt2_util`. Test updates: * `supported_dtypes_roundtrip` no longer includes `U8`/`I8`/`I16` — they aren't first-class, can't roundtrip. * New `narrow_ints_refuse_conversion` asserts the `Err` direction on `Byte`/`Char`/`Short` (forward) and `U8`/`I8`/`I16`/`U16` (reverse). Verification: * `cargo test -p luminal_python --lib torch_dtype` — 4 passed. * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. * `cargo clippy -p luminal_python --features cuda --tests -- -D warnings` — clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * torch_dtype: refuse TF32 in DType → TorchDType conversion `TryFrom<DType> for TorchDType` was silently aliasing `DType::TF32 → TorchDType::Float`. TF32 isn't a storage dtype on the PyTorch side (PyTorch has no `torch.tf32`); it's a compute-mode hint that affects how matmuls are rounded but the underlying buffer is still f32. If a luminal graph genuinely carried `DType::TF32` through to the boundary and we mapped it to `Float`, PyTorch would receive a tensor tagged as f32 that the caller had been tracking as TF32 inside luminal — exactly the silent-dtype-aliasing pattern we've been hunting down through the rest of this PR. Refuse instead. A caller that needs a real f32 bridge can insert an explicit `Cast(F32)` upstream — same pattern as the F64 transcendental story (a graph-level Cast rather than a hidden runtime conversion). The existing `Err`-handling at every caller (`compiled_graph::luminal_dtype_to_pt2_code`, ...) panics with the named variant. Test update: `TF32` joins the narrow-int set in `narrow_ints_refuse_conversion`. Verification: * `cargo test -p luminal_python --lib torch_dtype` — 4 passed. * CPU pytest sweep — 252 passed, 21 skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * compiled_model: strict per-dtype dispatch; drop narrow-int reread cast Addresses two recent review comments on `crates/luminal_python/src/luminal/compiled_model.py`: * "are we making vectors that are int32 and putting narrow int types in each slot? What is going on?" — `_output_readers` had three entries that read via `get_output_i32` then `.to(narrow_dtype)`'d back: a leftover from when the IR silently widened narrow ints to i32. After the recent narrow-int rejections in `pt2_util` and `torch_dtype.rs`, no graph can actually reach this code with a narrow-int declared output, so the dispatch entries are unreachable. Delete them. * "Why do we fallback to f32 instead of erroring?" — `_read_typed_ output`'s `if entry is None:` branch read the buffer as f32 and `.to(out_dtype)`'d back regardless of the declared dtype. That's the same silent-dtype-aliasing pattern we've been hunting down through the rest of the PR. Replace with an explicit `NotImplementedError` naming the unsupported dtype. Add explicit `_output_readers` entries for `float32` (which was relying on the fallback as a no-op cast on CPU) and for `float16` / `bfloat16` (documented as reading via the generic f32 getter and `.to()`-ing back — the runtime kernels already emit f32 bytes for these, so the cast at the end is the inverse of upstream's conversion, not a fresh precision drop; a proper typed getter is follow-up work). Net effect: every supported output dtype is an explicit dispatch entry, every unsupported one raises a clear `NotImplementedError`, and the narrow-int reread-and-cast path is gone. Verification: * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. * CUDA pytest (same suites + `test_scalars`, `-m "not slow"`) — 444 passed, 4 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * compiled_model: error vs silently default to float32 on missing dtypes Addresses the new review comment on `compiled_model.py:214` ("probably should error vs assume?"): The pattern `code_to_torch_dtype(codes[i]) if i < len(codes) else torch.float32` appeared in three places (one input loop, two output loops) and silently defaulted to float32 when the Rust side returned a shorter dtype-code list than the declared input/output count. Same silent-default pattern the reviewer's been hunting down through the rest of the PR. Replace all three sites with up-front length checks that raise `RuntimeError` if the counts don't match, then build the typed `torch.dtype` list once from the codes and reuse it. Net effect: * If the Rust side returns inconsistent counts, the error names the declared names and the count mismatch directly — points at the graph-construction bug instead of papering it over. * No `else torch.float32` remains for missing-code fallbacks. Also tightened `dtype_util.py`: * `code_to_torch_dtype(unknown_code)` and `torch_dtype_code(unsupported_dtype)` now raise `KeyError` listing the known set, instead of silently aliasing the unknown to float32. Verification: * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 273 passed. * CUDA pytest (same suites + `test_scalars`, `-m "not slow"`) — 444 passed, 4 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * hlir: drop unused widening accessors; add typed f16 / bf16 read path The reviewer flagged the kernel-internal `NativeData::{f32, f16, bf16, i32, i64, f64, bool}(i)` accessors as silent wideners. After main's PR #330 made the binary kernels strict on dtype, those accessors became dead code — `rg` for callers across `src/` and `crates/` turns up only their own panic strings. Delete all seven. The bigger silent-widening surface was on the **read** side: the native backend's `get_output_f32 / get_output_i32 / get_output_bool` just delegated to `NativeData::to_{f32,i32,bool}_vec()`, which happily accept any source variant. That's the same "widen on read" pattern the reviewer's been hammering on for `get_output_i64 / get_output_f64`. Tighten them with the same match-on-variant + panic-on-mismatch pattern (`Add a Cast(DType::X) before the Output`). Tightening the read boundary broke the existing `float16` / `bfloat16` output paths in `compiled_model.py`, which were dispatching through the generic f32 getter and `.to(half)`-ing back — relying on exactly the silent widening we just removed. Add proper typed paths: * Backend trait: `get_output_f16` / `get_output_bf16` with default panic impls (`src/dyn_backend.rs`). * `NativeDynBackend`: strict match on `F16` / `Bf16` variants. * `luminal_cuda_lite::CudaRuntime`: pre-existing `get_f16` / `get_bf16` reinterpreted bytes without checking dtype — add the same buffer-spec strictness as `get_i64` / `get_f64`. * `CudaLiteDynBackend`: wire `get_output_f16` / `get_output_bf16` through. * `CompiledGraph` (pyo3): new `get_output_f16` / `get_output_bf16` methods that return `bytes` (Python has no native f16/bf16); caller bit-casts via `torch.frombuffer(..., dtype=torch.float16)` / `torch.bfloat16`. * `compiled_model.py`: dispatch table maps `torch.float16` → `get_output_f16` (and same for bf16); the helper bit-casts the bytes back, then `.clone()`s so the returned tensor owns its storage. Net effect: every supported read boundary is strict — buffer dtype must already match the requested width. No silent widening anywhere in the read path. Verification: * `cargo test -p luminal -p luminal_python` — 114 + 9 + 5 passed. * `cargo clippy -p luminal_python --features cuda --tests -- -D warnings` — clean. * CPU pytest (`test_hlir_ops` + `test_unary` + `test_dtype_boundary` + `test_torch_dtype_parity`) — 252 passed, 21 skipped. * CUDA pytest (same suites + `test_scalars`, `-m "not slow"`) — 444 passed, 4 xfailed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * compiled_model: use bytearray for f16/bf16 frombuffer to silence warning The f16/bf16 read path in `_read_typed_output` calls `torch.frombuffer` on the `bytes` returned by `CompiledGraph::get_output_f16` / `get_output_bf16`. Python `bytes` is immutable, so PyTorch emits a `UserWarning` ("The given buffer is not writable... You may want to copy the buffer to protect its data or make it writable **before converting** it to a tensor"). That warning's message contains the word "converting", which `test_dtype_boundary.test_matching_dtype_does_not_raise` catches in its boundary-warning filter — surfaced in CI as a `[cpu-bfloat16]` failure on the most recent run. Wrap the bytes in `bytearray()` before `frombuffer` so the storage is writable and no warning fires. `bytearray(b)` copies the underlying bytes once; the returned tensor owns its own storage, so the previous `.clone()` becomes unnecessary and is removed. No behavior change. CPU sweep still 252 passed / 21 skipped locally (verified before push this time). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ruff format: line-break style on f16/bf16 frombuffer call Ruff's `pre-commit` hook reformats the multi-line `torch.frombuffer(...) .reshape(tuple(shape))` chain to break after `.reshape(` instead of inside `frombuffer(...)`. CI's Ruff Format step flagged it on the previous commit (`4d882763`). No semantic change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Austin Glover <austin_glover@berekely.edu> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Luminal is a high-performance general-purpose inference compiler.
Usage
use luminal::prelude::*;
// Create compute graph
let mut cx = Graph::new();
let a = cx.tensor((3, 1));
let b = cx.tensor((1, 4));
let c = a.matmul(b).output();
// Compile
cx.build_search_space::<NativeRuntime>();
let mut rt = cx.search(NativeRuntime::default(), 1);
// Set input tensors
rt.set_data(a, vec![1.0, 2.0, 3.0]);
rt.set_data(b, vec![1.0, 2.0, 3.0, 3.0]);
// Run
rt.execute(&cx.dyn_map);
// Get output tensor
println!("Result: {:?}", rt.get_f32(c));
Getting Started
Llama 3 8B
Here's a quick example of how you can run Llama 3 8B locally using Luminal on CUDA:
cd ./examples/llama
cargo run --release
Features
Speed
Luminal can run Q8 Llama 3 8B at ~80% of theoretical max performance on an H100. The goal is to become the fastest ML framework for any model on any device.
Simplicity
The core of Luminal is and always will be minimal. It should be possible to understand the entire core library in an afternoon.
PyTorch-native
Luminal directly integrates with PyTorch as a compiler backend. Simply do torch.compile(model, backend=luminal_cuda) to compile your PyTorch models. We also have an excellent tensor API in Rust.
RISC-style architecture
Everything in Luminal boils down to 15 primitive ops:
- Unary -
Log2, Exp2, Sin, Sqrt, Recip - Binary -
Add, Mul, Mod, LessThan - Other -
SumReduce, MaxReduce, Iota, Gather, Scatter, Cast
These ops are enough to support transformers, convnets, and nearly every popular model in the world.
Search
The best heuristic is no heuristic. Luminal tries to search every possible decision to give the compiler the flexibility to discover complex optimizations. This allows us to automatically discover Flash Attention and other similarly complex optimizations without relying on hand-written operations or heuristics. It also allows us to stay extremely small and simple long into the future and beat the performance of far larger frameworks.
Native
The current ML ecosystem is too fragmented, and the solution isn't another layer of abstraction. Luminal is written in rust, and interacts directly with the accelerator APIs (CUDA, Metal, etc.). No indirections or abstractions, compatability layers, docker containers, or virtual environments. Just a statically-linked rust crate.
Validated against Pytorch
Correctness matters. We write as much tests as possible to cover all ops and verify they work the same as an equivalent Pytorch implementation. (Improvements needed!)
Ideology
Why does this look so different from other DL libraries?
Most deep learning libraries are eager-first, meaning each op call directly operates on the data. In PyTorch, when you see x + y, the addition actually happens right there. This is great for debugging because it works exactly as most developers expect.
However, this isn't great for performance. What makes sense for a developer doesn't work well for the machine, in the same way that no one writes assembly by hand. Most libraries try to fix this problem by tacking on operator fusion or JIT compilation to try to change the compilation flow to something better for the machine. Turns out this is super difficult even for Pytorch!
What about XLA?
XLA, torch.compile, TVM, and other traditional compiler stacks suffer from complexity explosion. They are made up of a very large set of destructive (one-direction) rewrite rules that lower and optimize a graph from a high-level representation to low-level machine code. But since these rules are destructive, they are required to only fire when it's certian that there's a performance benefit. This leads to the rules becoming very complex, special-cased, and numerous. Once additional hardware backends, model architectures, and new dtypes get thrown in, they suffer from the weight of their complexity and often produce very suboptimal code, requiring DSLs like Pallas or Triton to regain performance.
Compile everything
A core tenet of Luminal is ahead-of-time compilation. Whenever possible, push everything to compile time and leave nothing to run time. Luminal takes an approach more similar to XLA, and tinygrad. Everything's static here. When you write out an expression like x + y, no actual computation happens. The operation is recorded to a directed acyclic computation graph for execution later. Only once graph.execute() is ran does the computation happen. But isn't that just lazy execution? Yes it is! But in luminal everything is done this way. All neural networks are built up as a static computation graphs, compiled, and executed later.
First-class dynamism
A fully-static world would be nice, but we live in a world of nessecary dynamism. So we model dynamic shapes natively, as symbolic dimensions. Luminal supports arbitrary symbolic dimensions, including complex expressions, to give us shapes like (s, 4096), (b, h, w + 3), etc. This rich representation gives the compiler full visibility into shapes and lets it still do aggressive specialization.
But why?
A consequence of this is that the actual computation that gets ran can be radically different than the code that was written. Since we have an entire neural network fully represented in a compute graph, Luminal has global knowledge. This means we can push most ML complexity to the compiler. For instance, devices, datatypes, and even autograd is modeled ahead of time and optimized by the compiler!
Now we can do:
- Aggressive kernel fusion
- Shape-specific kernels compiled at runtime
- Low-precision dtypes (mxfp4, nvfp4, fp8, etc.)
- Complex mutli-device parallelism topologies, searched ahead-of-time
- Networks can be written in generic code, but compiled and ran fast on hyper-specific architectures
Where are we?
- Native PyTorch support
- Many kernel libraries supported in the search space (FlashInfer, cuBLASLt, etc.)
- Many models implemented in our Rust tensor API in
examples/. - We have a small library of NN modules in
luminal_nn, including transformers. - A significant amount of high-level ops are implemented in
hl_ops. We are aiming to match the most used ~80% of the pytorch api.
Some things on the roadmap:
- More fine-grained dialects supporting thread- and warp-level intrinsics like TMA and tcgen.05
- ROCm backend
- More public infernce accelerator backends (coming very soon...)
- Public benchmarking suite
- Automatically searched model parallelism (TP, PP, EPS, EPR, SP, etc.)
- Write compiler for quantum photonic retro encabulator
- Build dyson swarm
License
Licensed under the Apache License, Version 2.0 http://www.apache.org/licenses/LICENSE-2.0 or the MIT license http://opensource.org/licenses/MIT, at your option. This file may not be copied, modified, or distributed except according to those terms.