luminal_python: DLRM vanilla-PyTorch fast paths + DCE + batched FFI

Adds the pieces needed for torch.compile(model, backend=luminal_backend) to ingest a vanilla DLRMv1 (with nn.EmbeddingBag modules) at correctness parity (max abs diff ≤ 1.79e-07 across num_cat∈{1,2,3,4,8,16,32}) and within 0.45-0.84× of Inductor (no CG) across that sweep. Translator - aten._embedding_bag_forward_only.default: fixed-stride offsets, mode=sum, routes to the fused embedding_bag_sum_kernel under cuda - aten.index_select.default: 2-D source / dim=0 / 1-D index lowering - aten.addmm.default fast path: detect `addmm(bias, x, weight.t())` and emit luminal_cuda_lite::kernel::linear_bias with the original (N,K) weight - aten.bmm.default split out + fast path through matmul_3d_t when the RHS comes from a [0,2,1] permute - aten.sum.dim_IntList peephole: sum ← view ← index_select → embedding_bag_sum_kernel - aten.index.Tensor peephole: [None, li, lj] over bmm(cat(unsq...), perm(cat)) → dlrm_pairwise_dot_lower_tri (new kernel, ported from worktree) - aten.permute.default records 2-D [1,0] and 3-D [0,2,1] sources into `transpose_2d_source` for the addmm / bmm fast paths - node_chain extended to populate from variadic-first-input ops (cat) so the PairwiseDot peephole can walk back through cat - New `op_inputs` side-table holds all input names per node so multi-input peepholes don't need to re-scan FX nodes - Post-translation DCE pass walks back from Output HLIR sinks and drops every unreachable producer — load-bearing for the PairwiseDot peephole because the cat→bmm→permute chain it supersedes was surviving past egglog otherwise Runtime - Drop the trailing cuStreamSynchronize at end of CudaRuntime::execute (incompatible with torch.cuda.CUDAGraph capture; PyTorch syncs on read) - New batched `run_with_ptrs(input_ptrs, output_ptrs)` PyO3 method that registers all I/O device pointers and runs in one FFI crossing. For DLRM at nc=32 collapses ~7 separate PyO3 calls per iter into 1; ~22% per-iter wall-clock improvement on top of the algorithmic peepholes. Python wrapper - main.py: narrow CUDA weight tensors (i64→i32, f64→f32) before extracting device pointers — luminal only has 32-bit Int/Float, the old direct-ptr path silently read half-words. Surfaced from this DLRM work but applies to every model whose state_dict carries i64 buffers. - main.py: set LUMINAL_BACKEND_CUDA=1 around _compile_pt2 so translator fast paths gate themselves correctly (cuda-only kernels behind native). - pt2.py: thread `search_iterations` through pt2_backend so callers can override the default of 10. - compiled_model.py: CUDA path now uses run_with_ptrs. New cuda_lite kernels - dlrm_interact.rs::PairwiseDotLowerTriKernel — variadic, N (B,D) inputs, computes the strict-lower-tri F*(F-1)/2 dot products in one launch. - embedding_bag.rs::MultiTableEmbeddingBagSumKernel — K (weight, idx) pointer pairs through two packed staging buffers, produces (B, K, D) in one launch. Drop-in for vanilla DLRMv1's F separate nn.EmbeddingBag modules. Translator integration is a follow-up; the kernel currently has a `device_ptr` API miss against the in-tree cudarc that needs a one-line tuple-destructure fix before it'll compile (see follow-up). Status: - 213/213 tests in tests/test_hlir_ops.py pass. - DLRM vanilla-PT sweep matches or beats Inductor (no CG) at every nc≤16, ~tied at nc=32. Still 1.6–3.2× behind Inductor+CG (the remaining gap is PyTorch-side Python/dynamo overhead around our run_with_ptrs call; torch.cuda.graph capture of luminal's runtime is the next item). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Metal qwen (#327 )
2026-06-01 21:49:47 +09:00 · 2026-05-21 00:29:37 +00:00 · 2026-05-20 13:26:34 -04:00 · 2026-05-20 01:13:02 -04:00 · 2026-05-19 17:26:38 -04:00 · 2026-05-18 19:06:31 -04:00
37 changed files with 6147 additions and 646 deletions
--- a/.github/workflows/test-core.yml
+++ b/.github/workflows/test-core.yml
@@ -21,4 +21,4 @@ jobs:
        steps:
            - uses: actions/checkout@v6
            - name: Run tests
-              run: cargo test --release --workspace --exclude luminal_cuda_lite --exclude luminal_metal --exclude luminal_bench --verbose
+              run: cargo test --release -p luminal -p luminal_nn -p luminal_tracing -p luminal_python --verbose
--- a/.github/workflows/test-full-cuda.yml
+++ b/.github/workflows/test-full-cuda.yml
@@ -0,0 +1,67 @@
+name: Test Full CUDA
+
+on:
+    pull_request_target:
+        branches: ["main"]
+        types: [labeled, synchronize]
+    workflow_dispatch:
+
+jobs:
+    rust_cuda_ignored_tests:
+        if: >-
+            github.event_name == 'workflow_dispatch'
+            || (github.event_name == 'pull_request_target'
+                && contains(github.event.pull_request.labels.*.name, 'full-modal-ready'))
+        name: Rust CUDA Ignored Tests
+        runs-on: ubuntu-latest
+        environment: Modal
+        timeout-minutes: 300
+
+        steps:
+            - uses: actions/checkout@v6
+              with:
+                  ref: ${{ github.event.pull_request.head.sha || github.sha }}
+            - name: Set up Python
+              uses: actions/setup-python@v5
+              with:
+                  python-version: "3.11"
+            - name: Install Modal
+              run: pip install modal
+            - name: Run ignored CUDA Rust tests on Modal
+              env:
+                  MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
+                  MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
+                  GPU_TYPE: H100
+                  MODAL_TIMEOUT: "14400"
+                  CARGO_TEST_ARGS: "--ignored --test-threads=1"
+              run: modal run ci/modal_cargo_test.py
+
+    python_cuda_slow_tests:
+        if: >-
+            github.event_name == 'workflow_dispatch'
+            || (github.event_name == 'pull_request_target'
+                && contains(github.event.pull_request.labels.*.name, 'full-modal-ready'))
+        name: Python CUDA Slow Tests
+        runs-on: ubuntu-latest
+        environment: Modal
+        timeout-minutes: 300
+        defaults:
+            run:
+                working-directory: crates/luminal_python
+
+        steps:
+            - uses: actions/checkout@v6
+              with:
+                  ref: ${{ github.event.pull_request.head.sha || github.sha }}
+            - name: Set up Python
+              uses: actions/setup-python@v5
+              with:
+                  python-version: "3.11"
+            - name: Install Modal
+              run: pip install modal
+            - name: Run slow pytest CUDA tests on Modal
+              env:
+                  MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
+                  MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
+                  HF_TOKEN: ${{ secrets.HF_TOKEN }}
+              run: modal run modal_pytest_runner.py --gpu A100 --timeout 14400 tests/ -v -s -m slow
--- a/.github/workflows/test-metal.yml
+++ b/.github/workflows/test-metal.yml
@@ -17,3 +17,20 @@ jobs:
            - uses: actions/checkout@v6
            - name: Run Metal crate tests
              run: rustup update; cargo test --release -p luminal_metal --verbose -- --test-threads=1
+
+    llama_1b_metal_example:
+        name: Llama 1B Metal Example
+        runs-on: macos-14-xlarge
+        timeout-minutes: 120
+
+        steps:
+            - uses: actions/checkout@v6
+            - name: Print runner hardware
+              run: system_profiler SPHardwareDataType SPDisplaysDataType
+            - name: Cache Hugging Face models
+              uses: actions/cache@v4
+              with:
+                  path: ~/.cache/huggingface
+                  key: llama-1b-metal-hf-${{ runner.os }}-${{ runner.arch }}-v1
+            - name: Run Llama 1B Metal example and validate output
+              run: rustup update; python3 ci/metal_llama_1b_example.py
--- a/ci/metal_llama_1b_example.py
+++ b/ci/metal_llama_1b_example.py
@@ -0,0 +1,48 @@
+import os
+import subprocess
+import sys
+
+
+def run_and_capture(command: list[str], *, cwd: str, env: dict[str, str]) -> str:
+    process = subprocess.Popen(
+        command,
+        cwd=cwd,
+        env=env,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+    )
+    assert process.stdout is not None
+
+    chunks = []
+    while True:
+        chunk = process.stdout.read1(4096)
+        if not chunk:
+            break
+        sys.stdout.buffer.write(chunk)
+        sys.stdout.buffer.flush()
+        chunks.append(chunk)
+
+    return_code = process.wait()
+    output = b"".join(chunks).decode("utf-8", errors="replace")
+    if return_code:
+        raise subprocess.CalledProcessError(return_code, command, output=output)
+    return output
+
+
+def main():
+    repo_root = os.environ.get("GITHUB_WORKSPACE", os.getcwd())
+    sys.path.insert(0, os.path.join(repo_root, "ci"))
+    from example_output import validate_output
+
+    output = run_and_capture(
+        ["cargo", "run", "--release", "-p", "luminal_metal", "--example", "llama_1b"],
+        cwd=repo_root,
+        env=os.environ.copy(),
+    )
+    if "TTFT:" not in output or "TPOT:" not in output:
+        raise AssertionError("Llama 1B Metal example did not complete generation")
+    validate_output("llama", output)
+
+
+if __name__ == "__main__":
+    main()
--- a/ci/modal_cargo_test.py
+++ b/ci/modal_cargo_test.py
@@ -1,8 +1,10 @@
 import modal
 import subprocess
 import os
+import shlex

 gpu_type = os.environ.get("GPU_TYPE", "T4")
+modal_timeout = int(os.environ.get("MODAL_TIMEOUT", "7200"))
 CUDARC_CUDA_VERSION = "12080"

 app = modal.App("luminal-ci-cargo-test")
@@ -28,7 +30,7 @@ cuda_image = (
@app.function(
    image=cuda_image,
    gpu=gpu_type,
-    timeout=7200,  # 2 hours
+    timeout=modal_timeout,
 )
 def run_cargo_test():
    """Run cargo test for luminal_cuda_lite on a Modal GPU."""
@@ -43,17 +45,20 @@ def run_cargo_test():
    )
    compute_cap = result.stdout.strip().replace(".", "")

+    test_args = shlex.split(os.environ.get("CARGO_TEST_ARGS", "--test-threads=1"))
+    cmd = [
+        "cargo",
+        "test",
+        "--release",
+        "-p",
+        "luminal_cuda_lite",
+        "--verbose",
+        "--",
+        *test_args,
+    ]
+    print("Running:", " ".join(cmd), flush=True)
    subprocess.run(
-        [
-            "cargo",
-            "test",
-            "--release",
-            "-p",
-            "luminal_cuda_lite",
-            "--verbose",
-            "--",
-            "--test-threads=1",
-        ],
+        cmd,
        cwd=WORKDIR,
        env={
            **os.environ,
--- a/crates/luminal_cuda_lite/src/host/cublaslt/mod.rs
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/mod.rs
@@ -196,28 +196,42 @@ impl EgglogOp for CuBlasLt {
            // so their presence is the matmul-broadcast signal — no further
            // stride-form check needed.
            //
-            // Delete the HLIR `Mul` and its generic fusion-region alternative
-            // from the Mul eclass. Emptying that eclass lets the empty-eclass
-            // cascade prune the downstream Sum / KernelSum fallback. cuBLAS,
-            // TileMatmulFullSplit, KernelBatchMatVec, and KernelBatchMatMul all
-            // take original (a, b) inputs rather than the Mul eclass, so they
-            // survive the cascade and remain as the matmul output alternative.
+            // Delete the HLIR `Mul` fallback from the Mul eclass. Emptying that
+            // eclass lets the empty-eclass cascade prune the downstream Sum /
+            // KernelSum fallback. cuBLAS, TileMatmulFullSplit, KernelBatchMatVec,
+            // and KernelBatchMatMul all take original (a, b) inputs rather than
+            // the Mul eclass, so they survive the cascade and remain as the
+            // matmul output alternative.
            Rule::raw("(rule
                ((= ?mul (Op (Mul ?shape ?as ?bs ?os) ?inputs))
-                 (= ?mul (Op (FusionEnd ?fshape ?fos ?fdt) ?finputs))
                 (= ?sum (Op (Sum ?sshape ?sk ?ssi ?sks ?sso) (ICons ?mul (INil))))
                 (= ?sum (Op (cublaslt ?cm ?cn ?ck ?cta ?ctb ?cao ?cbo ?cco ?cdo ?clda ?cldb ?cldc ?cldd ?cbc ?csa ?csb ?csc ?csd ?cadt ?cbdt ?ccdt ?cddt ?ccompute ?cscale ?calpha ?cbeta ?cepilogue) ?ci)))
-                ((delete (Op (Mul ?shape ?as ?bs ?os) ?inputs))
-                 (delete (Op (FusionEnd ?fshape ?fos ?fdt) ?finputs)))
+                ((delete (Op (Mul ?shape ?as ?bs ?os) ?inputs)))
                :ruleset cleanup
            )"),
            Rule::raw("(rule
                ((= ?mul (Op (Mul ?shape ?as ?bs ?os) ?inputs))
-                 (= ?mul (Op (FusionEnd ?fshape ?fos ?fdt) ?finputs))
                 (= ?sum (Op (Sum ?sshape ?sk ?ssi ?sks ?sso) (ICons ?mul (INil))))
                 (= ?sum (Op (KernelBatchMatMul ?bos ?bk ?bas ?baks ?bbs ?bbks ?bouts ?bdt) ?bi)))
-                ((delete (Op (Mul ?shape ?as ?bs ?os) ?inputs))
-                 (delete (Op (FusionEnd ?fshape ?fos ?fdt) ?finputs)))
+                ((delete (Op (Mul ?shape ?as ?bs ?os) ?inputs)))
+                :ruleset cleanup
+            )"),
+            // Also remove any generic fusion wrapper that was unioned with the
+            // broadcast Mul. This is deliberately a separate rule: requiring a
+            // FusionEnd in the same eclass made cleanup miss valid cuBLASLt
+            // matmuls when fusion wrapping was absent.
+            Rule::raw("(rule
+                ((= ?mul (Op (FusionEnd ?fshape ?fos ?fdt) ?finputs))
+                 (= ?sum (Op (Sum ?sshape ?sk ?ssi ?sks ?sso) (ICons ?mul (INil))))
+                 (= ?sum (Op (cublaslt ?cm ?cn ?ck ?cta ?ctb ?cao ?cbo ?cco ?cdo ?clda ?cldb ?cldc ?cldd ?cbc ?csa ?csb ?csc ?csd ?cadt ?cbdt ?ccdt ?cddt ?ccompute ?cscale ?calpha ?cbeta ?cepilogue) ?ci)))
+                ((delete (Op (FusionEnd ?fshape ?fos ?fdt) ?finputs)))
+                :ruleset cleanup
+            )"),
+            Rule::raw("(rule
+                ((= ?mul (Op (FusionEnd ?fshape ?fos ?fdt) ?finputs))
+                 (= ?sum (Op (Sum ?sshape ?sk ?ssi ?sks ?sso) (ICons ?mul (INil))))
+                 (= ?sum (Op (KernelBatchMatMul ?bos ?bk ?bas ?baks ?bbs ?bbks ?bouts ?bdt) ?bi)))
+                ((delete (Op (FusionEnd ?fshape ?fos ?fdt) ?finputs)))
                :ruleset cleanup
            )"),
        ]
--- a/crates/luminal_cuda_lite/src/kernel/conv2d.rs
+++ b/crates/luminal_cuda_lite/src/kernel/conv2d.rs
@@ -1,60 +1,533 @@
-//! Direct conv2d_bias kernel — fuses unfold + matmul + bias into one
-//! CUDA kernel with no `(H_out*W_out, C_in*K*K)` intermediate matrix.
+//! CUDA conv2d-with-bias backend rewrite.
 //!
-//! This is exposed as a luminal `CustomOp`, not a standard egglog-rewritten
-//! `KernelOp`, because the conv has no useful fusion opportunities with
-//! surrounding ops in the graphs it's used in (the VAE's resnet blocks),
-//! and pattern-matching the unfold+permute+merge_dims+matmul+bias chain
-//! reliably from egglog is significantly more work than just bypassing
-//! the egglog rewrite path entirely.
-//!
-//! The kernel is one-thread-per-output: each thread computes
-//!   `out[co, ho, wo] = bias[co] + sum_{ci,ki,kj} input[ci, ho*S+ki-P, wo*S+kj-P] * weight[co, ci, ki, kj]`
-//! with bounds checks on the spatial dims for padding. This is far from
-//! peak FLOPs (no shared-memory tiling, no warp-level reduction over K)
-//! but it's correct and the memory footprint is just the input + weight +
-//! bias + output buffers — no `(M, K)` or `(M, N, K)` intermediate, so it
-//! scales linearly with the actual conv FLOPs rather than blowing up at
-//! large H/W like the unfold-based formulation.
+//! `KernelConv2D` is selected by egglog from pure HLIR conv graphs and lowers
+//! to a one-thread-per-output CUDA kernel. It avoids materializing unfold/im2col
+//! intermediates while keeping model code free of custom ops.

 use std::sync::Arc;

 use cudarc::driver::{CudaFunction, CudaModule, CudaSlice, CudaStream};
 use luminal::prelude::FxHashMap;
 use luminal::{
-    dtype::DType, graph::Graph, op::CustomOp, op::LLIROp, prelude::GraphTensor, shape::Expression,
+    dtype::DType,
+    egglog_utils::{
+        api::{Rule, SortDef, sort},
+        base::{DTYPE, ELIST, EXPRESSION, OP_KIND},
+        extract_dtype, extract_expr, extract_expr_list,
+    },
+    op::{EgglogOp, LLIROp},
+    prelude::FxHashSet,
+    shape::{Expression, flatten_strides},
 };

 use crate::compile_module_image_for_current_device;
-use crate::kernel::KernelOp;
+use crate::kernel::{KernelOp, hlir::generate_dyn_dims_defines};

-/// Direct conv2d-with-bias kernel. All shape/kernel params are static
-/// (baked into the CUDA source via #defines), so each conv shape gets
-/// its own compiled kernel. Inputs (in order): input `(C_in, H_in, W_in)`,
-/// weight `(C_out, C_in*K*K)` (i.e. flattened `(C_out, C_in, K, K)`), bias
-/// `(C_out,)`. Output: `(C_out, H_out, W_out)`.
-#[derive(Debug, Clone)]
-pub struct Conv2DKernel {
-    pub c_in: usize,
-    pub h_in: usize,
-    pub w_in: usize,
-    pub c_out: usize,
-    pub kernel: usize,
-    pub stride: usize,
-    pub padding: usize,
-    pub h_out: usize,
-    pub w_out: usize,
+#[derive(Default, Debug, Clone)]
+pub struct KernelConv2D {
+    out_shape: Vec<Expression>,
+    input_shape: Vec<Expression>,
+    input_stride: Vec<Expression>,
+    weight_co_stride: Expression,
+    weight_inner_stride: Expression,
+    bias_c_stride: Expression,
+    out_stride: Vec<Expression>,
+    kernel_h: Expression,
+    kernel_w: Expression,
+    stride_h: Expression,
+    stride_w: Expression,
+    dilation_h: Expression,
+    dilation_w: Expression,
+    pad_h: Expression,
+    pad_w: Expression,
+    dtype: DType,
 }

-impl Conv2DKernel {
-    fn output_elements(&self) -> usize {
-        self.c_out * self.h_out * self.w_out
+impl EgglogOp for KernelConv2D {
+    fn sort(&self) -> SortDef {
+        sort(
+            OP_KIND,
+            "KernelConv2D",
+            &[
+                ("out_shape", ELIST),
+                ("input_shape", ELIST),
+                ("input_stride", ELIST),
+                ("weight_co_stride", EXPRESSION),
+                ("weight_inner_stride", EXPRESSION),
+                ("bias_c_stride", EXPRESSION),
+                ("out_stride", ELIST),
+                ("kernel_h", EXPRESSION),
+                ("kernel_w", EXPRESSION),
+                ("stride_h", EXPRESSION),
+                ("stride_w", EXPRESSION),
+                ("dilation_h", EXPRESSION),
+                ("dilation_w", EXPRESSION),
+                ("pad_h", EXPRESSION),
+                ("pad_w", EXPRESSION),
+                ("dtype", DTYPE),
+            ],
+        )
+    }
+
+    fn n_inputs(&self) -> usize {
+        3
+    }
+
+    fn rewrites(&self) -> Vec<Rule> {
+        vec![
+            // 1x1 convs in Flux2's VAE are represented without `unfold`:
+            //
+            //   input.permute([H,W,C]).merge(H,W)
+            //     -> matmul(weight.t())
+            //     -> split/permute back to [C_out,H,W]
+            //     -> + channel bias
+            //
+            // The lowered form is still the same Mul -> KernelSum -> Add
+            // matmul skeleton, but the lhs FusionStart reads directly from the
+            // original input instead of a KernelGather window tensor.
+            Rule::raw(
+                "(rule
+                    (
+                        (= ?out (Op (FusionEnd ?out_shape ?out_stride (F32)) (ICons ?add_elem (INil))))
+                        (= ?add_elem (Op (CudaBinaryElementwise \"Add\" ?out_shape ?sum_add_stride ?bias_add_stride ?out_stride (F32)) (ICons ?sum_fs (ICons ?bias_fs (INil)))))
+                        (= ?sum_fs (Op (FusionStart ?out_shape ?sum_add_stride (F32)) (ICons ?sum (INil))))
+                        (= ?bias_fs (Op (FusionStart ?out_shape ?bias_add_stride (F32)) (ICons ?bias (INil))))
+
+                        (= ?sum (Op (KernelSum ?matmul_out_shape ?c_in ?sum_in_stride ?k_stride ?sum_out_stride (F32)) (ICons ?mul_fe (INil))))
+                        (= ?mul_fe (Op (FusionEnd ?mul_shape ?mul_out_stride (F32)) (ICons ?mul_elem (INil))))
+                        (= ?mul_elem (Op (CudaBinaryElementwise \"Mul\" ?mul_shape ?input_1x1_stride ?weight_stride ?mul_out_stride (F32)) (ICons ?input_fs (ICons ?weight_fs (INil)))))
+                        (= ?input_fs (Op (FusionStart ?mul_shape ?input_1x1_stride (F32)) (ICons ?input (INil))))
+                        (= ?weight_fs (Op (FusionStart ?mul_shape ?weight_stride (F32)) (ICons ?weight (INil))))
+
+                        (= ?out_shape (ECons ?c_out (ECons ?h_out (ECons ?w_out (ENil)))))
+                        (= ?matmul_out_shape (ECons ?m (ECons ?c_out (ENil))))
+                        (= ?mul_shape (ECons ?m (ECons ?c_out (ECons ?c_in (ENil)))))
+                        (= ?input_1x1_stride (ECons ?flat_stride (ECons (MNum 0) (ECons ?input_c_stride (ENil)))))
+                        (= ?flat_stride (MIter))
+
+                        (= ?k_stride (MIter))
+                        (= ?sum_in_stride (ECons ?sum_m_stride (ECons ?sum_c_stride (ENil))))
+                        (= ?sum_out_stride (ECons ?sum_out_m_stride (ECons ?sum_out_c_stride (ENil))))
+                        (= ?sum_add_stride (ECons ?sum_add_c_stride (ECons ?sum_add_h_stride (ECons ?sum_add_w_stride (ENil)))))
+                        (= ?weight_co_stride (nth_from_end ?weight_stride 1))
+                        (= ?weight_inner_stride (nth_from_end ?weight_stride 0))
+                        (= (MNum 0) (nth_from_end ?weight_stride 2))
+                        (= ?bias_add_stride (ECons ?bias_c_stride (ECons (MNum 0) (ECons (MNum 0) (ENil)))))
+                    )
+                    (
+                        (let ?conv (Op (KernelConv2D
+                            ?out_shape
+                            (ECons ?c_in (ECons ?h_out (ECons ?w_out (ENil))))
+                            (ECons ?input_c_stride (ECons (MMul ?w_out ?flat_stride) (ECons ?flat_stride (ENil))))
+                            ?weight_co_stride
+                            ?weight_inner_stride
+                            ?bias_c_stride
+                            ?out_stride
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 0)
+                            (MNum 0)
+                            (F32))
+                            (ICons ?input (ICons ?weight (ICons ?bias (INil))))))
+                        (union ?out ?conv)
+                        (subsume (Op (FusionEnd ?out_shape ?out_stride (F32)) (ICons ?add_elem (INil))))
+                        (set (dtype ?conv) (F32))
+                    )
+                    :ruleset kernel_lower
+                    :name \"kernel conv2d 1x1 from cuda lowered matmul bias\"
+                )",
+            ),
+            Rule::raw(
+                "(rule
+                    (
+                        (= ?out (Op (FusionEnd ?out_shape ?out_stride (F32)) (ICons ?add_elem (INil))))
+                        (= ?add_elem (Op (CudaBinaryElementwise \"Add\" ?out_shape ?bias_add_stride ?sum_add_stride ?out_stride (F32)) (ICons ?bias_fs (ICons ?sum_fs (INil)))))
+                        (= ?sum_fs (Op (FusionStart ?out_shape ?sum_add_stride (F32)) (ICons ?sum (INil))))
+                        (= ?bias_fs (Op (FusionStart ?out_shape ?bias_add_stride (F32)) (ICons ?bias (INil))))
+
+                        (= ?sum (Op (KernelSum ?matmul_out_shape ?c_in ?sum_in_stride ?k_stride ?sum_out_stride (F32)) (ICons ?mul_fe (INil))))
+                        (= ?mul_fe (Op (FusionEnd ?mul_shape ?mul_out_stride (F32)) (ICons ?mul_elem (INil))))
+                        (= ?mul_elem (Op (CudaBinaryElementwise \"Mul\" ?mul_shape ?input_1x1_stride ?weight_stride ?mul_out_stride (F32)) (ICons ?input_fs (ICons ?weight_fs (INil)))))
+                        (= ?input_fs (Op (FusionStart ?mul_shape ?input_1x1_stride (F32)) (ICons ?input (INil))))
+                        (= ?weight_fs (Op (FusionStart ?mul_shape ?weight_stride (F32)) (ICons ?weight (INil))))
+
+                        (= ?out_shape (ECons ?c_out (ECons ?h_out (ECons ?w_out (ENil)))))
+                        (= ?matmul_out_shape (ECons ?m (ECons ?c_out (ENil))))
+                        (= ?mul_shape (ECons ?m (ECons ?c_out (ECons ?c_in (ENil)))))
+                        (= ?input_1x1_stride (ECons ?flat_stride (ECons (MNum 0) (ECons ?input_c_stride (ENil)))))
+                        (= ?flat_stride (MIter))
+
+                        (= ?k_stride (MIter))
+                        (= ?sum_in_stride (ECons ?sum_m_stride (ECons ?sum_c_stride (ENil))))
+                        (= ?sum_out_stride (ECons ?sum_out_m_stride (ECons ?sum_out_c_stride (ENil))))
+                        (= ?sum_add_stride (ECons ?sum_add_c_stride (ECons ?sum_add_h_stride (ECons ?sum_add_w_stride (ENil)))))
+                        (= ?weight_co_stride (nth_from_end ?weight_stride 1))
+                        (= ?weight_inner_stride (nth_from_end ?weight_stride 0))
+                        (= (MNum 0) (nth_from_end ?weight_stride 2))
+                        (= ?bias_add_stride (ECons ?bias_c_stride (ECons (MNum 0) (ECons (MNum 0) (ENil)))))
+                    )
+                    (
+                        (let ?conv (Op (KernelConv2D
+                            ?out_shape
+                            (ECons ?c_in (ECons ?h_out (ECons ?w_out (ENil))))
+                            (ECons ?input_c_stride (ECons (MMul ?w_out ?flat_stride) (ECons ?flat_stride (ENil))))
+                            ?weight_co_stride
+                            ?weight_inner_stride
+                            ?bias_c_stride
+                            ?out_stride
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 0)
+                            (MNum 0)
+                            (F32))
+                            (ICons ?input (ICons ?weight (ICons ?bias (INil))))))
+                        (union ?out ?conv)
+                        (subsume (Op (FusionEnd ?out_shape ?out_stride (F32)) (ICons ?add_elem (INil))))
+                        (set (dtype ?conv) (F32))
+                    )
+                    :ruleset kernel_lower
+                    :name \"kernel conv2d 1x1 from cuda lowered bias matmul\"
+                )",
+            ),
+            // Match the same conv after generic CUDA lowering has normalized
+            // the elementwise pieces into fusion regions:
+            //
+            //   KernelGather(input windows)
+            //     -> CudaBinaryElementwise("Mul", weight)
+            //     -> KernelSum(reduce K)
+            //     -> CudaBinaryElementwise("Add", bias)
+            //
+            // This is the form that survives long enough for CUDA search in
+            // real models. The KernelConv2D op consumes the pre-gather input
+            // and avoids materializing both the im2col window tensor and the
+            // elementwise product tensor.
+            //
+            // TODO(egglog-shapes): the current e-graph does not reliably prove
+            // the derived arithmetic equalities for this chain after CUDA
+            // normalization:
+            //   * `M == H_out * W_out`
+            //   * `K == C_in * KH * KW`
+            //   * separately-derived but structurally identical stride
+            //     expressions, e.g. the Mul output stride and KernelSum input
+            //     stride, belong to the same e-class.
+            // Keep the rewrite anchored on the stable conv layout facts the
+            // graph does carry today: six-axis unfold window shape, flattened
+            // `[M, C_out, K]` product, reduction over `K`, the three-axis
+            // `[C_out, H_out, W_out]` output view, and channel-only bias
+            // broadcast. Once expression/list canonicalization can prove those
+            // equalities, tighten this rule and its regression tests.
+            Rule::raw(
+                "(rule
+                    (
+                        (= ?out (Op (FusionEnd ?out_shape ?out_stride (F32)) (ICons ?add_elem (INil))))
+                        (= ?add_elem (Op (CudaBinaryElementwise \"Add\" ?out_shape ?sum_add_stride ?bias_add_stride ?out_stride (F32)) (ICons ?sum_fs (ICons ?bias_fs (INil)))))
+                        (= ?sum_fs (Op (FusionStart ?out_shape ?sum_add_stride (F32)) (ICons ?sum (INil))))
+                        (= ?bias_fs (Op (FusionStart ?out_shape ?bias_add_stride (F32)) (ICons ?bias (INil))))
+
+                        (= ?sum (Op (KernelSum ?matmul_out_shape ?k_dim ?sum_in_stride ?k_stride ?sum_out_stride (F32)) (ICons ?mul_fe (INil))))
+                        (= ?mul_fe (Op (FusionEnd ?mul_shape ?mul_out_stride (F32)) (ICons ?mul_elem (INil))))
+                        (= ?mul_elem (Op (CudaBinaryElementwise \"Mul\" ?mul_shape ?patch_stride ?weight_stride ?mul_out_stride (F32)) (ICons ?patch_fs (ICons ?weight_fs (INil)))))
+                        (= ?patch_fs (Op (FusionStart ?mul_shape ?patch_stride (F32)) (ICons ?patches (INil))))
+                        (= ?weight_fs (Op (FusionStart ?mul_shape ?weight_stride (F32)) (ICons ?weight (INil))))
+                        (= ?patches (Op (KernelGather ?idx_shape ?idx_stride ?input_shape ?input_stride ?gather_out_stride (F32)) (ICons ?indices (ICons ?input (INil)))))
+
+                        (= ?out_shape (ECons ?c_out (ECons ?h_out (ECons ?w_out (ENil)))))
+                        (= ?input_shape (ECons ?c_in (ECons ?h_in (ECons ?w_in (ENil)))))
+                        (= ?idx_shape (ECons ?c_in (ECons ?h_out (ECons ?w_out (ECons (MNum 1) (ECons ?kernel_h (ECons ?kernel_w (ENil))))))))
+                        (= ?matmul_out_shape (ECons ?m (ECons ?c_out (ENil))))
+                        (= ?mul_shape (ECons ?m (ECons ?c_out (ECons ?k_dim (ENil)))))
+
+                        (= ?k_stride (MIter))
+                        (= ?sum_in_stride (ECons ?sum_m_stride (ECons ?sum_c_stride (ENil))))
+                        (= ?sum_out_stride (ECons ?sum_out_m_stride (ECons ?sum_out_c_stride (ENil))))
+                        (= ?sum_add_stride (ECons ?sum_add_c_stride (ECons ?sum_add_h_stride (ECons ?sum_add_w_stride (ENil)))))
+                        (= ?weight_co_stride (nth_from_end ?weight_stride 1))
+                        (= ?weight_inner_stride (nth_from_end ?weight_stride 0))
+                        (= (MNum 0) (nth_from_end ?weight_stride 2))
+                        (= ?bias_add_stride (ECons ?bias_c_stride (ECons (MNum 0) (ECons (MNum 0) (ENil)))))
+                    )
+                    (
+                        (let ?conv (Op (KernelConv2D
+                            ?out_shape
+                            ?input_shape
+                            ?input_stride
+                            ?weight_co_stride
+                            ?weight_inner_stride
+                            ?bias_c_stride
+                            ?out_stride
+                            ?kernel_h
+                            ?kernel_w
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 0)
+                            (MNum 0)
+                            (F32))
+                            (ICons ?input (ICons ?weight (ICons ?bias (INil))))))
+                        (union ?out ?conv)
+                        (subsume (Op (FusionEnd ?out_shape ?out_stride (F32)) (ICons ?add_elem (INil))))
+                        (set (dtype ?conv) (F32))
+                    )
+                    :ruleset kernel_lower
+                    :name \"kernel conv2d from cuda lowered unfold matmul bias\"
+                )",
+            ),
+            Rule::raw(
+                "(rule
+                    (
+                        (= ?out (Op (FusionEnd ?out_shape ?out_stride (F32)) (ICons ?add_elem (INil))))
+                        (= ?add_elem (Op (CudaBinaryElementwise \"Add\" ?out_shape ?bias_add_stride ?sum_add_stride ?out_stride (F32)) (ICons ?bias_fs (ICons ?sum_fs (INil)))))
+                        (= ?sum_fs (Op (FusionStart ?out_shape ?sum_add_stride (F32)) (ICons ?sum (INil))))
+                        (= ?bias_fs (Op (FusionStart ?out_shape ?bias_add_stride (F32)) (ICons ?bias (INil))))
+
+                        (= ?sum (Op (KernelSum ?matmul_out_shape ?k_dim ?sum_in_stride ?k_stride ?sum_out_stride (F32)) (ICons ?mul_fe (INil))))
+                        (= ?mul_fe (Op (FusionEnd ?mul_shape ?mul_out_stride (F32)) (ICons ?mul_elem (INil))))
+                        (= ?mul_elem (Op (CudaBinaryElementwise \"Mul\" ?mul_shape ?patch_stride ?weight_stride ?mul_out_stride (F32)) (ICons ?patch_fs (ICons ?weight_fs (INil)))))
+                        (= ?patch_fs (Op (FusionStart ?mul_shape ?patch_stride (F32)) (ICons ?patches (INil))))
+                        (= ?weight_fs (Op (FusionStart ?mul_shape ?weight_stride (F32)) (ICons ?weight (INil))))
+                        (= ?patches (Op (KernelGather ?idx_shape ?idx_stride ?input_shape ?input_stride ?gather_out_stride (F32)) (ICons ?indices (ICons ?input (INil)))))
+
+                        (= ?out_shape (ECons ?c_out (ECons ?h_out (ECons ?w_out (ENil)))))
+                        (= ?input_shape (ECons ?c_in (ECons ?h_in (ECons ?w_in (ENil)))))
+                        (= ?idx_shape (ECons ?c_in (ECons ?h_out (ECons ?w_out (ECons (MNum 1) (ECons ?kernel_h (ECons ?kernel_w (ENil))))))))
+                        (= ?matmul_out_shape (ECons ?m (ECons ?c_out (ENil))))
+                        (= ?mul_shape (ECons ?m (ECons ?c_out (ECons ?k_dim (ENil)))))
+
+                        (= ?k_stride (MIter))
+                        (= ?sum_in_stride (ECons ?sum_m_stride (ECons ?sum_c_stride (ENil))))
+                        (= ?sum_out_stride (ECons ?sum_out_m_stride (ECons ?sum_out_c_stride (ENil))))
+                        (= ?sum_add_stride (ECons ?sum_add_c_stride (ECons ?sum_add_h_stride (ECons ?sum_add_w_stride (ENil)))))
+                        (= ?weight_co_stride (nth_from_end ?weight_stride 1))
+                        (= ?weight_inner_stride (nth_from_end ?weight_stride 0))
+                        (= (MNum 0) (nth_from_end ?weight_stride 2))
+                        (= ?bias_add_stride (ECons ?bias_c_stride (ECons (MNum 0) (ECons (MNum 0) (ENil)))))
+                    )
+                    (
+                        (let ?conv (Op (KernelConv2D
+                            ?out_shape
+                            ?input_shape
+                            ?input_stride
+                            ?weight_co_stride
+                            ?weight_inner_stride
+                            ?bias_c_stride
+                            ?out_stride
+                            ?kernel_h
+                            ?kernel_w
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 0)
+                            (MNum 0)
+                            (F32))
+                            (ICons ?input (ICons ?weight (ICons ?bias (INil))))))
+                        (union ?out ?conv)
+                        (subsume (Op (FusionEnd ?out_shape ?out_stride (F32)) (ICons ?add_elem (INil))))
+                        (set (dtype ?conv) (F32))
+                    )
+                    :ruleset kernel_lower
+                    :name \"kernel conv2d from cuda lowered bias unfold matmul\"
+                )",
+            ),
+            // Match the im2col-style HLIR conv used by Flux2:
+            //
+            //   input.unfold([1, kh, kw], [1, 1, 1], [1, 1, 1])
+            //     -> squeeze/permute/merge view
+            //     -> matmul(weight.t())
+            //     -> split/permute view
+            //     -> + bias.expand_dim(1, h_out).expand_dim(2, w_out)
+            //
+            // The kernel consumes the pre-unfold input directly. That input may
+            // already be a padded HLIR tensor, so the rewrite is still correct
+            // for Flux2's padded convs while removing the large patch matrix.
+            Rule::raw(
+                "(rule
+                    (
+                        (= ?add (Op (Add ?out_shape ?sum_add_stride ?bias_add_stride ?add_out_stride) (ICons ?sum (ICons ?bias (INil)))))
+                        (= ?sum (Op (Sum ?matmul_out_shape ?k_dim ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+                        (= ?mul (Op (Mul ?mul_shape ?patch_stride ?weight_stride ?mul_out_stride) (ICons ?patches (ICons ?weight (INil)))))
+                        (= ?patches (Op (Gather ?idx_shape ?idx_stride ?input_shape ?input_stride) (ICons ?indices (ICons ?input (INil)))))
+
+                        (= ?out_shape (ECons ?c_out (ECons ?h_out (ECons ?w_out (ENil)))))
+                        (= ?input_shape (ECons ?c_in (ECons ?h_in (ECons ?w_in (ENil)))))
+                        (= ?idx_shape (ECons ?c_in (ECons ?h_out (ECons ?w_out (ECons (MNum 1) (ECons ?kernel_h (ECons ?kernel_w (ENil))))))))
+                        (= ?matmul_out_shape (ECons ?m (ECons ?c_out (ENil))))
+
+                        ; This rewrite is for stride=1, dilation=1 over the
+                        ; tensor passed to unfold. Padded HLIR inputs are already
+                        ; represented as their own tensor, so padding is 0 here.
+                        (= ?h_out (MAdd (MSub ?h_in ?kernel_h) (MNum 1)))
+                        (= ?w_out (MAdd (MSub ?w_in ?kernel_w) (MNum 1)))
+                        (= ?m (MMul ?h_out ?w_out))
+                        (= ?k_dim (MMul ?c_in (MMul ?kernel_h ?kernel_w)))
+                        (= ?k_stride (MIter))
+
+                        (= ?weight_co_stride (nth_from_end ?weight_stride 1))
+                        (= ?weight_inner_stride (nth_from_end ?weight_stride 0))
+                        (= (MNum 0) (nth_from_end ?weight_stride 2))
+
+                        (= ?bias_add_stride (ECons ?bias_c_stride (ECons (MNum 0) (ECons (MNum 0) (ENil)))))
+
+                        (= (F32) (dtype ?input))
+                        (= (F32) (dtype ?weight))
+                        (= (F32) (dtype ?bias))
+                    )
+                    (
+                        (let ?conv (Op (KernelConv2D
+                            ?out_shape
+                            ?input_shape
+                            ?input_stride
+                            ?weight_co_stride
+                            ?weight_inner_stride
+                            ?bias_c_stride
+                            ?add_out_stride
+                            ?kernel_h
+                            ?kernel_w
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 0)
+                            (MNum 0)
+                            (F32))
+                            (ICons ?input (ICons ?weight (ICons ?bias (INil))))))
+                        (union ?add ?conv)
+                        (subsume (Op (Add ?out_shape ?sum_add_stride ?bias_add_stride ?add_out_stride) (ICons ?sum (ICons ?bias (INil)))))
+                        (set (dtype ?conv) (F32))
+                    )
+                    :ruleset kernel_specialize
+                    :name \"kernel conv2d from unfold matmul bias\"
+                )",
+            ),
+            Rule::raw(
+                "(rule
+                    (
+                        (= ?add (Op (Add ?out_shape ?bias_add_stride ?sum_add_stride ?add_out_stride) (ICons ?bias (ICons ?sum (INil)))))
+                        (= ?sum (Op (Sum ?matmul_out_shape ?k_dim ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+                        (= ?mul (Op (Mul ?mul_shape ?patch_stride ?weight_stride ?mul_out_stride) (ICons ?patches (ICons ?weight (INil)))))
+                        (= ?patches (Op (Gather ?idx_shape ?idx_stride ?input_shape ?input_stride) (ICons ?indices (ICons ?input (INil)))))
+
+                        (= ?out_shape (ECons ?c_out (ECons ?h_out (ECons ?w_out (ENil)))))
+                        (= ?input_shape (ECons ?c_in (ECons ?h_in (ECons ?w_in (ENil)))))
+                        (= ?idx_shape (ECons ?c_in (ECons ?h_out (ECons ?w_out (ECons (MNum 1) (ECons ?kernel_h (ECons ?kernel_w (ENil))))))))
+                        (= ?matmul_out_shape (ECons ?m (ECons ?c_out (ENil))))
+
+                        (= ?h_out (MAdd (MSub ?h_in ?kernel_h) (MNum 1)))
+                        (= ?w_out (MAdd (MSub ?w_in ?kernel_w) (MNum 1)))
+                        (= ?m (MMul ?h_out ?w_out))
+                        (= ?k_dim (MMul ?c_in (MMul ?kernel_h ?kernel_w)))
+                        (= ?k_stride (MIter))
+
+                        (= ?weight_co_stride (nth_from_end ?weight_stride 1))
+                        (= ?weight_inner_stride (nth_from_end ?weight_stride 0))
+                        (= (MNum 0) (nth_from_end ?weight_stride 2))
+
+                        (= ?bias_add_stride (ECons ?bias_c_stride (ECons (MNum 0) (ECons (MNum 0) (ENil)))))
+
+                        (= (F32) (dtype ?input))
+                        (= (F32) (dtype ?weight))
+                        (= (F32) (dtype ?bias))
+                    )
+                    (
+                        (let ?conv (Op (KernelConv2D
+                            ?out_shape
+                            ?input_shape
+                            ?input_stride
+                            ?weight_co_stride
+                            ?weight_inner_stride
+                            ?bias_c_stride
+                            ?add_out_stride
+                            ?kernel_h
+                            ?kernel_w
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 1)
+                            (MNum 0)
+                            (MNum 0)
+                            (F32))
+                            (ICons ?input (ICons ?weight (ICons ?bias (INil))))))
+                        (union ?add ?conv)
+                        (subsume (Op (Add ?out_shape ?bias_add_stride ?sum_add_stride ?add_out_stride) (ICons ?bias (ICons ?sum (INil)))))
+                        (set (dtype ?conv) (F32))
+                    )
+                    :ruleset kernel_specialize
+                    :name \"kernel conv2d from bias unfold matmul\"
+                )",
+            ),
+            Rule::raw(
+                "(rule
+                    (
+                        (= ?add (Op (Add ?shape ?as ?bs ?os) ?inputs))
+                        (= ?add (Op (KernelConv2D ?out_shape ?input_shape ?input_stride ?wco ?wi ?bc ?out_stride ?kh ?kw ?sh ?sw ?dh ?dw ?ph ?pw ?dt) ?conv_inputs))
+                    )
+                    ((delete (Op (Add ?shape ?as ?bs ?os) ?inputs)))
+                    :ruleset cleanup
+                )",
+            ),
+            Rule::raw(
+                "(rule
+                    (
+                        (= ?fe (Op (FusionEnd ?shape ?os ?dt) ?inputs))
+                        (= ?fe (Op (KernelConv2D ?out_shape ?input_shape ?input_stride ?wco ?wi ?bc ?out_stride ?kh ?kw ?sh ?sw ?dh ?dw ?ph ?pw ?conv_dt) ?conv_inputs))
+                    )
+                    ((delete (Op (FusionEnd ?shape ?os ?dt) ?inputs)))
+                    :ruleset cleanup
+                )",
+            ),
+        ]
+    }
+
+    fn cleanup(&self) -> bool {
+        false
+    }
+
+    fn extract<'a>(
+        &'a self,
+        egraph: &'a luminal::egglog_utils::SerializedEGraph,
+        kind_children: &[&'a luminal::egglog_utils::NodeId],
+        input_enodes: Vec<&'a luminal::egglog_utils::NodeId>,
+        list_cache: &mut FxHashMap<&'a luminal::egglog_utils::NodeId, Vec<Expression>>,
+        expr_cache: &mut FxHashMap<&'a luminal::egglog_utils::NodeId, Expression>,
+    ) -> (LLIROp, Vec<&'a luminal::egglog_utils::NodeId>) {
+        (
+            LLIROp::new::<dyn KernelOp>(Box::new(Self {
+                out_shape: extract_expr_list(egraph, kind_children[0], list_cache, expr_cache)
+                    .unwrap(),
+                input_shape: extract_expr_list(egraph, kind_children[1], list_cache, expr_cache)
+                    .unwrap(),
+                input_stride: extract_expr_list(egraph, kind_children[2], list_cache, expr_cache)
+                    .unwrap(),
+                weight_co_stride: extract_expr(egraph, kind_children[3], expr_cache).unwrap(),
+                weight_inner_stride: extract_expr(egraph, kind_children[4], expr_cache).unwrap(),
+                bias_c_stride: extract_expr(egraph, kind_children[5], expr_cache).unwrap(),
+                out_stride: extract_expr_list(egraph, kind_children[6], list_cache, expr_cache)
+                    .unwrap(),
+                kernel_h: extract_expr(egraph, kind_children[7], expr_cache).unwrap(),
+                kernel_w: extract_expr(egraph, kind_children[8], expr_cache).unwrap(),
+                stride_h: extract_expr(egraph, kind_children[9], expr_cache).unwrap(),
+                stride_w: extract_expr(egraph, kind_children[10], expr_cache).unwrap(),
+                dilation_h: extract_expr(egraph, kind_children[11], expr_cache).unwrap(),
+                dilation_w: extract_expr(egraph, kind_children[12], expr_cache).unwrap(),
+                pad_h: extract_expr(egraph, kind_children[13], expr_cache).unwrap(),
+                pad_w: extract_expr(egraph, kind_children[14], expr_cache).unwrap(),
+                dtype: extract_dtype(egraph, kind_children[15]),
+            }) as Box<dyn KernelOp>),
+            input_enodes,
+        )
    }
 }

-const THREADS_PER_BLOCK: usize = 256;
-
-impl KernelOp for Conv2DKernel {
+impl KernelOp for KernelConv2D {
    fn compile(
        &self,
        stream: &Arc<CudaStream>,
@@ -68,74 +541,135 @@ impl KernelOp for Conv2DKernel {
        Expression,
        FxHashMap<char, CudaSlice<u8>>,
    ) {
-        let total = self.output_elements();
-        let grid = total.div_ceil(THREADS_PER_BLOCK);
+        assert_eq!(self.dtype, DType::F32, "KernelConv2D currently emits F32");
+
+        let vars: FxHashSet<char> = self
+            .out_shape
+            .iter()
+            .chain(&self.input_shape)
+            .chain(&self.input_stride)
+            .chain(&self.out_stride)
+            .flat_map(|e| e.dyn_vars())
+            .chain(self.weight_co_stride.dyn_vars())
+            .chain(self.weight_inner_stride.dyn_vars())
+            .chain(self.bias_c_stride.dyn_vars())
+            .chain(self.kernel_h.dyn_vars())
+            .chain(self.kernel_w.dyn_vars())
+            .chain(self.stride_h.dyn_vars())
+            .chain(self.stride_w.dyn_vars())
+            .chain(self.dilation_h.dyn_vars())
+            .chain(self.dilation_w.dyn_vars())
+            .chain(self.pad_h.dyn_vars())
+            .chain(self.pad_w.dyn_vars())
+            .collect();
+
+        let (dyn_defines, _sorted_dims) = generate_dyn_dims_defines(&vars);
+        let dyn_dims_param = if vars.is_empty() {
+            ""
+        } else {
+            ", const int* dyn_dims"
+        };
+
+        let c_out = self.out_shape[0].to_kernel();
+        let h_out = self.out_shape[1].to_kernel();
+        let w_out = self.out_shape[2].to_kernel();
+        let c_in = self.input_shape[0].to_kernel();
+        let h_in = self.input_shape[1].to_kernel();
+        let w_in = self.input_shape[2].to_kernel();
+        let weight_co_stride = self
+            .weight_co_stride
+            .substitute('z', Expression::from(1))
+            .simplify()
+            .to_kernel();
+        let weight_inner_stride = self
+            .weight_inner_stride
+            .substitute('z', Expression::from(1))
+            .simplify()
+            .to_kernel();
+        let bias_c_stride = self
+            .bias_c_stride
+            .substitute('z', Expression::from(1))
+            .simplify()
+            .to_kernel();
+        let kh = self.kernel_h.to_kernel();
+        let kw = self.kernel_w.to_kernel();
+        let stride_h = self.stride_h.to_kernel();
+        let stride_w = self.stride_w.to_kernel();
+        let dilation_h = self.dilation_h.to_kernel();
+        let dilation_w = self.dilation_w.to_kernel();
+        let pad_h = self.pad_h.to_kernel();
+        let pad_w = self.pad_w.to_kernel();
+        let out_idx = flatten_strides(&self.out_shape, &self.out_stride).to_kernel();
+        let input_idx = flatten_strides(&self.input_shape, &self.input_stride)
+            .to_kernel()
+            .replace("const_z", "input_linear");
+        let n_outputs: Expression = self.out_shape.iter().copied().product();

        let kernel = format!(
            "
-extern \"C\" __global__ void conv2d_bias_kernel(
-    float* __restrict__ out,
-    const float* __restrict__ input,
-    const float* __restrict__ weight,
-    const float* __restrict__ bias
-) {{
-    const int TOTAL = {total};
-    const int CIN  = {c_in};
-    const int H    = {h_in};
-    const int W    = {w_in};
-    const int HOUT = {h_out};
-    const int WOUT = {w_out};
-    const int K    = {k};
-    const int S    = {s};
-    const int P    = {p};
+{dyn_defines}
+extern \"C\" {{
+    __global__ void generic_conv2d_bias(
+        float* __restrict__ out,
+        const float* __restrict__ input,
+        const float* __restrict__ weight,
+        const float* __restrict__ bias{dyn_dims_param}
+    ) {{
+        long long const_z = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+        const long long total = {total};
+        if (const_z >= total) return;

-    int idx = blockIdx.x * blockDim.x + threadIdx.x;
-    if (idx >= TOTAL) return;
-    int hw = HOUT * WOUT;
-    int co = idx / hw;
-    int rem = idx - co * hw;
-    int ho = rem / WOUT;
-    int wo = rem - ho * WOUT;
+        const long long COUT = {c_out};
+        const long long HOUT = {h_out};
+        const long long WOUT = {w_out};
+        const long long CIN = {c_in};
+        const long long HIN = {h_in};
+        const long long WIN = {w_in};
+        const long long KH = {kh};
+        const long long KW = {kw};
+        const long long SH = {stride_h};
+        const long long SW = {stride_w};
+        const long long DH = {dilation_h};
+        const long long DW = {dilation_w};
+        const long long PH = {pad_h};
+        const long long PW = {pad_w};
+        const long long W_CO_STRIDE = {weight_co_stride};
+        const long long W_INNER_STRIDE = {weight_inner_stride};
+        const long long BIAS_C_STRIDE = {bias_c_stride};

-    float acc = bias[co];
-    int weight_co_base = co * (CIN * K * K);
-    for (int ci = 0; ci < CIN; ci++) {{
-        int input_ci_base = ci * (H * W);
-        int weight_ci_base = weight_co_base + ci * (K * K);
-        #pragma unroll
-        for (int ki = 0; ki < K; ki++) {{
-            int hi = ho * S + ki - P;
-            if (hi < 0 || hi >= H) continue;
-            int input_row_base = input_ci_base + hi * W;
-            int weight_row_base = weight_ci_base + ki * K;
-            #pragma unroll
-            for (int kj = 0; kj < K; kj++) {{
-                int wj = wo * S + kj - P;
-                if (wj < 0 || wj >= W) continue;
-                acc += input[input_row_base + wj] * weight[weight_row_base + kj];
+        long long co = const_z / (HOUT * WOUT);
+        long long rem = const_z - co * HOUT * WOUT;
+        long long oh = rem / WOUT;
+        long long ow = rem - oh * WOUT;
+
+        float acc = bias[co * BIAS_C_STRIDE];
+        for (long long ci = 0; ci < CIN; ++ci) {{
+            for (long long r = 0; r < KH; ++r) {{
+                long long ih = oh * SH + r * DH - PH;
+                if (ih < 0 || ih >= HIN) continue;
+                for (long long s = 0; s < KW; ++s) {{
+                    long long iw = ow * SW + s * DW - PW;
+                    if (iw < 0 || iw >= WIN) continue;
+                    long long input_linear = (ci * HIN + ih) * WIN + iw;
+                    long long input_idx = {input_idx};
+                    long long inner = (ci * KH + r) * KW + s;
+                    long long weight_idx = co * W_CO_STRIDE + inner * W_INNER_STRIDE;
+                    acc += input[input_idx] * weight[weight_idx];
+                }}
            }}
        }}
+        out[{out_idx}] = acc;
    }}
-    out[idx] = acc;
-}}
-",
-            total = total,
-            c_in = self.c_in,
-            h_in = self.h_in,
-            w_in = self.w_in,
-            h_out = self.h_out,
-            w_out = self.w_out,
-            k = self.kernel,
-            s = self.stride,
-            p = self.padding,
+}}",
+            total = n_outputs.to_kernel(),
        );

-        let (module, func) = if let Some((m, f)) = compile_cache.get(&kernel) {
-            (m.clone(), f.clone())
+        let (module, func) = if let Some((module, func)) = compile_cache.get(&kernel) {
+            (module.clone(), func.clone())
        } else {
            let ptx = compile_module_image_for_current_device(stream.context(), &kernel).unwrap();
            let module = stream.context().load_module(ptx).unwrap();
-            let func = module.load_function("conv2d_bias_kernel").unwrap();
+            let func = module.load_function("generic_conv2d_bias").unwrap();
            compile_cache.insert(kernel.clone(), (module.clone(), func.clone()));
            (module, func)
        };
@@ -144,37 +678,45 @@ extern \"C\" __global__ void conv2d_bias_kernel(
            func,
            module,
            kernel,
-            (
-                Expression::from(grid),
-                Expression::from(1usize),
-                Expression::from(1usize),
-            ),
-            (
-                Expression::from(THREADS_PER_BLOCK),
-                Expression::from(1usize),
-                Expression::from(1usize),
-            ),
-            Expression::from(0usize),
+            (n_outputs.ceil_div(256), 1.into(), 1.into()),
+            (n_outputs.min(256), 1.into(), 1.into()),
+            0.into(),
            FxHashMap::default(),
        )
    }

    fn output_size(&self) -> Expression {
-        Expression::from(self.output_elements())
+        self.out_shape.iter().copied().product()
+    }
+
+    fn all_dyn_vars(&self) -> FxHashSet<char> {
+        self.out_shape
+            .iter()
+            .chain(&self.input_shape)
+            .chain(&self.input_stride)
+            .chain(&self.out_stride)
+            .flat_map(|e| e.dyn_vars())
+            .chain(self.weight_co_stride.dyn_vars())
+            .chain(self.weight_inner_stride.dyn_vars())
+            .chain(self.bias_c_stride.dyn_vars())
+            .chain(self.kernel_h.dyn_vars())
+            .chain(self.kernel_w.dyn_vars())
+            .chain(self.stride_h.dyn_vars())
+            .chain(self.stride_w.dyn_vars())
+            .chain(self.dilation_h.dyn_vars())
+            .chain(self.dilation_w.dyn_vars())
+            .chain(self.pad_h.dyn_vars())
+            .chain(self.pad_w.dyn_vars())
+            .collect()
    }

    fn output_bytes(&self) -> Expression {
        self.output_size() * 4
    }

-    fn output_dtype(&self) -> DType {
-        DType::F32
-    }
-
    fn bytes_loaded(&self) -> Expression {
-        // Per output: C_in * K * K input loads + same many weight loads + 1 bias load.
-        let per_out = self.c_in * self.kernel * self.kernel * 2 + 1;
-        Expression::from(self.output_elements() * per_out * 4)
+        let c_in = self.input_shape[0];
+        self.output_size() * self.kernel_h * self.kernel_w * c_in * 2 * 4 + self.output_size() * 4
    }

    fn bytes_stored(&self) -> Expression {
@@ -182,108 +724,15 @@ extern \"C\" __global__ void conv2d_bias_kernel(
    }

    fn flops(&self) -> Expression {
-        // 2 * C_in * K * K mul-adds per output, plus the bias add = +1.
-        let per_out = self.c_in * self.kernel * self.kernel * 2 + 1;
-        Expression::from(self.output_elements() * per_out)
+        let c_in = self.input_shape[0];
+        self.output_size() * self.kernel_h * self.kernel_w * c_in * 2
+    }
+
+    fn output_dtype(&self) -> DType {
+        self.dtype
    }

    fn kernel_name(&self) -> &'static str {
-        "Conv2DBias"
+        "GenericConv2D"
    }
 }
-
-/// luminal `CustomOp` that wraps `Conv2DKernel`. Lets us drop the kernel
-/// straight into an HLIR graph via `cx.custom_op(...)` without going
-/// through egglog rewrites.
-#[derive(Debug, Clone)]
-pub struct Conv2DCustom(pub Conv2DKernel);
-
-impl CustomOp for Conv2DCustom {
-    fn to_llir_op(&self) -> LLIROp {
-        LLIROp::new::<dyn KernelOp>(Box::new(self.0.clone()) as Box<dyn KernelOp>)
-    }
-}
-
-/// 2D conv-with-bias on a `(C_in, H, W)` F32 input tensor, with weights
-/// stored as `(C_out, C_in*K*K)` and bias as `(C_out,)`. Stride/padding/kernel
-/// are static. Output: `(C_out, H_out, W_out)`.
-///
-/// This is a thin wrapper over [`Conv2DKernel`] that hides the
-/// `cx.custom_op` plumbing. All inputs MUST be `DType::F32` and contiguous
-/// row-major; pass `tensor * 1.0_f32` first if you have a strided view.
-pub fn conv2d_bias(
-    input: GraphTensor,
-    weight: GraphTensor,
-    bias: GraphTensor,
-    kernel: usize,
-    stride: usize,
-    padding: usize,
-) -> GraphTensor {
-    assert_eq!(input.dtype, DType::F32, "conv2d_bias requires F32 input");
-    assert_eq!(weight.dtype, DType::F32, "conv2d_bias requires F32 weight");
-    assert_eq!(bias.dtype, DType::F32, "conv2d_bias requires F32 bias");
-
-    let dims = input.dims();
-    assert_eq!(dims.len(), 3, "conv2d_bias expects (C_in, H, W) input");
-    let c_in = dims[0].to_usize().expect("C_in must be a static dim");
-    let h_in = dims[1].to_usize().expect("H must be a static dim");
-    let w_in = dims[2].to_usize().expect("W must be a static dim");
-
-    let w_dims = weight.dims();
-    assert_eq!(
-        w_dims.len(),
-        2,
-        "conv2d_bias expects weight (C_out, C_in*K*K)"
-    );
-    let c_out = w_dims[0].to_usize().expect("C_out must be a static dim");
-    let w_kk = w_dims[1]
-        .to_usize()
-        .expect("weight inner dim must be static");
-    assert_eq!(
-        w_kk,
-        c_in * kernel * kernel,
-        "weight inner dim {w_kk} != C_in*K*K = {}",
-        c_in * kernel * kernel,
-    );
-
-    let b_dims = bias.dims();
-    assert_eq!(b_dims.len(), 1, "conv2d_bias expects bias (C_out,)");
-    assert_eq!(
-        b_dims[0].to_usize().expect("bias dim must be static"),
-        c_out
-    );
-
-    assert!(
-        h_in + 2 * padding >= kernel,
-        "padded H_in ({}) is smaller than kernel ({})",
-        h_in + 2 * padding,
-        kernel,
-    );
-    assert!(
-        w_in + 2 * padding >= kernel,
-        "padded W_in ({}) is smaller than kernel ({})",
-        w_in + 2 * padding,
-        kernel,
-    );
-    let h_out = (h_in + 2 * padding - kernel) / stride + 1;
-    let w_out = (w_in + 2 * padding - kernel) / stride + 1;
-
-    let kern = Conv2DKernel {
-        c_in,
-        h_in,
-        w_in,
-        c_out,
-        kernel,
-        stride,
-        padding,
-        h_out,
-        w_out,
-    };
-    let cx: &mut Graph = unsafe { &mut *input.graph_ref };
-    cx.custom_op(
-        Conv2DCustom(kern),
-        vec![input, weight, bias],
-        (c_out, h_out, w_out),
-        DType::F32,
-    )
-}
--- a/crates/luminal_cuda_lite/src/kernel/dlrm_interact.rs
+++ b/crates/luminal_cuda_lite/src/kernel/dlrm_interact.rs
@@ -0,0 +1,243 @@
+//! Fused DLRM pairwise-dot interaction.
+//!
+//! Replaces the cat→bmm(T,Tᵀ)→tril-gather chain with a single kernel
+//! that reads N separate `(batch, d)` tensors and writes the strict
+//! lower-triangular pairwise dot products directly into the output —
+//! `out[b, p] = Σ_d v_i[b, d] * v_j[b, d]` for each ordered pair (i, j)
+//! with i > j.
+//!
+//! Why this matters for the DLRM forward: the natural luminal lowering
+//! materializes the `(B, F, D)` stacked tensor, then the full `(B, F, F)`
+//! BMM output, then a flat gather to pull out F(F-1)/2 pairs. That's
+//! ~12 small kernels and an `F²·B` intermediate even though only half
+//! of those elements are kept. The fused version uses N pointer args
+//! (one per feature vector), computes only the F(F-1)/2 dot products,
+//! and writes directly to the final `(B, F(F-1)/2)` buffer.
+//!
+//! All shapes are static. The kernel source is generated with the
+//! exact pair table baked in (so the inner loop is a fixed `D`-element
+//! reduction with no shape-dependent branching).
+
+use std::sync::Arc;
+
+use cudarc::driver::{CudaFunction, CudaModule, CudaSlice, CudaStream};
+use luminal::{
+    dtype::DType, op::CustomOp, op::LLIROp, prelude::FxHashMap, prelude::GraphTensor,
+    shape::Expression,
+};
+
+use crate::compile_module_image_for_current_device;
+use crate::kernel::KernelOp;
+
+#[derive(Debug, Clone)]
+pub struct PairwiseDotLowerTriKernel {
+    pub batch: usize,
+    pub num_features: usize, // F
+    pub d: usize,
+}
+
+impl PairwiseDotLowerTriKernel {
+    fn pair_count(&self) -> usize {
+        self.num_features * (self.num_features - 1) / 2
+    }
+}
+
+impl KernelOp for PairwiseDotLowerTriKernel {
+    fn compile(
+        &self,
+        stream: &Arc<CudaStream>,
+        compile_cache: &mut FxHashMap<String, (Arc<CudaModule>, CudaFunction)>,
+    ) -> (
+        CudaFunction,
+        Arc<CudaModule>,
+        String,
+        (Expression, Expression, Expression),
+        (Expression, Expression, Expression),
+        Expression,
+        FxHashMap<char, CudaSlice<u8>>,
+    ) {
+        let f = self.num_features;
+        let p = self.pair_count();
+        // Pair table (i, j) with i > j, in strict-lower-tri (row-major over
+        // i then j) order — same convention as torch.tril_indices(F, F, -1).
+        let mut pairs: Vec<(usize, usize)> = Vec::with_capacity(p);
+        for i in 0..f {
+            for j in 0..i {
+                pairs.push((i, j));
+            }
+        }
+        // Build kernel params signature: one pointer per input feature.
+        let in_params: String = (0..f)
+            .map(|k| format!(", const float* __restrict__ v{k}"))
+            .collect::<Vec<_>>()
+            .concat();
+        // For each pair p, generate one branch in the switch that selects
+        // the two input pointers to dot-product. With F small (DLRM has
+        // F=4), the branch is fully unrolled.
+        let mut pair_switch = String::new();
+        for (pidx, (i, j)) in pairs.iter().enumerate() {
+            pair_switch += &format!(
+                "        case {pidx}: pa = v{i}; pb = v{j}; break;\n"
+            );
+        }
+
+        let kernel = format!(
+            "
+extern \"C\" __global__ void dlrm_pairwise_dot_lower_tri_kernel(
+    float* __restrict__ out{in_params}
+) {{
+    const int B = {batch};
+    const int D = {d};
+    const int P = {p};
+    int b = blockIdx.x;
+    int p = blockIdx.y;
+    int t = threadIdx.x;
+    if (b >= B || p >= P) return;
+
+    const float* pa = nullptr;
+    const float* pb = nullptr;
+    switch (p) {{
+{pair_switch}
+        default: return;
+    }}
+
+    // Block-wide reduction of dot(pa[b], pb[b]) over D using shared mem.
+    extern __shared__ float smem[];
+    float partial = 0.0f;
+    for (int d = t; d < D; d += blockDim.x) {{
+        partial += pa[b * D + d] * pb[b * D + d];
+    }}
+    smem[t] = partial;
+    __syncthreads();
+    // Power-of-two tree reduce. blockDim.x must be a power of two.
+    for (int stride = blockDim.x >> 1; stride > 0; stride >>= 1) {{
+        if (t < stride) {{
+            smem[t] += smem[t + stride];
+        }}
+        __syncthreads();
+    }}
+    if (t == 0) {{
+        out[b * P + p] = smem[0];
+    }}
+}}
+",
+            batch = self.batch,
+            d = self.d,
+            p = p,
+            pair_switch = pair_switch,
+            in_params = in_params,
+        );
+
+        let (module, func) = if let Some((m, f)) = compile_cache.get(&kernel) {
+            (m.clone(), f.clone())
+        } else {
+            let ptx = compile_module_image_for_current_device(stream.context(), &kernel).unwrap();
+            let module = stream.context().load_module(ptx).unwrap();
+            let func = module
+                .load_function("dlrm_pairwise_dot_lower_tri_kernel")
+                .unwrap();
+            compile_cache.insert(kernel.clone(), (module.clone(), func.clone()));
+            (module, func)
+        };
+
+        // Pick a power-of-two thread count ≤ D, ≥ 32 where possible.
+        let mut threads = 1usize;
+        while threads * 2 <= self.d.max(32) {
+            threads *= 2;
+        }
+        let threads = threads.max(32).min(1024);
+        (
+            func,
+            module,
+            kernel,
+            (
+                Expression::from(self.batch),
+                Expression::from(p),
+                Expression::from(1usize),
+            ),
+            (
+                Expression::from(threads),
+                Expression::from(1usize),
+                Expression::from(1usize),
+            ),
+            Expression::from(threads * 4),
+            FxHashMap::default(),
+        )
+    }
+
+    fn output_size(&self) -> Expression {
+        Expression::from(self.batch * self.pair_count())
+    }
+
+    fn output_bytes(&self) -> Expression {
+        self.output_size() * 4
+    }
+
+    fn output_dtype(&self) -> DType {
+        DType::F32
+    }
+
+    fn bytes_loaded(&self) -> Expression {
+        // Each pair reads 2 vectors of D floats per batch row. F-choose-2
+        // pairs, so per-batch each input vector is read F-1 times.
+        Expression::from(self.batch * self.num_features * (self.num_features - 1) * self.d * 4)
+    }
+
+    fn bytes_stored(&self) -> Expression {
+        self.output_size() * 4
+    }
+
+    fn flops(&self) -> Expression {
+        // 2D-1 flops per dot product (D mul + D-1 add).
+        Expression::from(self.batch * self.pair_count() * (2 * self.d - 1))
+    }
+
+    fn kernel_name(&self) -> &'static str {
+        "DLRMPairwiseDotLowerTri"
+    }
+}
+
+#[derive(Debug, Clone)]
+pub struct PairwiseDotLowerTriCustom(pub PairwiseDotLowerTriKernel);
+
+impl CustomOp for PairwiseDotLowerTriCustom {
+    fn to_llir_op(&self) -> LLIROp {
+        LLIROp::new::<dyn KernelOp>(Box::new(self.0.clone()) as Box<dyn KernelOp>)
+    }
+}
+
+/// Strict-lower-triangular pairwise dot product of N feature vectors.
+///
+/// * `features`: N tensors, each `(batch, d)`, all F32, all the same shape.
+///
+/// Returns `(batch, N*(N-1)/2)` with pair ordering matching
+/// `torch.tril_indices(N, N, -1)` (row-major: (1,0), (2,0), (2,1), …).
+pub fn dlrm_pairwise_dot_lower_tri(features: Vec<GraphTensor>) -> GraphTensor {
+    assert!(features.len() >= 2, "need at least 2 feature vectors");
+    let first = features[0];
+    let dims = first.dims();
+    assert_eq!(dims.len(), 2, "each feature vector must be 2D (batch, d)");
+    let batch = dims[0].to_usize().expect("batch must be static");
+    let d = dims[1].to_usize().expect("d must be static");
+    let f = features.len();
+    for v in &features {
+        assert_eq!(v.dtype, DType::F32, "features must all be F32");
+        let vd = v.dims();
+        assert_eq!(vd.len(), 2, "features must all be 2D");
+        assert_eq!(vd[0].to_usize().unwrap(), batch, "batch mismatch");
+        assert_eq!(vd[1].to_usize().unwrap(), d, "d mismatch");
+    }
+    let kern = PairwiseDotLowerTriKernel {
+        batch,
+        num_features: f,
+        d,
+    };
+    let p = f * (f - 1) / 2;
+    let cx = unsafe { &mut *first.graph_ref };
+    cx.custom_op(
+        PairwiseDotLowerTriCustom(kern),
+        features,
+        (batch, p),
+        DType::F32,
+    )
+}
--- a/crates/luminal_cuda_lite/src/kernel/embedding_bag.rs
+++ b/crates/luminal_cuda_lite/src/kernel/embedding_bag.rs
@@ -0,0 +1,470 @@
+//! Single-kernel fused EmbeddingBag (sum-pool) operator.
+//!
+//! Folds `gather(table, idx) → sum(L)` into one CUDA kernel. Same wrapping
+//! pattern as `Matmul2DKernel`: implement [`KernelOp`], wrap in a
+//! [`CustomOp`] so the user-facing call appears as a `dyn KernelOp` in
+//! the LLIR and gets absorbed into the same CudaGraphOp as everything
+//! around it — no extra host op, no extra CUDA launch outside the graph.
+//!
+//! Semantics: `out[b, d] = Σ_l table[indices[b, l], d]`
+//!   table:   (n_emb, d), F32, row-major
+//!   indices: (batch, bag), I32, row-major
+//!   out:     (batch, d), F32, row-major
+//!
+//! Used by luminal_python's PT2 translator to lower the DLRM v3 pattern
+//! `index_select(W, 0, idx).view([B, L, D]).sum(dim=1)` directly to one
+//! fused kernel.
+
+use std::sync::Arc;
+
+use cudarc::driver::{CudaFunction, CudaModule, CudaSlice, CudaStream};
+use luminal::{
+    dtype::DType, op::CustomOp, op::LLIROp, prelude::FxHashMap, prelude::GraphTensor,
+    shape::Expression,
+};
+
+use crate::compile_module_image_for_current_device;
+use crate::kernel::KernelOp;
+
+/// One-kernel fused EmbeddingBag with sum pooling and fixed bag size.
+#[derive(Debug, Clone)]
+pub struct EmbeddingBagSumKernel {
+    pub batch: usize,
+    pub bag: usize,
+    pub d: usize,
+    pub n_emb: usize,
+}
+
+impl KernelOp for EmbeddingBagSumKernel {
+    fn compile(
+        &self,
+        stream: &Arc<CudaStream>,
+        compile_cache: &mut FxHashMap<String, (Arc<CudaModule>, CudaFunction)>,
+    ) -> (
+        CudaFunction,
+        Arc<CudaModule>,
+        String,
+        (Expression, Expression, Expression),
+        (Expression, Expression, Expression),
+        Expression,
+        FxHashMap<char, CudaSlice<u8>>,
+    ) {
+        // One block per batch row, `d` threads per block. Each thread sums
+        // one output column over the `bag` indices. Standard bag-size-1..L
+        // pattern, memory-bandwidth bound on `table`.
+        let kernel = format!(
+            "
+extern \"C\" __global__ void embedding_bag_sum_kernel(
+    float* __restrict__ out,
+    const float* __restrict__ table,
+    const int* __restrict__ indices
+) {{
+    const int B = {batch};
+    const int L = {bag};
+    const int D = {d};
+    int b = blockIdx.x;
+    int d = threadIdx.x;
+    if (b >= B || d >= D) return;
+    float acc = 0.0f;
+    #pragma unroll 4
+    for (int l = 0; l < L; ++l) {{
+        int row = indices[b * L + l];
+        acc += table[row * D + d];
+    }}
+    out[b * D + d] = acc;
+}}
+",
+            batch = self.batch,
+            bag = self.bag,
+            d = self.d,
+        );
+
+        let (module, func) = if let Some((m, f)) = compile_cache.get(&kernel) {
+            (m.clone(), f.clone())
+        } else {
+            let ptx = compile_module_image_for_current_device(stream.context(), &kernel).unwrap();
+            let module = stream.context().load_module(ptx).unwrap();
+            let func = module.load_function("embedding_bag_sum_kernel").unwrap();
+            compile_cache.insert(kernel.clone(), (module.clone(), func.clone()));
+            (module, func)
+        };
+
+        (
+            func,
+            module,
+            kernel,
+            (
+                Expression::from(self.batch),
+                Expression::from(1usize),
+                Expression::from(1usize),
+            ),
+            (
+                Expression::from(self.d),
+                Expression::from(1usize),
+                Expression::from(1usize),
+            ),
+            Expression::from(0usize),
+            FxHashMap::default(),
+        )
+    }
+
+    fn output_size(&self) -> Expression {
+        Expression::from(self.batch * self.d)
+    }
+
+    fn output_bytes(&self) -> Expression {
+        self.output_size() * 4
+    }
+
+    fn output_dtype(&self) -> DType {
+        DType::F32
+    }
+
+    fn bytes_loaded(&self) -> Expression {
+        Expression::from(self.batch * self.d * self.bag * 4 + self.batch * self.bag * 4)
+    }
+
+    fn bytes_stored(&self) -> Expression {
+        self.output_size() * 4
+    }
+
+    fn flops(&self) -> Expression {
+        Expression::from(self.batch * self.d * self.bag)
+    }
+
+    fn kernel_name(&self) -> &'static str {
+        "EmbeddingBagSum"
+    }
+}
+
+/// CustomOp wrapper for [`EmbeddingBagSumKernel`].
+#[derive(Debug, Clone)]
+pub struct EmbeddingBagSumCustom(pub EmbeddingBagSumKernel);
+
+impl CustomOp for EmbeddingBagSumCustom {
+    fn to_llir_op(&self) -> LLIROp {
+        LLIROp::new::<dyn KernelOp>(Box::new(self.0.clone()) as Box<dyn KernelOp>)
+    }
+}
+
+/// Frontend helper: `out[b, d] = Σ_l table[indices[b, l], d]` as one
+/// fused custom op. Both inputs must be 2-D and contiguous. The indices
+/// must be 32-bit ints (use `.cast(DType::Int)` first if your input is
+/// `i64`); shape `(batch, bag)`.
+pub fn embedding_bag_sum_kernel(table: GraphTensor, indices: GraphTensor) -> GraphTensor {
+    assert_eq!(table.dtype, DType::F32, "embedding_bag_sum_kernel: table must be F32");
+    assert_eq!(indices.dtype, DType::Int, "embedding_bag_sum_kernel: indices must be Int (i32)");
+    let t_dims = table.dims();
+    let i_dims = indices.dims();
+    assert_eq!(t_dims.len(), 2, "table must be 2D (n_emb, d)");
+    assert_eq!(i_dims.len(), 2, "indices must be 2D (batch, bag)");
+    let n_emb = t_dims[0].to_usize().expect("n_emb must be static");
+    let d = t_dims[1].to_usize().expect("d must be static");
+    let batch = i_dims[0].to_usize().expect("batch must be static");
+    let bag = i_dims[1].to_usize().expect("bag must be static");
+    let kern = EmbeddingBagSumKernel { batch, bag, d, n_emb };
+    let cx = unsafe { &mut *table.graph_ref };
+    cx.custom_op(EmbeddingBagSumCustom(kern), vec![table, indices], (batch, d), DType::F32)
+}
+
+// ---------------------------------------------------------------------------
+// Multi-table EmbeddingBag (one kernel for K independent (weight, idx) pairs)
+// ---------------------------------------------------------------------------
+
+/// Folds K independent `EmbeddingBag(sum)` lookups into a single CUDA
+/// kernel launch. Used by the vanilla-DLRMv1 translator path where the
+/// model has K separate `nn.EmbeddingBag` modules — each one would
+/// otherwise lower to its own (~5 µs) launch.
+///
+/// Inputs (in `KernelOp`-order):
+///   - `weight_0, weight_1, ..., weight_{K-1}` — each `(n_emb_k, d)` F32.
+///     **The per-table `n_emb` may differ**; only `d` and bag size `L`
+///     must match across tables.
+///   - `idx_0, idx_1, ..., idx_{K-1}` — each `(batch, L)` Int (i32).
+///
+/// Two packed staging buffers carry the K weight + K idx device pointers
+/// into the kernel (`build_params` fills them per execution via
+/// `cuMemcpyHtoD`). The hot loop reads each pointer from shared memory
+/// — no per-table switch needed.
+///
+/// Output shape: `(batch, num_tables, d)` F32, row-major.
+#[derive(Debug, Clone)]
+pub struct MultiTableEmbeddingBagSumKernel {
+    pub batch: usize,
+    pub bag: usize,
+    pub d: usize,
+    pub num_tables: usize,
+}
+
+impl KernelOp for MultiTableEmbeddingBagSumKernel {
+    fn compile(
+        &self,
+        stream: &Arc<CudaStream>,
+        compile_cache: &mut FxHashMap<String, (Arc<CudaModule>, CudaFunction)>,
+    ) -> (
+        CudaFunction,
+        Arc<CudaModule>,
+        String,
+        (Expression, Expression, Expression),
+        (Expression, Expression, Expression),
+        Expression,
+        FxHashMap<char, CudaSlice<u8>>,
+    ) {
+        // Layout (mirrors worktree's StackedEmbeddingBagSumKernel):
+        //   - One block per batch row (B blocks).
+        //   - Each block produces (K, D) output cells, striding over K·D
+        //     threads (rounded up to a warp).
+        //   - K weight pointers + K idx pointers come in via two packed
+        //     staging buffers populated in `build_params`.
+        //   - Shared memory caches both pointer arrays so the hot loop
+        //     reads at shmem latency.
+        let kernel = format!(
+            "
+extern \"C\" __global__ void multi_table_embedding_bag_sum_kernel(
+    float* __restrict__ out,
+    const long* __restrict__ w_ptrs_packed,
+    const long* __restrict__ idx_ptrs_packed
+) {{
+    const int B = {batch};
+    const int L = {bag};
+    const int D = {d};
+    const int K = {num_tables};
+    const int total = K * D;
+    int b = blockIdx.x;
+    if (b >= B) return;
+
+    __shared__ const float* s_w_ptrs[K];
+    __shared__ const int*   s_idx_ptrs[K];
+    if (threadIdx.x < K) {{
+        s_w_ptrs[threadIdx.x]   = (const float*)(w_ptrs_packed[threadIdx.x]);
+        s_idx_ptrs[threadIdx.x] = (const int*)(idx_ptrs_packed[threadIdx.x]);
+    }}
+    __syncthreads();
+
+    for (int tid = threadIdx.x; tid < total; tid += blockDim.x) {{
+        int k = tid / D;
+        int d = tid - k * D;
+        const float* w = s_w_ptrs[k];
+        const int*   idx = s_idx_ptrs[k];
+        float acc = 0.0f;
+        #pragma unroll 4
+        for (int l = 0; l < L; ++l) {{
+            int row = idx[b * L + l];
+            acc += w[row * D + d];
+        }}
+        // (B, K, D) row-major.
+        out[(b * K + k) * D + d] = acc;
+    }}
+}}
+",
+            batch = self.batch,
+            bag = self.bag,
+            d = self.d,
+            num_tables = self.num_tables,
+        );
+
+        let (module, func) = if let Some((m, f)) = compile_cache.get(&kernel) {
+            (m.clone(), f.clone())
+        } else {
+            let ptx = compile_module_image_for_current_device(stream.context(), &kernel).unwrap();
+            let module = stream.context().load_module(ptx).unwrap();
+            let func = module
+                .load_function("multi_table_embedding_bag_sum_kernel")
+                .unwrap();
+            compile_cache.insert(kernel.clone(), (module.clone(), func.clone()));
+            (module, func)
+        };
+
+        let total = self.num_tables * self.d;
+        let block_threads = total.next_multiple_of(32).clamp(32, 1024);
+        (
+            func,
+            module,
+            kernel,
+            (
+                Expression::from(self.batch),
+                Expression::from(1usize),
+                Expression::from(1usize),
+            ),
+            (
+                Expression::from(block_threads),
+                Expression::from(1usize),
+                Expression::from(1usize),
+            ),
+            Expression::from(0usize),
+            FxHashMap::default(),
+        )
+    }
+
+    fn output_size(&self) -> Expression {
+        Expression::from(self.batch * self.num_tables * self.d)
+    }
+
+    fn output_bytes(&self) -> Expression {
+        self.output_size() * 4
+    }
+
+    fn output_dtype(&self) -> DType {
+        DType::F32
+    }
+
+    fn bytes_loaded(&self) -> Expression {
+        Expression::from(self.batch * self.num_tables * self.d * self.bag * 4
+            + self.batch * self.num_tables * self.bag * 4)
+    }
+
+    fn bytes_stored(&self) -> Expression {
+        self.output_size() * 4
+    }
+
+    fn flops(&self) -> Expression {
+        Expression::from(self.batch * self.num_tables * self.d * self.bag)
+    }
+
+    fn kernel_name(&self) -> &'static str {
+        "MultiTableEmbeddingBagSum"
+    }
+
+    /// Two staging buffers: one for K weight ptrs, one for K idx ptrs.
+    /// Each is `K * 8` bytes (an array of u64s, written as `long*` on
+    /// the device side).
+    fn allocate_internal_buffers(
+        &self,
+        stream: &Arc<CudaStream>,
+        _dyn_map: &FxHashMap<char, usize>,
+    ) -> Vec<CudaSlice<u8>> {
+        let buf_size = self.num_tables * 8;
+        vec![
+            stream
+                .alloc_zeros::<u8>(buf_size)
+                .expect("alloc MultiTableEmbBag w-ptr staging buffer"),
+            stream
+                .alloc_zeros::<u8>(buf_size)
+                .expect("alloc MultiTableEmbBag idx-ptr staging buffer"),
+        ]
+    }
+
+    /// Pack the K weight + K idx pointers into the two staging buffers
+    /// each execution, then emit `[out, w_buf, idx_buf]` as kernel params.
+    ///
+    /// `input_ptrs` layout: `[w_0, w_1, ..., w_{K-1}, idx_0, ..., idx_{K-1}]`.
+    /// `cuMemcpyHtoD_v2` is a blocking host call so by the time we return
+    /// the staging buffers are populated and the subsequent CUDA-graph
+    /// node-param update reads stable device pointers.
+    fn build_params(
+        &self,
+        stream: &Arc<CudaStream>,
+        output_ptr: u64,
+        input_ptrs: &[u64],
+        internal_bufs: &[CudaSlice<u8>],
+        _dyn_dims_ptr: u64,
+    ) -> Vec<u64> {
+        assert_eq!(
+            input_ptrs.len(),
+            2 * self.num_tables,
+            "MultiTableEmbeddingBagSum: expected {} input pointers (K weights + K idx), got {}",
+            2 * self.num_tables,
+            input_ptrs.len(),
+        );
+        let (w_ptrs, idx_ptrs) = input_ptrs.split_at(self.num_tables);
+        let w_buf = &internal_bufs[0];
+        let idx_buf = &internal_bufs[1];
+        let w_dev_ptr: u64 = w_buf.device_ptr(stream).0;
+        let idx_dev_ptr: u64 = idx_buf.device_ptr(stream).0;
+        unsafe {
+            let r1 = cudarc::driver::sys::cuMemcpyHtoD_v2(
+                w_dev_ptr,
+                w_ptrs.as_ptr() as *const std::ffi::c_void,
+                w_ptrs.len() * 8,
+            );
+            assert_eq!(
+                r1,
+                cudarc::driver::sys::CUresult::CUDA_SUCCESS,
+                "cuMemcpyHtoD_v2 for MultiTableEmbBag w-ptr staging failed: {r1:?}",
+            );
+            let r2 = cudarc::driver::sys::cuMemcpyHtoD_v2(
+                idx_dev_ptr,
+                idx_ptrs.as_ptr() as *const std::ffi::c_void,
+                idx_ptrs.len() * 8,
+            );
+            assert_eq!(
+                r2,
+                cudarc::driver::sys::CUresult::CUDA_SUCCESS,
+                "cuMemcpyHtoD_v2 for MultiTableEmbBag idx-ptr staging failed: {r2:?}",
+            );
+        }
+        vec![output_ptr, w_dev_ptr, idx_dev_ptr]
+    }
+}
+
+#[derive(Debug, Clone)]
+pub struct MultiTableEmbeddingBagSumCustom(pub MultiTableEmbeddingBagSumKernel);
+
+impl CustomOp for MultiTableEmbeddingBagSumCustom {
+    fn to_llir_op(&self) -> LLIROp {
+        LLIROp::new::<dyn KernelOp>(Box::new(self.0.clone()) as Box<dyn KernelOp>)
+    }
+}
+
+/// Frontend helper: K independent EmbeddingBag(sum) lookups in one
+/// kernel launch. Returns `(batch, num_tables, d)` F32, row-major;
+/// slice along axis 1 (`out.slice_along(k..k+1, 1).squeeze(1)`) to
+/// recover the k-th table's `(batch, d)` output.
+///
+/// * `weights`: K `(n_emb_k, d)` F32 tensors. Per-table `n_emb` may
+///   differ; only `d` must be shared.
+/// * `indices`: K `(batch, bag)` Int tensors (cast `.cast(DType::Int)`
+///   on the caller side if your indices are i64).
+pub fn multi_table_embedding_bag_sum_kernel(
+    weights: Vec<GraphTensor>,
+    indices: Vec<GraphTensor>,
+) -> GraphTensor {
+    assert_eq!(
+        weights.len(),
+        indices.len(),
+        "multi_table_embedding_bag_sum_kernel: need one weight per index tensor"
+    );
+    let num_tables = weights.len();
+    assert!(num_tables >= 1, "need at least one table");
+    let first_w = weights[0];
+    let first_idx = indices[0];
+    let w_dims = first_w.dims();
+    let i_dims = first_idx.dims();
+    assert_eq!(w_dims.len(), 2, "weights must be 2D (n_emb, d)");
+    assert_eq!(i_dims.len(), 2, "indices must be 2D (batch, bag)");
+    let d = w_dims[1].to_usize().expect("d must be static");
+    let batch = i_dims[0].to_usize().expect("batch must be static");
+    let bag = i_dims[1].to_usize().expect("bag must be static");
+    for w in &weights {
+        assert_eq!(w.dtype, DType::F32, "weights must all be F32");
+        let wd = w.dims();
+        assert_eq!(wd.len(), 2, "weight must be 2D");
+        assert_eq!(
+            wd[1].to_usize().unwrap(),
+            d,
+            "all weights must share inner dim"
+        );
+    }
+    for idx in &indices {
+        assert_eq!(idx.dtype, DType::Int, "indices must all be Int (i32)");
+        let id = idx.dims();
+        assert_eq!(id.len(), 2);
+        assert_eq!(id[0].to_usize().unwrap(), batch);
+        assert_eq!(id[1].to_usize().unwrap(), bag);
+    }
+    let kern = MultiTableEmbeddingBagSumKernel {
+        batch,
+        bag,
+        d,
+        num_tables,
+    };
+    let mut inputs = weights;
+    inputs.extend(indices);
+    let cx = unsafe { &mut *first_w.graph_ref };
+    cx.custom_op(
+        MultiTableEmbeddingBagSumCustom(kern),
+        inputs,
+        (batch, num_tables, d),
+        DType::F32,
+    )
+}
--- a/crates/luminal_cuda_lite/src/kernel/mod.rs
+++ b/crates/luminal_cuda_lite/src/kernel/mod.rs
@@ -11,21 +11,31 @@ use uuid::Uuid;

 pub mod conv2d;
 pub mod cuda_graph;
+pub mod dlrm_interact;
 pub mod fusion;
+pub mod embedding_bag;
 pub mod hlir;
 pub mod matmul2d;
 pub mod other_ops;
 pub mod rope;

-pub use conv2d::{Conv2DCustom, Conv2DKernel, conv2d_bias};
+pub use conv2d::KernelConv2D;
 pub use cuda_graph::*;
+pub use dlrm_interact::{
+    PairwiseDotLowerTriCustom, PairwiseDotLowerTriKernel, dlrm_pairwise_dot_lower_tri,
+};
+pub use embedding_bag::{
+    EmbeddingBagSumCustom, EmbeddingBagSumKernel, MultiTableEmbeddingBagSumCustom,
+    MultiTableEmbeddingBagSumKernel, embedding_bag_sum_kernel,
+    multi_table_embedding_bag_sum_kernel,
+};
 pub use matmul2d::{
    Matmul2DCustom, Matmul2DKernel, linear_bias, linear_no_bias_bf16_w, matmul_2d, matmul_2d_t,
    matmul_3d, matmul_3d_t,
 };
 pub use rope::{RoPECustom, RoPEKernel, apply_rope};

-pub type Ops = (hlir::Ops, other_ops::Ops, fusion::Ops);
+pub type Ops = (hlir::Ops, other_ops::Ops, conv2d::KernelConv2D, fusion::Ops);

 /// Build a mapping from interned string IDs to their string values for a given sequence.
 fn build_interned_strings(trace: &schema::Trace) -> std::collections::HashMap<(u32, u64), String> {
--- a/crates/luminal_cuda_lite/src/runtime.rs
+++ b/crates/luminal_cuda_lite/src/runtime.rs
@@ -225,7 +225,6 @@ impl CudaRuntime {
            result::memcpy_dtod_async(dst_ptr, src.ptr(), src.len(), stream.cu_stream())
                .expect("cuMemcpyDtoDAsync failed");
        }
-        stream.synchronize().unwrap();
        dst
    }

@@ -1536,8 +1535,26 @@ impl Runtime for CudaRuntime {
                    );
                });
        }
-        // Single sync at end - CUDA stream ordering guarantees sequential execution
-        self.cuda_stream.synchronize().unwrap();
+        // Drop the trailing `cuStreamSynchronize` here. It existed only
+        // so the rust runtime could record `last_total_time_us` after
+        // all kernels finished and so we returned to the host caller
+        // with a synchronized result.
+        //
+        // Two reasons it must go:
+        //   1. `torch.cuda.CUDAGraph()` (capture mode) rejects any sync
+        //      on the current-thread's streams with
+        //      `CUDA_ERROR_STREAM_CAPTURE_UNSUPPORTED`. Even though
+        //      luminal uses its own stream, PyTorch's capture context
+        //      detects cross-thread/cross-stream syncs and panics.
+        //   2. CUDA's stream ordering is already enough: each subsequent
+        //      kernel/HostOp on the same stream will run after prior
+        //      ones, and the host-side caller (PyTorch reading the
+        //      output, or a subsequent torch op consuming our output)
+        //      issues its own sync when it actually needs a result.
+        //
+        // The `last_total_time_us` field becomes a launch-time-only
+        // measurement (CPU clock from execute start to enqueue end);
+        // existing tests use `KernelStats` for real per-kernel timing.
        self.last_total_time_us = total_start.elapsed().as_secs_f64() * 1_000_000.0;

        // Populate last_kernel_stats from HostOps that report stats
--- a/crates/luminal_cuda_lite/src/tests/conv2d_rewrite.rs
+++ b/crates/luminal_cuda_lite/src/tests/conv2d_rewrite.rs
@@ -0,0 +1,482 @@
+use luminal::{
+    egglog_utils::{
+        NodeId, SerializedEGraph, egglog_to_llir, random_initial_choice, validate_choice_set,
+    },
+    prelude::*,
+};
+use rand::{SeedableRng, rngs::StdRng};
+
+use crate::{kernel::KernelOp, runtime::CudaRuntime};
+
+use super::utilities::{assert_close, get_cuda_stream};
+
+fn conv2d_bias_hlir(
+    x: GraphTensor,
+    weight: GraphTensor,
+    bias: GraphTensor,
+    kernel_h: usize,
+    kernel_w: usize,
+) -> GraphTensor {
+    let unfolded = x.unfold(
+        vec![1usize, kernel_h, kernel_w],
+        vec![1usize, 1, 1],
+        vec![1usize, 1, 1],
+    );
+    let output_spatial_dims = unfolded.dims()[1..3].to_vec();
+
+    let mut patches = unfolded.squeeze(3).permute(&[1, 2, 0, 3, 4]);
+    while patches.dims().len() > 3 {
+        let last = patches.dims().len();
+        patches = patches.merge_dims(last - 2, last - 1);
+    }
+    let patches = patches.merge_dims(0, 1);
+
+    let out = patches.matmul(weight.t());
+    let out = out
+        .split_dims(0, output_spatial_dims[1])
+        .permute(&[2, 0, 1]);
+    let out_dims = out.dims();
+    out + bias.expand_dim(1, out_dims[1]).expand_dim(2, out_dims[2])
+}
+
+fn build_conv_graph() -> (Graph, GraphTensor, GraphTensor, GraphTensor, GraphTensor) {
+    let mut cx = Graph::new();
+    let x = cx.tensor((2usize, 5usize, 6usize));
+    let weight = cx.tensor((3usize, 2usize * 3 * 2));
+    let bias = cx.tensor(3usize);
+    let out = conv2d_bias_hlir(x, weight, bias, 3, 2).output();
+    (cx, x, weight, bias, out)
+}
+
+fn conv2d_bias_padded_hlir(
+    x: GraphTensor,
+    weight: GraphTensor,
+    bias: GraphTensor,
+    kernel: usize,
+    padding: usize,
+) -> GraphTensor {
+    let zero = Expression::from(0);
+    let pad = Expression::from(padding);
+    let padded = x.pad(vec![(zero, zero), (pad, pad), (pad, pad)], 0.0);
+    conv2d_bias_hlir(padded, weight, bias, kernel, kernel)
+}
+
+fn build_padded_conv_graph() -> (Graph, GraphTensor, GraphTensor, GraphTensor, GraphTensor) {
+    let mut cx = Graph::new();
+    let x = cx.tensor((2usize, 4usize, 5usize));
+    let weight = cx.tensor((3usize, 2usize * 3 * 3));
+    let bias = cx.tensor(3usize);
+    let out = conv2d_bias_padded_hlir(x, weight, bias, 3, 1).output();
+    (cx, x, weight, bias, out)
+}
+
+fn nearest_upsample_2x_hlir(x: GraphTensor) -> GraphTensor {
+    let stage1 = x.expand_dim(2, 2usize).merge_dims(1, 2);
+    stage1.expand_dim(3, 2usize).merge_dims(2, 3)
+}
+
+fn build_upsample_conv_graph() -> (Graph, GraphTensor, GraphTensor, GraphTensor, GraphTensor) {
+    let mut cx = Graph::new();
+    let x = cx.tensor((2usize, 3usize, 4usize));
+    let weight = cx.tensor((3usize, 2usize * 3 * 3));
+    let bias = cx.tensor(3usize);
+    let up = nearest_upsample_2x_hlir(x);
+    let out = conv2d_bias_padded_hlir(up, weight, bias, 3, 1).output();
+    (cx, x, weight, bias, out)
+}
+
+fn conv1x1_bias_hlir(x: GraphTensor, weight: GraphTensor, bias: GraphTensor) -> GraphTensor {
+    let dims = x.dims();
+    let h = dims[1];
+    let w = dims[2];
+    let xt = x.permute(&[1, 2, 0]).merge_dims(0, 1);
+    let out = xt.matmul(weight.t());
+    let out = out.split_dims(0, w).permute(&[2, 0, 1]);
+    out + bias.expand_dim(1, h).expand_dim(2, w)
+}
+
+fn build_conv1x1_graph() -> (Graph, GraphTensor, GraphTensor, GraphTensor, GraphTensor) {
+    let mut cx = Graph::new();
+    let x = cx.tensor((2usize, 4usize, 5usize));
+    let weight = cx.tensor((3usize, 2usize));
+    let bias = cx.tensor(3usize);
+    let out = conv1x1_bias_hlir(x, weight, bias).output();
+    (cx, x, weight, bias, out)
+}
+
+fn conv2d_matmul_without_conv_output_shape(
+    x: GraphTensor,
+    weight: GraphTensor,
+    bias: GraphTensor,
+    kernel_h: usize,
+    kernel_w: usize,
+) -> GraphTensor {
+    let unfolded = x.unfold(
+        vec![1usize, kernel_h, kernel_w],
+        vec![1usize, 1, 1],
+        vec![1usize, 1, 1],
+    );
+
+    let mut patches = unfolded.squeeze(3).permute(&[1, 2, 0, 3, 4]);
+    while patches.dims().len() > 3 {
+        let last = patches.dims().len();
+        patches = patches.merge_dims(last - 2, last - 1);
+    }
+    let patches = patches.merge_dims(0, 1);
+
+    let out = patches.matmul(weight.t());
+    let out_dims = out.dims();
+    out + bias.expand_dim(0, out_dims[0])
+}
+
+#[test]
+fn generic_conv2d_rewrite_matches_unfold_matmul_bias() {
+    let (mut cx, _, _, _, _) = build_conv_graph();
+    cx.build_search_space::<CudaRuntime>();
+    let egraph = cx.egraph().expect("search space should have an e-graph");
+
+    assert!(
+        !op_ir_nodes(egraph, "KernelConv2D").is_empty(),
+        "expected generic conv2d rewrite candidate"
+    );
+    assert!(
+        op_ir_nodes(egraph, "Add").is_empty(),
+        "generic conv2d cleanup should prune the final bias Add fallback"
+    );
+}
+
+#[test]
+fn generic_conv2d_rewrite_matches_conv1x1_matmul_bias() {
+    let (mut cx, _, _, _, _) = build_conv1x1_graph();
+    cx.build_search_space::<CudaRuntime>();
+    let egraph = cx.egraph().expect("search space should have an e-graph");
+
+    assert!(
+        !op_ir_nodes(egraph, "KernelConv2D").is_empty(),
+        "expected generic conv2d rewrite candidate for 1x1 conv"
+    );
+}
+
+#[test]
+fn generic_conv2d_rewrite_requires_conv_output_shape() {
+    let mut cx = Graph::new();
+    let x = cx.tensor((2usize, 5usize, 6usize));
+    let weight = cx.tensor((3usize, 2usize * 3 * 2));
+    let bias = cx.tensor(3usize);
+    conv2d_matmul_without_conv_output_shape(x, weight, bias, 3, 2).output();
+
+    cx.build_search_space::<CudaRuntime>();
+    let egraph = cx.egraph().expect("search space should have an e-graph");
+
+    assert!(
+        op_ir_nodes(egraph, "KernelConv2D").is_empty(),
+        "matmul+bias without [C_out,H_out,W_out] conv output shape should not match KernelConv2D"
+    );
+}
+
+#[test]
+fn generic_conv2d_candidate_executes_unfold_matmul_bias() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+
+    let (mut cx, x, weight, bias, out) = build_conv_graph();
+    cx.build_search_space::<CudaRuntime>();
+    let llir = extract_forced_kernel_llir(&mut cx, "GenericConv2D");
+
+    let input: Vec<f32> = (0..2 * 5 * 6).map(|i| i as f32 * 0.03 - 0.4).collect();
+    let weights: Vec<f32> = (0..3 * 2 * 3 * 2)
+        .map(|i| (i as f32 % 11.0) * 0.04 - 0.2)
+        .collect();
+    let biases = vec![0.25_f32, -0.15, 0.05];
+    let expected = reference_conv2d(
+        &input,
+        &weights,
+        &biases,
+        ConvCase {
+            c_in: 2,
+            h: 5,
+            w: 6,
+            c_out: 3,
+            kh: 3,
+            kw: 2,
+            padding_h: 0,
+            padding_w: 0,
+        },
+    );
+
+    let mut rt = CudaRuntime::initialize(stream);
+    rt.load_llir(&llir);
+    rt.set_data(x, input);
+    rt.set_data(weight, weights);
+    rt.set_data(bias, biases);
+    rt.execute(&cx.dyn_map);
+
+    assert_close(&rt.get_f32(out.id), &expected, 1e-5, 1e-5);
+}
+
+#[test]
+fn generic_conv2d_candidate_executes_conv1x1_matmul_bias() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+
+    let (mut cx, x, weight, bias, out) = build_conv1x1_graph();
+    cx.build_search_space::<CudaRuntime>();
+    let llir = extract_forced_kernel_llir(&mut cx, "GenericConv2D");
+
+    let input: Vec<f32> = (0..2 * 4 * 5).map(|i| i as f32 * 0.07 - 1.0).collect();
+    let weights: Vec<f32> = (0..3 * 2).map(|i| (i as f32 % 5.0) * 0.11 - 0.2).collect();
+    let biases = vec![0.2_f32, -0.1, 0.4];
+    let expected = reference_conv2d(
+        &input,
+        &weights,
+        &biases,
+        ConvCase {
+            c_in: 2,
+            h: 4,
+            w: 5,
+            c_out: 3,
+            kh: 1,
+            kw: 1,
+            padding_h: 0,
+            padding_w: 0,
+        },
+    );
+
+    let mut rt = CudaRuntime::initialize(stream);
+    rt.load_llir(&llir);
+    rt.set_data(x, input);
+    rt.set_data(weight, weights);
+    rt.set_data(bias, biases);
+    rt.execute(&cx.dyn_map);
+
+    assert_close(&rt.get_f32(out.id), &expected, 1e-5, 1e-5);
+}
+
+#[test]
+fn generic_conv2d_candidate_executes_padded_unfold_matmul_bias() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+
+    let (mut cx, x, weight, bias, out) = build_padded_conv_graph();
+    cx.build_search_space::<CudaRuntime>();
+    let llir = extract_forced_kernel_llir(&mut cx, "GenericConv2D");
+
+    let input: Vec<f32> = (0..2 * 4 * 5).map(|i| i as f32 * 0.05 - 0.5).collect();
+    let weights: Vec<f32> = (0..3 * 2 * 3 * 3)
+        .map(|i| (i as f32 % 13.0) * 0.03 - 0.17)
+        .collect();
+    let biases = vec![0.15_f32, -0.25, 0.35];
+    let expected = reference_conv2d(
+        &input,
+        &weights,
+        &biases,
+        ConvCase {
+            c_in: 2,
+            h: 4,
+            w: 5,
+            c_out: 3,
+            kh: 3,
+            kw: 3,
+            padding_h: 1,
+            padding_w: 1,
+        },
+    );
+
+    let mut rt = CudaRuntime::initialize(stream);
+    rt.load_llir(&llir);
+    rt.set_data(x, input);
+    rt.set_data(weight, weights);
+    rt.set_data(bias, biases);
+    rt.execute(&cx.dyn_map);
+
+    assert_close(&rt.get_f32(out.id), &expected, 1e-5, 1e-5);
+}
+
+#[test]
+fn generic_conv2d_candidate_executes_upsample_view_input() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+
+    let (mut cx, x, weight, bias, out) = build_upsample_conv_graph();
+    cx.build_search_space::<CudaRuntime>();
+    let llir = extract_forced_kernel_llir(&mut cx, "GenericConv2D");
+
+    let input: Vec<f32> = (0..2 * 3 * 4).map(|i| i as f32 * 0.09 - 0.8).collect();
+    let weights: Vec<f32> = (0..3 * 2 * 3 * 3)
+        .map(|i| (i as f32 % 17.0) * 0.025 - 0.2)
+        .collect();
+    let biases = vec![0.05_f32, -0.1, 0.2];
+    let upsampled = reference_nearest_upsample_2x(&input, 2, 3, 4);
+    let expected = reference_conv2d(
+        &upsampled,
+        &weights,
+        &biases,
+        ConvCase {
+            c_in: 2,
+            h: 6,
+            w: 8,
+            c_out: 3,
+            kh: 3,
+            kw: 3,
+            padding_h: 1,
+            padding_w: 1,
+        },
+    );
+
+    let mut rt = CudaRuntime::initialize(stream);
+    rt.load_llir(&llir);
+    rt.set_data(x, input);
+    rt.set_data(weight, weights);
+    rt.set_data(bias, biases);
+    rt.execute(&cx.dyn_map);
+
+    assert_close(&rt.get_f32(out.id), &expected, 1e-5, 1e-5);
+}
+
+struct ConvCase {
+    c_in: usize,
+    h: usize,
+    w: usize,
+    c_out: usize,
+    kh: usize,
+    kw: usize,
+    padding_h: usize,
+    padding_w: usize,
+}
+
+fn reference_nearest_upsample_2x(input: &[f32], c: usize, h: usize, w: usize) -> Vec<f32> {
+    let mut out = vec![0.0_f32; c * h * 2 * w * 2];
+    for ci in 0..c {
+        for y in 0..h {
+            for x in 0..w {
+                let value = input[ci * h * w + y * w + x];
+                for dy in 0..2 {
+                    for dx in 0..2 {
+                        let oy = y * 2 + dy;
+                        let ox = x * 2 + dx;
+                        out[ci * h * 2 * w * 2 + oy * w * 2 + ox] = value;
+                    }
+                }
+            }
+        }
+    }
+    out
+}
+
+fn reference_conv2d(input: &[f32], weight: &[f32], bias: &[f32], case: ConvCase) -> Vec<f32> {
+    let ConvCase {
+        c_in,
+        h,
+        w,
+        c_out,
+        kh,
+        kw,
+        padding_h,
+        padding_w,
+    } = case;
+    let h_out = h + 2 * padding_h - kh + 1;
+    let w_out = w + 2 * padding_w - kw + 1;
+    let mut out = vec![0.0; c_out * h_out * w_out];
+    for co in 0..c_out {
+        for oh in 0..h_out {
+            for ow in 0..w_out {
+                let mut acc = bias[co];
+                for ci in 0..c_in {
+                    for r in 0..kh {
+                        for s in 0..kw {
+                            let Some(ih) = (oh + r).checked_sub(padding_h) else {
+                                continue;
+                            };
+                            let Some(iw) = (ow + s).checked_sub(padding_w) else {
+                                continue;
+                            };
+                            if ih >= h || iw >= w {
+                                continue;
+                            }
+                            let input_idx = ci * h * w + ih * w + iw;
+                            let weight_idx = co * c_in * kh * kw + (ci * kh + r) * kw + s;
+                            acc += input[input_idx] * weight[weight_idx];
+                        }
+                    }
+                }
+                out[co * h_out * w_out + oh * w_out + ow] = acc;
+            }
+        }
+    }
+    out
+}
+
+fn extract_forced_kernel_llir(cx: &mut Graph, kernel_name: &str) -> LLIRGraph {
+    let egraph = cx.egraph().expect("search space should have an e-graph");
+    let ops = cx
+        .egglog_ops()
+        .expect("search space should have registered egglog ops");
+    let kernel_nodes = op_ir_nodes(egraph, "KernelConv2D");
+    assert!(
+        !kernel_nodes.is_empty(),
+        "expected at least one {kernel_name} candidate"
+    );
+
+    for (idx, kernel_node) in kernel_nodes.iter().enumerate() {
+        let mut rng = StdRng::seed_from_u64(0xC0_2D00 + idx as u64);
+        let mut choices = random_initial_choice(egraph, &mut rng);
+        let kernel_class = &egraph.node_to_class[*kernel_node];
+        choices.insert(kernel_class, kernel_node);
+
+        if validate_choice_set(egraph, &choices, ops).is_err() {
+            continue;
+        }
+
+        let mut list_cache = FxHashMap::default();
+        let mut expr_cache = FxHashMap::default();
+        let llir = egglog_to_llir(
+            egraph,
+            choices,
+            ops,
+            &cx.custom_ops,
+            &mut list_cache,
+            &mut expr_cache,
+            None,
+        );
+        if llir_kernel_names(&llir).contains(&kernel_name) {
+            return llir;
+        }
+    }
+
+    panic!("could not extract a valid {kernel_name} candidate");
+}
+
+fn llir_kernel_names(llir: &LLIRGraph) -> Vec<&'static str> {
+    llir.node_indices()
+        .filter_map(|node| {
+            llir[node]
+                .to_dialect::<dyn KernelOp>()
+                .map(|kernel| kernel.kernel_name())
+        })
+        .collect()
+}
+
+fn op_ir_nodes<'a>(egraph: &'a SerializedEGraph, kind_label: &str) -> Vec<&'a NodeId> {
+    let op_kind_classes = egraph
+        .enodes
+        .iter()
+        .filter(|(_, (label, _))| label == kind_label)
+        .map(|(node, _)| egraph.node_to_class[node].clone())
+        .collect::<Vec<_>>();
+
+    egraph
+        .enodes
+        .iter()
+        .filter_map(|(node, (label, children))| {
+            (label == "Op"
+                && children
+                    .first()
+                    .is_some_and(|kind| op_kind_classes.contains(kind)))
+            .then_some(node)
+        })
+        .collect()
+}
--- a/crates/luminal_cuda_lite/src/tests/cublaslt_rewrite_tests.rs
+++ b/crates/luminal_cuda_lite/src/tests/cublaslt_rewrite_tests.rs
@@ -445,6 +445,54 @@ fn cublaslt_rewrites_cover_batched_row_order_layout_pairs() {
    }
 }

+#[test]
+fn cublaslt_rewrites_cover_flux2_qk_transposed_matmul() {
+    let mut cx = Graph::new();
+    let q = cx.tensor((8usize, 4usize));
+    let k = cx.tensor((8usize, 4usize));
+    let _out = q.matmul(k.t()).output();
+
+    assert_cublaslt_rewrite(cx, "flux2 q @ k.t()", |llir| {
+        cublaslt_matrix_order_tuples(llir).contains(&("ROW", "COL", "ROW", "ROW"))
+            || cublaslt_matrix_order_tuples(llir).contains(&("COL", "COL", "COL", "COL"))
+    });
+}
+
+#[test]
+fn cublaslt_rewrites_cover_flux2_linear_bias_epilogue() {
+    let mut cx = Graph::new();
+    let x = cx.tensor((8usize, 4usize));
+    let weight = cx.tensor((6usize, 4usize));
+    let bias = cx.tensor(6usize);
+    let _out = (x.matmul(weight.t()) + bias.expand_dim(0, 8usize)).output();
+
+    assert_cublaslt_epilogue_rewrite(
+        cx,
+        "flux2 x @ weight.t() + bias",
+        "BIAS",
+        Some(("COL", "COL", "COL", "COL")),
+    );
+}
+
+#[test]
+fn cublaslt_cleanup_prunes_flux2_broadcast_mul_fallback() {
+    let mut cx = Graph::new();
+    let q = cx.tensor((8usize, 4usize));
+    let k = cx.tensor((8usize, 4usize));
+    let _out = q.matmul(k.t()).output();
+
+    cx.build_search_space::<CudaRuntime>();
+    let egraph = cx.egraph().expect("search space should have an e-graph");
+    assert!(
+        !cublaslt_ir_nodes(egraph).is_empty(),
+        "Flux2 q @ k.t() should have at least one cuBLASLt candidate"
+    );
+    assert!(
+        op_ir_nodes(egraph, "Mul").is_empty(),
+        "cuBLASLt cleanup should prune the broadcast Mul fallback once a cuBLASLt candidate exists"
+    );
+}
+
 #[test]
 fn cublaslt_rewrites_keep_c_and_d_layouts_equal_initially() {
    for case in LAYOUT_CASES {
@@ -3033,10 +3081,17 @@ fn assert_no_cublaslt_llir_where(
 }

 fn cublaslt_ir_nodes(egraph: &SerializedEGraph) -> Vec<&NodeId> {
-    let cublaslt_kind_classes = egraph
+    op_ir_nodes(egraph, "cublaslt")
+        .into_iter()
+        .chain(op_ir_nodes(egraph, "cublaslt_scaled"))
+        .collect()
+}
+
+fn op_ir_nodes<'a>(egraph: &'a SerializedEGraph, kind_label: &str) -> Vec<&'a NodeId> {
+    let op_kind_classes = egraph
        .enodes
        .iter()
-        .filter(|(_, (label, _))| label == "cublaslt" || label == "cublaslt_scaled")
+        .filter(|(_, (label, _))| label == kind_label)
        .map(|(node, _)| egraph.node_to_class[node].clone())
        .collect::<Vec<_>>();

@@ -3047,7 +3102,7 @@ fn cublaslt_ir_nodes(egraph: &SerializedEGraph) -> Vec<&NodeId> {
            (label == "Op"
                && children
                    .first()
-                    .is_some_and(|kind| cublaslt_kind_classes.contains(kind)))
+                    .is_some_and(|kind| op_kind_classes.contains(kind)))
            .then_some(node)
        })
        .collect()
--- a/crates/luminal_cuda_lite/src/tests/mod.rs
+++ b/crates/luminal_cuda_lite/src/tests/mod.rs
@@ -5,6 +5,8 @@ mod bucket_tests;
 #[cfg(test)]
 mod consumed_buffer_tests;
 #[cfg(test)]
+mod conv2d_rewrite;
+#[cfg(test)]
 mod cublaslt_rewrite_tests;
 #[cfg(test)]
 mod flashinfer;
--- a/crates/luminal_metal/Cargo.toml
+++ b/crates/luminal_metal/Cargo.toml
@@ -19,7 +19,14 @@ bytemuck = "1.24.0"

 [dev-dependencies]
 candle-core = "0.9.2-alpha.1"
+hf-hub = { version = "0.4", default-features = false, features = ["rustls-tls", "ureq"] }
+luminal_nn = { path = "../luminal_nn" }
+luminal_tracing = { path = "../luminal_tracing" }
 proptest = "1.9.0"
+rand = "0.9.2"
+rustc-hash = "2.1"
+tokenizers = "0.22.2"
+tracing-subscriber = { version = "0.3", features = ["env-filter"] }

 [lints.rust]
 unexpected_cfgs = { level = "warn", check-cfg = ['cfg(feature, values("cargo-clippy"))'] }
--- a/crates/luminal_metal/examples/llama_1b.rs
+++ b/crates/luminal_metal/examples/llama_1b.rs
@@ -0,0 +1,641 @@
+use hf_hub::api::sync::Api;
+use luminal::{
+    dtype::DType,
+    graph::{BuildSearchSpaceOptions, DimBucket, Graph},
+    prelude::{F32Pow, GraphTensor, Runtime},
+};
+use luminal_metal::MetalRuntime;
+use luminal_nn::{LayerNorm, gather_rows, scatter_rows};
+use luminal_tracing::luminal_filter;
+use rustc_hash::FxHashSet;
+use std::{
+    error::Error,
+    io::Write,
+    path::PathBuf,
+    time::{Duration, Instant},
+};
+use tokenizers::Tokenizer;
+use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt};
+
+const REPO_ID: &str = "unsloth/Llama-3.2-1B-Instruct";
+const MAX_SEQ_LEN: usize = 2048;
+const GEN_TOKENS: usize = 96;
+const SEARCH_GRAPHS: usize = 100;
+const SEARCH_MEMORY_MIB: usize = 1536;
+const PROMPT: &str = "In one short paragraph, explain neural networks using the words layers, neurons, learning, and data.";
+
+const LAYERS: usize = 16;
+const HIDDEN: usize = 2048;
+const INTERMEDIATE: usize = 8192;
+const HEAD_DIM: usize = 64;
+const N_HEADS: usize = 32;
+const N_KV_HEADS: usize = 8;
+const KV_GROUPS: usize = N_HEADS / N_KV_HEADS;
+const KV_DIM: usize = N_KV_HEADS * HEAD_DIM;
+const VOCAB_SIZE: usize = 128256;
+const RMS_NORM_EPS: f32 = 1e-5;
+const ROPE_THETA: f32 = 500_000.0;
+const EOS_TOKEN: u32 = 128009;
+const STOP_TOKEN: u32 = 128001;
+
+fn prepare_hf_model() -> Result<PathBuf, Box<dyn Error>> {
+    let repo = Api::new()?.model(REPO_ID.to_string());
+    let tokenizer_path = repo.get("tokenizer.json")?;
+    repo.get("model.safetensors")?;
+    Ok(tokenizer_path.parent().unwrap().to_path_buf())
+}
+
+fn llama3_chat_prompt(user_prompt: &str) -> String {
+    format!(
+        "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
+    )
+}
+
+#[derive(Default, Clone)]
+struct StepProfile {
+    total: Duration,
+    execute: Duration,
+    get_logits: Duration,
+    cache_roundtrip: Duration,
+}
+
+fn avg_ms(duration: Duration, n: usize) -> f64 {
+    if n == 0 {
+        0.0
+    } else {
+        duration.as_secs_f64() * 1e3 / n as f64
+    }
+}
+
+fn sample_greedy(logits_row: &[f32], seen: &FxHashSet<u32>, repetition_penalty: f32) -> u32 {
+    let mut row = logits_row.to_vec();
+    for &tok in seen {
+        let logit = &mut row[tok as usize];
+        if *logit > 0.0 {
+            *logit /= repetition_penalty;
+        } else {
+            *logit *= repetition_penalty;
+        }
+    }
+    row.iter()
+        .enumerate()
+        .max_by(|(_, a), (_, b)| a.total_cmp(b))
+        .unwrap()
+        .0 as u32
+}
+
+fn causal_mask(q_pos: &[usize], context_len: usize) -> Vec<f32> {
+    let mut mask = vec![-1e10f32; q_pos.len() * context_len];
+    for (qi, &pos) in q_pos.iter().enumerate() {
+        for ci in 0..context_len {
+            if ci <= pos {
+                mask[qi * context_len + ci] = 0.0;
+            }
+        }
+    }
+    mask
+}
+
+struct KVCache {
+    k_caches: Vec<GraphTensor>,
+    v_caches: Vec<GraphTensor>,
+}
+
+impl KVCache {
+    fn new(cx: &mut Graph, num_slots: usize) -> Self {
+        let mut k_caches = Vec::with_capacity(LAYERS);
+        let mut v_caches = Vec::with_capacity(LAYERS);
+        for l in 0..LAYERS {
+            k_caches.push(
+                cx.named_tensor(format!("kv_cache.{l}.k"), (num_slots, KV_DIM))
+                    .persist(),
+            );
+            v_caches.push(
+                cx.named_tensor(format!("kv_cache.{l}.v"), (num_slots, KV_DIM))
+                    .persist(),
+            );
+        }
+        Self { k_caches, v_caches }
+    }
+}
+
+struct Llama {
+    embedding: GraphTensor,
+    layers: Vec<LlamaLayer>,
+    lm_norm: LayerNorm,
+}
+
+impl Llama {
+    fn init(cx: &mut Graph) -> Self {
+        let mut layers = Vec::with_capacity(LAYERS);
+        for l in 0..LAYERS {
+            layers.push(LlamaLayer {
+                up: cx
+                    .named_tensor(
+                        format!("model.layers.{l}.mlp.up_proj.weight"),
+                        (INTERMEDIATE, HIDDEN),
+                    )
+                    .persist(),
+                gate: cx
+                    .named_tensor(
+                        format!("model.layers.{l}.mlp.gate_proj.weight"),
+                        (INTERMEDIATE, HIDDEN),
+                    )
+                    .persist(),
+                down: cx
+                    .named_tensor(
+                        format!("model.layers.{l}.mlp.down_proj.weight"),
+                        (HIDDEN, INTERMEDIATE),
+                    )
+                    .persist(),
+                q_proj: cx
+                    .named_tensor(
+                        format!("model.layers.{l}.self_attn.q_proj.weight"),
+                        (HIDDEN, HIDDEN),
+                    )
+                    .persist(),
+                k_proj: cx
+                    .named_tensor(
+                        format!("model.layers.{l}.self_attn.k_proj.weight"),
+                        (KV_DIM, HIDDEN),
+                    )
+                    .persist(),
+                v_proj: cx
+                    .named_tensor(
+                        format!("model.layers.{l}.self_attn.v_proj.weight"),
+                        (KV_DIM, HIDDEN),
+                    )
+                    .persist(),
+                o_proj: cx
+                    .named_tensor(
+                        format!("model.layers.{l}.self_attn.o_proj.weight"),
+                        (HIDDEN, HIDDEN),
+                    )
+                    .persist(),
+                attn_rms: LayerNorm::new(
+                    HIDDEN,
+                    Some(&format!("model.layers.{l}.input_layernorm.weight")),
+                    None,
+                    false,
+                    RMS_NORM_EPS,
+                    cx,
+                ),
+                mlp_rms: LayerNorm::new(
+                    HIDDEN,
+                    Some(&format!("model.layers.{l}.post_attention_layernorm.weight")),
+                    None,
+                    false,
+                    RMS_NORM_EPS,
+                    cx,
+                ),
+            });
+        }
+
+        Self {
+            embedding: cx
+                .named_tensor("model.embed_tokens.weight", (VOCAB_SIZE, HIDDEN))
+                .persist(),
+            layers,
+            lm_norm: LayerNorm::new(
+                HIDDEN,
+                Some("model.norm.weight"),
+                None,
+                false,
+                RMS_NORM_EPS,
+                cx,
+            ),
+        }
+    }
+
+    fn forward(
+        &self,
+        input: GraphTensor,
+        q_pos: GraphTensor,
+        scatter_idx: GraphTensor,
+        gather_idx: GraphTensor,
+        attn_mask: GraphTensor,
+        kv_cache: &KVCache,
+    ) -> (GraphTensor, Vec<(GraphTensor, GraphTensor)>) {
+        let seq = input.dims1();
+        let mut x = self.embedding.gather(
+            (input * HIDDEN).expand_dim(1, HIDDEN)
+                + input.graph().arange(HIDDEN).expand_dim(0, seq),
+        );
+        let mut cache_outputs = Vec::with_capacity(LAYERS);
+        for (i, layer) in self.layers.iter().enumerate() {
+            let (x_new, k_out, v_out) = layer.forward(
+                x,
+                q_pos,
+                scatter_idx,
+                gather_idx,
+                attn_mask,
+                kv_cache.k_caches[i],
+                kv_cache.v_caches[i],
+            );
+            x = x_new;
+            cache_outputs.push((k_out, v_out));
+        }
+
+        let logits = self.lm_norm.forward(x).matmul(self.embedding.t());
+        (logits, cache_outputs)
+    }
+}
+
+struct LlamaLayer {
+    up: GraphTensor,
+    gate: GraphTensor,
+    down: GraphTensor,
+    q_proj: GraphTensor,
+    k_proj: GraphTensor,
+    v_proj: GraphTensor,
+    o_proj: GraphTensor,
+    attn_rms: LayerNorm,
+    mlp_rms: LayerNorm,
+}
+
+fn llama_rotary_embeddings(mut input: GraphTensor, pos_ids: GraphTensor) -> GraphTensor {
+    input = input.split_dims(1, HEAD_DIM).transpose(0, 1);
+
+    let freqs = input
+        .graph()
+        .arange_options(0, HEAD_DIM, 2)
+        .cast(DType::F32)
+        / HEAD_DIM as f32;
+    let inv_freqs = ROPE_THETA.pow(freqs).reciprocal();
+    let emb = pos_ids
+        .cast(DType::F32)
+        .expand_dim(1, 1)
+        .matmul(inv_freqs.expand_dim(0, 1));
+
+    let x0 = input.slice((.., .., ..HEAD_DIM / 2));
+    let x1 = input.slice((.., .., HEAD_DIM / 2..));
+
+    let cos = emb.cos().expand_dim(0, x0.dims()[0]);
+    let sin = emb.sin().expand_dim(0, x0.dims()[0]);
+    let x0_out = x0 * cos - x1 * sin;
+    let x1_out = x1 * cos + x0 * sin;
+
+    x0_out
+        .concat_along(x1_out, 2)
+        .transpose(0, 1)
+        .merge_dims(1, 2)
+}
+
+#[allow(clippy::too_many_arguments)]
+fn attention(
+    q_rope: GraphTensor,
+    k_rope: GraphTensor,
+    v: GraphTensor,
+    k_cache: GraphTensor,
+    v_cache: GraphTensor,
+    scatter_idx: GraphTensor,
+    gather_idx: GraphTensor,
+    attn_mask: GraphTensor,
+) -> (GraphTensor, GraphTensor, GraphTensor) {
+    let k_cache_out = scatter_rows(k_rope, scatter_idx, k_cache, KV_DIM);
+    let v_cache_out = scatter_rows(v, scatter_idx, v_cache, KV_DIM);
+
+    let k = gather_rows(k_cache_out, gather_idx, KV_DIM);
+    let v_ctx = gather_rows(v_cache_out, gather_idx, KV_DIM);
+
+    let q = (q_rope * 1.0).split_dims(1, HEAD_DIM).transpose(0, 1);
+    let k = k.split_dims(1, HEAD_DIM).permute((1, 2, 0));
+    let v_ctx = v_ctx.split_dims(1, HEAD_DIM).transpose(0, 1);
+
+    let k = k.expand_dim(1, KV_GROUPS).merge_dims(0, 1) * 1.0;
+    let v_ctx = v_ctx.expand_dim(1, KV_GROUPS).merge_dims(0, 1) * 1.0;
+
+    let scores = q.matmul(k) / (HEAD_DIM as f32).sqrt();
+    let masked_scores = scores + attn_mask.expand_dim(0, N_HEADS);
+    let weights = masked_scores.softmax(2);
+    let out = weights.matmul(v_ctx);
+    let attn_out = out.transpose(0, 1).merge_dims(1, 2);
+
+    (attn_out, k_cache_out, v_cache_out)
+}
+
+impl LlamaLayer {
+    #[allow(clippy::too_many_arguments)]
+    fn forward(
+        &self,
+        mut x: GraphTensor,
+        q_pos: GraphTensor,
+        scatter_idx: GraphTensor,
+        gather_idx: GraphTensor,
+        attn_mask: GraphTensor,
+        k_cache: GraphTensor,
+        v_cache: GraphTensor,
+    ) -> (GraphTensor, GraphTensor, GraphTensor) {
+        let x_attn = self.attn_rms.forward(x);
+        let q = x_attn.matmul(self.q_proj.t());
+        let k = x_attn.matmul(self.k_proj.t());
+        let v = x_attn.matmul(self.v_proj.t());
+
+        let q_rope = llama_rotary_embeddings(q, q_pos);
+        let k_rope = llama_rotary_embeddings(k, q_pos);
+        let (attn_out, k_cache_out, v_cache_out) = attention(
+            q_rope,
+            k_rope,
+            v,
+            k_cache,
+            v_cache,
+            scatter_idx,
+            gather_idx,
+            attn_mask,
+        );
+        x += attn_out.matmul(self.o_proj.t());
+
+        let x_mlp = self.mlp_rms.forward(x);
+        let mlp_out =
+            (x_mlp.matmul(self.gate.t()).swish() * x_mlp.matmul(self.up.t())).matmul(self.down.t());
+        (x + mlp_out, k_cache_out, v_cache_out)
+    }
+}
+
+#[allow(clippy::too_many_arguments)]
+fn run_model_step(
+    cx: &mut Graph,
+    runtime: &mut MetalRuntime,
+    input: GraphTensor,
+    q_pos_t: GraphTensor,
+    scatter_idx_t: GraphTensor,
+    gather_idx_t: GraphTensor,
+    attn_mask_t: GraphTensor,
+    logits: GraphTensor,
+    kv_cache: &KVCache,
+    cache_outputs: &[(GraphTensor, GraphTensor)],
+    tokens: &[u32],
+    q_pos: &[i32],
+    scatter_idx: &[i32],
+    gather_idx: &[i32],
+    attn_mask: &[f32],
+) -> (Vec<f32>, StepProfile) {
+    let start = Instant::now();
+    cx.set_dim('s', tokens.len());
+    cx.set_dim('c', gather_idx.len());
+
+    runtime.set_data(input, tokens.iter().map(|t| *t as i32).collect::<Vec<_>>());
+    runtime.set_data(q_pos_t, q_pos.to_vec());
+    runtime.set_data(scatter_idx_t, scatter_idx.to_vec());
+    runtime.set_data(gather_idx_t, gather_idx.to_vec());
+    runtime.set_data(attn_mask_t, attn_mask.to_vec());
+    runtime.allocate_intermediate_buffers(&cx.dyn_map);
+
+    let execute_start = Instant::now();
+    runtime.execute(&cx.dyn_map);
+    let execute = execute_start.elapsed();
+
+    let logits_start = Instant::now();
+    let logits_data = runtime.get_f32(logits);
+    let get_logits = logits_start.elapsed();
+
+    let cache_start = Instant::now();
+    for (layer_idx, (k_out, v_out)) in cache_outputs.iter().enumerate() {
+        let k_buf = runtime.remove_buffer(*k_out);
+        let v_buf = runtime.remove_buffer(*v_out);
+        runtime.set_buffer(kv_cache.k_caches[layer_idx], k_buf);
+        runtime.set_buffer(kv_cache.v_caches[layer_idx], v_buf);
+    }
+    let cache_roundtrip = cache_start.elapsed();
+
+    (
+        logits_data,
+        StepProfile {
+            total: start.elapsed(),
+            execute,
+            get_logits,
+            cache_roundtrip,
+        },
+    )
+}
+
+fn main() -> Result<(), Box<dyn Error>> {
+    let _ = tracing_subscriber::registry()
+        .with(tracing_subscriber::fmt::layer())
+        .with(luminal_filter())
+        .try_init();
+
+    let model_dir = prepare_hf_model()?;
+    println!("Using model directory: {}", model_dir.display());
+
+    let tokenizer = Tokenizer::from_file(model_dir.join("tokenizer.json"))
+        .map_err(|err| err as Box<dyn Error>)?;
+    let prompt_tokens = tokenizer
+        .encode(llama3_chat_prompt(PROMPT), false)
+        .map_err(|err| err as Box<dyn Error>)?
+        .get_ids()
+        .to_vec();
+
+    let mut cx = Graph::default();
+    let input = cx.named_tensor("input", 's').as_dtype(DType::Int);
+    let q_pos_t = cx.named_tensor("q_pos", 's').as_dtype(DType::Int);
+    let scatter_idx_t = cx.named_tensor("scatter_idx", 's').as_dtype(DType::Int);
+    let gather_idx_t = cx.named_tensor("gather_idx", 'c').as_dtype(DType::Int);
+    let attn_mask_t = cx.named_tensor("attn_mask", ('s', 'c'));
+    let kv_cache = KVCache::new(&mut cx, MAX_SEQ_LEN);
+    let (logits, cache_outputs) = Llama::init(&mut cx).forward(
+        input,
+        q_pos_t,
+        scatter_idx_t,
+        gather_idx_t,
+        attn_mask_t,
+        &kv_cache,
+    );
+    let logits = logits.output();
+    for (k_out, v_out) in &cache_outputs {
+        k_out.output();
+        v_out.output();
+    }
+
+    cx.set_dim('s', 1);
+    cx.set_dim('c', 1);
+
+    println!("Building E-Graph...");
+    let egraph_start = Instant::now();
+    cx.build_search_space_with_options::<MetalRuntime>(
+        BuildSearchSpaceOptions::new().max_memory_mib(SEARCH_MEMORY_MIB),
+    );
+    println!(
+        "  E-Graph build: {:.2} s",
+        egraph_start.elapsed().as_secs_f64()
+    );
+
+    println!("Loading weights...");
+    let load_start = Instant::now();
+    let mut runtime = MetalRuntime::initialize(());
+    runtime.load_safetensors(&cx, model_dir.join("model.safetensors").to_str().unwrap());
+    println!("  Weight load: {:.2} s", load_start.elapsed().as_secs_f64());
+
+    let cache_bytes = MAX_SEQ_LEN * KV_DIM * std::mem::size_of::<f32>();
+    for i in 0..LAYERS {
+        runtime.set_zeros(kv_cache.k_caches[i], cache_bytes);
+        runtime.set_zeros(kv_cache.v_caches[i], cache_bytes);
+    }
+
+    println!("Compiling...");
+    let compile_start = Instant::now();
+    let max_prefill = (prompt_tokens.len() + 16)
+        .next_power_of_two()
+        .min(MAX_SEQ_LEN);
+    let max_context = (prompt_tokens.len() + GEN_TOKENS + 1)
+        .next_power_of_two()
+        .min(MAX_SEQ_LEN);
+    let search_s = 16.min(max_prefill).max(2);
+    let search_c = 16.min(max_context).max(2);
+    cx.set_dim_buckets(
+        's',
+        &[
+            DimBucket::new(1, 1),
+            DimBucket::new(2, max_prefill).representative(search_s),
+        ],
+    );
+    cx.set_dim_buckets(
+        'c',
+        &[
+            DimBucket::new(1, 1),
+            DimBucket::new(2, max_context).representative(search_c),
+        ],
+    );
+    cx.set_dim('s', search_s);
+    cx.set_dim('c', search_c);
+    runtime.set_data(input, vec![1; search_s]);
+    runtime.set_data(q_pos_t, (0..search_s as i32).collect::<Vec<_>>());
+    runtime.set_data(scatter_idx_t, (0..search_s as i32).collect::<Vec<_>>());
+    runtime.set_data(gather_idx_t, (0..search_c as i32).collect::<Vec<_>>());
+    runtime.set_data(attn_mask_t, vec![0.0f32; search_s * search_c]);
+    runtime = cx.search(runtime, SEARCH_GRAPHS);
+    println!(
+        "  Search/compile: {:.2} s",
+        compile_start.elapsed().as_secs_f64()
+    );
+
+    for i in 0..LAYERS {
+        runtime.set_zeros(kv_cache.k_caches[i], cache_bytes);
+        runtime.set_zeros(kv_cache.v_caches[i], cache_bytes);
+    }
+
+    let prompt_len = prompt_tokens.len();
+    let mut context_len = 0usize;
+    let mut profiles = Vec::new();
+    let mut seen_tokens = FxHashSet::default();
+    let repetition_penalty = 1.05;
+
+    println!(
+        "Prompt: {} tokens, generating up to {} tokens",
+        prompt_len, GEN_TOKENS
+    );
+
+    let mut generated = 0usize;
+    let mut next_token = None;
+    if GEN_TOKENS > 0 && prompt_len > 0 {
+        let positions: Vec<usize> = (0..prompt_len).collect();
+        let q_pos: Vec<i32> = positions.iter().map(|&p| p as i32).collect();
+        let mask = causal_mask(&positions, prompt_len);
+        let (logits_data, profile) = run_model_step(
+            &mut cx,
+            &mut runtime,
+            input,
+            q_pos_t,
+            scatter_idx_t,
+            gather_idx_t,
+            attn_mask_t,
+            logits,
+            &kv_cache,
+            &cache_outputs,
+            &prompt_tokens,
+            &q_pos,
+            &q_pos,
+            &q_pos,
+            &mask,
+        );
+        context_len = prompt_len;
+
+        let token = sample_greedy(
+            &logits_data[logits_data.len() - VOCAB_SIZE..],
+            &seen_tokens,
+            repetition_penalty,
+        );
+        seen_tokens.insert(token);
+        next_token = Some(token);
+        generated = 1;
+        profiles.push(profile);
+
+        if token != EOS_TOKEN && token != STOP_TOKEN {
+            print!(
+                "{}",
+                tokenizer
+                    .decode(&[token], true)
+                    .map_err(|err| err as Box<dyn Error>)?
+            );
+            std::io::stdout().flush()?;
+        }
+    }
+
+    while generated < GEN_TOKENS {
+        let current_token = match next_token {
+            Some(token) if token != EOS_TOKEN && token != STOP_TOKEN => token,
+            _ => break,
+        };
+        let gather_idx = (0..=context_len as i32).collect::<Vec<_>>();
+        let mask = causal_mask(&[context_len], context_len + 1);
+        let (logits_data, profile) = run_model_step(
+            &mut cx,
+            &mut runtime,
+            input,
+            q_pos_t,
+            scatter_idx_t,
+            gather_idx_t,
+            attn_mask_t,
+            logits,
+            &kv_cache,
+            &cache_outputs,
+            &[current_token],
+            &[context_len as i32],
+            &[context_len as i32],
+            &gather_idx,
+            &mask,
+        );
+        context_len += 1;
+
+        let token = sample_greedy(
+            &logits_data[logits_data.len() - VOCAB_SIZE..],
+            &seen_tokens,
+            repetition_penalty,
+        );
+        seen_tokens.insert(token);
+        next_token = Some(token);
+        generated += 1;
+        profiles.push(profile);
+
+        if token == EOS_TOKEN || token == STOP_TOKEN {
+            break;
+        }
+        print!(
+            "{}",
+            tokenizer
+                .decode(&[token], true)
+                .map_err(|err| err as Box<dyn Error>)?
+        );
+        std::io::stdout().flush()?;
+    }
+    println!();
+
+    let ttft = profiles.first().map(|p| p.total).unwrap_or_default();
+    let decode_steps = profiles.len().saturating_sub(1);
+    let decode_total: Duration = profiles.iter().skip(1).map(|p| p.total).sum();
+    println!("  TTFT: {:.2} ms", ttft.as_secs_f64() * 1e3);
+    println!("  TPOT: {:.2} ms", avg_ms(decode_total, decode_steps));
+
+    let execute_total: Duration = profiles.iter().map(|p| p.execute).sum();
+    let logits_total: Duration = profiles.iter().map(|p| p.get_logits).sum();
+    let cache_total: Duration = profiles.iter().map(|p| p.cache_roundtrip).sum();
+    println!(
+        "  Profile: n={}, exec={:.2} ms, logits={:.2} ms, cache={:.2} ms",
+        profiles.len(),
+        avg_ms(execute_total, profiles.len()),
+        avg_ms(logits_total, profiles.len()),
+        avg_ms(cache_total, profiles.len()),
+    );
+
+    Ok(())
+}
--- a/crates/luminal_metal/src/kernel/ops.rs
+++ b/crates/luminal_metal/src/kernel/ops.rs
@@ -19,7 +19,8 @@ use luminal::{
    shape::flatten_strides,
 };
 use metal::{
-    Buffer, CommandBufferRef, ComputeCommandEncoderRef, ComputePipelineState, Device, MTLSize,
+    Buffer, CommandBufferRef, ComputeCommandEncoderRef, ComputePipelineState, Device,
+    MTLLanguageVersion, MTLSize,
    foreign_types::{ForeignType, ForeignTypeRef},
    mps,
 };
@@ -56,15 +57,21 @@ pub type MetalOps = (
 );

 fn compile_shader(device: &Device, source: &str, function_name: &str) -> ComputePipelineState {
+    let options = metal::CompileOptions::new();
+    options.set_language_version(MTLLanguageVersion::V2_4);
    let library = device
-        .new_library_with_source(source, &metal::CompileOptions::new())
-        .expect("Failed to compile Metal shader");
+        .new_library_with_source(source, &options)
+        .unwrap_or_else(|err| {
+            panic!("Failed to compile Metal shader {function_name}: {err:?}\n{source}")
+        });
    let function = library
        .get_function(function_name, None)
        .expect("Failed to get function from library");
    device
        .new_compute_pipeline_state_with_function(&function)
-        .expect("Failed to create compute pipeline state")
+        .unwrap_or_else(|err| {
+            panic!("Failed to create Metal compute pipeline state for {function_name}: {err:?}\n{source}")
+        })
 }

 fn lower_dynamic_consts(mut code: String) -> String {
@@ -1039,42 +1046,33 @@ impl MetalKernelOp for MetalSumReduce {
                constant int *dyn [[buffer({dyn_buffer_index})]],
                constant uint &n_outputs [[buffer({n_outputs_index})]],
                uint gid [[threadgroup_position_in_grid]],
-                uint tid [[thread_index_in_threadgroup]],
-                uint simd_lane [[thread_index_in_simdgroup]],
-                uint simd_id [[simdgroup_index_in_threadgroup]]
+                uint tid [[thread_index_in_threadgroup]]
            ) {{
                if (gid >= n_outputs) return;

-                threadgroup float warp_sums[THREADS_PER_GROUP / 32];
+                threadgroup float partials[THREADS_PER_GROUP];

                int in_start = {in_idx};
                int iters = {iters};
                (void)dyn;

-                // Each thread accumulates multiple elements
                float sum = 0.0f;
                for (int i = tid; i < iters; i += THREADS_PER_GROUP) {{
                    sum += {in_val};
                }}

-                // Warp-level reduction using simd_sum
-                sum = simd_sum(sum);
-
-                // First lane of each warp writes to shared memory
-                if (simd_lane == 0) {{
-                    warp_sums[simd_id] = sum;
-                }}
+                partials[tid] = sum;
                threadgroup_barrier(mem_flags::mem_threadgroup);
-
-                // First warp does final reduction
-                if (simd_id == 0) {{
-                    int n_warps = THREADS_PER_GROUP / 32;
-                    float block_sum = (tid < uint(n_warps)) ? warp_sums[tid] : 0.0f;
-                    block_sum = simd_sum(block_sum);
-
-                    if (tid == 0) {{
-                        out[{out_idx}] = {out_val};
+                for (uint stride = THREADS_PER_GROUP / 2; stride > 0; stride >>= 1) {{
+                    if (tid < stride) {{
+                        partials[tid] += partials[tid + stride];
                    }}
+                    threadgroup_barrier(mem_flags::mem_threadgroup);
+                }}
+
+                if (tid == 0) {{
+                    float block_sum = partials[0];
+                    out[{out_idx}] = {out_val};
                }}
            }}
            "#,
@@ -1220,42 +1218,33 @@ impl MetalKernelOp for MetalMaxReduce {
                constant int *dyn [[buffer({dyn_buffer_index})]],
                constant uint &n_outputs [[buffer({n_outputs_index})]],
                uint gid [[threadgroup_position_in_grid]],
-                uint tid [[thread_index_in_threadgroup]],
-                uint simd_lane [[thread_index_in_simdgroup]],
-                uint simd_id [[simdgroup_index_in_threadgroup]]
+                uint tid [[thread_index_in_threadgroup]]
            ) {{
                if (gid >= n_outputs) return;

-                threadgroup float warp_maxs[THREADS_PER_GROUP / 32];
+                threadgroup float partials[THREADS_PER_GROUP];

                int in_start = {in_idx};
                int iters = {iters};
                (void)dyn;

-                // Each thread finds max of multiple elements
                float max_val = NEG_INF_F;
                for (int i = tid; i < iters; i += THREADS_PER_GROUP) {{
                    max_val = fmax(max_val, {in_val});
                }}

-                // Warp-level reduction using simd_max
-                max_val = simd_max(max_val);
-
-                // First lane of each warp writes to shared memory
-                if (simd_lane == 0) {{
-                    warp_maxs[simd_id] = max_val;
-                }}
+                partials[tid] = max_val;
                threadgroup_barrier(mem_flags::mem_threadgroup);
-
-                // First warp does final reduction
-                if (simd_id == 0) {{
-                    int n_warps = THREADS_PER_GROUP / 32;
-                    float block_max = (tid < uint(n_warps)) ? warp_maxs[tid] : NEG_INF_F;
-                    block_max = simd_max(block_max);
-
-                    if (tid == 0) {{
-                        out[{out_idx}] = {out_val};
+                for (uint stride = THREADS_PER_GROUP / 2; stride > 0; stride >>= 1) {{
+                    if (tid < stride) {{
+                        partials[tid] = fmax(partials[tid], partials[tid + stride]);
                    }}
+                    threadgroup_barrier(mem_flags::mem_threadgroup);
+                }}
+
+                if (tid == 0) {{
+                    float block_max = partials[0];
+                    out[{out_idx}] = {out_val};
                }}
            }}
            "#,
@@ -1427,8 +1416,6 @@ impl EgglogOp for MPSMatmul {
            let dt = v(format!("?{}_dt", name.replace('-', "_")));

            rule(union(sum_op.clone(), mps_op.clone()))
-                .subsume(sum_op.clone())
-                .subsume(mul_op)
                .set(dtype(mps_op), dt.clone())
                .fact(eq(dt, dtype(sum_op)))
                .ruleset("kernel_lower")
@@ -1464,6 +1451,17 @@ impl EgglogOp for MPSMatmul {
                1,
                1,
            ),
+            Rule::raw(
+                "(rule
+                    ((= ?mul (Op (MetalMul ?shape ?as ?bs ?os) ?inputs))
+                     (= ?sum (Op (MetalSum ?sshape ?sk ?ssi ?sks ?sso) (ICons ?mul (INil))))
+                     (= ?sum (MPSMatmul ?m ?n ?k ?lhs ?lhsrs ?rhs ?rhsrs ?ors ?tl ?tr)))
+                    ((delete (Op (MetalSum ?sshape ?sk ?ssi ?sks ?sso) (ICons ?mul (INil))))
+                     (delete (Op (MetalMul ?shape ?as ?bs ?os) ?inputs)))
+                    :ruleset cleanup
+                    :name \"delete-broadcast-mul-sum-when-mps-matmul-exists\"
+                )",
+            ),
        ]
    }

@@ -1839,8 +1837,6 @@ impl EgglogOp for MPSBatchedMatmul {
            let dt = v(format!("?{}_dt", name.replace('-', "_")));

            rule(union(sum_op.clone(), mps_op.clone()))
-                .subsume(sum_op.clone())
-                .subsume(mul_op)
                .set(dtype(mps_op), dt.clone())
                .fact(eq(dt, dtype(sum_op)))
                .ruleset("kernel_lower")
@@ -1878,6 +1874,17 @@ impl EgglogOp for MPSBatchedMatmul {
                ),
                1,
            ),
+            Rule::raw(
+                "(rule
+                    ((= ?mul (Op (MetalMul ?shape ?as ?bs ?os) ?inputs))
+                     (= ?sum (Op (MetalSum ?sshape ?sk ?ssi ?sks ?sso) (ICons ?mul (INil))))
+                     (= ?sum (MPSBatchedMatmul ?b ?m ?n ?k ?lhs ?lhsbs ?lhsrs ?rhs ?rhsbs ?rhsrs ?obs ?ors ?tl ?tr)))
+                    ((delete (Op (MetalSum ?sshape ?sk ?ssi ?sks ?sso) (ICons ?mul (INil))))
+                     (delete (Op (MetalMul ?shape ?as ?bs ?os) ?inputs)))
+                    :ruleset cleanup
+                    :name \"delete-broadcast-mul-sum-when-mps-batched-matmul-exists\"
+                )",
+            ),
        ]
    }

@@ -2163,24 +2170,6 @@ impl EgglogOp for GenericMatmul {
                    :name \"delete-broadcast-mul-sum-when-generic-matmul-exists\"
                )",
            ),
-            Rule::raw(
-                "(rule
-                    ((= ?sum (GenericMatmul ?go ?gm ?gk ?gl ?glas ?gr ?grs ?gsis ?gsit ?gos))
-                     (= ?sum (MPSMatmul ?mm ?mn ?mk ?ml ?mls ?mr ?mrs ?mos ?mtl ?mtr)))
-                    ((delete (GenericMatmul ?go ?gm ?gk ?gl ?glas ?gr ?grs ?gsis ?gsit ?gos)))
-                    :ruleset cleanup
-                    :name \"prefer-mps-over-generic-matmul\"
-                )",
-            ),
-            Rule::raw(
-                "(rule
-                    ((= ?sum (GenericMatmul ?go ?gm ?gk ?gl ?glas ?gr ?grs ?gsis ?gsit ?gos))
-                     (= ?sum (MPSBatchedMatmul ?bb ?bm ?bn ?bk ?bl ?blbs ?blrs ?br ?brbs ?brrs ?bobs ?bors ?btl ?btr)))
-                    ((delete (GenericMatmul ?go ?gm ?gk ?gl ?glas ?gr ?grs ?gsis ?gsit ?gos)))
-                    :ruleset cleanup
-                    :name \"prefer-mps-batched-over-generic-matmul\"
-                )",
-            ),
        ]
    }

@@ -2265,13 +2254,11 @@ impl MetalKernelOp for GenericMatmul {
                constant int *dyn [[buffer({dyn_buffer_index})]],
                constant uint &n_outputs [[buffer({n_outputs_index})]],
                uint gid [[threadgroup_position_in_grid]],
-                uint tid [[thread_index_in_threadgroup]],
-                uint simd_lane [[thread_index_in_simdgroup]],
-                uint simd_id [[simdgroup_index_in_threadgroup]]
+                uint tid [[thread_index_in_threadgroup]]
            ) {{
                if (gid >= n_outputs) return;

-                threadgroup float warp_sums[THREADS_PER_GROUP / 32];
+                threadgroup float partials[THREADS_PER_GROUP];
                int base_idx = {sum_base_idx};
                int iters = {iters};
                (void)dyn;
@@ -2282,19 +2269,18 @@ impl MetalKernelOp for GenericMatmul {
                    sum += ({lhs_val}) * ({rhs_val});
                }}

-                sum = simd_sum(sum);
-                if (simd_lane == 0) {{
-                    warp_sums[simd_id] = sum;
-                }}
+                partials[tid] = sum;
                threadgroup_barrier(mem_flags::mem_threadgroup);
-
-                if (simd_id == 0) {{
-                    int n_warps = THREADS_PER_GROUP / 32;
-                    float block_sum = (tid < uint(n_warps)) ? warp_sums[tid] : 0.0f;
-                    block_sum = simd_sum(block_sum);
-                    if (tid == 0) {{
-                        out[{out_idx}] = {out_val};
+                for (uint stride = THREADS_PER_GROUP / 2; stride > 0; stride >>= 1) {{
+                    if (tid < stride) {{
+                        partials[tid] += partials[tid + stride];
                    }}
+                    threadgroup_barrier(mem_flags::mem_threadgroup);
+                }}
+
+                if (tid == 0) {{
+                    float block_sum = partials[0];
+                    out[{out_idx}] = {out_val};
                }}
            }}
            "#,
--- a/crates/luminal_metal/src/lib.rs
+++ b/crates/luminal_metal/src/lib.rs
@@ -1,5 +1,6 @@
 pub mod dyn_backend;
 pub mod kernel;
+mod memory_analysis;
 pub mod runtime;

 #[cfg(test)]
--- a/crates/luminal_metal/src/memory_analysis.rs
+++ b/crates/luminal_metal/src/memory_analysis.rs
--- a/crates/luminal_metal/src/runtime.rs
+++ b/crates/luminal_metal/src/runtime.rs
@@ -3,6 +3,7 @@ use half::{bf16, f16};
 use itertools::Itertools;
 use luminal::{
    dtype::DType,
+    egglog_utils::SerializedEGraph,
    graph::{BucketLLIR, DimBucket, Graph, LLIRGraph},
    hlir::{Input, NativeData, Output},
    op::{ExecutionStats, Runtime, RuntimeStats, TimingMethod},
@@ -304,6 +305,26 @@ impl Runtime for MetalRuntime {
    type ExecReturn = ();
    type ProfileMetric = Duration;

+    fn late_egglog_passes(
+        ops: &[std::sync::Arc<Box<dyn luminal::op::EgglogOp>>],
+        options: &luminal::graph::BuildSearchSpaceOptions,
+        dyn_map: &FxHashMap<char, usize>,
+    ) -> Vec<luminal::egglog_utils::LateEgglogPass> {
+        vec![crate::memory_analysis::metal_memory_analysis_pass(
+            ops,
+            options.max_memory_bytes,
+            dyn_map,
+        )]
+    }
+
+    fn estimate_graph_memory<'a>(
+        egraph: &'a SerializedEGraph,
+        choices: &luminal::egglog_utils::EGraphChoiceSet<'a>,
+        dyn_map: &FxHashMap<char, usize>,
+    ) -> Option<usize> {
+        crate::memory_analysis::estimate_graph_memory_bytes(egraph, choices, dyn_map)
+    }
+
    fn initialize(_: Self::CompileArg) -> Self {
        let device = Device::system_default().expect("No Metal device found!");
        let command_queue = device.new_command_queue();
@@ -347,19 +368,25 @@ impl Runtime for MetalRuntime {
        llir_graph: &LLIRGraph,
        dyn_map: &FxHashMap<char, usize>,
        trials: usize,
-        _timeout: Option<std::time::Duration>,
+        timeout: Option<std::time::Duration>,
    ) -> (Self::ProfileMetric, String) {
        self.load_llir(llir_graph);
        self.allocate_intermediate_buffers(dyn_map);

        let trials = trials.max(1);
+        let profile_start = std::time::Instant::now();
        let mut duration = Duration::default();
+        let mut completed_trials = 0;
        for _ in 0..trials {
            let start = std::time::Instant::now();
            self.execute(dyn_map);
            duration += start.elapsed();
+            completed_trials += 1;
+            if timeout.is_some_and(|timeout| profile_start.elapsed() >= timeout) {
+                break;
+            }
        }
-        duration /= trials as u32;
+        duration /= completed_trials as u32;

        (duration, format!("{:.2?}", duration))
    }
@@ -449,6 +476,21 @@ impl Runtime for MetalRuntime {
        self.buffers.clear();
    }

+    fn intermediate_buffer_bytes(&self) -> usize {
+        self.buffers
+            .values()
+            .map(|buffer| buffer.length() as usize)
+            .sum()
+    }
+
+    fn planned_intermediate_buffer_bytes(&self) -> Option<usize> {
+        Some(self.intermediate_buffer_bytes())
+    }
+
+    fn allocated_intermediate_buffer_bytes(&self) -> Option<usize> {
+        Some(self.intermediate_buffer_bytes())
+    }
+
    fn load_llir_buckets(
        &mut self,
        dim_buckets: &FxHashMap<char, Vec<DimBucket>>,
--- a/crates/luminal_metal/src/tests.rs
+++ b/crates/luminal_metal/src/tests.rs
@@ -3,6 +3,7 @@ use candle_core::{Device as CandleDevice, Tensor as CandleTensor};
 use half::{bf16, f16};
 use luminal::prelude::*;
 use proptest::prelude::*;
+use rand::{SeedableRng, rngs::StdRng};
 use safetensors::{Dtype, tensor::TensorView};
 use std::{
    collections::HashMap,
@@ -38,6 +39,30 @@ fn bytes_of<T: bytemuck::NoUninit>(values: &[T]) -> Vec<u8> {
    bytemuck::cast_slice(values).to_vec()
 }

+fn search_candidates(cx: &mut Graph, rt: MetalRuntime, limit: usize) -> MetalRuntime {
+    let mut rng = StdRng::seed_from_u64(0);
+    cx.search_options(rt, SearchOptions::new(limit), &mut rng)
+}
+
+fn egraph_has_op(cx: &Graph, op_name: &str) -> bool {
+    cx.egraph()
+        .expect("search space should be built")
+        .enodes
+        .values()
+        .any(|(label, _)| label == op_name)
+}
+
+fn assert_matmul_options(cx: &Graph, mps_op_name: &str) {
+    assert!(
+        egraph_has_op(cx, mps_op_name),
+        "expected {mps_op_name} rewrite option in e-graph"
+    );
+    assert!(
+        egraph_has_op(cx, "GenericMatmul"),
+        "expected GenericMatmul rewrite option in e-graph"
+    );
+}
+
 fn write_test_safetensors(tensors: &[(&str, Dtype, Vec<usize>, Vec<u8>)]) -> PathBuf {
    let tensor_views: HashMap<String, TensorView<'_>> = tensors
        .iter()
@@ -401,6 +426,18 @@ proptest! {
    }
 }

+#[test]
+fn metal_build_search_space_accepts_memory_budget() {
+    let mut cx = Graph::default();
+    let a = cx.tensor(4);
+    let b = cx.tensor(4);
+    (a * b).output();
+
+    cx.build_search_space_with_options::<MetalRuntime>(
+        BuildSearchSpaceOptions::new().max_memory_mib(1),
+    );
+}
+
 /// Simple deterministic test for add
 #[test]
 fn metal_simple_add() {
@@ -665,7 +702,7 @@ fn metal_specialized_matmul() {

    rt.set_data(a, &a_data);
    rt.set_data(b, &b_data);
-    rt = cx.search(rt, 1);
+    rt = search_candidates(&mut cx, rt, 32);
    assert!(
        rt.contains_matmul(),
        "expected Metal runtime to fuse matmul, kernels: {:?}",
@@ -698,6 +735,7 @@ fn metal_regular_tiled_matmul_path() {
    let output = a.matmul(b).output();

    cx.build_search_space::<MetalRuntime>();
+    assert_matmul_options(&cx, "MPSMatmul");
    let mut rt = MetalRuntime::initialize(());

    let a_data = seeded_data(m * k, 0.4, -0.2);
@@ -705,19 +743,7 @@ fn metal_regular_tiled_matmul_path() {

    rt.set_data(a, &a_data);
    rt.set_data(b, &b_data);
-    rt = cx.search(rt, 1);
-
-    let kernels = rt.debug_kernel_ops();
-    assert!(
-        kernels.iter().any(|k| k.contains("MPSMatmul")),
-        "expected MPS matmul path, kernels: {:?}",
-        kernels
-    );
-    assert!(
-        !kernels.iter().any(|k| k.contains("GenericMatmul")),
-        "MPS-compatible matmul should not extract the generic fallback, kernels: {:?}",
-        kernels
-    );
+    rt = search_candidates(&mut cx, rt, 32);

    rt.allocate_intermediate_buffers(&cx.dyn_map);
    rt.execute(&cx.dyn_map);
@@ -744,6 +770,7 @@ fn metal_mps_matmul_transposed_rhs_weight_layout() {
    let output = a.matmul(weight.t()).output();

    cx.build_search_space::<MetalRuntime>();
+    assert_matmul_options(&cx, "MPSMatmul");
    let mut rt = MetalRuntime::initialize(());

    let a_data = seeded_data(m * k, 0.35, -0.17);
@@ -751,14 +778,7 @@ fn metal_mps_matmul_transposed_rhs_weight_layout() {

    rt.set_data(a, &a_data);
    rt.set_data(weight, &weight_data);
-    rt = cx.search(rt, 1);
-
-    let kernels = rt.debug_kernel_ops();
-    assert!(
-        kernels.iter().any(|k| k.contains("transpose_rhs: true")),
-        "expected MPS matmul to cover transposed row-major RHS, kernels: {:?}",
-        kernels
-    );
+    rt = search_candidates(&mut cx, rt, 32);

    rt.allocate_intermediate_buffers(&cx.dyn_map);
    rt.execute(&cx.dyn_map);
@@ -785,6 +805,7 @@ fn metal_mps_matmul_transposed_lhs_layout() {
    let output = lhs_storage.t().matmul(rhs).output();

    cx.build_search_space::<MetalRuntime>();
+    assert_matmul_options(&cx, "MPSMatmul");
    let mut rt = MetalRuntime::initialize(());

    let lhs_data = seeded_data(k * m, 0.31, -0.12);
@@ -792,14 +813,7 @@ fn metal_mps_matmul_transposed_lhs_layout() {

    rt.set_data(lhs_storage, &lhs_data);
    rt.set_data(rhs, &rhs_data);
-    rt = cx.search(rt, 1);
-
-    let kernels = rt.debug_kernel_ops();
-    assert!(
-        kernels.iter().any(|k| k.contains("transpose_lhs: true")),
-        "expected MPS matmul to cover transposed row-major LHS, kernels: {:?}",
-        kernels
-    );
+    rt = search_candidates(&mut cx, rt, 32);

    rt.allocate_intermediate_buffers(&cx.dyn_map);
    rt.execute(&cx.dyn_map);
@@ -830,20 +844,14 @@ fn metal_mps_batched_matmul_row_row_layout() {
    let output = a.matmul(b).output();

    cx.build_search_space::<MetalRuntime>();
+    assert_matmul_options(&cx, "MPSBatchedMatmul");
    let mut rt = MetalRuntime::initialize(());

    let a_data = seeded_data(batch * m * k, 0.17, -0.08);
    let b_data = seeded_data(batch * k * n, 0.11, -0.05);
    rt.set_data(a, &a_data);
    rt.set_data(b, &b_data);
-    rt = cx.search(rt, 1);
-
-    let kernels = rt.debug_kernel_ops();
-    assert!(
-        kernels.iter().any(|k| k.contains("MPSBatchedMatmul")),
-        "expected MPS batched matmul path, kernels: {:?}",
-        kernels
-    );
+    rt = search_candidates(&mut cx, rt, 32);

    rt.allocate_intermediate_buffers(&cx.dyn_map);
    rt.execute(&cx.dyn_map);
@@ -880,13 +888,17 @@ fn metal_generic_matmul_covers_noncontiguous_merged_head_projection() {
    let output = merged.matmul(weight.t()).output();

    cx.build_search_space::<MetalRuntime>();
+    assert!(
+        egraph_has_op(&cx, "GenericMatmul"),
+        "expected GenericMatmul rewrite option in e-graph"
+    );
    let mut rt = MetalRuntime::initialize(());

    let attn_data = seeded_data(heads * seq * head_dim, 0.19, -0.09);
    let weight_data = seeded_data(out_dim * hidden, 0.14, -0.06);
    rt.set_data(attn, &attn_data);
    rt.set_data(weight, &weight_data);
-    rt = cx.search(rt, 1);
+    rt = search_candidates(&mut cx, rt, 32);

    let kernels = rt.debug_kernel_ops();
    assert!(
@@ -935,22 +947,14 @@ fn metal_mps_batched_matmul_transposed_rhs_layout() {
    let output = a.matmul(weight.permute((0, 2, 1))).output();

    cx.build_search_space::<MetalRuntime>();
+    assert_matmul_options(&cx, "MPSBatchedMatmul");
    let mut rt = MetalRuntime::initialize(());

    let a_data = seeded_data(batch * m * k, 0.13, -0.06);
    let weight_data = seeded_data(batch * n * k, 0.09, -0.04);
    rt.set_data(a, &a_data);
    rt.set_data(weight, &weight_data);
-    rt = cx.search(rt, 1);
-
-    let kernels = rt.debug_kernel_ops();
-    assert!(
-        kernels
-            .iter()
-            .any(|k| k.contains("MPSBatchedMatmul") && k.contains("transpose_rhs: true")),
-        "expected MPS batched matmul transposed RHS path, kernels: {:?}",
-        kernels
-    );
+    rt = search_candidates(&mut cx, rt, 32);

    rt.allocate_intermediate_buffers(&cx.dyn_map);
    rt.execute(&cx.dyn_map);
@@ -984,6 +988,7 @@ fn metal_mps_matmul_f16_transposed_rhs_weight_layout() {
    let output = a.matmul(weight.t()).cast(DType::F32).output();

    cx.build_search_space::<MetalRuntime>();
+    assert_matmul_options(&cx, "MPSMatmul");
    let mut rt = MetalRuntime::initialize(());

    let a_data = seeded_data(m * k, 0.22, -0.07);
@@ -991,14 +996,7 @@ fn metal_mps_matmul_f16_transposed_rhs_weight_layout() {

    rt.set_data(a, to_f16_vec(&a_data));
    rt.set_data(weight, to_f16_vec(&weight_data));
-    rt = cx.search(rt, 1);
-
-    let kernels = rt.debug_kernel_ops();
-    assert!(
-        kernels.iter().any(|k| k.contains("transpose_rhs: true")),
-        "expected MPS F16 matmul to cover transposed row-major RHS, kernels: {:?}",
-        kernels
-    );
+    rt = search_candidates(&mut cx, rt, 32);

    rt.allocate_intermediate_buffers(&cx.dyn_map);
    rt.execute(&cx.dyn_map);
--- a/crates/luminal_python/rust/src/compiled_graph.rs
+++ b/crates/luminal_python/rust/src/compiled_graph.rs
@@ -439,6 +439,77 @@ impl CompiledGraph {
        Ok(())
    }

+    /// Batched fast path: register all input + output device pointers,
+    /// run the graph, and return zero-copy flags for each output — all
+    /// in one PyO3 boundary crossing instead of ~7 separate FFI calls.
+    ///
+    /// For DLRM-sized graphs (batch=2048) the per-iter cost outside the
+    /// actual cuGraphLaunch is dominated by Python-Rust crossings; this
+    /// collapses them. Inputs and outputs come in name-matched order;
+    /// any name miss errors out instead of silently ignoring.
+    ///
+    /// `output_ptrs` may be `None` for the CPU / native path that
+    /// doesn't pre-register output buffers.
+    ///
+    /// Returns `Vec<bool>` aligned with `output_ptrs` — `true` when
+    /// the registered buffer received the writeback, `false` when an
+    /// aliased-output fallback DtoD copy is needed.
+    fn run_with_ptrs(
+        &mut self,
+        input_ptrs: Vec<(String, u64, usize)>,
+        output_ptrs: Option<Vec<(String, u64, usize)>>,
+    ) -> PyResult<Vec<bool>> {
+        if !self.runtime.supports_device_ptrs() {
+            return Err(pyo3::exceptions::PyValueError::new_err(
+                "run_with_ptrs requires a GPU backend",
+            ));
+        }
+        // Resolve all names once up front so a typo errors before any
+        // pointer registration mutates runtime state.
+        let in_nodes: Vec<(NodeIndex, u64, usize)> = input_ptrs
+            .into_iter()
+            .map(|(name, ptr, n)| {
+                self.tensor_ids
+                    .get(&name)
+                    .copied()
+                    .map(|nid| (nid, ptr, n))
+                    .ok_or_else(|| {
+                        PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!(
+                            "Unknown input tensor: {name}"
+                        ))
+                    })
+            })
+            .collect::<PyResult<_>>()?;
+        let out_nodes: Vec<(NodeIndex, u64, usize)> = if let Some(outs) = output_ptrs {
+            outs.into_iter()
+                .map(|(name, ptr, n)| {
+                    self.tensor_ids
+                        .get(&name)
+                        .copied()
+                        .map(|nid| (nid, ptr, n))
+                        .ok_or_else(|| {
+                            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!(
+                                "Unknown output tensor: {name}"
+                            ))
+                        })
+                })
+                .collect::<PyResult<_>>()?
+        } else {
+            Vec::new()
+        };
+        for (nid, ptr, n) in &in_nodes {
+            unsafe { self.runtime.set_device_ptr(*nid, *ptr, *n) };
+        }
+        for (nid, ptr, n) in &out_nodes {
+            unsafe { self.runtime.set_output_device_ptr(*nid, *ptr, *n) };
+        }
+        self.runtime.execute(&self.graph.dyn_map);
+        Ok(out_nodes
+            .iter()
+            .map(|(nid, _, _)| self.runtime.output_is_zero_copy(*nid))
+            .collect())
+    }
+
    /// Check whether an output tensor was zero-copied (written directly to the registered pointer).
    /// Returns false for aliased outputs that need a fallback DtoD copy, or if no GPU backend.
    /// Must be called after run().
--- a/crates/luminal_python/rust/src/translator/dispatch.rs
+++ b/crates/luminal_python/rust/src/translator/dispatch.rs
@@ -119,30 +119,118 @@ impl<'a> Translator<'a> {
            "torch.ops.aten.argsort.default" => self.translate_argsort(node)?,

            // Matmul
-            "torch.ops.aten.mm.default" | "torch.ops.aten.bmm.default" => {
+            "torch.ops.aten.mm.default" => {
                let a = self.get_input_tensor(node, 0)?;
                let b = self.get_input_tensor(node, 1)?;
                let (a, b) = ensure_same_dtype(a, b);
                a.matmul(b)
            }
+            // bmm: batched 3-D matmul. Fast path under cuda + F32 when
+            // B was produced by a permute([0, 2, 1]) (i.e. `T @ T.T` —
+            // the DLRM pairwise-interaction pattern): route to
+            // `matmul_3d_t` with the original (B, F, D) tensor, which
+            // uses the fused Matmul2DKernel and avoids the
+            // expand+mul+sum-reduce decomposition that produces ~25
+            // small kernels per bmm.
+            "torch.ops.aten.bmm.default" => {
+                let a = self.get_input_tensor(node, 0)?;
+                let b = self.get_input_tensor(node, 1)?;
+
+                let b_src = node.inputs.get(1).and_then(|n| n.arg.as_tensor_name())
+                    .and_then(|n| self.transpose_2d_source.get(n).cloned());
+
+                let f32_all = a.dtype == DType::F32 && b.dtype == DType::F32;
+                let backend_is_cuda = std::env::var("LUMINAL_BACKEND_CUDA")
+                    .map(|v| v == "1")
+                    .unwrap_or(false);
+                if cfg!(feature = "cuda")
+                    && backend_is_cuda
+                    && f32_all
+                    && a.shape.dims.len() == 3
+                    && b.shape.dims.len() == 3
+                    && let Some(orig_name) = b_src
+                    && let Some(orig_b) = self.tensors.get(&orig_name).copied()
+                    && orig_b.shape.dims.len() == 3
+                {
+                    // a: (B, M, K), orig_b: (B, N, K) — matmul_3d_t does
+                    // a @ orig_b.t() = (B, M, K) @ (B, K, N) = (B, M, N).
+                    luminal_cuda_lite::kernel::matmul_3d_t(a, orig_b)
+                } else {
+                    let (a, b) = ensure_same_dtype(a, b);
+                    a.matmul(b)
+                }
+            }

            // addmm: beta*input + alpha*(mat1 @ mat2)
+            //
+            // Fast path (CUDA, the common nn.Linear case): when
+            //   * shapes are 2-D F32, alpha=beta=1
+            //   * mat2 was produced by `aten.permute([1,0])` of a 2-D
+            //     tensor (`weight.t()` from nn.Linear)
+            //   * bias is 1-D
+            // we lower to the fused `linear_bias` kernel using the
+            // *original* (N,K) weight — bypassing the
+            // expand+mul+sum-reduce decomposition that otherwise
+            // produces ~25 small kernels per Linear layer (~3.7 ms on
+            // tiny shapes due to launch overhead).
+            //
+            // The transpose detection comes from
+            // `translate_permute`, which populates
+            // `transpose_2d_source` whenever it sees a 2-D permute.
            "torch.ops.aten.addmm.default" => {
                let input = self.get_input_tensor(node, 0)?;
                let mat1 = self.get_input_tensor(node, 1)?;
                let mat2 = self.get_input_tensor(node, 2)?;
                let beta = self.get_float_arg(node, 3).unwrap_or(1.0) as f32;
                let alpha = self.get_float_arg(node, 4).unwrap_or(1.0) as f32;
-                let mm = mat1.matmul(mat2);
-                let (input, mm) = broadcast_binary(input, mm);
-                input * beta + mm * alpha
+
+                let mat2_src = node.inputs.get(2).and_then(|n| n.arg.as_tensor_name())
+                    .and_then(|n| self.transpose_2d_source.get(n).cloned());
+
+                let unit_scale = (alpha - 1.0).abs() < 1e-7 && (beta - 1.0).abs() < 1e-7;
+                let f32_all = mat1.dtype == DType::F32
+                    && mat2.dtype == DType::F32
+                    && input.dtype == DType::F32;
+                let two_d = mat1.shape.dims.len() == 2 && mat2.shape.dims.len() == 2;
+                let bias_is_1d = input.shape.dims.len() == 1;
+
+                // Runtime CUDA check: the fast path emits
+                // `Matmul2DCustom` which is a cuda_lite kernel — the
+                // native CPU backend's load_llir doesn't know how to
+                // handle it. Python sets `LUMINAL_BACKEND_CUDA=1`
+                // before each compile when the chosen backend is cuda.
+                let backend_is_cuda = std::env::var("LUMINAL_BACKEND_CUDA")
+                    .map(|v| v == "1")
+                    .unwrap_or(false);
+                if cfg!(feature = "cuda")
+                    && backend_is_cuda
+                    && two_d
+                    && f32_all
+                    && unit_scale
+                    && bias_is_1d
+                    && let Some(weight_name) = mat2_src
+                    && let Some(orig_weight) = self.tensors.get(&weight_name).copied()
+                    && orig_weight.shape.dims.len() == 2
+                {
+                    // `orig_weight` has shape (N, K) — the original
+                    // nn.Linear weight before PyTorch's `.t()`.
+                    // `linear_bias(a, b, bias)` computes `a @ b.t() +
+                    // bias` with the fused Matmul2DKernel.
+                    luminal_cuda_lite::kernel::linear_bias(mat1, orig_weight, input)
+                } else {
+                    // Generic fallback (non-cuda, scaled, or unknown
+                    // mat2 source).
+                    let mm = mat1.matmul(mat2);
+                    let (input, mm) = broadcast_binary(input, mm);
+                    input * beta + mm * alpha
+                }
            }

            // Convolution
            "torch.ops.aten.convolution.default" => self.translate_conv(node)?,

            // Reduction ops
-            "torch.ops.aten.sum.dim_IntList" => self.translate_reduction(node, ReductionOp::Sum)?,
+            "torch.ops.aten.sum.dim_IntList" => self.translate_sum_with_embbag_peephole(node)?,
            "torch.ops.aten.mean.dim" => self.translate_reduction(node, ReductionOp::Mean)?,
            "torch.ops.aten.amax.default" => self.translate_reduction(node, ReductionOp::Max)?,

@@ -151,10 +239,22 @@ impl<'a> Translator<'a> {
            "torch.ops.aten.select.int" => self.translate_select(node)?,
            "torch.ops.aten.cat.default" => self.translate_cat(node)?,
            "torch.ops.aten.index.Tensor" => self.translate_index_tensor(node)?,
+            "torch.ops.aten.index_select.default" => self.translate_index_select(node)?,

            // Embedding
            "torch.ops.aten.embedding.default" => self.translate_embedding(node)?,

+            // EmbeddingBag (sum-pool, fixed bag size). PT2 export decomposes
+            // nn.EmbeddingBag → _embedding_bag_forward_only which returns a
+            // 4-tuple (output, offset2bag, bag_size, max_indices). Only the
+            // first tuple slot is used at inference; the translator returns
+            // early after inserting that slot's tensor and we skip the
+            // single-output `tensors.insert` at the end of the dispatch.
+            "torch.ops.aten._embedding_bag_forward_only.default" => {
+                self.translate_embedding_bag_forward_only(node)?;
+                return Ok(());
+            }
+
            // Softmax
            "torch.ops.aten._softmax.default" => {
                let a = self.get_input_tensor(node, 0)?;
@@ -514,6 +614,51 @@ impl<'a> Translator<'a> {
        };

        if !output_name.is_empty() {
+            // Record the chain (FX target + first input name) keyed by
+            // output name so multi-node peepholes (e.g. the EmbBag fast
+            // path that detects sum ← view ← index_select) can walk
+            // back without re-scanning all parsed nodes.
+            //
+            // For variadic ops (e.g. `aten.cat.default` whose first
+            // arg is `as_tensors`) fall back to the first entry of the
+            // variadic tensor list. The DLRM PairwiseDot peephole
+            // needs `node_chain[cat]` to walk back from `bmm → cat`.
+            let first_input_name: Option<String> = node
+                .inputs
+                .first()
+                .and_then(|i| {
+                    i.arg
+                        .as_tensor_name()
+                        .map(|s| s.to_string())
+                        .or_else(|| {
+                            i.arg
+                                .as_tensors()
+                                .and_then(|ts| ts.first().map(|tn| tn.name.clone()))
+                        })
+                });
+            if let Some(first_input) = first_input_name {
+                self.node_chain.insert(
+                    output_name.clone(),
+                    (node.target.clone(), first_input),
+                );
+            }
+            // Also record the full input-name list (in order, including
+            // entries that come from `as_tensors` for variadic ops like
+            // `aten.cat`). Used by the DLRM PairwiseDot peephole which
+            // needs all cat inputs and both bmm inputs.
+            let mut all_inputs: Vec<String> = Vec::new();
+            for inp in &node.inputs {
+                if let Some(names) = inp.arg.as_tensors() {
+                    for tn in names {
+                        all_inputs.push(tn.name.clone());
+                    }
+                } else if let Some(name) = inp.arg.as_tensor_name() {
+                    all_inputs.push(name.to_string());
+                }
+            }
+            if !all_inputs.is_empty() {
+                self.op_inputs.insert(output_name.clone(), all_inputs);
+            }
            self.tensors.insert(output_name, result);
        }
        Ok(())
@@ -521,6 +666,71 @@ impl<'a> Translator<'a> {
 }

 impl<'a> Translator<'a> {
+    /// Peephole for the DLRM-v3 embedding-bag pattern:
+    ///   `sum(dim=[1], keepdim=False)( view([?, L, D])( index_select(W, 0, IDX) ) )`
+    /// substitutes the fused `embedding_bag_sum_kernel(W, IDX.view(?, L))`
+    /// — same kernel as the hand-rolled DLRM example uses. Falls back to
+    /// the generic reduction path when the chain doesn't match.
+    pub(crate) fn translate_sum_with_embbag_peephole(
+        &mut self,
+        node: &Node,
+    ) -> Result<GraphTensor> {
+        let dims = self.get_ints_arg(node, 1).unwrap_or_default();
+        let keepdim = self.get_bool_arg(node, 2).unwrap_or(false);
+        let backend_is_cuda = std::env::var("LUMINAL_BACKEND_CUDA")
+            .map(|v| v == "1")
+            .unwrap_or(false);
+
+        // Only attempt fast-path under cuda + the specific sum(dim=[1]) pattern.
+        if cfg!(feature = "cuda")
+            && backend_is_cuda
+            && dims.len() == 1
+            && dims[0] == 1
+            && !keepdim
+            && let Some(sum_input_name) = node.inputs.first().and_then(|i| i.arg.as_tensor_name())
+            && let Some((view_target, view_src)) = self.node_chain.get(sum_input_name).cloned()
+            && view_target == "torch.ops.aten.view.default"
+            && let Some((is_target, is_src)) = self.node_chain.get(&view_src).cloned()
+            && is_target == "torch.ops.aten.index_select.default"
+            // Pull the FX index_select node so we can grab its dim + index args.
+            && let Some(is_node) = self.parsed.program.graph_module.graph.nodes.iter()
+                .find(|n| n.outputs.first()
+                    .and_then(|o| o.as_tensor.as_ref())
+                    .map(|t| t.name == view_src)
+                    .unwrap_or(false))
+        {
+            let weight = self.tensors.get(&is_src).copied();
+            let idx_name = is_node.inputs.get(2).and_then(|i| i.arg.as_tensor_name());
+            let is_dim = self.get_int_arg(is_node, 1).unwrap_or(-1);
+            let in_tensor = self.tensors.get(sum_input_name).copied();
+            if is_dim == 0
+                && let Some(w) = weight
+                && let Some(idx_n) = idx_name
+                && let Some(idx) = self.tensors.get(idx_n).copied()
+                && let Some(inp) = in_tensor
+                && w.shape.dims.len() == 2
+                && idx.shape.dims.len() == 1
+                && inp.shape.dims.len() == 3
+                // ensure view's middle dim == idx's bag dim divides idx total
+                && inp.dtype == DType::F32
+                && w.dtype == DType::F32
+            {
+                let l = inp.shape.dims[1];
+                let kb = inp.shape.dims[0];
+                let d = inp.shape.dims[2];
+                // Reshape flat indices (K*B*L,) to (K*B, L).
+                let idx_2d = reshape_tensor(idx, vec![kb, l]);
+                // embedding_bag_sum_kernel expects (n_emb, d) weights +
+                // (batch, bag) indices, returns (batch, d).
+                let _ = d; // d already encoded in `w.shape.dims[1]`
+                return Ok(luminal_cuda_lite::kernel::embedding_bag_sum_kernel(w, idx_2d));
+            }
+        }
+
+        // Generic fallback.
+        self.translate_reduction(node, ReductionOp::Sum)
+    }
+
    fn translate_scalar_comparison(
        &mut self,
        node: &Node,
--- a/crates/luminal_python/rust/src/translator/mod.rs
+++ b/crates/luminal_python/rust/src/translator/mod.rs
@@ -51,6 +51,27 @@ pub(crate) struct Translator<'a> {
    pub(crate) output_ids: Vec<(String, NodeIndex)>,
    /// Extra tensor metadata from inlined subgraphs.
    pub(crate) extra_tensor_values: HashMap<String, TensorMeta>,
+    /// Peephole: maps an output-tensor name produced by a `permute([1,0])`
+    /// (i.e. a 2-D transpose) back to its input-tensor name. Used by the
+    /// addmm dispatch to detect `aten.addmm(bias, x, weight.t())` and
+    /// route it through the fused `Matmul2DKernel` (`matmul_2d_t`) with
+    /// the original weight, instead of through the generic
+    /// expand+mul+sum decomposition that materializes ~25 small kernels.
+    pub(crate) transpose_2d_source: HashMap<String, String>,
+    /// Trace each emitted node by its first output's name → (FX target,
+    /// first input's name). Used by the EmbBag peephole to walk back
+    /// `sum.dim_IntList → aten.view.default → aten.index_select.default`
+    /// and substitute the fused `embedding_bag_sum_kernel` for the slow
+    /// expand+gather decomposition. Populated by `record_node_chain`
+    /// after dispatching each op.
+    pub(crate) node_chain: HashMap<String, (String, String)>,
+    /// Per-node side table mapping the primary output name → list of all
+    /// input tensor names (in order). Lets multi-input peepholes — e.g.
+    /// `index.Tensor(bmm(cat([…]), permute(cat([…]))), [None, li, lj])`
+    /// → `dlrm_pairwise_dot_lower_tri([…])` — walk back through cat and
+    /// bmm without re-scanning the FX node array. Populated alongside
+    /// `node_chain` after each translated op.
+    pub(crate) op_inputs: HashMap<String, Vec<String>>,
 }

 impl<'a> Translator<'a> {
@@ -61,6 +82,9 @@ impl<'a> Translator<'a> {
            graph: Graph::new(),
            tensors: HashMap::new(),
            sym_map,
+            transpose_2d_source: HashMap::new(),
+            node_chain: HashMap::new(),
+            op_inputs: HashMap::new(),
            user_input_ids: Vec::new(),
            output_ids: Vec::new(),
            extra_tensor_values: HashMap::new(),
@@ -90,9 +114,64 @@ impl<'a> Translator<'a> {
            self.output_ids.push((name.clone(), tensor.id));
        }

+        // Post-translation dead-code elimination. luminal's egglog DOES
+        // prune unreachable subgraphs in the common case (e.g. an unused
+        // `x*2.0` next to a returned `x+1.0`), but in some patterns the
+        // optimizer holds onto subgraphs that were created and then
+        // superseded by a translator peephole — most notably the DLRM
+        // PairwiseDot path where `index.Tensor(bmm(cat(...), perm(cat(...))), ...)`
+        // is replaced with a fused custom op but the original bmm/cat
+        // pad-and-add chain remains in the HLIR. Walk back from every
+        // `Output` HLIR node, mark reachable producers, and drop the rest.
+        // Preserves `Input` nodes unconditionally so the runtime's input
+        // signature stays intact even when an input is unused (a few
+        // models pass dead constants alongside live tensors).
+        self.dce();
+
        Ok(())
    }

+    /// Sweep the HLIR graph: remove every node not reachable backward
+    /// from an `Output` HLIR sink. Inputs are kept regardless so the
+    /// runtime input contract is preserved.
+    fn dce(&mut self) {
+        use luminal::hlir::{Input, Output};
+        use petgraph::Direction;
+        use std::collections::HashSet;
+
+        let mut keep: HashSet<NodeIndex> = HashSet::new();
+        let mut stack: Vec<NodeIndex> = Vec::new();
+        let node_ids: Vec<NodeIndex> = self.graph.graph.node_indices().collect();
+        for n in &node_ids {
+            if self.graph.try_get_op::<Output>(*n).is_some() {
+                if keep.insert(*n) {
+                    stack.push(*n);
+                }
+            }
+            if self.graph.try_get_op::<Input>(*n).is_some() {
+                keep.insert(*n);
+            }
+        }
+        while let Some(n) = stack.pop() {
+            // Walk incoming edges — operands of `n`.
+            let preds: Vec<NodeIndex> = self
+                .graph
+                .graph
+                .neighbors_directed(n, Direction::Incoming)
+                .collect();
+            for pred in preds {
+                if keep.insert(pred) {
+                    stack.push(pred);
+                }
+            }
+        }
+        for n in node_ids {
+            if !keep.contains(&n) {
+                self.graph.graph.remove_node(n);
+            }
+        }
+    }
+
    fn create_inputs(&mut self) -> Result<()> {
        let inputs = self.parsed.classify_inputs();
        for input in &inputs {
--- a/crates/luminal_python/rust/src/translator/movement.rs
+++ b/crates/luminal_python/rust/src/translator/movement.rs
@@ -72,6 +72,24 @@ impl<'a> Translator<'a> {
            .iter()
            .map(|&d| normalize_dim(d, a.shape.len()))
            .collect();
+        // Record matmul-compatible inner-axis transposes so addmm /
+        // bmm can route them through the fused Matmul2DKernel /
+        // matmul_3d_t with the *original* input. The view-transposed
+        // tensor has non-contiguous strides that the SGEMM kernel
+        // doesn't honor, so we need the original. We recognize two
+        // patterns:
+        //   * 2-D permute [1, 0] — `weight.t()` from nn.Linear
+        //   * 3-D permute [0, 2, 1] — `T.transpose(1, 2)` for bmm
+        let is_inner_transpose = (axes == [1usize, 0usize] && a.shape.dims.len() == 2)
+            || (axes == [0usize, 2usize, 1usize] && a.shape.dims.len() == 3);
+        if is_inner_transpose
+            && let Some(src_name) = node.inputs.first().and_then(|i| i.arg.as_tensor_name())
+            && let Some(out_ref) = node.outputs.first()
+            && let Some(out_t) = out_ref.as_tensor.as_ref()
+        {
+            self.transpose_2d_source
+                .insert(out_t.name.clone(), src_name.to_string());
+        }
        Ok(a.permute(axes))
    }

@@ -256,7 +274,206 @@ impl<'a> Translator<'a> {
        Ok(weight.gather(ids_expanded + arange_expanded))
    }

+    /// `aten.index_select(input, dim, index)` — pick rows/slices of `input`
+    /// along `dim` using a 1-D `index` tensor. Output shape is
+    /// `input.shape` with `dim` replaced by `index.shape[0]`.
+    ///
+    /// For the DLRM v3 use case this is `index_select(emb_weight, 0,
+    /// flat_indices)` — a 2-D source and 1-D index along dim 0. We lower
+    /// it the same way `translate_embedding` does: build a flat-rows
+    /// gather index `(index * hidden_dim) + arange(hidden_dim)` and read
+    /// the flattened weight in one pass. Higher-rank sources and non-zero
+    /// `dim` are not yet wired (would need stride math over Expression
+    /// shapes); they error out cleanly so they're easy to add when the
+    /// next model surfaces them.
+    pub(crate) fn translate_index_select(&mut self, node: &Node) -> Result<GraphTensor> {
+        let source = self.get_input_tensor(node, 0)?;
+        let dim_raw = self.get_int_arg(node, 1)?;
+        let index = self.get_input_tensor(node, 2)?;
+
+        let rank = source.shape.dims.len();
+        anyhow::ensure!(
+            rank == 2,
+            "translate_index_select: only 2-D source supported (got rank {rank}); \
+             extend this when a model needs higher rank."
+        );
+        let dim = if dim_raw < 0 {
+            dim_raw + rank as i64
+        } else {
+            dim_raw
+        };
+        anyhow::ensure!(
+            dim == 0,
+            "translate_index_select: only dim=0 supported (got {dim}); \
+             extend this when a model needs another axis."
+        );
+        anyhow::ensure!(
+            index.shape.dims.len() == 1,
+            "translate_index_select: index must be 1-D (got rank {})",
+            index.shape.dims.len()
+        );
+
+        // Same lowering as `translate_embedding`: build a flat gather index
+        // that combines the row-base offsets (`index * hidden_dim`) with a
+        // per-row `arange(hidden_dim)` broadcast.
+        let hidden_dim = source.shape.dims[1];
+        let n_idx = index.shape.dims[0];
+        let index_int = index.cast(DType::Int);
+        let base_expanded = (index_int * hidden_dim).expand_dim(1, hidden_dim);
+        let arange = self.graph.arange(hidden_dim);
+        let arange_expanded = arange.expand_dim(0, n_idx);
+        Ok(source.gather(base_expanded + arange_expanded))
+    }
+
+    /// `aten._embedding_bag_forward_only(weight, indices, offsets,
+    ///   scale_grad_by_freq, mode, sparse, per_sample_weights,
+    ///   include_last_offset, padding_idx)` →
+    ///   `(output, offset2bag, bag_size, max_indices)`.
+    ///
+    /// PyTorch decomposes `nn.EmbeddingBag` to this op. For the DLRM use
+    /// case all bags share a fixed stride `L = indices.len() / offsets.len()`
+    /// and `mode == 0` (sum). We detect that and lower to the fused
+    /// [`embedding_bag_sum_kernel`] on CUDA, or to a generic
+    /// `gather → reshape → sum` chain on CPU.
+    ///
+    /// Only `output` (tuple slot 0) is computed — `offset2bag`, `bag_size`
+    /// and `max_indices` are training-time dead ends for inference DLRM
+    /// and never read by any downstream `getitem`.
+    pub(crate) fn translate_embedding_bag_forward_only(&mut self, node: &Node) -> Result<()> {
+        let weight = self.get_input_tensor(node, 0)?;
+        let indices = self.get_input_tensor(node, 1)?;
+        let offsets = self.get_input_tensor(node, 2)?;
+        let mode = self.get_int_arg(node, 4).unwrap_or(0);
+        anyhow::ensure!(
+            mode == 0,
+            "translate_embedding_bag_forward_only: only mode=0 (sum) supported (got {mode}); \
+             vanilla DLRM uses sum-pooled bags. Extend this when a model needs mean/max."
+        );
+        // per_sample_weights is input index 6 and may be None / absent.
+        let has_per_sample_weights = node
+            .inputs
+            .get(6)
+            .and_then(|i| i.arg.as_tensor_name())
+            .is_some();
+        anyhow::ensure!(
+            !has_per_sample_weights,
+            "translate_embedding_bag_forward_only: per_sample_weights not supported \
+             (DLRM doesn't use them)."
+        );
+
+        anyhow::ensure!(
+            weight.shape.dims.len() == 2,
+            "translate_embedding_bag_forward_only: weight must be 2-D (got rank {})",
+            weight.shape.dims.len()
+        );
+        anyhow::ensure!(
+            indices.shape.dims.len() == 1,
+            "translate_embedding_bag_forward_only: indices must be 1-D (got rank {})",
+            indices.shape.dims.len()
+        );
+        anyhow::ensure!(
+            offsets.shape.dims.len() == 1,
+            "translate_embedding_bag_forward_only: offsets must be 1-D (got rank {})",
+            offsets.shape.dims.len()
+        );
+
+        let n_idx = indices.shape.dims[0]
+            .to_usize()
+            .context("translate_embedding_bag_forward_only: indices length must be static")?;
+        let batch = offsets.shape.dims[0]
+            .to_usize()
+            .context("translate_embedding_bag_forward_only: offsets length must be static")?;
+        anyhow::ensure!(
+            n_idx % batch == 0,
+            "translate_embedding_bag_forward_only: indices length ({n_idx}) must be a \
+             multiple of offsets length ({batch}); variable bag sizes not supported."
+        );
+        let bag = n_idx / batch;
+        let d = weight.shape.dims[1]
+            .to_usize()
+            .context("translate_embedding_bag_forward_only: weight dim 1 must be static")?;
+
+        // Reshape indices (B*L,) → (B, L) and cast to i32 (luminal kernel
+        // wants Int). Then either use the fused kernel under CUDA or
+        // a host-portable gather+sum lowering.
+        let indices_int = indices.cast(DType::Int);
+        let indices_2d = {
+            let new_shape = ShapeTracker::new(vec![
+                Expression::from(batch),
+                Expression::from(bag),
+            ]);
+            GraphTensor {
+                id: indices_int.id,
+                graph_ref: indices_int.graph_ref,
+                shape: new_shape,
+                dtype: indices_int.dtype,
+            }
+        };
+
+        let backend_is_cuda = std::env::var("LUMINAL_BACKEND_CUDA")
+            .map(|v| v == "1")
+            .unwrap_or(false);
+
+        let result = if cfg!(feature = "cuda") && backend_is_cuda && weight.dtype == DType::F32 {
+            // Fused CUDA path: one kernel for the whole bag-sum.
+            #[cfg(feature = "cuda")]
+            {
+                luminal_cuda_lite::kernel::embedding_bag_sum_kernel(weight, indices_2d)
+            }
+            #[cfg(not(feature = "cuda"))]
+            {
+                // Unreachable — gated above, but keep the compiler happy.
+                unreachable!()
+            }
+        } else {
+            // Generic fallback: gather (B*L, D) then reshape and sum.
+            let hidden_dim = weight.shape.dims[1];
+            let ids_expanded = (indices_2d * hidden_dim).expand_dim(2, hidden_dim);
+            let arange = self.graph.arange(hidden_dim);
+            let arange_expanded = arange.expand_dim(0, batch).expand_dim(0, bag);
+            // Note: weight.gather expects the gather indices to broadcast
+            // against weight's row-flattened layout; we want (B, L, D)
+            // out, then sum along L.
+            let _ = d; // hidden_dim is used; keep `d` reachable for debug only.
+            let gathered = weight.gather(ids_expanded + arange_expanded);
+            gathered.sum(1)
+        };
+
+        // Record the output under outputs[0][0] (the tuple's first slot).
+        // The other three slots are dead under inference and there's no
+        // downstream `getitem` that reads them — but if there ever is,
+        // we'd need to materialize them too.
+        let out_name = node
+            .outputs
+            .first()
+            .and_then(|o| o.as_tensors.as_ref())
+            .and_then(|ts| ts.first().map(|t| t.name.clone()))
+            .or_else(|| {
+                node.outputs
+                    .first()
+                    .and_then(|o| o.as_tensor.as_ref().map(|t| t.name.clone()))
+            })
+            .context(
+                "translate_embedding_bag_forward_only: missing output[0] name in FX node",
+            )?;
+        self.tensors.insert(out_name, result);
+        Ok(())
+    }
+
    pub(crate) fn translate_index_tensor(&mut self, node: &Node) -> Result<GraphTensor> {
+        // Try the DLRM PairwiseDot peephole before falling back to the
+        // generic gather-based lowering. Detects the
+        //   Z[:, li, lj] ← bmm(T, T.transpose(1, 2)) ← cat([t1.unsqueeze(1), ..., tF.unsqueeze(1)], dim=1)
+        // pattern that vanilla `nn.Sequential` DLRM (DLRMv1) emits — the
+        // version a user writes with one `EmbeddingBag` per categorical
+        // table. Replacing it with `dlrm_pairwise_dot_lower_tri` collapses
+        // the cat-then-bmm-then-gather chain (which lowers to ~40
+        // small Iota/Cast/Gather/FusedRegion kernels via pad_along+add)
+        // into a single CUDA kernel.
+        if let Some(t) = self.try_translate_pairwise_dot_lower_tri(node)? {
+            return Ok(t);
+        }
+
        let source = self.get_input_tensor(node, 0)?;

        // Handle indices as_tensors (all non-None) or as individual args with None entries
@@ -308,6 +525,98 @@ impl<'a> Translator<'a> {
                    expanded.shape.expand(target);
                    return Ok(source.gather_elements(expanded, first_non_none_dim));
                }
+
+                // Multi-index advanced indexing through leading dims that
+                // pass through (e.g. DLRM's `Z[:, li, lj]` where Z has
+                // shape `(B, F, F)` and output[b, p] = Z[b, li[p], lj[p]]).
+                //
+                // Strategy: reduce to the proven single-index simple
+                // case. Combine the multi-axis indices into one (`li * F
+                // + lj`) and reshape the source so the indexed region
+                // becomes a single dim. Then take the exact `gather_elements`
+                // path the rest of this translator already uses.
+                //
+                // Supported shape pattern (DLRM): exactly one leading
+                // passthrough dim, no trailing dims after the indexed
+                // region, per-axis indices all 1-D of the same length.
+                if first_non_none_dim > 0 {
+                    let src_dims = source.shape.dims;
+                    let src_rank = src_dims.len();
+                    let n_idx = index_names.len();
+                    let trailing_start = first_non_none_dim + n_idx;
+                    anyhow::ensure!(
+                        first_non_none_dim == 1,
+                        "index.Tensor: leading-dim passthrough only supported for \
+                         exactly one leading dim (got {first_non_none_dim})."
+                    );
+                    anyhow::ensure!(
+                        trailing_start == src_rank,
+                        "index.Tensor: trailing dims after indexed region not yet supported."
+                    );
+                    let mut idx_tensors: Vec<GraphTensor> = Vec::with_capacity(n_idx);
+                    for n in &index_names {
+                        idx_tensors.push(self.get_tensor(&n.name)?.cast(DType::Int));
+                    }
+                    let idx0_shape = idx_tensors[0].shape.dims;
+                    anyhow::ensure!(
+                        idx0_shape.len() == 1,
+                        "index.Tensor: only 1-D per-axis indices supported (got rank {})",
+                        idx0_shape.len()
+                    );
+                    for it in idx_tensors.iter().skip(1) {
+                        anyhow::ensure!(
+                            it.shape.dims == idx0_shape,
+                            "index.Tensor: per-axis indices must share a common shape"
+                        );
+                    }
+                    // strides over indexed axes (no trailing dims).
+                    let mut strides_idx: Vec<Expression> = vec![Expression::from(1usize); n_idx];
+                    for i in (0..n_idx - 1).rev() {
+                        strides_idx[i] =
+                            strides_idx[i + 1] * src_dims[first_non_none_dim + i + 1];
+                    }
+                    // combined[p] = sum_i idx_i * stride_i  (1-D)
+                    let mut combined: Option<GraphTensor> = None;
+                    for (i, it) in idx_tensors.into_iter().enumerate() {
+                        let weighted = if strides_idx[i].to_usize() == Some(1) {
+                            it
+                        } else {
+                            it * strides_idx[i]
+                        };
+                        combined = Some(match combined {
+                            Some(acc) => {
+                                let (a, b) = broadcast_binary(acc, weighted);
+                                a + b
+                            }
+                            None => weighted,
+                        });
+                    }
+                    let combined = combined.context("index.Tensor: no indices")?;
+
+                    // Indexed region size, then a (leading, indexed_size) reshape.
+                    let mut indexed_size = Expression::from(1usize);
+                    for d in &src_dims[first_non_none_dim..trailing_start] {
+                        indexed_size *= *d;
+                    }
+                    let leading_dim = src_dims[0];
+                    let flat_source =
+                        reshape_tensor(source, vec![leading_dim, indexed_size]);
+
+                    // Now dispatch through the exact single-index simple
+                    // case lowering — known-good. Add unit leading dims
+                    // to match flat_source rank, then expand to the full
+                    // (leading_dim, pair_count) shape.
+                    let mut expanded = combined;
+                    let flat_rank = 2; // (leading, indexed_size)
+                    for _ in 0..(flat_rank - expanded.shape.len()) {
+                        expanded = expanded.expand_dim(0, Expression::from(1usize));
+                    }
+                    let idx_dim_size = expanded.shape.dims[1];
+                    let mut target: Vec<Expression> = vec![leading_dim, indexed_size];
+                    target[1] = idx_dim_size;
+                    expanded.shape.expand(target);
+                    return Ok(flat_source.gather_elements(expanded, 1));
+                }
            } else {
                bail!(
                    "index.Tensor: unsupported indices format: {:?}",
@@ -552,4 +861,179 @@ impl<'a> Translator<'a> {

        first_chunk.ok_or_else(|| anyhow::anyhow!("split_with_sizes: empty sizes list"))
    }
+
+    /// DLRM PairwiseDot peephole: detect
+    ///   `aten.index.Tensor(bmm, [None, li, lj])`
+    /// where
+    ///   `bmm = aten.bmm.default(T, T_permuted)`
+    ///   `T_permuted = aten.permute.default(T, [0, 2, 1])`
+    ///   `T = aten.cat.default([unsqueeze_a, unsqueeze_b, …], dim=1)`
+    ///   each `unsqueeze_k = aten.unsqueeze.default(t_k, 1)`
+    /// and lower to `dlrm_pairwise_dot_lower_tri([t_0, t_1, …])`.
+    ///
+    /// Why this matters: at DLRM nc=3 the generic lowering produces
+    /// ~80 CUDA-graph kernels from the cat+bmm+gather chain alone (the
+    /// `pad_along + add` decomposition of cat fans out into many
+    /// Iota/Cast/Gather/FusedRegion launches). The fused kernel
+    /// computes the F(F-1)/2 dot products directly with one launch.
+    ///
+    /// Returns `Ok(Some(out))` on match, `Ok(None)` if the pattern
+    /// doesn't apply, `Err(_)` only if matching diagnostics surface a
+    /// genuine bug.
+    fn try_translate_pairwise_dot_lower_tri(
+        &mut self,
+        node: &Node,
+    ) -> Result<Option<GraphTensor>> {
+        // CUDA-only fast path. The kernel is in luminal_cuda_lite.
+        #[cfg(not(feature = "cuda"))]
+        {
+            let _ = node;
+            return Ok(None);
+        }
+
+        #[cfg(feature = "cuda")]
+        {
+            let backend_is_cuda = std::env::var("LUMINAL_BACKEND_CUDA")
+                .map(|v| v == "1")
+                .unwrap_or(false);
+            if !backend_is_cuda {
+                return Ok(None);
+            }
+
+            // 1. Detect [None, li, lj] index list.
+            let opt_tensors =
+                match node.inputs.get(1).and_then(|i| i.arg.as_optional_tensors()) {
+                    Some(t) => t,
+                    None => {
+                        return Ok(None);
+                    }
+                };
+            // Expect [None, li, lj]: three entries, first None, last two
+            // are tensors.
+            if opt_tensors.len() != 3 {
+                return Ok(None);
+            }
+            use crate::pt2_schema::OptionalTensorEntry;
+            let (li_name, lj_name) =
+                match (&opt_tensors[0], &opt_tensors[1], &opt_tensors[2]) {
+                    (OptionalTensorEntry::None(_), OptionalTensorEntry::Tensor(li), OptionalTensorEntry::Tensor(lj)) => {
+                        (li.as_tensor.name.clone(), lj.as_tensor.name.clone())
+                    }
+                    _ => {
+                        return Ok(None);
+                    }
+                };
+            // 2. Source must be a bmm of (T, T_permuted).
+            let source_name = match node.inputs.first().and_then(|i| i.arg.as_tensor_name()) {
+                Some(s) => s.to_string(),
+                None => {
+                    return Ok(None);
+                }
+            };
+            let bmm_info = match self.node_chain.get(&source_name) {
+                Some(x) => x.clone(),
+                None => {
+                    return Ok(None);
+                }
+            };
+            if bmm_info.0 != "torch.ops.aten.bmm.default" {
+                return Ok(None);
+            }
+            let bmm_inputs = match self.op_inputs.get(&source_name) {
+                Some(v) => v.clone(),
+                None => {
+                    return Ok(None);
+                }
+            };
+            if bmm_inputs.len() != 2 {
+                return Ok(None);
+            }
+            let (bmm_a, bmm_b) = (bmm_inputs[0].clone(), bmm_inputs[1].clone());
+            // 3. Both bmm inputs must descend from the same cat — one
+            //    directly, one via a [0, 2, 1] permute. The permute is
+            //    already recorded in `transpose_2d_source` (we extended
+            //    it to cover 3-D `[0, 2, 1]` for bmm fast paths).
+            let permute_src = self.transpose_2d_source.get(&bmm_b).cloned();
+            let (cat_name, _has_transpose) = if permute_src.as_deref() == Some(bmm_a.as_str()) {
+                (bmm_a.clone(), true)
+            } else if self
+                .transpose_2d_source
+                .get(&bmm_a)
+                .map(|s| s.as_str() == bmm_b.as_str())
+                .unwrap_or(false)
+            {
+                (bmm_b.clone(), true)
+            } else {
+                return Ok(None);
+            };
+            let cat_info = match self.node_chain.get(&cat_name) {
+                Some(x) => x.clone(),
+                None => {
+                    return Ok(None);
+                }
+            };
+            if cat_info.0 != "torch.ops.aten.cat.default" {
+                return Ok(None);
+            }
+            let cat_inputs = match self.op_inputs.get(&cat_name) {
+                Some(v) => v.clone(),
+                None => {
+                    return Ok(None);
+                }
+            };
+            if cat_inputs.len() < 2 {
+                return Ok(None);
+            }
+            // 4. Each cat input should be `unsqueeze(t_k, 1)` — peel the
+            //    unsqueeze so we recover the original (B, D) tensor.
+            let mut feature_tensors: Vec<GraphTensor> = Vec::with_capacity(cat_inputs.len());
+            for ci in &cat_inputs {
+                let unsqueeze_info = match self.node_chain.get(ci) {
+                    Some(x) => x.clone(),
+                    None => {
+                        return Ok(None);
+                    }
+                };
+                if unsqueeze_info.0 != "torch.ops.aten.unsqueeze.default" {
+                    return Ok(None);
+                }
+                // unsqueeze's first input is the source tensor name.
+                let src = unsqueeze_info.1;
+                let t = self.get_tensor(&src)?;
+                if t.dtype != DType::F32 || t.shape.dims.len() != 2 {
+                    return Ok(None);
+                }
+                feature_tensors.push(t);
+            }
+            // 5. Sanity check on li/lj — they must be the strict lower-tri
+            //    pair table for F = feature_tensors.len(). We don't
+            //    materialize them; just verify the index buffers have
+            //    the right length, then trust they're tril-indices.
+            //    A user passing arbitrary indices through this exact
+            //    chain would silently get tril results; gating on
+            //    `index buffer length == F*(F-1)/2` catches the common
+            //    case without invasive constant-folding work.
+            let f = feature_tensors.len();
+            let pair_count = f * (f - 1) / 2;
+            let li_t = self.get_tensor(&li_name)?;
+            let lj_t = self.get_tensor(&lj_name)?;
+            if li_t.shape.dims.len() != 1
+                || lj_t.shape.dims.len() != 1
+                || li_t.shape.dims[0].to_usize() != Some(pair_count)
+                || lj_t.shape.dims[0].to_usize() != Some(pair_count)
+            {
+                return Ok(None);
+            }
+
+            // All checks passed — emit the fused kernel. The bmm, cat,
+            // and unsqueeze nodes left dangling in the HLIR get picked
+            // up by `Translator::dce()` after every FX node is translated
+            // (walks back from `Output` HLIR sinks and drops everything
+            // unreachable). luminal's egglog optimizer leaves some of
+            // these subgraphs alive on its own, so the explicit pass is
+            // load-bearing for this peephole.
+            let out = luminal_cuda_lite::kernel::dlrm_pairwise_dot_lower_tri(feature_tensors);
+            Ok(Some(out))
+        }
+    }
 }
--- a/crates/luminal_python/src/luminal/compiled_model.py
+++ b/crates/luminal_python/src/luminal/compiled_model.py
@@ -87,25 +87,6 @@ class CompiledModel:
            input_shapes = [list(t.shape) for t in user_inputs]
            self._graph.auto_set_dims_from_input_shapes(input_shapes)

-        # Set user input data via pointer.
-        # Convert to the graph's expected dtype so bytes match the Input node's dtype tag.
-        # For CUDA inputs, keep references alive so the caching allocator doesn't
-        # recycle GPU memory before run() reads the pointers.
-        _input_refs = []
-        for name, tensor, expected_dtype in zip(
-            self._input_names, user_inputs, self._input_dtypes
-        ):
-            if self._supports_device_ptrs and tensor.is_cuda:
-                t = tensor.detach().contiguous().to(expected_dtype)
-                n_bytes = t.numel() * t.element_size()
-                self._graph.set_input_device_ptr(name, t.data_ptr(), n_bytes)
-                _input_refs.append(t)
-            else:
-                t = tensor.detach().cpu().contiguous().to(expected_dtype)
-                n_bytes = t.numel() * t.element_size()
-                dtype_code = _torch_dtype_code(t.dtype)
-                self._graph.set_input_from_ptr(name, t.data_ptr(), n_bytes, dtype_code)
-
        # Resolve output shapes before run() (needed for pre-allocation).
        if self._has_dynamic_dims:
            output_shapes = self._graph.resolve_output_shapes()
@@ -114,11 +95,30 @@ class CompiledModel:

        output_dtype_codes = self._graph.output_dtypes

-        # CUDA zero-copy path: pre-allocate output tensors and register their device
-        # pointers so the final kernel writes directly into PyTorch's buffer.
-        _use_zero_copy = self._supports_device_ptrs
+        # GPU fast path: batch input registration + output pre-alloc +
+        # run() into a single FFI call to the Rust runtime. For tiny
+        # graphs (DLRM at batch=2048 lands in ~10 µs of GPU work) the
+        # ~7 separate PyO3 crossings of the slow path eat enough host
+        # time to dominate the per-iter cost; the batched path saves
+        # roughly 30–60 µs depending on input/output count.
+        _input_refs = []
        output_tensors = []
-        if _use_zero_copy:
+        _use_zero_copy = self._supports_device_ptrs
+        _used_batched = False
+        _zero_copy_flags: list[bool] = []
+        if _use_zero_copy and all(t.is_cuda for t in user_inputs):
+            input_ptrs = []
+            for name, tensor, expected_dtype in zip(
+                self._input_names, user_inputs, self._input_dtypes
+            ):
+                t = tensor.detach().contiguous().to(expected_dtype)
+                n_bytes = t.numel() * t.element_size()
+                input_ptrs.append((name, t.data_ptr(), n_bytes))
+                _input_refs.append(t)
+            output_ptrs = []
+            # Track which output_ptrs index corresponds to each float output
+            # name, so we can map run_with_ptrs's zero-copy bools back.
+            _zc_index_by_name: dict[str, int] = {}
            for i, (name, shape) in enumerate(zip(self._output_names, output_shapes)):
                out_dtype = (
                    code_to_torch_dtype(output_dtype_codes[i])
@@ -127,13 +127,45 @@ class CompiledModel:
                )
                out = torch.empty(shape, dtype=out_dtype, device=input_device)
                if out_dtype.is_floating_point:
-                    self._graph.set_output_device_ptr(
-                        name, out.data_ptr(), out.numel() * out.element_size()
+                    _zc_index_by_name[name] = len(output_ptrs)
+                    output_ptrs.append(
+                        (name, out.data_ptr(), out.numel() * out.element_size())
                    )
                output_tensors.append(out)
+            # Single FFI call: set all input ptrs, set all output ptrs, run.
+            _zero_copy_flags = self._graph.run_with_ptrs(input_ptrs, output_ptrs)
+            _used_batched = True
+        else:
+            # Slow path (CPU input or mixed-device): per-tensor FFI like before.
+            for name, tensor, expected_dtype in zip(
+                self._input_names, user_inputs, self._input_dtypes
+            ):
+                if self._supports_device_ptrs and tensor.is_cuda:
+                    t = tensor.detach().contiguous().to(expected_dtype)
+                    n_bytes = t.numel() * t.element_size()
+                    self._graph.set_input_device_ptr(name, t.data_ptr(), n_bytes)
+                    _input_refs.append(t)
+                else:
+                    t = tensor.detach().cpu().contiguous().to(expected_dtype)
+                    n_bytes = t.numel() * t.element_size()
+                    dtype_code = _torch_dtype_code(t.dtype)
+                    self._graph.set_input_from_ptr(name, t.data_ptr(), n_bytes, dtype_code)
+            if _use_zero_copy:
+                for i, (name, shape) in enumerate(zip(self._output_names, output_shapes)):
+                    out_dtype = (
+                        code_to_torch_dtype(output_dtype_codes[i])
+                        if i < len(output_dtype_codes)
+                        else torch.float32
+                    )
+                    out = torch.empty(shape, dtype=out_dtype, device=input_device)
+                    if out_dtype.is_floating_point:
+                        self._graph.set_output_device_ptr(
+                            name, out.data_ptr(), out.numel() * out.element_size()
+                        )
+                    output_tensors.append(out)

-        # Run the graph
-        self._graph.run()
+            # Run the graph
+            self._graph.run()

        # Integer dtypes for which we read the buffer as i32 and then cast.
        # Includes int64 because luminal collapses all integer types to its
@@ -151,7 +183,14 @@ class CompiledModel:
                )
                out = output_tensors[i]
                if out_dtype.is_floating_point:
-                    if not self._graph.output_is_zero_copy(name):
+                    # Prefer the zero-copy flag returned by run_with_ptrs to
+                    # avoid a second FFI call. Falls back to the per-name
+                    # query when we took the unbatched slow path.
+                    if _used_batched:
+                        is_zc = _zero_copy_flags[_zc_index_by_name[name]]
+                    else:
+                        is_zc = self._graph.output_is_zero_copy(name)
+                    if not is_zc:
                        self._graph.copy_output_to_device_ptr(
                            name, out.data_ptr(), out.numel() * out.element_size()
                        )
--- a/crates/luminal_python/src/luminal/main.py
+++ b/crates/luminal_python/src/luminal/main.py
@@ -1,3 +1,5 @@
+import os
+
 import torch
 import torch._dynamo

@@ -31,10 +33,22 @@ def _detect_factory_capsule(example_inputs):
    return _native_factory_capsule()


+# torch dtypes luminal does not have a same-width equivalent for — must
+# narrow on the host before handing the device pointer to luminal, or the
+# Rust side will read bytes at the wrong stride. Mirrors the narrowing
+# already done by `TypedData::from_pytorch_bytes` on the CPU weight path.
+_LUMINAL_NARROW_MAP = {
+    torch.int64: torch.int32,
+    torch.float64: torch.float32,
+}
+
+
 def _collect_weight_pointers(weights):
    """Partition weight tensors into CUDA device pointers and CPU host pointers.

-    Preserves native dtype — no forced conversion to float32.
+    Preserves native dtype where luminal has a same-width equivalent, and
+    explicitly narrows the few it doesn't (i64 → i32, f64 → f32) so the
+    CUDA path's raw device pointer matches what the Rust side expects.

    Args:
        weights: dict of name -> torch.Tensor
@@ -50,6 +64,16 @@ def _collect_weight_pointers(weights):
    cpu_ptrs = {}
    for name, tensor in weights.items():
        t = tensor.detach().contiguous()
+        # If luminal stores this dtype at a different width than torch, we
+        # must materialize a converted copy here — handing the original
+        # i64 device pointer to a luminal Int (i32) tensor produces
+        # half-word reads. The CPU path's `from_pytorch_bytes` already
+        # narrows on a byte-level copy; do the equivalent for CUDA.
+        target = _LUMINAL_NARROW_MAP.get(t.dtype)
+        if target is not None:
+            t = t.to(target)
+            # `.to` always returns a fresh tensor when the dtype changes,
+            # so it's safe to take its data_ptr below.
        n_bytes = t.numel() * t.element_size()
        if t.is_cuda:
            keep_alive.append(t)
@@ -83,7 +107,7 @@ def register_backend(factory_capsule):
    """

    def backend(gm, example_inputs, options=None):
-        return _compile_pt2(gm, example_inputs, factory_capsule)
+        return _compile_pt2(gm, example_inputs, factory_capsule, options=options)

    return backend

@@ -102,7 +126,21 @@ def luminal_backend(gm, example_inputs, options=None):
    For external backends, use register_backend with the backend's factory capsule.
    """
    capsule = _detect_factory_capsule(example_inputs)
-    return _compile_pt2(gm, example_inputs, capsule)
+    # Tell the Rust translator which backend was selected so its dispatch
+    # peepholes can choose backend-appropriate ops (e.g. emit cuda_lite
+    # Matmul2DKernel for addmm only when the cuda backend is active —
+    # the native backend has no such kernel and would panic in load_llir).
+    first_t = next((t for t in (example_inputs or []) if torch.is_tensor(t)), None)
+    is_cuda = first_t is not None and first_t.device.type == "cuda"
+    prior = os.environ.get("LUMINAL_BACKEND_CUDA")
+    os.environ["LUMINAL_BACKEND_CUDA"] = "1" if is_cuda else "0"
+    try:
+        return _compile_pt2(gm, example_inputs, capsule, options=options)
+    finally:
+        if prior is None:
+            os.environ.pop("LUMINAL_BACKEND_CUDA", None)
+        else:
+            os.environ["LUMINAL_BACKEND_CUDA"] = prior


 # ---------------------------------------------------------------------------
@@ -110,8 +148,8 @@ def luminal_backend(gm, example_inputs, options=None):
 # ---------------------------------------------------------------------------


-def _compile_pt2(gm, example_inputs, factory_capsule):
+def _compile_pt2(gm, example_inputs, factory_capsule, options=None):
    """PT2/torch.export path — delegates to pt2.pt2_backend."""
    from .pt2 import pt2_backend

-    return pt2_backend(gm, example_inputs, factory=factory_capsule)
+    return pt2_backend(gm, example_inputs, factory=factory_capsule, options=options)
--- a/crates/luminal_python/src/luminal/pt2.py
+++ b/crates/luminal_python/src/luminal/pt2.py
@@ -603,7 +603,13 @@ def _build_dynamic_shapes_from_dim_arg(dynamic_dim, example_args):


 def _eager_pt2_compile(
-    gm, user_inputs, original_weights, user_indices, dynamic_shapes, factory
+    gm,
+    user_inputs,
+    original_weights,
+    user_indices,
+    dynamic_shapes,
+    factory,
+    search_iterations,
 ):
    """Run torch.export → save → Rust compile end-to-end. Returns CompiledModel.

@@ -660,7 +666,7 @@ def _eager_pt2_compile(
        return _save_and_compile(
            pt2_path,
            factory,
-            10,
+            search_iterations,
            original_weights=original_weights,
            user_indices=user_indices,
        )
@@ -693,6 +699,7 @@ class _LazyDynamicCompiledModel:
        user_indices,
        dynamic_shapes,
        factory,
+        search_iterations,
    ):
        self._gm = gm
        self._user_inputs = user_inputs
@@ -700,6 +707,7 @@ class _LazyDynamicCompiledModel:
        self._user_indices = user_indices
        self._dynamic_shapes = dynamic_shapes
        self._factory = factory
+        self._search_iterations = search_iterations
        self._compiled = None

    def _ensure_compiled(self):
@@ -711,6 +719,7 @@ class _LazyDynamicCompiledModel:
                self._user_indices,
                self._dynamic_shapes,
                self._factory,
+                self._search_iterations,
            )
            # Drop references to inputs we no longer need — the Rust side
            # holds onto weights via device pointers / CPU buffers.
@@ -734,7 +743,7 @@ class _LazyDynamicCompiledModel:
        return self._ensure_compiled().set_dim(name, value)


-def pt2_backend(gm, example_inputs, factory=None):
+def pt2_backend(gm, example_inputs, factory=None, options=None):
    """torch.compile backend using PT2 pipeline.

    Usage: torch.compile(model, backend=luminal.register_backend(capsule))
@@ -743,6 +752,8 @@ def pt2_backend(gm, example_inputs, factory=None):

    if factory is None:
        factory = _detect_factory_capsule(example_inputs)
+    options = options or {}
+    search_iterations = int(options.get("search_iterations", 10))

    # Work on a private copy of the GraphModule. Dynamo holds onto the
    # original to install guards and to retrace on shape changes; mutating it
@@ -777,9 +788,21 @@ def pt2_backend(gm, example_inputs, factory=None):
        # Dynamo is still relying on, and running it inside the backend frame
        # corrupts the freshly-installed guards.
        return _LazyDynamicCompiledModel(
-            gm, user_inputs, original_weights, user_indices, dynamic_shapes, factory
+            gm,
+            user_inputs,
+            original_weights,
+            user_indices,
+            dynamic_shapes,
+            factory,
+            search_iterations,
        )

    return _eager_pt2_compile(
-        gm, user_inputs, original_weights, user_indices, None, factory
+        gm,
+        user_inputs,
+        original_weights,
+        user_indices,
+        None,
+        factory,
+        search_iterations,
    )
--- a/examples/flux2/src/text_encoder.rs
+++ b/examples/flux2/src/text_encoder.rs
@@ -68,20 +68,10 @@ pub const WEIGHT_DTYPE: DType = DType::Bf16;
 // =============================================================================

 fn linear_no_bias(x: GraphTensor, w: GraphTensor) -> GraphTensor {
-    // Direct mixed-precision kernel: F32 A × BF16 B^T → F32 (M, N), with the
-    // BF16 → F32 conversion happening on each load inside the kernel rather
-    // than as a separate cast op. This keeps the BF16 weight in memory as-is
-    // (a 24 GB → 48 GB cast for the full encoder would not fit on the GPU)
-    // and bypasses the egglog matmul lowering, where the cublaslt 2D rule
-    // doesn't reliably fire for these shapes — see kernel::matmul2d's docs.
-    //
-    // Falls back to the standard `x.matmul(w.cast(x.dtype).t())` lowering
-    // for ranks > 2 (e.g. attention's batched (heads, seq, head_dim) form),
-    // since the custom kernel is only 2D.
-    if x.shape.len() == 2 && w.shape.len() == 2 {
-        luminal_cuda_lite::kernel::linear_no_bias_bf16_w(x, w)
+    if x.dtype == w.dtype {
+        x.matmul(w.t())
    } else {
-        x.matmul(w.cast(x.dtype).t())
+        x.cast(w.dtype).matmul(w.t()).cast(x.dtype)
    }
 }

@@ -131,13 +121,6 @@ fn apply_rope(x: GraphTensor, pos_ids: GraphTensor, n_heads: usize, theta: f32)
 /// Standard scaled dot-product attention over `(n_heads, seq_q, head_dim)`,
 /// `(n_heads, seq_k, head_dim)`, `(n_heads, seq_k, head_dim)` with a causal
 /// mask. Returns `(seq_q, n_heads * head_dim)`.
-///
-/// Routes the two batched matmuls through `kernel::matmul_3d_t` /
-/// `matmul_3d` rather than the egglog matmul lowering. The standard path
-/// has the same problem the VAE attention had (cublaslt batched rules
-/// fail to fire reliably; the broadcast Mul + SumReduce fallback creates
-/// a `(n_heads, M, N, K)` intermediate that scales O(seq²) and OOMs at
-/// seq_len ≥ ~256 even with BF16 weights elsewhere).
 fn causal_sdpa(
    q: GraphTensor,
    k: GraphTensor,
@@ -148,13 +131,16 @@ fn causal_sdpa(
    let n_heads = q.dims()[0];
    let seq = q.dims()[1];
    let scale = (HEAD_DIM as f32).sqrt().recip();
-    // The kernel needs contiguous batches; a `* 1.0` after the upstream
-    // transpose / GQA-expand chain materialises the strided view.
+    // Materialize strided views from the upstream transpose / GQA-expand chain
+    // before expressing attention as HLIR matmuls. Today the generic batched
+    // matmul fallback can handle those arbitrary strides correctly, but the
+    // full model becomes too memory-heavy unless cuBLASLt sees contiguous
+    // per-head matrices.
    let q = q * 1.0_f32;
    let k = k * 1.0_f32;
    let v = v * 1.0_f32;
    // Q @ K^T: (heads, seq, head_dim) @ (heads, seq, head_dim)^T = (heads, seq, seq).
-    let scores = luminal_cuda_lite::kernel::matmul_3d_t(q, k) * scale;
+    let scores = q.matmul(k.transpose(1, 2)) * scale;
    // Causal mask: positions where k_pos > q_pos are masked.
    let q_pos = cx.arange(seq).cast(DType::F32);
    let k_pos = cx.arange(seq).cast(DType::F32);
@@ -177,13 +163,9 @@ fn causal_sdpa(
    let masked = scores + mask * (-1e10_f32);
    let weights = masked.softmax(2);
    // attn = weights @ v: (heads, seq, seq) @ (heads, seq, head_dim) = (heads, seq, head_dim).
-    let attn = luminal_cuda_lite::kernel::matmul_3d(weights, v);
-    // `transpose(0, 1).merge_dims(1, 2)` produces the merge_dims
-    // non-contiguous K stride `(((z/HEAD_DIM)*HEAD_DIM)*SEQ)+(z%HEAD_DIM)`.
-    // The cublaslt 2D rule requires `K stride = MIter` (contiguous), so
-    // without forcing materialization here the downstream o_proj matmul
-    // falls through to a broadcast Mul whose `(SEQ, HIDDEN, KV_DIM)`
-    // intermediate is ~20 GB BF16 and OOMs the GPU during search.
+    let attn = weights.matmul(v);
+    // `transpose(0, 1).merge_dims(1, 2)` produces a non-contiguous K stride;
+    // materialize before the downstream o_proj matmul.
    attn.transpose(0, 1).merge_dims(1, 2) * 1.0_f32 // (seq_q, n_heads*head_dim)
 }

@@ -372,8 +354,26 @@ pub fn format_chat(system_message: &str, user_prompt: &str) -> String {

 #[cfg(test)]
 mod tests {
+    use luminal::hlir::CustomOpKind;
+
    use super::*;

+    fn assert_no_custom_ops(cx: &Graph) {
+        assert!(
+            cx.custom_ops.is_empty(),
+            "Flux2 text encoder helpers should use pure HLIR, not registered CustomOp wrappers"
+        );
+        let custom_nodes: Vec<_> = cx
+            .graph
+            .node_indices()
+            .filter(|&node| cx.try_get_op::<CustomOpKind>(node).is_some())
+            .collect();
+        assert!(
+            custom_nodes.is_empty(),
+            "Flux2 text encoder graph contains CustomOpKind nodes: {custom_nodes:?}"
+        );
+    }
+
    #[test]
    fn chat_template_matches_jinja_output() {
        // Sanity check: the result is the deterministic concatenation we
@@ -395,4 +395,23 @@ mod tests {
        // hidden_states[30] requires running 30 layers (0..29 inclusive).
        assert_eq!(NUM_LAYERS_USED, *TAP_LAYERS.iter().max().unwrap());
    }
+
+    #[test]
+    fn text_encoder_helpers_use_no_custom_ops() {
+        let mut cx = Graph::default();
+
+        let x = cx.named_tensor("x", (2usize, 3usize));
+        let w = cx
+            .named_tensor("w", (4usize, 3usize))
+            .as_dtype(WEIGHT_DTYPE);
+        let _ = linear_no_bias(x, w).output();
+
+        let q = cx.named_tensor("q", (1usize, 2usize, HEAD_DIM));
+        let k = cx.named_tensor("k", (1usize, 2usize, HEAD_DIM));
+        let v = cx.named_tensor("v", (1usize, 2usize, HEAD_DIM));
+        let mask = cx.named_tensor("attention_mask", 2usize);
+        let _ = causal_sdpa(q, k, v, mask).output();
+
+        assert_no_custom_ops(&cx);
+    }
 }
--- a/examples/flux2/src/transformer.rs
+++ b/examples/flux2/src/transformer.rs
@@ -120,26 +120,10 @@ pub const WEIGHT_DTYPE: DType = DType::Bf16;
 // =============================================================================

 fn linear_no_bias(x: GraphTensor, w: GraphTensor) -> GraphTensor {
-    // For 2D inputs we go through `kernel::linear_no_bias_bf16_w`, which
-    // is a direct mixed-precision SGEMM (F32 A × BF16 B^T → F32) that
-    // converts BF16 → F32 on each load instead of materializing a
-    // separate F32 cast tensor. Two reasons we don't use the egglog
-    // matmul lowering for these:
-    //   1. The cublaslt 2D rule fails to fire reliably for some matmul
-    //      shapes (see kernel::matmul2d's docs); even one bad genome
-    //      pick on the broadcast Mul + SumReduce fallback creates an
-    //      `(M, N, K)` intermediate that OOMs the GPU.
-    //   2. Explicitly casting all BF16 weights to F32 first would more
-    //      than double the transformer's working set (~120 GB) and
-    //      wouldn't fit. The kernel keeps weights as BF16 in memory.
-    //
-    // Higher-rank cases (3D batched matmul inside attention) fall
-    // through to the standard matmul lowering — those go through the
-    // separate `matmul_3d` / `matmul_3d_t` helpers in `sdpa` below.
-    if x.shape.len() == 2 && w.shape.len() == 2 {
-        luminal_cuda_lite::kernel::linear_no_bias_bf16_w(x, w)
+    if x.dtype == w.dtype {
+        x.matmul(w.t())
    } else {
-        x.matmul(w.cast(x.dtype).t())
+        x.cast(w.dtype).matmul(w.t()).cast(x.dtype)
    }
 }

@@ -191,20 +175,20 @@ fn apply_rope(x: GraphTensor, cos: GraphTensor, sin: GraphTensor) -> GraphTensor

 /// Scaled dot-product attention with NO mask, no causal: standard SDPA.
 /// q, k, v: `(H, S, D)`. Returns `(S, H, D)`.
-///
-/// Routes through the direct batched matmul kernels for the same reason
-/// the text encoder does — see `text_encoder::causal_sdpa` for context.
 fn sdpa(q: GraphTensor, k: GraphTensor, v: GraphTensor) -> GraphTensor {
    let head_dim = q.dims()[2].to_usize().expect("head_dim must be static");
    let scale = (head_dim as f32).sqrt().recip();
-    // The kernel needs contiguous batches; materialize the strided views
-    // produced upstream (transpose / split_dims chains).
+    // Materialize the strided views produced upstream (transpose /
+    // split_dims chains) before expressing attention as HLIR matmuls. cuBLASLt
+    // can represent the leading dimensions, but the current rewrite rules do
+    // not yet match the interleaved per-head layout, so omitting these copies
+    // falls back to a much larger generic plan in the full Flux2 graph.
    let q = q * 1.0_f32;
    let k = k * 1.0_f32;
    let v = v * 1.0_f32;
-    let scores = luminal_cuda_lite::kernel::matmul_3d_t(q, k) * scale; // (H, S, S)
+    let scores = q.matmul(k.transpose(1, 2)) * scale; // (H, S, S)
    let attn_w = scores.softmax(2);
-    let attn = luminal_cuda_lite::kernel::matmul_3d(attn_w, v); // (H, S, D)
+    let attn = attn_w.matmul(v); // (H, S, D)
    attn.transpose(0, 1) // (S, H, D)
 }

@@ -518,9 +502,8 @@ impl SingleStreamAttn {
        let q = q.transpose(0, 1);
        let k = k.transpose(0, 1);
        let v = v.transpose(0, 1);
-        // `merge_dims(1, 2)` on (S, H, D) produces non-contiguous K
-        // stride; force materialization so cublaslt can match the
-        // downstream `to_out` matmul. See dual-stream block above.
+        // `merge_dims(1, 2)` on (S, H, D) produces non-contiguous K stride;
+        // materialize before the downstream `to_out` matmul.
        let attn = sdpa(q, k, v).merge_dims(1, 2) * 1.0_f32; // (S, HIDDEN)

        let mlp = swiglu(mlp_in); // (S, MLP_HIDDEN)
@@ -915,3 +898,44 @@ impl Flux2Transformer {
 // =============================================================================
 // Tests
 // =============================================================================
+
+#[cfg(test)]
+mod tests {
+    use luminal::hlir::CustomOpKind;
+
+    use super::*;
+
+    fn assert_no_custom_ops(cx: &Graph) {
+        assert!(
+            cx.custom_ops.is_empty(),
+            "Flux2 transformer helpers should use pure HLIR, not registered CustomOp wrappers"
+        );
+        let custom_nodes: Vec<_> = cx
+            .graph
+            .node_indices()
+            .filter(|&node| cx.try_get_op::<CustomOpKind>(node).is_some())
+            .collect();
+        assert!(
+            custom_nodes.is_empty(),
+            "Flux2 transformer graph contains CustomOpKind nodes: {custom_nodes:?}"
+        );
+    }
+
+    #[test]
+    fn transformer_helpers_use_no_custom_ops() {
+        let mut cx = Graph::default();
+
+        let x = cx.named_tensor("x", (3usize, 4usize));
+        let w = cx
+            .named_tensor("w", (5usize, 4usize))
+            .as_dtype(WEIGHT_DTYPE);
+        let _ = linear_no_bias(x, w).output();
+
+        let q = cx.named_tensor("q", (2usize, 3usize, 4usize));
+        let k = cx.named_tensor("k", (2usize, 3usize, 4usize));
+        let v = cx.named_tensor("v", (2usize, 3usize, 4usize));
+        let _ = sdpa(q, k, v).output();
+
+        assert_no_custom_ops(&cx);
+    }
+}
--- a/examples/flux2/src/vae.rs
+++ b/examples/flux2/src/vae.rs
@@ -2,7 +2,7 @@
 //!
 //! ## Status
 //!
-//! - All three primitives (`conv2d_bias`, `group_norm`, `nearest_upsample_2x`)
+//! - All three building blocks (`conv2d_bias`, `group_norm`, `nearest_upsample_2x`)
 //!   are implemented and **individually validated** against numerical
 //!   references — see the tests at the bottom of this file.
 //! - Stitching them into the full decoder currently hits a `luminal_cuda_lite`
@@ -70,14 +70,9 @@ fn decoder_block_channels(block_idx: usize) -> (usize, usize) {
 // HLIR primitive helpers
 // =============================================================================

-/// 2D convolution with bias on a `(C_in, H, W)` input, weights stored as
-/// `(C_out, C_in, K, K)` flat-loaded, bias as `(C_out,)`. Returns
+/// 2D convolution with bias on a `(C_in, H, W)` input, weights stored flat as
+/// `(C_out, C_in * K * K)`, bias as `(C_out,)`. Returns
 /// `(C_out, H_out, W_out)` where `H_out = (H + 2*padding - kernel) / stride + 1`.
-///
-/// Wraps the direct conv kernel from [`luminal_cuda_lite::kernel::conv2d_bias`]
-/// (one CUDA thread per output element), which avoids materializing the
-/// `(H_out*W_out, C_in*K*K)` unfold intermediate that earlier HLIR-only
-/// implementations needed.
 fn conv2d_bias(
    x: GraphTensor,
    weight: GraphTensor,
@@ -86,7 +81,58 @@ fn conv2d_bias(
    stride: usize,
    padding: usize,
 ) -> GraphTensor {
-    luminal_cuda_lite::kernel::conv2d_bias(x, weight, bias, kernel, stride, padding)
+    let dims = x.dims();
+    assert_eq!(dims.len(), 3, "conv2d_bias expects (C, H, W)");
+    let h = dims[1];
+    let w = dims[2];
+
+    if kernel == 1 && stride == 1 && padding == 0 {
+        let xt = x.permute(&[1, 2, 0]).merge_dims(0, 1); // (H*W, C_in)
+        let out = xt.matmul(weight.t()); // (H*W, C_out)
+        let out = out.split_dims(0, w).permute(&[2, 0, 1]); // (C_out, H, W)
+        return out + bias.expand_dim(1, h).expand_dim(2, w);
+    }
+
+    let zero = Expression::from(0);
+    let pad = Expression::from(padding);
+    let padded = if padding > 0 {
+        x.pad(vec![(zero, zero), (pad, pad), (pad, pad)], 0.0)
+    } else {
+        x
+    };
+
+    let unfolded = padded.unfold(
+        vec![1usize, kernel, kernel],
+        vec![1usize, stride, stride],
+        vec![1usize, 1, 1],
+    );
+    let output_spatial_dims = unfolded.dims()[1..3].to_vec();
+
+    // (C, H_out, W_out, 1, K, K) -> (H_out, W_out, C, K, K)
+    let mut patches = unfolded.squeeze(3).permute(&[1, 2, 0, 3, 4]);
+    while patches.dims().len() > 3 {
+        let last = patches.dims().len();
+        patches = patches.merge_dims(last - 2, last - 1);
+    }
+    let patches = patches.merge_dims(0, 1); // (H_out*W_out, C_in*K*K)
+
+    let out = patches.matmul(weight.t()); // (H_out*W_out, C_out)
+    let out = out
+        .split_dims(0, output_spatial_dims[1])
+        .permute(&[2, 0, 1]); // (C_out, H_out, W_out)
+    let out_dims = out.dims();
+    out + bias.expand_dim(1, out_dims[1]).expand_dim(2, out_dims[2])
+}
+
+fn linear_bias(x: GraphTensor, weight: GraphTensor, bias: GraphTensor) -> GraphTensor {
+    let out = x.matmul(weight.cast(x.dtype).t());
+    let out_dims = out.dims();
+    match out_dims.len() {
+        1 => out + bias,
+        2 => out + bias.expand_dim(0, out_dims[0]),
+        3 => out + bias.expand_dim(0, out_dims[0]).expand_dim(1, out_dims[1]),
+        n => panic!("linear_bias: unsupported rank {n}"),
+    }
 }

 /// PyTorch-style GroupNorm on a (C, H, W) tensor.
@@ -148,9 +194,7 @@ fn group_norm(
 fn nearest_upsample_2x(x: GraphTensor) -> GraphTensor {
    // (C, H, W) -> (C, H, 2, W) -> (C, 2H, W) -> (C, 2H, W, 2) -> (C, 2H, 2W)
    let stage1 = x.expand_dim(2, 2_usize).merge_dims(1, 2);
-    let stage2 = stage1.expand_dim(3, 2_usize).merge_dims(2, 3);
-    // Materialize the broadcast view so subsequent ops see contiguous strides.
-    stage2 + 0.0_f32
+    stage1.expand_dim(3, 2_usize).merge_dims(2, 3)
 }

 /// SiLU = x * sigmoid(x).
@@ -300,30 +344,21 @@ impl AttnBlock {
            NORM_NUM_GROUPS,
            NORM_EPS,
        );
-        // (C, H, W) -> (C, H*W) -> (H*W, C). The transpose at the end leaves
-        // a column-major view, which the direct matmul kernels assume away;
-        // `* 1.0` forces a contiguous row-major materialization.
-        let merged = normed.merge_dims(1, 2).transpose(0, 1) * 1.0_f32;
+        // (C, H, W) -> (C, H*W) -> (H*W, C). This is a column-major view
+        // that cuBLASLt can consume directly.
+        let merged = normed.merge_dims(1, 2).transpose(0, 1);

-        // Q, K, V projections — direct kernel routes around the cublaslt
-        // 2D rule, which silently fails to fire for some of these matmuls
-        // and lets search occasionally pick the broadcast Mul + SumReduce
-        // fallback. At 1024² the bad path on `q @ kᵀ` allocates a
-        // `(HW, HW, C) = (16384, 16384, 512)` ≈ 524 GiB intermediate.
-        let q = luminal_cuda_lite::kernel::linear_bias(merged, self.to_q_w, self.to_q_b);
-        let k = luminal_cuda_lite::kernel::linear_bias(merged, self.to_k_w, self.to_k_b);
-        let v = luminal_cuda_lite::kernel::linear_bias(merged, self.to_v_w, self.to_v_b);
+        let q = linear_bias(merged, self.to_q_w, self.to_q_b);
+        let k = linear_bias(merged, self.to_k_w, self.to_k_b);
+        let v = linear_bias(merged, self.to_v_w, self.to_v_b);

        // Standard scaled dot-product attention over the spatial axis.
-        // `q @ kᵀ` with k stored row-major as `(HW, C)`: matmul_2d_t handles
-        // the transpose without materialising k as a separate tensor.
        let scale = (self.channels as f32).sqrt().recip();
-        let scores = luminal_cuda_lite::kernel::matmul_2d_t(q, k) * scale;
+        let scores = q.matmul(k.t()) * scale;
        let attn_w = scores.softmax(1);
-        // attn_w is (HW, HW) row-major, v is (HW, C) row-major; plain matmul.
-        let attn = luminal_cuda_lite::kernel::matmul_2d(attn_w, v);
+        let attn = attn_w.matmul(v);

-        let out = luminal_cuda_lite::kernel::linear_bias(attn, self.to_out_w, self.to_out_b);
+        let out = linear_bias(attn, self.to_out_w, self.to_out_b);
        // (H*W, C) -> (C, H*W) -> (C, H, W)
        let out = out.transpose(0, 1).split_dims(1, w);
        residual + out
@@ -500,3 +535,338 @@ impl VaeDecoder {
        conv2d_bias(x, self.conv_out_w, self.conv_out_b, 3, 1, 1)
    }
 }
+
+#[cfg(test)]
+mod tests {
+    use luminal::hlir::CustomOpKind;
+    use luminal_cuda_lite::{cudarc::driver::CudaContext, runtime::CudaRuntime};
+
+    use super::*;
+
+    fn assert_no_custom_ops(cx: &Graph) {
+        assert!(
+            cx.custom_ops.is_empty(),
+            "Flux2 VAE helpers should use pure HLIR, not registered CustomOp wrappers"
+        );
+        let custom_nodes: Vec<_> = cx
+            .graph
+            .node_indices()
+            .filter(|&node| cx.try_get_op::<CustomOpKind>(node).is_some())
+            .collect();
+        assert!(
+            custom_nodes.is_empty(),
+            "Flux2 VAE graph contains CustomOpKind nodes: {custom_nodes:?}"
+        );
+    }
+
+    #[test]
+    fn vae_helpers_use_no_custom_ops() {
+        let mut cx = Graph::default();
+
+        let x = cx.named_tensor("x", (2usize, 3usize, 3usize));
+        let conv_w = cx.named_tensor("conv_w", (4usize, 2usize * 3 * 3));
+        let conv_b = cx.named_tensor("conv_b", 4usize);
+        let _ = conv2d_bias(x, conv_w, conv_b, 3, 1, 1).output();
+
+        let lin_x = cx.named_tensor("lin_x", (2usize, 3usize));
+        let lin_w = cx.named_tensor("lin_w", (4usize, 3usize));
+        let lin_b = cx.named_tensor("lin_b", 4usize);
+        let _ = linear_bias(lin_x, lin_w, lin_b).output();
+
+        assert_no_custom_ops(&cx);
+    }
+
+    struct Conv2dCase {
+        c_in: usize,
+        h: usize,
+        w: usize,
+        c_out: usize,
+        kernel: usize,
+        stride: usize,
+        padding: usize,
+    }
+
+    fn reference_conv2d_bias(
+        input: &[f32],
+        weight: &[f32],
+        bias: &[f32],
+        case: Conv2dCase,
+    ) -> Vec<f32> {
+        let Conv2dCase {
+            c_in,
+            h,
+            w,
+            c_out,
+            kernel,
+            stride,
+            padding,
+        } = case;
+        let h_out = (h + 2 * padding - kernel) / stride + 1;
+        let w_out = (w + 2 * padding - kernel) / stride + 1;
+        let mut out = vec![0.0_f32; c_out * h_out * w_out];
+        for co in 0..c_out {
+            for oy in 0..h_out {
+                for ox in 0..w_out {
+                    let mut acc = bias[co];
+                    for ci in 0..c_in {
+                        for ky in 0..kernel {
+                            for kx in 0..kernel {
+                                let iy_padded = oy * stride + ky;
+                                let ix_padded = ox * stride + kx;
+                                if iy_padded < padding || ix_padded < padding {
+                                    continue;
+                                }
+                                let iy = iy_padded - padding;
+                                let ix = ix_padded - padding;
+                                if iy >= h || ix >= w {
+                                    continue;
+                                }
+                                let input_idx = ci * h * w + iy * w + ix;
+                                let weight_idx = co * c_in * kernel * kernel
+                                    + ci * kernel * kernel
+                                    + ky * kernel
+                                    + kx;
+                                acc += input[input_idx] * weight[weight_idx];
+                            }
+                        }
+                    }
+                    out[co * h_out * w_out + oy * w_out + ox] = acc;
+                }
+            }
+        }
+        out
+    }
+
+    fn assert_close(actual: &[f32], expected: &[f32]) {
+        assert_eq!(actual.len(), expected.len());
+        for (idx, (a, e)) in actual.iter().zip(expected).enumerate() {
+            assert!(
+                (*a - *e).abs() < 1e-4,
+                "value mismatch at {idx}: got {a}, expected {e}"
+            );
+        }
+    }
+
+    fn reference_nearest_upsample_2x(input: &[f32], c: usize, h: usize, w: usize) -> Vec<f32> {
+        let mut out = vec![0.0_f32; c * h * 2 * w * 2];
+        for ci in 0..c {
+            for y in 0..h {
+                for x in 0..w {
+                    let value = input[ci * h * w + y * w + x];
+                    for dy in 0..2 {
+                        for dx in 0..2 {
+                            let oy = y * 2 + dy;
+                            let ox = x * 2 + dx;
+                            out[ci * h * 2 * w * 2 + oy * w * 2 + ox] = value;
+                        }
+                    }
+                }
+            }
+        }
+        out
+    }
+
+    struct GroupNormCase {
+        c: usize,
+        h: usize,
+        w: usize,
+        num_groups: usize,
+        eps: f32,
+    }
+
+    fn reference_group_norm(
+        input: &[f32],
+        weight: &[f32],
+        bias: &[f32],
+        case: GroupNormCase,
+    ) -> Vec<f32> {
+        let GroupNormCase {
+            c,
+            h,
+            w,
+            num_groups,
+            eps,
+        } = case;
+        let group_size = c / num_groups;
+        let group_volume = group_size * h * w;
+        let mut out = vec![0.0_f32; input.len()];
+        for group in 0..num_groups {
+            let c_start = group * group_size;
+            let mut mean = 0.0_f32;
+            for ci in c_start..c_start + group_size {
+                for idx in 0..h * w {
+                    mean += input[ci * h * w + idx];
+                }
+            }
+            mean /= group_volume as f32;
+
+            let mut variance = 0.0_f32;
+            for ci in c_start..c_start + group_size {
+                for idx in 0..h * w {
+                    let centered = input[ci * h * w + idx] - mean;
+                    variance += centered * centered;
+                }
+            }
+            variance /= group_volume as f32;
+            let inv_std = (variance + eps).sqrt().recip();
+
+            for ci in c_start..c_start + group_size {
+                for idx in 0..h * w {
+                    let flat = ci * h * w + idx;
+                    out[flat] = (input[flat] - mean) * inv_std * weight[ci] + bias[ci];
+                }
+            }
+        }
+        out
+    }
+
+    #[test]
+    fn conv2d_bias_matches_reference() {
+        let mut cx = Graph::default();
+        let input_t = cx.named_tensor("input", (2usize, 3usize, 3usize));
+        let weight_t = cx.named_tensor("weight", (2usize, 2usize * 3 * 3));
+        let bias_t = cx.named_tensor("bias", 2usize);
+        let out = conv2d_bias(input_t, weight_t, bias_t, 3, 1, 1).output();
+
+        let input: Vec<f32> = (0..18).map(|i| i as f32 * 0.1 - 0.7).collect();
+        let weight: Vec<f32> = (0..36).map(|i| (i as f32 % 7.0) * 0.05 - 0.15).collect();
+        let bias = vec![0.25_f32, -0.5_f32];
+        let expected = reference_conv2d_bias(
+            &input,
+            &weight,
+            &bias,
+            Conv2dCase {
+                c_in: 2,
+                h: 3,
+                w: 3,
+                c_out: 2,
+                kernel: 3,
+                stride: 1,
+                padding: 1,
+            },
+        );
+
+        cx.build_search_space::<NativeRuntime>();
+        let mut rt = cx.search(NativeRuntime::default(), 1);
+        rt.set_data(input_t, input);
+        rt.set_data(weight_t, weight);
+        rt.set_data(bias_t, bias);
+        rt.execute(&cx.dyn_map);
+
+        assert_close(rt.get_f32(out.id), &expected);
+    }
+
+    #[test]
+    fn nearest_upsample_2x_matches_reference_native() {
+        let mut cx = Graph::default();
+        let input_t = cx.named_tensor("input", (2usize, 3usize, 4usize));
+        let out = nearest_upsample_2x(input_t).output();
+
+        let input: Vec<f32> = (0..2 * 3 * 4).map(|i| i as f32 - 11.0).collect();
+        let expected = reference_nearest_upsample_2x(&input, 2, 3, 4);
+
+        cx.build_search_space::<NativeRuntime>();
+        let mut rt = cx.search(NativeRuntime::default(), 1);
+        rt.set_data(input_t, input);
+        rt.execute(&cx.dyn_map);
+
+        assert_close(rt.get_f32(out.id), &expected);
+    }
+
+    #[test]
+    fn nearest_upsample_2x_matches_reference_cuda() {
+        let Ok(ctx) = CudaContext::new(0) else {
+            return;
+        };
+        ctx.bind_to_thread().unwrap();
+
+        let mut cx = Graph::default();
+        let input_t = cx.named_tensor("input", (2usize, 3usize, 4usize));
+        let out = nearest_upsample_2x(input_t).output();
+
+        let input: Vec<f32> = (0..2 * 3 * 4).map(|i| i as f32 - 11.0).collect();
+        let expected = reference_nearest_upsample_2x(&input, 2, 3, 4);
+
+        cx.build_search_space::<CudaRuntime>();
+        let mut rt = CudaRuntime::initialize(ctx.default_stream());
+        rt.set_data(input_t, input);
+        rt = cx.search(rt, 1);
+        rt.execute(&cx.dyn_map);
+
+        assert_close(&rt.get_f32(out.id), &expected);
+    }
+
+    #[test]
+    fn group_norm_matches_reference_native() {
+        let mut cx = Graph::default();
+        let input_t = cx.named_tensor("input", (4usize, 2usize, 3usize));
+        let weight_t = cx.named_tensor("weight", 4usize);
+        let bias_t = cx.named_tensor("bias", 4usize);
+        let out = group_norm(input_t, weight_t, bias_t, 2, 1e-6).output();
+
+        let input: Vec<f32> = (0..4 * 2 * 3).map(|i| i as f32 * 0.2 - 2.0).collect();
+        let weight = vec![0.7_f32, -0.2, 1.3, 0.5];
+        let bias = vec![0.1_f32, -0.3, 0.4, -0.6];
+        let expected = reference_group_norm(
+            &input,
+            &weight,
+            &bias,
+            GroupNormCase {
+                c: 4,
+                h: 2,
+                w: 3,
+                num_groups: 2,
+                eps: 1e-6,
+            },
+        );
+
+        cx.build_search_space::<NativeRuntime>();
+        let mut rt = cx.search(NativeRuntime::default(), 1);
+        rt.set_data(input_t, input);
+        rt.set_data(weight_t, weight);
+        rt.set_data(bias_t, bias);
+        rt.execute(&cx.dyn_map);
+
+        assert_close(rt.get_f32(out.id), &expected);
+    }
+
+    #[test]
+    fn group_norm_matches_reference_cuda() {
+        let Ok(ctx) = CudaContext::new(0) else {
+            return;
+        };
+        ctx.bind_to_thread().unwrap();
+
+        let mut cx = Graph::default();
+        let input_t = cx.named_tensor("input", (4usize, 2usize, 3usize));
+        let weight_t = cx.named_tensor("weight", 4usize);
+        let bias_t = cx.named_tensor("bias", 4usize);
+        let out = group_norm(input_t, weight_t, bias_t, 2, 1e-6).output();
+
+        let input: Vec<f32> = (0..4 * 2 * 3).map(|i| i as f32 * 0.2 - 2.0).collect();
+        let weight = vec![0.7_f32, -0.2, 1.3, 0.5];
+        let bias = vec![0.1_f32, -0.3, 0.4, -0.6];
+        let expected = reference_group_norm(
+            &input,
+            &weight,
+            &bias,
+            GroupNormCase {
+                c: 4,
+                h: 2,
+                w: 3,
+                num_groups: 2,
+                eps: 1e-6,
+            },
+        );
+
+        cx.build_search_space::<CudaRuntime>();
+        let mut rt = CudaRuntime::initialize(ctx.default_stream());
+        rt.set_data(input_t, input);
+        rt.set_data(weight_t, weight);
+        rt.set_data(bias_t, bias);
+        rt = cx.search(rt, 1);
+        rt.execute(&cx.dyn_map);
+
+        assert_close(&rt.get_f32(out.id), &expected);
+    }
+}
--- a/examples/gemma4_moe/Cargo.toml
+++ b/examples/gemma4_moe/Cargo.toml
@@ -11,6 +11,7 @@ luminal_nn = { path = "../../crates/luminal_nn" }
 luminal_cuda_lite = { path = "../../crates/luminal_cuda_lite" }
 tokenizers = "0.22.2"
 rustc-hash = "2"
+rand = "0.9.2"

 # HuggingFace model download
 hf-hub = { version = "0.4", default-features = false, features = ["rustls-tls", "ureq"] }
--- a/examples/gemma4_moe/src/main.rs
+++ b/examples/gemma4_moe/src/main.rs
@@ -5,11 +5,13 @@ use hf::prepare_hf_model;
 use luminal::prelude::*;
 use luminal_cuda_lite::{cudarc::driver::CudaContext, runtime::CudaRuntime};
 use model::*;
+use rand::{SeedableRng, rngs::SmallRng};
 use rustc_hash::FxHashSet;
 use std::{io::Write, time::Duration};
 use tokenizers::Tokenizer;

 const REPO_ID: &str = "google/gemma-4-26B-A4B";
+const SEARCH_SEED: u64 = 0;

 fn env_bool(name: &str) -> bool {
    std::env::var(name)
@@ -78,7 +80,12 @@ fn main() {
    cx.set_dim('p', 0);
    runtime.set_data(input, vec![1; search_s]);
    runtime.set_data(pos_ids, (0..search_s as i32).collect::<Vec<_>>());
-    runtime = cx.search(runtime, search_graphs);
+    let mut rng = SmallRng::seed_from_u64(SEARCH_SEED);
+    runtime = cx.search_options(
+        runtime,
+        SearchOptions::new(search_graphs).profile_timeout(Duration::from_secs(2)),
+        &mut rng,
+    );

    for layer in 0..LAYERS {
        let cache_bytes = cache_bytes_for_layer(layer, max_seq_len);
--- a/examples/yolo_v11/README.md
+++ b/examples/yolo_v11/README.md
@@ -10,7 +10,7 @@ examples/yolo_v11/
 ├── Cargo.toml              # Rust crate (binary: yolo_v11)
 ├── src/
 │   ├── main.rs             # Full forward, NMS, and annotated image output
-│   └── model.rs            # YOLO v11n architecture in luminal IR
+│   └── model.rs            # YOLO v11n architecture in pure HLIR
 ├── python/
 │   ├── reference.py        # PyTorch eager reference + weight prep
 │   └── luminal_example.py  # torch.compile(..., backend=luminal_backend) demo
@@ -77,6 +77,8 @@ examples/yolo_v11/
 * All Conv blocks are loaded with `bn` folded into a bias-augmented Conv2d
  (`forward_fuse`), so the saved tensors are just `<layer>.conv.weight` and
  `<layer>.conv.bias`.
+* The model computation is specified as pure HLIR tensor algebra. The YOLO graph
+  does not register `Graph::custom_op` wrappers or insert `CustomOpKind` nodes.
 * The `C3k2`, `C3k`, `C2PSA`, and `Attention` modules in PyTorch use
  `tensor.chunk(2, dim=1)` (or `qkv.split([...], dim=...)`) to produce two/three
  channel-slices that then take separate paths. Slicing followed by a residual
@@ -91,8 +93,8 @@ examples/yolo_v11/
  non-contiguous view via `gather + iota` (the same trick `GraphTensor::output`
  uses internally). It's applied wherever an op chain produces a strided view
  that the next op needs to read sequentially.
-* 1x1 convolutions skip the unfold path and use a direct 2D matmul, so
-  luminal_cuda_lite's `TileMatmulFullSplit` kernel can match.
+* 1x1 convolutions skip the unfold path and use a plain HLIR matmul over
+  flattened spatial positions.

 ## Known limitation: full-model compile time

--- a/examples/yolo_v11/src/model.rs
+++ b/examples/yolo_v11/src/model.rs
@@ -99,7 +99,7 @@ impl Conv {

    /// Apply the convolution + bias (no activation). Closely mirrors the
    /// pt2-translator's `conv_unfold` so it exercises the same tested code
-    /// paths in the luminal e-graph. Special-cases 1x1 convs to a plain matmul
+    /// paths in the luminal e-graph. Special-cases 1x1 convs to an HLIR matmul
    /// (no unfold) since they don't need spatial windowing.
    pub fn forward_no_act(&self, x: GraphTensor) -> GraphTensor {
        let x = canonicalize_static_shape(x);
@@ -179,9 +179,8 @@ impl Conv {
        self.forward_no_act(x).silu()
    }

-    /// Specialized 1x1 conv path: equivalent to a per-spatial matmul with no
-    /// unfold and no padding. Uses a 2D matmul so the e-graph can match
-    /// luminal_cuda_lite's TileMatmulFullSplit specialization.
+    /// Specialized 1x1 conv path: equivalent to a per-spatial HLIR matmul with
+    /// no unfold and no padding.
    fn forward_1x1(&self, x: GraphTensor) -> GraphTensor {
        // x: (1, c_in, H, W) -> drop batch dim, then permute to (H, W, c_in)
        let dims = x.dims();
@@ -189,7 +188,7 @@ impl Conv {
        let w = dims[3];
        let x = x.squeeze(0); // (c_in, H, W)
        let xt = x.permute(&[1, 2, 0]); // (H, W, c_in)
-        // 2D matmul matches the specialized kernel in cuda_lite.
+        // Pure HLIR matmul over flattened spatial positions.
        let xt = xt.merge_dims(0, 1); // (H*W, c_in)
        let out = xt.matmul(self.weight.t()); // (H*W, c_out)
        let out = out.split_dims(0, w); // (H, W, c_out)
@@ -1054,3 +1053,33 @@ pub fn make_anchors_and_strides(
 pub fn dfl_weight() -> Vec<f32> {
    (0..REG_MAX as i32).map(|i| i as f32).collect()
 }
+
+#[cfg(test)]
+mod tests {
+    use luminal::hlir::CustomOpKind;
+
+    use super::*;
+
+    #[test]
+    fn yolo_forward_graph_uses_no_custom_ops() {
+        let mut cx = Graph::default();
+        let img = cx.named_tensor("input.image", (1usize, 3usize, IMG_SIZE, IMG_SIZE));
+        let yolo = YoloV11::init(&mut cx);
+        let _ = yolo.forward(img).output();
+
+        assert!(
+            cx.custom_ops.is_empty(),
+            "YOLO should express model computation in pure HLIR, not registered CustomOp wrappers"
+        );
+
+        let custom_nodes: Vec<_> = cx
+            .graph
+            .node_indices()
+            .filter(|&node| cx.try_get_op::<CustomOpKind>(node).is_some())
+            .collect();
+        assert!(
+            custom_nodes.is_empty(),
+            "YOLO graph contains CustomOpKind HLIR nodes: {custom_nodes:?}"
+        );
+    }
+}
--- a/src/egglog_utils/mod.rs
+++ b/src/egglog_utils/mod.rs
@@ -1620,7 +1620,56 @@ pub fn extract_expr<'a>(

 pub type EGraphChoiceSet<'a> = FxHashMap<&'a ClassId, &'a NodeId>;

-/// Count the total number of possible IR/IList choice sets, capped at `limit`.
+fn is_search_choice_eclass(label: &str) -> bool {
+    label.contains("IR") || label.contains("IList") || label.contains("OpKind")
+}
+
+fn extractor_list_len(egraph: &SerializedEGraph, eclass_id: &ClassId) -> Option<usize> {
+    let mut len = 0usize;
+    let mut cur_eclass: ClassId = eclass_id.clone();
+    let mut visited: FxHashSet<ClassId> = FxHashSet::default();
+    loop {
+        if !visited.insert(cur_eclass.clone()) {
+            return None;
+        }
+        let (label, enodes) = egraph.eclasses.get(&cur_eclass)?;
+        if !label.contains("List") {
+            return Some(len);
+        }
+        let head_enode = enodes.first()?;
+        let head_label = &egraph.enodes[head_enode].0;
+        if head_label == "ENil" || head_label == "INil" {
+            return Some(len);
+        }
+        if head_label != "ECons" && head_label != "ICons" {
+            return Some(len);
+        }
+        len += 1;
+        let children = &egraph.enodes[head_enode].1;
+        if children.len() < 2 {
+            return Some(len);
+        }
+        cur_eclass = children[1].clone();
+    }
+}
+
+fn opkind_metadata_consistent(egraph: &SerializedEGraph, node: &NodeId) -> bool {
+    let lens: Vec<usize> = egraph.enodes[node]
+        .1
+        .iter()
+        .filter_map(|c| {
+            let lbl = &egraph.eclasses[c].0;
+            if lbl.contains("List") {
+                extractor_list_len(egraph, c)
+            } else {
+                None
+            }
+        })
+        .collect();
+    lens.is_empty() || lens.iter().all(|l| *l == lens[0])
+}
+
+/// Count the total number of possible searchable choice sets, capped at `limit`.
 ///
 /// Search deduplicates candidates by `EGraphChoiceSet`, so this gives the exact
 /// number of candidates when it is below `limit` without risking overflow on
@@ -1632,7 +1681,7 @@ pub fn count_choice_sets_up_to(egraph: &SerializedEGraph, limit: usize) -> usize

    let mut count = 1usize;
    for (label, enodes) in egraph.eclasses.values() {
-        if !label.contains("IR") && !label.contains("IList") {
+        if !is_search_choice_eclass(label) {
            continue;
        }

@@ -1650,10 +1699,10 @@ pub fn random_initial_choice<'a>(
 ) -> EGraphChoiceSet<'a> {
    let mut choices = FxHashMap::default();
    for (eclass, (label, enodes)) in &egraph.eclasses {
-        if !label.contains("IR") && !label.contains("IList") {
+        if !is_search_choice_eclass(label) {
            continue;
        }
-        // Prefer synth-injected enodes when available — they point at
+        // Use synth-injected enodes when available — they point at
        // deterministic single-variant kind eclasses produced by the
        // deep-clone fallback in `inject_kernel_alternatives`, so the
        // extractor's first-enode walk is guaranteed length-consistent.
@@ -1667,7 +1716,18 @@ pub fn random_initial_choice<'a>(
            .enumerate()
            .filter_map(|(i, n)| n.as_ref().starts_with("synth_").then_some(i))
            .collect();
-        let pick_idx = if !synth_indices.is_empty() {
+        let consistent_opkind_indices: Vec<usize> = if label == "OpKind" {
+            enodes
+                .iter()
+                .enumerate()
+                .filter_map(|(i, n)| opkind_metadata_consistent(egraph, n).then_some(i))
+                .collect()
+        } else {
+            Vec::new()
+        };
+        let pick_idx = if !consistent_opkind_indices.is_empty() {
+            consistent_opkind_indices[rng.random_range(0..consistent_opkind_indices.len())]
+        } else if !synth_indices.is_empty() {
            synth_indices[rng.random_range(0..synth_indices.len())]
        } else {
            rng.random_range(0..enodes.len())
@@ -1684,9 +1744,9 @@ pub fn validate_choice_set<'a>(
    choices: &EGraphChoiceSet<'a>,
    ops: &[Arc<Box<dyn EgglogOp>>],
 ) -> Result<(), String> {
-    // Check all IR/IList eclasses have a choice
+    // Check all searchable eclasses have a choice.
    for (eclass, (label, enodes)) in &egraph.eclasses {
-        if !label.contains("IR") && !label.contains("IList") {
+        if !is_search_choice_eclass(label) {
            continue;
        }
        let Some(chosen) = choices.get(eclass) else {
@@ -1719,7 +1779,7 @@ pub fn validate_choice_set<'a>(
                .eclasses
                .get(ch)
                .ok_or_else(|| format!("Eclass {} not found", ch.as_ref()))?;
-            if label.contains("IR") || label.contains("IList") {
+            if is_search_choice_eclass(label) {
                let n = choices
                    .get(ch)
                    .ok_or_else(|| format!("No choice for reachable eclass {}", ch.as_ref()))?;
@@ -1745,14 +1805,12 @@ pub fn validate_choice_set<'a>(
        if op_name == "Op" {
            // Normalized op — check OpKind child
            if let Some(kind_eclass) = children.first() {
-                if let Some((_, kind_enodes)) = egraph.eclasses.get(kind_eclass) {
-                    if let Some(kn) = kind_enodes.first() {
-                        let kind_name = &egraph.enodes[kn].0;
-                        if kind_name != "CustomOpKind"
-                            && !ops.iter().any(|op| op.sort().name == *kind_name)
-                        {
-                            return Err(format!("No extractor for OpKind {kind_name}"));
-                        }
+                if let Some(kn) = choices.get(kind_eclass) {
+                    let kind_name = &egraph.enodes[kn].0;
+                    if kind_name != "CustomOpKind"
+                        && !ops.iter().any(|op| op.sort().name == *kind_name)
+                    {
+                        return Err(format!("No extractor for OpKind {kind_name}"));
                    }
                }
            }
@@ -1813,9 +1871,7 @@ pub fn extract_generation<'a>(
    let mutable_classes: Vec<&ClassId> = egraph
        .eclasses
        .iter()
-        .filter(|(_, (label, enodes))| {
-            (label.contains("IR") || label.contains("IList")) && enodes.len() > 1
-        })
+        .filter(|(_, (label, enodes))| is_search_choice_eclass(label) && enodes.len() > 1)
        .map(|(class_id, _)| class_id)
        .collect();

@@ -1848,9 +1904,21 @@ pub fn extract_generation<'a>(
        for _ in 0..rng.random_range(1..=mutations_per_generation) {
            // Pick a random mutable eclass
            let class_id = mutable_classes[rng.random_range(0..mutable_classes.len())];
-            let (_, enodes) = &egraph.eclasses[class_id];
+            let (label, enodes) = &egraph.eclasses[class_id];
            // Pick a random enode for this class
-            let new_node = &enodes[rng.random_range(0..enodes.len())];
+            let consistent_opkind_nodes: Vec<&NodeId> = if label == "OpKind" {
+                enodes
+                    .iter()
+                    .filter(|n| opkind_metadata_consistent(egraph, n))
+                    .collect()
+            } else {
+                Vec::new()
+            };
+            let new_node = if !consistent_opkind_nodes.is_empty() {
+                consistent_opkind_nodes[rng.random_range(0..consistent_opkind_nodes.len())]
+            } else {
+                &enodes[rng.random_range(0..enodes.len())]
+            };
            // Insert returns the previous binding (if any); fold the diff
            // into the running hash. If the new pick equals the old one,
            // the two XORs cancel and `child_hash` is unchanged — exactly
@@ -1932,7 +2000,7 @@ pub fn egglog_to_llir_from_root<'a>(
    let mut reachability_stack = vec![choices[root_class]];
    while let Some(r) = reachability_stack.pop() {
        for ch in &egraph.enodes[r].1 {
-            if egraph.eclasses[ch].0.contains("IR") || egraph.eclasses[ch].0.contains("IList") {
+            if is_search_choice_eclass(&egraph.eclasses[ch].0) {
                let n = choices[ch];
                if !reachable.contains(n) {
                    reachability_stack.push(n);
@@ -1968,69 +2036,19 @@ pub fn egglog_to_llir_from_root<'a>(
            // structurally-equivalent kind enodes whose ELIST children
            // were unioned but resolve (under the extractor's first-enode
            // walk) to inconsistent lengths — picking such an enode causes
-            // a downstream `flatten_strides` length mismatch. Prefer the
-            // first kind enode whose ELIST children all walk to the same
-            // length; fall back to the original first enode if no
-            // consistent candidate exists (rare; only happens for ops
-            // outside the runnable subgraph).
+            // a downstream `flatten_strides` length mismatch. Candidate
+            // generation filters these out where possible; this fallback is
+            // structural only and does not rank backend implementations.
            let kind_enodes = &egraph.eclasses[kind_eclass].1;
-            let extractor_length = |eclass_id: &ClassId| -> Option<usize> {
-                let mut len = 0usize;
-                let mut cur_eclass: ClassId = eclass_id.clone();
-                let mut visited: FxHashSet<ClassId> = FxHashSet::default();
-                loop {
-                    if !visited.insert(cur_eclass.clone()) {
-                        return None;
-                    }
-                    let (label, enodes) = egraph.eclasses.get(&cur_eclass)?;
-                    if !label.contains("List") {
-                        return Some(len);
-                    }
-                    let head_enode = enodes.first()?;
-                    let head_label = &egraph.enodes[head_enode].0;
-                    if head_label == "ENil" || head_label == "INil" {
-                        return Some(len);
-                    }
-                    if head_label != "ECons" && head_label != "ICons" {
-                        return Some(len);
-                    }
-                    len += 1;
-                    let children = &egraph.enodes[head_enode].1;
-                    if children.len() < 2 {
-                        return Some(len);
-                    }
-                    cur_eclass = children[1].clone();
-                }
-            };
-            let elist_lens_for = |n: &NodeId| -> Vec<usize> {
-                egraph.enodes[n]
-                    .1
-                    .iter()
-                    .filter_map(|c| {
-                        let lbl = &egraph.eclasses[c].0;
-                        if lbl.contains("List") {
-                            extractor_length(c)
-                        } else {
-                            None
-                        }
-                    })
-                    .collect()
-            };
-            let is_consistent = |n: &NodeId| -> bool {
-                let lens = elist_lens_for(n);
-                lens.is_empty() || lens.iter().all(|l| *l == lens[0])
-            };
-            let is_kernel = |n: &NodeId| -> bool {
-                let l = &egraph.enodes[n].0;
-                l.starts_with("Kernel") || l.starts_with("Fused")
-            };
-            // Prefer a consistent kernel kind; then any consistent;
-            // then any kernel; then fall back to first.
-            let kind_enode = kind_enodes
-                .iter()
-                .find(|n| is_kernel(n) && is_consistent(n))
-                .or_else(|| kind_enodes.iter().find(|n| is_consistent(n)))
-                .or_else(|| kind_enodes.iter().find(|n| is_kernel(n)))
+            let kind_enode = choices
+                .get(kind_eclass)
+                .copied()
+                .filter(|n| opkind_metadata_consistent(egraph, n))
+                .or_else(|| {
+                    kind_enodes
+                        .iter()
+                        .find(|n| opkind_metadata_consistent(egraph, n))
+                })
                .unwrap_or(&kind_enodes[0]);
            let kind_label = &egraph.enodes[kind_enode].0;

@@ -2039,8 +2057,7 @@ pub fn egglog_to_llir_from_root<'a>(
                .1
                .iter()
                .map(|c| {
-                    if egraph.eclasses[c].0.contains("IR") || egraph.eclasses[c].0.contains("IList")
-                    {
+                    if is_search_choice_eclass(&egraph.eclasses[c].0) {
                        choices[c]
                    } else {
                        &egraph.eclasses[c].1[0]
@@ -2085,8 +2102,7 @@ pub fn egglog_to_llir_from_root<'a>(
                .1
                .iter()
                .map(|c| {
-                    if egraph.eclasses[c].0.contains("IR") || egraph.eclasses[c].0.contains("IList")
-                    {
+                    if is_search_choice_eclass(&egraph.eclasses[c].0) {
                        choices[c]
                    } else {
                        &egraph.eclasses[c].1[0]
@@ -2165,10 +2181,11 @@ mod tests {
        let egraph = egraph(vec![
            eclass("a", "IR", 2),
            eclass("b", "IList", 3),
+            eclass("op", "OpKind", 5),
            eclass("c", "Shape", 99),
        ]);

-        assert_eq!(count_choice_sets_up_to(&egraph, 100), 6);
+        assert_eq!(count_choice_sets_up_to(&egraph, 100), 30);
    }

    #[test]
--- a/src/graph.rs
+++ b/src/graph.rs
@@ -173,15 +173,16 @@ impl BuildSearchSpaceOptions {
 pub struct SearchOptions {
    /// Maximum number of graphs to evaluate
    pub limit: usize,
-    /// Number of offspring per generation (default: 30)
+    /// Number of offspring per generation (default: 10)
    pub generation_size: usize,
-    /// Number of mutations applied to each offspring (default: 30)
+    /// Number of mutations applied to each offspring (default: 10)
    pub mutations: usize,
    /// Number of profiling trials per candidate (default: 3)
    pub trials: usize,
    /// Number of best genomes to keep as parents per generation (default: 1)
    pub keep_best: usize,
-    /// Optional per-candidate profiling timeout.
+    /// Per-candidate profiling timeout. If a profile call reaches this budget,
+    /// that candidate is discarded and search continues.
    pub profile_timeout: Option<std::time::Duration>,
    /// Optional per-group search timeout.
    pub group_timeout: Option<std::time::Duration>,
@@ -194,11 +195,11 @@ impl SearchOptions {
    pub fn new(limit: usize) -> Self {
        Self {
            limit,
-            generation_size: 30,
-            mutations: 30,
+            generation_size: 10,
+            mutations: 10,
            trials: 3,
            keep_best: 1,
-            profile_timeout: None,
+            profile_timeout: Some(std::time::Duration::from_secs(1)),
            group_timeout: None,
            profile_dims: FxHashMap::default(),
        }
@@ -315,6 +316,27 @@ fn maybe_dump_selected_llir(label: &str, dyn_map: &FxHashMap<char, usize>, llir:
    }
 }

+fn random_choice_generation<'a, G: rand::Rng>(
+    egraph: &'a SerializedEGraph,
+    generation_size: usize,
+    prev_selected: &mut FxHashSet<u64>,
+    rng: &mut G,
+) -> Vec<crate::egglog_utils::EGraphChoiceSet<'a>> {
+    let mut generation = Vec::with_capacity(generation_size);
+    let max_attempts = generation_size.saturating_mul(100);
+    let mut attempts = 0;
+
+    while generation.len() < generation_size && attempts < max_attempts {
+        attempts += 1;
+        let genome = random_initial_choice(egraph, rng);
+        if prev_selected.insert(hash_choice_set(&genome)) {
+            generation.push(genome);
+        }
+    }
+
+    generation
+}
+
 /// A Luminal compute graph.
 ///
 /// All computation is represented as a directed acyclic graph.
@@ -1347,6 +1369,11 @@ impl Graph {
        let mut list_cache = FxHashMap::default();
        let mut expr_cache = FxHashMap::default();
        runtime.clear_intermediate_buffers();
+        let profile_timed_out = |elapsed: std::time::Duration| {
+            options
+                .profile_timeout
+                .is_some_and(|timeout| elapsed >= timeout)
+        };

        // Find a viable initial genome (may need multiple attempts if some panic)
        let (mut best_genome, mut best_metric, display, mut n_graphs);
@@ -1385,13 +1412,15 @@ impl Graph {
                // unrolled graph size.
                collapse_loops_to_first_iter(&mut graph);
                runtime.clear_intermediate_buffers();
+                let profile_start = std::time::Instant::now();
                let (rep_metric, rep_display) = runtime.profile(
                    &graph,
                    &profile_dyn_map,
                    options.trials,
                    options.profile_timeout,
                );
-                let has_nan = runtime.has_nan_outputs(&graph, &profile_dyn_map);
+                let timed_out = profile_timed_out(profile_start.elapsed());
+                let has_nan = !timed_out && runtime.has_nan_outputs(&graph, &profile_dyn_map);
                (
                    rep_metric,
                    append_memory_display(
@@ -1401,11 +1430,12 @@ impl Graph {
                        runtime.allocated_intermediate_buffer_bytes(),
                    ),
                    has_nan,
+                    timed_out,
                )
            }));

            match result {
-                Ok((metric, disp, false)) => {
+                Ok((metric, disp, false, false)) => {
                    best_genome = genome;
                    best_metric = R::aggregate_profile_metrics(&[metric]);
                    display = disp;
@@ -1435,6 +1465,7 @@ impl Graph {
        // Track top-N parents for offspring generation
        let mut parents: Vec<(R::ProfileMetric, crate::egglog_utils::EGraphChoiceSet<'_>)> =
            vec![(best_metric.clone(), best_genome.clone())];
+        let mut resample_generation = false;

        while n_graphs < search_limit {
            if options
@@ -1446,26 +1477,33 @@ impl Graph {

            // Generate offspring from all parents, dividing budget evenly
            let budget = (search_limit - n_graphs).min(options.generation_size);
-            let per_parent = budget.div_ceil(parents.len());
-            let mut all_offspring = Vec::new();
-            for (_, parent_genome) in &parents {
-                let remaining = budget.saturating_sub(all_offspring.len());
-                if remaining == 0 {
-                    break;
+            let all_offspring = if resample_generation {
+                random_choice_generation(egraph, budget, &mut prev_selected, rng)
+            } else {
+                let per_parent = budget.div_ceil(parents.len());
+                let mut offspring = Vec::new();
+                for (_, parent_genome) in &parents {
+                    let remaining = budget.saturating_sub(offspring.len());
+                    if remaining == 0 {
+                        break;
+                    }
+                    offspring.extend(extract_generation(
+                        egraph,
+                        parent_genome,
+                        per_parent.min(remaining),
+                        options.mutations,
+                        &mut prev_selected,
+                        rng,
+                    ));
                }
-                all_offspring.extend(extract_generation(
-                    egraph,
-                    parent_genome,
-                    per_parent.min(remaining),
-                    options.mutations,
-                    &mut prev_selected,
-                    rng,
-                ));
-            }
+                offspring
+            };
            if all_offspring.is_empty() {
                break;
            }

+            let mut generation_found_non_timeout = false;
+
            for genome in all_offspring {
                if options
                    .group_timeout
@@ -1502,13 +1540,16 @@ impl Graph {
                    // before profiling — see initial-genome path.
                    collapse_loops_to_first_iter(&mut llir_graph);
                    runtime.clear_intermediate_buffers();
+                    let profile_start = std::time::Instant::now();
                    let (rep_metric, rep_display) = runtime.profile(
                        &llir_graph,
                        &profile_dyn_map,
                        options.trials,
                        options.profile_timeout,
                    );
-                    let has_nan = runtime.has_nan_outputs(&llir_graph, &profile_dyn_map);
+                    let timed_out = profile_timed_out(profile_start.elapsed());
+                    let has_nan =
+                        !timed_out && runtime.has_nan_outputs(&llir_graph, &profile_dyn_map);
                    (
                        rep_metric,
                        append_memory_display(
@@ -1518,15 +1559,28 @@ impl Graph {
                            runtime.allocated_intermediate_buffer_bytes(),
                        ),
                        has_nan,
+                        timed_out,
                    )
                }));

                let (new_metric, display_metric) = match profile_result {
-                    Ok((metric, display, false)) => {
+                    Ok((metric, display, false, false)) => {
+                        generation_found_non_timeout = true;
                        (R::aggregate_profile_metrics(&[metric]), display)
                    }
-                    Ok((_, _, true)) | Err(_) => {
-                        // NaN or panic — redraw bars and skip
+                    Ok((_, _, _, true)) | Err(_) => {
+                        // Timed out or panicked — redraw bars and skip.
+                        for _ in 1..n_bar_lines {
+                            print!("\x1b[1A");
+                        }
+                        print!("\r\x1b[2K");
+                        render_bars(n_graphs, search_limit, bucket_progress);
+                        std::io::stdout().flush().unwrap();
+                        continue;
+                    }
+                    Ok((_, _, true, false)) => {
+                        generation_found_non_timeout = true;
+                        // Completed profiling but produced NaNs — redraw bars and skip.
                        for _ in 1..n_bar_lines {
                            print!("\x1b[1A");
                        }
@@ -1577,6 +1631,8 @@ impl Graph {
                render_bars(n_graphs, search_limit, bucket_progress);
                std::io::stdout().flush().unwrap();
            }
+
+            resample_generation = !generation_found_non_timeout;
        }

        // Clear progress bars
Author	SHA1	Message	Date
Tucker Morgan	6a0c6321f4	luminal_python: DLRM vanilla-PyTorch fast paths + DCE + batched FFI Adds the pieces needed for torch.compile(model, backend=luminal_backend) to ingest a vanilla DLRMv1 (with nn.EmbeddingBag modules) at correctness parity (max abs diff ≤ 1.79e-07 across num_cat∈{1,2,3,4,8,16,32}) and within 0.45-0.84× of Inductor (no CG) across that sweep. Translator - aten._embedding_bag_forward_only.default: fixed-stride offsets, mode=sum, routes to the fused embedding_bag_sum_kernel under cuda - aten.index_select.default: 2-D source / dim=0 / 1-D index lowering - aten.addmm.default fast path: detect `addmm(bias, x, weight.t())` and emit luminal_cuda_lite::kernel::linear_bias with the original (N,K) weight - aten.bmm.default split out + fast path through matmul_3d_t when the RHS comes from a [0,2,1] permute - aten.sum.dim_IntList peephole: sum ← view ← index_select → embedding_bag_sum_kernel - aten.index.Tensor peephole: [None, li, lj] over bmm(cat(unsq...), perm(cat)) → dlrm_pairwise_dot_lower_tri (new kernel, ported from worktree) - aten.permute.default records 2-D [1,0] and 3-D [0,2,1] sources into `transpose_2d_source` for the addmm / bmm fast paths - node_chain extended to populate from variadic-first-input ops (cat) so the PairwiseDot peephole can walk back through cat - New `op_inputs` side-table holds all input names per node so multi-input peepholes don't need to re-scan FX nodes - Post-translation DCE pass walks back from Output HLIR sinks and drops every unreachable producer — load-bearing for the PairwiseDot peephole because the cat→bmm→permute chain it supersedes was surviving past egglog otherwise Runtime - Drop the trailing cuStreamSynchronize at end of CudaRuntime::execute (incompatible with torch.cuda.CUDAGraph capture; PyTorch syncs on read) - New batched `run_with_ptrs(input_ptrs, output_ptrs)` PyO3 method that registers all I/O device pointers and runs in one FFI crossing. For DLRM at nc=32 collapses ~7 separate PyO3 calls per iter into 1; ~22% per-iter wall-clock improvement on top of the algorithmic peepholes. Python wrapper - main.py: narrow CUDA weight tensors (i64→i32, f64→f32) before extracting device pointers — luminal only has 32-bit Int/Float, the old direct-ptr path silently read half-words. Surfaced from this DLRM work but applies to every model whose state_dict carries i64 buffers. - main.py: set LUMINAL_BACKEND_CUDA=1 around _compile_pt2 so translator fast paths gate themselves correctly (cuda-only kernels behind native). - pt2.py: thread `search_iterations` through pt2_backend so callers can override the default of 10. - compiled_model.py: CUDA path now uses run_with_ptrs. New cuda_lite kernels - dlrm_interact.rs::PairwiseDotLowerTriKernel — variadic, N (B,D) inputs, computes the strict-lower-tri F*(F-1)/2 dot products in one launch. - embedding_bag.rs::MultiTableEmbeddingBagSumKernel — K (weight, idx) pointer pairs through two packed staging buffers, produces (B, K, D) in one launch. Drop-in for vanilla DLRMv1's F separate nn.EmbeddingBag modules. Translator integration is a follow-up; the kernel currently has a `device_ptr` API miss against the in-tree cudarc that needs a one-line tuple-destructure fix before it'll compile (see follow-up). Status: - 213/213 tests in tests/test_hlir_ops.py pass. - DLRM vanilla-PT sweep matches or beats Inductor (no CG) at every nc≤16, ~tied at nc=32. Still 1.6–3.2× behind Inductor+CG (the remaining gap is PyTorch-side Python/dynamo overhead around our run_with_ptrs call; torch.cuda.graph capture of luminal's runtime is the next item). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 00:29:37 +00:00
Joe Fioti	156fac518e	Metal qwen (#327 ) * Refine Luminal graph rewrite handling * Generalize Metal scatter reuse and Qwen validation * Add Qwen safetensor size accounting * Fix Modal example imports for shared output validation * Clarify Luminal contributor guidance * Revert direct shard loading from qwen metal * Remove qwen Metal CI job * Add Metal Llama 1B CI and restore safe profiling timeouts * Fix duplicate Metal ops and tests * Fix Metal pipeline compilation on llama * Run llama Metal CI on xlarge runners * Resample search generations after timeout failures	2026-05-20 13:26:34 -04:00
Joe Fioti	a3df68bd43	Add full-modal-ready CUDA test workflows (#329 )	2026-05-20 01:13:02 -04:00
Ali	7a95e56a8b	copy_device_buffer_to_new_slice synchronizes stream unnecessarily (#322 )	2026-05-19 17:26:38 -04:00
Joe Fioti	e558ce6849	Flux2 cleanup (#319 ) * Refactor core graph and plugin interfaces * Switch examples to batched prefill * Add native-reference MoE fuzz tests * Add native MoE fuzzing and relax qwen3_moe CI check * Fix CI checks and CUDA fuzz harness * Fix llama clippy warnings and normalize fuzz seeds * Use pure HLIR for YOLO v11 model * Remove conv2d custom wrapper and use KernelConv2D rewrites * Fix conv view indexing and trim flux materializations * Skip flux CUDA tests without driver * Restrict core CI to CPU packages	2026-05-18 19:06:31 -04:00