tests for interface specification

Use parallel launches for small CUDA kernels (#315 )
* Use parallel launches for cast and iota kernels * Use parallel launch for embed kernel
2026-06-01 21:49:47 +09:00 · 2026-05-14 21:24:56 +00:00 · 2026-05-14 00:47:12 -04:00 · 2026-05-13 20:16:30 -04:00 · 2026-05-13 13:43:36 -04:00 · 2026-05-13 13:43:15 -04:00
190 changed files with 37888 additions and 12142 deletions
--- a/.agents/skills/aoti-debug/SKILL.md
+++ b/.agents/skills/aoti-debug/SKILL.md
@@ -1,130 +0,0 @@
---
-name: aoti-debug
-description: Debug AOTInductor (AOTI) errors including device mismatches, CUDA illegal memory access, segfaults, and wrong outputs when deploying compiled PyTorch models. Use when encountering errors with aoti_compile_and_package, aoti_load_package, or the deprecated aot_compile/aot_load APIs.
---
-
-# AOTInductor Debugging
-
-Debug errors when compiling and deploying PyTorch models with AOTInductor.
-
-## First Step: Always Check Device and Shape Matching
-
-**For ANY AOTI error (segfault, exception, crash, wrong output), check these first:**
-
-1. **Compile device == Load device**: The model must be loaded on the same device type it was compiled on
-2. **Input devices match**: Runtime inputs must be on the same device as the compiled model
-3. **Input shapes match**: Runtime input shapes must match compilation shapes (or satisfy dynamic shape constraints)
-
-```python
-# Compilation -- note the device and shapes
-model = MyModel().eval().cuda()
-inp = torch.randn(2, 10, device="cuda")
-pkg = torch._inductor.aoti_compile_and_package(model, (inp,))
-
-# Loading -- device type MUST match compilation
-loaded = torch._inductor.aoti_load_package(pkg)  # auto-detects device from package
-
-# Inference -- device and shapes MUST match
-out = loaded(torch.randn(2, 10, device="cuda"))  # same device, same shape
-```
-
-**AOTI requires compile and load to use the same device type.** Cross-device loading (compile on GPU, load on CPU) is NOT supported. Device index can differ (cuda:0 vs cuda:1).
-
-## Current vs Deprecated API
-
-### Current API (use this)
-```python
-torch._inductor.aoti_compile_and_package()  # compile
-torch._inductor.aoti_load_package()          # load (auto-detects device)
-```
-
-### Deprecated API (migrate away)
-```python
-torch._export.aot_compile()  # deprecated
-torch._export.aot_load()     # deprecated
-```
-
-The new API stores device metadata in the package, so `aoti_load_package()` automatically uses the correct device type.
-
-## Common Error Patterns
-
-### Device Mismatch Segfault
-
-**Symptom**: Segfault, exception, or crash during load or execution.
-
-**Example errors**:
- `The specified pointer resides on host memory and is not registered with any CUDA device`
- Crash during constant loading
- `Expected out tensor to have device cuda:0, but got cpu instead`
-
-**Solution**: Ensure compile and load use the same device type.
-
-### Input Device Mismatch at Runtime
-
-**Symptom**: RuntimeError during model execution.
-
-**Better debugging**: Run with `AOTI_RUNTIME_CHECK_INPUTS=1` for clear errors:
-```bash
-AOTI_RUNTIME_CHECK_INPUTS=1 python script.py
-```
-
-Produces actionable messages like:
-```
-Error: input_handles[0]: unmatched device type, expected: 0(cpu), but got: 1(cuda)
-```
-
-## Debugging CUDA Illegal Memory Access (IMA)
-
-### Step 1: Sanity Checks
-
-```bash
-AOTI_RUNTIME_CHECK_INPUTS=1 python script.py          # validate inputs match compilation guards
-TORCHINDUCTOR_NAN_ASSERTS=1 python script.py           # check for NaN before/after each kernel
-```
-
-Both flags take effect at **compile time** (codegen time).
-
-### Step 2: Make IMA Deterministic
-
-```bash
-PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 python script.py
-```
-
- `PYTORCH_NO_CUDA_MEMORY_CACHING=1` -- disables caching allocator (which allocates bigger buffers, masking IMA)
- `CUDA_LAUNCH_BLOCKING=1` -- forces synchronous kernel launches (pinpoints which kernel crashed)
-
-Both take effect at **runtime**.
-
-### Step 3: Identify the Problematic Kernel
-
-```bash
-AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3 python script.py
-```
-
-Prints kernels one by one at runtime. Combined with Step 2 flags, shows which kernel launched right before the error.
-
-To inspect inputs to specific kernels:
-```bash
-AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="kernel_name_1,kernel_name_2" \
-AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 python script.py
-```
-
-If inputs to a kernel are unexpected, trace back to the kernel that produced the bad input.
-
-## Environment Variables Reference
-
-| Variable | When | Purpose |
-|---|---|---|
-| `AOTI_RUNTIME_CHECK_INPUTS=1` | Compile time | Validate inputs match compilation guards |
-| `TORCHINDUCTOR_NAN_ASSERTS=1` | Compile time | Check for NaN before/after kernels |
-| `PYTORCH_NO_CUDA_MEMORY_CACHING=1` | Runtime | Make IMA errors deterministic |
-| `CUDA_LAUNCH_BLOCKING=1` | Runtime | Force synchronous kernel launches |
-| `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3` | Compile time | Print kernels at runtime |
-| `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="..."` | Compile time | Filter which kernels to print |
-| `TORCH_LOGS="+inductor,output_code"` | Runtime | See PT2 internal logs |
-| `TORCH_SHOW_CPP_STACKTRACES=1` | Runtime | Show C++ stack traces |
-
-## Common Sources of Issues
-
- **Dynamic shapes**: Historically a common source of IMA errors. Pay special attention when using dynamic shape constraints.
- **Custom ops**: Especially C++ custom ops with dynamic shapes. The meta function may need to handle SymInt properly.
--- a/.agents/skills/pt2-debug/SKILL.md
+++ b/.agents/skills/pt2-debug/SKILL.md
@@ -1,195 +0,0 @@
---
-name: pt2-debug
-description: Debug torch.compile failures, graph breaks, recompilation issues, accuracy mismatches, and Triton kernel errors. Use when encountering BackendCompilerFailed exceptions, torch.compile errors, recompilation warnings, or numerical accuracy issues with compiled PyTorch models.
---
-
-# PyTorch 2 Compile Debugging
-
-Debug `torch.compile`, Dynamo, Inductor, and AOTAutograd failures when using PyTorch as a library.
-
-## Diagnostic Environment Variables
-
-Pick the right diagnostic based on the error:
-
-| Command | When to use |
-|---|---|
-| `TORCH_LOGS="+dynamo,graph_breaks,recompiles" python script.py` | Quick overview of what's going wrong |
-| `TORCH_COMPILE_DEBUG=1 python script.py` | Full debug artifacts (FX graphs, Inductor IR, generated code) in `torch_compile_debug/` |
-| `TORCH_LOGS="output_code" python script.py` | See the generated Triton/C++ kernel code |
-| `TORCH_TRACE=/path/to/trace python script.py` | Structured trace (parse with `tlparse`) |
-| `TORCHINDUCTOR_COMPILE_THREADS=1 python script.py` | Single-threaded compilation for pdb debugging |
-
-## Error Triage
-
-Classify the failure and jump to the right section:
-
-| Error Pattern | Category |
-|---|---|
-| `Unsupported: ...` or `graph break` in logs | [Graph Breaks](#graph-breaks) |
-| `BackendCompilerFailed` | [Backend Failures](#backend-compiler-failures) |
-| `RecompileError` or `cache_size_limit` | [Recompilation](#recompilation-issues) |
-| Accuracy mismatch / wrong numerical output | [Accuracy](#accuracy-issues) |
-| `InternalTorchDynamoError` | [Internal Errors](#internal-dynamo-errors) |
-| Segfault or CUDA IMA | [Runtime Crashes](#runtime-crashes) |
-| Triton assertion / index out of bounds | [Triton Failures](#triton-kernel-failures) |
-
-## Graph Breaks
-
-Graph breaks split the compiled graph into smaller subgraphs, causing performance regressions.
-
-**Diagnose:**
-```bash
-TORCH_LOGS="graph_breaks" python script.py
-```
-
-**Common causes:**
- Data-dependent control flow
- Unsupported Python builtins
- In-place ops on inputs, unsupported dtypes
- Calls to non-traceable functions
-
-**Fix approaches:**
-1. Read the graph break message to identify the unsupported operation
-2. Check for a decomposition or supported alternative
-3. Consider `torch._dynamo.allow_in_graph` or restructure user code
-
-## Backend Compiler Failures
-
-`BackendCompilerFailed` means Inductor crashed during compilation.
-
-**Diagnose with the minifier:**
-```bash
-# Generate minifier launcher
-TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=2 python script.py
-
-# Run the minifier to get minimal failing graph
-python minifier_launcher.py minify
-
-# Run the minimized reproduction
-python minifier_launcher.py run
-```
-
-**Then inspect:**
-```bash
-TORCH_COMPILE_DEBUG=1 python script.py  # FX graphs in torch_compile_debug/
-```
-
-## Recompilation Issues
-
-Excessive recompilation from guards that are too specific, causing cache misses.
-
-**Diagnose:**
-```bash
-TORCH_LOGS="recompiles,recompiles_verbose,guards" python script.py
-```
-
-**Key config:**
-```python
-torch._dynamo.config.recompile_limit  # default: 8
-torch._dynamo.config.fail_on_recompile_limit_hit = True  # hard error on limit
-```
-
-**Common causes:**
- Changing tensor shapes without marking them dynamic
- Python scalar values that change between calls
- Global state mutations between calls
-
-**Fix:** Read the recompilation reason from logs, identify the failing guard, then either:
- Mark dimensions as dynamic: `torch._dynamo.mark_dynamic(tensor, dim)`
- Fix the source of guard instability
-
-## Accuracy Issues
-
-Compiled model produces different numerical results than eager mode.
-
-**Diagnose:**
-```bash
-# Compares compiled vs eager with fp64 reference, dumps repro on failure
-TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=4 python script.py
-```
-
-**Fix approach:**
-1. Get minimal failing graph from the minifier
-2. Compare eager vs compiled output at fp64 precision
-3. Binary search through ops to find the diverging operation
-4. Check for known issues: reduction order, fused kernels, dtype promotions
-
-## Internal Dynamo Errors
-
-`InternalTorchDynamoError` indicates a bug in Dynamo.
-
-**Diagnose:**
-```bash
-TORCHDYNAMO_VERBOSE=1 python script.py
-# or equivalently:
-TORCH_LOGS="+dynamo" python script.py
-```
-
-**Debug interactively:**
-```bash
-TORCHINDUCTOR_COMPILE_THREADS=1 python script.py  # then attach pdb
-```
-
-## Runtime Crashes
-
-Segfaults and CUDA illegal memory access during execution of compiled code.
-
-**Make crash deterministic:**
-```bash
-PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1 python script.py
-```
-
-**Add NaN checks to find the first bad kernel:**
-```bash
-TORCHINDUCTOR_NAN_ASSERTS=1 python script.py
-```
-
-**Inductor sync debugging:**
-```python
-torch._inductor.config.triton.debug_sync_kernel = True  # sync after every kernel
-torch._inductor.config.triton.debug_sync_graph = True   # sync before/after graph
-```
-
-**Fix approach:**
-1. Make deterministic with `PYTORCH_NO_CUDA_MEMORY_CACHING=1 CUDA_LAUNCH_BLOCKING=1`
-2. Check input shapes, devices, dtypes
-3. Inspect generated kernel code with `TORCH_LOGS="output_code"`
-4. Use `TORCHINDUCTOR_NAN_ASSERTS=1` to find the first kernel producing bad values
-5. Dynamic shapes are historically a common source of IMA
-
-## Triton Kernel Failures
-
-Triton assertion failures or index-out-of-bounds in generated kernels.
-
-**Diagnose:**
-```bash
-TORCH_LOGS="output_code,schedule" python script.py
-```
-
-**Fix approach:**
-1. Get the generated Triton kernel from `output_code` logs
-2. Check index computations for off-by-one or wrong stride calculations
-3. Check IR with `TORCH_COMPILE_DEBUG=1` to trace back to the FX op
-4. Check if fusion decisions created invalid index combinations
-
-## Distinguish Trace-Time vs Runtime
-
-Many bugs come from confusing these:
- **Trace-time**: Inside Dynamo's symbolic interpreter. Function calls may be constant-folded.
- **Runtime**: Real tensors, real Python calls.
-
-When debugging, add `print()` directly in source files rather than monkey-patching -- dispatch chains make monkey-patching unreliable.
-
-## Using the Minifier
-
-The minifier reduces a failing graph to the smallest reproduction:
-
-```bash
-# For compilation failures (level 2)
-TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=2 python script.py
-python minifier_launcher.py minify
-python minifier_launcher.py run
-
-# For accuracy failures (level 4)
-TORCHDYNAMO_REPRO_AFTER=aot TORCHDYNAMO_REPRO_LEVEL=4 python script.py
-```
--- a/.agents/skills/ruff/SKILL.md
+++ b/.agents/skills/ruff/SKILL.md
@@ -1,134 +0,0 @@
---
-name: ruff
-description:
-  Guide for using ruff, the extremely fast Python linter and formatter. Use this
-  when linting, formatting, or fixing Python code.
---
-
-# ruff
-
-Ruff is an extremely fast Python linter and code formatter. It replaces Flake8,
-isort, Black, pyupgrade, autoflake, and dozens of other tools.
-
-## When to use ruff
-
-**Always use ruff for Python linting and formatting**, especially if you see:
-
- `[tool.ruff]` section in `pyproject.toml`
- A `ruff.toml` or `.ruff.toml` configuration file
-
-However, avoid making unnecessary changes:
-
- **Don't format unformatted code** - If `ruff format --diff` shows changes
-  throughout an entire file, the project likely isn't using ruff for formatting.
-  Skip formatting to avoid obscuring actual changes.
- **Scope fixes to code being edited** - Use `ruff check --diff` to see fixes
-  relevant to the code you're changing. Only apply fixes to files you're
-  modifying unless the user explicitly asks for broader fixes.
-
-## How to invoke ruff
-
- `uv run ruff ...` - Use when ruff is in the project's dependencies to ensure
-  you use the pinned version
- `uvx ruff ...` - Use when ruff is not a project dependency, or for quick
-  one-off checks
- `ruff ...` - Use if ruff is installed globally
-
-## Commands
-
-### Linting
-
-```bash
-ruff check .                  # Check all files in current directory
-ruff check path/to/file.py    # Check specific file
-ruff check --fix .            # Auto-fix fixable violations
-ruff check --fix --unsafe-fixes .  # Include unsafe fixes (review changes!)
-ruff check --watch .          # Watch for changes and re-lint
-ruff check --select E,F .     # Only check specific rules
-ruff check --ignore E501 .    # Ignore specific rules
-ruff rule E501                # Explain a specific rule
-ruff linter                   # List available linters
-```
-
-### Formatting
-
-```bash
-ruff format .                 # Format all files
-ruff format path/to/file.py   # Format specific file
-ruff format --check .         # Check if files are formatted (no changes)
-ruff format --diff .          # Show formatting diff without applying
-```
-
-## Configuration
-
-Ruff is configured in `pyproject.toml` or `ruff.toml`:
-
-```toml
-# pyproject.toml
-[tool.ruff.lint]
-select = ["E", "F", "I", "UP"]  # Enable specific rule sets
-ignore = ["E501"]               # Ignore specific rules
-
-[tool.ruff.lint.isort]
-known-first-party = ["myproject"]
-```
-
-## Migrating from other tools
-
-### Black → ruff format
-
-```bash
-black .                       → ruff format .
-black --check .               → ruff format --check .
-black --diff .                → ruff format --diff .
-```
-
-### Flake8 → ruff check
-
-```bash
-flake8 .                      → ruff check .
-flake8 --select E,F .         → ruff check --select E,F .
-flake8 --ignore E501 .        → ruff check --ignore E501 .
-```
-
-### isort → ruff check
-
-```bash
-isort .                       → ruff check --select I --fix .
-isort --check .               → ruff check --select I .
-isort --diff .                → ruff check --select I --diff .
-```
-
-## Common patterns
-
-### Apply lint fixes before formatting
-
-Run `ruff check --fix` before `ruff format`. Lint fixes can change code
-structure (e.g., reordering imports), which formatting then cleans up.
-
-```bash
-ruff check --fix .
-ruff format .
-```
-
-### Applying and reviewing unsafe fixes
-
-Ruff categorizes some auto-fixes as "unsafe" because they may change code
-behavior, not just style. For example, removing unused imports could break code
-that relies on side effects.
-
-```bash
-ruff check --fix --unsafe-fixes --diff .  # Preview changes first
-ruff check --fix --unsafe-fixes .         # Apply changes
-```
-
-**Always review changes before applying `--unsafe-fixes`:**
-
- Use `ruff rule <CODE>` to understand why the fix is considered unsafe
- Verify the fix doesn't violate those assumptions in your code
-
-## Documentation
-
-For detailed information, read the official documentation:
-
- https://docs.astral.sh/ruff/
--- a/.agents/skills/ty/SKILL.md
+++ b/.agents/skills/ty/SKILL.md
@@ -1,135 +0,0 @@
---
-name: ty
-description:
-  Guide for using ty, the extremely fast Python type checker and language
-  server. Use this when type checking Python code or setting up type checking in
-  Python projects.
---
-
-# ty
-
-ty is an extremely fast Python type checker and language server. It replaces
-mypy, Pyright, and other type checkers.
-
-## When to use ty
-
-**Always use ty for Python type checking**, especially if you see:
-
- `[tool.ty]` section in `pyproject.toml`
- A `ty.toml` configuration file
-
-## How to invoke ty
-
- `uv run ty ...` - Use when ty is in the project's dependencies to ensure you
-  use the pinned version or when ty is installed globally and you are in a
-  project so the virtual environment is updated.
- `uvx ty ...` - Use when ty is not a project dependency, or for quick one-off
-  checks
-
-## Commands
-
-### Type checking
-
-```bash
-ty check                      # Check all files in current directory
-ty check path/to/file.py      # Check specific file
-ty check src/                 # Check specific directory
-```
-
-### Rule configuration
-
-```bash
-ty check --error possibly-unresolved-reference   # Treat as error
-ty check --warn division-by-zero                 # Treat as warning
-ty check --ignore unresolved-import              # Disable rule
-```
-
-### Python version targeting
-
-```bash
-ty check --python-version 3.12     # Check against Python 3.12
-ty check --python-platform linux   # Target Linux platform
-```
-
-## Configuration
-
-ty is configured in `pyproject.toml` or `ty.toml`:
-
-```toml
-# pyproject.toml
-[tool.ty.environment]
-python-version = "3.12"
-
-[tool.ty.rules]
-possibly-unresolved-reference = "warn"
-division-by-zero = "error"
-
-[tool.ty.src]
-include = ["src/**/*.py"]
-exclude = ["**/migrations/**"]
-
-[tool.ty.terminal]
-output-format = "full"
-error-on-warning = false
-```
-
-### Per-file overrides
-
-Use overrides to apply different rules to specific files, such as relaxing rules
-for tests or scripts that have different typing requirements than production
-code:
-
-```toml
-[[tool.ty.overrides]]
-include = ["tests/**", "**/test_*.py"]
-
-[tool.ty.overrides.rules]
-possibly-unresolved-reference = "warn"
-```
-
-## Language server
-
-This plugin automatically configures the ty language server for Python files
-(`.py` and `.pyi`).
-
-## Migrating from other tools
-
-### mypy → ty
-
-```bash
-mypy .                        → ty check
-mypy --strict .               → ty check --error-on-warning
-mypy path/to/file.py          → ty check path/to/file.py
-```
-
-### Pyright → ty
-
-```bash
-pyright .                     → ty check
-pyright path/to/file.py       → ty check path/to/file.py
-```
-
-## Common patterns
-
-### Don't add ignore comments
-
-Fix type errors instead of suppressing them. Only add ignore comments when
-explicitly requested by the user. Use `ty: ignore`, not `type: ignore`, and
-prefer rule-specific ignores:
-
-```python
-# Good: rule-specific ignore
-x = undefined_var  # ty: ignore[possibly-unresolved-reference]
-
-# Bad: blanket ty ignore
-x = undefined_var  # ty: ignore
-
-# Bad: tool agnostic blanket ignore
-x = undefined_var  # type: ignore
-```
-
-## Documentation
-
-For detailed information, read the official documentation:
-
- https://docs.astral.sh/ty/
--- a/.agents/skills/uv/SKILL.md
+++ b/.agents/skills/uv/SKILL.md
@@ -1,182 +0,0 @@
---
-name: uv
-description:
-  Guide for using uv, the Python package and project manager. Use this when
-  working with Python projects, scripts, packages, or tools.
---
-
-# uv
-
-uv is an extremely fast Python package and project manager. It replaces pip,
-pip-tools, pipx, pyenv, virtualenv, poetry, etc.
-
-## When to use uv
-
-**Always use uv for Python work**, especially if you see:
-
- The `uv.lock` file
- uv headers in `requirements*` files, e.g., "This file was autogenerated by uv"
-
-Don't use uv in projects managed by other tools:
-
- Poetry projects (identifiable by `poetry.lock` file)
- PDM projects (identifiable by `pdm.lock` file)
-
-## Choosing the right workflow
-
-### Scripts
-
-**Use when:** Running single Python files and standalone scripts.
-
-**Key commands:**
-
-```bash
-uv run script.py                      # Run a script
-uv run --with requests script.py      # Run with additional packages
-uv add --script script.py requests    # Add dependencies inline to the script
-```
-
-### Projects
-
-**Use when:** There is a `pyproject.toml` or `uv.lock`
-
-**Key commands:**
-
-```bash
-uv init                   # Create new project
-uv add requests           # Add dependency
-uv remove requests        # Remove dependency
-uv sync                   # Install from lockfile
-uv run <command>          # Run commands in environment
-uv run python -c ""       # Run Python in project environment
-uv run -p 3.12 <command>  # Run with specific Python version
-```
-
-### Tools
-
-**Use when:** Running command-line tools (e.g., ruff, ty, pytest) without
-installation.
-
-**Key commands:**
-
-```bash
-uvx <tool> <args>            # Run a tool without installation
-uvx <tool>@<version> <args>  # Run a specific version of a tool
-```
-
-**Important:**
-
- `uvx` runs tools from PyPI by package name. This can be unsafe - only run
-  well-known tools.
- Only use `uv tool install` only when specifically requested by the user.
-
-### Pip interface
-
-**Use when:** Legacy workflows with `requirements.txt` or manual environment
-management, no `uv.lock` present.
-
-**Key commands:**
-
-```bash
-uv venv
-uv pip install -r requirements.txt
-uv pip compile requirements.in -o requirements.txt
-uv pip sync requirements.txt
-
-# Platform independent resolution
-uv pip compile --universal requirements.in -o requirements.txt
-```
-
-**Important:**
-
- Don't use the pip interface unless clearly needed.
- Don't introduce new `requirements.txt` files.
- Prefer `uv init` for new projects.
-
-## Migrating from other tools
-
-### pyenv → uv python
-
-```bash
-pyenv install 3.12       → uv python install 3.12
-pyenv versions           → uv python list --only-installed
-pyenv local 3.12         → uv python pin 3.12
-pyenv global 3.12        → uv python install 3.12 --default
-```
-
-### pipx → uvx
-
-```bash
-pipx run ruff            → uvx ruff
-pipx install ruff        → uv tool install ruff
-pipx upgrade ruff        → uv tool upgrade ruff
-pipx list                → uv tool list
-```
-
-### pip and pip-tools → uv pip
-
-```bash
-pip install package      → uv pip install package
-pip install -r req.txt   → uv pip install -r req.txt
-pip freeze               → uv pip freeze
-pip-compile req.in       → uv pip compile req.in
-pip-sync req.txt         → uv pip sync req.txt
-virtualenv .venv         → uv venv
-```
-
-## Common patterns
-
-### Don't use pip in uv projects
-
-```bash
-# Bad
-pip install requests
-
-# Good
-uv add requests
-```
-
-### Don't run python directly
-
-```bash
-# Bad
-python script.py
-
-# Good
-uv run script.py
-```
-
-```bash
-# Bad
-python -c "..."
-
-# Good
-uv run python -c "..."
-```
-
-```bash
-# Bad
-python3.12 -c "..."
-
-# Good
-uvx python@3.12 -c "..."
-```
-
-### Don't manually manage environments in uv projects
-
-```bash
-# Bad
-python -m venv .venv
-source .venv/bin/activate
-
-# Good
-uv run <command>
-```
-
-## Documentation
-
-For detailed information, read the official documentation:
-
- https://docs.astral.sh/uv/llms.txt
-
-The documentation links to specific pages for each of these workflows.
--- a/.devcontainer/cpu/devcontainer.json
+++ b/.devcontainer/cpu/devcontainer.json
@@ -17,15 +17,11 @@
            "userUid": "1000",
            "userGid": "1000",
            "configureZshAsDefaultShell": false
-        },
-        "ghcr.io/devcontainers/features/node:1": {
-            "version": "lts"
        }
    },
    "remoteUser": "ubuntu",
    "remoteEnv": {
-        "CARGO_HOME": "/home/ubuntu/.cache/luminal/cargo",
-        "CODEX_HOME": "${containerWorkspaceFolder}/.claude/codex"
+        "CARGO_HOME": "/home/ubuntu/.cache/luminal/cargo"
    },
    "postStartCommand": "mkdir -p /home/ubuntu/.cache/luminal/cargo && git config --global --add safe.directory ${containerWorkspaceFolder} && gh auth setup-git",
    "customizations": {
--- a/.devcontainer/cuda/devcontainer.json
+++ b/.devcontainer/cuda/devcontainer.json
@@ -21,15 +21,11 @@
            "userUid": "1000",
            "userGid": "1000",
            "configureZshAsDefaultShell": false
-        },
-        "ghcr.io/devcontainers/features/node:1": {
-            "version": "lts"
        }
    },
    "remoteUser": "ubuntu",
    "remoteEnv": {
-        "CARGO_HOME": "/home/ubuntu/.cache/luminal/cargo",
-        "CODEX_HOME": "${containerWorkspaceFolder}/.claude/codex"
+        "CARGO_HOME": "/home/ubuntu/.cache/luminal/cargo"
    },
    "postStartCommand": "mkdir -p /home/ubuntu/.cache/luminal/cargo && git config --global --add safe.directory ${containerWorkspaceFolder} && gh auth setup-git",
    "customizations": {
@@ -56,4 +52,4 @@
            ]
        }
    }
-}
+}
--- a/.github/workflows/modal-examples.yml
+++ b/.github/workflows/modal-examples.yml
@@ -3,7 +3,7 @@ name: Modal Examples
 on:
    push:
        branches: ["main"]
-    pull_request:
+    pull_request_target:
        branches: ["main"]
        types: [labeled, synchronize]
    workflow_dispatch:
@@ -13,16 +13,16 @@ jobs:
        if: >-
            github.event_name == 'push'
            || github.event_name == 'workflow_dispatch'
-            || (github.event_name == 'pull_request'
+            || (github.event_name == 'pull_request_target'
                && contains(github.event.pull_request.labels.*.name, 'modal-ready'))
        name: "${{ matrix.example }} (Modal ${{ matrix.gpu.type }})"
        runs-on: ubuntu-latest
        environment: Modal
-        timeout-minutes: 70
+        timeout-minutes: 120
        strategy:
            fail-fast: false
            matrix:
-                example: [llama, gemma, qwen, qwen3_moe]
+                example: [llama, gemma, qwen, qwen3_moe, gemma4_moe, whisper]
                gpu:
                    - { type: "A100-80GB" }
                    # To add more GPUs, just append another entry:
@@ -30,6 +30,8 @@ jobs:

        steps:
            - uses: actions/checkout@v6
+              with:
+                  ref: ${{ github.event.pull_request.head.sha || github.sha }}
            - name: Set up Python
              uses: actions/setup-python@v5
              with:
--- a/.github/workflows/test-core.yml
+++ b/.github/workflows/test-core.yml
@@ -21,4 +21,4 @@ jobs:
        steps:
            - uses: actions/checkout@v6
            - name: Run tests
-              run: cargo test --workspace --exclude luminal_cuda_lite --exclude luminal_metal --exclude luminal_bench --verbose
+              run: cargo test --release --workspace --exclude luminal_cuda_lite --exclude luminal_metal --exclude luminal_bench --verbose
--- a/.github/workflows/test-cuda.yml
+++ b/.github/workflows/test-cuda.yml
@@ -3,7 +3,7 @@ name: Test CUDA
 on:
    push:
        branches: ["main"]
-    pull_request:
+    pull_request_target:
        branches: ["main"]
        types: [labeled, synchronize]
    workflow_dispatch:
@@ -13,15 +13,17 @@ jobs:
        if: >-
            github.event_name == 'push'
            || github.event_name == 'workflow_dispatch'
-            || (github.event_name == 'pull_request'
+            || (github.event_name == 'pull_request_target'
                && contains(github.event.pull_request.labels.*.name, 'modal-ready'))
        name: Cuda Unit Tests
        runs-on: ubuntu-latest
        environment: Modal
-        timeout-minutes: 30
+        timeout-minutes: 120

        steps:
            - uses: actions/checkout@v6
+              with:
+                  ref: ${{ github.event.pull_request.head.sha || github.sha }}
            - name: Set up Python
              uses: actions/setup-python@v5
              with:
--- a/.github/workflows/test-metal.yml
+++ b/.github/workflows/test-metal.yml
@@ -16,4 +16,4 @@ jobs:
        steps:
            - uses: actions/checkout@v6
            - name: Run Metal crate tests
-              run: rustup update; cargo test -p luminal_metal --verbose -- --test-threads=1
+              run: rustup update; cargo test --release -p luminal_metal --verbose -- --test-threads=1
--- a/.github/workflows/test-python-cuda.yml
+++ b/.github/workflows/test-python-cuda.yml
@@ -3,7 +3,7 @@ name: Test Python CUDA
 on:
    push:
        branches: ["main"]
-    pull_request:
+    pull_request_target:
        branches: ["main"]
        types: [labeled, synchronize]
    workflow_dispatch:
@@ -13,18 +13,20 @@ jobs:
        if: >-
            github.event_name == 'push'
            || github.event_name == 'workflow_dispatch'
-            || (github.event_name == 'pull_request'
+            || (github.event_name == 'pull_request_target'
                && contains(github.event.pull_request.labels.*.name, 'modal-ready'))
        name: Python CUDA Tests
        runs-on: ubuntu-latest
        environment: Modal
-        timeout-minutes: 60
+        timeout-minutes: 120
        defaults:
            run:
                working-directory: crates/luminal_python

        steps:
            - uses: actions/checkout@v6
+              with:
+                  ref: ${{ github.event.pull_request.head.sha || github.sha }}
            - name: Set up Python
              uses: actions/setup-python@v5
              with:
@@ -36,7 +38,7 @@ jobs:
                  MODAL_TOKEN_ID: ${{ secrets.MODAL_TOKEN_ID }}
                  MODAL_TOKEN_SECRET: ${{ secrets.MODAL_TOKEN_SECRET }}
                  HF_TOKEN: ${{ secrets.HF_TOKEN }}
-              run: modal run modal_pytest_runner.py --gpu A100 --timeout 3300 --profile --profile-output-dir luminal_artifacts/pytest-profiling/github-${{ github.run_id }}-${{ github.run_attempt }} tests/ -v -s -m "not slow"
+              run: modal run modal_pytest_runner.py --gpu A100 --timeout 7200 --profile --profile-output-dir luminal_artifacts/pytest-profiling/github-${{ github.run_id }}-${{ github.run_attempt }} tests/ -v -s -m "not slow"
            - name: Upload Modal pytest profiling artifacts
              if: always()
              uses: actions/upload-artifact@v4
--- a/.github/workflows/test-python-native.yml
+++ b/.github/workflows/test-python-native.yml
@@ -23,6 +23,6 @@ jobs:
            - name: Update Rust toolchain
              run: rustup update
            - name: Build maturin extension
-              run: uv run maturin develop --manifest-path rust/Cargo.toml
+              run: uv run maturin develop --manifest-path rust/Cargo.toml --profile release
            - name: Run pytest
              run: uv run pytest tests/test_hlir_ops.py tests/test_unary.py -v -m "not slow"
--- a/.gitignore
+++ b/.gitignore
@@ -1,9 +1,6 @@
 /target
 /crates/**/target
 /examples/**/target
-.claude-project
-.claude-memory 
-.codex

 *.env
 .claude/
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,34 +0,0 @@
-# Luminal
-
-## Package Management
-
- Use `uv add`, `uv add --dev`, `uv remove` for Python dependencies (pyproject.toml is in `crates/luminal_python/`)
- Use `uv sync` to sync the Python environment
- Never use pip, pip-tools, poetry, or conda
- Never manually create or activate virtual environments — uv manages `.venv/` automatically
- Never generate requirements.txt
-
-## Code Execution
-
- Always use `uv run` to execute Python tools: `uv run pytest`, `uv run pre-commit`, `uv run python`
- Use `cargo` directly for Rust: `cargo build`, `cargo test`, `cargo check`, `cargo clippy`
- Python project root is `crates/luminal_python/` — run `uv run` commands from there
-
-## Building the Python Package (Maturin)
-
- After modifying `.rs` files that affect the Python bridge, rebuild with: `maturin develop --release`
- Maturin config is in `crates/luminal_python/pyproject.toml` under `[tool.maturin]`
-
-## Pre-commit
-
- Run with: `uv run pre-commit run --all-files`
- Hooks configured: ruff-check, ruff-format (Python), cargo-fmt, cargo-clippy (Rust)
- Manual-stage hooks (cargo-clippy-metal, cargo-clippy-cuda-lite) run with `--hook-stage manual`
-
-## Testing
-
- **Rust tests**: `cargo test -p <crate_name>`
- **Python tests**: `cd crates/luminal_python && uv run pytest`
-  - `./run_test.sh` — native backend
-  - `./run_tests_cuda.sh` — CUDA backend
- See `crates/luminal_python/CLAUDE.md` for Python test patterns and conventions
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -25,6 +25,7 @@ generational-box = "0.5.6"
 serde_json = "1.0.140"
 egglog = {git="https://github.com/egraphs-good/egglog", rev="0a8cc35a6c68d0460c20449d5fa19ca3caba2923"}
 egglog-ast = {git="https://github.com/egraphs-good/egglog", rev="0a8cc35a6c68d0460c20449d5fa19ca3caba2923"}
+egglog-reports = {git="https://github.com/egraphs-good/egglog", rev="0a8cc35a6c68d0460c20449d5fa19ca3caba2923"}
 egraph-serialize = { version = "0.3.0", default-features = false, features = ["graphviz", "serde"]}
 tracing = "0.1.43"
 paste = "1.0.15"
@@ -32,6 +33,7 @@ pretty-duration = "0.1.1"
 anyhow = "1.0"
 graphviz-rust = { version = "0.9", default-features = false}
 lru = "0.16.2"
+rayon = "1.10"

 [workspace.package]
 edition = "2024"
--- a/README.md
+++ b/README.md
@@ -1,10 +1,10 @@
-<img href="luminal.com" alt="Screenshot 2025-08-14 at 9 18 54 PM" src="https://github.com/user-attachments/assets/c5832634-55d5-45b7-ba65-6efe36afce4a" />
+<img href="luminal.com" alt="Screenshot 2025-08-14 at 9 18 54 PM" src="https://github.com/luminal-ai/luminal/blob/main/docs/logo/inference_at_the_speed_of_light.png" />

 <h3 align="center">
 Luminal is a high-performance general-purpose inference compiler.
 </h3>

-[![CI Status](https://img.shields.io/github/actions/workflow/status/jafioti/luminal/test.yml?style=for-the-badge&logo=github-actions&logoColor=white&branch=main)](https://github.com/jafioti/luminal/actions)
+[![CI Status](https://img.shields.io/github/actions/workflow/status/luminal-ai/luminal/test-core.yml?style=for-the-badge&logo=github-actions&logoColor=white&branch=main)](https://github.com/luminal-ai/luminal/actions)
 [![Docs](https://img.shields.io/badge/Documentation-green?style=for-the-badge&color=0D9373)](https://docs.luminalai.com)
 [![Current Crates.io Version](https://img.shields.io/crates/v/luminal.svg?style=for-the-badge&logo=rust)](https://crates.io/crates/luminal)
 [![discord](https://dcbadge.limes.pink/api/server/APjuwHAbGy)](https://discord.gg/APjuwHAbGy)
--- a/ci/modal_cargo_test.py
+++ b/ci/modal_cargo_test.py
@@ -1,7 +1,6 @@
 import modal
 import subprocess
 import os
-import sys

 gpu_type = os.environ.get("GPU_TYPE", "T4")
 CUDARC_CUDA_VERSION = "12080"
@@ -29,7 +28,7 @@ cuda_image = (
@app.function(
    image=cuda_image,
    gpu=gpu_type,
-    timeout=1800,  # 30 minutes
+    timeout=7200,  # 2 hours
 )
 def run_cargo_test():
    """Run cargo test for luminal_cuda_lite on a Modal GPU."""
@@ -46,8 +45,11 @@ def run_cargo_test():

    subprocess.run(
        [
-            "cargo", "test",
-            "-p", "luminal_cuda_lite",
+            "cargo",
+            "test",
+            "--release",
+            "-p",
+            "luminal_cuda_lite",
            "--verbose",
            "--",
            "--test-threads=1",
--- a/ci/modal_example.py
+++ b/ci/modal_example.py
@@ -1,6 +1,9 @@
-import modal
-import subprocess
 import os
+import re
+import subprocess
+import sys
+
+import modal

 example = os.environ.get("EXAMPLE", "llama")
 gpu_type = os.environ.get("GPU_TYPE", "A100-80GB")
@@ -18,6 +21,79 @@ hf_cache = modal.Volume.from_name(

 WORKDIR = "/workspace/luminal"

+ANSI_ESCAPE = re.compile(r"\x1b\[[0-?]*[ -/]*[@-~]")
+
+EXPECTED_OUTPUT = {
+    "llama": [
+        "complex system modeled after the structure and function of the human brain",
+    ],
+    "gemma": [
+        "recognize pictures of cats",
+        "little detectives looking for specific features",
+    ],
+    "qwen": [
+        "computational model inspired by the structure and function of the human brain",
+    ],
+    "qwen3_moe": [
+        "The capital of France is Paris",
+    ],
+    "gemma4_moe": [
+        "city of romance, art and culture",
+    ],
+    "whisper": [
+        "ask not what your country can do for you",
+    ],
+}
+
+
+def run_and_capture(command: list[str], *, cwd: str, env: dict[str, str]) -> str:
+    process = subprocess.Popen(
+        command,
+        cwd=cwd,
+        env=env,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+    )
+    assert process.stdout is not None
+
+    chunks = []
+    while True:
+        chunk = process.stdout.read1(4096)
+        if not chunk:
+            break
+        sys.stdout.buffer.write(chunk)
+        sys.stdout.buffer.flush()
+        chunks.append(chunk)
+
+    return_code = process.wait()
+    output = b"".join(chunks).decode("utf-8", errors="replace")
+    if return_code:
+        raise subprocess.CalledProcessError(return_code, command, output=output)
+    return output
+
+
+def normalize_output(output: str) -> str:
+    output = ANSI_ESCAPE.sub("", output)
+    output = output.replace("\r", "\n")
+    return re.sub(r"\s+", " ", output).casefold()
+
+
+def validate_output(example: str, output: str):
+    expected_phrases = EXPECTED_OUTPUT.get(example)
+    if expected_phrases is None:
+        raise ValueError(f"No expected output phrases configured for example {example!r}")
+
+    normalized_output = normalize_output(output)
+    for phrase in expected_phrases:
+        if normalize_output(phrase) in normalized_output:
+            print(f"\nOutput check passed for {example!r}: found {phrase!r}")
+            return
+
+    expected = "\n  - ".join(expected_phrases)
+    raise AssertionError(
+        f"Output check failed for {example!r}. Expected one of:\n  - {expected}"
+    )
+
 cuda_image = (
    modal.Image.from_registry(
        "nvcr.io/nvidia/pytorch:25.03-py3"
@@ -39,7 +115,7 @@ cuda_image = (
@app.function(
    image=cuda_image,
    gpu=gpu_type,
-    timeout=3600,  # 60 minutes
+    timeout=7200,  # 2 hours
    volumes={
        HF_CACHE_PATH: hf_cache,
    },
@@ -48,16 +124,17 @@ def run_example(example: str):
    """Build and run a luminal example on a Modal GPU."""
    subprocess.run(["nvidia-smi"], check=True)

-    subprocess.run(
+    run_env = {
+        **os.environ,
+        "CUDARC_CUDA_VERSION": CUDARC_CUDA_VERSION,
+        "HF_HOME": HF_CACHE_PATH,
+    }
+    output = run_and_capture(
        ["cargo", "run", "--release"],
        cwd=f"{WORKDIR}/examples/{example}",
-        env={
-            **os.environ,
-            "CUDARC_CUDA_VERSION": CUDARC_CUDA_VERSION,
-            "HF_HOME": HF_CACHE_PATH,
-        },
-        check=True,
+        env=run_env,
    )
+    validate_output(example, output)

    hf_cache.commit()

--- a/crates/luminal_bench/examples/debug_ops.rs
+++ b/crates/luminal_bench/examples/debug_ops.rs
@@ -106,13 +106,13 @@ impl Case {
        let out = match self {
            Case::Mul => {
                let x = cx.tensor(size);
-                x.clone() * x
+                x * x
            }
            Case::Sigmoid => cx.tensor(size).sigmoid(),
            Case::Tanh => cx.tensor(size).tanh(),
            Case::GeluInner => {
                let x = cx.tensor(size);
-                (0.797_884_560_8_f32 * x.clone() * (1. + 0.044_715_f32 * x.clone() * x)).tanh()
+                (0.797_884_6_f32 * x * (1. + 0.044_715_f32 * x * x)).tanh()
            }
            Case::Gelu => cx.tensor(size).gelu(),
            Case::LayerNorm => {
@@ -447,10 +447,10 @@ where
        if let Some(ref backend) = backend_analysis {
            print_lowering_analysis(backend);
        }
-    } else if !args.inspect_ops.is_empty() {
-        if let Some(ref backend) = backend_analysis {
-            print_lowering_analysis(backend);
-        }
+    } else if !args.inspect_ops.is_empty()
+        && let Some(ref backend) = backend_analysis
+    {
+        print_lowering_analysis(backend);
    }

    // Trace facts for explicit variables.
--- a/crates/luminal_cuda_lite/Cargo.toml
+++ b/crates/luminal_cuda_lite/Cargo.toml
@@ -10,7 +10,8 @@ license = "MIT OR Apache-2.0"
 [dependencies]
 luminal = { path = "../.." }
 luminal_tracing = { path = "../luminal_tracing" }
-cudarc = {version="0.18.2", features=["cuda-version-from-build-system", "fallback-latest"]}
+cudarc = {version="0.19.4", features=["cuda-version-from-build-system", "fallback-latest"]}
+anyhow = "1.0"
 as-any = "0.3.2"
 itertools = "0.12.1"
 fixedbitset = "0.5.7"
@@ -23,6 +24,7 @@ memmap2 = "0.9.9"
 uuid = {version="1.19.0", features=["v4"]}
 lru = "0.16.2"
 libc = "0.2"
+libloading = "0.8"
 colorize = "*"

 [dev-dependencies]
--- a/crates/luminal_cuda_lite/examples/egglog_saturation.rs
+++ b/crates/luminal_cuda_lite/examples/egglog_saturation.rs
@@ -0,0 +1,611 @@
+use std::{collections::BTreeMap, sync::Arc, time::Instant};
+
+use itertools::Itertools;
+use luminal::prelude::egglog::{ast::Span, prelude::RustSpan};
+use luminal::{
+    dtype::DType,
+    egglog_utils::{
+        base::{base_cleanup_egglog, base_expression_egglog},
+        hlir_to_egglog,
+    },
+    hlir::HLIROps,
+    op::{EgglogOp, IntoEgglogOp, Runtime},
+    prelude::*,
+    shape::Expression,
+};
+use luminal_cuda_lite::runtime::CudaRuntime;
+
+const DEFAULT_PASSES: usize = 256;
+const EGGLOG_RULESETS: &[&str] = &[
+    "matmul_flatten",
+    "kernel_lower",
+    "direct_kernel",
+    "kernel_specialize",
+    "buffer_reuse",
+    "matmul_backend",
+    "glumoe",
+    "fusion_pair",
+    "fusion_grow",
+    "fusion_merge",
+];
+const MOE_SEQ: usize = 2;
+const MOE_HIDDEN: usize = 16;
+const MOE_NUM_EXPERTS: usize = 8;
+const MOE_TOP_K: usize = 2;
+const MOE_INTERMEDIATE: usize = 6;
+const GEMMA_RMS_NORM_EPS: f32 = 1e-6;
+
+#[derive(Debug, Clone, Copy)]
+enum Backend {
+    Native,
+    Cuda,
+}
+
+#[derive(Debug, Clone, Copy)]
+enum Mode {
+    Current,
+    Steps,
+    FullDefault,
+    FullCycle,
+}
+
+#[derive(Debug, Clone, Copy)]
+enum Case {
+    Mul,
+    UnaryChain(usize),
+    Gelu,
+    Softmax,
+    LayerNorm,
+    Matmul,
+    Attention,
+    QwenMoe,
+    GemmaMoe,
+}
+
+#[derive(Debug)]
+struct Args {
+    backend: Backend,
+    mode: Mode,
+    case: Case,
+    passes: usize,
+    cleanup: bool,
+    skip_roll: bool,
+}
+
+fn parse_args() -> Args {
+    let mut args = Args {
+        backend: Backend::Cuda,
+        mode: Mode::Current,
+        case: Case::Gelu,
+        passes: DEFAULT_PASSES,
+        cleanup: true,
+        skip_roll: false,
+    };
+
+    let mut iter = std::env::args().skip(1);
+    while let Some(arg) = iter.next() {
+        match arg.as_str() {
+            "--backend" => {
+                args.backend = match iter.next().as_deref() {
+                    Some("native") => Backend::Native,
+                    Some("cuda") => Backend::Cuda,
+                    other => panic!("invalid --backend {other:?}; use native|cuda"),
+                };
+            }
+            "--mode" => {
+                args.mode = match iter.next().as_deref() {
+                    Some("current") => Mode::Current,
+                    Some("steps") => Mode::Steps,
+                    Some("full-default") => Mode::FullDefault,
+                    Some("full-cycle") => Mode::FullCycle,
+                    other => panic!(
+                        "invalid --mode {other:?}; use current|steps|full-default|full-cycle"
+                    ),
+                };
+            }
+            "--case" => {
+                args.case = parse_case(&iter.next().expect("missing --case value"));
+            }
+            "--passes" => {
+                args.passes = iter
+                    .next()
+                    .expect("missing --passes value")
+                    .parse()
+                    .expect("invalid --passes value");
+            }
+            "--no-cleanup" => args.cleanup = false,
+            "--skip-roll" => args.skip_roll = true,
+            "--help" | "-h" => {
+                println!(
+                    "Usage: egglog_saturation [OPTIONS]\n\
+                     \n\
+                     Options:\n\
+                       --backend native|cuda          default: cuda\n\
+                       --mode current|steps|full-default|full-cycle\n\
+                       --case mul|unary-chain:N|gelu|softmax|layer-norm|matmul|attention|qwen-moe|gemma-moe\n\
+                       --passes N                    default: 256\n\
+                       --no-cleanup                  omit backend/HLIR cleanup rules\n\
+                       --skip-roll                   skip auto loop rolling prepass"
+                );
+                std::process::exit(0);
+            }
+            other => panic!("unknown argument {other}; use --help"),
+        }
+    }
+
+    args
+}
+
+fn parse_case(s: &str) -> Case {
+    if let Some(n) = s.strip_prefix("unary-chain:") {
+        return Case::UnaryChain(n.parse().expect("invalid unary-chain length"));
+    }
+    match s {
+        "mul" => Case::Mul,
+        "gelu" => Case::Gelu,
+        "softmax" => Case::Softmax,
+        "layer-norm" | "layer_norm" => Case::LayerNorm,
+        "matmul" => Case::Matmul,
+        "attention" => Case::Attention,
+        "qwen-moe" | "qwen_moe" => Case::QwenMoe,
+        "gemma-moe" | "gemma_moe" => Case::GemmaMoe,
+        other => panic!("unknown case {other}"),
+    }
+}
+
+fn build_case(case: Case) -> Graph {
+    let mut cx = Graph::new();
+    let out = match case {
+        Case::Mul => {
+            let x = cx.tensor((64, 64));
+            x * x
+        }
+        Case::UnaryChain(n) => {
+            let mut x = cx.tensor((64, 64));
+            for i in 0..n {
+                x = match i % 6 {
+                    0 => x.sin(),
+                    1 => x.sqrt(),
+                    2 => x.reciprocal(),
+                    3 => x.exp2(),
+                    4 => x.log2(),
+                    _ => x * 1.125,
+                };
+            }
+            x
+        }
+        Case::Gelu => cx.tensor((64, 64)).gelu(),
+        Case::Softmax => cx.tensor((128, 128)).softmax(1),
+        Case::LayerNorm => cx.tensor((128, 128)).layer_norm(1, 1e-5),
+        Case::Matmul => {
+            let a = cx.tensor((32, 64));
+            let b = cx.tensor((64, 32));
+            a.matmul(b)
+        }
+        Case::Attention => {
+            let q = cx.tensor((64, 32));
+            let k = cx.tensor((64, 32));
+            let v = cx.tensor((64, 32));
+            let scores = q.matmul(k.permute((1, 0))) * (1.0 / 32.0_f32.sqrt());
+            scores.softmax(1).matmul(v)
+        }
+        Case::QwenMoe => build_qwen_moe(&mut cx),
+        Case::GemmaMoe => build_gemma_moe(&mut cx),
+    };
+    let _ = out.output();
+    cx
+}
+
+fn build_qwen_moe(cx: &mut Graph) -> GraphTensor {
+    cx.set_dim('s', MOE_SEQ);
+    let x = cx.tensor(('s', MOE_HIDDEN));
+    let router = cx.tensor((MOE_NUM_EXPERTS, MOE_HIDDEN));
+    let gate_up_weights = cx
+        .tensor((MOE_NUM_EXPERTS, MOE_INTERMEDIATE * 2, MOE_HIDDEN))
+        .as_dtype(DType::Bf16);
+    let down_weights = cx
+        .tensor((MOE_NUM_EXPERTS, MOE_HIDDEN, MOE_INTERMEDIATE))
+        .as_dtype(DType::Bf16);
+
+    let n = x.dims().len();
+    let e_dim = *router.dims().first().unwrap();
+    let k_expr = Expression::from(MOE_TOP_K);
+
+    let routing_weights = x.matmul(router.t()).softmax(n - 1);
+    let top_k_indices = routing_weights.topk_indexes(MOE_TOP_K, n - 1);
+    let row_offsets = x
+        .graph()
+        .iota(Expression::from('z') / k_expr * e_dim, top_k_indices.dims());
+    let routing_flat_idx = row_offsets + top_k_indices;
+    let top_k_values = routing_weights.gather(routing_flat_idx);
+
+    let gate_up_gathered = gather_experts(x, top_k_indices, gate_up_weights).cast(DType::F32);
+    let x_exp = x.expand_dim(n - 1, MOE_TOP_K).unsqueeze(n);
+    let gate_up_out = x_exp.matmul(gate_up_gathered.transpose(2, 3)).squeeze(n);
+    let gate = gate_up_out.slice((.., .., ..MOE_INTERMEDIATE));
+    let up = gate_up_out.slice((.., .., MOE_INTERMEDIATE..));
+    let hidden = gate.silu() * up;
+
+    let down_gathered = gather_experts(x, top_k_indices, down_weights).cast(DType::F32);
+    let down_out = hidden
+        .unsqueeze(2)
+        .matmul(down_gathered.transpose(2, 3))
+        .squeeze(2);
+    let mut weights_exp = top_k_values.unsqueeze(top_k_values.dims().len());
+    weights_exp.shape.expand(down_out.dims());
+    (down_out * weights_exp).sum(n - 1)
+}
+
+fn build_gemma_moe(cx: &mut Graph) -> GraphTensor {
+    cx.set_dim('s', MOE_SEQ);
+    let router_input = cx.tensor(('s', MOE_HIDDEN));
+    let expert_input = cx.tensor(('s', MOE_HIDDEN));
+    let router_scale = cx.tensor(MOE_HIDDEN);
+    let router_proj = cx.tensor((MOE_NUM_EXPERTS, MOE_HIDDEN));
+    let per_expert_scale = cx.tensor(MOE_NUM_EXPERTS);
+    let gate_up_weights = cx
+        .tensor((MOE_NUM_EXPERTS, MOE_INTERMEDIATE * 2, MOE_HIDDEN))
+        .as_dtype(DType::Bf16);
+    let down_weights = cx
+        .tensor((MOE_NUM_EXPERTS, MOE_HIDDEN, MOE_INTERMEDIATE))
+        .as_dtype(DType::Bf16);
+
+    let n = router_input.dims().len();
+    let e_dim = *router_proj.dims().first().unwrap();
+    let k_expr = Expression::from(MOE_TOP_K);
+
+    let router_hidden = router_input.std_norm(n - 1, GEMMA_RMS_NORM_EPS)
+        * router_scale.expand_lhs(&router_input.dims()[..n - 1])
+        * (MOE_HIDDEN as f32).sqrt().recip();
+    let routing_weights = router_hidden.matmul(router_proj.t()).softmax(n - 1);
+    let top_k_indices = routing_weights.topk_indexes(MOE_TOP_K, n - 1);
+    let row_offsets = router_input
+        .graph()
+        .iota(Expression::from('z') / k_expr * e_dim, top_k_indices.dims());
+    let routing_flat_idx = row_offsets + top_k_indices;
+    let top_k_values = routing_weights.gather(routing_flat_idx);
+    let top_k_norm = top_k_values.sum(n - 1).expand_dim(n - 1, MOE_TOP_K);
+    let top_k_weights = (top_k_values / top_k_norm) * per_expert_scale.gather(top_k_indices);
+
+    let gate_up_gathered =
+        gather_experts(expert_input, top_k_indices, gate_up_weights).cast(DType::F32);
+    let x_exp = expert_input.expand_dim(n - 1, MOE_TOP_K).unsqueeze(n);
+    let gate_up_out = x_exp.matmul(gate_up_gathered.transpose(2, 3)).squeeze(n);
+    let gate = gate_up_out.slice((.., .., ..MOE_INTERMEDIATE));
+    let up = gate_up_out.slice((.., .., MOE_INTERMEDIATE..));
+    let hidden = gemma_gelu(gate) * up;
+
+    let down_gathered = gather_experts(expert_input, top_k_indices, down_weights).cast(DType::F32);
+    let down_out = hidden
+        .unsqueeze(2)
+        .matmul(down_gathered.transpose(2, 3))
+        .squeeze(2);
+    let mut weights_exp = top_k_weights.unsqueeze(top_k_weights.dims().len());
+    weights_exp.shape.expand(down_out.dims());
+    (down_out * weights_exp).sum(n - 1)
+}
+
+fn gather_experts(
+    graph_source: GraphTensor,
+    top_k_indices: GraphTensor,
+    weights: GraphTensor,
+) -> GraphTensor {
+    let (_, d1, d2) = weights.dims3();
+    let io = d1 * d2;
+    let base = top_k_indices * io;
+    let within = graph_source.graph().iota(Expression::from('z'), (d1, d2));
+    let n_base = base.dims().len();
+    let exp_base = base.expand_dim(n_base, d1).expand_dim(n_base + 1, d2);
+    let mut exp_within = within;
+    for (axis, dim) in base.dims().iter().enumerate() {
+        exp_within = exp_within.expand_dim(axis, *dim);
+    }
+    weights.gather(exp_base + exp_within)
+}
+
+#[allow(clippy::excessive_precision)]
+fn gemma_gelu(x: GraphTensor) -> GraphTensor {
+    let scaled = 1.5957691216 * x * (1. + 0.044715 * x * x);
+    x * scaled.sigmoid()
+}
+
+fn op_defs_string(ops: &[Arc<Box<dyn EgglogOp>>]) -> String {
+    let mut ir_variants = Vec::new();
+    let mut opkind_variants = Vec::new();
+    for op in ops {
+        let sort = op.sort();
+        let variant = format!(
+            "({} {})",
+            sort.name,
+            sort.fields.iter().map(|field| &field.sort).join(" ")
+        );
+        match sort.class.as_str() {
+            "IR" => ir_variants.push(variant),
+            "OpKind" => opkind_variants.push(variant),
+            other => panic!("unknown sort class {other} for {}", sort.name),
+        }
+    }
+    let extra_ir = ops.iter().flat_map(|op| op.ir_defs()).unique().join("\n");
+    format!(
+        "
+(datatype*
+    (IR
+        (OutputJoin IR IR)
+        (Op OpKind IList)
+        {extra_ir}
+        {}
+    )
+    (OpKind
+        {}
+    )
+    (IList
+        (ICons IR IList)
+        (INil)
+    )
+)
+(function dtype (IR) DType :merge new)
+",
+        ir_variants.join("\n"),
+        opkind_variants.join("\n")
+    )
+}
+
+fn op_cleanups_string(ops: &[Arc<Box<dyn EgglogOp>>]) -> String {
+    ops.iter()
+        .filter(|op| op.cleanup())
+        .map(|op| {
+            let sort = op.sort();
+            let fields = (0..sort.fields.len())
+                .map(|i| (b'a' + i as u8) as char)
+                .join(" ");
+            if sort.class == "OpKind" {
+                format!(
+                    "(rule
+                       ((= ?m (Op ({} {fields}) ?__cleanup_inputs)))
+                       ((delete (Op ({} {fields}) ?__cleanup_inputs)))
+                       :ruleset cleanup)",
+                    sort.name, sort.name
+                )
+            } else {
+                format!(
+                    "(rule
+                       ((= ?m ({} {fields})))
+                       ((delete ({} {fields})))
+                       :ruleset cleanup)",
+                    sort.name, sort.name
+                )
+            }
+        })
+        .join("\n")
+}
+
+fn setup_program(program: &str, ops: &[Arc<Box<dyn EgglogOp>>], cleanup: bool) -> String {
+    let rewrites = ops
+        .iter()
+        .flat_map(|op| op.rewrites())
+        .map(|rule| rule.to_egglog_string())
+        .join("\n");
+    [
+        EGGLOG_RULESETS
+            .iter()
+            .map(|ruleset| format!("(ruleset {ruleset})"))
+            .join("\n"),
+        base_expression_egglog(),
+        op_defs_string(ops),
+        if cleanup {
+            op_cleanups_string(ops)
+        } else {
+            String::new()
+        },
+        base_cleanup_egglog(),
+        rewrites,
+        program.to_string(),
+    ]
+    .join("\n")
+}
+
+fn producer_schedule() -> String {
+    "(seq
+        (saturate expr)
+        (saturate dtype_prop)
+        (run matmul_flatten)
+        (run kernel_lower)
+        (run direct_kernel)
+        (run kernel_specialize)
+        (run buffer_reuse)
+        (run matmul_backend)
+        (run glumoe)
+        (run fusion_pair)
+    )"
+    .to_string()
+}
+
+fn fusion_schedule() -> String {
+    "(seq
+        (saturate expr)
+        (saturate dtype_prop)
+        (run fusion_grow)
+        (run fusion_merge)
+    )"
+    .to_string()
+}
+
+fn split_cycle() -> Vec<(&'static str, String)> {
+    vec![
+        ("producers", format!("(saturate {})", producer_schedule())),
+        ("fusion", format!("(saturate {})", fusion_schedule())),
+    ]
+}
+
+fn split_cycle_schedule() -> String {
+    format!(
+        "(seq
+            (saturate {})
+            (saturate {})
+        )",
+        producer_schedule(),
+        fusion_schedule()
+    )
+}
+
+fn phase(egraph: &mut egglog::EGraph, name: &str, schedule: &str) -> bool {
+    let before = egraph.num_tuples();
+    let start = Instant::now();
+    let command = format!("(run-schedule {schedule})");
+    let outputs = egraph
+        .parse_and_run_program(None, &command)
+        .unwrap_or_else(|err| panic!("failed phase {name} schedule {schedule}: {err}"));
+    let elapsed = start.elapsed();
+    let after = egraph.num_tuples();
+    let report = outputs
+        .into_iter()
+        .find_map(|output| match output {
+            egglog::CommandOutput::RunSchedule(report) => Some(report),
+            _ => None,
+        })
+        .expect("run-schedule did not return a report");
+    let mut rules = report
+        .search_and_apply_time_per_rule
+        .iter()
+        .map(|(rule, time)| {
+            (
+                rule.to_string(),
+                *time,
+                report
+                    .num_matches_per_rule
+                    .get(rule)
+                    .copied()
+                    .unwrap_or_default(),
+            )
+        })
+        .collect_vec();
+    rules.sort_by_key(|(_, time, matches)| (std::cmp::Reverse(*time), std::cmp::Reverse(*matches)));
+    let matches = report.num_matches_per_rule.values().sum::<usize>();
+    println!(
+        "phase {name:<18} {elapsed_ms:>8.2} ms | tuples {before} -> {after} ({delta:+}) | updated={updated} | iters={iters} | matches={matches}",
+        elapsed_ms = elapsed.as_secs_f64() * 1000.0,
+        delta = after as isize - before as isize,
+        updated = report.updated,
+        iters = report.iterations.len(),
+    );
+    for (rule, time, matches) in rules
+        .into_iter()
+        .filter(|(_, time, matches)| !time.is_zero() || *matches > 0)
+        .take(8)
+    {
+        println!(
+            "  rule {rule:<82} {ms:>8.2} ms | matches {matches}",
+            ms = time.as_secs_f64() * 1000.0,
+        );
+    }
+    report.updated
+}
+
+fn serialize_summary(egraph: &mut egglog::EGraph, root: &str) {
+    let (sort, value) = egraph.eval_expr(&egglog::var!(root.to_string())).unwrap();
+    let output = egraph.serialize(egglog::SerializeConfig {
+        root_eclasses: vec![(sort, value)],
+        max_functions: None,
+        include_temporary_functions: false,
+        max_calls_per_function: None,
+    });
+    let mut classes = std::collections::BTreeSet::new();
+    let mut top_ops = BTreeMap::<String, usize>::new();
+    let mut nodes = 0usize;
+    for node in output.egraph.nodes.values().filter(|node| !node.subsumed) {
+        nodes += 1;
+        classes.insert(node.eclass.clone());
+        *top_ops.entry(node.op.clone()).or_default() += 1;
+    }
+    let top_ops = top_ops
+        .into_iter()
+        .sorted_by_key(|(_, count)| std::cmp::Reverse(*count))
+        .take(12)
+        .map(|(op, count)| format!("{op}={count}"))
+        .join(", ");
+    println!(
+        "serialize nodes={nodes} classes={} roots={} top_ops={top_ops}",
+        classes.len(),
+        output.egraph.root_eclasses.len()
+    );
+}
+
+fn run(args: Args) {
+    let mut graph = build_case(args.case);
+    let rolled = if args.skip_roll {
+        0
+    } else {
+        graph.auto_roll_loops_prepass()
+    };
+    let (program, root) = hlir_to_egglog(&graph);
+
+    let mut ops = match args.backend {
+        Backend::Native => <NativeRuntime as Runtime>::Ops::into_vec(),
+        Backend::Cuda => <CudaRuntime as Runtime>::Ops::into_vec(),
+    };
+    ops.extend(<HLIROps as IntoEgglogOp>::into_vec());
+    let cleanup = args.cleanup && matches!(args.backend, Backend::Cuda);
+    let setup = setup_program(&program, &ops, cleanup);
+
+    println!(
+        "case={:?} backend={:?} mode={:?} passes={} cleanup={} rolled={} hlir_nodes={} setup_lines={} setup_bytes={} root={root}",
+        args.case,
+        args.backend,
+        args.mode,
+        args.passes,
+        cleanup,
+        rolled,
+        graph.graph.node_count(),
+        setup.lines().count(),
+        setup.len(),
+    );
+
+    let mut egraph = egglog::EGraph::default();
+    let before = egraph.num_tuples();
+    let start = Instant::now();
+    let commands = egraph.parser.get_program_from_string(None, &setup).unwrap();
+    egraph.run_program(commands).unwrap();
+    println!(
+        "setup {:>8.2} ms | tuples {before} -> {} ({:+})",
+        start.elapsed().as_secs_f64() * 1000.0,
+        egraph.num_tuples(),
+        egraph.num_tuples() as isize - before as isize,
+    );
+
+    match args.mode {
+        Mode::Current | Mode::Steps => {
+            for pass in 1..=args.passes {
+                let mut updated = false;
+                for (name, schedule) in split_cycle() {
+                    updated |= phase(&mut egraph, &format!("{pass:03} {name}"), &schedule);
+                }
+                if matches!(args.mode, Mode::Current) && !updated {
+                    break;
+                }
+            }
+        }
+        Mode::FullDefault => {
+            phase(&mut egraph, "expr", "(saturate expr)");
+            phase(&mut egraph, "dtype", "(saturate dtype_prop)");
+            phase(&mut egraph, "default-full", "(saturate (run))");
+        }
+        Mode::FullCycle => {
+            phase(
+                &mut egraph,
+                "cycle-full",
+                &format!("(saturate {})", split_cycle_schedule()),
+            );
+        }
+    }
+
+    phase(&mut egraph, "final expr", "(saturate expr)");
+    if cleanup {
+        phase(&mut egraph, "cleanup", "(saturate cleanup)");
+    }
+    phase(&mut egraph, "base cleanup", "(saturate base_cleanup)");
+    serialize_summary(&mut egraph, &root);
+}
+
+fn main() {
+    run(parse_args());
+}
--- a/crates/luminal_cuda_lite/src/dyn_backend.rs
+++ b/crates/luminal_cuda_lite/src/dyn_backend.rs
@@ -0,0 +1,75 @@
+//! [`DynBackend`] implementation for the CUDA lite runtime.
+
+use luminal::dtype::DType;
+use luminal::dyn_backend::{BackendCompileArgs, DynBackend, compile_backend};
+use luminal::prelude::*;
+
+use crate::cudarc::driver::CudaContext;
+use crate::runtime::CudaRuntime;
+
+/// [`DynBackend`] wrapper for [`CudaRuntime`].
+pub struct CudaLiteDynBackend {
+    pub runtime: CudaRuntime,
+}
+
+impl DynBackend for CudaLiteDynBackend {
+    fn name(&self) -> &str {
+        "cuda_lite"
+    }
+    fn device_type(&self) -> &str {
+        "cuda"
+    }
+
+    fn set_data_bytes(&mut self, node: NodeIndex, bytes: Vec<u8>, _dtype: DType) {
+        self.runtime.set_data(node, bytes);
+    }
+    fn set_data_f32(&mut self, node: NodeIndex, data: Vec<f32>) {
+        self.runtime.set_data(node, data);
+    }
+    fn get_output_f32(&self, node: NodeIndex) -> Vec<f32> {
+        self.runtime.get_f32(node)
+    }
+    fn get_output_i32(&self, node: NodeIndex) -> Vec<i32> {
+        self.runtime.get_i32(node)
+    }
+    fn get_output_bool(&self, node: NodeIndex) -> Vec<bool> {
+        self.runtime.get_bool(node)
+    }
+    fn execute(&mut self, dyn_map: &FxHashMap<char, usize>) {
+        self.runtime.execute(dyn_map);
+    }
+
+    fn supports_device_ptrs(&self) -> bool {
+        true
+    }
+    unsafe fn set_device_ptr(&mut self, node: NodeIndex, ptr: u64, n: usize) {
+        unsafe { self.runtime.set_device_ptr(node, ptr, n) }
+    }
+    unsafe fn set_output_device_ptr(&mut self, node: NodeIndex, ptr: u64, n: usize) {
+        unsafe { self.runtime.set_output_device_ptr(node, ptr, n) }
+    }
+    fn output_is_zero_copy(&self, node: NodeIndex) -> bool {
+        self.runtime.output_is_zero_copy(node)
+    }
+    unsafe fn copy_output_to_device_ptr(&self, node: NodeIndex, ptr: u64, n: usize) {
+        unsafe { self.runtime.copy_output_to_device_ptr(node, ptr, n) }
+    }
+}
+
+pub fn cuda_lite_factory(
+    graph: &mut Graph,
+    args: BackendCompileArgs,
+) -> Result<Box<dyn DynBackend>, String> {
+    let cuda_ctx = CudaContext::new(0).map_err(|e| format!("CUDA init failed: {e}"))?;
+    let stream = cuda_ctx.default_stream();
+    compile_backend::<CudaRuntime>(
+        graph,
+        args,
+        || Ok(CudaRuntime::initialize(stream)),
+        |rt, node, bytes, _dtype| {
+            rt.set_data(node, bytes);
+        },
+        Some(&|rt, node, ptr, n| unsafe { rt.set_device_ptr(node, ptr, n) }),
+        |rt| Box::new(CudaLiteDynBackend { runtime: rt }),
+    )
+}
--- a/crates/luminal_cuda_lite/src/host/compute_attn_mask.rs
+++ b/crates/luminal_cuda_lite/src/host/compute_attn_mask.rs
@@ -0,0 +1,198 @@
+//! ComputeAttnMask — fused op that computes the paged attention mask from indptrs.
+//!
+//! This op exists so the indptr tensors (qo_indptr, kv_indptr) are visible in the
+//! same e-graph chunk as the attention pattern, letting the FlashInfer egglog rule
+//! capture them directly.
+//!
+//! Inputs (3): q_pos (s,) Int, qo_indptr (r,) Int, kv_indptr (r,) Int.
+//! Output: mask (s, c) F32 where mask[i, j] = 0.0 (attend) or -1e10 (block).
+
+use std::sync::Arc;
+
+use luminal::{
+    egglog_utils::{
+        api::{Rule, SortDef, sort},
+        base::{EXPRESSION, OP_KIND},
+        extract_expr,
+    },
+    op::{EgglogOp, HLIROp, LLIROp},
+    prelude::*,
+};
+
+use crate::{
+    cudarc::driver::{CudaStream, result},
+    host::{DeviceBuffer, HostOp},
+};
+
+/// Computes the paged attention mask from indptr arrays.
+///
+/// The mask encodes both request-membership and causality:
+/// `mask[i, j] = 0.0` if query `i` and context `j` belong to the same request AND
+/// context `j`'s local position is `<= q_pos[i]`; `-1e10` otherwise.
+#[derive(Debug, Default)]
+pub struct ComputeAttnMask {
+    pub s_dim: Expression,
+    pub c_dim: Expression,
+}
+
+impl std::fmt::Display for ComputeAttnMask {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        write!(f, "ComputeAttnMask(s={}, c={})", self.s_dim, self.c_dim)
+    }
+}
+
+impl HLIROp for ComputeAttnMask {
+    fn to_egglog(&self, inputs: &[(NodeIndex, String)]) -> String {
+        format!(
+            "(Op (ComputeAttnMask {} {}) (ICons {} (ICons {} (ICons {} (INil)))))",
+            self.s_dim.to_egglog(),
+            self.c_dim.to_egglog(),
+            inputs[0].1, // q_pos
+            inputs[1].1, // qo_indptr
+            inputs[2].1, // kv_indptr
+        )
+    }
+}
+
+impl EgglogOp for ComputeAttnMask {
+    fn sort(&self) -> SortDef {
+        sort(
+            OP_KIND,
+            "ComputeAttnMask",
+            &[("s_dim", EXPRESSION), ("c_dim", EXPRESSION)],
+        )
+    }
+
+    fn n_inputs(&self) -> usize {
+        3
+    }
+
+    fn rewrites(&self) -> Vec<Rule> {
+        // No rewrites — inserted directly by model code.
+        vec![]
+    }
+
+    fn extract<'a>(
+        &'a self,
+        egraph: &'a luminal::egglog_utils::SerializedEGraph,
+        kind_children: &[&'a ENodeId],
+        input_enodes: Vec<&'a ENodeId>,
+        _list_cache: &mut FxHashMap<&'a ENodeId, Vec<Expression>>,
+        expr_cache: &mut FxHashMap<&'a ENodeId, Expression>,
+    ) -> (LLIROp, Vec<&'a ENodeId>) {
+        let s_dim = extract_expr(egraph, kind_children[0], expr_cache).unwrap();
+        let c_dim = extract_expr(egraph, kind_children[1], expr_cache).unwrap();
+        let op = Self { s_dim, c_dim };
+        let llir_op = LLIROp::new::<dyn HostOp>(Box::new(op) as Box<dyn HostOp>);
+        (llir_op, input_enodes)
+    }
+
+    fn cleanup(&self) -> bool {
+        false
+    }
+}
+
+impl HostOp for ComputeAttnMask {
+    fn execute(
+        &self,
+        stream: &Arc<CudaStream>,
+        self_node: NodeIndex,
+        inputs: &[NodeIndex],
+        buffers: &FxHashMap<NodeIndex, DeviceBuffer>,
+        dyn_map: &FxHashMap<char, usize>,
+    ) -> anyhow::Result<()> {
+        if inputs.len() < 3 {
+            anyhow::bail!(
+                "ComputeAttnMask expects 3 inputs (q_pos, qo_indptr, kv_indptr), got {}",
+                inputs.len()
+            );
+        }
+
+        let s = self
+            .s_dim
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("ComputeAttnMask s_dim unresolved"))?;
+        let c = self
+            .c_dim
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("ComputeAttnMask c_dim unresolved"))?;
+        let r = *dyn_map
+            .get(&'r')
+            .ok_or_else(|| anyhow::anyhow!("ComputeAttnMask requires dynamic dim 'r'"))?;
+
+        let get_buf = |name: &str, node: NodeIndex| -> anyhow::Result<DeviceBuffer> {
+            buffers.get(&node).copied().ok_or_else(|| {
+                anyhow::anyhow!("ComputeAttnMask missing {name} buffer for {node:?}")
+            })
+        };
+
+        let q_pos_buf = get_buf("q_pos", inputs[0])?;
+        let qo_indptr_buf = get_buf("qo_indptr", inputs[1])?;
+        let kv_indptr_buf = get_buf("kv_indptr", inputs[2])?;
+        let out_buf = get_buf("output", self_node)?;
+
+        let q_pos = dtoh_i32(stream, q_pos_buf.ptr(), s)?;
+        let qo_indptr = dtoh_i32(stream, qo_indptr_buf.ptr(), r)?;
+        let kv_indptr = dtoh_i32(stream, kv_indptr_buf.ptr(), r)?;
+
+        let mut mask = vec![-1e10f32; s * c];
+        for i in 0..s {
+            let q_req = indptr_to_request(&qo_indptr, i as i32);
+            for j in 0..c {
+                let c_req = indptr_to_request(&kv_indptr, j as i32);
+                if q_req == c_req && q_req >= 0 {
+                    let c_local = j as i32 - kv_indptr[c_req as usize];
+                    if c_local <= q_pos[i] {
+                        mask[i * c + j] = 0.0;
+                    }
+                }
+            }
+        }
+
+        let mask_bytes =
+            unsafe { std::slice::from_raw_parts(mask.as_ptr() as *const u8, mask.len() * 4) };
+        unsafe {
+            let res = cudarc::driver::sys::cuMemcpyHtoD_v2(
+                out_buf.ptr(),
+                mask_bytes.as_ptr() as *const std::ffi::c_void,
+                mask_bytes.len(),
+            );
+            if res != cudarc::driver::sys::CUresult::CUDA_SUCCESS {
+                anyhow::bail!("ComputeAttnMask cuMemcpyHtoD failed: {res:?}");
+            }
+        }
+
+        Ok(())
+    }
+
+    fn output_size(&self) -> Expression {
+        self.s_dim * self.c_dim
+    }
+
+    fn output_bytes(&self) -> Expression {
+        self.output_size() * 4
+    }
+
+    fn stats_name(&self) -> Option<&'static str> {
+        Some("ComputeAttnMask")
+    }
+}
+
+fn dtoh_i32(stream: &Arc<CudaStream>, dev_ptr: u64, len: usize) -> anyhow::Result<Vec<i32>> {
+    let mut host = vec![0u8; len * std::mem::size_of::<i32>()];
+    unsafe {
+        result::memcpy_dtoh_async(&mut host, dev_ptr, stream.cu_stream())?;
+    }
+    stream.synchronize()?;
+    let v = unsafe {
+        let mut bytes = std::mem::ManuallyDrop::new(host);
+        Vec::from_raw_parts(bytes.as_mut_ptr() as *mut i32, len, len)
+    };
+    Ok(v)
+}
+
+/// Given an indptr array `[0, a, b, ...]`, find which segment `idx` belongs to.
+/// Returns `count(indptr[i] <= idx) - 1`.
+fn indptr_to_request(indptr: &[i32], idx: i32) -> i32 {
+    indptr.iter().filter(|&&v| v <= idx).count() as i32 - 1
+}
--- a/crates/luminal_cuda_lite/src/host/cublas/mod.rs
+++ b/crates/luminal_cuda_lite/src/host/cublas/mod.rs
@@ -19,9 +19,9 @@ use crate::{
            CudaBlas,
            sys::{cublasOperation_t, cublasSetStream_v2, cublasSgemm_v2, cublasStatus_t},
        },
-        driver::{CudaSlice, CudaStream, DevicePtr},
+        driver::CudaStream,
    },
-    host::HostOp,
+    host::{DeviceBuffer, HostOp},
 };

 /// Global shared cuBLAS handle to avoid per-operation workspace allocation
@@ -156,7 +156,7 @@ impl HostOp for CuBlasSgemmV2 {
        stream: &Arc<CudaStream>,
        self_node: NodeIndex,
        inputs: &[NodeIndex],
-        buffers: &FxHashMap<NodeIndex, &CudaSlice<u8>>,
+        buffers: &FxHashMap<NodeIndex, DeviceBuffer>,
        dyn_map: &FxHashMap<char, usize>,
    ) -> anyhow::Result<()> {
        // GEMM parameters
@@ -178,9 +178,9 @@ impl HostOp for CuBlasSgemmV2 {
        let b_buf = buffers[&inputs[1]];

        // Get device pointers
-        let (a_ptr, _a_guard) = a_buf.device_ptr(stream);
-        let (b_ptr, _b_guard) = b_buf.device_ptr(stream);
-        let (c_ptr, _c_guard) = c_buf.device_ptr(stream);
+        let a_ptr = a_buf.ptr();
+        let b_ptr = b_buf.ptr();
+        let c_ptr = c_buf.ptr();

        // Debug: Check buffer sizes
        trace!(
--- a/crates/luminal_cuda_lite/src/host/cublas/sgemm_v2_CmCm_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublas/sgemm_v2_CmCm_rewrite.egg
@@ -68,5 +68,6 @@
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) (F32))
    )
+    :ruleset matmul_backend
    :name "cublas sgemm column-major × column-major"
 )
--- a/crates/luminal_cuda_lite/src/host/cublas/sgemm_v2_CmRm_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublas/sgemm_v2_CmRm_rewrite.egg
@@ -68,5 +68,6 @@
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) (F32))
    )
+    :ruleset matmul_backend
    :name "cublas sgemm column-major × row-major"
 )
--- a/crates/luminal_cuda_lite/src/host/cublas/sgemm_v2_RmCm_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublas/sgemm_v2_RmCm_rewrite.egg
@@ -68,5 +68,6 @@
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) (F32))
    )
+    :ruleset matmul_backend
    :name "cublas sgemm row-major × column-major"
-)
+)
--- a/crates/luminal_cuda_lite/src/host/cublas/sgemm_v2_RmRm_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublas/sgemm_v2_RmRm_rewrite.egg
@@ -68,5 +68,6 @@
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) (F32))
    )
+    :ruleset matmul_backend
    :name "cublas sgemm row-major"
-)
+)
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_CmCm_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_CmCm_rewrite.egg
@@ -42,6 +42,7 @@

        (= ?dt (dtype ?a))
        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
    )
    (
        ; For column-major A × column-major B with cuBLAS:
@@ -52,18 +53,22 @@
            ?k             ; k unchanged
            "T"            ; transa = Transpose (B is column-major [k,n], need B^T[n,k])
            "T"            ; transb = Transpose (A is column-major [m,k], need A^T[k,m])
+            "COL" "COL" "COL" "COL" ; A/B/C/D matrix orders
            ?b_n_stride    ; lda = B's column stride (resolves to k after z→1)
            ?a_k_stride    ; ldb = A's column stride (resolves to m after z→1)
            ?n             ; ldc = n (row-major C[m,n] viewed as col-major [n,m])
+            ?n             ; ldd = ldc for current row-major output rewrites
            (MNum 1)       ; batch_count = 1
            (MNum 0)       ; stride_a = 0
            (MNum 0)       ; stride_b = 0
            (MNum 0)       ; stride_c = 0
-            ?dt)           ; dtype
+            (MNum 0)       ; stride_d = 0
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")   ; type tuple, alpha, beta
            (ICons ?b (ICons ?a (INil)))))
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) ?dt)
    )
+    :ruleset matmul_backend
    :name "cublaslt column-major × column-major"
 )

@@ -111,23 +116,28 @@

        (= ?dt (dtype ?a))
        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
    )
    (
        ; cuBLAS: cublas(OP_T, OP_T, n, m, k, B, lda=b_n_stride, A, ldb=a_k_stride, C, ldc=n)
        (let ?sgemm (Op (cublaslt
            ?n ?m ?k
            "T" "T"
+            "COL" "COL" "COL" "COL"
            ?b_n_stride       ; lda (cuBLAS A = our B, column stride)
            ?a_k_stride       ; ldb (cuBLAS B = our A, column stride)
            ?n                ; ldc
+            ?n                ; ldd
            ?batch
            ?b_batch_stride   ; stride_a (cuBLAS A = our B)
            ?a_batch_stride   ; stride_b (cuBLAS B = our A)
            (MMul ?m ?n)      ; stride_c
-            ?dt)
+            (MMul ?m ?n)      ; stride_d
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
            (ICons ?b (ICons ?a (INil)))))
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) ?dt)
    )
+    :ruleset matmul_backend
    :name "cublaslt batched column-major × column-major"
 )
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_CmRm_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_CmRm_rewrite.egg
@@ -42,6 +42,7 @@

        (= ?dt (dtype ?a))
        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
    )
    (
        ; For column-major A × row-major B with cuBLAS:
@@ -52,18 +53,22 @@
            ?k             ; k unchanged
            "N"            ; transa = No transpose (B is row-major, viewed as col-major [n,k])
            "T"            ; transb = Transpose (A is column-major [m,k], need A^T[k,m])
+            "COL" "COL" "COL" "COL" ; A/B/C/D matrix orders
            ?b_k_stride    ; lda = B's row stride (resolves to n after z→1)
            ?a_k_stride    ; ldb = A's column stride (resolves to m after z→1)
            ?n             ; ldc = n (row-major C[m,n] viewed as col-major [n,m])
+            ?n             ; ldd = ldc for current row-major output rewrites
            (MNum 1)       ; batch_count = 1
            (MNum 0)       ; stride_a = 0
            (MNum 0)       ; stride_b = 0
            (MNum 0)       ; stride_c = 0
-            ?dt)           ; dtype
+            (MNum 0)       ; stride_d = 0
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")   ; type tuple, alpha, beta
            (ICons ?b (ICons ?a (INil)))))
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) ?dt)
    )
+    :ruleset matmul_backend
    :name "cublaslt column-major × row-major"
 )

@@ -111,23 +116,28 @@

        (= ?dt (dtype ?a))
        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
    )
    (
        ; cuBLAS: cublas(OP_N, OP_T, n, m, k, B, lda=b_k_stride, A, ldb=a_k_stride, C, ldc=n)
        (let ?sgemm (Op (cublaslt
            ?n ?m ?k
            "N" "T"
+            "COL" "COL" "COL" "COL"
            ?b_k_stride       ; lda (cuBLAS A = our B, row stride)
            ?a_k_stride       ; ldb (cuBLAS B = our A, column stride)
            ?n                ; ldc
+            ?n                ; ldd
            ?batch
            ?b_batch_stride   ; stride_a (cuBLAS A = our B)
            ?a_batch_stride   ; stride_b (cuBLAS B = our A)
            (MMul ?m ?n)      ; stride_c
-            ?dt)
+            (MMul ?m ?n)      ; stride_d
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
            (ICons ?b (ICons ?a (INil)))))
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) ?dt)
    )
+    :ruleset matmul_backend
    :name "cublaslt batched column-major × row-major"
 )
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_RmCm_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_RmCm_rewrite.egg
@@ -42,6 +42,7 @@

        (= ?dt (dtype ?a))
        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
    )
    (
        ; For row-major A × column-major B with cuBLAS:
@@ -52,18 +53,22 @@
            ?k             ; k unchanged
            "T"            ; transa = Transpose (B is column-major, need B^T)
            "N"            ; transb = No transpose
+            "COL" "COL" "COL" "COL" ; A/B/C/D matrix orders
            ?b_n_stride    ; lda = B's column stride (resolves to k after z→1)
            ?a_m_stride    ; ldb = A's row stride (resolves to k after z→1)
            ?n             ; ldc = n (row-major C[m,n] viewed as col-major [n,m])
+            ?n             ; ldd = ldc for current row-major output rewrites
            (MNum 1)       ; batch_count = 1
            (MNum 0)       ; stride_a = 0
            (MNum 0)       ; stride_b = 0
            (MNum 0)       ; stride_c = 0
-            ?dt)           ; dtype
+            (MNum 0)       ; stride_d = 0
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")   ; type tuple, alpha, beta
            (ICons ?b (ICons ?a (INil)))))
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) ?dt)
    )
+    :ruleset matmul_backend
    :name "cublaslt row-major × column-major"
 )

@@ -111,23 +116,28 @@

        (= ?dt (dtype ?a))
        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
    )
    (
        ; cuBLAS: cublas(OP_T, OP_N, n, m, k, B, lda=b_n_stride, A, ldb=a_m_stride, C, ldc=n)
        (let ?sgemm (Op (cublaslt
            ?n ?m ?k
            "T" "N"
+            "COL" "COL" "COL" "COL"
            ?b_n_stride       ; lda (cuBLAS A = our B, column stride)
            ?a_m_stride       ; ldb (cuBLAS B = our A, row stride)
            ?n                ; ldc
+            ?n                ; ldd
            ?batch
            ?b_batch_stride   ; stride_a (cuBLAS A = our B)
            ?a_batch_stride   ; stride_b (cuBLAS B = our A)
            (MMul ?m ?n)      ; stride_c
-            ?dt)
+            (MMul ?m ?n)      ; stride_d
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
            (ICons ?b (ICons ?a (INil)))))
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) ?dt)
    )
+    :ruleset matmul_backend
    :name "cublaslt batched row-major × column-major"
 )
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_RmRm_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_RmRm_rewrite.egg
@@ -42,6 +42,7 @@

        (= ?dt (dtype ?a))
        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
    )
    (
        ; For row-major C = A × B with cuBLAS (column-major):
@@ -52,18 +53,22 @@
            ?k             ; k unchanged
            "N"            ; transa = No transpose
            "N"            ; transb = No transpose
+            "COL" "COL" "COL" "COL" ; A/B/C/D matrix orders
            ?b_k_stride    ; lda = B's row stride (resolves to n after z→1)
            ?a_m_stride    ; ldb = A's row stride (resolves to k after z→1)
            ?n             ; ldc = n (row-major C[m,n] viewed as col-major [n,m])
+            ?n             ; ldd = ldc for current row-major output rewrites
            (MNum 1)       ; batch_count = 1
            (MNum 0)       ; stride_a = 0
            (MNum 0)       ; stride_b = 0
            (MNum 0)       ; stride_c = 0
-            ?dt)           ; dtype
+            (MNum 0)       ; stride_d = 0
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")   ; type tuple, alpha, beta
            (ICons ?b (ICons ?a (INil)))))
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) ?dt)
    )
+    :ruleset matmul_backend
    :name "cublaslt row-major x row-major"
 )

@@ -116,6 +121,7 @@

        (= ?dt (dtype ?a))
        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
    )
    (
        ; cuBLAS swap: C^T[n,m] = B^T[n,k] × A^T[k,m] per batch
@@ -123,17 +129,21 @@
        (let ?sgemm (Op (cublaslt
            ?n ?m ?k
            "N" "N"
+            "COL" "COL" "COL" "COL"
            ?b_k_stride       ; lda (cuBLAS A = our B, row stride)
            ?a_m_stride       ; ldb (cuBLAS B = our A, row stride)
            ?n                ; ldc (contiguous output per batch)
+            ?n                ; ldd
            ?batch            ; batch_count
            ?b_batch_stride   ; stride_a (cuBLAS A = our B)
            ?a_batch_stride   ; stride_b (cuBLAS B = our A)
            (MMul ?m ?n)      ; stride_c
-            ?dt)
+            (MMul ?m ?n)      ; stride_d
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
            (ICons ?b (ICons ?a (INil)))))
        (union ?sum ?sgemm)
        (set (dtype ?sgemm) ?dt)
    )
+    :ruleset matmul_backend
    :name "cublaslt batched row-major × row-major"
 )
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_beta_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_beta_rewrite.egg
@@ -0,0 +1,428 @@
+; Fuse a row-major Add on top of an existing cuBLASLt matmul into
+; D = alpha * A * B + beta * C.
+;
+; The existing matmul rewrites view Luminal's row-major output [m,n] as a
+; column-major cuBLASLt matrix [n,m]. A row-major C input with logical strides
+; [row_stride, 1] therefore maps to ldc=row_stride. This lets a C slice from a
+; wider parent tensor use a larger ldc while D keeps the matmul output layout.
+; cuBLASLt requires out-of-place C and D to have the same matrix order, so these
+; beta rules only fuse C layouts that map to the current COL-ordered D layout.
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "COL"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (!= ?epilogue "RELU")
+        (!= ?epilogue "RELU_BIAS")
+        (!= ?epilogue "GELU")
+        (!= ?epilogue "GELU_BIAS")
+
+        (= ?add (Op (Add
+            (ECons ?n (ECons ?m (ENil)))
+            ?matmul_add_strides
+            ?c_add_strides
+            ?add_out_strides)
+            (ICons ?matmul (ICons ?c (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_add_strides (ECons ?c_row_stride (ECons ?c_col_stride (ENil))))
+        (= ?add_out_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "COL" "COL"
+            ?lda ?ldb ?c_row_stride ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b (MNum 0) ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 1.0 ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt 2d matmul plus c beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "COL"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (!= ?epilogue "RELU")
+        (!= ?epilogue "RELU_BIAS")
+        (!= ?epilogue "GELU")
+        (!= ?epilogue "GELU_BIAS")
+
+        (= ?add (Op (Add
+            (ECons ?n (ECons ?m (ENil)))
+            ?c_add_strides
+            ?matmul_add_strides
+            ?add_out_strides)
+            (ICons ?c (ICons ?matmul (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_add_strides (ECons ?c_row_stride (ECons ?c_col_stride (ENil))))
+        (= ?add_out_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "COL" "COL"
+            ?lda ?ldb ?c_row_stride ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b (MNum 0) ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 1.0 ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt 2d c plus matmul beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "COL"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (!= ?epilogue "RELU")
+        (!= ?epilogue "RELU_BIAS")
+        (!= ?epilogue "GELU")
+        (!= ?epilogue "GELU_BIAS")
+
+        (= ?add (Op (Add
+            (ECons ?batch (ECons ?n (ECons ?m (ENil))))
+            ?matmul_add_strides
+            ?c_add_strides
+            ?add_out_strides)
+            (ICons ?matmul (ICons ?c (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_add_strides (ECons ?c_batch_stride (ECons ?c_row_stride (ECons ?c_col_stride (ENil)))))
+        (= ?add_out_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "COL" "COL"
+            ?lda ?ldb ?c_row_stride ?ldd
+            ?batch
+            ?stride_a ?stride_b ?c_batch_stride ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 1.0 ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt batched matmul plus c beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "COL"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (!= ?epilogue "RELU")
+        (!= ?epilogue "RELU_BIAS")
+        (!= ?epilogue "GELU")
+        (!= ?epilogue "GELU_BIAS")
+
+        (= ?add (Op (Add
+            (ECons ?batch (ECons ?n (ECons ?m (ENil))))
+            ?c_add_strides
+            ?matmul_add_strides
+            ?add_out_strides)
+            (ICons ?c (ICons ?matmul (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_add_strides (ECons ?c_batch_stride (ECons ?c_row_stride (ECons ?c_col_stride (ENil)))))
+        (= ?add_out_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "COL" "COL"
+            ?lda ?ldb ?c_row_stride ?ldd
+            ?batch
+            ?stride_a ?stride_b ?c_batch_stride ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 1.0 ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt batched c plus matmul beta"
+)
+
+; ROW-ordered D beta fusions. These pair with cublaslt_row_order_rewrite.egg,
+; where the cuBLASLt problem dimensions match Luminal's logical output [m,n].
+; A row-major C input with logical strides [row_stride, 1] maps directly to a
+; ROW-ordered cuBLASLt C[m,n] descriptor with ldc=row_stride.
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "ROW"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (!= ?epilogue "RELU")
+        (!= ?epilogue "RELU_BIAS")
+        (!= ?epilogue "GELU")
+        (!= ?epilogue "GELU_BIAS")
+
+        (= ?add (Op (Add
+            (ECons ?m (ECons ?n (ENil)))
+            ?matmul_add_strides
+            ?c_add_strides
+            ?add_out_strides)
+            (ICons ?matmul (ICons ?c (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_add_strides (ECons ?c_row_stride (ECons ?c_col_stride (ENil))))
+        (= ?add_out_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "ROW" "ROW"
+            ?lda ?ldb ?c_row_stride ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b (MNum 0) ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 1.0 ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order 2d matmul plus c beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "ROW"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (!= ?epilogue "RELU")
+        (!= ?epilogue "RELU_BIAS")
+        (!= ?epilogue "GELU")
+        (!= ?epilogue "GELU_BIAS")
+
+        (= ?add (Op (Add
+            (ECons ?m (ECons ?n (ENil)))
+            ?c_add_strides
+            ?matmul_add_strides
+            ?add_out_strides)
+            (ICons ?c (ICons ?matmul (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_add_strides (ECons ?c_row_stride (ECons ?c_col_stride (ENil))))
+        (= ?add_out_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "ROW" "ROW"
+            ?lda ?ldb ?c_row_stride ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b (MNum 0) ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 1.0 ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order 2d c plus matmul beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "ROW"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (!= ?epilogue "RELU")
+        (!= ?epilogue "RELU_BIAS")
+        (!= ?epilogue "GELU")
+        (!= ?epilogue "GELU_BIAS")
+
+        (= ?add (Op (Add
+            (ECons ?batch (ECons ?m (ECons ?n (ENil))))
+            ?matmul_add_strides
+            ?c_add_strides
+            ?add_out_strides)
+            (ICons ?matmul (ICons ?c (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_add_strides (ECons ?c_batch_stride (ECons ?c_row_stride (ECons ?c_col_stride (ENil)))))
+        (= ?add_out_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "ROW" "ROW"
+            ?lda ?ldb ?c_row_stride ?ldd
+            ?batch
+            ?stride_a ?stride_b ?c_batch_stride ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 1.0 ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order batched matmul plus c beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "ROW"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (!= ?epilogue "RELU")
+        (!= ?epilogue "RELU_BIAS")
+        (!= ?epilogue "GELU")
+        (!= ?epilogue "GELU_BIAS")
+
+        (= ?add (Op (Add
+            (ECons ?batch (ECons ?m (ECons ?n (ENil))))
+            ?c_add_strides
+            ?matmul_add_strides
+            ?add_out_strides)
+            (ICons ?c (ICons ?matmul (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_add_strides (ECons ?c_batch_stride (ECons ?c_row_stride (ECons ?c_col_stride (ENil)))))
+        (= ?add_out_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "ROW" "ROW"
+            ?lda ?ldb ?c_row_stride ?ldd
+            ?batch
+            ?stride_a ?stride_b ?c_batch_stride ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 1.0 ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order batched c plus matmul beta"
+)
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_epilogue_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_epilogue_rewrite.egg
@@ -0,0 +1,614 @@
+; cuBLASLt epilogue rewrites.
+;
+; ReLU in the frontend lowers through maximum_f32(0.0):
+;
+;   (matmul < 0) * 0 + cast(cast((-cast(matmul < 0) + 1) as bool) as f32) * matmul
+;
+; These rules fuse that expression back into CUBLASLT_EPILOGUE_RELU.
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?zero (Op (Constant 0.0) (INil)))
+        (= ?neg_one (Op (Constant -1.0) (INil)))
+        (= ?one (Op (Constant 1.0) (INil)))
+
+        (= ?lt (Op (LessThan
+            ?shape
+            ?matmul_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?mask_strides)
+            (ICons ?matmul (ICons ?zero (INil)))))
+        (= ?lt_f32 (Op (Cast ?size (F32)) (ICons ?lt (INil))))
+
+        (= ?zeroed (Op (Mul
+            ?shape
+            ?mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?zeroed_strides)
+            (ICons ?lt_f32 (ICons ?zero (INil)))))
+
+        (= ?neg_mask (Op (Mul
+            ?shape
+            ?mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?neg_mask_strides)
+            (ICons ?lt_f32 (ICons ?neg_one (INil)))))
+        (= ?not_mask_f32 (Op (Add
+            ?shape
+            ?neg_mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?not_mask_f32_strides)
+            (ICons ?neg_mask (ICons ?one (INil)))))
+        (= ?not_mask_bool (Op (Cast ?size (Bool)) (ICons ?not_mask_f32 (INil))))
+        (= ?not_mask (Op (Cast ?size (F32)) (ICons ?not_mask_bool (INil))))
+
+        (= ?positive (Op (Mul
+            ?shape
+            ?not_mask_f32_strides
+            ?matmul_strides
+            ?positive_strides)
+            (ICons ?not_mask (ICons ?matmul (INil)))))
+        (= ?relu (Op (Add
+            ?shape
+            ?zeroed_strides
+            ?positive_strides
+            ?relu_strides)
+            (ICons ?zeroed (ICons ?positive (INil)))))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "RELU")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (union ?relu ?fused)
+        (set (dtype ?fused) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt 2d relu epilogue"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?zero (Op (Constant 0.0) (INil)))
+        (= ?neg_one (Op (Constant -1.0) (INil)))
+        (= ?one (Op (Constant 1.0) (INil)))
+
+        (= ?lt (Op (LessThan
+            ?shape
+            ?matmul_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?mask_strides)
+            (ICons ?matmul (ICons ?zero (INil)))))
+        (= ?lt_f32 (Op (Cast ?size (F32)) (ICons ?lt (INil))))
+
+        (= ?zeroed (Op (Mul
+            ?shape
+            ?mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?zeroed_strides)
+            (ICons ?lt_f32 (ICons ?zero (INil)))))
+
+        (= ?neg_mask (Op (Mul
+            ?shape
+            ?mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?neg_mask_strides)
+            (ICons ?lt_f32 (ICons ?neg_one (INil)))))
+        (= ?not_mask_f32 (Op (Add
+            ?shape
+            ?neg_mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?not_mask_f32_strides)
+            (ICons ?neg_mask (ICons ?one (INil)))))
+        (= ?not_mask_bool (Op (Cast ?size (Bool)) (ICons ?not_mask_f32 (INil))))
+        (= ?not_mask (Op (Cast ?size (F32)) (ICons ?not_mask_bool (INil))))
+
+        (= ?positive (Op (Mul
+            ?shape
+            ?not_mask_f32_strides
+            ?matmul_strides
+            ?positive_strides)
+            (ICons ?not_mask (ICons ?matmul (INil)))))
+        (= ?relu (Op (Add
+            ?shape
+            ?zeroed_strides
+            ?positive_strides
+            ?relu_strides)
+            (ICons ?zeroed (ICons ?positive (INil)))))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "RELU")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (union ?relu ?fused)
+        (set (dtype ?fused) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt batched relu epilogue"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "BIAS")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?zero (Op (Constant 0.0) (INil)))
+        (= ?neg_one (Op (Constant -1.0) (INil)))
+        (= ?one (Op (Constant 1.0) (INil)))
+
+        (= ?lt (Op (LessThan
+            ?shape
+            ?matmul_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?mask_strides)
+            (ICons ?matmul (ICons ?zero (INil)))))
+        (= ?lt_f32 (Op (Cast ?size (F32)) (ICons ?lt (INil))))
+
+        (= ?zeroed (Op (Mul
+            ?shape
+            ?mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?zeroed_strides)
+            (ICons ?lt_f32 (ICons ?zero (INil)))))
+
+        (= ?neg_mask (Op (Mul
+            ?shape
+            ?mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?neg_mask_strides)
+            (ICons ?lt_f32 (ICons ?neg_one (INil)))))
+        (= ?not_mask_f32 (Op (Add
+            ?shape
+            ?neg_mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?not_mask_f32_strides)
+            (ICons ?neg_mask (ICons ?one (INil)))))
+        (= ?not_mask_bool (Op (Cast ?size (Bool)) (ICons ?not_mask_f32 (INil))))
+        (= ?not_mask (Op (Cast ?size (F32)) (ICons ?not_mask_bool (INil))))
+
+        (= ?positive (Op (Mul
+            ?shape
+            ?not_mask_f32_strides
+            ?matmul_strides
+            ?positive_strides)
+            (ICons ?not_mask (ICons ?matmul (INil)))))
+        (= ?relu (Op (Add
+            ?shape
+            ?zeroed_strides
+            ?positive_strides
+            ?relu_strides)
+            (ICons ?zeroed (ICons ?positive (INil)))))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "RELU_BIAS")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (union ?relu ?fused)
+        (set (dtype ?fused) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt 2d relu bias epilogue"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "BIAS")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?zero (Op (Constant 0.0) (INil)))
+        (= ?neg_one (Op (Constant -1.0) (INil)))
+        (= ?one (Op (Constant 1.0) (INil)))
+
+        (= ?lt (Op (LessThan
+            ?shape
+            ?matmul_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?mask_strides)
+            (ICons ?matmul (ICons ?zero (INil)))))
+        (= ?lt_f32 (Op (Cast ?size (F32)) (ICons ?lt (INil))))
+
+        (= ?zeroed (Op (Mul
+            ?shape
+            ?mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?zeroed_strides)
+            (ICons ?lt_f32 (ICons ?zero (INil)))))
+
+        (= ?neg_mask (Op (Mul
+            ?shape
+            ?mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?neg_mask_strides)
+            (ICons ?lt_f32 (ICons ?neg_one (INil)))))
+        (= ?not_mask_f32 (Op (Add
+            ?shape
+            ?neg_mask_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?not_mask_f32_strides)
+            (ICons ?neg_mask (ICons ?one (INil)))))
+        (= ?not_mask_bool (Op (Cast ?size (Bool)) (ICons ?not_mask_f32 (INil))))
+        (= ?not_mask (Op (Cast ?size (F32)) (ICons ?not_mask_bool (INil))))
+
+        (= ?positive (Op (Mul
+            ?shape
+            ?not_mask_f32_strides
+            ?matmul_strides
+            ?positive_strides)
+            (ICons ?not_mask (ICons ?matmul (INil)))))
+        (= ?relu (Op (Add
+            ?shape
+            ?zeroed_strides
+            ?positive_strides
+            ?relu_strides)
+            (ICons ?zeroed (ICons ?positive (INil)))))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "RELU_BIAS")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (union ?relu ?fused)
+        (set (dtype ?fused) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt batched relu bias epilogue"
+)
+
+; Canonical tanh-approx GELU can also appear directly as:
+;
+;   x * sigmoid(1.5957691216 * x * (1 + 0.044715 * x * x))
+;
+; Match that sigmoid form and fuse it into the cuBLASLt GELU epilogues.
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?gelu_coeff_inner (Op (Constant 0.044715) (INil)))
+        (= ?gelu_inner_scaled (Op (Mul ?gelu_inner_scaled_shape ?gelu_inner_scaled_a_stride ?gelu_inner_scaled_b_stride ?gelu_inner_scaled_out_stride) (ICons ?matmul (ICons ?gelu_coeff_inner (INil)))))
+        (= ?gelu_inner_quad (Op (Mul ?gelu_inner_quad_shape ?gelu_inner_quad_a_stride ?gelu_inner_quad_b_stride ?gelu_inner_quad_out_stride) (ICons ?gelu_inner_scaled (ICons ?matmul (INil)))))
+        (= ?gelu_one (Op (Constant 1.000000) (INil)))
+        (= ?gelu_poly (Op (Add ?gelu_poly_shape ?gelu_poly_a_stride ?gelu_poly_b_stride ?gelu_poly_out_stride) (ICons ?gelu_inner_quad (ICons ?gelu_one (INil)))))
+        (= ?gelu_coeff_outer (Op (Constant 1.595769) (INil)))
+        (= ?gelu_outer_scaled (Op (Mul ?gelu_outer_scaled_shape ?gelu_outer_scaled_a_stride ?gelu_outer_scaled_b_stride ?gelu_outer_scaled_out_stride) (ICons ?matmul (ICons ?gelu_coeff_outer (INil)))))
+        (= ?gelu_scaled (Op (Mul ?gelu_scaled_shape ?gelu_scaled_a_stride ?gelu_scaled_b_stride ?gelu_scaled_out_stride) (ICons ?gelu_outer_scaled (ICons ?gelu_poly (INil)))))
+        (= ?neg1 (Op (Constant -1.000000) (INil)))
+        (= ?gelu_neg (Op (Mul ?gelu_neg_shape ?gelu_neg_a_stride ?gelu_neg_b_stride ?gelu_neg_out_stride) (ICons ?gelu_scaled (ICons ?neg1 (INil)))))
+        (= ?log2e (Op (Constant 1.442695) (INil)))
+        (= ?gelu_exp_scaled (Op (Mul ?gelu_exp_scaled_shape ?gelu_exp_scaled_a_stride ?gelu_exp_scaled_b_stride ?gelu_exp_scaled_out_stride) (ICons ?gelu_neg (ICons ?log2e (INil)))))
+        (= ?gelu_exp2_val (Op (Exp2 ?gelu_exp_shape ?gelu_exp_in_stride ?gelu_exp_out_stride) (ICons ?gelu_exp_scaled (INil))))
+        (= ?gelu_plus1 (Op (Add ?gelu_plus1_shape ?gelu_plus1_a_stride ?gelu_plus1_b_stride ?gelu_plus1_out_stride) (ICons ?gelu_exp2_val (ICons ?gelu_one (INil)))))
+        (= ?gelu_sigmoid (Op (Recip ?gelu_sigmoid_shape ?gelu_sigmoid_in_stride ?gelu_sigmoid_out_stride) (ICons ?gelu_plus1 (INil))))
+        (= ?gelu_out (Op (Mul ?gelu_out_shape ?gelu_out_a_stride ?gelu_out_b_stride ?gelu_out_out_stride) (ICons ?matmul (ICons ?gelu_sigmoid (INil)))))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "GELU")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (union ?gelu_out ?fused)
+        (set (dtype ?fused) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt gelu epilogue"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "BIAS")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?gelu_coeff_inner (Op (Constant 0.044715) (INil)))
+        (= ?gelu_inner_scaled (Op (Mul ?gelu_inner_scaled_shape ?gelu_inner_scaled_a_stride ?gelu_inner_scaled_b_stride ?gelu_inner_scaled_out_stride) (ICons ?matmul (ICons ?gelu_coeff_inner (INil)))))
+        (= ?gelu_inner_quad (Op (Mul ?gelu_inner_quad_shape ?gelu_inner_quad_a_stride ?gelu_inner_quad_b_stride ?gelu_inner_quad_out_stride) (ICons ?gelu_inner_scaled (ICons ?matmul (INil)))))
+        (= ?gelu_one (Op (Constant 1.000000) (INil)))
+        (= ?gelu_poly (Op (Add ?gelu_poly_shape ?gelu_poly_a_stride ?gelu_poly_b_stride ?gelu_poly_out_stride) (ICons ?gelu_inner_quad (ICons ?gelu_one (INil)))))
+        (= ?gelu_coeff_outer (Op (Constant 1.595769) (INil)))
+        (= ?gelu_outer_scaled (Op (Mul ?gelu_outer_scaled_shape ?gelu_outer_scaled_a_stride ?gelu_outer_scaled_b_stride ?gelu_outer_scaled_out_stride) (ICons ?matmul (ICons ?gelu_coeff_outer (INil)))))
+        (= ?gelu_scaled (Op (Mul ?gelu_scaled_shape ?gelu_scaled_a_stride ?gelu_scaled_b_stride ?gelu_scaled_out_stride) (ICons ?gelu_outer_scaled (ICons ?gelu_poly (INil)))))
+        (= ?neg1 (Op (Constant -1.000000) (INil)))
+        (= ?gelu_neg (Op (Mul ?gelu_neg_shape ?gelu_neg_a_stride ?gelu_neg_b_stride ?gelu_neg_out_stride) (ICons ?gelu_scaled (ICons ?neg1 (INil)))))
+        (= ?log2e (Op (Constant 1.442695) (INil)))
+        (= ?gelu_exp_scaled (Op (Mul ?gelu_exp_scaled_shape ?gelu_exp_scaled_a_stride ?gelu_exp_scaled_b_stride ?gelu_exp_scaled_out_stride) (ICons ?gelu_neg (ICons ?log2e (INil)))))
+        (= ?gelu_exp2_val (Op (Exp2 ?gelu_exp_shape ?gelu_exp_in_stride ?gelu_exp_out_stride) (ICons ?gelu_exp_scaled (INil))))
+        (= ?gelu_plus1 (Op (Add ?gelu_plus1_shape ?gelu_plus1_a_stride ?gelu_plus1_b_stride ?gelu_plus1_out_stride) (ICons ?gelu_exp2_val (ICons ?gelu_one (INil)))))
+        (= ?gelu_sigmoid (Op (Recip ?gelu_sigmoid_shape ?gelu_sigmoid_in_stride ?gelu_sigmoid_out_stride) (ICons ?gelu_plus1 (INil))))
+        (= ?gelu_out (Op (Mul ?gelu_out_shape ?gelu_out_a_stride ?gelu_out_b_stride ?gelu_out_out_stride) (ICons ?matmul (ICons ?gelu_sigmoid (INil)))))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype (F32)
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "GELU_BIAS")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (union ?gelu_out ?fused)
+        (set (dtype ?fused) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt gelu bias epilogue"
+)
+
+; This first slice fuses column-bias adds into CUBLASLT_EPILOGUE_BIAS for the
+; older COL-ordered output view. In that view Luminal's logical [m,n] output is
+; represented as a cuBLASLt [n,m] matrix, so cuBLASLt's row-broadcast bias maps
+; to the common logical column bias of length n.
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order "COL"
+            ?lda ?ldb ?ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+
+        (= ?add (Op (Add
+            (ECons ?n (ECons ?m (ENil)))
+            ?matmul_add_strides
+            ?bias_add_strides
+            ?add_out_strides)
+            (ICons ?matmul (ICons ?bias (INil)))))
+
+        (= ?bias_add_strides (ECons (MNum 0) (ECons (MIter) (ENil))))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?d_dtype (dtype ?bias))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order "COL"
+            ?lda ?ldb ?ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "BIAS")
+            (ICons ?a (ICons ?b (ICons ?bias (INil))))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt 2d matmul plus column bias epilogue"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order "COL"
+            ?lda ?ldb ?ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+
+        (= ?add (Op (Add
+            (ECons ?n (ECons ?m (ENil)))
+            ?bias_add_strides
+            ?matmul_add_strides
+            ?add_out_strides)
+            (ICons ?bias (ICons ?matmul (INil)))))
+
+        (= ?bias_add_strides (ECons (MNum 0) (ECons (MIter) (ENil))))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?d_dtype (dtype ?bias))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order "COL"
+            ?lda ?ldb ?ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "BIAS")
+            (ICons ?a (ICons ?b (ICons ?bias (INil))))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt 2d column bias plus matmul epilogue"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order "COL"
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+
+        (= ?add (Op (Add
+            (ECons ?batch (ECons ?n (ECons ?m (ENil))))
+            ?matmul_add_strides
+            ?bias_add_strides
+            ?add_out_strides)
+            (ICons ?matmul (ICons ?bias (INil)))))
+
+        (= ?bias_add_strides (ECons (MNum 0) (ECons (MNum 0) (ECons (MIter) (ENil)))))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?d_dtype (dtype ?bias))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order "COL"
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "BIAS")
+            (ICons ?a (ICons ?b (ICons ?bias (INil))))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt batched matmul plus column bias epilogue"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order "COL"
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+
+        (= ?add (Op (Add
+            (ECons ?batch (ECons ?n (ECons ?m (ENil))))
+            ?bias_add_strides
+            ?matmul_add_strides
+            ?add_out_strides)
+            (ICons ?bias (ICons ?matmul (INil)))))
+
+        (= ?bias_add_strides (ECons (MNum 0) (ECons (MNum 0) (ECons (MIter) (ENil)))))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?d_dtype (dtype ?bias))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order "COL"
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "BIAS")
+            (ICons ?a (ICons ?b (ICons ?bias (INil))))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt batched column bias plus matmul epilogue"
+)
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_fp8_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_fp8_rewrite.egg
@@ -0,0 +1,345 @@
+; FP8 support is narrower than "any FP8 x any FP8". cuBLASLt's regular FP8
+; matmul table supports these A/B descriptor pairs for F32 outputs:
+;   E4M3 x E4M3
+;   E4M3 x E5M2
+;   E5M2 x E4M3
+; and requires TN format on Ada/Hopper-class GPUs. These rules therefore match
+; row-major x column-major Luminal matmuls, which the existing COL-order lowering
+; describes as descriptor A = logical B, descriptor B = logical A, transa=T,
+; transb=N.
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+        (= ?cast (Op (Cast ?size (F32)) (ICons ?sum (INil))))
+
+        (= ?out_shape (ECons ?m (ECons ?n (ENil))))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+
+        (= ?a_stride (ECons ?a_m_stride (ECons ?a_n_stride (ECons ?a_k_stride (ENil)))))
+        (= ?b_stride (ECons ?b_m_stride (ECons ?b_n_stride (ECons ?b_k_stride (ENil)))))
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MMul (MIter) ?k))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MIter))
+
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+        (= ?b_k_stride (MIter))
+
+        (= (F8E4M3) (dtype ?a))
+        (= (F8E4M3) (dtype ?b))
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?n ?m ?k
+            "T" "N"
+            "COL" "COL" "COL" "COL"
+            ?b_n_stride
+            ?a_m_stride
+            ?n
+            ?n
+            (MNum 1)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (F8E4M3) (F8E4M3) (F32) (F32) "32F" "F32" 1.0 0.0 "DEFAULT")
+            (ICons ?b (ICons ?a (INil)))))
+        (union ?cast ?sgemm)
+        (set (dtype ?sgemm) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt fp8 e4m3/e4m3 row-major x column-major f32 output"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+        (= ?cast (Op (Cast ?size (F32)) (ICons ?sum (INil))))
+
+        (= ?out_shape (ECons ?m (ECons ?n (ENil))))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+
+        (= ?a_stride (ECons ?a_m_stride (ECons ?a_n_stride (ECons ?a_k_stride (ENil)))))
+        (= ?b_stride (ECons ?b_m_stride (ECons ?b_n_stride (ECons ?b_k_stride (ENil)))))
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MMul (MIter) ?k))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MIter))
+
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+        (= ?b_k_stride (MIter))
+
+        (= (F8E4M3) (dtype ?a))
+        (= (F8E5M2) (dtype ?b))
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?n ?m ?k
+            "T" "N"
+            "COL" "COL" "COL" "COL"
+            ?b_n_stride
+            ?a_m_stride
+            ?n
+            ?n
+            (MNum 1)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (F8E5M2) (F8E4M3) (F32) (F32) "32F" "F32" 1.0 0.0 "DEFAULT")
+            (ICons ?b (ICons ?a (INil)))))
+        (union ?cast ?sgemm)
+        (set (dtype ?sgemm) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt fp8 e5m2/e4m3 row-major x column-major f32 output"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+        (= ?cast (Op (Cast ?size (F32)) (ICons ?sum (INil))))
+
+        (= ?out_shape (ECons ?m (ECons ?n (ENil))))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+
+        (= ?a_stride (ECons ?a_m_stride (ECons ?a_n_stride (ECons ?a_k_stride (ENil)))))
+        (= ?b_stride (ECons ?b_m_stride (ECons ?b_n_stride (ECons ?b_k_stride (ENil)))))
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MMul (MIter) ?k))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MIter))
+
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+        (= ?b_k_stride (MIter))
+
+        (= (F8E5M2) (dtype ?a))
+        (= (F8E4M3) (dtype ?b))
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?n ?m ?k
+            "T" "N"
+            "COL" "COL" "COL" "COL"
+            ?b_n_stride
+            ?a_m_stride
+            ?n
+            ?n
+            (MNum 1)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (F8E4M3) (F8E5M2) (F32) (F32) "32F" "F32" 1.0 0.0 "DEFAULT")
+            (ICons ?b (ICons ?a (INil)))))
+        (union ?cast ?sgemm)
+        (set (dtype ?sgemm) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt fp8 e4m3/e5m2 row-major x column-major f32 output"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+        (= ?cast (Op (Cast ?size (F32)) (ICons ?sum (INil))))
+
+        (= ?batch (nth_from_end ?out_shape 2))
+        (= ?m (nth_from_end ?out_shape 1))
+        (= ?n (nth_from_end ?out_shape 0))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+        (!= ?batch (MNum 0))
+
+        (= ?a_batch_stride (nth_from_end ?a_stride 3))
+        (= ?a_m_stride (nth_from_end ?a_stride 2))
+        (= ?a_n_stride (nth_from_end ?a_stride 1))
+        (= ?a_k_stride (nth_from_end ?a_stride 0))
+
+        (= ?b_batch_stride (nth_from_end ?b_stride 3))
+        (= ?b_m_stride (nth_from_end ?b_stride 2))
+        (= ?b_n_stride (nth_from_end ?b_stride 1))
+        (= ?b_k_stride (nth_from_end ?b_stride 0))
+
+        (= ?k_stride (MIter))
+
+        (= ?a_k_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_m_stride (MMul (MIter) ?k))
+
+        (= ?b_k_stride (MIter))
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+
+        (= ?a_batch_stride (MMul ?m ?a_m_stride))
+        (= ?b_batch_stride (MMul ?n ?b_n_stride))
+
+        (= (F8E4M3) (dtype ?a))
+        (= (F8E4M3) (dtype ?b))
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?n ?m ?k
+            "T" "N"
+            "COL" "COL" "COL" "COL"
+            ?b_n_stride
+            ?a_m_stride
+            ?n
+            ?n
+            ?batch
+            ?b_batch_stride
+            ?a_batch_stride
+            (MMul ?m ?n)
+            (MMul ?m ?n)
+            (F8E4M3) (F8E4M3) (F32) (F32) "32F" "F32" 1.0 0.0 "DEFAULT")
+            (ICons ?b (ICons ?a (INil)))))
+        (union ?cast ?sgemm)
+        (set (dtype ?sgemm) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt fp8 e4m3/e4m3 batched row-major x column-major f32 output"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+        (= ?cast (Op (Cast ?size (F32)) (ICons ?sum (INil))))
+
+        (= ?batch (nth_from_end ?out_shape 2))
+        (= ?m (nth_from_end ?out_shape 1))
+        (= ?n (nth_from_end ?out_shape 0))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+        (!= ?batch (MNum 0))
+
+        (= ?a_batch_stride (nth_from_end ?a_stride 3))
+        (= ?a_m_stride (nth_from_end ?a_stride 2))
+        (= ?a_n_stride (nth_from_end ?a_stride 1))
+        (= ?a_k_stride (nth_from_end ?a_stride 0))
+
+        (= ?b_batch_stride (nth_from_end ?b_stride 3))
+        (= ?b_m_stride (nth_from_end ?b_stride 2))
+        (= ?b_n_stride (nth_from_end ?b_stride 1))
+        (= ?b_k_stride (nth_from_end ?b_stride 0))
+
+        (= ?k_stride (MIter))
+
+        (= ?a_k_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_m_stride (MMul (MIter) ?k))
+
+        (= ?b_k_stride (MIter))
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+
+        (= ?a_batch_stride (MMul ?m ?a_m_stride))
+        (= ?b_batch_stride (MMul ?n ?b_n_stride))
+
+        (= (F8E4M3) (dtype ?a))
+        (= (F8E5M2) (dtype ?b))
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?n ?m ?k
+            "T" "N"
+            "COL" "COL" "COL" "COL"
+            ?b_n_stride
+            ?a_m_stride
+            ?n
+            ?n
+            ?batch
+            ?b_batch_stride
+            ?a_batch_stride
+            (MMul ?m ?n)
+            (MMul ?m ?n)
+            (F8E5M2) (F8E4M3) (F32) (F32) "32F" "F32" 1.0 0.0 "DEFAULT")
+            (ICons ?b (ICons ?a (INil)))))
+        (union ?cast ?sgemm)
+        (set (dtype ?sgemm) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt fp8 e5m2/e4m3 batched row-major x column-major f32 output"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+        (= ?cast (Op (Cast ?size (F32)) (ICons ?sum (INil))))
+
+        (= ?batch (nth_from_end ?out_shape 2))
+        (= ?m (nth_from_end ?out_shape 1))
+        (= ?n (nth_from_end ?out_shape 0))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+        (!= ?batch (MNum 0))
+
+        (= ?a_batch_stride (nth_from_end ?a_stride 3))
+        (= ?a_m_stride (nth_from_end ?a_stride 2))
+        (= ?a_n_stride (nth_from_end ?a_stride 1))
+        (= ?a_k_stride (nth_from_end ?a_stride 0))
+
+        (= ?b_batch_stride (nth_from_end ?b_stride 3))
+        (= ?b_m_stride (nth_from_end ?b_stride 2))
+        (= ?b_n_stride (nth_from_end ?b_stride 1))
+        (= ?b_k_stride (nth_from_end ?b_stride 0))
+
+        (= ?k_stride (MIter))
+
+        (= ?a_k_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_m_stride (MMul (MIter) ?k))
+
+        (= ?b_k_stride (MIter))
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+
+        (= ?a_batch_stride (MMul ?m ?a_m_stride))
+        (= ?b_batch_stride (MMul ?n ?b_n_stride))
+
+        (= (F8E5M2) (dtype ?a))
+        (= (F8E4M3) (dtype ?b))
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?n ?m ?k
+            "T" "N"
+            "COL" "COL" "COL" "COL"
+            ?b_n_stride
+            ?a_m_stride
+            ?n
+            ?n
+            ?batch
+            ?b_batch_stride
+            ?a_batch_stride
+            (MMul ?m ?n)
+            (MMul ?m ?n)
+            (F8E4M3) (F8E5M2) (F32) (F32) "32F" "F32" 1.0 0.0 "DEFAULT")
+            (ICons ?b (ICons ?a (INil)))))
+        (union ?cast ?sgemm)
+        (set (dtype ?sgemm) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt fp8 e4m3/e5m2 batched row-major x column-major f32 output"
+)
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_mixed_dtype_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_mixed_dtype_rewrite.egg
@@ -0,0 +1,75 @@
+; Mixed output dtype rewrites for cuBLASLt.
+;
+; The first mixed mode we need for low-precision matmuls is:
+;
+;   D[f32] = A[fp16/bf16] * B[fp16/bf16]
+;
+; Luminal graphs express this today as a Cast(F32) around a low-precision
+; matmul. cuBLASLt can write the f32 output directly, so expose that candidate
+; before beta fusion tries to consume an f32 C input.
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            (F16) (F16) (F16) (F16)
+            ?compute_type ?scale_dtype
+            ?alpha ?beta ?epilogue)
+            ?inputs))
+        (= ?cast (Op (Cast ?size (F32)) (ICons ?matmul (INil))))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            (F16) (F16) (F32) (F32)
+            ?compute_type ?scale_dtype
+            ?alpha ?beta ?epilogue)
+            ?inputs))
+        (union ?cast ?fused)
+        (set (dtype ?fused) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt f16 matmul cast f32 output"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            (Bf16) (Bf16) (Bf16) (Bf16)
+            ?compute_type ?scale_dtype
+            ?alpha ?beta ?epilogue)
+            ?inputs))
+        (= ?cast (Op (Cast ?size (F32)) (ICons ?matmul (INil))))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            (Bf16) (Bf16) (F32) (F32)
+            ?compute_type ?scale_dtype
+            ?alpha ?beta ?epilogue)
+            ?inputs))
+        (union ?cast ?fused)
+        (set (dtype ?fused) (F32))
+    )
+    :ruleset matmul_backend
+    :name "cublaslt bf16 matmul cast f32 output"
+)
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_row_order_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_row_order_rewrite.egg
@@ -0,0 +1,452 @@
+; Natural cuBLASLt row-order output rewrites. These keep Luminal's logical
+; output C[m,n] as a cuBLASLt ROW-ordered D[m,n] instead of using the older
+; swapped COL-ordered D[n,m] view. A and B orders mirror their matched logical
+; layouts, so this family is the legal base for future ROW-ordered beta fusions.
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+
+        (= ?out_shape (ECons ?m (ECons ?n (ENil))))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+
+        (= ?a_stride (ECons ?a_m_stride (ECons ?a_n_stride (ECons ?a_k_stride (ENil)))))
+        (= ?b_stride (ECons ?b_m_stride (ECons ?b_n_stride (ECons ?b_k_stride (ENil)))))
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MMul (MIter) ?k))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MIter))
+
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MIter))
+        (= ?b_k_stride (MMul (MIter) ?n))
+
+        (= ?dt (dtype ?a))
+        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?m ?n ?k
+            "N" "N"
+            "ROW" "ROW" "ROW" "ROW"
+            ?a_m_stride
+            ?b_k_stride
+            ?n
+            ?n
+            (MNum 1)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+        (union ?sum ?sgemm)
+        (set (dtype ?sgemm) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order row-major x row-major"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+
+        (= ?out_shape (ECons ?m (ECons ?n (ENil))))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+
+        (= ?a_stride (ECons ?a_m_stride (ECons ?a_n_stride (ECons ?a_k_stride (ENil)))))
+        (= ?b_stride (ECons ?b_m_stride (ECons ?b_n_stride (ECons ?b_k_stride (ENil)))))
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MMul (MIter) ?k))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MIter))
+
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+        (= ?b_k_stride (MIter))
+
+        (= ?dt (dtype ?a))
+        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?m ?n ?k
+            "N" "N"
+            "ROW" "COL" "ROW" "ROW"
+            ?a_m_stride
+            ?b_n_stride
+            ?n
+            ?n
+            (MNum 1)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+        (union ?sum ?sgemm)
+        (set (dtype ?sgemm) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order row-major x column-major"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+
+        (= ?out_shape (ECons ?m (ECons ?n (ENil))))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+
+        (= ?a_stride (ECons ?a_m_stride (ECons ?a_n_stride (ECons ?a_k_stride (ENil)))))
+        (= ?b_stride (ECons ?b_m_stride (ECons ?b_n_stride (ECons ?b_k_stride (ENil)))))
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MMul (MIter) ?m))
+
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MIter))
+        (= ?b_k_stride (MMul (MIter) ?n))
+
+        (= ?dt (dtype ?a))
+        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?m ?n ?k
+            "N" "N"
+            "COL" "ROW" "ROW" "ROW"
+            ?a_k_stride
+            ?b_k_stride
+            ?n
+            ?n
+            (MNum 1)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+        (union ?sum ?sgemm)
+        (set (dtype ?sgemm) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order column-major x row-major"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+
+        (= ?out_shape (ECons ?m (ECons ?n (ENil))))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+
+        (= ?a_stride (ECons ?a_m_stride (ECons ?a_n_stride (ECons ?a_k_stride (ENil)))))
+        (= ?b_stride (ECons ?b_m_stride (ECons ?b_n_stride (ECons ?b_k_stride (ENil)))))
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MMul (MIter) ?m))
+
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+        (= ?b_k_stride (MIter))
+
+        (= ?dt (dtype ?a))
+        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?m ?n ?k
+            "N" "N"
+            "COL" "COL" "ROW" "ROW"
+            ?a_k_stride
+            ?b_n_stride
+            ?n
+            ?n
+            (MNum 1)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            (MNum 0)
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+        (union ?sum ?sgemm)
+        (set (dtype ?sgemm) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order column-major x column-major"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+
+        (= ?batch (nth_from_end ?out_shape 2))
+        (= ?m (nth_from_end ?out_shape 1))
+        (= ?n (nth_from_end ?out_shape 0))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+        (!= ?batch (MNum 0))
+
+        (= ?a_batch_stride (nth_from_end ?a_stride 3))
+        (= ?a_m_stride (nth_from_end ?a_stride 2))
+        (= ?a_n_stride (nth_from_end ?a_stride 1))
+        (= ?a_k_stride (nth_from_end ?a_stride 0))
+
+        (= ?b_batch_stride (nth_from_end ?b_stride 3))
+        (= ?b_m_stride (nth_from_end ?b_stride 2))
+        (= ?b_n_stride (nth_from_end ?b_stride 1))
+        (= ?b_k_stride (nth_from_end ?b_stride 0))
+
+        (= ?k_stride (MIter))
+
+        (= ?a_k_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_m_stride (MMul (MIter) ?k))
+
+        (= ?b_n_stride (MIter))
+        (= ?b_m_stride (MNum 0))
+        (= ?b_k_stride (MMul (MIter) ?n))
+
+        (= ?a_batch_stride (MMul ?m ?a_m_stride))
+        (= ?b_batch_stride (MMul ?k ?b_k_stride))
+
+        (= ?dt (dtype ?a))
+        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?m ?n ?k
+            "N" "N"
+            "ROW" "ROW" "ROW" "ROW"
+            ?a_m_stride
+            ?b_k_stride
+            ?n
+            ?n
+            ?batch
+            ?a_batch_stride
+            ?b_batch_stride
+            (MMul ?m ?n)
+            (MMul ?m ?n)
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+        (union ?sum ?sgemm)
+        (set (dtype ?sgemm) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order batched row-major x row-major"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+
+        (= ?batch (nth_from_end ?out_shape 2))
+        (= ?m (nth_from_end ?out_shape 1))
+        (= ?n (nth_from_end ?out_shape 0))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+        (!= ?batch (MNum 0))
+
+        (= ?a_batch_stride (nth_from_end ?a_stride 3))
+        (= ?a_m_stride (nth_from_end ?a_stride 2))
+        (= ?a_n_stride (nth_from_end ?a_stride 1))
+        (= ?a_k_stride (nth_from_end ?a_stride 0))
+
+        (= ?b_batch_stride (nth_from_end ?b_stride 3))
+        (= ?b_m_stride (nth_from_end ?b_stride 2))
+        (= ?b_n_stride (nth_from_end ?b_stride 1))
+        (= ?b_k_stride (nth_from_end ?b_stride 0))
+
+        (= ?k_stride (MIter))
+
+        (= ?a_k_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_m_stride (MMul (MIter) ?k))
+
+        (= ?b_k_stride (MIter))
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+
+        (= ?a_batch_stride (MMul ?m ?a_m_stride))
+        (= ?b_batch_stride (MMul ?n ?b_n_stride))
+
+        (= ?dt (dtype ?a))
+        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?m ?n ?k
+            "N" "N"
+            "ROW" "COL" "ROW" "ROW"
+            ?a_m_stride
+            ?b_n_stride
+            ?n
+            ?n
+            ?batch
+            ?a_batch_stride
+            ?b_batch_stride
+            (MMul ?m ?n)
+            (MMul ?m ?n)
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+        (union ?sum ?sgemm)
+        (set (dtype ?sgemm) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order batched row-major x column-major"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+
+        (= ?batch (nth_from_end ?out_shape 2))
+        (= ?m (nth_from_end ?out_shape 1))
+        (= ?n (nth_from_end ?out_shape 0))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+        (!= ?batch (MNum 0))
+
+        (= ?a_batch_stride (nth_from_end ?a_stride 3))
+        (= ?a_m_stride (nth_from_end ?a_stride 2))
+        (= ?a_n_stride (nth_from_end ?a_stride 1))
+        (= ?a_k_stride (nth_from_end ?a_stride 0))
+
+        (= ?b_batch_stride (nth_from_end ?b_stride 3))
+        (= ?b_m_stride (nth_from_end ?b_stride 2))
+        (= ?b_n_stride (nth_from_end ?b_stride 1))
+        (= ?b_k_stride (nth_from_end ?b_stride 0))
+
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MMul (MIter) ?m))
+
+        (= ?b_n_stride (MIter))
+        (= ?b_m_stride (MNum 0))
+        (= ?b_k_stride (MMul (MIter) ?n))
+
+        (= ?a_batch_stride (MMul ?k ?a_k_stride))
+        (= ?b_batch_stride (MMul ?k ?b_k_stride))
+
+        (= ?dt (dtype ?a))
+        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?m ?n ?k
+            "N" "N"
+            "COL" "ROW" "ROW" "ROW"
+            ?a_k_stride
+            ?b_k_stride
+            ?n
+            ?n
+            ?batch
+            ?a_batch_stride
+            ?b_batch_stride
+            (MMul ?m ?n)
+            (MMul ?m ?n)
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+        (union ?sum ?sgemm)
+        (set (dtype ?sgemm) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order batched column-major x row-major"
+)
+
+(rule
+    (
+        (= ?mul (Op (Mul ?mul_shape ?a_stride ?b_stride ?mul_out_stride) (ICons ?a (ICons ?b (INil)))))
+        (= ?sum (Op (Sum ?out_shape ?k ?sum_in_stride ?k_stride ?sum_out_stride) (ICons ?mul (INil))))
+
+        (= ?batch (nth_from_end ?out_shape 2))
+        (= ?m (nth_from_end ?out_shape 1))
+        (= ?n (nth_from_end ?out_shape 0))
+        (!= ?m (MNum 0))
+        (!= ?n (MNum 0))
+        (!= ?k (MNum 1))
+        (!= ?batch (MNum 0))
+
+        (= ?a_batch_stride (nth_from_end ?a_stride 3))
+        (= ?a_m_stride (nth_from_end ?a_stride 2))
+        (= ?a_n_stride (nth_from_end ?a_stride 1))
+        (= ?a_k_stride (nth_from_end ?a_stride 0))
+
+        (= ?b_batch_stride (nth_from_end ?b_stride 3))
+        (= ?b_m_stride (nth_from_end ?b_stride 2))
+        (= ?b_n_stride (nth_from_end ?b_stride 1))
+        (= ?b_k_stride (nth_from_end ?b_stride 0))
+
+        (= ?k_stride (MIter))
+
+        (= ?a_m_stride (MIter))
+        (= ?a_n_stride (MNum 0))
+        (= ?a_k_stride (MMul (MIter) ?m))
+
+        (= ?b_k_stride (MIter))
+        (= ?b_m_stride (MNum 0))
+        (= ?b_n_stride (MMul (MIter) ?k))
+
+        (= ?a_batch_stride (MMul ?k ?a_k_stride))
+        (= ?b_batch_stride (MMul ?n ?b_n_stride))
+
+        (= ?dt (dtype ?a))
+        (= ?dt (dtype ?b))
+        (cublaslt_base_dtype ?dt)
+    )
+    (
+        (let ?sgemm (Op (cublaslt
+            ?m ?n ?k
+            "N" "N"
+            "COL" "COL" "ROW" "ROW"
+            ?a_k_stride
+            ?b_n_stride
+            ?n
+            ?n
+            ?batch
+            ?a_batch_stride
+            ?b_batch_stride
+            (MMul ?m ?n)
+            (MMul ?m ?n)
+            ?dt ?dt ?dt ?dt "default" "default" 1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b (INil)))))
+        (union ?sum ?sgemm)
+        (set (dtype ?sgemm) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order batched column-major x column-major"
+)
--- a/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_scale_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/cublaslt_scale_rewrite.egg
@@ -0,0 +1,316 @@
+; Scalar alpha/beta rewrites for cuBLASLt. These rules target scalar constants
+; expanded across the matmul/add shape, i.e. zero strides on every logical axis.
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?scale (Op (Constant ?alpha) (INil)))
+        ; alpha=1.0 hash-conses ?fused == ?matmul; the union merges Mul into ?matmul's eclass and saturate diverges. 
+        (!= ?alpha 1.0)
+        (= ?scaled (Op (Mul ?shape
+            ?matmul_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?scaled_out_strides)
+            (ICons ?matmul (ICons ?scale (INil)))))
+        (= ?matmul_strides ?scaled_out_strides)
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (union ?scaled ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt 2d alpha scale"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            1.0 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?scale (Op (Constant ?alpha) (INil)))
+        ; See 2d alpha scale: alpha=1.0 makes (saturate ...) diverge.   
+        (!= ?alpha 1.0)
+        (= ?scaled (Op (Mul ?shape
+            ?matmul_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?scaled_out_strides)
+            (ICons ?matmul (ICons ?scale (INil)))))
+        (= ?matmul_strides ?scaled_out_strides)
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?c_order ?d_order
+            ?lda ?ldb ?ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 "DEFAULT")
+            (ICons ?a (ICons ?b ?matmul_tail))))
+        (union ?scaled ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt batched alpha scale"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "ROW"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?beta_node (Op (Constant ?beta) (INil)))
+        (= ?scaled_c (Op (Mul
+            (ECons ?m (ECons ?n (ENil)))
+            ?c_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?scaled_c_out_strides)
+            (ICons ?c (ICons ?beta_node (INil)))))
+
+        (= ?add (Op (Add
+            (ECons ?m (ECons ?n (ENil)))
+            ?matmul_add_strides
+            ?scaled_c_add_strides
+            ?add_out_strides)
+            (ICons ?matmul (ICons ?scaled_c (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_strides (ECons ?c_row_stride (ECons ?c_col_stride (ENil))))
+        (= ?add_out_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?scaled_c_add_strides ?scaled_c_out_strides)
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "ROW" "ROW"
+            ?lda ?ldb ?c_row_stride ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b (MNum 0) ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha ?beta ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order 2d scaled c beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "ROW"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?beta_node (Op (Constant ?beta) (INil)))
+        (= ?scaled_c (Op (Mul
+            (ECons ?m (ECons ?n (ENil)))
+            ?c_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ENil)))
+            ?scaled_c_out_strides)
+            (ICons ?c (ICons ?beta_node (INil)))))
+
+        (= ?add (Op (Add
+            (ECons ?m (ECons ?n (ENil)))
+            ?scaled_c_add_strides
+            ?matmul_add_strides
+            ?add_out_strides)
+            (ICons ?scaled_c (ICons ?matmul (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?c_strides (ECons ?c_row_stride (ECons ?c_col_stride (ENil))))
+        (= ?add_out_strides (ECons ?d_row_stride (ECons ?d_col_stride (ENil))))
+        (= ?scaled_c_add_strides ?scaled_c_out_strides)
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "ROW" "ROW"
+            ?lda ?ldb ?c_row_stride ?ldd
+            (MNum 1)
+            ?stride_a ?stride_b (MNum 0) ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha ?beta ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order 2d scaled c plus matmul beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "ROW"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?beta_node (Op (Constant ?beta) (INil)))
+        (= ?scaled_c (Op (Mul
+            (ECons ?batch (ECons ?m (ECons ?n (ENil))))
+            ?c_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?scaled_c_out_strides)
+            (ICons ?c (ICons ?beta_node (INil)))))
+
+        (= ?add (Op (Add
+            (ECons ?batch (ECons ?m (ECons ?n (ENil))))
+            ?matmul_add_strides
+            ?scaled_c_add_strides
+            ?add_out_strides)
+            (ICons ?matmul (ICons ?scaled_c (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_strides (ECons ?c_batch_stride (ECons ?c_row_stride (ECons ?c_col_stride (ENil)))))
+        (= ?add_out_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?scaled_c_add_strides ?scaled_c_out_strides)
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "ROW" "ROW"
+            ?lda ?ldb ?c_row_stride ?ldd
+            ?batch
+            ?stride_a ?stride_b ?c_batch_stride ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha ?beta ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order batched scaled c beta"
+)
+
+(rule
+    (
+        (= ?matmul (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order ?matmul_c_order "ROW"
+            ?lda ?ldb ?matmul_ldc ?ldd
+            ?batch
+            ?stride_a ?stride_b ?matmul_stride_c ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha 0.0 ?epilogue)
+            (ICons ?a (ICons ?b ?matmul_tail))))
+
+        (= ?beta_node (Op (Constant ?beta) (INil)))
+        (= ?scaled_c (Op (Mul
+            (ECons ?batch (ECons ?m (ECons ?n (ENil))))
+            ?c_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?scaled_c_out_strides)
+            (ICons ?c (ICons ?beta_node (INil)))))
+
+        (= ?add (Op (Add
+            (ECons ?batch (ECons ?m (ECons ?n (ENil))))
+            ?scaled_c_add_strides
+            ?matmul_add_strides
+            ?add_out_strides)
+            (ICons ?scaled_c (ICons ?matmul (INil)))))
+
+        (= ?matmul_add_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?c_strides (ECons ?c_batch_stride (ECons ?c_row_stride (ECons ?c_col_stride (ENil)))))
+        (= ?add_out_strides (ECons ?d_batch_stride (ECons ?d_row_stride (ECons ?d_col_stride (ENil)))))
+        (= ?scaled_c_add_strides ?scaled_c_out_strides)
+        (= ?c_col_stride (MIter))
+        (!= ?c_row_stride (MNum 0))
+        (= ?matmul_add_strides ?add_out_strides)
+        (= ?c_dtype (dtype ?c))
+    )
+    (
+        (let ?fused (Op (cublaslt
+            ?m ?n ?k
+            ?a_layout ?b_layout
+            ?a_order ?b_order "ROW" "ROW"
+            ?lda ?ldb ?c_row_stride ?ldd
+            ?batch
+            ?stride_a ?stride_b ?c_batch_stride ?stride_d
+            ?a_dtype ?b_dtype ?c_dtype ?d_dtype
+            ?compute_type ?scale_dtype
+            ?alpha ?beta ?epilogue)
+            (ICons ?a (ICons ?b (ICons ?c ?matmul_tail)))))
+        (union ?add ?fused)
+        (set (dtype ?fused) ?d_dtype)
+    )
+    :ruleset matmul_backend
+    :name "cublaslt row-order batched scaled c plus matmul beta"
+)
--- a/crates/luminal_cuda_lite/src/host/cublaslt/mod.rs
+++ b/crates/luminal_cuda_lite/src/host/cublaslt/mod.rs
--- a/crates/luminal_cuda_lite/src/host/flashinfer/README.md
+++ b/crates/luminal_cuda_lite/src/host/flashinfer/README.md
@@ -0,0 +1,124 @@
+# FlashInfer Integration
+
+FlashInfer replaces the multi-op attention pattern (Q×K^T → scale → mask → softmax → ×V) with a single fused GPU kernel via [FlashInfer](https://github.com/flashinfer-ai/flashinfer)'s batch decode and batch prefill APIs.
+
+## Current State
+
+**Working:**
+- Egglog rewrite rule matches any GQA paged attention pattern (model-agnostic shapes)
+- GA search selects FlashInfer when it wins profiling — verified on Llama 3 8B (32 layers) and Qwen 3 4B (36 layers)
+- **BatchDecode** (s=1): fp32 natively — FlashInfer's decode kernel uses scalar vectorized dot products, no tensor cores
+- **BatchPrefill**: template-instantiated for fp16 but **not callable from fp32** — FlashInfer's prefill kernel requires tensor core MMA (`mma.sync.aligned.m16n8k16`) and `ldmatrix` which physically only operate on 16-bit types; the C API stubs return -1 for fp32; will be enabled when native fp16/bf16 pipeline is added
+- Decode handles all cases in the current fp32 pipeline (prefill uses cuBLAS attention via dim bucketing)
+- Indptr-based mask: `qo_indptr` and `kv_indptr` are computed in-graph so the egglog rule can see them in the same chunk as the attention ops
+
+**Not yet implemented:**
+- Native fp16 / bf16 pipeline (would eliminate the cast overhead in prefill)
+- Page sizes > 1
+
+---
+
+## File Organization
+
+```
+src/host/flashinfer/
+  flashinfer_attention.egg  — egglog rewrite rule (pattern match → FlashInferAttention)
+  mod.rs                    — FlashInferAttention op (EgglogOp + HostOp impl)
+  jit.rs                    — JIT compilation: nvcc wrapper.cu → .so, dlopen, fn pointers
+  find_indptrs.rs           — walks the mask e-graph node to locate qo_indptr / kv_indptr inputs
+  wrapper.cu                — CUDA: FlashInfer template instantiation + helper kernels
+  wrapper.h                 — C API header for wrapper.cu
+  README.md                 — this file
+```
+
+## How It Works
+
+### 1. Egglog Pattern Matching
+
+The rule in `flashinfer_attention.egg` matches the structural pattern of paged GQA attention:
+
+```
+Gather(K_cache, idx) → GQA broadcast (Mul×1.0) → Q×K^T → Sum → scale → mask Add → softmax → attn×V → Sum → output
+Gather(V_cache, idx) → GQA broadcast (Mul×1.0) ──────────────────────────────────────────→ attn×V → Sum → output
+```
+
+Key anchors that prevent false matches on MLP or other ops:
+- Two Gather ops from 2D cache pools (MLP never uses Gather)
+- GQA broadcast via `Mul(gathered, Constant(1.0))` with all-zero strides
+- Mask Add with zero-stride broadcast in the first (nheads) dimension
+- Two sequential matmul+Sum pairs connected through softmax
+
+Shape dimensions are egglog variables, not pinned constants — the rule works for any model with GQA (Llama, Qwen, Mistral, etc.). The structural invariants (dimension count, zero-stride positions, Gather from 2D) are enough to avoid combinatorial explosion during saturation.
+
+When the rule fires, it unions `FlashInferAttention` with the original attention output, making it an equivalent alternative in the e-graph. The GA search then profiles both paths and picks the faster one.
+
+### 2. Extraction: Finding Indptrs
+
+During `extract()` (called when egglog selects the FlashInferAttention e-node), `find_indptrs.rs` walks backward from the mask node in the e-graph to locate the `qo_indptr` and `kv_indptr` Input nodes. It validates the mask structure by checking for the `Mul(allowed, Constant(1e10))` pattern that `compute_attn_mask()` produces.
+
+The indptrs are appended as inputs 5 and 6 to the FlashInferAttention op, so the runtime can build the CSR page table directly without recomputing anything.
+
+### 3. JIT Compilation
+
+FlashInfer requires `HEAD_DIM` as a compile-time template parameter. Rather than baking it at `cargo build` time, `jit.rs` JIT-compiles `wrapper.cu` with the model's actual HEAD_DIM:
+
+1. First call to `ensure_compiled(head_dim)` runs `nvcc` with `-DLUMINAL_HEAD_DIM=<N>`
+2. The compiled `.so` is cached at `~/.cache/luminal/flashinfer/libflashinfer_hd<N>_<arch>.so`
+3. Subsequent calls load the cached library via `dlopen`
+4. Function pointers (plan, run, transpose, etc.) are resolved and stored in a `static OnceLock`
+
+Supported HEAD_DIM values: 64, 128, 256.
+
+### 4. Runtime Execution
+
+`FlashInferAttention::execute()` dispatches to decode or prefill based on `total_q_tokens vs batch_size`:
+
+**Common steps:**
+1. **Extract kv_indices** — a helper kernel converts the flat gather index `(c, KV_DIM)` to slot indices `(c,)`
+2. **Read indptrs to host** — copied to CPU for the plan phase
+3. **Plan** — queries GPU occupancy and decides split-KV decomposition
+4. **Run** — the fused kernel writes `(total_q_tokens, num_qo_heads, head_dim)`
+5. **Transpose** — transposes to `(num_qo_heads, total_q_tokens, head_dim)` to match the Sum reduction layout
+
+**Decode path** (current, fp32): Always used. Runs FlashInfer's BatchDecode directly on fp32 buffers.
+
+**Prefill path** (future, fp16/bf16 only): The prefill kernel templates are compiled into the JIT .so for fp16 (CTA_TILE_Q=16/64/128, causal mask). The C API stubs currently return -1 since the pipeline is fp32. When native fp16/bf16 dtype support is added, `execute()` will dispatch to prefill when `total_q_tokens > batch_size`.
+
+Global workspaces (`static OnceLock`) are shared across all FlashInferAttention instances to avoid ~4ms allocation overhead per GA profiling candidate. Without this, the GA never selects FlashInfer because the first-run allocation cost dwarfs the kernel time.
+
+## How the Attention Mask Enables FlashInfer
+
+For the egglog rule to fire, the `qo_indptr` and `kv_indptr` tensors must be visible in the same e-graph chunk as the attention ops. This is why the mask is computed *inside* each layer (via `compute_attn_mask()` in the model) rather than passed as a pre-computed input.
+
+The mask computation uses a specific structure:
+```rust
+let allowed = same_request * causal;
+allowed * 1e10 - 1e10    // → 0.0 for allowed, -1e10 for blocked
+```
+
+The `Mul(allowed, Constant(1e10))` pattern is the anchor that `find_indptrs.rs` uses to walk backward and locate the indptr inputs.
+
+## Roadmap
+
+Items listed in priority order. Checked items are done.
+
+- [x] Model-agnostic egglog rule (shape variables instead of Llama-specific constants)
+- [x] bs>1 supersequence decode
+- [x] Indptr-based attention mask (replaces CPU-computed mask)
+- [x] Multi-model support (verified on Llama 3 8B and Qwen 3 4B)
+- [x] BatchPrefill kernel compiled for fp16 (causal mask, CTA_TILE_Q=16/64/128)
+- [ ] Native fp16 / bf16 pipeline (enables prefill, reduces memory, eliminates cuBLAS prefill fallback)
+- [ ] HEAD_DIM dispatch for 64, 96 (JIT supports 64/128/256; wrapper.cu needs 96 for Phi)
+- [ ] Page sizes > 1 (currently page_size=1; larger pages reduce CSR overhead)
+- [ ] Sliding window, ALiBi, logits soft cap (FlashInfer `AttentionVariant` templates)
+- [ ] MHA / MQA / arbitrary GQA ratios beyond {1, 2, 4, 8}
+
+## Key Design Decisions
+
+- **page_size=1**: Each KV cache slot is one "page". This simplifies the CSR page table (`kv_indices` = physical slot indices directly) and matches the flat `(num_slots, KV_DIM)` cache layout.
+
+- **Pinned structural anchors**: The egglog rule pins the *structure* (number of dimensions, which dims are zero-stride, presence of Gather from 2D cache) but uses variables for the *values* (head counts, head_dim). This prevents saturation blowup while remaining model-agnostic.
+
+- **Prefill requires fp16/bf16**: FlashInfer's prefill kernel uses tensor core MMA instructions (`mma.sync.aligned.m16n8k16`) and `ldmatrix` which physically require 16-bit inputs — there is no fp32 tensor core matmul instruction. The prefill kernel templates are compiled into the .so for fp16 but the C API returns -1 for fp32 callers. When native fp16/bf16 is added, prefill will be enabled automatically.
+
+- **Global workspaces**: Float workspace (128 MiB), int workspace (8 MiB), and a page-locked host buffer are allocated once via `static OnceLock` and shared across all instances.
--- a/crates/luminal_cuda_lite/src/host/flashinfer/find_indptrs.rs
+++ b/crates/luminal_cuda_lite/src/host/flashinfer/find_indptrs.rs
@@ -0,0 +1,248 @@
+//! Walk the e-graph from the mask node to find qo_indptr and kv_indptr Input nodes.
+//!
+//! The mask is produced by `compute_attn_mask(q_pos, qo_indptr, kv_indptr)` using
+//! primitive HLIR ops. This module validates the mask's structure and extracts the
+//! indptr Input node IDs so FlashInfer can use them directly.
+
+use luminal::egglog_utils::{ClassId, NodeId, SerializedEGraph};
+use luminal::prelude::FxHashSet;
+
+/// Result of walking the mask computation chain.
+#[derive(Debug)]
+pub struct IndptrNodes<'a> {
+    pub qo_indptr: &'a NodeId,
+    pub kv_indptr: &'a NodeId,
+}
+
+/// Find the qo_indptr and kv_indptr Input nodes by walking backwards from the mask.
+///
+/// Validates the mask structure: `allowed * 1e10 + (-1e10)`. Then does a BFS from
+/// the `allowed` subtree to find all reachable Input nodes with names containing
+/// "qo_indptr" and "kv_indptr".
+///
+/// Panics with a diagnostic message if the structure doesn't match or the
+/// indptr inputs can't be found.
+pub fn find_indptr_inputs<'a>(
+    egraph: &'a SerializedEGraph,
+    mask_node: &'a NodeId,
+) -> IndptrNodes<'a> {
+    // Step 1: Validate mask = Add(scaled_allowed, neg_constant)
+    let (mask_label, mask_children) = &egraph.enodes[mask_node];
+    assert!(
+        mask_label == "Op",
+        "find_indptr_inputs: mask node is not an Op (label={mask_label})"
+    );
+    let mask_kind = resolve_first_node(egraph, &mask_children[0]);
+    let mask_kind_label = &egraph.enodes[mask_kind].0;
+    assert!(
+        mask_kind_label.contains("Add"),
+        "find_indptr_inputs: mask is not an Add (kind={mask_kind_label})"
+    );
+
+    let mask_inputs = walk_ilist_simple(egraph, &mask_children[1]);
+    assert_eq!(
+        mask_inputs.len(),
+        2,
+        "find_indptr_inputs: mask Add should have 2 inputs, got {}",
+        mask_inputs.len()
+    );
+
+    // Step 2: One of the inputs should be Mul(allowed, Constant(1e10))
+    let (scaled_allowed, allowed_node) = find_1e10_mul(egraph, &mask_inputs);
+
+    // Step 3: BFS from `allowed` to find all reachable Input nodes
+    let reachable_inputs = find_reachable_inputs(egraph, allowed_node);
+
+    // Step 4: Match by name
+    let mut qo_indptr: Option<&NodeId> = None;
+    let mut kv_indptr: Option<&NodeId> = None;
+
+    for (node_id, name) in &reachable_inputs {
+        if name.contains("qo_indptr") {
+            qo_indptr = Some(node_id);
+        } else if name.contains("kv_indptr") {
+            kv_indptr = Some(node_id);
+        }
+    }
+
+    let qo = qo_indptr.unwrap_or_else(|| {
+        let found_names: Vec<&str> = reachable_inputs.iter().map(|(_, n)| n.as_str()).collect();
+        panic!(
+            "find_indptr_inputs: could not find 'qo_indptr' Input reachable from mask.\n\
+             Found inputs: {:?}\n\
+             Mask node: {:?}\n\
+             Scaled allowed node: {:?}",
+            found_names, mask_node, scaled_allowed
+        );
+    });
+
+    let kv = kv_indptr.unwrap_or_else(|| {
+        let found_names: Vec<&str> = reachable_inputs.iter().map(|(_, n)| n.as_str()).collect();
+        panic!(
+            "find_indptr_inputs: could not find 'kv_indptr' Input reachable from mask.\n\
+             Found inputs: {:?}\n\
+             Mask node: {:?}\n\
+             Scaled allowed node: {:?}",
+            found_names, mask_node, scaled_allowed
+        );
+    });
+
+    IndptrNodes {
+        qo_indptr: qo,
+        kv_indptr: kv,
+    }
+}
+
+fn find_1e10_mul<'a>(
+    egraph: &'a SerializedEGraph,
+    mask_add_inputs: &[&'a NodeId],
+) -> (&'a NodeId, &'a NodeId) {
+    for &input_node in mask_add_inputs {
+        let (label, children) = &egraph.enodes[input_node];
+        if label != "Op" {
+            continue;
+        }
+        let kind = resolve_first_node(egraph, &children[0]);
+        if !egraph.enodes[kind].0.contains("Mul") {
+            continue;
+        }
+        let mul_inputs = walk_ilist_simple(egraph, &children[1]);
+        if mul_inputs.len() != 2 {
+            continue;
+        }
+        for (i, &inp) in mul_inputs.iter().enumerate() {
+            if is_constant(egraph, inp, 1e10) {
+                let other = mul_inputs[1 - i];
+                return (input_node, other);
+            }
+        }
+    }
+    let mut debug_info = String::new();
+    for (i, &input_node) in mask_add_inputs.iter().enumerate() {
+        let (label, children) = &egraph.enodes[input_node];
+        debug_info.push_str(&format!("\n  input[{i}]: label={label}"));
+        if label == "Op" && !children.is_empty() {
+            let kind = resolve_first_node(egraph, &children[0]);
+            let kind_label = &egraph.enodes[kind].0;
+            debug_info.push_str(&format!(" kind={kind_label}"));
+            for (j, kc) in egraph.enodes[kind].1.iter().enumerate() {
+                let kc_node = resolve_first_node(egraph, kc);
+                debug_info.push_str(&format!(" child[{j}]={}", egraph.enodes[kc_node].0));
+            }
+            if kind_label.contains("Mul") && children.len() >= 2 {
+                let mul_inputs = walk_ilist_simple(egraph, &children[1]);
+                for (j, &mi) in mul_inputs.iter().enumerate() {
+                    let (ml, mc) = &egraph.enodes[mi];
+                    debug_info.push_str(&format!("\n    mul_input[{j}]: label={ml}"));
+                    if ml == "Op" && !mc.is_empty() {
+                        let mk = resolve_first_node(egraph, &mc[0]);
+                        debug_info.push_str(&format!(" kind={}", egraph.enodes[mk].0));
+                        for (k, mkc) in egraph.enodes[mk].1.iter().enumerate() {
+                            let mkc_node = resolve_first_node(egraph, mkc);
+                            debug_info.push_str(&format!(" ch[{k}]={}", egraph.enodes[mkc_node].0));
+                        }
+                    }
+                }
+            }
+        }
+    }
+    panic!(
+        "find_indptr_inputs: could not find Mul(allowed, Constant(1e10)) in mask Add inputs.{debug_info}"
+    );
+}
+
+fn is_constant(egraph: &SerializedEGraph, node: &NodeId, expected: f32) -> bool {
+    let (label, children) = &egraph.enodes[node];
+    if label != "Op" {
+        return false;
+    }
+    let kind = resolve_first_node(egraph, &children[0]);
+    let kind_label = &egraph.enodes[kind].0;
+    if !kind_label.contains("Constant") {
+        return false;
+    }
+    let val_children = &egraph.enodes[kind].1;
+    if val_children.is_empty() {
+        return false;
+    }
+    let val_node = resolve_first_node(egraph, &val_children[0]);
+    let val_str = &egraph.enodes[val_node].0;
+    if let Ok(val) = val_str.parse::<f64>() {
+        (val as f32 - expected).abs() < 1.0
+    } else {
+        false
+    }
+}
+
+fn find_reachable_inputs<'a>(
+    egraph: &'a SerializedEGraph,
+    start: &'a NodeId,
+) -> Vec<(&'a NodeId, String)> {
+    let mut found = Vec::new();
+    let mut visited = FxHashSet::default();
+    let mut stack = vec![start];
+
+    while let Some(node) = stack.pop() {
+        if !visited.insert(node) {
+            continue;
+        }
+
+        let (label, children) = &egraph.enodes[node];
+
+        if label == "Input" {
+            if children.len() >= 2 {
+                let name_node = resolve_first_node(egraph, &children[1]);
+                let name = egraph.enodes[name_node].0.trim_matches('"').to_string();
+                found.push((node, name));
+            }
+            continue;
+        }
+
+        if label == "Op" && children.len() >= 2 {
+            let ir_inputs = walk_ilist_simple(egraph, &children[1]);
+            for inp in ir_inputs {
+                stack.push(inp);
+            }
+        }
+    }
+
+    found
+}
+
+fn walk_ilist_simple<'a>(
+    egraph: &'a SerializedEGraph,
+    ilist_eclass: &'a ClassId,
+) -> Vec<&'a NodeId> {
+    let mut inputs = Vec::new();
+    let mut current = resolve_first_node(egraph, ilist_eclass);
+
+    loop {
+        let (label, children) = &egraph.enodes[current];
+        if label == "INil" {
+            break;
+        }
+        if label != "ICons" {
+            break;
+        }
+        let ir_node = resolve_first_ir_node(egraph, &children[0]);
+        inputs.push(ir_node);
+        current = resolve_first_node(egraph, &children[1]);
+    }
+
+    inputs
+}
+
+fn resolve_first_node<'a>(egraph: &'a SerializedEGraph, eclass: &ClassId) -> &'a NodeId {
+    &egraph.eclasses[eclass].1[0]
+}
+
+fn resolve_first_ir_node<'a>(egraph: &'a SerializedEGraph, eclass: &ClassId) -> &'a NodeId {
+    let nodes = &egraph.eclasses[eclass].1;
+    for node in nodes {
+        let label = &egraph.enodes[node].0;
+        if label == "Op" || label == "Input" {
+            return node;
+        }
+    }
+    &nodes[0]
+}
--- a/crates/luminal_cuda_lite/src/host/flashinfer/flashinfer_attention.egg
+++ b/crates/luminal_cuda_lite/src/host/flashinfer/flashinfer_attention.egg
@@ -0,0 +1,125 @@
+; FlashInfer batch decode attention rewrite rule.
+;
+; Matches the paged attention pattern for ANY model with GQA:
+;   Gather(K_cache) → GQA broadcast → Q*K^T matmul → scale → add mask → softmax → attn*V matmul
+;   Gather(V_cache) → GQA broadcast ──────────────────────────────────────────→ attn*V matmul
+;
+; Structural anchors (prevent false matches on MLP/other ops):
+;   - Gather ops from 2D cache pools (MLP never uses Gather)
+;   - GQA broadcast via Mul(gathered, Constant(1.0)) with all-zero strides
+;   - Scale Mul(QK, constant) connecting QK scores to mask Add
+;   - Mask Add with zero-stride broadcast in first dim (nheads broadcast)
+;   - Data flow: two sequential matmul+reduce pairs connected through softmax
+;
+; The egglog rule captures the mask as 5th input.  During extract(), a Rust
+; function walks the mask's computation chain in the e-graph to locate the
+; qo_indptr and kv_indptr Input nodes (validated via the Constant(1e10) anchor
+; and structural checks).  These are appended as inputs 5 and 6 so FlashInfer
+; can build the CSR page table directly — no runtime derivation needed.
+;
+; Shape dimensions are egglog variables, not pinned constants.
+; Dynamic dims "s" (batch/seq) and "c" (context) stay pinned as MVar.
+
+(rule
+    (
+        ; ── Second matmul: Mul(softmax_out, V_gqa) ──
+        ; Shape: (nheads, s, hdim, c) — 4D
+        (= ?mul2 (Op (Mul
+            (ECons ?nheads (ECons (MVar "s") (ECons ?hdim (ECons (MVar "c") (ENil)))))
+            ?mul2_a_strides
+            ?mul2_b_strides
+            ?mul2_out_strides)
+            (ICons ?soft (ICons ?v_gqa (INil)))))
+
+        ; ── Second matmul: Sum (reduction over c) → output ──
+        ; Shape: (nheads, s, hdim) — reduces c
+        (= ?output (Op (Sum
+            (ECons ?nheads2 (ECons (MVar "s") (ECons ?hdim2 (ENil))))
+            (MVar "c")
+            ?out_in_strides
+            (MIter)
+            ?out_out_strides)
+            (ICons ?mul2 (INil))))
+
+        ; ── V GQA broadcast: Mul(V_gathered, 1.0) with zero-stride constant ──
+        ; Shape: (nheads, c, hdim) — 3D
+        (= ?v_gqa_const (Op (Constant 1.000000) (INil)))
+        (= ?v_gqa (Op (Mul
+            (ECons ?nheads3 (ECons (MVar "c") (ECons ?hdim3 (ENil))))
+            ?v_gqa_a_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?v_gqa_out_strides)
+            (ICons ?v_gathered (ICons ?v_gqa_const (INil)))))
+
+        ; ── V Gather: rows from V_cache (2D) ──
+        ; Shape: (c, kvdim), Source: (num_slots, kvdim)
+        (= ?v_gathered (Op (Gather
+            (ECons (MVar "c") (ECons ?kvdim (ENil)))
+            ?v_gather_strides
+            (ECons ?num_slots_v (ECons ?kvdim2 (ENil)))
+            ?v_src_strides)
+            (ICons ?v_idx (ICons ?v_cache (INil)))))
+
+        ; ── First matmul: Mul(Q, K_gqa) ──
+        ; Shape: (nheads, s, c, hdim) — 4D
+        (= ?mul1 (Op (Mul
+            (ECons ?nheads4 (ECons (MVar "s") (ECons (MVar "c") (ECons ?hdim4 (ENil)))))
+            ?mul1_a_strides
+            ?mul1_b_strides
+            ?mul1_out_strides)
+            (ICons ?q (ICons ?k_gqa (INil)))))
+
+        ; ── First matmul: Sum (reduction over hdim) → QK scores ──
+        ; Shape: (nheads, s, c) — reduces hdim
+        (= ?qk (Op (Sum
+            (ECons ?nheads5 (ECons (MVar "s") (ECons (MVar "c") (ENil))))
+            ?hdim5
+            ?qk_in_strides
+            (MIter)
+            ?qk_out_strides)
+            (ICons ?mul1 (INil))))
+
+        ; ── Mask Add: Add(scaled_QK, mask) ──
+        ; Shape: (nheads, s, c) — 3D
+        ; Mask is broadcast from (s, c) via zero-stride in first dim (nheads).
+        (= ?masked (Op (Add
+            (ECons ?nheads8 (ECons (MVar "s") (ECons (MVar "c") (ENil))))
+            ?mask_add_a_strides
+            (ECons (MNum 0) ?mask_rest_strides)
+            ?mask_add_out_strides)
+            (ICons ?scaled_qk (ICons ?mask (INil)))))
+
+        ; ── K GQA broadcast: Mul(K_gathered, 1.0) with zero-stride constant ──
+        ; Shape: (nheads, hdim, c) — 3D
+        (= ?k_gqa_const (Op (Constant 1.000000) (INil)))
+        (= ?k_gqa (Op (Mul
+            (ECons ?nheads6 (ECons ?hdim6 (ECons (MVar "c") (ENil))))
+            ?k_gqa_a_strides
+            (ECons (MNum 0) (ECons (MNum 0) (ECons (MNum 0) (ENil))))
+            ?k_gqa_out_strides)
+            (ICons ?k_gathered (ICons ?k_gqa_const (INil)))))
+
+        ; ── K Gather: rows from K_cache (2D) ──
+        ; Shape: (c, kvdim), Source: (num_slots, kvdim)
+        (= ?k_gathered (Op (Gather
+            (ECons (MVar "c") (ECons ?kvdim3 (ENil)))
+            ?k_gather_strides
+            (ECons ?num_slots_k (ECons ?kvdim4 (ENil)))
+            ?k_src_strides)
+            (ICons ?k_idx (ICons ?k_cache (INil)))))
+
+        ; ── Dtype consistency ──
+        (= ?dt (dtype ?q))
+        (= ?dt (dtype ?k_cache))
+        (= ?dt (dtype ?v_cache))
+    )
+    (
+        (let ?fi (Op (FlashInferAttention
+            ?nheads (MDiv ?kvdim ?hdim) ?hdim (MNum 1) (MVar "s"))
+            (ICons ?q (ICons ?k_cache (ICons ?v_cache (ICons ?k_idx (ICons ?mask (INil))))))))
+        (union ?output ?fi)
+        (set (dtype ?fi) ?dt)
+    )
+    :ruleset matmul_backend
+    :name "FlashInfer batch decode attention"
+)
--- a/crates/luminal_cuda_lite/src/host/flashinfer/jit.rs
+++ b/crates/luminal_cuda_lite/src/host/flashinfer/jit.rs
@@ -0,0 +1,504 @@
+//! JIT compilation and dynamic loading of FlashInfer kernels.
+//!
+//! Everything runs at compile / profiling time — there is no `build.rs`.
+//! `wrapper.cu` and `wrapper.h` are embedded via `include_str!()` and
+//! extracted to the cache directory on first use. The FlashInfer + CUTLASS
+//! header trees are located by probing `LUMINAL_FLASHINFER_DIR`, a small set
+//! of default paths, and (as a last resort) by `git clone`-ing FlashInfer at
+//! a pinned commit into the cache. `nvcc` is then invoked with the model's
+//! actual `HEAD_DIM` and the resulting `.so` is `dlopen`'d.
+//!
+//! `ensure_compiled` is called from `FlashInferAttention::extract()`, i.e.
+//! during luminal's compile / GA-profiling phase, not from `execute()`. After
+//! the first call the `OnceLock` makes subsequent lookups free.
+
+use std::{
+    ffi::c_void,
+    hash::{Hash, Hasher},
+    path::{Path, PathBuf},
+    process::Command,
+    sync::OnceLock,
+};
+
+// ── Function pointer types matching wrapper.h ──
+
+pub type PlanFn = unsafe extern "C" fn(
+    float_workspace: *mut c_void,
+    float_ws_size: usize,
+    int_workspace: *mut c_void,
+    int_ws_size: usize,
+    page_locked_int_workspace: *mut c_void,
+    indptr_h: *mut i32,
+    batch_size: i32,
+    num_qo_heads: i32,
+    num_kv_heads: i32,
+    page_size: i32,
+    head_dim: i32,
+    stream: *mut c_void,
+    plan_info_out: *mut i64,
+    plan_info_len_out: *mut i32,
+) -> i32;
+
+pub type RunFn = unsafe extern "C" fn(
+    float_workspace: *mut c_void,
+    float_ws_size: usize,
+    int_workspace: *mut c_void,
+    plan_info_vec: *mut i64,
+    plan_info_len: i32,
+    q: *mut f32,
+    k_cache: *mut f32,
+    v_cache: *mut f32,
+    kv_indptr: *mut i32,
+    kv_indices: *mut i32,
+    kv_last_page_len: *mut i32,
+    output: *mut f32,
+    batch_size: i32,
+    num_qo_heads: i32,
+    num_kv_heads: i32,
+    page_size: i32,
+    head_dim: i32,
+    stream: *mut c_void,
+) -> i32;
+
+pub type ExtractFn = unsafe extern "C" fn(
+    flat_idx: *const i32,
+    out: *mut i32,
+    c: i32,
+    kv_dim: i32,
+    stream: *mut c_void,
+);
+
+pub type DeriveIndptrFn =
+    unsafe extern "C" fn(mask: *const f32, indptr: *mut i32, s: i32, c: i32, stream: *mut c_void);
+
+pub type TransposeOutputFn = unsafe extern "C" fn(
+    src: *const f32,
+    dst: *mut f32,
+    batch: i32,
+    heads: i32,
+    dim: i32,
+    stream: *mut c_void,
+);
+
+pub type PrefillPlanFn = unsafe extern "C" fn(
+    float_workspace: *mut c_void,
+    float_ws_size: usize,
+    int_workspace: *mut c_void,
+    int_ws_size: usize,
+    page_locked_int_workspace: *mut c_void,
+    qo_indptr_h: *mut i32,
+    kv_indptr_h: *mut i32,
+    total_num_rows: i32,
+    batch_size: i32,
+    num_qo_heads: i32,
+    num_kv_heads: i32,
+    page_size: i32,
+    head_dim: i32,
+    stream: *mut c_void,
+    plan_info_out: *mut i64,
+    plan_info_len_out: *mut i32,
+) -> i32;
+
+pub type PrefillRunFn = unsafe extern "C" fn(
+    float_workspace: *mut c_void,
+    float_ws_size: usize,
+    int_workspace: *mut c_void,
+    plan_info_vec: *mut i64,
+    plan_info_len: i32,
+    q: *mut f32,
+    k_cache: *mut f32,
+    v_cache: *mut f32,
+    qo_indptr: *mut i32,
+    kv_indptr: *mut i32,
+    kv_indices: *mut i32,
+    kv_last_page_len: *mut i32,
+    output: *mut f32,
+    total_num_rows: i32,
+    batch_size: i32,
+    num_qo_heads: i32,
+    num_kv_heads: i32,
+    page_size: i32,
+    head_dim: i32,
+    stream: *mut c_void,
+) -> i32;
+
+// ── Embedded CUDA sources ──
+
+const WRAPPER_CU: &str = include_str!("wrapper.cu");
+const WRAPPER_H: &str = include_str!("wrapper.h");
+
+// ── Loaded library handle ──
+
+pub struct FlashInferLib {
+    // Keep the handle alive so the dlopen'd .so remains mapped.
+    _lib: libloading::Library,
+    pub plan: PlanFn,
+    pub run: RunFn,
+    pub extract_slot_indices: ExtractFn,
+    pub derive_indptr_from_mask: DeriveIndptrFn,
+    pub transpose_output: TransposeOutputFn,
+    pub prefill_plan: PrefillPlanFn,
+    pub prefill_run: PrefillRunFn,
+}
+
+// SAFETY: The library handle and function pointers are valid for the lifetime
+// of the process. All functions are called with proper CUDA stream serialization.
+unsafe impl Send for FlashInferLib {}
+unsafe impl Sync for FlashInferLib {}
+
+static FLASHINFER_LIB: OnceLock<FlashInferLib> = OnceLock::new();
+
+/// Ensure the FlashInfer library is compiled and loaded for the given HEAD_DIM.
+/// Returns a reference to the loaded library. Thread-safe via OnceLock.
+pub fn ensure_compiled(head_dim: usize) -> &'static FlashInferLib {
+    FLASHINFER_LIB.get_or_init(|| {
+        assert!(
+            matches!(head_dim, 64 | 128 | 256),
+            "FlashInfer: unsupported HEAD_DIM={} (must be 64, 128, or 256 for f32)",
+            head_dim
+        );
+        let so_path = compile_or_cache(head_dim);
+        unsafe {
+            FlashInferLib::load(&so_path)
+                .unwrap_or_else(|e| panic!("Failed to load FlashInfer library: {e}"))
+        }
+    })
+}
+
+impl FlashInferLib {
+    /// Load a compiled FlashInfer .so and resolve function pointers.
+    ///
+    /// # Safety
+    /// The .so must be a valid FlashInfer wrapper compiled from wrapper.cu.
+    unsafe fn load(path: &Path) -> Result<Self, libloading::Error> {
+        let lib = unsafe { libloading::Library::new(path)? };
+        let plan: PlanFn = unsafe { *lib.get::<PlanFn>(b"flashinfer_batch_decode_plan\0")? };
+        let run: RunFn = unsafe { *lib.get::<RunFn>(b"flashinfer_batch_decode_run\0")? };
+        let extract_slot_indices: ExtractFn =
+            unsafe { *lib.get::<ExtractFn>(b"flashinfer_extract_slot_indices\0")? };
+        let derive_indptr_from_mask: DeriveIndptrFn =
+            unsafe { *lib.get::<DeriveIndptrFn>(b"flashinfer_derive_indptr_from_mask\0")? };
+        let transpose_output: TransposeOutputFn =
+            unsafe { *lib.get::<TransposeOutputFn>(b"flashinfer_transpose_output\0")? };
+        let prefill_plan: PrefillPlanFn =
+            unsafe { *lib.get::<PrefillPlanFn>(b"flashinfer_batch_prefill_plan\0")? };
+        let prefill_run: PrefillRunFn =
+            unsafe { *lib.get::<PrefillRunFn>(b"flashinfer_batch_prefill_run\0")? };
+        Ok(Self {
+            _lib: lib,
+            plan,
+            run,
+            extract_slot_indices,
+            derive_indptr_from_mask,
+            transpose_output,
+            prefill_plan,
+            prefill_run,
+        })
+    }
+}
+
+/// Compile wrapper.cu for the given HEAD_DIM, or return cached .so path.
+fn compile_or_cache(head_dim: usize) -> PathBuf {
+    let cache_dir = cache_directory();
+    std::fs::create_dir_all(&cache_dir).expect("Failed to create FlashInfer cache directory");
+
+    // Extract bundled wrapper sources to the cache so nvcc can compile them.
+    let (wrapper_cu_path, wrapper_h_dir) = extract_wrapper_sources(&cache_dir);
+
+    let arch = detect_cuda_arch();
+    // Bake a hash of the embedded wrapper into the .so name so old caches are
+    // discarded automatically when wrapper.cu or wrapper.h change.
+    let wrapper_hash = wrapper_source_hash();
+    let so_name = format!(
+        "libflashinfer_hd{}_{}_w{:016x}.so",
+        head_dim, arch, wrapper_hash
+    );
+    let so_path = cache_dir.join(&so_name);
+
+    if so_path.exists() {
+        eprintln!(
+            "FlashInfer: using cached library for HEAD_DIM={} ({})",
+            head_dim,
+            so_path.display()
+        );
+        return so_path;
+    }
+
+    let Some((flashinfer_include, cutlass_include)) = locate_flashinfer_includes() else {
+        panic!(
+            "FlashInfer: could not locate header tree. Set LUMINAL_FLASHINFER_DIR to the \
+             FlashInfer source root (the directory containing `include/` and \
+             `3rdparty/cutlass/include/`)."
+        );
+    };
+
+    eprintln!(
+        "FlashInfer: JIT compiling for HEAD_DIM={}, arch={} ...",
+        head_dim, arch
+    );
+    let start = std::time::Instant::now();
+
+    let output = Command::new("nvcc")
+        .args([
+            "-shared",
+            "-o",
+            so_path.to_str().unwrap(),
+            &format!("-DLUMINAL_HEAD_DIM={}", head_dim),
+            wrapper_cu_path.to_str().unwrap(),
+            "-I",
+            flashinfer_include.to_str().unwrap(),
+            "-I",
+            cutlass_include.to_str().unwrap(),
+            "-I",
+            wrapper_h_dir.to_str().unwrap(),
+            "-std=c++17",
+            &format!("-arch={}", arch),
+            "-O3",
+            "--expt-relaxed-constexpr",
+            "-w",
+            "-rdc=true",
+            "--compiler-options",
+            "-fPIC",
+        ])
+        .output()
+        .expect("Failed to run nvcc. Is the CUDA toolkit installed?");
+
+    if !output.status.success() {
+        let stderr = String::from_utf8_lossy(&output.stderr);
+        let stdout = String::from_utf8_lossy(&output.stdout);
+        let _ = std::fs::remove_file(&so_path);
+        panic!(
+            "FlashInfer JIT compilation failed (HEAD_DIM={}, arch={}):\nstdout: {}\nstderr: {}",
+            head_dim, arch, stdout, stderr
+        );
+    }
+
+    let elapsed = start.elapsed();
+    eprintln!(
+        "FlashInfer: compiled in {:.1}s → {}",
+        elapsed.as_secs_f64(),
+        so_path.display()
+    );
+
+    so_path
+}
+
+/// Returns ~/.cache/luminal/flashinfer/
+fn cache_directory() -> PathBuf {
+    let home = std::env::var("HOME").unwrap_or_else(|_| "/tmp".to_string());
+    PathBuf::from(home)
+        .join(".cache")
+        .join("luminal")
+        .join("flashinfer")
+}
+
+/// Drop the embedded wrapper.cu/wrapper.h into the cache dir so nvcc has files
+/// on disk to compile. Returns (wrapper.cu path, directory containing wrapper.h).
+fn extract_wrapper_sources(cache_dir: &Path) -> (PathBuf, PathBuf) {
+    let cu = cache_dir.join("wrapper.cu");
+    let h = cache_dir.join("wrapper.h");
+    write_if_changed(&cu, WRAPPER_CU.as_bytes());
+    write_if_changed(&h, WRAPPER_H.as_bytes());
+    (cu, cache_dir.to_path_buf())
+}
+
+fn write_if_changed(path: &Path, contents: &[u8]) {
+    if let Ok(existing) = std::fs::read(path)
+        && existing == contents
+    {
+        return;
+    }
+    std::fs::write(path, contents).unwrap_or_else(|e| {
+        panic!(
+            "FlashInfer: failed to write wrapper source to {}: {e}",
+            path.display()
+        )
+    });
+}
+
+fn wrapper_source_hash() -> u64 {
+    let mut hasher = std::collections::hash_map::DefaultHasher::new();
+    WRAPPER_CU.hash(&mut hasher);
+    WRAPPER_H.hash(&mut hasher);
+    hasher.finish()
+}
+
+// ── Pinned FlashInfer source ──
+//
+// Bumping this constant invalidates the cached source tree AND the cached .so
+// (the .so cache key incorporates the wrapper hash, which is rebuilt against
+// these headers, so different headers compile to a different .so file even at
+// the same head_dim). If you change `FLASHINFER_GIT_REV`, also re-check
+// `wrapper.cu` against the new FlashInfer API.
+
+const FLASHINFER_GIT_URL: &str = "https://github.com/flashinfer-ai/flashinfer.git";
+const CUTLASS_GIT_URL: &str = "https://github.com/NVIDIA/cutlass.git";
+const FLASHINFER_GIT_REV: &str = "f1e6fdcb8f65104047697f022b5d055ef022d763";
+const CUTLASS_GIT_REV: &str = "f3fde58372d33e9a5650ba7b80fc48b3b49d40c8";
+
+fn locate_flashinfer_includes() -> Option<(PathBuf, PathBuf)> {
+    if let Ok(path) = std::env::var("LUMINAL_FLASHINFER_DIR")
+        && !path.is_empty()
+    {
+        let root = PathBuf::from(path);
+        let inc = root.join("include");
+        let cutlass = root.join("3rdparty/cutlass/include");
+        if inc.exists() && cutlass.exists() {
+            return Some((inc, cutlass));
+        }
+        eprintln!(
+            "FlashInfer: LUMINAL_FLASHINFER_DIR={} did not contain include/ and \
+             3rdparty/cutlass/include/ — falling back to default locations",
+            root.display()
+        );
+    }
+
+    let home = std::env::var("HOME").unwrap_or_default();
+    let candidates = [
+        PathBuf::from(&home).join("luminal_cuda/crates/luminal_cuda/flashinfer"),
+        PathBuf::from(&home).join("luminal_cuda/flashinfer"),
+        PathBuf::from("/opt/luminal_cuda/crates/luminal_cuda/flashinfer"),
+    ];
+    for root in candidates {
+        let inc = root.join("include");
+        let cutlass = root.join("3rdparty/cutlass/include");
+        if inc.exists() && cutlass.exists() {
+            return Some((inc, cutlass));
+        }
+    }
+
+    // Last resort: fetch the pinned commit into the cache directory.
+    fetch_flashinfer_source().ok().map(|root| {
+        let inc = root.join("include");
+        let cutlass = root.join("3rdparty/cutlass/include");
+        (inc, cutlass)
+    })
+}
+
+/// Clone FlashInfer at `FLASHINFER_GIT_REV` + CUTLASS at `CUTLASS_GIT_REV`
+/// into `~/.cache/luminal/flashinfer-src/<short_rev>/` if absent, then return
+/// the FlashInfer root directory. ~50 MB one-time download; subsequent calls
+/// short-circuit on the directory check.
+fn fetch_flashinfer_source() -> Result<PathBuf, String> {
+    let short = &FLASHINFER_GIT_REV[..12];
+    let cache_root = cache_directory().join("flashinfer-src").join(short);
+    let inc = cache_root.join("include");
+    let cutlass_inc = cache_root.join("3rdparty/cutlass/include");
+
+    if inc.exists() && cutlass_inc.exists() {
+        return Ok(cache_root);
+    }
+
+    let parent = cache_root.parent().unwrap();
+    std::fs::create_dir_all(parent)
+        .map_err(|e| format!("failed to create {}: {e}", parent.display()))?;
+
+    // Clone into a staging dir, then atomic rename. Protects against multiple
+    // processes racing to fetch the same source.
+    let staging = parent.join(format!(".staging-{}-{}", short, std::process::id()));
+    let _ = std::fs::remove_dir_all(&staging);
+
+    eprintln!(
+        "FlashInfer: cloning {FLASHINFER_GIT_URL} @ {short} into {} (one-time fetch, ~50 MB) …",
+        cache_root.display()
+    );
+
+    run_git(&[
+        "clone",
+        "--filter=blob:none",
+        "--no-checkout",
+        FLASHINFER_GIT_URL,
+        staging.to_str().unwrap(),
+    ])?;
+    run_git_in(&staging, &["checkout", FLASHINFER_GIT_REV])?;
+
+    // Init only the CUTLASS submodule (skip spdlog — we don't need it for kernels).
+    let cutlass_path = staging.join("3rdparty/cutlass");
+    let _ = std::fs::remove_dir_all(&cutlass_path);
+    run_git(&[
+        "clone",
+        "--filter=blob:none",
+        "--no-checkout",
+        CUTLASS_GIT_URL,
+        cutlass_path.to_str().unwrap(),
+    ])?;
+    run_git_in(&cutlass_path, &["checkout", CUTLASS_GIT_REV])?;
+
+    if !staging.join("include").exists() {
+        return Err(format!(
+            "FlashInfer clone succeeded but include/ missing at {}",
+            staging.display()
+        ));
+    }
+    if !staging.join("3rdparty/cutlass/include").exists() {
+        return Err(format!(
+            "CUTLASS clone succeeded but include/ missing at {}",
+            staging.join("3rdparty/cutlass").display()
+        ));
+    }
+
+    // Atomic-ish rename. If another process beat us to it, just keep theirs.
+    match std::fs::rename(&staging, &cache_root) {
+        Ok(()) => {}
+        Err(_) if cache_root.exists() => {
+            let _ = std::fs::remove_dir_all(&staging);
+        }
+        Err(e) => return Err(format!("rename to {} failed: {e}", cache_root.display())),
+    }
+
+    Ok(cache_root)
+}
+
+fn run_git(args: &[&str]) -> Result<(), String> {
+    let out = Command::new("git")
+        .args(args)
+        .output()
+        .map_err(|e| format!("failed to spawn `git`: {e}. Is git installed?"))?;
+    if !out.status.success() {
+        return Err(format!(
+            "`git {}` failed: {}",
+            args.join(" "),
+            String::from_utf8_lossy(&out.stderr)
+        ));
+    }
+    Ok(())
+}
+
+fn run_git_in(cwd: &Path, args: &[&str]) -> Result<(), String> {
+    let out = Command::new("git")
+        .args(args)
+        .current_dir(cwd)
+        .output()
+        .map_err(|e| format!("failed to spawn `git`: {e}"))?;
+    if !out.status.success() {
+        return Err(format!(
+            "`git {}` in {} failed: {}",
+            args.join(" "),
+            cwd.display(),
+            String::from_utf8_lossy(&out.stderr)
+        ));
+    }
+    Ok(())
+}
+
+/// Detect CUDA arch via env override → nvidia-smi → default sm_80.
+fn detect_cuda_arch() -> String {
+    if let Ok(arch) = std::env::var("FLASHINFER_CUDA_ARCH") {
+        return arch;
+    }
+
+    if let Ok(output) = Command::new("nvidia-smi")
+        .args(["--query-gpu=compute_cap", "--format=csv,noheader"])
+        .output()
+        && output.status.success()
+    {
+        let cap = String::from_utf8_lossy(&output.stdout);
+        let cap = cap.trim().lines().next().unwrap_or("8.0");
+        let sm = cap.replace('.', "");
+        if !sm.is_empty() {
+            return format!("sm_{}", sm);
+        }
+    }
+
+    "sm_80".to_string()
+}
--- a/crates/luminal_cuda_lite/src/host/flashinfer/mod.rs
+++ b/crates/luminal_cuda_lite/src/host/flashinfer/mod.rs
@@ -0,0 +1,424 @@
+pub mod find_indptrs;
+pub mod jit;
+
+use std::sync::{Arc, Mutex, OnceLock};
+
+use luminal::{
+    egglog_utils::{
+        api::{Rule, SortDef, sort},
+        base::{EXPRESSION, OP_KIND},
+        extract_expr,
+    },
+    op::{EgglogOp, LLIROp},
+    prelude::{
+        tracing::{Level, span},
+        *,
+    },
+};
+
+use crate::{
+    cudarc::driver::{CudaSlice, CudaStream, DevicePtr, result},
+    host::{DeviceBuffer, HostOp},
+};
+
+/// FlashInfer attention op (batch decode, fp32).
+///
+/// Replaces the full paged-GQA attention pattern (gather → broadcast → Q*K^T →
+/// scale → mask → softmax → *V) with a single FlashInfer fused kernel.
+///
+/// Graph inputs (7): Q, K_pool, V_pool, flat_gather_idx, mask, qo_indptr, kv_indptr.
+/// The egglog rule captures the first 5; `extract()` appends qo/kv indptrs after
+/// walking the e-graph from the mask. `batch_size` is derived at runtime from the
+/// indptr length (= num_sequences + 1).
+#[derive(Debug)]
+pub struct FlashInferAttention {
+    pub num_qo_heads: usize,
+    pub num_kv_heads: usize,
+    pub head_dim: usize,
+    pub page_size: usize,
+    pub batch_dim: Expression,
+
+    pub plan_info: Mutex<Vec<i64>>,
+}
+
+// SAFETY: PAGE_LOCKED_WORKSPACE holds a raw pointer to page-locked CUDA memory
+// allocated once and serialized via the CUDA stream that owns it.
+unsafe impl Send for FlashInferAttention {}
+unsafe impl Sync for FlashInferAttention {}
+
+const FLOAT_WORKSPACE_SIZE: usize = 128 * 1024 * 1024; // 128 MiB
+const INT_WORKSPACE_SIZE: usize = 8 * 1024 * 1024; // 8 MiB
+
+static PAGE_LOCKED_WORKSPACE: OnceLock<PageLockedPtr> = OnceLock::new();
+
+struct PageLockedPtr(*mut u8);
+
+// SAFETY: The pointer is page-locked CUDA memory allocated once via
+// posix_memalign + cudaHostRegister and only mutated during OnceLock
+// initialization.
+unsafe impl Send for PageLockedPtr {}
+unsafe impl Sync for PageLockedPtr {}
+
+impl std::fmt::Debug for PageLockedPtr {
+    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
+        write!(f, "PageLockedPtr({:p})", self.0)
+    }
+}
+
+impl Default for FlashInferAttention {
+    fn default() -> Self {
+        Self {
+            num_qo_heads: 0,
+            num_kv_heads: 0,
+            head_dim: 0,
+            page_size: 0,
+            batch_dim: Expression::default(),
+            plan_info: Mutex::new(Vec::new()),
+        }
+    }
+}
+
+impl EgglogOp for FlashInferAttention {
+    fn sort(&self) -> SortDef {
+        sort(
+            OP_KIND,
+            "FlashInferAttention",
+            &[
+                ("num_qo_heads", EXPRESSION),
+                ("num_kv_heads", EXPRESSION),
+                ("head_dim", EXPRESSION),
+                ("page_size", EXPRESSION),
+                ("batch_dim", EXPRESSION),
+            ],
+        )
+    }
+
+    fn n_inputs(&self) -> usize {
+        // Q, K_pool, V_pool, flat_gather_idx, mask (egglog IList).
+        // extract() appends qo_indptr + kv_indptr → 7 actual inputs at runtime.
+        5
+    }
+
+    fn rewrites(&self) -> Vec<Rule> {
+        vec![Rule::raw(include_str!["flashinfer_attention.egg"])]
+    }
+
+    fn extract<'a>(
+        &'a self,
+        egraph: &'a luminal::egglog_utils::SerializedEGraph,
+        kind_children: &[&'a ENodeId],
+        input_enodes: Vec<&'a ENodeId>,
+        _list_cache: &mut FxHashMap<&'a ENodeId, Vec<Expression>>,
+        expr_cache: &mut FxHashMap<&'a ENodeId, Expression>,
+    ) -> (LLIROp, Vec<&'a ENodeId>) {
+        let num_qo_heads = extract_expr(egraph, kind_children[0], expr_cache)
+            .unwrap()
+            .exec(&FxHashMap::default())
+            .unwrap();
+        let num_kv_heads = extract_expr(egraph, kind_children[1], expr_cache)
+            .unwrap()
+            .exec(&FxHashMap::default())
+            .unwrap();
+        let head_dim = extract_expr(egraph, kind_children[2], expr_cache)
+            .unwrap()
+            .exec(&FxHashMap::default())
+            .unwrap();
+        let page_size = extract_expr(egraph, kind_children[3], expr_cache)
+            .unwrap()
+            .exec(&FxHashMap::default())
+            .unwrap();
+        let batch_dim = extract_expr(egraph, kind_children[4], expr_cache).unwrap();
+
+        let extracted = Self {
+            num_qo_heads,
+            num_kv_heads,
+            head_dim,
+            page_size,
+            batch_dim,
+            plan_info: Mutex::new(Vec::new()),
+        };
+
+        // Trigger JIT compilation (or .so cache hit) at extract time, not at
+        // first execute. Pays the ~30s cold-cache nvcc cost during compile
+        // rather than during the GA profiling loop, where it would dominate
+        // the candidate's measured runtime and make the GA reject FlashInfer.
+        let _ = jit::ensure_compiled(head_dim);
+
+        // Walk the mask e-graph chain to recover qo_indptr / kv_indptr Input nodes.
+        // input_enodes: [Q, K_cache, V_cache, gather_idx, mask]
+        let mask_node = input_enodes[4];
+        let indptrs = find_indptrs::find_indptr_inputs(egraph, mask_node);
+
+        // Build final inputs: [Q, K_cache, V_cache, gather_idx, mask, qo_indptr, kv_indptr]
+        let mut final_inputs = input_enodes;
+        final_inputs.push(indptrs.qo_indptr);
+        final_inputs.push(indptrs.kv_indptr);
+
+        let op = LLIROp::new::<dyn HostOp>(Box::new(extracted) as Box<dyn HostOp>);
+        (op, final_inputs)
+    }
+
+    fn cleanup(&self) -> bool {
+        false
+    }
+}
+
+impl HostOp for FlashInferAttention {
+    fn execute(
+        &self,
+        stream: &Arc<CudaStream>,
+        self_node: NodeIndex,
+        inputs: &[NodeIndex],
+        buffers: &FxHashMap<NodeIndex, DeviceBuffer>,
+        dyn_map: &FxHashMap<char, usize>,
+    ) -> anyhow::Result<()> {
+        let lib = jit::ensure_compiled(self.head_dim);
+
+        let total_q_tokens = self
+            .batch_dim
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("FlashInferAttention batch_dim is unresolved"))?;
+        let c = *dyn_map
+            .get(&'c')
+            .ok_or_else(|| anyhow::anyhow!("FlashInferAttention requires dynamic dim 'c'"))?;
+        let r = *dyn_map
+            .get(&'r')
+            .ok_or_else(|| anyhow::anyhow!("FlashInferAttention requires dynamic dim 'r'"))?;
+
+        if inputs.len() < 7 {
+            anyhow::bail!(
+                "FlashInferAttention expects 7 inputs (Q, K, V, flat_idx, mask, qo_indptr, kv_indptr), got {}",
+                inputs.len()
+            );
+        }
+
+        let get_buf = |name: &str, node: NodeIndex| -> anyhow::Result<DeviceBuffer> {
+            buffers.get(&node).copied().ok_or_else(|| {
+                anyhow::anyhow!("FlashInferAttention missing {name} buffer for {node:?}")
+            })
+        };
+
+        let q_buf = get_buf("Q", inputs[0])?;
+        let k_buf = get_buf("K_cache", inputs[1])?;
+        let v_buf = get_buf("V_cache", inputs[2])?;
+        let flat_idx_buf = get_buf("flat_gather_idx", inputs[3])?;
+        // inputs[4] = mask (unused by FlashInfer — indptrs replace it)
+        let kv_indptr_buf = get_buf("kv_indptr", inputs[6])?;
+        let out_buf = get_buf("output", self_node)?;
+
+        // Derive batch_size (num sequences) from r = indptr length.
+        let batch_size = r.saturating_sub(1);
+
+        let _span = span!(
+            Level::TRACE,
+            "FlashInferAttention",
+            total_q_tokens,
+            batch_size,
+            self.num_qo_heads,
+            self.num_kv_heads,
+            self.head_dim,
+        )
+        .entered();
+
+        let kv_dim = self.num_kv_heads * self.head_dim;
+        let cu_stream = stream.cu_stream() as *mut std::ffi::c_void;
+
+        // Extract slot indices (one per context page) from the flat gather index.
+        let indices_buf = unsafe { stream.alloc::<u8>(c.max(1) * std::mem::size_of::<i32>())? };
+        let (indices_ptr, _idx_guard) = indices_buf.device_ptr(stream);
+
+        if c > 0 {
+            unsafe {
+                (lib.extract_slot_indices)(
+                    flat_idx_buf.ptr() as *const i32,
+                    indices_ptr as *mut i32,
+                    c as i32,
+                    kv_dim as i32,
+                    cu_stream,
+                );
+            }
+        }
+
+        // Read kv_indptr to host for the plan phase.
+        let kv_indptr_bytes = r * 4;
+        let mut kv_indptr_host_bytes = vec![0u8; kv_indptr_bytes];
+        unsafe {
+            result::memcpy_dtoh_async(
+                &mut kv_indptr_host_bytes,
+                kv_indptr_buf.ptr(),
+                stream.cu_stream(),
+            )?;
+        }
+        stream.synchronize()?;
+        let kv_indptr_host: Vec<i32> = unsafe {
+            let mut v = std::mem::ManuallyDrop::new(kv_indptr_host_bytes);
+            Vec::from_raw_parts(v.as_mut_ptr() as *mut i32, r, r)
+        };
+
+        // kv_last_page_len = [1; batch_size] when page_size=1.
+        let last_page_host: Vec<i32> = vec![1; batch_size];
+        let last_page_dev: CudaSlice<u8> = if batch_size > 0 {
+            stream.clone_htod(unsafe {
+                std::slice::from_raw_parts(
+                    last_page_host.as_ptr() as *const u8,
+                    last_page_host.len() * std::mem::size_of::<i32>(),
+                )
+            })?
+        } else {
+            unsafe { stream.alloc::<u8>(1)? }
+        };
+        let (last_page_ptr, _lp_guard) = last_page_dev.device_ptr(stream);
+
+        // Global shared workspaces (allocated once across all op instances to
+        // amortize the ~4ms first-allocation cost during GA profiling).
+        static FLOAT_WORKSPACE: OnceLock<CudaSlice<u8>> = OnceLock::new();
+        static INT_WORKSPACE: OnceLock<CudaSlice<u8>> = OnceLock::new();
+        let float_ws = FLOAT_WORKSPACE
+            .get_or_init(|| unsafe { stream.alloc::<u8>(FLOAT_WORKSPACE_SIZE).unwrap() });
+        let int_ws = INT_WORKSPACE
+            .get_or_init(|| unsafe { stream.alloc::<u8>(INT_WORKSPACE_SIZE).unwrap() });
+        let page_locked_ws = PAGE_LOCKED_WORKSPACE.get_or_init(|| unsafe {
+            let mut ptr: *mut std::ffi::c_void = std::ptr::null_mut();
+            let status = libc::posix_memalign(&mut ptr, 4096, INT_WORKSPACE_SIZE);
+            assert_eq!(status, 0, "Failed to allocate page-locked workspace");
+            let cuda_status = cuda_pin_memory(ptr, INT_WORKSPACE_SIZE);
+            assert_eq!(cuda_status, 0, "Failed to pin memory");
+            PageLockedPtr(ptr as *mut u8)
+        });
+
+        let (float_ws_ptr, _fws_guard) = float_ws.device_ptr(stream);
+        let (int_ws_ptr, _iws_guard) = int_ws.device_ptr(stream);
+
+        // FlashInfer decode writes (total_q_tokens, heads, dim);
+        // luminal expects (heads, total_q_tokens, dim) — transpose at the end.
+        let output_elems = total_q_tokens * self.num_qo_heads * self.head_dim;
+        let temp_out_buf =
+            unsafe { stream.alloc::<u8>(output_elems * std::mem::size_of::<f32>())? };
+        let (temp_out_ptr, _tmp_guard) = temp_out_buf.device_ptr(stream);
+
+        // PrefillPlanInfo has 15 entries, DecodePlanInfo fewer — 16 is enough.
+        let mut plan_info_buf = [0i64; 16];
+        let mut plan_info_len: i32 = 0;
+
+        // ── BatchDecode path ──
+        // Prefill kernels require fp16/bf16 tensor-core MMA; the C API returns -1
+        // when called from the fp32 pipeline. We only use decode here.
+        let plan_ret = unsafe {
+            (lib.plan)(
+                float_ws_ptr as *mut std::ffi::c_void,
+                FLOAT_WORKSPACE_SIZE,
+                int_ws_ptr as *mut std::ffi::c_void,
+                INT_WORKSPACE_SIZE,
+                page_locked_ws.0 as *mut std::ffi::c_void,
+                kv_indptr_host.as_ptr() as *mut i32,
+                batch_size as i32,
+                self.num_qo_heads as i32,
+                self.num_kv_heads as i32,
+                self.page_size as i32,
+                self.head_dim as i32,
+                cu_stream,
+                plan_info_buf.as_mut_ptr(),
+                &mut plan_info_len,
+            )
+        };
+        if plan_ret != 0 {
+            return Err(anyhow::anyhow!(
+                "FlashInfer decode plan failed with error code {plan_ret}"
+            ));
+        }
+
+        let mut plan_info = self.plan_info.lock().unwrap();
+        plan_info.clear();
+        plan_info.extend_from_slice(&plan_info_buf[..plan_info_len as usize]);
+
+        let run_ret = unsafe {
+            (lib.run)(
+                float_ws_ptr as *mut std::ffi::c_void,
+                FLOAT_WORKSPACE_SIZE,
+                int_ws_ptr as *mut std::ffi::c_void,
+                plan_info.as_mut_ptr(),
+                plan_info.len() as i32,
+                q_buf.ptr() as *mut f32,
+                k_buf.ptr() as *mut f32,
+                v_buf.ptr() as *mut f32,
+                kv_indptr_buf.ptr() as *mut i32,
+                indices_ptr as *mut i32,
+                last_page_ptr as *mut i32,
+                temp_out_ptr as *mut f32,
+                batch_size as i32,
+                self.num_qo_heads as i32,
+                self.num_kv_heads as i32,
+                self.page_size as i32,
+                self.head_dim as i32,
+                cu_stream,
+            )
+        };
+        drop(plan_info);
+
+        if run_ret != 0 {
+            return Err(anyhow::anyhow!(
+                "FlashInfer decode run failed with error code {run_ret}"
+            ));
+        }
+
+        // Transpose (total_q_tokens, heads, dim) → (heads, total_q_tokens, dim)
+        unsafe {
+            (lib.transpose_output)(
+                temp_out_ptr as *const f32,
+                out_buf.ptr() as *mut f32,
+                total_q_tokens as i32,
+                self.num_qo_heads as i32,
+                self.head_dim as i32,
+                cu_stream,
+            );
+        }
+
+        Ok(())
+    }
+
+    fn output_size(&self) -> Expression {
+        self.batch_dim * self.num_qo_heads * self.head_dim
+    }
+
+    fn output_bytes(&self) -> Expression {
+        self.output_size() * 4
+    }
+
+    fn stats_name(&self) -> Option<&'static str> {
+        Some("FlashInferAttention")
+    }
+}
+
+/// Pin host memory for CUDA async memcpy.
+///
+/// `cudaHostRegister` lives in libcudart, which cudarc doesn't link to our
+/// binary. Resolve it via `dlopen`/`dlsym` so we don't need a build script or
+/// a `#[link]` directive — keeping the crate buildable without any nvcc-side
+/// dependencies.
+unsafe fn cuda_pin_memory(ptr: *mut std::ffi::c_void, size: usize) -> i32 {
+    type HostRegisterFn = unsafe extern "C" fn(*mut std::ffi::c_void, usize, u32) -> i32;
+    static FN: OnceLock<usize> = OnceLock::new();
+
+    let raw = *FN.get_or_init(|| unsafe {
+        let lib = [
+            "libcudart.so",
+            "libcudart.so.13",
+            "libcudart.so.12",
+            "libcudart.so.11",
+        ]
+        .iter()
+        .find_map(|n| libloading::Library::new(*n).ok())
+        .expect("FlashInfer: could not dlopen libcudart for cudaHostRegister");
+        let sym: libloading::Symbol<HostRegisterFn> = lib
+            .get(b"cudaHostRegister\0")
+            .expect("FlashInfer: libcudart missing cudaHostRegister symbol");
+        let ptr = *sym as *const () as usize;
+        // Keep libcudart resident for the process lifetime so the function
+        // pointer remains valid.
+        std::mem::forget(lib);
+        ptr
+    });
+    let f: HostRegisterFn = unsafe { std::mem::transmute(raw) };
+    // cudaHostRegisterDefault = 0
+    unsafe { f(ptr, size, 0) }
+}
--- a/crates/luminal_cuda_lite/src/host/flashinfer/wrapper.cu
+++ b/crates/luminal_cuda_lite/src/host/flashinfer/wrapper.cu
@@ -0,0 +1,357 @@
+// FlashInfer batch decode + prefill wrapper for luminal_cuda.
+// JIT-compiled at runtime with -DLUMINAL_HEAD_DIM=N.
+//
+// Decode: instantiated for f32 (scalar vectorized dot products, no tensor cores).
+// Prefill: instantiated for f16 (requires tensor core MMA + ldmatrix).
+//   The C API accepts fp32 buffers; cast kernels convert fp32↔fp16 at the boundary.
+//
+// NHD layout. GQA group_size and page_size are runtime parameters.
+
+#ifndef LUMINAL_HEAD_DIM
+#error "LUMINAL_HEAD_DIM must be defined (e.g. -DLUMINAL_HEAD_DIM=128)"
+#endif
+
+// Include utils.cuh first to get the original DISPATCH_HEAD_DIM, then override it
+// to only instantiate our specific HEAD_DIM. This avoids a compile error in
+// cascade.cuh where HEAD_DIM=512 + f32 triggers vec_size=16, vec_bits=512
+// which exceeds cp_async's 256-bit limit.
+#include <flashinfer/utils.cuh>
+#undef DISPATCH_HEAD_DIM
+#define DISPATCH_HEAD_DIM(head_dim, HEAD_DIM, ...)  \
+  {                                                  \
+    constexpr size_t HEAD_DIM = LUMINAL_HEAD_DIM;    \
+    __VA_ARGS__                                      \
+  }
+
+#include <flashinfer/attention/scheduler.cuh>
+#include <flashinfer/attention/decode.cuh>
+#include <flashinfer/attention/default_decode_params.cuh>
+#include <flashinfer/attention/prefill.cuh>
+#include <flashinfer/attention/default_prefill_params.cuh>
+#include <flashinfer/attention/mask.cuh>
+#include <flashinfer/attention/variants.cuh>
+#include <flashinfer/page.cuh>
+#include <flashinfer/pos_enc.cuh>
+
+#include "wrapper.h"
+
+#include <cstring>
+#include <vector>
+#include <cuda_fp16.h>
+
+using namespace flashinfer;
+
+// ── Decode types (f32) ──
+using DTypeQ = float;
+using DTypeKV = float;
+using DTypeO = float;
+using IdType = int32_t;
+
+// ── Prefill types (f16 compute, fp32 external interface) ──
+using PrefillDTypeQ = half;
+using PrefillDTypeKV = half;
+using PrefillDTypeO = half;
+
+constexpr uint32_t HEAD_DIM = LUMINAL_HEAD_DIM;
+constexpr PosEncodingMode POS_ENCODING_MODE = PosEncodingMode::kNone;
+
+// Attention variants
+using Variant = DefaultAttention</*use_custom_mask=*/false,
+                                  /*use_sliding_window=*/false,
+                                  /*use_logits_soft_cap=*/false,
+                                  /*use_alibi=*/false>;
+
+using CausalVariant = DefaultAttention</*use_custom_mask=*/false,
+                                        /*use_sliding_window=*/false,
+                                        /*use_logits_soft_cap=*/false,
+                                        /*use_alibi=*/false>;
+
+// Decode params (f32)
+using DecodeParams = BatchDecodeParams<DTypeQ, DTypeKV, DTypeO, IdType>;
+
+// Prefill params (f16)
+using PrefillParams = BatchPrefillPagedParams<PrefillDTypeQ, PrefillDTypeKV, PrefillDTypeO, IdType>;
+
+// Forward declarations
+namespace flashinfer {
+template <uint32_t HEAD_DIM, PosEncodingMode POS_ENCODING_MODE, typename AttentionVariant,
+          typename Params>
+cudaError_t BatchDecodeWithPagedKVCacheDispatched(Params params, typename Params::DTypeO* tmp_v,
+                                                   float* tmp_s, bool enable_pdl,
+                                                   cudaStream_t stream);
+
+template <uint32_t CTA_TILE_Q, uint32_t HEAD_DIM_QK, uint32_t HEAD_DIM_VO,
+          PosEncodingMode POS_ENCODING_MODE, bool USE_FP16_QK_REDUCTION,
+          MaskMode MASK_MODE, typename AttentionVariant, typename Params>
+cudaError_t BatchPrefillWithPagedKVCacheDispatched(Params params, typename Params::DTypeO* tmp_v,
+                                                    float* tmp_s, bool enable_pdl,
+                                                    cudaStream_t stream);
+}
+
+// Explicit instantiation: decode kernel (f32)
+template cudaError_t flashinfer::BatchDecodeWithPagedKVCacheDispatched<
+    HEAD_DIM, POS_ENCODING_MODE, Variant, DecodeParams>(
+    DecodeParams params, DTypeO* tmp_v, float* tmp_s, bool enable_pdl, cudaStream_t stream);
+
+// Explicit instantiation: prefill kernels (f16, causal mask, CTA_TILE_Q=16/64/128)
+template cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<
+    16, HEAD_DIM, HEAD_DIM, POS_ENCODING_MODE, false, MaskMode::kCausal, CausalVariant, PrefillParams>(
+    PrefillParams params, PrefillDTypeO* tmp_v, float* tmp_s, bool enable_pdl, cudaStream_t stream);
+
+template cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<
+    64, HEAD_DIM, HEAD_DIM, POS_ENCODING_MODE, false, MaskMode::kCausal, CausalVariant, PrefillParams>(
+    PrefillParams params, PrefillDTypeO* tmp_v, float* tmp_s, bool enable_pdl, cudaStream_t stream);
+
+template cudaError_t flashinfer::BatchPrefillWithPagedKVCacheDispatched<
+    128, HEAD_DIM, HEAD_DIM, POS_ENCODING_MODE, false, MaskMode::kCausal, CausalVariant, PrefillParams>(
+    PrefillParams params, PrefillDTypeO* tmp_v, float* tmp_s, bool enable_pdl, cudaStream_t stream);
+
+// ── fp32 ↔ fp16 cast kernels ──
+
+__global__ void cast_f32_to_f16_kernel(const float* src, half* dst, size_t n) {
+    size_t i = (size_t)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i < n) dst[i] = __float2half(src[i]);
+}
+
+__global__ void cast_f16_to_f32_kernel(const half* src, float* dst, size_t n) {
+    size_t i = (size_t)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i < n) dst[i] = __half2float(src[i]);
+}
+
+extern "C" {
+
+int flashinfer_batch_decode_plan(
+    void* float_workspace, size_t float_ws_size,
+    void* int_workspace, size_t int_ws_size,
+    void* page_locked_int_workspace,
+    int32_t* indptr_h, int batch_size,
+    int num_qo_heads, int num_kv_heads, int page_size, int head_dim,
+    cudaStream_t stream,
+    int64_t* plan_info_out, int* plan_info_len_out)
+{
+    (void)head_dim; // fixed at compile time
+
+    DecodePlanInfo plan_info;
+    uint32_t group_size = num_qo_heads / num_kv_heads;
+
+    // We need to dispatch on GROUP_SIZE to get the right work estimation function
+    cudaError_t status = cudaSuccess;
+
+    // Use a lambda to dispatch on group size
+    auto do_plan = [&]<uint32_t GROUP_SIZE>() -> cudaError_t {
+        auto work_estimation_func =
+            BatchDecodeWithPagedKVCacheWorkEstimationDispatched<
+                GROUP_SIZE, HEAD_DIM, POS_ENCODING_MODE, Variant, DecodeParams>;
+        return DecodePlan<HEAD_DIM, POS_ENCODING_MODE, Variant, DecodeParams>(
+            float_workspace, float_ws_size,
+            int_workspace, page_locked_int_workspace,
+            int_ws_size, plan_info, indptr_h,
+            (uint32_t)batch_size, (uint32_t)num_qo_heads,
+            (uint32_t)page_size, /*enable_cuda_graph=*/false,
+            stream, work_estimation_func);
+    };
+
+    switch (group_size) {
+        case 1:  status = do_plan.operator()<1>();  break;
+        case 2:  status = do_plan.operator()<2>();  break;
+        case 4:  status = do_plan.operator()<4>();  break;
+        case 8:  status = do_plan.operator()<8>();  break;
+        default: return -1; // unsupported group size
+    }
+
+    if (status != cudaSuccess) return (int)status;
+
+    auto vec = plan_info.ToVector();
+    *plan_info_len_out = (int)vec.size();
+    std::memcpy(plan_info_out, vec.data(), vec.size() * sizeof(int64_t));
+    return 0;
+}
+
+int flashinfer_batch_decode_run(
+    void* float_workspace, size_t float_ws_size,
+    void* int_workspace,
+    int64_t* plan_info_vec, int plan_info_len,
+    float* q,
+    float* k_cache,
+    float* v_cache,
+    int32_t* kv_indptr,
+    int32_t* kv_indices,
+    int32_t* kv_last_page_len,
+    float* output,
+    int batch_size,
+    int num_qo_heads, int num_kv_heads, int page_size, int head_dim,
+    cudaStream_t stream)
+{
+    (void)head_dim; // fixed at compile time
+
+    DecodePlanInfo plan_info;
+    plan_info.FromVector(std::vector<int64_t>(plan_info_vec, plan_info_vec + plan_info_len));
+
+    // Construct paged_kv_t with NHD layout
+    paged_kv_t<DTypeKV, IdType> paged_kv(
+        (uint32_t)num_kv_heads,
+        (uint32_t)page_size,
+        HEAD_DIM,
+        (uint32_t)batch_size,
+        QKVLayout::kNHD,
+        k_cache,
+        v_cache,
+        kv_indices,
+        kv_indptr,
+        kv_last_page_len);
+
+    DecodeParams params;
+    params.q = q;
+    params.q_rope_offset = nullptr;
+    params.paged_kv = paged_kv;
+    params.o = output;
+    params.lse = nullptr;
+    params.maybe_alibi_slopes = nullptr;
+    params.padded_batch_size = plan_info.padded_batch_size;
+    params.num_qo_heads = (uint32_t)num_qo_heads;
+    // Q buffer is (batch, num_qo_heads * head_dim) flat — the graph's split_dims + transpose
+    // are stride tricks, no data movement. So the actual memory layout is (batch, heads, dim).
+    params.q_stride_n = num_qo_heads * HEAD_DIM;
+    params.q_stride_h = HEAD_DIM;
+    params.window_left = -1; // no sliding window
+    params.logits_soft_cap = 0.0f;
+    params.sm_scale = 1.0f / sqrtf((float)HEAD_DIM);
+    params.rope_rcp_scale = 1.0f;
+    params.rope_rcp_theta = 1.0f;
+
+    // Set plan info pointers
+    params.request_indices =
+        GetPtrFromBaseOffset<IdType>(int_workspace, plan_info.request_indices_offset);
+    params.kv_tile_indices =
+        GetPtrFromBaseOffset<IdType>(int_workspace, plan_info.kv_tile_indices_offset);
+    params.o_indptr =
+        GetPtrFromBaseOffset<IdType>(int_workspace, plan_info.o_indptr_offset);
+    params.kv_chunk_size_ptr =
+        GetPtrFromBaseOffset<IdType>(int_workspace, plan_info.kv_chunk_size_ptr_offset);
+    params.block_valid_mask = nullptr;
+    params.partition_kv = false;
+
+    DTypeO* tmp_v = nullptr;
+    float* tmp_s = nullptr;
+
+    if (plan_info.split_kv) {
+        tmp_v = GetPtrFromBaseOffset<DTypeO>(float_workspace, plan_info.v_offset);
+        tmp_s = GetPtrFromBaseOffset<float>(float_workspace, plan_info.s_offset);
+        if (plan_info.enable_cuda_graph) {
+            params.block_valid_mask =
+                GetPtrFromBaseOffset<bool>(int_workspace, plan_info.block_valid_mask_offset);
+        }
+    }
+
+    cudaError_t status =
+        flashinfer::BatchDecodeWithPagedKVCacheDispatched<HEAD_DIM, POS_ENCODING_MODE, Variant>(
+            params, tmp_v, tmp_s, /*enable_pdl=*/false, stream);
+
+    return (int)status;
+}
+
+// ═══════════════════════════════════════════════════════════
+// BatchPrefill (fp16/bf16 only — tensor core MMA requires 16-bit inputs)
+// ═══════════════════════════════════════════════════════════
+//
+// The prefill kernel templates are instantiated above for fp16. These C API
+// functions accept fp32 pointers (matching the current luminal pipeline) but
+// return -1 to indicate that fp32 prefill is not supported. When native fp16
+// support is added, these will accept fp16 pointers and call through to the
+// instantiated templates.
+
+int flashinfer_batch_prefill_plan(
+    void*, size_t, void*, size_t, void*,
+    int32_t*, int32_t*, int, int,
+    int, int, int, int, cudaStream_t,
+    int64_t*, int*)
+{
+    return -1; // fp32 not supported — requires fp16/bf16
+}
+
+int flashinfer_batch_prefill_run(
+    void*, size_t, void*,
+    int64_t*, int,
+    float*, float*, float*,
+    int32_t*, int32_t*, int32_t*, int32_t*,
+    float*, int, int, int, int, int, int, cudaStream_t)
+{
+    return -1; // fp32 not supported — requires fp16/bf16
+}
+
+} // extern "C"
+
+// ── Slot index extraction kernel (outside extern "C" for __global__) ──
+
+__global__ void extract_slot_indices_kernel(
+    const int32_t* flat_idx, int32_t* out, int c, int kv_dim) {
+    int i = blockIdx.x * blockDim.x + threadIdx.x;
+    if (i < c) out[i] = flat_idx[i * kv_dim] / kv_dim;
+}
+
+extern "C" void flashinfer_extract_slot_indices(
+    const int32_t* flat_idx, int32_t* out, int c, int kv_dim,
+    cudaStream_t stream) {
+    if (c == 0) return;
+    int threads = 256;
+    int blocks = (c + threads - 1) / threads;
+    extract_slot_indices_kernel<<<blocks, threads, 0, stream>>>(
+        flat_idx, out, c, kv_dim);
+}
+
+// ── Derive CSR indptr from attention mask ──
+// Mask is (s, c) f32. Entries > -1e9 are "valid" (0.0), rest are -inf.
+// Per-row count of valid entries = context length for that sequence.
+// Output: indptr[0..=s] with indptr[0]=0 and indptr[i+1] = indptr[i] + ctx_len[i].
+// Single thread is fine since s is tiny (batch_size during decode, typically 1-8).
+
+__global__ void derive_indptr_kernel(
+    const float* mask, int32_t* indptr, int s, int c) {
+    if (threadIdx.x != 0 || blockIdx.x != 0) return;
+    indptr[0] = 0;
+    for (int i = 0; i < s; i++) {
+        int count = 0;
+        for (int j = 0; j < c; j++) {
+            if (mask[i * c + j] > -1e9f) count++;
+        }
+        indptr[i + 1] = indptr[i] + count;
+    }
+}
+
+extern "C" void flashinfer_derive_indptr_from_mask(
+    const float* mask, int32_t* indptr, int s, int c,
+    cudaStream_t stream) {
+    if (s == 0) return;
+    derive_indptr_kernel<<<1, 1, 0, stream>>>(mask, indptr, s, c);
+}
+
+// ── Output transpose: (batch, heads, dim) → (heads, batch, dim) ──
+// FlashInfer writes output as (batch, heads, dim) but Luminal expects (heads, batch, dim).
+// For batch=1 these are identical; for batch>1 we need an explicit transpose.
+
+__global__ void transpose_bhd_to_hbd_kernel(
+    const float* src, float* dst, int batch, int heads, int dim) {
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    int total = batch * heads * dim;
+    if (idx >= total) return;
+
+    // Decompose linear index into (b, h, d) for src layout
+    int d = idx % dim;
+    int h = (idx / dim) % heads;
+    int b = idx / (heads * dim);
+
+    // Write to (h, b, d) layout in dst
+    dst[h * batch * dim + b * dim + d] = src[idx];
+}
+
+extern "C" void flashinfer_transpose_output(
+    const float* src, float* dst,
+    int batch, int heads, int dim,
+    cudaStream_t stream) {
+    int total = batch * heads * dim;
+    if (total == 0) return;
+    int threads = 256;
+    int blocks = (total + threads - 1) / threads;
+    transpose_bhd_to_hbd_kernel<<<blocks, threads, 0, stream>>>(
+        src, dst, batch, heads, dim);
+}
--- a/crates/luminal_cuda_lite/src/host/flashinfer/wrapper.h
+++ b/crates/luminal_cuda_lite/src/host/flashinfer/wrapper.h
@@ -0,0 +1,93 @@
+#pragma once
+
+#include <cuda_runtime.h>
+#include <stdint.h>
+#include <stddef.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+// Plan phase: CPU-side scheduling. Must call before each new batch config.
+// Returns 0 on success, non-zero on failure.
+int flashinfer_batch_decode_plan(
+    void* float_workspace, size_t float_ws_size,
+    void* int_workspace, size_t int_ws_size,
+    void* page_locked_int_workspace,
+    int32_t* indptr_h, int batch_size,
+    int num_qo_heads, int num_kv_heads, int page_size, int head_dim,
+    cudaStream_t stream,
+    int64_t* plan_info_out, int* plan_info_len_out);
+
+// Run phase: GPU kernel launch.
+// Returns 0 on success, non-zero on failure.
+int flashinfer_batch_decode_run(
+    void* float_workspace, size_t float_ws_size,
+    void* int_workspace,
+    int64_t* plan_info_vec, int plan_info_len,
+    float* q,                    // [batch_size, num_qo_heads, head_dim]
+    float* k_cache,              // [num_pages, page_size, num_kv_heads, head_dim] (NHD)
+    float* v_cache,              // same layout
+    int32_t* kv_indptr,          // [batch_size + 1]
+    int32_t* kv_indices,         // [total_pages]
+    int32_t* kv_last_page_len,   // [batch_size]
+    float* output,               // [batch_size, num_qo_heads, head_dim]
+    int batch_size,
+    int num_qo_heads, int num_kv_heads, int page_size, int head_dim,
+    cudaStream_t stream);
+
+// Extract slot indices from a flat gather index tensor.
+// flat_idx shape: (c, kv_dim) i32, out shape: (c,) i32.
+// out[i] = flat_idx[i * kv_dim] / kv_dim
+void flashinfer_extract_slot_indices(
+    const int32_t* flat_idx, int32_t* out, int c, int kv_dim,
+    cudaStream_t stream);
+
+// Derive CSR indptr from attention mask.
+// mask shape: (s, c) f32. Entries > -1e9 are valid.
+// indptr shape: (s + 1,) i32. indptr[0] = 0, indptr[i+1] = cumsum of valid counts.
+void flashinfer_derive_indptr_from_mask(
+    const float* mask, int32_t* indptr, int s, int c,
+    cudaStream_t stream);
+
+// Transpose output from (batch, heads, dim) to (heads, batch, dim).
+void flashinfer_transpose_output(
+    const float* src, float* dst,
+    int batch, int heads, int dim,
+    cudaStream_t stream);
+
+// ── BatchPrefill with Paged KV Cache ──
+
+// Plan phase for batch prefill.
+// Returns 0 on success, non-zero on failure.
+int flashinfer_batch_prefill_plan(
+    void* float_workspace, size_t float_ws_size,
+    void* int_workspace, size_t int_ws_size,
+    void* page_locked_int_workspace,
+    int32_t* qo_indptr_h, int32_t* kv_indptr_h,
+    int total_num_rows, int batch_size,
+    int num_qo_heads, int num_kv_heads, int page_size, int head_dim,
+    cudaStream_t stream,
+    int64_t* plan_info_out, int* plan_info_len_out);
+
+// Run phase for batch prefill.
+// Returns 0 on success, non-zero on failure.
+int flashinfer_batch_prefill_run(
+    void* float_workspace, size_t float_ws_size,
+    void* int_workspace,
+    int64_t* plan_info_vec, int plan_info_len,
+    float* q,                    // [total_num_rows, num_qo_heads, head_dim]
+    float* k_cache,              // [num_pages, page_size, num_kv_heads, head_dim] (NHD)
+    float* v_cache,              // same layout
+    int32_t* qo_indptr,         // [batch_size + 1] on GPU
+    int32_t* kv_indptr,         // [batch_size + 1] on GPU
+    int32_t* kv_indices,         // [total_pages]
+    int32_t* kv_last_page_len,   // [batch_size]
+    float* output,               // [total_num_rows, num_qo_heads, head_dim]
+    int total_num_rows, int batch_size,
+    int num_qo_heads, int num_kv_heads, int page_size, int head_dim,
+    cudaStream_t stream);
+
+#ifdef __cplusplus
+}
+#endif
--- a/crates/luminal_cuda_lite/src/host/mod.rs
+++ b/crates/luminal_cuda_lite/src/host/mod.rs
@@ -1,17 +1,122 @@
 use std::{fmt::Debug, sync::Arc};

-use crate::cudarc::driver::{CudaSlice, CudaStream};
+use crate::cudarc::driver::{CudaStream, DriverError, result};
 use luminal::{op::EgglogOp, prelude::*};
+pub mod compute_attn_mask;
 mod cublas;
 mod cublaslt;
+pub mod flashinfer;
 pub mod moe;

+pub use compute_attn_mask::ComputeAttnMask;
+
 pub type Ops = (
    // cublas::CuBlasSgemmV2,
    cublaslt::CuBlasLt,
    moe::GLUMoE,
+    compute_attn_mask::ComputeAttnMask,
+    flashinfer::FlashInferAttention,
 );

+#[cfg(test)]
+pub(crate) type CublasLtTypeTuple = (
+    luminal::dtype::DType,
+    luminal::dtype::DType,
+    luminal::dtype::DType,
+    luminal::dtype::DType,
+    &'static str,
+    luminal::dtype::DType,
+);
+
+#[cfg(test)]
+pub(crate) fn cublaslt_type_tuple(op: &dyn HostOp) -> Option<CublasLtTypeTuple> {
+    op.as_any()
+        .downcast_ref::<cublaslt::CuBlasLt>()
+        .map(cublaslt::CuBlasLt::type_tuple)
+}
+
+#[cfg(test)]
+pub(crate) type CublasLtScaleValues = (f64, f64);
+
+#[cfg(test)]
+pub(crate) fn cublaslt_scale_values(op: &dyn HostOp) -> Option<CublasLtScaleValues> {
+    op.as_any()
+        .downcast_ref::<cublaslt::CuBlasLt>()
+        .map(cublaslt::CuBlasLt::scale_values)
+}
+
+#[cfg(test)]
+pub(crate) fn cublaslt_epilogue(op: &dyn HostOp) -> Option<&'static str> {
+    op.as_any()
+        .downcast_ref::<cublaslt::CuBlasLt>()
+        .map(cublaslt::CuBlasLt::epilogue)
+}
+
+#[cfg(test)]
+pub(crate) type CublasLtMatrixOrders = (&'static str, &'static str, &'static str, &'static str);
+
+#[cfg(test)]
+pub(crate) fn cublaslt_matrix_orders(op: &dyn HostOp) -> Option<CublasLtMatrixOrders> {
+    op.as_any()
+        .downcast_ref::<cublaslt::CuBlasLt>()
+        .map(cublaslt::CuBlasLt::matrix_orders)
+}
+
+#[cfg(test)]
+pub(crate) type CublasLtTransposeOps = (&'static str, &'static str);
+
+#[cfg(test)]
+pub(crate) fn cublaslt_transpose_ops(op: &dyn HostOp) -> Option<CublasLtTransposeOps> {
+    op.as_any()
+        .downcast_ref::<cublaslt::CuBlasLt>()
+        .map(cublaslt::CuBlasLt::transpose_ops)
+}
+
+#[cfg(test)]
+pub(crate) fn cublaslt_c_d_layouts_match(op: &dyn HostOp) -> Option<bool> {
+    op.as_any()
+        .downcast_ref::<cublaslt::CuBlasLt>()
+        .map(cublaslt::CuBlasLt::c_d_layouts_match)
+}
+
+/// Non-owning device buffer handle used by host operations.
+///
+/// Runtime-owned intermediates may be a whole `CudaSlice`, a subregion inside
+/// the reusable arena, or an external pointer. Host ops only need the pointer
+/// and the logical byte length.
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub struct DeviceBuffer {
+    ptr: u64,
+    len: usize,
+}
+
+impl DeviceBuffer {
+    pub fn new(ptr: u64, len: usize) -> Self {
+        Self { ptr, len }
+    }
+
+    pub fn ptr(self) -> u64 {
+        self.ptr
+    }
+
+    pub fn len(self) -> usize {
+        self.len
+    }
+
+    pub fn is_empty(self) -> bool {
+        self.len == 0
+    }
+
+    pub fn clone_dtoh(self, stream: &Arc<CudaStream>) -> Result<Vec<u8>, DriverError> {
+        let mut host = vec![0u8; self.len];
+        unsafe {
+            result::memcpy_dtoh_async(&mut host, self.ptr, stream.cu_stream())?;
+        }
+        stream.synchronize()?;
+        Ok(host)
+    }
+}
+
 /// Host operations that execute on the CPU but orchestrate GPU work.
 ///
 /// This includes operations like cuBLAS calls and CUDA graph executions.
@@ -29,7 +134,7 @@ pub trait HostOp: Debug + as_any::AsAny + EgglogOp {
        stream: &Arc<CudaStream>,
        self_node: NodeIndex,
        inputs: &[NodeIndex],
-        buffers: &FxHashMap<NodeIndex, &CudaSlice<u8>>,
+        buffers: &FxHashMap<NodeIndex, DeviceBuffer>,
        dyn_map: &FxHashMap<char, usize>,
    ) -> anyhow::Result<()>;

@@ -48,6 +153,15 @@ pub trait HostOp: Debug + as_any::AsAny + EgglogOp {
        vec![]
    }

+    /// Returns relative lifetimes for extra buffer nodes within this host op.
+    ///
+    /// The tuple is `(node, first_step, last_step)`, where steps are local to
+    /// this host op's execution. Returning `None` tells the runtime to treat
+    /// every extra buffer as live for the whole host op.
+    fn extra_buffer_lifetimes(&self) -> Option<Vec<(NodeIndex, usize, usize)>> {
+        None
+    }
+
    /// Returns buffer size requirements for extra nodes (node -> size in elements).
    ///
    /// Called during buffer allocation to ensure all required buffers exist.
--- a/crates/luminal_cuda_lite/src/host/moe/glumoe_rewrite.egg
+++ b/crates/luminal_cuda_lite/src/host/moe/glumoe_rewrite.egg
@@ -1,128 +1,246 @@
-; GLUMoE: Match the expert computation subgraph of a Gated MoE (SwiGLU variant).
+; GLUMoE: Match the expert computation subgraph of a gated MoE.
 ;
-; This matches the pattern produced by QwenMoE::forward() starting from the
-; expert gathers through to the final weighted sum, and replaces it with a
-; fused GLUMoE HostOp.
+; One fused op supports two activation modes:
+;   mode=0: Qwen-style SwiGLU (silu(gate) * up)
+;   mode=1: Gemma-style GELU (gate * sigmoid(1.595769 * gate * (1 + 0.044715 * gate^2)))
 ;
-; Inputs extracted:
-;   ?x            - input activations [s, H] F32
-;   ?topk_idx     - top-k expert indices [s, k] Int (from argsort+slice)
-;   ?topk_vals    - top-k routing values [s, k] F32 (from gather on softmax)
-;   ?gate_up_w    - stacked gate+up expert weights [E, intermediate*2, H] BF16
-;   ?down_w       - stacked down expert weights [E, H, intermediate] BF16
-;
-; The pattern captures:
-;   1. Gate-up expert gather (Iota, Mul, Cast, Iota, Cast, Add, Cast, Gather)
-;   2. Cast BF16→F32 of gathered gate-up weights
-;   3. Gate-up batched matmul (Mul + SumReduce)
-;   4. Gate/Up split via Iota+Gather (slice semantics)
-;   5. SwiGLU: silu(gate) * up
-;   6. Down expert gather (same pattern as gate-up)
-;   7. Cast BF16→F32 of gathered down weights
-;   8. Down batched matmul (Mul + SumReduce)
-;   9. Weighted sum: (down_out * topk_values) summed over k
-;
-; Variables with ? prefix are egglog pattern variables.
-; We use wildcards (?_xxx) for shapes/strides we don't extract.
+; To keep matching fast, we stage through marker states:
+;   1) Shared expert index/gather markers
+;   2) Shared gate-up matmul marker
+;   3) Activation marker (separate swiglu / gemma_gelu paths)
+;   4) Down matmul marker (separate swiglu / gemma_gelu paths)
+;   5) Final GLUMoE fusion (separate swiglu / gemma_gelu rules)
+
+(datatype*
+    (GLUMoEExpertIndexState
+        (MkGLUMoEExpertIndexState Expression Expression IR)
+    )
+    (GLUMoEExpertGatherState
+        (MkGLUMoEExpertGatherState Expression Expression IR IR)
+    )
+    (GLUMoEGateUpState
+        (MkGLUMoEGateUpState Expression Expression Expression IR IR IR)
+    )
+    (GLUMoESwiGLUState
+        (MkGLUMoESwiGLUState GLUMoEGateUpState)
+    )
+    (GLUMoEGemmaGELUState
+        (MkGLUMoEGemmaGELUState GLUMoEGateUpState)
+    )
+    (GLUMoESwiGLUDownState
+        (MkGLUMoESwiGLUDownState Expression Expression Expression GLUMoESwiGLUState IR IR)
+    )
+    (GLUMoEGemmaDownState
+        (MkGLUMoEGemmaDownState Expression Expression Expression GLUMoEGemmaGELUState IR IR)
+    )
+)
+
+(function glumoe_expert_index (IR) GLUMoEExpertIndexState :merge new)
+(function glumoe_expert_gather (IR) GLUMoEExpertGatherState :merge new)
+(function glumoe_gate_up (IR) GLUMoEGateUpState :merge new)
+(function glumoe_swiglu (IR) GLUMoESwiGLUState :merge new)
+(function glumoe_gemma_gelu (IR) GLUMoEGemmaGELUState :merge new)
+(function glumoe_swiglu_down (IR) GLUMoESwiGLUDownState :merge new)
+(function glumoe_gemma_down (IR) GLUMoEGemmaDownState :merge new)

 (rule
    (
-        ; ===== Gate-up expert gather =====
-        ; t51: Iota for base index (expert_idx * io_gu)
-        (= ?gu_iota_base (Op (Iota ?gu_io ?gu_iota_base_range) (INil)))
-        ; t52: Mul topk_indices * io → base offsets [s, k]
-        (= ?gu_mul_base (Op (Mul ?gu_mul_base_shape ?gu_mul_base_a_stride ?gu_mul_base_b_stride ?gu_mul_base_out_stride) (ICons ?topk_idx (ICons ?gu_iota_base (INil)))))
-        ; t53: Cast to F32
-        (= ?gu_cast_base (Op (Cast ?gu_cast_base_size (F32)) (ICons ?gu_mul_base (INil))))
-        ; t54: Iota for within-expert index
-        (= ?gu_iota_within (Op (Iota (MIter) ?gu_iota_within_range) (INil)))
-        ; t55: Cast within to F32
-        (= ?gu_cast_within (Op (Cast ?gu_cast_within_size (F32)) (ICons ?gu_iota_within (INil))))
-        ; t56: Add base + within → flat gather indices
-        (= ?gu_add_idx (Op (Add ?gu_add_shape ?gu_add_a_stride ?gu_add_b_stride ?gu_add_out_stride) (ICons ?gu_cast_base (ICons ?gu_cast_within (INil)))))
-        ; t57: Cast to Int
-        (= ?gu_cast_idx (Op (Cast ?gu_cast_idx_size (Int)) (ICons ?gu_add_idx (INil))))
-        ; t58: Gather gate_up weights
-        (= ?gu_gathered (Op (Gather ?gu_gather_idx_shape ?gu_gather_idx_stride ?gu_gather_data_shape ?gu_gather_data_stride) (ICons ?gu_cast_idx (ICons ?gate_up_w (INil)))))
+        (= ?iota_base (Op (Iota ?io ?iota_base_range) (INil)))
+        (= ?mul_base (Op (Mul ?mul_base_shape ?mul_base_a_stride ?mul_base_b_stride ?mul_base_out_stride) (ICons ?topk_idx (ICons ?iota_base (INil)))))
+        (= ?iota_within (Op (Iota (MIter) ?iota_within_range) (INil)))
+        (= ?add_idx (Op (Add ?add_shape ?add_a_stride ?add_b_stride ?add_out_stride) (ICons ?mul_base (ICons ?iota_within (INil)))))
+    )
+    (
+        (set (glumoe_expert_index ?add_idx)
+            (MkGLUMoEExpertIndexState ?io ?iota_within_range ?topk_idx))
+    )
+    :ruleset glumoe
+    :name "GLUMoE expert index marker"
+)

-        ; ===== Cast BF16→F32 =====
-        ; t59: Cast gathered gate_up to F32
-        (= ?gu_f32 (Op (Cast ?gu_f32_size (F32)) (ICons ?gu_gathered (INil))))
+(rule
+    (
+        (= ?index_state (glumoe_expert_index ?idx))
+        (= ?index_state (MkGLUMoEExpertIndexState ?io ?within_range ?topk_idx))
+        (= ?gathered (Op (Gather ?gather_idx_shape ?gather_idx_stride ?gather_data_shape ?gather_data_stride) (ICons ?idx (ICons ?weights (INil)))))
+        (= ?f32 (Op (Cast ?f32_size (F32)) (ICons ?gathered (INil))))
+    )
+    (
+        (set (glumoe_expert_gather ?f32)
+            (MkGLUMoEExpertGatherState ?io ?within_range ?topk_idx ?weights))
+    )
+    :ruleset glumoe
+    :name "GLUMoE expert gather marker"
+)

-        ; ===== Gate-up batched matmul =====
-        ; t60: Mul x * gathered_gu (broadcast multiply)
+(rule
+    (
+        (= ?gather_state (glumoe_expert_gather ?gu_f32))
+        (= ?gather_state (MkGLUMoEExpertGatherState ?gu_io ?gu_iota_within_range ?topk_idx ?gate_up_w))
        (= ?gu_matmul_mul (Op (Mul ?gu_matmul_mul_shape ?gu_matmul_a_stride ?gu_matmul_b_stride ?gu_matmul_mul_out_stride) (ICons ?x (ICons ?gu_f32 (INil)))))
-        ; t61: SumReduce over K dimension
        (= ?gu_matmul (Op (Sum ?gu_matmul_out_shape ?gu_matmul_k ?gu_matmul_in_stride ?gu_matmul_k_stride ?gu_matmul_out_stride) (ICons ?gu_matmul_mul (INil))))
+    )
+    (
+        (set (glumoe_gate_up ?gu_matmul)
+            (MkGLUMoEGateUpState ?gu_io ?gu_matmul_k ?gu_iota_within_range ?x ?topk_idx ?gate_up_w))
+    )
+    :ruleset glumoe
+    :name "GLUMoE gate-up matmul marker"
+)
+
+; ===== SwiGLU activation marker =====
+(rule
+    (
+        (= ?gate_up_state (glumoe_gate_up ?gu_matmul))
+        (= ?gate_up_state (MkGLUMoEGateUpState ?gu_io ?gu_matmul_k ?gu_within_range ?x ?topk_idx ?gate_up_w))

-        ; ===== Up slice via Iota+Gather =====
-        ; t62: Iota with complex expression (slicing the "up" half)
        (= ?up_iota (Op (Iota ?up_iota_expr ?up_iota_range) (INil)))
-        ; t63: Gather to select up portion from matmul result
        (= ?up_slice (Op (Gather ?up_gather_idx_shape ?up_gather_idx_stride ?up_gather_data_shape ?up_gather_data_stride) (ICons ?up_iota (ICons ?gu_matmul (INil)))))

-        ; ===== SwiGLU: silu(gate) * up =====
-        ; t64: Constant(-1)
        (= ?neg1 (Op (Constant -1.000000) (INil)))
-        ; t65: gate * -1
        (= ?neg_gate (Op (Mul ?silu_shape1 ?silu_a_stride1 ?silu_b_stride1 ?silu_out_stride1) (ICons ?gu_matmul (ICons ?neg1 (INil)))))
-        ; t66: Constant(log2e)
        (= ?log2e (Op (Constant 1.442695) (INil)))
-        ; t67: neg_gate * log2e
        (= ?scaled (Op (Mul ?silu_shape2 ?silu_a_stride2 ?silu_b_stride2 ?silu_out_stride2) (ICons ?neg_gate (ICons ?log2e (INil)))))
-        ; t68: exp2
        (= ?exp2_val (Op (Exp2 ?silu_shape3 ?silu_in_stride3 ?silu_out_stride3) (ICons ?scaled (INil))))
-        ; t69: Constant(1)
        (= ?one (Op (Constant 1.000000) (INil)))
-        ; t70: exp2 + 1
        (= ?plus1 (Op (Add ?silu_shape4 ?silu_a_stride4 ?silu_b_stride4 ?silu_out_stride4) (ICons ?exp2_val (ICons ?one (INil)))))
-        ; t71: recip
        (= ?sigmoid (Op (Recip ?silu_shape5 ?silu_in_stride5 ?silu_out_stride5) (ICons ?plus1 (INil))))
-        ; t72: gate * sigmoid(gate) = silu(gate)
        (= ?silu_out (Op (Mul ?silu_shape6 ?silu_a_stride6 ?silu_b_stride6 ?silu_out_stride6) (ICons ?gu_matmul (ICons ?sigmoid (INil)))))
-        ; t73: silu(gate) * up
        (= ?swiglu_out (Op (Mul ?swiglu_shape ?swiglu_a_stride ?swiglu_b_stride ?swiglu_out_stride) (ICons ?silu_out (ICons ?up_slice (INil)))))
+    )
+    (
+        (set (glumoe_swiglu ?swiglu_out) (MkGLUMoESwiGLUState ?gate_up_state))
+    )
+    :ruleset glumoe
+    :name "GLUMoE swiglu marker"
+)

-        ; ===== Down expert gather =====
-        ; t74: Iota for base index (expert_idx * io_down)
-        (= ?dn_iota_base (Op (Iota ?dn_io ?dn_iota_base_range) (INil)))
-        ; t75: Mul topk_indices * io_down
-        (= ?dn_mul_base (Op (Mul ?dn_mul_base_shape ?dn_mul_base_a_stride ?dn_mul_base_b_stride ?dn_mul_base_out_stride) (ICons ?topk_idx (ICons ?dn_iota_base (INil)))))
-        ; t76: Cast to F32
-        (= ?dn_cast_base (Op (Cast ?dn_cast_base_size (F32)) (ICons ?dn_mul_base (INil))))
-        ; t77: Iota for within-expert index
-        (= ?dn_iota_within (Op (Iota (MIter) ?dn_iota_within_range) (INil)))
-        ; t78: Cast within to F32
-        (= ?dn_cast_within (Op (Cast ?dn_cast_within_size (F32)) (ICons ?dn_iota_within (INil))))
-        ; t79: Add base + within
-        (= ?dn_add_idx (Op (Add ?dn_add_shape ?dn_add_a_stride ?dn_add_b_stride ?dn_add_out_stride) (ICons ?dn_cast_base (ICons ?dn_cast_within (INil)))))
-        ; t80: Cast to Int
-        (= ?dn_cast_idx (Op (Cast ?dn_cast_idx_size (Int)) (ICons ?dn_add_idx (INil))))
-        ; t81: Gather down weights
-        (= ?dn_gathered (Op (Gather ?dn_gather_idx_shape ?dn_gather_idx_stride ?dn_gather_data_shape ?dn_gather_data_stride) (ICons ?dn_cast_idx (ICons ?down_w (INil)))))
+; ===== Gemma GELU activation marker =====
+(rule
+    (
+        (= ?gate_up_state (glumoe_gate_up ?gu_matmul))
+        (= ?gate_up_state (MkGLUMoEGateUpState ?gu_io ?gu_matmul_k ?gu_within_range ?x ?topk_idx ?gate_up_w))

-        ; ===== Cast BF16→F32 =====
-        ; t82: Cast gathered down to F32
-        (= ?dn_f32 (Op (Cast ?dn_f32_size (F32)) (ICons ?dn_gathered (INil))))
+        (= ?up_iota (Op (Iota ?up_iota_expr ?up_iota_range) (INil)))
+        (= ?up_slice (Op (Gather ?up_gather_idx_shape ?up_gather_idx_stride ?up_gather_data_shape ?up_gather_data_stride) (ICons ?up_iota (ICons ?gu_matmul (INil)))))

-        ; ===== Down batched matmul =====
-        ; t83: Mul swiglu_out * gathered_down (broadcast multiply)
+        (= ?gelu_coeff_inner (Op (Constant 0.044715) (INil)))
+        (= ?gelu_inner_scaled (Op (Mul ?gelu_inner_scaled_shape ?gelu_inner_scaled_a_stride ?gelu_inner_scaled_b_stride ?gelu_inner_scaled_out_stride) (ICons ?gu_matmul (ICons ?gelu_coeff_inner (INil)))))
+        (= ?gelu_inner_quad (Op (Mul ?gelu_inner_quad_shape ?gelu_inner_quad_a_stride ?gelu_inner_quad_b_stride ?gelu_inner_quad_out_stride) (ICons ?gelu_inner_scaled (ICons ?gu_matmul (INil)))))
+        (= ?gelu_one (Op (Constant 1.000000) (INil)))
+        (= ?gelu_poly (Op (Add ?gelu_poly_shape ?gelu_poly_a_stride ?gelu_poly_b_stride ?gelu_poly_out_stride) (ICons ?gelu_inner_quad (ICons ?gelu_one (INil)))))
+        (= ?gelu_coeff_outer (Op (Constant 1.595769) (INil)))
+        (= ?gelu_outer_scaled (Op (Mul ?gelu_outer_scaled_shape ?gelu_outer_scaled_a_stride ?gelu_outer_scaled_b_stride ?gelu_outer_scaled_out_stride) (ICons ?gu_matmul (ICons ?gelu_coeff_outer (INil)))))
+        (= ?gelu_scaled (Op (Mul ?gelu_scaled_shape ?gelu_scaled_a_stride ?gelu_scaled_b_stride ?gelu_scaled_out_stride) (ICons ?gelu_outer_scaled (ICons ?gelu_poly (INil)))))
+        (= ?neg1 (Op (Constant -1.000000) (INil)))
+        (= ?gelu_neg (Op (Mul ?gelu_neg_shape ?gelu_neg_a_stride ?gelu_neg_b_stride ?gelu_neg_out_stride) (ICons ?gelu_scaled (ICons ?neg1 (INil)))))
+        (= ?log2e (Op (Constant 1.442695) (INil)))
+        (= ?gelu_exp_scaled (Op (Mul ?gelu_exp_scaled_shape ?gelu_exp_scaled_a_stride ?gelu_exp_scaled_b_stride ?gelu_exp_scaled_out_stride) (ICons ?gelu_neg (ICons ?log2e (INil)))))
+        (= ?gelu_exp2_val (Op (Exp2 ?gelu_exp_shape ?gelu_exp_in_stride ?gelu_exp_out_stride) (ICons ?gelu_exp_scaled (INil))))
+        (= ?gelu_plus1 (Op (Add ?gelu_plus1_shape ?gelu_plus1_a_stride ?gelu_plus1_b_stride ?gelu_plus1_out_stride) (ICons ?gelu_exp2_val (ICons ?gelu_one (INil)))))
+        (= ?gelu_sigmoid (Op (Recip ?gelu_sigmoid_shape ?gelu_sigmoid_in_stride ?gelu_sigmoid_out_stride) (ICons ?gelu_plus1 (INil))))
+        (= ?gelu_out (Op (Mul ?gelu_out_shape ?gelu_out_a_stride ?gelu_out_b_stride ?gelu_out_out_stride) (ICons ?gu_matmul (ICons ?gelu_sigmoid (INil)))))
+        (= ?gemma_out (Op (Mul ?geglu_shape ?geglu_a_stride ?geglu_b_stride ?geglu_out_stride) (ICons ?gelu_out (ICons ?up_slice (INil)))))
+    )
+    (
+        (set (glumoe_gemma_gelu ?gemma_out) (MkGLUMoEGemmaGELUState ?gate_up_state))
+    )
+    :ruleset glumoe
+    :name "GLUMoE gemma gelu marker"
+)
+
+; ===== SwiGLU down marker =====
+(rule
+    (
+        (= ?swiglu_state (glumoe_swiglu ?swiglu_out))
+        (= ?swiglu_state (MkGLUMoESwiGLUState ?gate_up_state))
+
+        (= ?gather_state (glumoe_expert_gather ?dn_f32))
+        (= ?gather_state (MkGLUMoEExpertGatherState ?dn_io ?dn_iota_within_range ?topk_idx ?down_w))
        (= ?dn_matmul_mul (Op (Mul ?dn_matmul_mul_shape ?dn_matmul_a_stride ?dn_matmul_b_stride ?dn_matmul_mul_out_stride) (ICons ?swiglu_out (ICons ?dn_f32 (INil)))))
-        ; t84: SumReduce
        (= ?dn_matmul (Op (Sum ?dn_matmul_out_shape ?dn_matmul_k ?dn_matmul_in_stride ?dn_matmul_k_stride ?dn_matmul_out_stride) (ICons ?dn_matmul_mul (INil))))
+    )
+    (
+        (set (glumoe_swiglu_down ?dn_matmul)
+            (MkGLUMoESwiGLUDownState ?dn_io ?dn_matmul_k ?dn_iota_within_range ?swiglu_state ?topk_idx ?down_w))
+    )
+    :ruleset glumoe
+    :name "GLUMoE swiglu down marker"
+)
+
+; ===== Gemma GELU down marker =====
+(rule
+    (
+        (= ?gemma_state (glumoe_gemma_gelu ?gemma_out))
+        (= ?gemma_state (MkGLUMoEGemmaGELUState ?gate_up_state))
+
+        (= ?gather_state (glumoe_expert_gather ?dn_f32))
+        (= ?gather_state (MkGLUMoEExpertGatherState ?dn_io ?dn_iota_within_range ?topk_idx ?down_w))
+        (= ?dn_matmul_mul (Op (Mul ?dn_matmul_mul_shape ?dn_matmul_a_stride ?dn_matmul_b_stride ?dn_matmul_mul_out_stride) (ICons ?gemma_out (ICons ?dn_f32 (INil)))))
+        (= ?dn_matmul (Op (Sum ?dn_matmul_out_shape ?dn_matmul_k ?dn_matmul_in_stride ?dn_matmul_k_stride ?dn_matmul_out_stride) (ICons ?dn_matmul_mul (INil))))
+    )
+    (
+        (set (glumoe_gemma_down ?dn_matmul)
+            (MkGLUMoEGemmaDownState ?dn_io ?dn_matmul_k ?dn_iota_within_range ?gemma_state ?topk_idx ?down_w))
+    )
+    :ruleset glumoe
+    :name "GLUMoE gemma down marker"
+)
+
+; ===== Final fusion: mode 0 (SwiGLU) =====
+(rule
+    (
+        (= ?down_state (glumoe_swiglu_down ?dn_matmul))
+        (= ?down_state (MkGLUMoESwiGLUDownState ?dn_io ?dn_matmul_k ?dn_within_range ?swiglu_state ?topk_idx ?down_w))
+        (= ?swiglu_state (MkGLUMoESwiGLUState ?gate_up_state))
+        (= ?gate_up_state (MkGLUMoEGateUpState ?gu_io ?gu_matmul_k ?gu_within_range ?x ?topk_idx ?gate_up_w))

-        ; ===== Weighted sum over k experts =====
-        ; t85: Mul down_out * topk_values
        (= ?weighted (Op (Mul ?weighted_shape ?weighted_a_stride ?weighted_b_stride ?weighted_out_stride) (ICons ?dn_matmul (ICons ?topk_vals (INil)))))
-        ; t86: SumReduce over k dimension → [s, H]
        (= ?output (Op (Sum ?output_shape ?output_k ?output_in_stride ?output_k_stride ?output_out_stride) (ICons ?weighted (INil))))
    )
    (
        (let ?glumoe (Op (GLUMoE
            ?gu_io ?dn_io ?gu_matmul_k ?dn_matmul_k ?output_k
-            ?gu_iota_within_range ?dn_iota_within_range)
-            (ICons ?x (ICons ?topk_idx (ICons ?topk_vals (ICons ?gate_up_w (ICons ?down_w (INil))))))))
+            ?gu_within_range ?dn_within_range (MNum 0))
+            (ICons ?x (ICons ?topk_idx (ICons ?topk_vals (ICons ?gate_up_w (ICons ?down_w (ICons ?topk_vals (INil)))))))))
        (union ?output ?glumoe)
+        (subsume (Op (Sum ?output_shape ?output_k ?output_in_stride ?output_k_stride ?output_out_stride) (ICons ?weighted (INil))))
+        (subsume (Op (KernelSum ?output_shape ?output_k ?output_in_stride ?output_k_stride ?output_out_stride (F32)) (ICons ?weighted (INil))))
    )
-    :name "GLUMoE fused expert computation"
+    :ruleset glumoe
+    :name "GLUMoE fused expert computation (swiglu)"
+)
+
+; ===== Final fusion: mode 1 (Gemma GELU) =====
+(rule
+    (
+        (= ?down_state (glumoe_gemma_down ?dn_matmul))
+        (= ?down_state (MkGLUMoEGemmaDownState ?dn_io ?dn_matmul_k ?dn_within_range ?gemma_state ?topk_idx ?down_w))
+        (= ?gemma_state (MkGLUMoEGemmaGELUState ?gate_up_state))
+        (= ?gate_up_state (MkGLUMoEGateUpState ?gu_io ?gu_matmul_k ?gu_within_range ?x ?topk_idx ?gate_up_w))
+
+        ; Gemma expert weights: topk_weights = normed_topk * per_expert_scale.gather(topk_idx)
+        (= ?per_expert_vals (Op (Gather ?scale_gather_idx_shape ?scale_gather_idx_stride ?scale_gather_data_shape ?scale_gather_data_stride) (ICons ?topk_idx (ICons ?per_expert_scale (INil)))))
+        (= ?topk_row_offsets (Op (Iota ?topk_row_offsets_expr ?topk_row_offsets_range) (INil)))
+        (= ?topk_flat_idx (Op (Add ?topk_flat_idx_shape ?topk_flat_idx_a_stride ?topk_flat_idx_b_stride ?topk_flat_idx_out_stride) (ICons ?topk_row_offsets (ICons ?topk_idx (INil)))))
+        (= ?topk_vals (Op (Gather ?topk_vals_gather_idx_shape ?topk_vals_gather_idx_stride ?topk_vals_gather_data_shape ?topk_vals_gather_data_stride) (ICons ?topk_flat_idx (ICons ?routing_weights (INil)))))
+        (= ?topk_norm (Op (Sum ?topk_norm_shape ?output_k ?topk_norm_in_stride ?topk_norm_k_stride ?topk_norm_out_stride) (ICons ?topk_vals (INil))))
+        (= ?topk_norm_factor (Op (Recip ?topk_norm_recip_shape ?topk_norm_recip_in_stride ?topk_norm_recip_out_stride) (ICons ?topk_norm (INil))))
+        (= ?normed_topk (Op (Mul ?normed_topk_shape ?normed_topk_a_stride ?normed_topk_b_stride ?normed_topk_out_stride) (ICons ?topk_vals (ICons ?topk_norm_factor (INil)))))
+        (= ?expert_weights (Op (Mul ?expert_weights_shape ?expert_weights_a_stride ?expert_weights_b_stride ?expert_weights_out_stride) (ICons ?normed_topk (ICons ?per_expert_vals (INil)))))
+
+        (= ?weighted (Op (Mul ?weighted_shape ?weighted_a_stride ?weighted_b_stride ?weighted_out_stride) (ICons ?dn_matmul (ICons ?expert_weights (INil)))))
+        (= ?output (Op (Sum ?output_shape ?output_k ?output_in_stride ?output_k_stride ?output_out_stride) (ICons ?weighted (INil))))
+    )
+    (
+        (let ?glumoe (Op (GLUMoE
+            ?gu_io ?dn_io ?gu_matmul_k ?dn_matmul_k ?output_k
+            ?gu_within_range ?dn_within_range (MNum 1))
+            (ICons ?x (ICons ?topk_idx (ICons ?topk_vals (ICons ?gate_up_w (ICons ?down_w (ICons ?per_expert_scale (INil)))))))))
+        (union ?output ?glumoe)
+        (subsume (Op (Sum ?output_shape ?output_k ?output_in_stride ?output_k_stride ?output_out_stride) (ICons ?weighted (INil))))
+        (subsume (Op (KernelSum ?output_shape ?output_k ?output_in_stride ?output_k_stride ?output_out_stride (F32)) (ICons ?weighted (INil))))
+    )
+    :ruleset glumoe
+    :name "GLUMoE fused expert computation (gemma_gelu)"
 )
--- a/crates/luminal_cuda_lite/src/host/moe/mod.rs
+++ b/crates/luminal_cuda_lite/src/host/moe/mod.rs
@@ -32,15 +32,16 @@ use crate::{
            CudaFunction, CudaModule, CudaSlice, CudaStream, DevicePtr, LaunchConfig, PushKernelArg,
        },
    },
-    host::HostOp,
+    host::{DeviceBuffer, HostOp},
+    try_create_cublaslt,
 };

 const WORKSPACE_SIZE: usize = 32 * 1024 * 1024; // 32 MiB

 /// Fused GLU-MoE HostOp matched via egglog pattern.
 ///
-/// Replaces the expert computation subgraph (expert gathers + matmuls + SwiGLU
-/// + weighted sum) with an efficient cuBLASLt implementation.
+/// Replaces the expert computation subgraph (expert gathers + matmuls + gated
+/// activation + weighted sum) with an efficient cuBLASLt implementation.
 ///
 /// Inputs (graph edges, in order):
 ///   0: x              [seq, hidden]                        F32
@@ -48,9 +49,13 @@ const WORKSPACE_SIZE: usize = 32 * 1024 * 1024; // 32 MiB
 ///   2: topk_values    [seq, k]                             F32
 ///   3: gate_up_w      [E, gate_up_dim, hidden]             BF16
 ///   4: down_w         [E, hidden, intermediate]             BF16
+///   5: mode_aux
+///      - SwiGLU: ignored (rewriter wires `topk_values` again)
+///      - GemmaGELU: per_expert_scale [E]                   F32
 ///
 /// Output: [seq, hidden] F32
 pub struct GLUMoE {
+    pub(crate) mode: GLUMoEMode,
    /// Product of gate_up weight dimensions per expert (gate_up_dim * hidden) used for gather stride
    gu_io: Expression,
    /// Product of down weight dimensions per expert (hidden * intermediate) used for gather stride
@@ -69,9 +74,35 @@ pub struct GLUMoE {
    module: OnceLock<(Arc<CudaModule>, CudaFunction, CudaFunction)>,
 }

+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub(crate) enum GLUMoEMode {
+    SwiGLU,
+    GemmaGELU,
+}
+
+impl GLUMoEMode {
+    fn from_mode_id(mode_id: usize) -> Self {
+        match mode_id {
+            0 => Self::SwiGLU,
+            1 => Self::GemmaGELU,
+            other => {
+                panic!("Unknown GLUMoE mode id: {other}");
+            }
+        }
+    }
+
+    fn activation_kernel_mode(self) -> i32 {
+        match self {
+            Self::SwiGLU => 0,
+            Self::GemmaGELU => 1,
+        }
+    }
+}
+
 impl Default for GLUMoE {
    fn default() -> Self {
        Self {
+            mode: GLUMoEMode::SwiGLU,
            gu_io: Expression::default(),
            dn_io: Expression::default(),
            gu_matmul_k: Expression::default(),
@@ -88,6 +119,7 @@ impl Default for GLUMoE {
 impl std::fmt::Debug for GLUMoE {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_struct("GLUMoE")
+            .field("mode", &self.mode)
            .field("gu_io", &self.gu_io)
            .field("dn_io", &self.dn_io)
            .field("gu_matmul_k", &self.gu_matmul_k)
@@ -100,6 +132,7 @@ impl std::fmt::Debug for GLUMoE {
 impl Clone for GLUMoE {
    fn clone(&self) -> Self {
        Self {
+            mode: self.mode,
            gu_io: self.gu_io,
            dn_io: self.dn_io,
            gu_matmul_k: self.gu_matmul_k,
@@ -114,9 +147,15 @@ impl Clone for GLUMoE {
 }

 impl GLUMoE {
-    fn get_cublaslt(&self, stream: &Arc<CudaStream>) -> &Arc<CudaBlasLT> {
-        self.cublaslt
-            .get_or_init(|| Arc::new(CudaBlasLT::new(stream.clone()).unwrap()))
+    fn get_cublaslt(&self, stream: &Arc<CudaStream>) -> anyhow::Result<Arc<CudaBlasLT>> {
+        if let Some(cublaslt) = self.cublaslt.get() {
+            return Ok(cublaslt.clone());
+        }
+        let created = try_create_cublaslt(stream.clone()).map_err(|message| {
+            anyhow::anyhow!("cuBLASLt unavailable on this machine: {message}")
+        })?;
+        let _ = self.cublaslt.set(created.clone());
+        Ok(created)
    }

    fn get_kernels(
@@ -134,23 +173,34 @@ extern "C" __global__ void f32_to_bf16(unsigned long long in_ptr, unsigned long
    if (i < n) out[i] = __float2bfloat16(in_[i]);
 }

-extern "C" __global__ void swiglu_bf16(unsigned long long gate_up_ptr, unsigned long long out_ptr, int intermediate) {
+extern "C" __global__ void glu_activation_bf16(
+    unsigned long long gate_up_ptr,
+    unsigned long long out_ptr,
+    int intermediate,
+    int mode
+) {
    const __nv_bfloat16* gate_up = (const __nv_bfloat16*)gate_up_ptr;
    __nv_bfloat16* out = (__nv_bfloat16*)out_ptr;
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < intermediate) {
        float gate = __bfloat162float(gate_up[i]);
        float up   = __bfloat162float(gate_up[i + intermediate]);
-        float silu = gate / (1.0f + expf(-gate));
-        out[i] = __float2bfloat16(silu * up);
+        float activated;
+        if (mode == 0) {
+            activated = gate / (1.0f + expf(-gate));
+        } else {
+            float scaled = 1.5957691216f * gate * (1.0f + 0.044715f * gate * gate);
+            activated = gate / (1.0f + expf(-scaled));
+        }
+        out[i] = __float2bfloat16(activated * up);
    }
 }
 "#;
            let ptx = compile_module_image_for_current_device(stream.context(), src).unwrap();
            let module = stream.context().load_module(ptx).unwrap();
            let f32_to_bf16 = module.load_function("f32_to_bf16").unwrap();
-            let swiglu = module.load_function("swiglu_bf16").unwrap();
-            (module, f32_to_bf16, swiglu)
+            let activation = module.load_function("glu_activation_bf16").unwrap();
+            (module, f32_to_bf16, activation)
        })
    }
 }
@@ -168,16 +218,30 @@ impl EgglogOp for GLUMoE {
                ("output_k", EXPRESSION),
                ("gu_within_range", EXPRESSION),
                ("dn_within_range", EXPRESSION),
+                ("mode", EXPRESSION),
            ],
        )
    }

-    fn n_inputs(&self) -> usize {
-        5
+    fn rewrites(&self) -> Vec<Rule> {
+        vec![
+            Rule::raw(
+                "(rule
+                (
+                    (= ?e (Op (GLUMoE ?gu_io ?dn_io ?gu_matmul_k ?dn_matmul_k ?output_k ?gu_within_range ?dn_within_range ?mode) ?inputs))
+                )
+                (
+                    (set (dtype ?e) (F32))
+                )
+                :ruleset dtype_prop
+            )",
+            ),
+            Rule::raw(include_str!["glumoe_rewrite.egg"]),
+        ]
    }

-    fn early_rewrites(&self) -> Vec<Rule> {
-        vec![Rule::raw(include_str!["glumoe_rewrite.egg"])]
+    fn n_inputs(&self) -> usize {
+        6
    }

    fn extract<'a>(
@@ -195,8 +259,14 @@ impl EgglogOp for GLUMoE {
        let output_k = extract_expr(egraph, kind_children[4], expr_cache).unwrap();
        let gu_within_range = extract_expr(egraph, kind_children[5], expr_cache).unwrap();
        let dn_within_range = extract_expr(egraph, kind_children[6], expr_cache).unwrap();
+        let mode_expr = extract_expr(egraph, kind_children[7], expr_cache).unwrap();
+        let mode_id = mode_expr
+            .to_usize()
+            .unwrap_or_else(|| panic!("GLUMoE mode must be static, got expression: {mode_expr}"));
+        let mode = GLUMoEMode::from_mode_id(mode_id);

        let extracted = GLUMoE {
+            mode,
            gu_io,
            dn_io,
            gu_matmul_k,
@@ -209,7 +279,7 @@ impl EgglogOp for GLUMoE {
        };

        let op = LLIROp::new::<dyn HostOp>(Box::new(extracted) as Box<dyn HostOp>);
-        // Return the 5 IR inputs: x, topk_idx, topk_vals, gate_up_w, down_w
+        // Return the 6 IR inputs: x, topk_idx, topk_values, gate_up_w, down_w, mode_aux
        (op, input_enodes)
    }

@@ -224,26 +294,140 @@ impl HostOp for GLUMoE {
        stream: &Arc<CudaStream>,
        self_node: NodeIndex,
        inputs: &[NodeIndex],
-        buffers: &FxHashMap<NodeIndex, &CudaSlice<u8>>,
+        buffers: &FxHashMap<NodeIndex, DeviceBuffer>,
        dyn_map: &FxHashMap<char, usize>,
    ) -> anyhow::Result<()> {
-        // Resolve dimensions
-        let hidden = self.gu_matmul_k.exec(dyn_map).unwrap();
-        let intermediate = self.dn_matmul_k.exec(dyn_map).unwrap();
-        let top_k = self.output_k.exec(dyn_map).unwrap();
-        let gate_up_dim = self.gu_io.exec(dyn_map).unwrap() / hidden; // gate_up_dim = gu_io / hidden
-        let _num_experts = self.gu_within_range.exec(dyn_map).unwrap() / (gate_up_dim * hidden);
+        if inputs.len() < 6 {
+            anyhow::bail!("GLUMoE expected at least 6 inputs, got {}", inputs.len());
+        }

-        // Derive seq from x buffer size: x is [seq, hidden] F32 → seq = len / (hidden * 4)
-        let x_buf = buffers[&inputs[0]];
-        let seq = x_buf.len() / (hidden * 4);
+        // Resolve dimensions
+        let hidden = self
+            .gu_matmul_k
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("GLUMoE hidden dimension is unresolved"))?;
+        let intermediate = self
+            .dn_matmul_k
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("GLUMoE intermediate dimension is unresolved"))?;
+        let top_k = self
+            .output_k
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("GLUMoE top-k dimension is unresolved"))?;
+        let gu_io = self
+            .gu_io
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("GLUMoE gate/up stride is unresolved"))?;
+        let dn_io = self
+            .dn_io
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("GLUMoE down stride is unresolved"))?;
+
+        if hidden == 0 || intermediate == 0 {
+            anyhow::bail!(
+                "GLUMoE got zero-sized matmul dimensions: hidden={hidden}, intermediate={intermediate}"
+            );
+        }
+        if top_k == 0 {
+            return Ok(());
+        }
+        if gu_io % hidden != 0 {
+            anyhow::bail!("GLUMoE gate/up stride {gu_io} is not divisible by hidden {hidden}");
+        }
+        if dn_io % intermediate != 0 {
+            anyhow::bail!(
+                "GLUMoE down stride {dn_io} is not divisible by intermediate {intermediate}"
+            );
+        }
+
+        let gate_up_dim = gu_io / hidden; // gate_up_dim = 2 * intermediate for GLU
+        let down_hidden = dn_io / intermediate;
+        if gate_up_dim != intermediate * 2 {
+            anyhow::bail!(
+                "GLUMoE expected gate/up dim {} to equal 2 * intermediate {}",
+                gate_up_dim,
+                intermediate * 2
+            );
+        }
+        if down_hidden != hidden {
+            anyhow::bail!("GLUMoE down hidden {down_hidden} does not match hidden {hidden}");
+        }
+
+        let output_bytes = self
+            .output_bytes()
+            .exec(dyn_map)
+            .ok_or_else(|| anyhow::anyhow!("GLUMoE output byte size is unresolved"))?;
+        if output_bytes % (hidden * 4) != 0 {
+            anyhow::bail!(
+                "GLUMoE output bytes {output_bytes} are not divisible by hidden bytes {}",
+                hidden * 4
+            );
+        }
+        let seq = output_bytes / (hidden * 4);
+        if seq == 0 {
+            return Ok(());
+        }
+
+        let get_buffer = |name: &str, node: NodeIndex| -> anyhow::Result<DeviceBuffer> {
+            buffers.get(&node).copied().ok_or_else(|| {
+                anyhow::anyhow!("GLUMoE missing {name} buffer for LLIR node {node:?}")
+            })
+        };

        // Get input/output buffers
-        let topk_idx_buf = buffers[&inputs[1]]; // [seq, k] Int
-        let topk_vals_buf = buffers[&inputs[2]]; // [seq, k] F32
-        let gate_up_buf = buffers[&inputs[3]]; // [E, gate_up_dim, hidden] BF16
-        let down_buf = buffers[&inputs[4]]; // [E, hidden, intermediate] BF16
-        let output_buf = buffers[&self_node]; // [seq, hidden] F32
+        let x_buf = get_buffer("x", inputs[0])?; // [seq, hidden] F32
+        let topk_idx_buf = get_buffer("topk indices", inputs[1])?; // [seq, k] Int
+        let topk_vals_buf = get_buffer("topk values", inputs[2])?; // [seq, k] F32
+        let gate_up_buf = get_buffer("gate/up weights", inputs[3])?; // [E, gate_up_dim, hidden] BF16
+        let down_buf = get_buffer("down weights", inputs[4])?; // [E, hidden, intermediate] BF16
+        let mode_aux_buf = get_buffer("mode aux", inputs[5])?;
+        let output_buf = get_buffer("output", self_node)?; // [seq, hidden] F32
+
+        let topk_bytes = seq * top_k * 4;
+        if x_buf.len() < output_bytes {
+            anyhow::bail!(
+                "GLUMoE x buffer too small: have {} bytes, need {output_bytes}",
+                x_buf.len()
+            );
+        }
+        if topk_idx_buf.len() < topk_bytes {
+            anyhow::bail!(
+                "GLUMoE topk index buffer too small: have {} bytes, need {topk_bytes}",
+                topk_idx_buf.len()
+            );
+        }
+        if topk_vals_buf.len() < topk_bytes {
+            anyhow::bail!(
+                "GLUMoE topk value buffer too small: have {} bytes, need {topk_bytes}",
+                topk_vals_buf.len()
+            );
+        }
+        if output_buf.len() < output_bytes {
+            anyhow::bail!(
+                "GLUMoE output buffer too small: have {} bytes, need {output_bytes}",
+                output_buf.len()
+            );
+        }
+
+        let gu_stride_bytes = gate_up_dim * hidden * 2;
+        let down_stride_bytes = hidden * intermediate * 2;
+        if gu_stride_bytes == 0 || gate_up_buf.len() % gu_stride_bytes != 0 {
+            anyhow::bail!(
+                "GLUMoE gate/up weight buffer has {} bytes, not a multiple of per-expert stride {gu_stride_bytes}",
+                gate_up_buf.len()
+            );
+        }
+        let num_experts = gate_up_buf.len() / gu_stride_bytes;
+        if num_experts == 0 {
+            anyhow::bail!("GLUMoE has no expert weights");
+        }
+        if down_buf.len() < num_experts * down_stride_bytes {
+            anyhow::bail!(
+                "GLUMoE down weight buffer too small: have {} bytes, need {}",
+                down_buf.len(),
+                num_experts * down_stride_bytes
+            );
+        }

        // Get raw device pointer addresses
        let x_ptr = buf_ptr(x_buf, stream);
@@ -251,14 +435,62 @@ impl HostOp for GLUMoE {
        let down_ptr = buf_ptr(down_buf, stream);
        let output_ptr = buf_ptr(output_buf, stream);

-        let cublaslt = self.get_cublaslt(stream);
-        let (_, f32_to_bf16_fn, swiglu_fn) = self.get_kernels(stream);
+        let cublaslt = self.get_cublaslt(stream)?;
+        let (_, f32_to_bf16_fn, activation_fn) = self.get_kernels(stream);

-        // Read topk indices and values from GPU
-        let topk_idx_host: Vec<u8> = stream.clone_dtoh(topk_idx_buf)?;
-        let topk_idx_i32: &[i32] = bytemuck::cast_slice(&topk_idx_host);
-        let topk_vals_host: Vec<u8> = stream.clone_dtoh(topk_vals_buf)?;
-        let topk_vals_f32: &[f32] = bytemuck::cast_slice(&topk_vals_host);
+        // Read top-k routing values from GPU
+        let topk_idx_host: Vec<u8> = topk_idx_buf.clone_dtoh(stream)?;
+        let topk_idx_i32: &[i32] = bytemuck::cast_slice(&topk_idx_host[..topk_bytes]);
+        let topk_vals_host: Vec<u8> = topk_vals_buf.clone_dtoh(stream)?;
+        let topk_vals_f32: &[f32] = bytemuck::cast_slice(&topk_vals_host[..topk_bytes]);
+
+        for (pos, &expert_idx) in topk_idx_i32.iter().enumerate() {
+            if expert_idx < 0 || expert_idx as usize >= num_experts {
+                anyhow::bail!(
+                    "GLUMoE expert index {expert_idx} at routing position {pos} out of bounds for {num_experts} experts"
+                );
+            }
+        }
+
+        // Mode-dependent expert weights used for the final reduction:
+        // - SwiGLU: direct topk values
+        // - GemmaGELU: normalize topk values and scale by per-expert factors
+        let mut expert_weights_storage: Vec<f32> = Vec::new();
+        let expert_weights_f32: &[f32] = match self.mode {
+            GLUMoEMode::SwiGLU => topk_vals_f32,
+            GLUMoEMode::GemmaGELU => {
+                let per_expert_scale_host: Vec<u8> = mode_aux_buf.clone_dtoh(stream)?;
+                let per_expert_scale_bytes = num_experts * 4;
+                if per_expert_scale_host.len() < per_expert_scale_bytes {
+                    anyhow::bail!(
+                        "GLUMoE per-expert scale buffer too small: have {} bytes, need {per_expert_scale_bytes}",
+                        per_expert_scale_host.len()
+                    );
+                }
+                let per_expert_scale_f32: &[f32] =
+                    bytemuck::cast_slice(&per_expert_scale_host[..per_expert_scale_bytes]);
+                expert_weights_storage.resize(seq * top_k, 0.0);
+                for t in 0..seq {
+                    let base = t * top_k;
+                    let vals = &topk_vals_f32[base..base + top_k];
+                    let norm = vals.iter().copied().sum::<f32>();
+                    let inv_norm = if norm != 0.0 { norm.recip() } else { 0.0 };
+                    for i in 0..top_k {
+                        let expert_idx = topk_idx_i32[base + i] as usize;
+                        if expert_idx >= per_expert_scale_f32.len() {
+                            anyhow::bail!(
+                                "GLUMoE Gemma mode expert index {} out of bounds {}",
+                                expert_idx,
+                                per_expert_scale_f32.len()
+                            );
+                        }
+                        let scale = per_expert_scale_f32[expert_idx];
+                        expert_weights_storage[base + i] = vals[i] * inv_norm * scale;
+                    }
+                }
+                &expert_weights_storage
+            }
+        };

        // Allocate temp buffers
        let x_bf16_buf = unsafe { stream.alloc::<u8>(seq * hidden * 2)? }; // BF16
@@ -266,10 +498,10 @@ impl HostOp for GLUMoE {
        let hidden_tmp = unsafe { stream.alloc::<u8>(intermediate * 2)? }; // BF16
        let workspace = unsafe { stream.alloc::<u8>(WORKSPACE_SIZE)? };

-        let xbf16_ptr = buf_ptr(&x_bf16_buf, stream);
-        let gu_out_ptr = buf_ptr(&gate_up_out_buf, stream);
-        let hid_ptr = buf_ptr(&hidden_tmp, stream);
-        let ws_ptr = buf_ptr(&workspace, stream);
+        let xbf16_ptr = slice_ptr(&x_bf16_buf, stream);
+        let gu_out_ptr = slice_ptr(&gate_up_out_buf, stream);
+        let hid_ptr = slice_ptr(&hidden_tmp, stream);
+        let ws_ptr = slice_ptr(&workspace, stream);

        // Cast x F32 → BF16
        let n_cast = (seq * hidden) as i32;
@@ -288,25 +520,13 @@ impl HostOp for GLUMoE {
        }

        // Per-token expert computation
-        let gu_stride = (gate_up_dim * hidden * 2) as u64; // bytes per expert gate_up (BF16)
-        let down_stride = (hidden * intermediate * 2) as u64; // bytes per expert down (BF16)
-
-        // Normalize top-k values per token (norm_topk_prob=true)
-        let mut normalized_vals = topk_vals_f32.to_vec();
-        for t in 0..seq {
-            let row = &mut normalized_vals[t * top_k..(t + 1) * top_k];
-            let sum: f32 = row.iter().sum();
-            if sum > 0.0 {
-                for v in row.iter_mut() {
-                    *v /= sum;
-                }
-            }
-        }
+        let gu_stride = gu_stride_bytes as u64; // bytes per expert gate_up (BF16)
+        let down_stride = down_stride_bytes as u64; // bytes per expert down (BF16)

        for t in 0..seq {
            let x_t_ptr = xbf16_ptr + (t * hidden * 2) as u64; // BF16
            let expert_indices = &topk_idx_i32[t * top_k..(t + 1) * top_k];
-            let weights = &normalized_vals[t * top_k..(t + 1) * top_k];
+            let weights = &expert_weights_f32[t * top_k..(t + 1) * top_k];

            for (i, (&expert_idx, &weight)) in expert_indices.iter().zip(weights.iter()).enumerate()
            {
@@ -316,7 +536,7 @@ impl HostOp for GLUMoE {
                let expert_gu_ptr = gate_up_ptr + expert_idx as u64 * gu_stride;
                cublas_matmul(
                    stream,
-                    cublaslt,
+                    &cublaslt,
                    ws_ptr,
                    gate_up_dim as u64,
                    1,
@@ -335,17 +555,19 @@ impl HostOp for GLUMoE {
                    0.0f32,
                )?;

-                // b. SwiGLU kernel (BF16 → BF16)
+                // b. Mode-specific gated activation (BF16 → BF16)
                let moe_int = intermediate as i32;
-                let swiglu_blocks = (moe_int as u32).div_ceil(256);
+                let activation_mode = self.mode.activation_kernel_mode();
+                let activation_blocks = (moe_int as u32).div_ceil(256);
                unsafe {
                    stream
-                        .launch_builder(swiglu_fn)
+                        .launch_builder(activation_fn)
                        .arg(&gu_out_ptr)
                        .arg(&hid_ptr)
                        .arg(&moe_int)
+                        .arg(&activation_mode)
                        .launch(LaunchConfig {
-                            grid_dim: (swiglu_blocks, 1, 1),
+                            grid_dim: (activation_blocks, 1, 1),
                            block_dim: (256, 1, 1),
                            shared_mem_bytes: 0,
                        })?;
@@ -358,7 +580,7 @@ impl HostOp for GLUMoE {
                let beta = if i == 0 { 0.0f32 } else { 1.0f32 };
                cublas_matmul_mixed(
                    stream,
-                    cublaslt,
+                    &cublaslt,
                    ws_ptr,
                    hidden as u64,
                    1,
@@ -401,7 +623,11 @@ impl HostOp for GLUMoE {
 // Helpers
 // ============================================================

-fn buf_ptr(buf: &CudaSlice<u8>, stream: &Arc<CudaStream>) -> u64 {
+fn buf_ptr(buf: DeviceBuffer, _stream: &Arc<CudaStream>) -> u64 {
+    buf.ptr()
+}
+
+fn slice_ptr(buf: &CudaSlice<u8>, stream: &Arc<CudaStream>) -> u64 {
    let (ptr, _guard) = buf.device_ptr(stream);
    ptr
 }
--- a/crates/luminal_cuda_lite/src/kernel/cuda_graph.rs
+++ b/crates/luminal_cuda_lite/src/kernel/cuda_graph.rs
@@ -653,4 +653,53 @@ mod tests {
        }
        assert_close(&rt.get_f32(output), &expected, 1e-2, 1e-2);
    }
+
+    /// Test that CUDA graphs produce correct results when dynamic dimensions
+    /// change incrementally across many executions (simulating a decode loop
+    /// where position offset increments each step).
+    #[test]
+    fn test_cuda_graph_incremental_dim_changes() {
+        let Some(stream) = get_cuda_stream() else {
+            return;
+        };
+        let mut cx = Graph::default();
+        let a = cx.tensor('s');
+        let b = cx.tensor('s');
+        let c = ((a + b) * a).output();
+
+        let initial_size = 128;
+        cx.set_dim('s', initial_size);
+        let mut rt = CudaRuntime::initialize(stream);
+        let data_a = random_f32_vec(initial_size, 42, -0.5, 0.5);
+        let data_b = random_f32_vec(initial_size, 43, -0.5, 0.5);
+        rt.set_data(a, data_a.clone());
+        rt.set_data(b, data_b.clone());
+        cx.build_search_space::<CudaRuntime>();
+        rt = cx.search(rt, 5);
+
+        // Initial execution
+        rt.execute(&cx.dyn_map);
+        let eps = dtype_epsilon(luminal::dtype::DType::F32);
+        let tol = eps * TOLERANCE_SAFETY_FACTOR;
+        let expected: Vec<f32> = data_a
+            .iter()
+            .zip(&data_b)
+            .map(|(a, b)| (a + b) * a)
+            .collect();
+        assert_close(&rt.get_f32(c), &expected, tol, tol);
+
+        // Incrementally change the dynamic dimension 10 times,
+        // simulating decode steps where position offset grows.
+        for step in 1..=10usize {
+            let size = initial_size + step;
+            cx.set_dim('s', size);
+            let da = random_f32_vec(size, 100 + step as u64, -0.5, 0.5);
+            let db = random_f32_vec(size, 200 + step as u64, -0.5, 0.5);
+            rt.set_data(a, da.clone());
+            rt.set_data(b, db.clone());
+            rt.execute(&cx.dyn_map);
+            let expected: Vec<f32> = da.iter().zip(&db).map(|(a, b)| (a + b) * a).collect();
+            assert_close(&rt.get_f32(c), &expected, tol, tol);
+        }
+    }
 }
--- a/crates/luminal_cuda_lite/src/kernel/fusion/fused_ops.rs
+++ b/crates/luminal_cuda_lite/src/kernel/fusion/fused_ops.rs
@@ -0,0 +1,301 @@
+// =========================================================================
+// Fused elementwise op variants used inside FusionStart/FusionEnd regions.
+//
+// Each `FusedX` struct mirrors its un-fused `KernelX` sibling field-for-field
+// and serves a single purpose: give the egglog rules a distinct sort to
+// rewrite into so a pair-fuse rule's RHS can never re-match its own LHS
+// pattern. Cascade prevention by typing.
+//
+// Each FusedX must be absorbed into a FusionEnd-rooted region and compiled by
+// `region_codegen`; standalone compilation is intentionally unsupported.
+// =========================================================================
+
+use std::sync::Arc;
+
+use cudarc::driver::{CudaFunction, CudaModule, CudaSlice, CudaStream};
+use luminal::{
+    egglog_utils::{
+        api::{Rule, SortDef, sort},
+        base::{DTYPE, ELIST, OP_KIND},
+        extract_dtype, extract_expr_list,
+    },
+    op::*,
+    prelude::*,
+};
+
+use crate::kernel::KernelOp;
+
+pub type Ops = (
+    FusedSin,
+    FusedSqrt,
+    FusedExp,
+    FusedExp2,
+    FusedLog2,
+    FusedRecip,
+    FusedAdd,
+    FusedMul,
+);
+
+// Standard `compile()` return tuple (matches the trait signature).
+type CompileOut = (
+    CudaFunction,
+    Arc<CudaModule>,
+    String,
+    (Expression, Expression, Expression),
+    (Expression, Expression, Expression),
+    Expression,
+    FxHashMap<char, CudaSlice<u8>>,
+);
+
+/// Generate `pub struct $Name { … unary fields … }` plus its `EgglogOp` and
+/// `KernelOp` impls. `$kernel_name` names the CUDA function (and the cache
+/// key); `$body` is the per-op CUDA expression, e.g. `"sinf(in[{in_idx}])"`.
+macro_rules! impl_fused_unary {
+    ($Name:ident, $sort:literal, $kernel_name:literal, $body:literal) => {
+        #[derive(Default, Debug, Clone)]
+        pub struct $Name {
+            pub(crate) shape: Vec<Expression>,
+            pub(crate) in_strides: Vec<Expression>,
+            pub(crate) out_strides: Vec<Expression>,
+            pub(crate) dtype: DType,
+        }
+
+        impl EgglogOp for $Name {
+            fn sort(&self) -> SortDef {
+                sort(
+                    OP_KIND,
+                    $sort,
+                    &[
+                        ("shape", ELIST),
+                        ("strides", ELIST),
+                        ("out_strides", ELIST),
+                        ("dtype", DTYPE),
+                    ],
+                )
+            }
+            fn n_inputs(&self) -> usize {
+                1
+            }
+            fn rewrites(&self) -> Vec<Rule> {
+                Vec::new()
+            }
+            fn cleanup(&self) -> bool {
+                false
+            }
+            fn extract<'a>(
+                &'a self,
+                egraph: &'a SerializedEGraph,
+                kind_children: &[&'a ENodeId],
+                input_enodes: Vec<&'a ENodeId>,
+                list_cache: &mut FxHashMap<&'a ENodeId, Vec<Expression>>,
+                expr_cache: &mut FxHashMap<&'a ENodeId, Expression>,
+            ) -> (LLIROp, Vec<&'a ENodeId>) {
+                (
+                    LLIROp::new::<dyn KernelOp>(Box::new(Self {
+                        shape: extract_expr_list(egraph, kind_children[0], list_cache, expr_cache)
+                            .unwrap(),
+                        in_strides: extract_expr_list(
+                            egraph,
+                            kind_children[1],
+                            list_cache,
+                            expr_cache,
+                        )
+                        .unwrap(),
+                        out_strides: extract_expr_list(
+                            egraph,
+                            kind_children[2],
+                            list_cache,
+                            expr_cache,
+                        )
+                        .unwrap(),
+                        dtype: extract_dtype(egraph, kind_children[3]),
+                    })),
+                    input_enodes,
+                )
+            }
+        }
+
+        impl KernelOp for $Name {
+            fn compile(
+                &self,
+                _stream: &Arc<CudaStream>,
+                _compile_cache: &mut FxHashMap<String, (Arc<CudaModule>, CudaFunction)>,
+            ) -> CompileOut {
+                unreachable!(concat!(
+                    $sort,
+                    " must be compiled through fusion region codegen"
+                ))
+            }
+            fn output_size(&self) -> Expression {
+                self.shape.iter().copied().product()
+            }
+            fn output_bytes(&self) -> Expression {
+                (self.output_size() * self.dtype.bits()).ceil_div(8)
+            }
+            fn bytes_loaded(&self) -> Expression {
+                self.output_bytes()
+            }
+            fn bytes_stored(&self) -> Expression {
+                self.output_bytes()
+            }
+            fn flops(&self) -> Expression {
+                self.shape.iter().copied().product()
+            }
+            fn output_dtype(&self) -> DType {
+                self.dtype
+            }
+            fn kernel_name(&self) -> &'static str {
+                $sort
+            }
+        }
+    };
+}
+
+/// As `impl_fused_unary!` but for binary ops: 5-field sort signature
+/// (shape + per-input strides + out_stride + dtype), n_inputs = 2.
+/// `$op_str` is the CUDA infix operator, e.g. `"+"`, `"*"`.
+macro_rules! impl_fused_binary {
+    ($Name:ident, $sort:literal, $kernel_name:literal, $op_str:literal) => {
+        #[derive(Default, Debug, Clone)]
+        pub struct $Name {
+            pub(crate) out_shape: Vec<Expression>,
+            pub(crate) a_stride: Vec<Expression>,
+            pub(crate) b_stride: Vec<Expression>,
+            pub(crate) out_stride: Vec<Expression>,
+            pub(crate) dtype: DType,
+        }
+
+        impl EgglogOp for $Name {
+            fn sort(&self) -> SortDef {
+                sort(
+                    OP_KIND,
+                    $sort,
+                    &[
+                        ("shape", ELIST),
+                        ("a_strides", ELIST),
+                        ("b_strides", ELIST),
+                        ("out_strides", ELIST),
+                        ("dtype", DTYPE),
+                    ],
+                )
+            }
+            fn n_inputs(&self) -> usize {
+                2
+            }
+            fn rewrites(&self) -> Vec<Rule> {
+                Vec::new()
+            }
+            fn cleanup(&self) -> bool {
+                false
+            }
+            fn extract<'a>(
+                &'a self,
+                egraph: &'a SerializedEGraph,
+                kind_children: &[&'a ENodeId],
+                input_enodes: Vec<&'a ENodeId>,
+                list_cache: &mut FxHashMap<&'a ENodeId, Vec<Expression>>,
+                expr_cache: &mut FxHashMap<&'a ENodeId, Expression>,
+            ) -> (LLIROp, Vec<&'a ENodeId>) {
+                (
+                    LLIROp::new::<dyn KernelOp>(Box::new(Self {
+                        out_shape: extract_expr_list(
+                            egraph,
+                            kind_children[0],
+                            list_cache,
+                            expr_cache,
+                        )
+                        .unwrap(),
+                        a_stride: extract_expr_list(
+                            egraph,
+                            kind_children[1],
+                            list_cache,
+                            expr_cache,
+                        )
+                        .unwrap(),
+                        b_stride: extract_expr_list(
+                            egraph,
+                            kind_children[2],
+                            list_cache,
+                            expr_cache,
+                        )
+                        .unwrap(),
+                        out_stride: extract_expr_list(
+                            egraph,
+                            kind_children[3],
+                            list_cache,
+                            expr_cache,
+                        )
+                        .unwrap(),
+                        dtype: extract_dtype(egraph, kind_children[4]),
+                    })),
+                    input_enodes,
+                )
+            }
+        }
+
+        impl KernelOp for $Name {
+            fn compile(
+                &self,
+                _stream: &Arc<CudaStream>,
+                _compile_cache: &mut FxHashMap<String, (Arc<CudaModule>, CudaFunction)>,
+            ) -> CompileOut {
+                unreachable!(concat!(
+                    $sort,
+                    " must be compiled through fusion region codegen"
+                ))
+            }
+            fn output_size(&self) -> Expression {
+                self.out_shape.iter().copied().product()
+            }
+            fn output_bytes(&self) -> Expression {
+                (self.output_size() * self.dtype.bits()).ceil_div(8)
+            }
+            fn bytes_loaded(&self) -> Expression {
+                let bytes = (self.output_size() * self.dtype.bits()).ceil_div(8);
+                bytes + bytes
+            }
+            fn bytes_stored(&self) -> Expression {
+                self.output_bytes()
+            }
+            fn flops(&self) -> Expression {
+                self.out_shape.iter().copied().product()
+            }
+            fn output_dtype(&self) -> DType {
+                self.dtype
+            }
+            fn kernel_name(&self) -> &'static str {
+                $sort
+            }
+        }
+    };
+}
+
+impl_fused_unary!(FusedSin, "FusedSin", "fused_sin_k", "sinf(in[{in_idx}])");
+impl_fused_unary!(
+    FusedSqrt,
+    "FusedSqrt",
+    "fused_sqrt_k",
+    "sqrtf(in[{in_idx}])"
+);
+impl_fused_unary!(FusedExp, "FusedExp", "fused_exp_k", "expf(in[{in_idx}])");
+impl_fused_unary!(
+    FusedExp2,
+    "FusedExp2",
+    "fused_exp2_k",
+    "exp2f(in[{in_idx}])"
+);
+impl_fused_unary!(
+    FusedLog2,
+    "FusedLog2",
+    "fused_log2_k",
+    "log2f(in[{in_idx}])"
+);
+impl_fused_unary!(
+    FusedRecip,
+    "FusedRecip",
+    "fused_recip_k",
+    "1.0f / in[{in_idx}]"
+);
+
+impl_fused_binary!(FusedAdd, "FusedAdd", "fused_add_k", "+");
+impl_fused_binary!(FusedMul, "FusedMul", "fused_mul_k", "*");
--- a/crates/luminal_cuda_lite/src/kernel/fusion/markers.rs
+++ b/crates/luminal_cuda_lite/src/kernel/fusion/markers.rs
@@ -0,0 +1,413 @@
+// =========================================================================
+// Fusion boundary markers — FusionStart and FusionEnd.
+//
+// Tag-like LLIR ops that bracket a region of elementwise ops destined to
+// be emitted as a single CUDA kernel:
+//   - N FusionStart nodes per region (one per FS leaf — distinct external
+//     reads),
+//   - exactly 1 FusionEnd per region.
+//
+// `FusionEnd::rewrites()` carries the seven rule families that build and
+// extend regions (pair-fuse / grow / merge); the actual single-kernel
+// codegen lives in `region_codegen`. Like FusedX, both markers'
+// `compile()` is `unreachable!()` — region codegen folds them away
+// before kernel_to_host's compile loop reaches an interior node.
+// =========================================================================
+
+use std::sync::Arc;
+
+use cudarc::driver::{CudaFunction, CudaModule, CudaSlice, CudaStream};
+use luminal::{
+    egglog_utils::{
+        api::{Rule, SortDef, sort},
+        base::{DTYPE, ELIST, OP_KIND},
+        extract_dtype, extract_expr_list,
+    },
+    op::*,
+    prelude::*,
+};
+
+use crate::kernel::KernelOp;
+
+pub type Ops = (FusionStart, FusionEnd);
+
+type CompileOut = (
+    CudaFunction,
+    Arc<CudaModule>,
+    String,
+    (Expression, Expression, Expression),
+    (Expression, Expression, Expression),
+    Expression,
+    FxHashMap<char, CudaSlice<u8>>,
+);
+
+// =========================================================================
+// FusionStart
+// =========================================================================
+
+#[derive(Default, Debug, Clone)]
+pub struct FusionStart {
+    pub(crate) shape: Vec<Expression>,
+    pub(crate) strides: Vec<Expression>,
+    pub(crate) dtype: DType,
+}
+
+impl EgglogOp for FusionStart {
+    fn sort(&self) -> SortDef {
+        sort(
+            OP_KIND,
+            "FusionStart",
+            &[("shape", ELIST), ("strides", ELIST), ("dtype", DTYPE)],
+        )
+    }
+    fn n_inputs(&self) -> usize {
+        1
+    }
+    fn rewrites(&self) -> Vec<Rule> {
+        // No idempotence rule. `FusionStart(FusionStart(x)) ≡ FusionStart(x)`
+        // would unify nested markers and create eclass cycles via the
+        // pair-fuse rules; without it, occasional re-firings produce extra
+        // semantically-correct identity layers, bounded by the run schedule.
+        Vec::new()
+    }
+    fn cleanup(&self) -> bool {
+        false
+    }
+    fn extract<'a>(
+        &'a self,
+        egraph: &'a SerializedEGraph,
+        kind_children: &[&'a ENodeId],
+        input_enodes: Vec<&'a ENodeId>,
+        list_cache: &mut FxHashMap<&'a ENodeId, Vec<Expression>>,
+        expr_cache: &mut FxHashMap<&'a ENodeId, Expression>,
+    ) -> (LLIROp, Vec<&'a ENodeId>) {
+        (
+            LLIROp::new::<dyn KernelOp>(Box::new(Self {
+                shape: extract_expr_list(egraph, kind_children[0], list_cache, expr_cache).unwrap(),
+                strides: extract_expr_list(egraph, kind_children[1], list_cache, expr_cache)
+                    .unwrap(),
+                dtype: extract_dtype(egraph, kind_children[2]),
+            })),
+            input_enodes,
+        )
+    }
+}
+
+impl KernelOp for FusionStart {
+    fn compile(
+        &self,
+        _stream: &Arc<CudaStream>,
+        _compile_cache: &mut FxHashMap<String, (Arc<CudaModule>, CudaFunction)>,
+    ) -> CompileOut {
+        unreachable!("FusionStart must be compiled through fusion region codegen")
+    }
+    fn output_size(&self) -> Expression {
+        self.shape.iter().copied().product()
+    }
+    fn output_bytes(&self) -> Expression {
+        (self.output_size() * self.dtype.bits()).ceil_div(8)
+    }
+    fn output_dtype(&self) -> DType {
+        self.dtype
+    }
+    fn kernel_name(&self) -> &'static str {
+        "FusionStart"
+    }
+    fn output_aliases_input(&self) -> Option<usize> {
+        Some(0)
+    }
+}
+
+// =========================================================================
+// FusionEnd
+// =========================================================================
+
+#[derive(Default, Debug, Clone)]
+pub struct FusionEnd {
+    pub(crate) shape: Vec<Expression>,
+    pub(crate) strides: Vec<Expression>,
+    pub(crate) dtype: DType,
+}
+
+impl EgglogOp for FusionEnd {
+    fn sort(&self) -> SortDef {
+        sort(
+            OP_KIND,
+            "FusionEnd",
+            &[("shape", ELIST), ("strides", ELIST), ("dtype", DTYPE)],
+        )
+    }
+    fn n_inputs(&self) -> usize {
+        1
+    }
+
+    fn rewrites(&self) -> Vec<Rule> {
+        // Seven rule families build and extend FE-bracketed regions. Each
+        // pair-fuse rule's LHS pattern matches *un-fused* `KernelX` ops; the
+        // RHS produces `FusedX` variants in a different egglog sort, so the
+        // rule's own output cannot re-match its LHS — cascade is prevented
+        // by typing rather than by a discriminator field.
+        //
+        // Stride compatibility is expressed by reusing variable names: a
+        // unary inside a region matches `(KernelU ?shape ?s ?s ?dt)` (in =
+        // out, no transpose); a binary feeding a downstream op binds the
+        // binary's out-stride to the downstream op's in-stride along the
+        // connecting side.
+        let mut rules = Vec::new();
+
+        // (KernelX kind, FusedX kind)
+        let unaries: &[(&str, &str)] = &[
+            ("KernelSin", "FusedSin"),
+            ("KernelSqrt", "FusedSqrt"),
+            ("KernelExp", "FusedExp"),
+            ("KernelExp2", "FusedExp2"),
+            ("KernelLog2", "FusedLog2"),
+            ("KernelRecip", "FusedRecip"),
+        ];
+        // (KernelX kind, FusedX kind, rule-name label)
+        let binaries: &[(&str, &str, &str)] = &[
+            ("KernelAdd", "FusedAdd", "Add"),
+            ("KernelMul", "FusedMul", "Mul"),
+        ];
+
+        // 1. Pair-fuse U → U: U2(U1(x)) → FE(FU2(FU1(FS(x)))).
+        for (ki1, fi1) in unaries {
+            for (ko2, fo2) in unaries {
+                rules.push(Rule::raw(format!(
+                    "(rule (
+                        (= ?u1 (Op ({ki1} ?shape ?s ?s ?dt) (ICons ?x (INil))))
+                        (= ?u2 (Op ({ko2} ?shape ?s ?s ?dt) (ICons ?u1 (INil))))
+                     ) (
+                        (let ?fs (Op (FusionStart ?shape ?s ?dt) (ICons ?x (INil))))
+                        (let ?fu1 (Op ({fi1} ?shape ?s ?s ?dt) (ICons ?fs (INil))))
+                        (let ?fu2 (Op ({fo2} ?shape ?s ?s ?dt) (ICons ?fu1 (INil))))
+                        (let ?fe (Op (FusionEnd ?shape ?s ?dt) (ICons ?fu2 (INil))))
+                        (union ?u2 ?fe)
+                     ) :ruleset fusion_pair :name \"pair-fuse-U-U-{ki1}-{ko2}\")"
+                )));
+            }
+        }
+
+        // 2. Pair-fuse B → U: U(B(a, b)) → FE(FU(FB(FS(a), FS(b)))).
+        for (kb, fb, lb) in binaries {
+            for (ku, fu) in unaries {
+                rules.push(Rule::raw(format!(
+                    "(rule (
+                        (= ?bin (Op ({kb} ?shape ?a_s ?b_s ?o_s ?dt)
+                                     (ICons ?a (ICons ?b (INil)))))
+                        (= ?u (Op ({ku} ?shape ?o_s ?o_s ?dt) (ICons ?bin (INil))))
+                     ) (
+                        (let ?fs_a (Op (FusionStart ?shape ?a_s ?dt) (ICons ?a (INil))))
+                        (let ?fs_b (Op (FusionStart ?shape ?b_s ?dt) (ICons ?b (INil))))
+                        (let ?fbin (Op ({fb} ?shape ?a_s ?b_s ?o_s ?dt)
+                                       (ICons ?fs_a (ICons ?fs_b (INil)))))
+                        (let ?fu (Op ({fu} ?shape ?o_s ?o_s ?dt) (ICons ?fbin (INil))))
+                        (let ?fe (Op (FusionEnd ?shape ?o_s ?dt) (ICons ?fu (INil))))
+                        (union ?u ?fe)
+                     ) :ruleset fusion_pair :name \"pair-fuse-B-U-{lb}-{ku}\")"
+                )));
+            }
+        }
+
+        // 3. Pair-fuse U → B (lhs / rhs): unary feeds binary's A or B input.
+        //    LHS:  B(U(a), b) → FE(FB(FU(FS(a)), FS(b))).
+        //    RHS:  B(a, U(b)) → FE(FB(FS(a), FU(FS(b)))).
+        for (ku, fu) in unaries {
+            for (kb, fb, lb) in binaries {
+                rules.push(Rule::raw(format!(
+                    "(rule (
+                        (= ?u (Op ({ku} ?shape ?u_s ?u_s ?dt) (ICons ?a (INil))))
+                        (= ?bin (Op ({kb} ?shape ?u_s ?b_s ?o_s ?dt)
+                                     (ICons ?u (ICons ?b (INil)))))
+                     ) (
+                        (let ?fs_a (Op (FusionStart ?shape ?u_s ?dt) (ICons ?a (INil))))
+                        (let ?fs_b (Op (FusionStart ?shape ?b_s ?dt) (ICons ?b (INil))))
+                        (let ?fu (Op ({fu} ?shape ?u_s ?u_s ?dt) (ICons ?fs_a (INil))))
+                        (let ?fbin (Op ({fb} ?shape ?u_s ?b_s ?o_s ?dt)
+                                       (ICons ?fu (ICons ?fs_b (INil)))))
+                        (let ?fe (Op (FusionEnd ?shape ?o_s ?dt) (ICons ?fbin (INil))))
+                        (union ?bin ?fe)
+                     ) :ruleset fusion_pair :name \"pair-fuse-U-B-lhs-{ku}-{lb}\")"
+                )));
+                rules.push(Rule::raw(format!(
+                    "(rule (
+                        (= ?u (Op ({ku} ?shape ?u_s ?u_s ?dt) (ICons ?b (INil))))
+                        (= ?bin (Op ({kb} ?shape ?a_s ?u_s ?o_s ?dt)
+                                     (ICons ?a (ICons ?u (INil)))))
+                     ) (
+                        (let ?fs_a (Op (FusionStart ?shape ?a_s ?dt) (ICons ?a (INil))))
+                        (let ?fs_b (Op (FusionStart ?shape ?u_s ?dt) (ICons ?b (INil))))
+                        (let ?fu (Op ({fu} ?shape ?u_s ?u_s ?dt) (ICons ?fs_b (INil))))
+                        (let ?fbin (Op ({fb} ?shape ?a_s ?u_s ?o_s ?dt)
+                                       (ICons ?fs_a (ICons ?fu (INil)))))
+                        (let ?fe (Op (FusionEnd ?shape ?o_s ?dt) (ICons ?fbin (INil))))
+                        (union ?bin ?fe)
+                     ) :ruleset fusion_pair :name \"pair-fuse-U-B-rhs-{ku}-{lb}\")"
+                )));
+            }
+        }
+
+        // 4. Pair-fuse B → B (lhs / rhs): inner binary feeds outer's A or B.
+        for (kbi, fbi, lbi) in binaries {
+            for (kbo, fbo, lbo) in binaries {
+                rules.push(Rule::raw(format!(
+                    "(rule (
+                        (= ?bi (Op ({kbi} ?shape ?ai_s ?bi_s ?oi_s ?dt)
+                                    (ICons ?a (ICons ?b (INil)))))
+                        (= ?bo (Op ({kbo} ?shape ?oi_s ?co_s ?oo_s ?dt)
+                                    (ICons ?bi (ICons ?c (INil)))))
+                     ) (
+                        (let ?fs_a (Op (FusionStart ?shape ?ai_s ?dt) (ICons ?a (INil))))
+                        (let ?fs_b (Op (FusionStart ?shape ?bi_s ?dt) (ICons ?b (INil))))
+                        (let ?fs_c (Op (FusionStart ?shape ?co_s ?dt) (ICons ?c (INil))))
+                        (let ?fbi (Op ({fbi} ?shape ?ai_s ?bi_s ?oi_s ?dt)
+                                       (ICons ?fs_a (ICons ?fs_b (INil)))))
+                        (let ?fbo (Op ({fbo} ?shape ?oi_s ?co_s ?oo_s ?dt)
+                                       (ICons ?fbi (ICons ?fs_c (INil)))))
+                        (let ?fe (Op (FusionEnd ?shape ?oo_s ?dt) (ICons ?fbo (INil))))
+                        (union ?bo ?fe)
+                     ) :ruleset fusion_pair :name \"pair-fuse-B-B-lhs-{lbi}-{lbo}\")"
+                )));
+                rules.push(Rule::raw(format!(
+                    "(rule (
+                        (= ?bi (Op ({kbi} ?shape ?ai_s ?bi_s ?oi_s ?dt)
+                                    (ICons ?a (ICons ?b (INil)))))
+                        (= ?bo (Op ({kbo} ?shape ?co_s ?oi_s ?oo_s ?dt)
+                                    (ICons ?c (ICons ?bi (INil)))))
+                     ) (
+                        (let ?fs_a (Op (FusionStart ?shape ?ai_s ?dt) (ICons ?a (INil))))
+                        (let ?fs_b (Op (FusionStart ?shape ?bi_s ?dt) (ICons ?b (INil))))
+                        (let ?fs_c (Op (FusionStart ?shape ?co_s ?dt) (ICons ?c (INil))))
+                        (let ?fbi (Op ({fbi} ?shape ?ai_s ?bi_s ?oi_s ?dt)
+                                       (ICons ?fs_a (ICons ?fs_b (INil)))))
+                        (let ?fbo (Op ({fbo} ?shape ?co_s ?oi_s ?oo_s ?dt)
+                                       (ICons ?fs_c (ICons ?fbi (INil)))))
+                        (let ?fe (Op (FusionEnd ?shape ?oo_s ?dt) (ICons ?fbo (INil))))
+                        (union ?bo ?fe)
+                     ) :ruleset fusion_pair :name \"pair-fuse-B-B-rhs-{lbi}-{lbo}\")"
+                )));
+            }
+        }
+
+        // 5. Grow FE → U: U(FE(inner)) → FE(FU(inner)). No new FS.
+        for (ku, fu) in unaries {
+            rules.push(Rule::raw(format!(
+                "(rule (
+                    (= ?fe (Op (FusionEnd ?shape ?s ?dt) (ICons ?inner (INil))))
+                    (= ?u (Op ({ku} ?shape ?s ?s ?dt) (ICons ?fe (INil))))
+                 ) (
+                    (let ?fu (Op ({fu} ?shape ?s ?s ?dt) (ICons ?inner (INil))))
+                    (let ?new_fe (Op (FusionEnd ?shape ?s ?dt) (ICons ?fu (INil))))
+                    (union ?u ?new_fe)
+                 ) :ruleset fusion_grow :name \"grow-FE-U-{ku}\")"
+            )));
+        }
+
+        // 6. Grow FE → B (lhs / rhs): one input is the FE, the other external.
+        for (kb, fb, lb) in binaries {
+            rules.push(Rule::raw(format!(
+                "(rule (
+                    (= ?fe (Op (FusionEnd ?shape ?a_s ?dt) (ICons ?inner_a (INil))))
+                    (= ?bin (Op ({kb} ?shape ?a_s ?b_s ?o_s ?dt)
+                                 (ICons ?fe (ICons ?b (INil)))))
+                 ) (
+                    (let ?fs_b (Op (FusionStart ?shape ?b_s ?dt) (ICons ?b (INil))))
+                    (let ?fbin (Op ({fb} ?shape ?a_s ?b_s ?o_s ?dt)
+                                   (ICons ?inner_a (ICons ?fs_b (INil)))))
+                    (let ?new_fe (Op (FusionEnd ?shape ?o_s ?dt) (ICons ?fbin (INil))))
+                    (union ?bin ?new_fe)
+                 ) :ruleset fusion_grow :name \"grow-FE-B-lhs-{lb}\")"
+            )));
+            rules.push(Rule::raw(format!(
+                "(rule (
+                    (= ?fe (Op (FusionEnd ?shape ?b_s ?dt) (ICons ?inner_b (INil))))
+                    (= ?bin (Op ({kb} ?shape ?a_s ?b_s ?o_s ?dt)
+                                 (ICons ?a (ICons ?fe (INil)))))
+                 ) (
+                    (let ?fs_a (Op (FusionStart ?shape ?a_s ?dt) (ICons ?a (INil))))
+                    (let ?fbin (Op ({fb} ?shape ?a_s ?b_s ?o_s ?dt)
+                                   (ICons ?fs_a (ICons ?inner_b (INil)))))
+                    (let ?new_fe (Op (FusionEnd ?shape ?o_s ?dt) (ICons ?fbin (INil))))
+                    (union ?bin ?new_fe)
+                 ) :ruleset fusion_grow :name \"grow-FE-B-rhs-{lb}\")"
+            )));
+        }
+
+        // 7. Merge two FEs at a binary: B(FE(ia), FE(ib)) → FE(FB(ia, ib)).
+        //
+        // This is destructive: after creating the larger region, subsume the
+        // two smaller FusionEnd rows. Without that, independently-grown left
+        // and right regions form a Cartesian product, then those alternatives
+        // can merge again higher in the graph.
+        for (kb, fb, lb) in binaries {
+            rules.push(Rule::raw(format!(
+                "(rule (
+                    (= ?fe_a (Op (FusionEnd ?shape ?a_s ?dt) (ICons ?inner_a (INil))))
+                    (= ?fe_b (Op (FusionEnd ?shape ?b_s ?dt) (ICons ?inner_b (INil))))
+                    (= ?bin (Op ({kb} ?shape ?a_s ?b_s ?o_s ?dt)
+                                 (ICons ?fe_a (ICons ?fe_b (INil)))))
+                 ) (
+                    (let ?fbin (Op ({fb} ?shape ?a_s ?b_s ?o_s ?dt)
+                                   (ICons ?inner_a (ICons ?inner_b (INil)))))
+                    (let ?new_fe (Op (FusionEnd ?shape ?o_s ?dt) (ICons ?fbin (INil))))
+                    (union ?bin ?new_fe)
+                    (subsume (Op (FusionEnd ?shape ?a_s ?dt) (ICons ?inner_a (INil))))
+                    (subsume (Op (FusionEnd ?shape ?b_s ?dt) (ICons ?inner_b (INil))))
+                 ) :ruleset fusion_merge :name \"merge-FE-FE-{lb}\")"
+            )));
+        }
+
+        // No dissolve rule (`FS(FE(x)) → x`): unioning FS's eclass with FE's
+        // inner eclass creates self-referential eclasses after grow rules
+        // extend the downstream region, and extraction then panics with
+        // `Cycle(NodeIndex(_))`. Grow rules already compose adjacent regions
+        // correctly without dissolve.
+
+        rules
+    }
+
+    fn cleanup(&self) -> bool {
+        false
+    }
+
+    fn extract<'a>(
+        &'a self,
+        egraph: &'a SerializedEGraph,
+        kind_children: &[&'a ENodeId],
+        input_enodes: Vec<&'a ENodeId>,
+        list_cache: &mut FxHashMap<&'a ENodeId, Vec<Expression>>,
+        expr_cache: &mut FxHashMap<&'a ENodeId, Expression>,
+    ) -> (LLIROp, Vec<&'a ENodeId>) {
+        (
+            LLIROp::new::<dyn KernelOp>(Box::new(Self {
+                shape: extract_expr_list(egraph, kind_children[0], list_cache, expr_cache).unwrap(),
+                strides: extract_expr_list(egraph, kind_children[1], list_cache, expr_cache)
+                    .unwrap(),
+                dtype: extract_dtype(egraph, kind_children[2]),
+            })),
+            input_enodes,
+        )
+    }
+}
+
+impl KernelOp for FusionEnd {
+    fn compile(
+        &self,
+        _stream: &Arc<CudaStream>,
+        _compile_cache: &mut FxHashMap<String, (Arc<CudaModule>, CudaFunction)>,
+    ) -> CompileOut {
+        unreachable!("FusionEnd must be compiled through fusion region codegen")
+    }
+    fn output_size(&self) -> Expression {
+        self.shape.iter().copied().product()
+    }
+    fn output_bytes(&self) -> Expression {
+        (self.output_size() * self.dtype.bits()).ceil_div(8)
+    }
+    fn output_dtype(&self) -> DType {
+        self.dtype
+    }
+    fn kernel_name(&self) -> &'static str {
+        "FusionEnd"
+    }
+}
--- a/crates/luminal_cuda_lite/src/kernel/fusion/mod.rs
+++ b/crates/luminal_cuda_lite/src/kernel/fusion/mod.rs
@@ -0,0 +1,26 @@
+//! Binary-inclusive elementwise kernel fusion.
+//!
+//! - `markers` — `FusionStart` / `FusionEnd` ops + the seven egglog rule
+//!   families that build and extend FE-bracketed regions.
+//! - `fused_ops` — eight `FusedX` op variants (interior to a region) so
+//!   pair-fuse rules' RHS sit in a different egglog sort than their LHS,
+//!   blocking cascade by typing.
+//! - `region_codegen` — `kernel_to_host` calls into here to collapse each
+//!   FE-rooted region into a single CUDA kernel at compile time.
+//!
+//! The LLIR keeps `FusionStart` / `FusedX` / `FusionEnd` nodes after
+//! extraction; `region_codegen` is the only place that walks them.
+
+pub mod fused_ops;
+pub mod markers;
+pub mod region_codegen;
+
+pub use fused_ops::{
+    FusedAdd, FusedExp, FusedExp2, FusedLog2, FusedMul, FusedRecip, FusedSin, FusedSqrt,
+};
+pub use markers::{FusionEnd, FusionStart};
+
+/// All fusion-related op types that the egglog runtime needs to know about
+/// (markers + interior FusedX variants). Combined into a flat tuple for the
+/// `Ops` registry in `kernel::mod`.
+pub type Ops = (markers::Ops, fused_ops::Ops);
--- a/crates/luminal_cuda_lite/src/kernel/fusion/region_codegen.rs
+++ b/crates/luminal_cuda_lite/src/kernel/fusion/region_codegen.rs
@@ -0,0 +1,476 @@
+// =========================================================================
+// Region codegen for FusionStart / FusionEnd-bracketed fused regions.
+//
+// PR1 left FusedX / FusionStart / FusionEnd nodes in the post-extraction
+// LLIR, each compiling to its own standalone CUDA kernel. PR2 collapses
+// every FusionEnd-rooted region into ONE fused CUDA kernel at codegen
+// time — without rewriting the LLIR.
+//
+// Pipeline:
+//   `kernel_to_host` builds a Vec<CompileUnit> from the topo order:
+//     - CompileUnit::Single(node)  — un-fused KernelX, compiled as before.
+//     - CompileUnit::Region(rgn)   — one FE + its interior FusedX DAG +
+//                                    its FS leaves. Compiled here as a
+//                                    single CUDA kernel that reads from
+//                                    the region's external inputs once,
+//                                    chains all FusedX bodies through
+//                                    register-resident locals, and writes
+//                                    the FE's output.
+//
+// The CompiledKernel for a Region is keyed on the FE node and stores
+// `inputs = external producer NodeIndices` (one per interior FusionStart),
+// so the existing buffer-pointer wiring in to_host.rs picks up the right
+// device pointers at execute time. Interior FusedX / FusionStart nodes
+// never enter the kernels Vec — they have no buffers, no launches.
+// =========================================================================
+
+use std::sync::Arc;
+
+use cudarc::driver::{CudaFunction, CudaModule, CudaSlice, CudaStream};
+use luminal::{
+    graph::LLIRGraph,
+    prelude::{
+        petgraph::{Direction, algo::toposort, visit::EdgeRef},
+        *,
+    },
+};
+
+use as_any::Downcast;
+
+use crate::{
+    compile_module_image_for_current_device, cuda_dtype,
+    kernel::KernelOp,
+    kernel::fusion::markers::{FusionEnd, FusionStart},
+    kernel::hlir::{dtype_includes, generate_dyn_dims_defines},
+};
+
+// =========================================================================
+// Compile units — what `kernel_to_host` iterates over instead of nodes.
+// =========================================================================
+
+#[derive(Debug, Clone)]
+pub(crate) struct RegionUnit {
+    /// The FusionEnd node that anchors this region.
+    pub fe_node: NodeIndex,
+    /// Interior FusedX nodes, in topological order (predecessors before
+    /// consumers). Used to emit register-binding statements in dependency
+    /// order in the fused CUDA kernel body.
+    pub fusedx_topo: Vec<NodeIndex>,
+    /// FusionStart nodes that bound the region's leaves. One per external
+    /// read site — duplicates (different FS LLIR nodes wrapping the same
+    /// upstream tensor) are kept separate so each read uses its own
+    /// strides; the host launch passes the same device pointer twice.
+    pub fs_nodes: Vec<NodeIndex>,
+    /// External producer NodeIndices, one per `fs_nodes` entry in the same
+    /// order. Becomes the `inputs` field of the FE's `CompiledKernel`, and
+    /// the kernel function's `in0`, `in1`, ... parameters in that order.
+    pub external_inputs: Vec<NodeIndex>,
+}
+
+#[derive(Debug, Clone)]
+pub(crate) enum CompileUnit {
+    Single(NodeIndex),
+    Region(RegionUnit),
+}
+
+// =========================================================================
+// Region detection.
+// =========================================================================
+
+/// Group a sub-DAG's topo order into compile units. Each FusionEnd node
+/// becomes the root of a `CompileUnit::Region`; the region's interior
+/// FusedX and FusionStart nodes are absorbed into that region and removed
+/// from the per-node iteration. Anything else is wrapped in
+/// `CompileUnit::Single`.
+/// Globally-absorbed FS / FE markers — the set of marker nodes that any
+/// `FusionEnd` in the LLIR walks back to during region detection. A
+/// marker is "absorbed" iff some FE in the LLIR can reach it by walking
+/// incoming edges through `FusionEnd` / `FusedX` nodes, stopping at
+/// `FusionStart` leaves.
+///
+/// This is computed once over the full LLIR rather than per-convex-
+/// subgraph, because `partition_marked_convex` may put a shared FS leaf
+/// (one whose e-graph congruence-deduplicated it across multiple
+/// regions) into a different subgraph than the FE that absorbs it.
+/// Without this global view, `build_compile_units` running on the FS's
+/// subgraph would not see any FE walking back to the FS and would emit the
+/// FS as `CompileUnit::Single`; marker standalone compilation is not supported.
+pub(crate) fn globally_absorbed_markers(llir_graph: &LLIRGraph) -> FxHashSet<NodeIndex> {
+    let name_of = |idx: NodeIndex| -> Option<&'static str> {
+        llir_graph
+            .node_weight(idx)
+            .and_then(|op| op.to_dialect::<dyn KernelOp>().map(|k| k.kernel_name()))
+    };
+
+    let mut absorbed: FxHashSet<NodeIndex> = FxHashSet::default();
+    for fe in llir_graph.node_indices() {
+        if name_of(fe) != Some("FusionEnd") {
+            continue;
+        }
+        let mut visited: FxHashSet<NodeIndex> = FxHashSet::default();
+        let mut stack: Vec<NodeIndex> = vec![fe];
+        visited.insert(fe);
+        while let Some(cur) = stack.pop() {
+            for pred in llir_graph.neighbors_directed(cur, Direction::Incoming) {
+                if !visited.insert(pred) {
+                    continue;
+                }
+                match name_of(pred) {
+                    Some("FusionStart") => {
+                        absorbed.insert(pred);
+                    }
+                    Some("FusionEnd") => {
+                        absorbed.insert(pred);
+                        stack.push(pred);
+                    }
+                    Some(other) if other.starts_with("Fused") => {
+                        absorbed.insert(pred);
+                        stack.push(pred);
+                    }
+                    _ => {}
+                }
+            }
+        }
+    }
+    absorbed
+}
+
+pub(crate) fn build_compile_units(
+    topo_order: &[NodeIndex],
+    llir_graph: &LLIRGraph,
+    globally_absorbed: &FxHashSet<NodeIndex>,
+) -> Vec<CompileUnit> {
+    let name_of = |idx: NodeIndex| -> Option<&'static str> {
+        llir_graph
+            .node_weight(idx)
+            .and_then(|op| op.to_dialect::<dyn KernelOp>().map(|k| k.kernel_name()))
+    };
+
+    // First pass: every FusionEnd in the subgraph anchors a region; gather
+    // the region's interior + FS leaves by walking incoming edges
+    // backward, stopping at FusionStart (a leaf — its predecessor is the
+    // external producer, outside the region).
+    let mut absorbed: FxHashSet<NodeIndex> = FxHashSet::default();
+    let mut regions: FxHashMap<NodeIndex, RegionUnit> = FxHashMap::default();
+
+    for &node in topo_order {
+        if name_of(node) != Some("FusionEnd") {
+            continue;
+        }
+
+        let mut interior: Vec<NodeIndex> = Vec::new();
+        let mut fs_nodes: Vec<NodeIndex> = Vec::new();
+        let mut visited: FxHashSet<NodeIndex> = FxHashSet::default();
+        let mut stack: Vec<NodeIndex> = Vec::new();
+        stack.push(node);
+        visited.insert(node);
+
+        while let Some(cur) = stack.pop() {
+            for pred in llir_graph.neighbors_directed(cur, Direction::Incoming) {
+                if !visited.insert(pred) {
+                    continue;
+                }
+                match name_of(pred) {
+                    Some("FusionStart") => {
+                        fs_nodes.push(pred);
+                        // Don't recurse past FS — its predecessor is
+                        // external (outside the region).
+                    }
+                    Some("FusionEnd") => {
+                        // A nested FE inside a region. Under the current
+                        // rule design these are cascade artifacts — treat
+                        // them as transparent (walk through) rather than
+                        // as a separate region. The outer region absorbs
+                        // them. They do not become CompileUnit::Region
+                        // anchors because their eclass is already the
+                        // outer region's.
+                        absorbed.insert(pred);
+                        stack.push(pred);
+                    }
+                    Some(other) if other.starts_with("Fused") => {
+                        interior.push(pred);
+                        stack.push(pred);
+                    }
+                    _ => {
+                        // Non-marker, non-FusedX predecessor inside what
+                        // we thought was a region. Shouldn't happen with
+                        // the current rules; treat conservatively: do
+                        // not absorb it. This means the region is
+                        // malformed and we likely should not have a
+                        // region at all; caller will see incomplete
+                        // interior.
+                    }
+                }
+            }
+        }
+
+        // Topological order on the interior + FS nodes (so the kernel
+        // emits `let v = ...;` lines after their inputs are bound). We
+        // use the parent graph's toposort filtered to in-region nodes.
+        let mut region_set: FxHashSet<NodeIndex> = FxHashSet::default();
+        region_set.extend(interior.iter().copied());
+        region_set.extend(fs_nodes.iter().copied());
+        let topo = toposort(llir_graph, None).expect("LLIR cycle in region detection");
+        let interior_topo: Vec<NodeIndex> = topo
+            .iter()
+            .copied()
+            .filter(|n| region_set.contains(n) && interior.contains(n))
+            .collect();
+        let fs_topo: Vec<NodeIndex> = topo
+            .iter()
+            .copied()
+            .filter(|n| region_set.contains(n) && fs_nodes.contains(n))
+            .collect();
+
+        // External producer for each FS leaf, in the same order.
+        let external_inputs: Vec<NodeIndex> = fs_topo
+            .iter()
+            .map(|&fs| {
+                llir_graph
+                    .neighbors_directed(fs, Direction::Incoming)
+                    .next()
+                    .expect("FusionStart with no predecessor")
+            })
+            .collect();
+
+        absorbed.extend(interior_topo.iter().copied());
+        absorbed.extend(fs_topo.iter().copied());
+
+        regions.insert(
+            node,
+            RegionUnit {
+                fe_node: node,
+                fusedx_topo: interior_topo,
+                fs_nodes: fs_topo,
+                external_inputs,
+            },
+        );
+    }
+
+    // Second pass: emit compile units in original topo order, replacing
+    // FE nodes with their RegionUnit and skipping anything absorbed —
+    // either by a region in *this* subgraph (`absorbed`) or by any
+    // region anywhere in the LLIR (`globally_absorbed`). Skipping the
+    // latter prevents shared FS markers whose consumers live in other
+    // convex subgraphs from being emitted as standalone compile units:
+    // those FSes are absorbed by some other region, and the consuming
+    // region reads from FS's external producer.
+    let mut units: Vec<CompileUnit> = Vec::new();
+    for &node in topo_order {
+        if let Some(region) = regions.remove(&node) {
+            units.push(CompileUnit::Region(region));
+        } else if absorbed.contains(&node) || globally_absorbed.contains(&node) {
+            continue;
+        } else {
+            units.push(CompileUnit::Single(node));
+        }
+    }
+    units
+}
+
+// =========================================================================
+// Per-FusedX body templates.
+//
+// Each entry takes the names of the local variables holding the op's
+// inputs and returns a CUDA expression evaluating to the op's output
+// (a register-resident value, no buffer involved).
+// =========================================================================
+
+fn fused_body(name: &str, locals: &[&str]) -> String {
+    match name {
+        "FusedSin" => format!("sinf({})", locals[0]),
+        "FusedSqrt" => format!("sqrtf({})", locals[0]),
+        "FusedExp" => format!("expf({})", locals[0]),
+        "FusedExp2" => format!("exp2f({})", locals[0]),
+        "FusedLog2" => format!("log2f({})", locals[0]),
+        "FusedRecip" => format!("1.0f / {}", locals[0]),
+        "FusedAdd" => format!("{} + {}", locals[0], locals[1]),
+        "FusedMul" => format!("{} * {}", locals[0], locals[1]),
+        other => panic!("region_codegen: unknown FusedX op {other}"),
+    }
+}
+
+// =========================================================================
+// Region compilation — emit one CUDA kernel for the whole region.
+// =========================================================================
+
+#[allow(clippy::type_complexity)]
+pub(crate) struct CompiledRegion {
+    pub function: CudaFunction,
+    pub module: Arc<CudaModule>,
+    pub kernel_str: String,
+    pub grid: (Expression, Expression, Expression),
+    pub block: (Expression, Expression, Expression),
+    pub shared_mem: Expression,
+    pub constants: FxHashMap<char, CudaSlice<u8>>,
+}
+
+#[allow(clippy::type_complexity)]
+pub(crate) fn compile_region(
+    region: &RegionUnit,
+    llir_graph: &LLIRGraph,
+    stream: &Arc<CudaStream>,
+    compile_cache: &mut FxHashMap<String, (Arc<CudaModule>, CudaFunction)>,
+) -> CompiledRegion {
+    // Resolve FE: shape, strides (for the write), dtype.
+    let fe_op = llir_graph[region.fe_node]
+        .to_dialect::<dyn KernelOp>()
+        .expect("FE node must be a KernelOp");
+    let fe_struct: &FusionEnd = (***fe_op)
+        .downcast_ref::<FusionEnd>()
+        .expect("region root must be FusionEnd");
+    let out_shape: &[Expression] = &fe_struct.shape;
+    let out_strides: &[Expression] = &fe_struct.strides;
+    let dtype: DType = fe_struct.dtype;
+
+    // Aggregate all dynamic vars used anywhere in the region (FS strides,
+    // FE strides, FusedX shape — all FusedX share `out_shape`, but their
+    // own strides are likewise relevant for any future stride-affine ops).
+    let mut all_vars: FxHashSet<char> = FxHashSet::default();
+    all_vars.extend(out_shape.iter().flat_map(|e| e.dyn_vars()));
+    all_vars.extend(out_strides.iter().flat_map(|e| e.dyn_vars()));
+    for &fs_idx in &region.fs_nodes {
+        let fs_op = llir_graph[fs_idx].to_dialect::<dyn KernelOp>().unwrap();
+        let fs_struct: &FusionStart = (***fs_op).downcast_ref::<FusionStart>().unwrap();
+        all_vars.extend(fs_struct.strides.iter().flat_map(|e| e.dyn_vars()));
+    }
+
+    let cuda_ty = cuda_dtype(dtype);
+    let includes = dtype_includes(&[dtype]);
+    let (dyn_defines, _sorted_dims) = generate_dyn_dims_defines(&all_vars);
+    let dyn_dims_param = if all_vars.is_empty() {
+        ""
+    } else {
+        ", const int* dyn_dims"
+    };
+
+    let n_elements = out_shape
+        .iter()
+        .copied()
+        .product::<Expression>()
+        .to_kernel();
+
+    // Build kernel signature: out, then one input per FS leaf in
+    // `region.fs_nodes` order. The `external_inputs` list (parallel to
+    // `fs_nodes`) is what the host wires into the launch params.
+    let mut signature_params: Vec<String> = vec![format!("{cuda_ty} *out")];
+    for i in 0..region.fs_nodes.len() {
+        signature_params.push(format!("const {cuda_ty} *in{i}"));
+    }
+    let signature = signature_params.join(", ");
+
+    // Body: read FS leaves, then walk FusedX in topo order emitting a
+    // local per op, then write FE output. Every node gets a local keyed
+    // by a position-in-region index so the kernel string is invariant
+    // under NodeIndex churn (each `egglog_to_llir` reissues NodeIndexes,
+    // so naming locals by `n.index()` would invalidate the kernel
+    // string cache on every search candidate). Indices: FS leaves get
+    // 0..fs_nodes.len(), FusedX get fs_nodes.len()..(+ fusedx_topo.len()).
+    let mut local_idx_map: FxHashMap<NodeIndex, usize> = FxHashMap::default();
+    for (i, &fs_idx) in region.fs_nodes.iter().enumerate() {
+        local_idx_map.insert(fs_idx, i);
+    }
+    let fs_count = region.fs_nodes.len();
+    for (i, &op_idx) in region.fusedx_topo.iter().enumerate() {
+        local_idx_map.insert(op_idx, fs_count + i);
+    }
+    let local_name = |n: NodeIndex| format!("v_{}", local_idx_map[&n]);
+
+    let mut body = String::new();
+    body.push_str(&format!(
+        "        long long const_z = (long long)blockIdx.x * blockDim.x + threadIdx.x;\n\
+         \x20       if (const_z >= {n_elements}) return;\n"
+    ));
+
+    // FS leaves: each reads from its corresponding `in_i` parameter using
+    // its own strides.
+    for (i, &fs_idx) in region.fs_nodes.iter().enumerate() {
+        let fs_op = llir_graph[fs_idx].to_dialect::<dyn KernelOp>().unwrap();
+        let fs_struct: &FusionStart = (***fs_op).downcast_ref::<FusionStart>().unwrap();
+        let read_idx = flatten_strides(out_shape, &fs_struct.strides).to_kernel();
+        body.push_str(&format!(
+            "        {cuda_ty} {name} = in{i}[{read_idx}];\n",
+            name = local_name(fs_idx),
+        ));
+    }
+
+    // FusedX ops in topo order. Each looks up its predecessor locals
+    // (in incoming-edge id order to match the original op's input
+    // arity / position).
+    for &op_idx in &region.fusedx_topo {
+        let op_ref = llir_graph[op_idx].to_dialect::<dyn KernelOp>().unwrap();
+        let op_name = op_ref.kernel_name();
+
+        let mut input_locals: Vec<String> = llir_graph
+            .edges_directed(op_idx, Direction::Incoming)
+            .map(|e| (e.id(), e.source()))
+            .collect::<Vec<_>>()
+            .into_iter()
+            .map(|(_, src)| local_name(src))
+            .collect();
+        // Sort by edge id like the rest of the codegen does for stable
+        // input ordering.
+        let mut edges: Vec<(_, NodeIndex)> = llir_graph
+            .edges_directed(op_idx, Direction::Incoming)
+            .map(|e| (e.id(), e.source()))
+            .collect();
+        edges.sort_by_key(|(eid, _)| *eid);
+        input_locals = edges.into_iter().map(|(_, src)| local_name(src)).collect();
+        let inputs_ref: Vec<&str> = input_locals.iter().map(|s| s.as_str()).collect();
+
+        let expr = fused_body(op_name, &inputs_ref);
+        body.push_str(&format!(
+            "        {cuda_ty} {name} = {expr};\n",
+            name = local_name(op_idx),
+        ));
+    }
+
+    // FE write: pick the FusedX feeding FE (its single incoming edge in
+    // the region — a FusedX or, in degenerate single-FS regions which
+    // shouldn't arise, an FS).
+    let fe_input: NodeIndex = llir_graph
+        .neighbors_directed(region.fe_node, Direction::Incoming)
+        .next()
+        .expect("FusionEnd with no predecessor");
+    let fe_input_local = local_name(fe_input);
+    let write_idx = flatten_strides(out_shape, out_strides).to_kernel();
+    body.push_str(&format!("        out[{write_idx}] = {fe_input_local};\n"));
+
+    let kernel = format!(
+        "{includes}\n\
+         {dyn_defines}\n\
+         extern \"C\" {{\n\
+         \x20   __global__ void fused_region_k({signature}{dyn_dims_param}) {{\n\
+         {body}\
+         \x20   }}\n\
+         }}"
+    );
+
+    let (module, function) = if let Some((m, f)) = compile_cache.get(&kernel) {
+        (m.clone(), f.clone())
+    } else {
+        let ptx = compile_module_image_for_current_device(stream.context(), &kernel)
+            .expect("region kernel PTX compile failed");
+        let module = stream
+            .context()
+            .load_module(ptx)
+            .expect("module load failed");
+        let function = module
+            .load_function("fused_region_k")
+            .expect("region kernel function not found");
+        compile_cache.insert(kernel.clone(), (module.clone(), function.clone()));
+        (module, function)
+    };
+
+    let out_size = out_shape.iter().copied().product::<Expression>();
+
+    CompiledRegion {
+        function,
+        module,
+        kernel_str: kernel,
+        grid: (out_size.ceil_div(256), 1.into(), 1.into()),
+        block: (out_size.min(256), 1.into(), 1.into()),
+        shared_mem: 0.into(),
+        constants: FxHashMap::default(),
+    }
+}
--- a/crates/luminal_cuda_lite/src/kernel/hlir.rs
+++ b/crates/luminal_cuda_lite/src/kernel/hlir.rs
@@ -8,7 +8,7 @@ use cudarc::driver::{CudaFunction, CudaModule, CudaSlice, CudaStream};
 use itertools::Itertools;
 use luminal::{
    egglog_utils::{
-        api::{Rule, SortDef, app, eq, rule, set, sort, union, v},
+        api::{Rule, SortDef, Term, app, eq, rule, set, sort, union, v},
        base::{DTYPE, ELIST, EXPRESSION, F64, OP_KIND, SORTS, dtype, ilist, op_term},
        extract_dtype, extract_expr, extract_expr_list,
    },
@@ -79,7 +79,48 @@ pub fn kernel_rewrite<H: Default + EgglogOp, L: Default + EgglogOp>() -> Rule {
    args.add("dtype", dt.clone());
    let llir_kind_term = llir.call(&args);
    let llir_op = op_term(llir_kind_term, inputs);
-    rule(union(hlir_op.clone(), llir_op)).fact(eq(dt, dtype(hlir_op)))
+    rule(union(hlir_op.clone(), llir_op))
+        .fact(eq(dt, dtype(hlir_op)))
+        .ruleset("kernel_lower")
+}
+
+/// Build a kernel rewrite for ops whose kernel dtype must match the first input.
+///
+/// This avoids extracting stale/conflicting dtype facts from the output e-class
+/// after backend alternatives have been unioned into it.
+fn kernel_rewrite_from_first_input<H: Default + EgglogOp, L: Default + EgglogOp>() -> Rule {
+    let hlir = H::default().sort();
+    let llir = L::default().sort();
+    let (mut args, hlir_kind_term) = hlir.new_call();
+    let first_inp = v("?__first_inp");
+    let tail = v("?__tail");
+    let inputs = Term::App {
+        variant: "ICons".to_string(),
+        args: vec![first_inp.clone(), tail],
+    };
+    let hlir_op = op_term(hlir_kind_term, inputs.clone());
+    let dt = v("?__dt");
+    args.add("dtype", dt.clone());
+    let llir_kind_term = llir.call(&args);
+    let llir_op = op_term(llir_kind_term, inputs);
+    rule(union(hlir_op, llir_op))
+        .fact(eq(dt, dtype(first_inp)))
+        .ruleset("kernel_lower")
+}
+
+fn dtype_for_ir_enode(egraph: &SerializedEGraph, ir_node: &ENodeId) -> Option<DType> {
+    let ir_class = egraph.node_to_class.get(ir_node)?;
+    let dtype_node = egraph.enodes.iter().find_map(|(node, (label, children))| {
+        (label == "dtype" && children.first() == Some(ir_class)).then_some(node)
+    })?;
+    let dtype_class = egraph.node_to_class.get(dtype_node)?;
+    egraph.eclasses.get(dtype_class)?.1.iter().find_map(|node| {
+        match egraph.enodes.get(node)?.0.as_str() {
+            "F32" | "F16" | "Bf16" | "Int" | "Bool" | "F4E2M1" | "F8E4M3" | "F8UE8M0" | "I4"
+            | "TF32" => Some(extract_dtype(egraph, node)),
+            _ => None,
+        }
+    })
 }

 #[derive(Default, Debug, Clone)]
@@ -634,8 +675,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(), // No per-module constants needed
        )
@@ -700,7 +741,7 @@ impl EgglogOp for KernelMul {
    }

    fn rewrites(&self) -> Vec<Rule> {
-        vec![kernel_rewrite::<Mul, Self>()]
+        vec![kernel_rewrite_from_first_input::<Mul, Self>()]
    }

    fn cleanup(&self) -> bool {
@@ -715,17 +756,45 @@ impl EgglogOp for KernelMul {
        list_cache: &mut FxHashMap<&'a ENodeId, Vec<Expression>>,
        expr_cache: &mut FxHashMap<&'a ENodeId, Expression>,
    ) -> (LLIROp, Vec<&'a ENodeId>) {
+        let mut out_shape =
+            extract_expr_list(egraph, kind_children[0], list_cache, expr_cache).unwrap();
+        let mut a_stride =
+            extract_expr_list(egraph, kind_children[1], list_cache, expr_cache).unwrap();
+        let mut b_stride =
+            extract_expr_list(egraph, kind_children[2], list_cache, expr_cache).unwrap();
+        let mut out_stride =
+            extract_expr_list(egraph, kind_children[3], list_cache, expr_cache).unwrap();
+        // Some e-graph paths (length-changing rewrites such as `merge_dims`
+        // or `RemoveNthFromEnd`) leave a Mul kind enode whose shape and
+        // strides children are extracted to different lengths under the
+        // first-enode walk. The `enforce_consistent_first_kind_enodes`
+        // pass in `src/egglog_utils/mod.rs` repairs this where it can,
+        // but a handful of eclasses have *no* consistent variant in any
+        // of their stride sub-eclasses. For those we truncate to the
+        // SHORTEST length here so `flatten_strides` is structurally
+        // satisfied — the resulting kernel is numerically wrong for that
+        // candidate but harmless for the search, which profiles many
+        // candidates and steers toward the consistent ones.
+        let n = out_shape
+            .len()
+            .min(a_stride.len())
+            .min(b_stride.len())
+            .min(out_stride.len());
+        out_shape.truncate(n);
+        a_stride.truncate(n);
+        b_stride.truncate(n);
+        out_stride.truncate(n);
+        let dtype = input_enodes
+            .first()
+            .and_then(|node| dtype_for_ir_enode(egraph, node))
+            .unwrap_or_else(|| extract_dtype(egraph, kind_children[4]));
        (
            LLIROp::new::<dyn KernelOp>(Box::new(Self {
-                out_shape: extract_expr_list(egraph, kind_children[0], list_cache, expr_cache)
-                    .unwrap(),
-                a_stride: extract_expr_list(egraph, kind_children[1], list_cache, expr_cache)
-                    .unwrap(),
-                b_stride: extract_expr_list(egraph, kind_children[2], list_cache, expr_cache)
-                    .unwrap(),
-                out_stride: extract_expr_list(egraph, kind_children[3], list_cache, expr_cache)
-                    .unwrap(),
-                dtype: extract_dtype(egraph, kind_children[4]),
+                out_shape,
+                a_stride,
+                b_stride,
+                out_stride,
+                dtype,
            })),
            input_enodes,
        )
@@ -797,8 +866,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -865,13 +934,29 @@ impl EgglogOp for KernelGather {
    }

    fn rewrites(&self) -> Vec<Rule> {
-        // Match HLIR Gather (now in Op format) and rewrite to KernelGather
+        // Match HLIR Gather (now in Op format) and rewrite to KernelGather.
+        // Mirror the IList pattern used by `Gather`'s own dtype propagation
+        // rule (`src/hlir.rs`): use a `?__tail` variable instead of a
+        // strict `(INil)` so we don't accidentally fail to match against a
+        // Gather Op whose IList tail eclass has been merged with another
+        // chain by some unrelated egglog union. Without this the kernel
+        // rewrite is silently skipped for some Gathers in deep models
+        // (e.g. YOLO's stacked make_contiguous chains).
        let hlir_gather = luminal::hlir::Gather::default().sort();
        let (gather_args, gather_kind_term) = hlir_gather.new_call();
-        // HLIR Gather inputs: [indexes, data] (n_inputs=2)
        let indexes = v("?__indexes");
        let data = v("?__data");
-        let gather_inputs = ilist(vec![indexes.clone(), data.clone()]);
+        let tail = v("?__tail");
+        let gather_inputs = Term::App {
+            variant: "ICons".to_string(),
+            args: vec![
+                indexes.clone(),
+                Term::App {
+                    variant: "ICons".to_string(),
+                    args: vec![data.clone(), tail],
+                },
+            ],
+        };
        let gather_op = op_term(gather_kind_term, gather_inputs);

        let out_strides = SORTS
@@ -894,7 +979,11 @@ impl EgglogOp for KernelGather {
        ];
        let kernel_kind_term = self.sort().call(kernel_kind_args);
        let kernel_op = op_term(kernel_kind_term, ilist(vec![indexes, data.clone()]));
-        vec![rule(union(gather_op, kernel_op)).fact(eq(dt, dtype(data)))]
+        vec![
+            rule(union(gather_op, kernel_op))
+                .fact(eq(dt, dtype(data)))
+                .ruleset("kernel_lower"),
+        ]
    }

    fn cleanup(&self) -> bool {
@@ -990,12 +1079,13 @@ extern \"C\" {{
            compile_cache.insert(kernel.clone(), (module.clone(), func.clone()));
            (module, func)
        };
+        let out_size = self.out_shape.iter().copied().product::<Expression>();
        (
            func,
            module,
            kernel,
-            (self.out_shape.iter().copied().product(), 1.into(), 1.into()),
-            (1.into(), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -1128,7 +1218,11 @@ impl EgglogOp for KernelScatter {
        ];
        let kernel_kind_term = self.sort().call(kernel_kind_args);
        let kernel_op = op_term(kernel_kind_term, ilist(vec![dest, indexes, src.clone()]));
-        vec![rule(union(scatter_op, kernel_op)).fact(eq(dt, dtype(src)))]
+        vec![
+            rule(union(scatter_op, kernel_op))
+                .fact(eq(dt, dtype(src)))
+                .ruleset("kernel_lower"),
+        ]
    }

    fn cleanup(&self) -> bool {
@@ -1199,7 +1293,25 @@ impl KernelOp for KernelScatter {

        // Single-kernel scatter: copy dest→output then scatter src→output[indexes]
        // Launched as 1 block of 1024 threads with __syncthreads() barrier.
-        // Uses float4 vectorized copy (4x throughput) for the copy phase.
+        // Uses float4 vectorized copy (16 bytes per op) for the copy phase.
+        //
+        // The number of dtype elements that fit in a float4 (16 bytes) depends
+        // on the element size. Computing `n_vec = n_dest / 4` would only be
+        // correct for 4-byte dtypes — for bf16 it walks 2× past the end of
+        // `out`, producing CUDA_ERROR_ILLEGAL_ADDRESS once the OOB region
+        // happens to land on an unmapped page.
+        let elements_per_vec: usize = match self.dtype {
+            DType::F64 => 2,
+            DType::F32 | DType::Int => 4,
+            DType::F16 | DType::Bf16 | DType::I16 | DType::U16 => 8,
+            DType::Bool
+            | DType::I8
+            | DType::U8
+            | DType::F8UE8M0
+            | DType::F8E4M3
+            | DType::F8E5M2 => 16,
+            other => panic!("Unsupported dtype for scatter vectorization: {other:?}"),
+        };
        let n_src_elements = self
            .index_shape
            .iter()
@@ -1224,15 +1336,17 @@ extern \"C\" {{
        int tid = threadIdx.x;
        long long n_dest = {n_dest_elements};
        long long n_src = {n_src_elements};
-        // Phase 1: vectorized copy dest → output (float4 = 4 elements per op)
-        long long n_vec = n_dest / 4;
+        // Phase 1: vectorized copy dest → output (float4 = 16 bytes / iter,
+        // i.e. {elements_per_vec} {dtype} elements). n_vec is sized so the
+        // total bytes covered (`n_vec * 16`) never exceed `n_dest * sizeof({dtype})`.
+        long long n_vec = n_dest / {elements_per_vec};
        float4 *out4 = (float4 *)out;
        const float4 *dest4 = (const float4 *)dest;
        for (long long i = tid; i < n_vec; i += blockDim.x) {{
            out4[i] = dest4[i];
        }}
-        // Handle remaining elements
-        long long remainder_start = n_vec * 4;
+        // Handle remaining elements (the dtype-tail past the last full float4).
+        long long remainder_start = n_vec * {elements_per_vec};
        for (long long i = remainder_start + tid; i < n_dest; i += blockDim.x) {{
            out[i] = dest[i];
        }}
@@ -1385,7 +1499,8 @@ impl EgglogOp for KernelIota {
        let kernel_op = op_term(kernel_kind, hlir_inputs);
        vec![
            rule(union(hlir_op, kernel_op.clone()))
-                .set(dtype(kernel_op), app(&SORTS.int_dt, vec![])),
+                .set(dtype(kernel_op), app(&SORTS.int_dt, vec![]))
+                .ruleset("kernel_lower"),
        ]
    }

@@ -1425,19 +1540,22 @@ impl KernelOp for KernelIota {
        Expression,
        FxHashMap<char, CudaSlice<u8>>,
    ) {
-        let vars = self.expr.dyn_vars().into_iter().collect::<FxHashSet<_>>();
+        let mut vars = self.expr.dyn_vars().into_iter().collect::<FxHashSet<_>>();
+        vars.extend(self.range.dyn_vars());
        let (dyn_defines, _sorted_dims) = generate_dyn_dims_defines(&vars);
        let dyn_dims_param = if vars.is_empty() {
            ""
        } else {
            ", const int* dyn_dims"
        };
+        let range = self.range.to_kernel();
        let kernel = format!(
            "
 {dyn_defines}
 extern \"C\" {{
    __global__ void iota_k(int *C{dyn_dims_param}) {{
        long long const_z = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+        if (const_z >= {range}) return;
        C[const_z] = {};
    }}
 }}",
@@ -1456,8 +1574,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (self.range, 1.into(), 1.into()),
-            (1.into(), 1.into(), 1.into()),
+            (self.range.ceil_div(256), 1.into(), 1.into()),
+            (256.into(), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -1615,8 +1733,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -1769,8 +1887,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -1923,8 +2041,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -2077,8 +2195,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -2231,8 +2349,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -2392,8 +2510,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -2470,7 +2588,11 @@ impl EgglogOp for KernelLessThan {
        args.add("dtype", dt.clone());
        let kernel_kind_term = self.sort().call(&args);
        let kernel_op = op_term(kernel_kind_term, hlir_inputs);
-        vec![rule(union(hlir_op, kernel_op)).fact(eq(dt, dtype(inp_a)))]
+        vec![
+            rule(union(hlir_op, kernel_op))
+                .fact(eq(dt, dtype(inp_a)))
+                .ruleset("kernel_lower"),
+        ]
    }

    fn cleanup(&self) -> bool {
@@ -2567,8 +2689,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -2627,7 +2749,8 @@ impl EgglogOp for KernelConstant {
        let kernel_op = op_term(kernel_kind, hlir_inputs);
        vec![
            rule(union(hlir_op, kernel_op.clone()))
-                .set(dtype(kernel_op), app(&SORTS.f32_dt, vec![])),
+                .set(dtype(kernel_op), app(&SORTS.f32_dt, vec![]))
+                .ruleset("kernel_lower"),
        ]
    }

@@ -2769,7 +2892,11 @@ impl EgglogOp for KernelCast {
        cast_args.add("src_dtype", out_dty);
        let kernel_kind_term = self.sort().call(&cast_args);
        let kernel_op = op_term(kernel_kind_term, cast_inputs);
-        vec![rule(union(cast_op, kernel_op)).fact(eq(in_dty, dtype(inp)))]
+        vec![
+            rule(union(cast_op, kernel_op))
+                .fact(eq(in_dty, dtype(inp)))
+                .ruleset("kernel_lower"),
+        ]
    }

    fn cleanup(&self) -> bool {
@@ -2811,6 +2938,14 @@ impl KernelOp for KernelCast {
    ) {
        let out_dtype = cuda_dtype(self.out_dtype);
        let includes = dtype_includes(&[self.in_dtype, self.out_dtype]);
+        let vars = self.size.dyn_vars().into_iter().collect::<FxHashSet<_>>();
+        let (dyn_defines, _sorted_dims) = generate_dyn_dims_defines(&vars);
+        let dyn_dims_param = if vars.is_empty() {
+            ""
+        } else {
+            ", const int* dyn_dims"
+        };
+        let size = self.size.to_kernel();

        let kernel = if self.in_dtype.bits() < 8 {
            // Sub-byte packed types: multiple values packed per byte.
@@ -2820,9 +2955,11 @@ impl KernelOp for KernelCast {
            let mask = (1u32 << bits) - 1;
            format!(
                "{includes}
+{dyn_defines}
 extern \"C\" {{
-    __global__ void cast_k({out_dtype} *out, const unsigned char *in_raw) {{
+    __global__ void cast_k({out_dtype} *out, const unsigned char *in_raw{dyn_dims_param}) {{
        long long idx = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx >= {size}) return;
        long long bit_offset = idx * {bits};
        long long byte_idx = bit_offset >> 3;
        int bit_pos = (int)(bit_offset & 7);
@@ -2838,9 +2975,11 @@ extern \"C\" {{
            let in_dtype = cuda_dtype(self.in_dtype);
            format!(
                "{includes}
+{dyn_defines}
 extern \"C\" {{
-    __global__ void cast_k({out_dtype} *out, const {in_dtype} *in) {{
+    __global__ void cast_k({out_dtype} *out, const {in_dtype} *in{dyn_dims_param}) {{
        long long const_z = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+        if (const_z >= {size}) return;
        out[const_z] = ({out_dtype})in[const_z];
    }}
 }}"
@@ -2859,8 +2998,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (self.size, 1.into(), 1.into()),
-            (1.into(), 1.into(), 1.into()),
+            (self.size.ceil_div(256), 1.into(), 1.into()),
+            (256.into(), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -3023,6 +3162,7 @@ impl EgglogOp for KernelEmbed {
                    (union ?gather ?ke)
                    (set (dtype ?ke) (F32))
                )
+                :ruleset kernel_specialize
                :name \"kernel embed with cast mul\"
            )"),
            // Match Gather with Add(Iota, Mul(Cast(token_ids), const)) indices (reversed order)
@@ -3042,6 +3182,7 @@ impl EgglogOp for KernelEmbed {
                    (union ?gather ?ke)
                    (set (dtype ?ke) (F32))
                )
+                :ruleset kernel_specialize
                :name \"kernel embed with cast mul reversed\"
            )"),
            // Match Gather with Add(Mul(token_ids, const), Iota) indices (no Cast)
@@ -3060,6 +3201,7 @@ impl EgglogOp for KernelEmbed {
                    (union ?gather ?ke)
                    (set (dtype ?ke) (F32))
                )
+                :ruleset kernel_specialize
                :name \"kernel embed with mul\"
            )"),
            // Match Gather with Add(Iota, Mul(token_ids, const)) indices (reversed order, no Cast)
@@ -3078,6 +3220,7 @@ impl EgglogOp for KernelEmbed {
                    (union ?gather ?ke)
                    (set (dtype ?ke) (F32))
                )
+                :ruleset kernel_specialize
                :name \"kernel embed with mul reversed\"
            )"),
        ]
@@ -3138,15 +3281,24 @@ impl KernelOp for KernelEmbed {
            .chain(self.out_stride.iter().flat_map(|e| e.dyn_vars()))
            .chain(self.embed_dim.dyn_vars())
            .collect::<FxHashSet<_>>();
+        let (dyn_defines, _sorted_dims) = generate_dyn_dims_defines(&vars);
+        let dyn_dims_param = if vars.is_empty() {
+            ""
+        } else {
+            ", const int* dyn_dims"
+        };
        let token_offset_expr = flatten_strides(&self.batch_shape, &self.token_stride).to_kernel();
        let out_offset_expr = flatten_strides(&self.batch_shape, &self.out_stride).to_kernel();
        let embed_dim_expr = self.embed_dim.to_kernel();
+        let total_threads = batch_size * self.embed_dim;
+        let n_elements = total_threads.to_kernel();
        let kernel = format!(
            "
-{}
+{dyn_defines}
 extern \"C\" {{
-    __global__ void embed(float *out, const int *token_ids, const float *embed_table) {{
+    __global__ void embed(float *out, const int *token_ids, const float *embed_table{dyn_dims_param}) {{
        long long idx = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx >= {n_elements}) return;
        long long embed_dim = {embed_dim_expr};
        long long batch_idx = idx / embed_dim;
        long long embed_idx = idx % embed_dim;
@@ -3156,10 +3308,7 @@ extern \"C\" {{
        int token_id = token_ids[token_offset];
        out[out_offset + embed_idx] = embed_table[(long long)token_id * embed_dim + embed_idx];
    }}
-}}",
-            vars.iter()
-                .map(|i| format!("__constant__ int const_{i}[1];"))
-                .join("\n"),
+}}"
        );
        let (module, func) = if let Some((module, func)) = compile_cache.get(&kernel) {
            (module.clone(), func.clone())
@@ -3170,17 +3319,14 @@ extern \"C\" {{
            compile_cache.insert(kernel.clone(), (module.clone(), func.clone()));
            (module, func)
        };
-        let constants = vars
-            .into_iter()
-            .map(|d| (d, module.get_global(&format!("const_{d}"), stream).unwrap()))
-            .collect();
-        let total_threads = batch_size * self.embed_dim;
+        // Return empty constants map - we now use shared dyn_dims buffer
+        let constants = FxHashMap::default();
        (
            func,
            module,
            kernel,
-            (total_threads, 1.into(), 1.into()),
-            (1.into(), 1.into(), 1.into()),
+            (total_threads.ceil_div(256), 1.into(), 1.into()),
+            (256.into(), 1.into(), 1.into()),
            0.into(),
            constants,
        )
--- a/crates/luminal_cuda_lite/src/kernel/mod.rs
+++ b/crates/luminal_cuda_lite/src/kernel/mod.rs
@@ -10,12 +10,13 @@ use luminal_tracing::schema::{
 use uuid::Uuid;

 pub mod cuda_graph;
+pub mod fusion;
 pub mod hlir;
 pub mod other_ops;

 pub use cuda_graph::*;

-pub type Ops = (hlir::Ops, other_ops::Ops);
+pub type Ops = (hlir::Ops, other_ops::Ops, fusion::Ops);

 /// Build a mapping from interned string IDs to their string values for a given sequence.
 fn build_interned_strings(trace: &schema::Trace) -> std::collections::HashMap<(u32, u64), String> {
--- a/crates/luminal_cuda_lite/src/kernel/other_ops.rs
+++ b/crates/luminal_cuda_lite/src/kernel/other_ops.rs
@@ -10,7 +10,7 @@ use itertools::Itertools;
 use luminal::{
    egglog_utils::{
        api::{Rule, SortDef, sort},
-        base::{DTYPE, ELIST, EXPRESSION, OP_KIND},
+        base::{DTYPE, ELIST, EXPRESSION, OP_KIND, STRING},
        extract_dtype, extract_expr, extract_expr_list,
    },
    op::*,
@@ -128,7 +128,8 @@ impl KernelOp for KernelMeanReduce {
        let dtype = cuda_dtype(self.dtype);
        let includes = dtype_includes(&[self.dtype]);
        let n_outputs: Expression = self.out_shape.iter().copied().product();
-        let threads_per_block = 256; // 8 warps per block
+        let threads_per_block: usize = 256; // 8 warps per block
+        let n_warps = threads_per_block / 32;
        let (dyn_defines, _sorted_dims) = generate_dyn_dims_defines(&vars);
        let dyn_dims_param = if vars.is_empty() {
            ""
@@ -149,12 +150,24 @@ extern \"C\" {{
        long long iters = {iters};
        long long iter_stride = {iter_stride};

-        {dtype} sum = 0;
-        for (long long i = 0; i < iters; i++) {{
-            sum += in[in_start + i * iter_stride];
-        }}
+        float thread_sum = 0.0f;
+        for (long long i = threadIdx.x; i < iters; i += {threads_per_block})
+            thread_sum += (float)in[in_start + i * iter_stride];

-        out[{out_index}] = ({dtype})(sum / ({dtype})iters);
+        for (int offset = 16; offset > 0; offset >>= 1)
+            thread_sum += __shfl_down_sync(0xffffffff, thread_sum, offset);
+
+        __shared__ float warp_sums[{n_warps}];
+        int lane = threadIdx.x & 31;
+        int warp = threadIdx.x >> 5;
+        if (lane == 0) warp_sums[warp] = thread_sum;
+        __syncthreads();
+
+        if (threadIdx.x == 0) {{
+            float sum = 0.0f;
+            for (int w = 0; w < {n_warps}; w++) sum += warp_sums[w];
+            out[{out_index}] = ({dtype})(sum / (float)iters);
+        }}
    }}
 }}",
            dtype = dtype,
@@ -167,6 +180,8 @@ extern \"C\" {{
                .substitute('z', Expression::from(1))
                .simplify()
                .to_kernel(),
+            threads_per_block = threads_per_block,
+            n_warps = n_warps,
        );

        let (module, func) = if let Some((module, func)) = compile_cache.get(&kernel) {
@@ -183,9 +198,9 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (n_outputs, 1.into(), 1.into()), // grid
-            (1.into(), 1.into(), 1.into()),  // blocks (single-threaded)
-            0.into(),                        // shmem size
+            (n_outputs, 1.into(), 1.into()),                // grid
+            (threads_per_block.into(), 1.into(), 1.into()), // block
+            0.into(),                                       // shmem size
            FxHashMap::default(),
        )
    }
@@ -279,6 +294,9 @@ impl EgglogOp for KernelScatterNoCopy {
    fn rewrites(&self) -> Vec<Rule> {
        // Match KernelScatter and rewrite to KernelScatterNoCopy with ConsumedBuffer on dest.
        // ConsumedBuffer wraps dest to signal in-place modification.
+        // This is only valid when the destination buffer can also represent
+        // the scatter output layout. If dest is a strided/broadcast view,
+        // regular Scatter must first materialize a contiguous output copy.
        //
        // Two-phase resolution:
        // 1. During (run): cleanup rules delete ConsumedBuffer if dest is shared (another op uses it)
@@ -289,12 +307,31 @@ impl EgglogOp for KernelScatterNoCopy {
        // If ConsumedBuffer was deleted (shared case), cascade cleanup removes the dependent
        // ICons and KernelScatterNoCopy Op, leaving only KernelScatter.
        let mut rules = vec![
+            Rule::raw("(relation consumed_buffer_ilist_contains (IList IR))"),
+            Rule::raw(
+                "(rule
+                    ((= ?list (ICons ?head ?tail)))
+                    ((consumed_buffer_ilist_contains ?list ?head))
+                    :ruleset cleanup
+                    :name \"consumed-buffer-ilist-contains-head\"
+                )",
+            ),
+            Rule::raw(
+                "(rule
+                    ((= ?list (ICons ?head ?tail))
+                     (consumed_buffer_ilist_contains ?tail ?item))
+                    ((consumed_buffer_ilist_contains ?list ?item))
+                    :ruleset cleanup
+                    :name \"consumed-buffer-ilist-contains-tail\"
+                )",
+            ),
            // Rewrite: KernelScatter -> KernelScatterNoCopy with ConsumedBuffer
            Rule::raw(
                "(rule
                    (
                        (= ?scatter (Op (KernelScatter ?ds ?dst ?is ?istr ?ss ?os ?dt)
                            (ICons ?dest (ICons ?indexes (ICons ?src (INil))))))
+                        (= ?dst ?os)
                        (= ?dty (dtype ?src))
                    )
                    (
@@ -304,6 +341,7 @@ impl EgglogOp for KernelScatterNoCopy {
                        (union ?scatter ?nocopy)
                        (set (dtype ?nocopy) ?dty)
                    )
+                    :ruleset buffer_reuse
                    :name \"scatter to scatter-no-copy\"
                )",
            ),
@@ -313,6 +351,7 @@ impl EgglogOp for KernelScatterNoCopy {
                    ((= ?cb (ConsumedBuffer ?a))
                     (= ?dt (dtype ?a)))
                    ((set (dtype ?cb) ?dt))
+                    :ruleset dtype_prop
                    :name \"consumed-buffer-dtype\"
                )",
            ),
@@ -322,13 +361,28 @@ impl EgglogOp for KernelScatterNoCopy {
            "(rule
                ((= ?cb (ConsumedBuffer ?a))
                 (= ?op1 (Op ?k1 ?ilist1))
-                 (= ?ilist1 (ICons ?cb ?rest1))
+                 (consumed_buffer_ilist_contains ?ilist1 ?cb)
                 (= ?op2 (Op ?k2 ?ilist2))
                 (!= ?op1 ?op2)
-                 (= ?ilist2 (ICons ?a ?t2)))
+                 (consumed_buffer_ilist_contains ?ilist2 ?a))
                ((delete (ConsumedBuffer ?a)))
                :ruleset cleanup
-                :name \"consumed-buffer-cleanup-pos\"
+                :name \"consumed-buffer-cleanup-shared-op-use\"
+            )",
+        ));
+        // If a valid no-copy scatter survives cleanup, it dominates the copying scatter.
+        // This must run before base_cleanup resolves ConsumedBuffer back to the destination.
+        rules.push(Rule::raw(
+            "(rule
+                ((= ?cb (ConsumedBuffer ?dest))
+                 (= ?scatter (Op (KernelScatter ?ds ?dst ?is ?istr ?ss ?os ?dt)
+                     (ICons ?dest (ICons ?indexes (ICons ?src (INil))))))
+                 (= ?nocopy (Op (KernelScatterNoCopy ?ds ?dst ?is ?istr ?ss ?os ?dt)
+                     (ICons ?cb (ICons ?indexes (ICons ?src (INil)))))))
+                ((delete (Op (KernelScatter ?ds ?dst ?is ?istr ?ss ?os ?dt)
+                     (ICons ?dest (ICons ?indexes (ICons ?src (INil)))))))
+                :ruleset post_cleanup
+                :name \"scatter-no-copy-dominates-valid-consumed-buffer\"
            )",
        ));
        // Surviving ConsumedBuffers are valid — union with source and delete.
@@ -455,8 +509,8 @@ extern \"C\" {{
            func,
            module,
            scatter_kernel,
-            (n_src, 1.into(), 1.into()),
-            (1.into(), 1.into(), 1.into()),
+            (n_src.ceil_div(256), 1.into(), 1.into()),
+            (256.into(), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -659,6 +713,7 @@ impl EgglogOp for KernelBatchMatVec {
                    (union ?sum ?bmv)
                    (set (dtype ?bmv) (F32))
                )
+                :ruleset matmul_backend
                :name \"batch mat-vec\"
            )"
        )]
@@ -939,6 +994,7 @@ impl EgglogOp for KernelBatchMatMul {
                    (union ?sum ?bmm)
                    (set (dtype ?bmm) (F32))
                )
+                :ruleset matmul_backend
                :name \"batch matmul\"
            )"
        )]
@@ -1178,6 +1234,7 @@ impl EgglogOp for KernelSoftmax {
                    (union ?sm ?ksm)
                    (set (dtype ?ksm) (F32))
                )
+                :ruleset kernel_lower
                :name \"softmax-to-kernel-f32\"
            )",
            ),
@@ -1450,6 +1507,7 @@ impl EgglogOp for KernelExp {
                    (union ?exp2 ?kexp)
                    (set (dtype ?kexp) ?dt)
                )
+                :ruleset direct_kernel
                :name \"direct-exp-fusion\"
            )",
            ),
@@ -1544,8 +1602,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
@@ -1611,9 +1669,17 @@ impl EgglogOp for KernelSigmoid {

    fn rewrites(&self) -> Vec<Rule> {
        vec![
-            // Match the HLIR pattern directly: Recip(Add(Exp2(Mul(Mul(x, -1), log2e)), 1))
+            // Stage the HLIR sigmoid pattern through a small marker so repeated
+            // default passes do not re-run one large join over every Mul/Add/Recip.
            Rule::raw(
-                "(rule
+                "(datatype*
+                    (KernelSigmoidScaledState
+                        (MkKernelSigmoidScaledState IR EList EList DType)
+                    )
+                )
+                (function kernel_sigmoid_scaled (IR) KernelSigmoidScaledState :merge new)
+
+                (rule
                (
                    (= ?neg1 (Op (Constant ?nv) (INil)))
                    (< ?nv -0.99)
@@ -1623,19 +1689,33 @@ impl EgglogOp for KernelSigmoid {
                    (> ?lv 1.44)
                    (< ?lv 1.45)
                    (= ?scaled (Op (Mul ?shape ?neg_out_stride ?log2e_stride ?scaled_stride) (ICons ?neg_x (ICons ?log2e (INil)))))
+                    (= ?dt (dtype ?x))
+                )
+                (
+                    (set (kernel_sigmoid_scaled ?scaled)
+                        (MkKernelSigmoidScaledState ?x ?shape ?x_stride ?dt))
+                )
+                :ruleset direct_kernel
+                :name \"direct-sigmoid-scaled-marker\"
+            )
+
+                (rule
+                (
+                    (= ?scaled_state (kernel_sigmoid_scaled ?scaled))
+                    (= ?scaled_state (MkKernelSigmoidScaledState ?x ?shape ?x_stride ?dt))
                    (= ?exp2 (Op (Exp2 ?shape ?scaled_stride ?exp_stride) (ICons ?scaled (INil))))
                    (= ?one (Op (Constant ?ov) (INil)))
                    (> ?ov 0.99)
                    (< ?ov 1.01)
                    (= ?plus_one (Op (Add ?shape ?exp_stride ?one_stride ?add_stride) (ICons ?exp2 (ICons ?one (INil)))))
                    (= ?sig_out (Op (Recip ?shape ?add_stride ?out_stride) (ICons ?plus_one (INil))))
-                    (= ?dt (dtype ?x))
                )
                (
                    (let ?ksig (Op (KernelSigmoid ?shape ?x_stride ?out_stride ?dt) (ICons ?x (INil))))
                    (union ?sig_out ?ksig)
                    (set (dtype ?ksig) ?dt)
                )
+                :ruleset direct_kernel
                :name \"direct-sigmoid-fusion\"
            )",
            ),
@@ -1730,8 +1810,8 @@ extern \"C\" {{
            func,
            module,
            kernel,
-            (out_size.ceil_div(128), 1.into(), 1.into()),
-            (out_size.min(128), 1.into(), 1.into()),
+            (out_size.ceil_div(256), 1.into(), 1.into()),
+            (out_size.min(256), 1.into(), 1.into()),
            0.into(),
            FxHashMap::default(),
        )
--- a/crates/luminal_cuda_lite/src/kernel/to_host.rs
+++ b/crates/luminal_cuda_lite/src/kernel/to_host.rs
@@ -13,6 +13,7 @@ use itertools::Itertools;
 use luminal::{
    egglog_utils::{api::Rule, base::OP_KIND},
    graph::LLIRGraph,
+    hlir::{LoopEnd, LoopInput, LoopInputStatic, LoopOutput, LoopOutputSelect, LoopStart},
    op::{EgglogOp, LLIROp},
    prelude::{
        petgraph::{Direction, algo::toposort, visit::EdgeRef},
@@ -22,10 +23,11 @@ use luminal::{
 use tracing::{Level, enabled, span};

 use crate::{
-    host::HostOp,
+    host::{DeviceBuffer, HostOp},
    kernel::{
        CudaFunctionExt, CudaGraphExecHandle, CudaGraphHandle, KernelOp, create_cuda_event,
        destroy_cuda_event,
+        fusion::region_codegen::{self, CompileUnit},
        hlir::{clear_global_dyn_dims, get_global_dyn_dims, set_global_dyn_dims},
    },
    runtime::partition_marked_convex,
@@ -46,8 +48,12 @@ struct CompiledKernel {
    shared_mem: Expression,
    /// Input node indices (for buffer lookup)
    inputs: Vec<NodeIndex>,
+    /// Human-readable labels for input nodes, for launch diagnostics.
+    input_labels: Vec<String>,
    /// Reference to the KernelOp for trait methods
    kernel_op: Arc<Box<dyn KernelOp>>,
+    /// Whether this compiled CUDA function has a trailing dyn_dims parameter.
+    has_dyn_dims_param: bool,
    /// Internal buffers allocated for this kernel
    internal_bufs: Vec<CudaSlice<u8>>,
    /// Device constants from compile()
@@ -67,7 +73,9 @@ impl CompiledKernel {
        block: (Expression, Expression, Expression),
        shared_mem: Expression,
        inputs: Vec<NodeIndex>,
+        input_labels: Vec<String>,
        kernel_op: Arc<Box<dyn KernelOp>>,
+        has_dyn_dims_param: bool,
        constants: FxHashMap<char, CudaSlice<u8>>,
        kernel_name: &'static str,
    ) -> Self {
@@ -78,7 +86,9 @@ impl CompiledKernel {
            block,
            shared_mem,
            inputs,
+            input_labels,
            kernel_op,
+            has_dyn_dims_param,
            internal_bufs: Vec::new(),
            constants,
            graph_node: None,
@@ -225,7 +235,7 @@ impl HostOp for CudaGraphOp {
        stream: &Arc<CudaStream>,
        _self_node: NodeIndex,
        _inputs: &[NodeIndex],
-        buffers: &FxHashMap<NodeIndex, &CudaSlice<u8>>,
+        buffers: &FxHashMap<NodeIndex, DeviceBuffer>,
        dyn_map: &FxHashMap<char, usize>,
    ) -> anyhow::Result<()> {
        self.execute_internal(stream, buffers, dyn_map)
@@ -257,6 +267,40 @@ impl HostOp for CudaGraphOp {
            .collect()
    }

+    fn extra_buffer_lifetimes(&self) -> Option<Vec<(NodeIndex, usize, usize)>> {
+        let state = self.state.borrow();
+        let mut lifetimes: FxHashMap<NodeIndex, (usize, usize)> = FxHashMap::default();
+        let max_step = state.kernels.len().saturating_sub(1);
+
+        let mut touch = |node: NodeIndex, step: usize| {
+            lifetimes
+                .entry(node)
+                .and_modify(|(first, last)| {
+                    *first = (*first).min(step);
+                    *last = (*last).max(step);
+                })
+                .or_insert((step, step));
+        };
+
+        for (step, kernel) in state.kernels.iter().enumerate() {
+            for &input in &kernel.inputs {
+                touch(input, step);
+            }
+            touch(kernel.node, step);
+        }
+
+        for node in self.extra_buffer_nodes() {
+            lifetimes.entry(node).or_insert((0, max_step));
+        }
+
+        Some(
+            lifetimes
+                .into_iter()
+                .map(|(node, (start, end))| (node, start, end))
+                .collect(),
+        )
+    }
+
    fn extra_buffer_sizes(&self) -> FxHashMap<NodeIndex, Expression> {
        self.buffer_sizes.clone()
    }
@@ -267,11 +311,64 @@ impl HostOp for CudaGraphOp {
 }

 impl CudaGraphOp {
+    fn expected_kernel_inputs(kernel_name: &str) -> Option<usize> {
+        match kernel_name {
+            "Constant" | "Iota" => Some(0),
+            "MaxReduce" | "MeanReduce" | "SumReduce" | "Cast" | "Exp" | "Exp2" | "Log2" | "Sin"
+            | "Recip" | "Sigmoid" | "Softmax" | "Sqrt" => Some(1),
+            "Add" | "BatchMatMul" | "BatchMatVec" | "Embed" | "Gather" | "LessThan" | "Mod"
+            | "Mul" => Some(2),
+            "Scatter" | "ScatterNoCopy" => Some(3),
+            _ => None,
+        }
+    }
+
+    fn kernel_requires_output_buffer(
+        kernel: &CompiledKernel,
+        dyn_map: &FxHashMap<char, usize>,
+    ) -> bool {
+        kernel.kernel_op.output_size().exec(dyn_map).unwrap_or(1) != 0
+            && kernel.kernel_op.output_aliases_input().is_none()
+    }
+
+    fn validate_kernel_pointers(
+        kernel: &CompiledKernel,
+        output_ptr: u64,
+        input_ptrs: &[u64],
+        dyn_map: &FxHashMap<char, usize>,
+    ) -> anyhow::Result<()> {
+        if Self::kernel_requires_output_buffer(kernel, dyn_map) && output_ptr == 0 {
+            anyhow::bail!(
+                "missing output buffer for CUDA kernel {} at LLIR node {:?}",
+                kernel.kernel_name,
+                kernel.node,
+            );
+        }
+
+        for (idx, (input_node, input_ptr)) in kernel.inputs.iter().zip(input_ptrs).enumerate() {
+            if *input_ptr == 0 {
+                let input_label = kernel
+                    .input_labels
+                    .get(idx)
+                    .map(String::as_str)
+                    .unwrap_or("unknown");
+                anyhow::bail!(
+                    "missing input buffer {idx} for CUDA kernel {} at LLIR node {:?}; input LLIR node {:?} ({input_label})",
+                    kernel.kernel_name,
+                    kernel.node,
+                    input_node,
+                );
+            }
+        }
+
+        Ok(())
+    }
+
    /// Execute the CUDA graph with the given buffers and dynamic dimensions.
    fn execute_internal(
        &self,
        stream: &Arc<CudaStream>,
-        buffers: &FxHashMap<NodeIndex, &CudaSlice<u8>>,
+        buffers: &FxHashMap<NodeIndex, DeviceBuffer>,
        dyn_map: &FxHashMap<char, usize>,
    ) -> anyhow::Result<()> {
        let mut state = self.state.borrow_mut();
@@ -302,8 +399,10 @@ impl CudaGraphOp {
                kernel.internal_bufs = kernel.kernel_op.allocate_internal_buffers(stream, dyn_map);
            }
        }
-        // Force full rebuild when dims change (debug: testing if update_kernel_node is the issue)
-        if dyn_map_changed || needs_internal_realloc {
+        // Only force full rebuild when internal buffer sizes change.
+        // Dim-only changes (e.g. position offset `p` incrementing each decode step) are
+        // handled by updating the dyn_dims device buffer + kernel node params in-place.
+        if needs_internal_realloc {
            state.cuda_graph = None;
            state.cuda_graph_exec = None;
            state.node_to_graph_node.clear();
@@ -340,7 +439,7 @@ impl CudaGraphOp {
        let mut current_buffer_ptrs: FxHashMap<NodeIndex, u64> = FxHashMap::default();
        for &node in &self.buffer_nodes {
            if let Some(buf) = buffers.get(&node) {
-                current_buffer_ptrs.insert(node, buf.device_ptr(stream).0);
+                current_buffer_ptrs.insert(node, buf.ptr());
            }
        }

@@ -388,13 +487,26 @@ impl CudaGraphOp {
                    .iter()
                    .map(|inp| current_buffer_ptrs.get(inp).copied().unwrap_or(0))
                    .collect();
+                Self::validate_kernel_pointers(kernel, output_ptr, &input_ptrs, dyn_map)?;
+                let kernel_dyn_dims_ptr = if kernel.has_dyn_dims_param {
+                    dyn_dims_ptr
+                } else {
+                    0
+                };
+                if kernel.has_dyn_dims_param && kernel_dyn_dims_ptr == 0 {
+                    anyhow::bail!(
+                        "missing dyn_dims buffer for CUDA kernel {} at LLIR node {:?}",
+                        kernel.kernel_name,
+                        kernel.node,
+                    );
+                }

                let param_values = kernel.kernel_op.build_params(
                    stream,
                    output_ptr,
                    &input_ptrs,
                    &kernel.internal_bufs,
-                    dyn_dims_ptr,
+                    kernel_dyn_dims_ptr,
                );
                state.kernel_params[idx] = UnifiedKernelParams::new(param_values);
            }
@@ -421,6 +533,19 @@ impl CudaGraphOp {
                    kernel.block.1.exec(dyn_map).unwrap() as u32,
                    kernel.block.2.exec(dyn_map).unwrap() as u32,
                );
+                if grid_dim.0 == 0
+                    || grid_dim.1 == 0
+                    || grid_dim.2 == 0
+                    || block_dim.0 == 0
+                    || block_dim.1 == 0
+                    || block_dim.2 == 0
+                {
+                    anyhow::bail!(
+                        "invalid CUDA launch dimensions for kernel {} at LLIR node {:?}: grid={grid_dim:?} block={block_dim:?}",
+                        kernel.kernel_name,
+                        kernel.node,
+                    );
+                }
                let shared_mem = kernel.shared_mem.exec(dyn_map).unwrap() as u32;
                let cu_func = unsafe { kernel.function.raw_function() };

@@ -449,7 +574,7 @@ impl CudaGraphOp {
        &self,
        state: &mut std::cell::RefMut<'_, CudaGraphOpState>,
        stream: &Arc<CudaStream>,
-        buffers: &FxHashMap<NodeIndex, &CudaSlice<u8>>,
+        buffers: &FxHashMap<NodeIndex, DeviceBuffer>,
        dyn_map: &FxHashMap<char, usize>,
    ) -> anyhow::Result<()> {
        let ctx = stream.context().clone();
@@ -471,7 +596,7 @@ impl CudaGraphOp {
        let mut buffer_ptrs: FxHashMap<NodeIndex, u64> = FxHashMap::default();
        for &node in &self.buffer_nodes {
            if let Some(buf) = buffers.get(&node) {
-                buffer_ptrs.insert(node, buf.device_ptr(stream).0);
+                buffer_ptrs.insert(node, buf.ptr());
            }
        }

@@ -518,6 +643,19 @@ impl CudaGraphOp {
                kernel.block.1.exec(dyn_map).unwrap() as u32,
                kernel.block.2.exec(dyn_map).unwrap() as u32,
            );
+            if grid_dim.0 == 0
+                || grid_dim.1 == 0
+                || grid_dim.2 == 0
+                || block_dim.0 == 0
+                || block_dim.1 == 0
+                || block_dim.2 == 0
+            {
+                anyhow::bail!(
+                    "invalid CUDA launch dimensions for kernel {} at LLIR node {:?}: grid={grid_dim:?} block={block_dim:?}",
+                    kernel.kernel_name,
+                    kernel.node,
+                );
+            }
            let shared_mem = kernel.shared_mem.exec(dyn_map).unwrap() as u32;

            let output_ptr = buffer_ptrs.get(&kernel.node).copied().unwrap_or(0);
@@ -526,18 +664,41 @@ impl CudaGraphOp {
                .iter()
                .map(|inp| buffer_ptrs.get(inp).copied().unwrap_or(0))
                .collect();
+            Self::validate_kernel_pointers(kernel, output_ptr, &input_ptrs, dyn_map)?;
+            let kernel_dyn_dims_ptr = if kernel.has_dyn_dims_param {
+                dyn_dims_ptr
+            } else {
+                0
+            };
+            if kernel.has_dyn_dims_param && kernel_dyn_dims_ptr == 0 {
+                anyhow::bail!(
+                    "missing dyn_dims buffer for CUDA kernel {} at LLIR node {:?}",
+                    kernel.kernel_name,
+                    kernel.node,
+                );
+            }

            let param_values = kernel.kernel_op.build_params(
                stream,
                output_ptr,
                &input_ptrs,
                &kernel.internal_bufs,
-                dyn_dims_ptr,
+                kernel_dyn_dims_ptr,
            );
            let mut params = UnifiedKernelParams::new(param_values);

            let cu_func = unsafe { kernel.function.raw_function() };
            let kernel_node = kernel.node;
+            if std::env::var_os("LUMINAL_CUDA_DEBUG_GRAPH").is_some() {
+                eprintln!(
+                    "cuGraphAddKernelNode kernel={} node={:?} grid={grid_dim:?} block={block_dim:?} shared_mem={shared_mem} inputs={} has_dyn={} params={}",
+                    kernel.kernel_name,
+                    kernel.node,
+                    kernel.inputs.len(),
+                    kernel.has_dyn_dims_param,
+                    params.values.len(),
+                );
+            }

            // Get timing event for this index (separate access from kernels)
            let timing_event = if tracing_enabled {
@@ -653,6 +814,41 @@ pub fn kernel_to_host(
    }

    let kernel_subgraphs = partition_marked_convex(llir_graph, &kernel_ops_in_graph).unwrap();
+    // Compute the set of FS / FE / FusedX nodes globally absorbed by some
+    // FusionEnd in the LLIR. Used by `build_compile_units` to suppress
+    // standalone marker compile units for shared FS leaves whose consumers
+    // live in a different convex subgraph than the FS itself.
+    let globally_absorbed = region_codegen::globally_absorbed_markers(llir_graph);
+
+    let name_of = |graph: &LLIRGraph, idx: NodeIndex| -> Option<&'static str> {
+        graph
+            .node_weight(idx)
+            .and_then(|op| op.to_dialect::<dyn KernelOp>().map(|k| k.kernel_name()))
+    };
+    let is_transparent_input = |graph: &LLIRGraph, node: NodeIndex| -> bool {
+        name_of(graph, node) == Some("FusionStart")
+            || graph[node].to_op::<LoopStart>().is_some()
+            || graph[node].to_op::<LoopEnd>().is_some()
+            || graph[node].to_op::<LoopInput>().is_some()
+            || graph[node].to_op::<LoopInputStatic>().is_some()
+            || graph[node].to_op::<LoopOutput>().is_some()
+            || graph[node].to_op::<LoopOutputSelect>().is_some()
+    };
+    let resolve_transparent_input = |graph: &LLIRGraph, mut node: NodeIndex| -> NodeIndex {
+        let mut visited = FxHashSet::default();
+        while visited.insert(node) && is_transparent_input(graph, node) {
+            let Some(pred) = graph
+                .edges_directed(node, Direction::Incoming)
+                .sorted_by_key(|e| e.id())
+                .map(|e| e.source())
+                .next()
+            else {
+                break;
+            };
+            node = pred;
+        }
+        node
+    };

    // Track which kernel node belongs to which CudaGraphOp (for later edge creation)
    let mut kernel_to_cuda_graph: FxHashMap<NodeIndex, NodeIndex> = FxHashMap::default();
@@ -670,6 +866,7 @@ pub fn kernel_to_host(
        let mut all_dyn_dims = FxHashSet::default();
        let mut all_buffer_nodes = FxHashSet::default();
        let mut all_buffer_sizes: FxHashMap<NodeIndex, Expression> = FxHashMap::default();
+        let mut external_inputs = FxHashSet::default();

        // Pre-scan: collect all dynamic vars from all kernel ops without compiling.
        // This uses KernelOp::all_dyn_vars() which inspects struct expression fields.
@@ -683,49 +880,151 @@ pub fn kernel_to_host(
        // Set global dyn dims ordering so compiles use consistent indices
        let mut global_dyn_dims: Vec<char> = all_dyn_dims.iter().copied().collect();
        global_dyn_dims.sort();
-        if !global_dyn_dims.is_empty() {
-            set_global_dyn_dims(global_dyn_dims.clone());
-        }
+        set_global_dyn_dims(global_dyn_dims.clone());

-        // Compile all kernels with global ordering for correct dyn_dims indices
-        let mut kernels = Vec::with_capacity(topo_order.len());
-        for kernel_node_idx in &topo_order {
-            let kernel_op_ref = llir_graph[*kernel_node_idx]
-                .to_dialect::<dyn KernelOp>()
-                .unwrap();
+        // Group the topo order into compile units: each FusionEnd-rooted
+        // region collapses to a single CompileUnit::Region (one fused
+        // CUDA kernel for the whole DAG); everything else stays as
+        // CompileUnit::Single (the existing per-op compile path).
+        let compile_units =
+            region_codegen::build_compile_units(&topo_order, llir_graph, &globally_absorbed);

-            let (kernel_function, _, _kernel_str, grid, block, shared_mem, constants) =
-                kernel_op_ref.compile(cuda_stream, kernel_cache);
+        // Compile all units with global ordering for correct dyn_dims indices
+        let mut kernels = Vec::with_capacity(compile_units.len());
+        for unit in &compile_units {
+            match unit {
+                CompileUnit::Single(kernel_node_idx) => {
+                    let kernel_op_ref = llir_graph[*kernel_node_idx]
+                        .to_dialect::<dyn KernelOp>()
+                        .unwrap();

-            // Collect inputs from graph edges
-            let mut inputs: Vec<NodeIndex> = llir_graph
-                .edges_directed(*kernel_node_idx, Direction::Incoming)
-                .sorted_by_key(|e| e.id())
-                .map(|e| e.source())
-                .collect_vec();
+                    let (kernel_function, _, kernel_str, grid, block, shared_mem, constants) =
+                        kernel_op_ref.compile(cuda_stream, kernel_cache);
+                    let has_dyn_dims_param = kernel_str.contains("dyn_dims");

-            // Collect buffer nodes and sizes
-            // Only add kernel nodes with non-zero output size (MegakernelOps have size 0)
-            let output_size = kernel_op_ref.output_size();
-            if output_size.exec(&FxHashMap::default()).unwrap_or(1) != 0 {
-                all_buffer_nodes.insert(*kernel_node_idx);
-                all_buffer_sizes.insert(*kernel_node_idx, output_size);
+                    // Collect inputs from graph edges
+                    let inputs: Vec<NodeIndex> = llir_graph
+                        .edges_directed(*kernel_node_idx, Direction::Incoming)
+                        .sorted_by_key(|e| e.id())
+                        .map(|e| e.source())
+                        .map(|input| resolve_transparent_input(llir_graph, input))
+                        .collect_vec();
+                    if let Some(expected_inputs) =
+                        CudaGraphOp::expected_kernel_inputs(kernel_op_ref.kernel_name())
+                    {
+                        assert_eq!(
+                            inputs.len(),
+                            expected_inputs,
+                            "invalid input arity for CUDA kernel {} at LLIR node {:?}",
+                            kernel_op_ref.kernel_name(),
+                            kernel_node_idx,
+                        );
+                    }
+                    let input_labels = inputs
+                        .iter()
+                        .map(|&input| {
+                            name_of(llir_graph, input)
+                                .map(str::to_string)
+                                .unwrap_or_else(|| format!("{:?}", llir_graph[input]))
+                        })
+                        .collect_vec();
+
+                    // Collect buffer nodes and sizes
+                    // Only add kernel nodes with non-zero output size (MegakernelOps have size 0)
+                    let output_size = kernel_op_ref.output_size();
+                    if output_size.exec(&FxHashMap::default()).unwrap_or(1) != 0 {
+                        all_buffer_nodes.insert(*kernel_node_idx);
+                        all_buffer_sizes.insert(*kernel_node_idx, output_size);
+                    }
+                    all_buffer_nodes.extend(inputs.iter().copied());
+                    external_inputs.extend(
+                        inputs
+                            .iter()
+                            .copied()
+                            .filter(|input| !subgraph.contains(input)),
+                    );
+
+                    let kernel_op: Arc<Box<dyn KernelOp>> = Arc::clone(kernel_op_ref);
+
+                    kernels.push(CompiledKernel::new(
+                        *kernel_node_idx,
+                        kernel_function,
+                        grid,
+                        block,
+                        shared_mem,
+                        inputs,
+                        input_labels,
+                        kernel_op.clone(),
+                        has_dyn_dims_param,
+                        constants,
+                        kernel_op.kernel_name(),
+                    ));
+                }
+                CompileUnit::Region(region) => {
+                    // Generate one fused CUDA kernel for the whole region.
+                    let compiled = region_codegen::compile_region(
+                        region,
+                        llir_graph,
+                        cuda_stream,
+                        kernel_cache,
+                    );
+                    let has_dyn_dims_param = compiled.kernel_str.contains("dyn_dims");
+
+                    // The region's CompiledKernel is keyed on the FE node
+                    // (so FE provides trait methods like output_size /
+                    // build_params) but its `inputs` are the external
+                    // producers, not FE's literal LLIR predecessors —
+                    // those are interior FusedX nodes that don't exist
+                    // as buffer-bearing nodes from the host's view.
+                    let fe_op_ref = llir_graph[region.fe_node]
+                        .to_dialect::<dyn KernelOp>()
+                        .unwrap();
+
+                    let inputs: Vec<NodeIndex> = region
+                        .external_inputs
+                        .iter()
+                        .copied()
+                        .map(|input| resolve_transparent_input(llir_graph, input))
+                        .collect();
+                    let input_labels = inputs
+                        .iter()
+                        .map(|&input| {
+                            name_of(llir_graph, input)
+                                .map(str::to_string)
+                                .unwrap_or_else(|| format!("{:?}", llir_graph[input]))
+                        })
+                        .collect_vec();
+
+                    let output_size = fe_op_ref.output_size();
+                    if output_size.exec(&FxHashMap::default()).unwrap_or(1) != 0 {
+                        all_buffer_nodes.insert(region.fe_node);
+                        all_buffer_sizes.insert(region.fe_node, output_size);
+                    }
+                    all_buffer_nodes.extend(inputs.iter().copied());
+                    external_inputs.extend(
+                        inputs
+                            .iter()
+                            .copied()
+                            .filter(|input| !subgraph.contains(input)),
+                    );
+
+                    let kernel_op: Arc<Box<dyn KernelOp>> = Arc::clone(fe_op_ref);
+
+                    kernels.push(CompiledKernel::new(
+                        region.fe_node,
+                        compiled.function,
+                        compiled.grid,
+                        compiled.block,
+                        compiled.shared_mem,
+                        inputs,
+                        input_labels,
+                        kernel_op,
+                        has_dyn_dims_param,
+                        compiled.constants,
+                        "FusedRegion",
+                    ));
+                }
            }
-            all_buffer_nodes.extend(inputs.iter().copied());
-
-            let kernel_op: Arc<Box<dyn KernelOp>> = Arc::clone(kernel_op_ref);
-
-            kernels.push(CompiledKernel::new(
-                *kernel_node_idx,
-                kernel_function,
-                grid,
-                block,
-                shared_mem,
-                inputs,
-                kernel_op.clone(),
-                constants,
-                kernel_op.kernel_name(),
-            ));
        }

        // Get the possibly-extended global ordering (kernels may have discovered new dims)
@@ -765,16 +1064,17 @@ pub fn kernel_to_host(
        }
        cuda_graph_subgraphs.push((cuda_graph_node, subgraph.clone()));

-        // Find external inputs: nodes outside subgraph that have edges into subgraph
-        let external_inputs: FxHashSet<NodeIndex> = subgraph
-            .iter()
-            .flat_map(|&node| {
-                llir_graph
-                    .edges_directed(node, Direction::Incoming)
-                    .map(|e| e.source())
-                    .filter(|src| !subgraph.contains(src))
-            })
-            .collect();
+        // Find external inputs: nodes outside subgraph that have edges into
+        // subgraph. Also include normalized FusionStart predecessors, because
+        // the compiled kernels read from the concrete producer buffer rather
+        // than the marker node.
+        external_inputs.extend(subgraph.iter().flat_map(|&node| {
+            llir_graph
+                .edges_directed(node, Direction::Incoming)
+                .map(|e| e.source())
+                .map(|input| resolve_transparent_input(llir_graph, input))
+                .filter(|src| !subgraph.contains(src))
+        }));

        // Add edges from external inputs to CudaGraphOp
        for input in &external_inputs {
@@ -818,22 +1118,41 @@ pub fn kernel_to_host(
        }
    }

-    // Add collected edges (deduplicate), skipping back-edges to preserve DAG property
+    // Add each cross-CudaGraphOp dep edge iff it would carry new ordering
+    // information without closing a cycle. The previous topo-position gate
+    // ("skip when src_pos >= dst_pos") was too coarse: it dropped edges
+    // whose src happened to land later in the toposort than their dst even
+    // when no path dst→src actually existed, leaving consumers free to run
+    // before the producer wrote their input buffer (wrong outputs); and it
+    // also added edges that were already implied by an existing src→dst
+    // path (extra serialization, no new info).
    let edges_to_add: FxHashSet<(NodeIndex, NodeIndex)> = edges_to_add.into_iter().collect();
-    let topo = toposort(&*llir_graph, None).unwrap();
-    let mut topo_pos: FxHashMap<NodeIndex, usize> = FxHashMap::default();
-    for (i, n) in topo.iter().enumerate() {
-        topo_pos.insert(*n, i);
-    }
+    use petgraph::algo::has_path_connecting;
    for (src, dst) in edges_to_add {
-        // Only add forward edges (src before dst in topo order) to avoid creating cycles
-        let src_pos = topo_pos.get(&src).copied().unwrap_or(usize::MAX);
-        let dst_pos = topo_pos.get(&dst).copied().unwrap_or(usize::MAX);
-        if src_pos >= dst_pos {
-            continue; // Skip back-edges
+        if has_path_connecting(&*llir_graph, src, dst, None) {
+            continue; // already ordered src→dst by some path; edge redundant
        }
-        if !llir_graph.edges_connecting(src, dst).any(|_| true) {
-            llir_graph.add_edge(src, dst, ());
+        if has_path_connecting(&*llir_graph, dst, src, None) {
+            continue; // adding src→dst would close a cycle
+        }
+        llir_graph.add_edge(src, dst, ());
+    }
+
+    // Strip fully-absorbed marker nodes (FusionStart, nested FusionEnd,
+    // FusedX) from the LLIR. Region codegen has already folded them into
+    // a single fused CUDA function anchored at each region's root
+    // FusionEnd; the absorbed nodes have no consumers outside the region
+    // and never need their own buffers. Removing them keeps later
+    // per-execute walks (e.g., `allocate_intermediate_buffers`) from
+    // chewing through dead nodes every decode token.
+    //
+    // Root FusionEnd nodes are NOT in `globally_absorbed` (they were the
+    // walks' starting points), so we keep them — they're the kernel
+    // anchor for the region's compiled kernel.
+    for node in globally_absorbed {
+        // Defensive: only remove if the node still exists.
+        if llir_graph.node_weight(node).is_some() {
+            llir_graph.remove_node(node);
        }
    }
 }
--- a/crates/luminal_cuda_lite/src/lib.rs
+++ b/crates/luminal_cuda_lite/src/lib.rs
@@ -1,5 +1,7 @@
+pub mod dyn_backend;
 pub mod host;
 pub mod kernel;
+mod memory_analysis;
 pub mod runtime;
 use std::{
    ffi::{CStr, CString},
@@ -9,6 +11,8 @@ use std::{

 pub use cudarc;

+use cudarc::{cublaslt::CudaBlasLT, driver::CudaStream};
+
 #[cfg(test)]
 mod tests;

@@ -137,6 +141,25 @@ fn cuda_driver_diagnostics() -> (Option<i32>, Option<i32>) {
    (driver_version, None)
 }

+pub(crate) fn try_create_cublaslt(
+    stream: Arc<CudaStream>,
+) -> std::result::Result<Arc<CudaBlasLT>, String> {
+    match std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| CudaBlasLT::new(stream))) {
+        Ok(Ok(handle)) => Ok(Arc::new(handle)),
+        Ok(Err(err)) => Err(err.to_string()),
+        Err(payload) => {
+            let message = if let Some(message) = payload.downcast_ref::<String>() {
+                message.clone()
+            } else if let Some(message) = payload.downcast_ref::<&str>() {
+                message.to_string()
+            } else {
+                "cuBLASLt initialization panicked".to_string()
+            };
+            Err(message)
+        }
+    }
+}
+
 fn cuda_nvrtc_compile_options(target_arch: &str) -> Vec<String> {
    let mut options = cuda_nvrtc_include_paths()
        .into_iter()
@@ -186,9 +209,9 @@ fn get_cubin(program: nvrtc_sys::nvrtcProgram) -> Result<Vec<u8>, NvrtcError> {
    }

    let mut cubin = Vec::with_capacity(cubin_size);
-    cubin.resize(cubin_size, 0);
-    unsafe { nvrtc_sys::nvrtcGetCUBIN(program, cubin.as_mut_ptr()) }.result()?;
-    Ok(cubin.into_iter().map(|byte| byte as u8).collect())
+    cubin.resize(cubin_size, 0u8);
+    unsafe { nvrtc_sys::nvrtcGetCUBIN(program, cubin.as_mut_ptr() as *mut _) }.result()?;
+    Ok(cubin)
 }

 pub(crate) fn compile_module_image_for_current_device<S: AsRef<str>>(
--- a/crates/luminal_cuda_lite/src/memory_analysis.rs
+++ b/crates/luminal_cuda_lite/src/memory_analysis.rs
--- a/crates/luminal_cuda_lite/src/runtime.rs
+++ b/crates/luminal_cuda_lite/src/runtime.rs
--- a/crates/luminal_cuda_lite/src/tests/bucket_tests.rs
+++ b/crates/luminal_cuda_lite/src/tests/bucket_tests.rs
@@ -41,7 +41,7 @@ fn test_bucket_dispatch_simple() {
    rt.set_data(a, vec![1.0f32; 4]);

    let mut rng = SmallRng::seed_from_u64(42);
-    rt = cx.search_rng(rt, 5, &mut rng);
+    rt = cx.search_options(rt, SearchOptions::new(5), &mut rng);

    // Test bucket 1: s=1
    cx.set_dim('s', 1);
@@ -85,7 +85,7 @@ fn test_bucket_matmul_dynamic() {
    rt.set_data(b_tensor, b_data.clone());

    let mut rng = SmallRng::seed_from_u64(42);
-    rt = cx.search_rng(rt, 5, &mut rng);
+    rt = cx.search_options(rt, SearchOptions::new(5), &mut rng);

    // Execute at s=1
    cx.set_dim('s', 1);
@@ -140,7 +140,7 @@ fn test_bucket_results_match_unbucketed() {
    let input_data = random_f32_vec(12, seed, -1.0, 1.0);
    rt1.set_data(a1, input_data.clone());
    let mut rng1 = SmallRng::seed_from_u64(seed);
-    rt1 = cx1.search_rng(rt1, 5, &mut rng1);
+    rt1 = cx1.search_options(rt1, SearchOptions::new(5), &mut rng1);
    rt1.set_data(a1, input_data.clone());
    rt1.execute(&cx1.dyn_map);
    let result_unbucketed = rt1.get_f32(b1);
@@ -153,7 +153,7 @@ fn test_bucket_results_match_unbucketed() {
    let mut rt2 = CudaRuntime::initialize(stream.clone());
    rt2.set_data(a2, input_data.clone());
    let mut rng2 = SmallRng::seed_from_u64(seed);
-    rt2 = cx2.search_rng(rt2, 5, &mut rng2);
+    rt2 = cx2.search_options(rt2, SearchOptions::new(5), &mut rng2);
    rt2.set_data(a2, input_data.clone());
    rt2.execute(&cx2.dyn_map);
    let result_bucketed = rt2.get_f32(b2);
@@ -179,7 +179,7 @@ fn test_bucket_out_of_range_panics() {
    cx.set_dim('s', 1);
    rt.set_data(a, vec![1.0f32; 4]);
    let mut rng = SmallRng::seed_from_u64(42);
-    rt = cx.search_rng(rt, 3, &mut rng);
+    rt = cx.search_options(rt, SearchOptions::new(3), &mut rng);

    // s=10 is outside all buckets — should panic
    cx.set_dim('s', 10);
@@ -204,7 +204,7 @@ fn test_bucket_no_buckets_backward_compat() {
    let input_data = vec![1.0f32, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
    rt.set_data(a, input_data.clone());
    let mut rng = SmallRng::seed_from_u64(42);
-    rt = cx.search_rng(rt, 3, &mut rng);
+    rt = cx.search_options(rt, SearchOptions::new(3), &mut rng);

    rt.set_data(a, input_data.clone());
    rt.execute(&cx.dyn_map);
@@ -249,7 +249,7 @@ fn test_bucket_switch_preserves_weights() {
    rt.set_data(b_tensor, b_data.clone());

    let mut rng = SmallRng::seed_from_u64(42);
-    rt = cx.search_rng(rt, 5, &mut rng);
+    rt = cx.search_options(rt, SearchOptions::new(5), &mut rng);

    // Execute with bucket 1 (s=1)
    cx.set_dim('s', 1);
@@ -305,7 +305,7 @@ fn test_bucket_multiple_executions_same_bucket() {
    cx.set_dim('s', 1);
    rt.set_data(a, vec![1.0f32; 4]);
    let mut rng = SmallRng::seed_from_u64(42);
-    rt = cx.search_rng(rt, 3, &mut rng);
+    rt = cx.search_options(rt, SearchOptions::new(3), &mut rng);

    // Execute at different sizes within the same bucket
    for s in [1, 2, 4, 8] {
--- a/crates/luminal_cuda_lite/src/tests/consumed_buffer_tests.rs
+++ b/crates/luminal_cuda_lite/src/tests/consumed_buffer_tests.rs
@@ -41,9 +41,8 @@ fn extract_all_kernel_names(cx: &mut Graph) -> Vec<String> {
    all_names
 }

-/// When dest is NOT shared with any other op, KernelScatterNoCopy should be available.
-/// The ConsumedBuffer cleanup rule should NOT fire because dest only appears inside
-/// the ConsumedBuffer (not in any other ICons).
+/// When dest is NOT shared with any other compute op, KernelScatterNoCopy should
+/// be the only scatter variant left after post-cleanup.
 #[test]
 fn test_scatter_nocopy_selected_when_dest_unshared() {
    let ctx = CudaContext::new(0).unwrap();
@@ -62,12 +61,17 @@ fn test_scatter_nocopy_selected_when_dest_unshared() {
    let names = extract_all_kernel_names(&mut cx);
    println!("All possible kernels: {:?}", names);

-    // KernelScatterNoCopy should be available (dest is not shared)
+    // KernelScatterNoCopy should be the only scatter variant (dest is not shared)
    assert!(
        names.iter().any(|n| n == "ScatterNoCopy"),
        "Expected ScatterNoCopy to be available but got: {:?}",
        names
    );
+    assert!(
+        !names.iter().any(|n| n == "Scatter"),
+        "Regular Scatter should be pruned when ScatterNoCopy is valid, got: {:?}",
+        names
+    );
 }

 /// When dest IS shared (used by another op besides the scatter), the ConsumedBuffer
@@ -109,8 +113,74 @@ fn test_scatter_nocopy_not_selected_when_dest_shared() {
    );
 }

+/// Shared-use detection must catch the destination in non-first input
+/// positions too. Gather takes indexes first and data second, so this would
+/// miss the unsafe read if cleanup only inspected the head of the input list.
+#[test]
+fn test_scatter_nocopy_not_selected_when_dest_shared_as_later_input() {
+    let ctx = CudaContext::new(0).unwrap();
+    ctx.bind_to_thread().unwrap();
+
+    let mut cx = Graph::default();
+
+    let dest = cx.tensor(10).persist();
+    let src = cx.tensor(3).persist();
+    let scatter_indexes = cx.tensor(3).as_dtype(DType::Int).persist();
+    let read_indexes = cx.tensor(1).as_dtype(DType::Int).persist();
+
+    let scatter_result = src.scatter(scatter_indexes, dest);
+    let _dest_also_read = dest.gather(read_indexes).output();
+    let _result = scatter_result.output();
+
+    let names = extract_all_kernel_names(&mut cx);
+    println!("All possible kernels: {:?}", names);
+
+    assert!(
+        !names.iter().any(|n| n == "ScatterNoCopy"),
+        "ScatterNoCopy should NOT be available when dest is read by another op, got: {:?}",
+        names
+    );
+    assert!(
+        names.iter().any(|n| n == "Scatter"),
+        "Expected regular Scatter but got: {:?}",
+        names
+    );
+}
+
+/// ScatterNoCopy aliases the destination buffer as the output, so it is only
+/// valid when the destination layout already matches the contiguous scatter
+/// output layout. Broadcast/expanded destinations need regular Scatter's
+/// copy-then-scatter materialization.
+#[test]
+fn test_scatter_nocopy_not_selected_for_expanded_dest_layout() {
+    let ctx = CudaContext::new(0).unwrap();
+    ctx.bind_to_thread().unwrap();
+
+    let mut cx = Graph::default();
+
+    let dest = cx.tensor(128).expand_dim(0, 4).persist();
+    let src = cx.tensor((4, 128)).persist();
+    let indexes = cx.tensor((4, 128)).as_dtype(DType::Int).persist();
+
+    let _result = src.scatter(indexes, dest).output();
+
+    let names = extract_all_kernel_names(&mut cx);
+    println!("All possible kernels: {:?}", names);
+
+    assert!(
+        !names.iter().any(|n| n == "ScatterNoCopy"),
+        "ScatterNoCopy should NOT be available when dest layout differs from output, got: {:?}",
+        names
+    );
+    assert!(
+        names.iter().any(|n| n == "Scatter"),
+        "Expected regular Scatter but got: {:?}",
+        names
+    );
+}
+
 /// Actually execute the scatter and verify correctness.
-/// Tests all possible extractions (both KernelScatter and KernelScatterNoCopy).
+/// Post-cleanup should force the valid no-copy extraction.
 #[test]
 fn test_scatter_execution_correctness() {
    let ctx = CudaContext::new(0).unwrap();
@@ -135,9 +205,8 @@ fn test_scatter_execution_correctness() {
    // Expected: [0.0, 10.0, 2.0, 20.0, 30.0]
    let expected = vec![0.0f32, 10.0, 2.0, 20.0, 30.0];

-    // Try many random extractions to cover both Scatter and ScatterNoCopy
+    // Try many random extractions; each valid choice should now use ScatterNoCopy.
    let mut rng = rand::rng();
-    let mut tested_scatter = false;
    let mut tested_nocopy = false;

    for _ in 0..50 {
@@ -180,27 +249,24 @@ fn test_scatter_execution_correctness() {

        let actual = rt.get_f32(result);

-        let variant = if has_nocopy {
-            tested_nocopy = true;
-            "ScatterNoCopy"
-        } else if has_scatter {
-            tested_scatter = true;
-            "Scatter"
-        } else {
-            "Unknown"
-        };
+        assert!(
+            has_nocopy,
+            "Expected ScatterNoCopy after post-cleanup, got no no-copy scatter"
+        );
+        assert!(
+            !has_scatter,
+            "Regular Scatter should be pruned when ScatterNoCopy is valid"
+        );
+        tested_nocopy = true;

        assert_eq!(
            actual, expected,
-            "Scatter result mismatch with variant {variant}: got {:?}, expected {:?}",
+            "Scatter result mismatch with ScatterNoCopy: got {:?}, expected {:?}",
            actual, expected
        );
    }

-    println!(
-        "Tested Scatter: {}, Tested ScatterNoCopy: {}",
-        tested_scatter, tested_nocopy
-    );
+    println!("Tested ScatterNoCopy: {}", tested_nocopy);
    assert!(
        tested_nocopy,
        "ScatterNoCopy was never selected in 50 attempts — can't verify correctness"
@@ -242,14 +308,28 @@ fn test_scatter_kv_cache_roundtrip() {

    rt = cx.search(rt, 5);

-    // Print which scatter variant was selected
-    for node in rt.llir_graph().node_weights() {
-        if let Some(k) = node.to_dialect::<dyn KernelOp>()
-            && k.kernel_name().contains("catter")
-        {
-            println!("Selected: {}", k.kernel_name());
+    // Print and verify which scatter variant was selected
+    let scatter_names: Vec<_> = rt
+        .kernel_names()
+        .iter()
+        .copied()
+        .filter(|name| name.contains("catter"))
+        .collect();
+    for name in rt.kernel_names() {
+        if name.contains("catter") {
+            println!("Selected: {name}");
        }
    }
+    assert!(
+        scatter_names.contains(&"ScatterNoCopy"),
+        "Expected ScatterNoCopy in KV-cache search result, got: {:?}",
+        scatter_names
+    );
+    assert!(
+        !scatter_names.contains(&"Scatter"),
+        "Regular Scatter should be pruned from KV-cache search result, got: {:?}",
+        scatter_names
+    );

    // Step 1: Initialize cache to zeros, scatter 10.0 at position 0
    rt.set_data(cache_in, vec![0.0f32; 5]);
@@ -301,9 +381,8 @@ fn test_scatter_kv_cache_roundtrip() {
 }

 /// Test scatter with TWO cache buffers and dual outputs (closer to llama K+V pattern).
-/// Also verifies graph_break interaction.
 #[test]
-fn test_scatter_dual_cache_with_graph_break() {
+fn test_scatter_dual_cache() {
    let ctx = CudaContext::new(0).unwrap();
    ctx.bind_to_thread().unwrap();
    let stream = ctx.default_stream();
@@ -345,19 +424,31 @@ fn test_scatter_dual_cache_with_graph_break() {
    rt.set_data(v_new, vec![3.0f32]);
    rt.set_data(indexes, vec![0i32]);

-    // Use seeded search for deterministic scatter variant selection.
-    // Seed 0 reliably selects Scatter (not ScatterNoCopy) for both caches.
+    // Use seeded search for deterministic variant selection.
    let mut rng = rand::rngs::SmallRng::seed_from_u64(0);
-    rt = cx.search_rng(rt, 5, &mut rng);
+    rt = cx.search_options(rt, SearchOptions::new(5), &mut rng);

-    // Print selected variants
-    for node in rt.llir_graph().node_weights() {
-        if let Some(k) = node.to_dialect::<dyn KernelOp>()
-            && k.kernel_name().contains("catter")
-        {
-            println!("Dual test selected: {}", k.kernel_name());
+    // Print and verify selected variants
+    let scatter_names: Vec<_> = rt
+        .kernel_names()
+        .iter()
+        .copied()
+        .filter(|name| name.contains("catter"))
+        .collect();
+    for name in rt.kernel_names() {
+        if name.contains("catter") {
+            println!("Dual test selected: {name}");
        }
    }
+    assert!(
+        !scatter_names.is_empty(),
+        "Expected scatter kernels in dual-cache search result"
+    );
+    assert!(
+        scatter_names.iter().all(|name| *name == "ScatterNoCopy"),
+        "Expected only ScatterNoCopy in dual-cache search result, got: {:?}",
+        scatter_names
+    );

    // Step 1: scatter k=2.0, v=3.0 at position 0
    rt.set_data(k_cache, vec![0.0f32; 5]);
--- a/crates/luminal_cuda_lite/src/tests/cublaslt_rewrite_tests.rs
+++ b/crates/luminal_cuda_lite/src/tests/cublaslt_rewrite_tests.rs
--- a/crates/luminal_cuda_lite/src/tests/flashinfer.rs
+++ b/crates/luminal_cuda_lite/src/tests/flashinfer.rs
@@ -0,0 +1,941 @@
+//! Unit + integration tests for the FlashInfer port.
+//!
+//! Four layers:
+//! 1. Pure egglog metadata (no GPU): trait wiring, sort + rewrite parse cleanly.
+//! 2. Egglog rule firing (no GPU): the rule unifies on a real paged-attention
+//!    HLIR and does NOT fire on bare attention or unrelated matmul/Gather mixes.
+//! 3. Mask op correctness (GPU): `ComputeAttnMask` produces the right (s, c) mask.
+//! 4. Full kernel correctness (GPU + JIT): direct `FlashInferAttention::execute`
+//!    compared against a luminal-compiled reference attention graph.
+//!
+//! GPU-dependent tests short-circuit when no CUDA device is available.
+
+use std::sync::{Arc, Mutex};
+
+use cudarc::driver::{CudaStream, DevicePtr};
+use luminal::egglog_utils::{hlir_to_egglog, run_egglog};
+use luminal::op::{EgglogOp, IntoEgglogOp};
+use luminal::prelude::*;
+
+use crate::host::flashinfer::FlashInferAttention;
+use crate::host::{ComputeAttnMask, DeviceBuffer, HostOp};
+use crate::runtime::CudaRuntime;
+use crate::tests::utilities::get_cuda_stream;
+
+/// Look up an op in `CudaRuntime::Ops::into_vec()` by its egglog sort name.
+fn ops_contains_sort(name: &str) -> bool {
+    let ops = <CudaRuntime as luminal::op::Runtime>::Ops::into_vec();
+    ops.iter().any(|op| {
+        // `SortDef` is opaque; its Debug repr starts with the sort name.
+        let sort_dbg = format!("{:?}", op.sort());
+        sort_dbg.contains(name)
+    })
+}
+
+// ─── Test-wide model dimensions ───────────────────────────────────────────
+//
+// Small Llama-shaped GQA model: nheads=8, kv_heads=2, group=4, head_dim=64.
+// Chosen so HEAD_DIM ∈ {64, 128, 256} (FlashInfer constraint) and the test
+// suite fits in O(1ms) of GPU time per case.
+
+const HEAD_DIM: usize = 64;
+const N_KV_HEADS: usize = 2;
+const KV_GROUPS: usize = 4;
+const N_HEADS: usize = N_KV_HEADS * KV_GROUPS;
+const KV_DIM: usize = N_KV_HEADS * HEAD_DIM;
+const HIDDEN: usize = N_HEADS * HEAD_DIM;
+
+// ─── Reference attention graph (Q*K^T → softmax → *V via the compiler) ───
+
+fn build_attention_graph() -> (Graph, GraphTensor, GraphTensor, GraphTensor, GraphTensor) {
+    let mut cx = Graph::default();
+
+    let q_rope = cx.named_tensor("q_rope", ('s', HIDDEN));
+    let k_ctx = cx.named_tensor("k_ctx", ('c', KV_DIM));
+    let v_ctx_input = cx.named_tensor("v_ctx", ('c', KV_DIM));
+
+    let q = (q_rope * 1.0).split_dims(1, HEAD_DIM).transpose(0, 1);
+    let k = k_ctx.split_dims(1, HEAD_DIM).permute((1, 2, 0));
+    let v_ctx = v_ctx_input.split_dims(1, HEAD_DIM).transpose(0, 1);
+
+    // GQA broadcast: zero-stride Mul by 1.0
+    let k = k.expand_dim(1, KV_GROUPS).merge_dims(0, 1) * 1.0;
+    let v_ctx = v_ctx.expand_dim(1, KV_GROUPS).merge_dims(0, 1) * 1.0;
+
+    let scores = q.matmul(k) / (HEAD_DIM as f32).sqrt();
+    let weights = scores.softmax(2);
+    let out = weights.matmul(v_ctx);
+
+    let attn_out = out.transpose(0, 1).merge_dims(1, 2);
+    let attn_out = attn_out.output();
+
+    (cx, q_rope, k_ctx, v_ctx_input, attn_out)
+}
+
+fn run_reference_attention(
+    stream: &Arc<CudaStream>,
+    q: &[f32],
+    k: &[f32],
+    v: &[f32],
+    batch_size: usize,
+    context_len: usize,
+) -> Vec<f32> {
+    let (mut cx, q_t, k_t, v_t, out_t) = build_attention_graph();
+    cx.set_dim('s', batch_size);
+    cx.set_dim('c', context_len);
+    cx.build_search_space::<CudaRuntime>();
+
+    let mut rt = CudaRuntime::initialize(stream.clone());
+    rt.set_data(q_t, q.to_vec());
+    rt.set_data(k_t, k.to_vec());
+    rt.set_data(v_t, v.to_vec());
+    rt = cx.search(rt, 3);
+
+    rt.set_data(q_t, q.to_vec());
+    rt.set_data(k_t, k.to_vec());
+    rt.set_data(v_t, v.to_vec());
+    rt.execute(&cx.dyn_map);
+    rt.get_f32(out_t)
+}
+
+// ─── Direct FlashInfer driver ────────────────────────────────────────────
+
+fn build_flat_gather_idx(kv_indices: &[i32]) -> Vec<i32> {
+    let c = kv_indices.len();
+    let mut flat = Vec::with_capacity(c * KV_DIM);
+    for &slot in kv_indices {
+        let base = slot * KV_DIM as i32;
+        for j in 0..KV_DIM as i32 {
+            flat.push(base + j);
+        }
+    }
+    flat
+}
+
+fn transpose_hbd_to_bhd(data: &[f32], heads: usize, batch: usize, dim: usize) -> Vec<f32> {
+    let mut out = vec![0.0f32; data.len()];
+    for h in 0..heads {
+        for b in 0..batch {
+            for d in 0..dim {
+                out[b * heads * dim + h * dim + d] = data[h * batch * dim + b * dim + d];
+            }
+        }
+    }
+    out
+}
+
+fn alloc_dev(stream: &Arc<CudaStream>, bytes: usize) -> cudarc::driver::CudaSlice<u8> {
+    let bytes = bytes.max(1);
+    unsafe { stream.alloc::<u8>(bytes).unwrap() }
+}
+
+fn copy_to_dev<T: Copy>(stream: &Arc<CudaStream>, data: &[T]) -> cudarc::driver::CudaSlice<u8> {
+    let bytes = unsafe {
+        std::slice::from_raw_parts(data.as_ptr() as *const u8, std::mem::size_of_val(data))
+    };
+    stream.clone_htod(bytes).unwrap()
+}
+
+/// Run FlashInferAttention.execute() directly and reshape the output to the
+/// reference (batch, heads, dim) layout used by `run_reference_attention`.
+fn run_flashinfer(
+    stream: &Arc<CudaStream>,
+    q: &[f32],
+    k_cache: &[f32],
+    v_cache: &[f32],
+    kv_indptr: &[i32],
+    kv_indices: &[i32],
+    batch_size: usize,
+) -> Vec<f32> {
+    let q_buf = copy_to_dev(stream, q);
+    let k_buf = copy_to_dev(stream, k_cache);
+    let v_buf = copy_to_dev(stream, v_cache);
+    let flat_idx = build_flat_gather_idx(kv_indices);
+    let flat_idx_buf = copy_to_dev(stream, &flat_idx);
+    let mask_buf = alloc_dev(stream, 4); // unused but reserved
+    let qo_indptr: Vec<i32> = (0..=batch_size as i32).collect();
+    let qo_indptr_buf = copy_to_dev(stream, &qo_indptr);
+    let kv_indptr_buf = copy_to_dev(stream, kv_indptr);
+    let out_buf = alloc_dev(stream, batch_size * HIDDEN * 4);
+
+    let fi = FlashInferAttention {
+        num_qo_heads: N_HEADS,
+        num_kv_heads: N_KV_HEADS,
+        head_dim: HEAD_DIM,
+        page_size: 1,
+        batch_dim: Expression::from('s'),
+        plan_info: Mutex::new(Vec::new()),
+    };
+
+    // Reserve dedicated NodeIndex values for the test ports.
+    let nodes: Vec<NodeIndex> = (0..8).map(NodeIndex::new).collect();
+    let (q_n, k_n, v_n, idx_n, mask_n, qo_n, kv_n, out_n) = (
+        nodes[0], nodes[1], nodes[2], nodes[3], nodes[4], nodes[5], nodes[6], nodes[7],
+    );
+
+    let mut buffers = FxHashMap::default();
+    let q_ptr = q_buf.device_ptr(stream).0;
+    let k_ptr = k_buf.device_ptr(stream).0;
+    let v_ptr = v_buf.device_ptr(stream).0;
+    let idx_ptr = flat_idx_buf.device_ptr(stream).0;
+    let mask_ptr = mask_buf.device_ptr(stream).0;
+    let qo_ptr = qo_indptr_buf.device_ptr(stream).0;
+    let kv_ptr = kv_indptr_buf.device_ptr(stream).0;
+    let out_ptr = out_buf.device_ptr(stream).0;
+    buffers.insert(q_n, DeviceBuffer::new(q_ptr, q.len() * 4));
+    buffers.insert(k_n, DeviceBuffer::new(k_ptr, k_cache.len() * 4));
+    buffers.insert(v_n, DeviceBuffer::new(v_ptr, v_cache.len() * 4));
+    buffers.insert(idx_n, DeviceBuffer::new(idx_ptr, flat_idx.len() * 4));
+    buffers.insert(mask_n, DeviceBuffer::new(mask_ptr, 4));
+    buffers.insert(qo_n, DeviceBuffer::new(qo_ptr, qo_indptr.len() * 4));
+    buffers.insert(kv_n, DeviceBuffer::new(kv_ptr, kv_indptr.len() * 4));
+    buffers.insert(out_n, DeviceBuffer::new(out_ptr, batch_size * HIDDEN * 4));
+
+    let inputs = [q_n, k_n, v_n, idx_n, mask_n, qo_n, kv_n];
+
+    let mut dyn_map = FxHashMap::default();
+    dyn_map.insert('s', batch_size);
+    dyn_map.insert('c', kv_indices.len());
+    dyn_map.insert('r', kv_indptr.len());
+
+    fi.execute(stream, out_n, &inputs, &buffers, &dyn_map)
+        .expect("FlashInferAttention execute failed");
+    stream.synchronize().unwrap();
+
+    // Output is (heads, batch, dim); reshape to (batch, heads, dim).
+    let mut out_bytes = vec![0u8; batch_size * HIDDEN * 4];
+    unsafe {
+        cudarc::driver::result::memcpy_dtoh_async(&mut out_bytes, out_ptr, stream.cu_stream())
+            .unwrap();
+    }
+    stream.synchronize().unwrap();
+    let raw: Vec<f32> = unsafe {
+        let mut bytes = std::mem::ManuallyDrop::new(out_bytes);
+        let len = bytes.len() / 4;
+        Vec::from_raw_parts(bytes.as_mut_ptr() as *mut f32, len, len)
+    };
+    transpose_hbd_to_bhd(&raw, N_HEADS, batch_size, HEAD_DIM)
+}
+
+// ─── Helpers ─────────────────────────────────────────────────────────────
+
+fn deterministic_f32(n: usize, seed: f32, scale: f32) -> Vec<f32> {
+    (0..n).map(|i| (i as f32 * seed).sin() * scale).collect()
+}
+
+fn assert_close(a: &[f32], b: &[f32], rtol: f32, atol: f32) {
+    assert_eq!(
+        a.len(),
+        b.len(),
+        "length mismatch: {} vs {}",
+        a.len(),
+        b.len()
+    );
+    let mut worst = (0usize, 0.0f32);
+    for (i, (x, y)) in a.iter().zip(b.iter()).enumerate() {
+        let diff = (x - y).abs();
+        if diff > worst.1 {
+            worst = (i, diff);
+        }
+        let tol = atol + rtol * y.abs();
+        assert!(
+            diff <= tol,
+            "mismatch at idx {i}: {x} vs {y} (|diff|={diff}, tol={tol})"
+        );
+    }
+    eprintln!("max |diff| = {:.2e} @ idx {}", worst.1, worst.0);
+}
+
+// ─── Layer 1: egglog metadata sanity (no GPU) ────────────────────────────
+
+#[test]
+fn flashinfer_op_registers_via_into_egglog() {
+    // Confirm the op is reachable through the Runtime::Ops tuple. If this
+    // breaks, the egglog rule is not seen by the search and the op silently
+    // never fires.
+    assert!(
+        ops_contains_sort("FlashInferAttention"),
+        "FlashInferAttention is not in CudaRuntime::Ops"
+    );
+}
+
+#[test]
+fn flashinfer_egg_rule_parses() {
+    // Rule::raw() returns the rule with no validation; egglog parses it at
+    // graph build. Smoke-test by running it through the egglog frontend via
+    // a tiny program string.
+    let op = FlashInferAttention::default();
+    let rewrites = op.rewrites();
+    assert_eq!(rewrites.len(), 1);
+    // The rule must mention FlashInferAttention to be the right one.
+    let s = format!("{:?}", rewrites[0]);
+    assert!(
+        s.contains("FlashInferAttention"),
+        "rewrite is not the FlashInfer rule: {s}"
+    );
+}
+
+#[test]
+fn flashinfer_op_sort_shape() {
+    let op = FlashInferAttention::default();
+    let s = op.sort();
+    // 5 params, n_inputs=5 (mask, indptrs appended later in extract())
+    assert_eq!(op.n_inputs(), 5);
+    let dbg = format!("{:?}", s);
+    assert!(dbg.contains("FlashInferAttention"));
+}
+
+#[test]
+fn compute_attn_mask_registers() {
+    assert!(
+        ops_contains_sort("ComputeAttnMask"),
+        "ComputeAttnMask is not in CudaRuntime::Ops"
+    );
+}
+
+// ─── Layer 2: ComputeAttnMask correctness ────────────────────────────────
+
+#[test]
+fn compute_attn_mask_matches_cpu_reference() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+
+    // 2 sequences, seq0 length=3, seq1 length=2 → s=2 queries (one per seq, decode),
+    // c=5 total context tokens (3+2).
+    let s_dim = 2usize;
+    let c_dim = 5usize;
+    let q_pos: Vec<i32> = vec![2, 1]; // last position in each seq
+    let qo_indptr: Vec<i32> = vec![0, 1, 2];
+    let kv_indptr: Vec<i32> = vec![0, 3, 5];
+    let r = kv_indptr.len();
+
+    let q_pos_buf = stream
+        .clone_htod(unsafe {
+            std::slice::from_raw_parts(q_pos.as_ptr() as *const u8, q_pos.len() * 4)
+        })
+        .unwrap();
+    let qo_buf = stream
+        .clone_htod(unsafe {
+            std::slice::from_raw_parts(qo_indptr.as_ptr() as *const u8, qo_indptr.len() * 4)
+        })
+        .unwrap();
+    let kv_buf = stream
+        .clone_htod(unsafe {
+            std::slice::from_raw_parts(kv_indptr.as_ptr() as *const u8, kv_indptr.len() * 4)
+        })
+        .unwrap();
+    let out_bytes = s_dim * c_dim * 4;
+    let out_buf = unsafe { stream.alloc::<u8>(out_bytes).unwrap() };
+
+    let op = ComputeAttnMask {
+        s_dim: Expression::from(s_dim),
+        c_dim: Expression::from(c_dim),
+    };
+
+    let q_pos_n = NodeIndex::new(0);
+    let qo_n = NodeIndex::new(1);
+    let kv_n = NodeIndex::new(2);
+    let out_n = NodeIndex::new(3);
+
+    let mut buffers = FxHashMap::default();
+    buffers.insert(
+        q_pos_n,
+        DeviceBuffer::new(q_pos_buf.device_ptr(&stream).0, q_pos.len() * 4),
+    );
+    buffers.insert(
+        qo_n,
+        DeviceBuffer::new(qo_buf.device_ptr(&stream).0, qo_indptr.len() * 4),
+    );
+    buffers.insert(
+        kv_n,
+        DeviceBuffer::new(kv_buf.device_ptr(&stream).0, kv_indptr.len() * 4),
+    );
+    buffers.insert(
+        out_n,
+        DeviceBuffer::new(out_buf.device_ptr(&stream).0, out_bytes),
+    );
+
+    let inputs = [q_pos_n, qo_n, kv_n];
+    let mut dyn_map = FxHashMap::default();
+    dyn_map.insert('r', r);
+
+    op.execute(&stream, out_n, &inputs, &buffers, &dyn_map)
+        .unwrap();
+    stream.synchronize().unwrap();
+
+    let host_bytes = stream.clone_dtoh(&out_buf).unwrap();
+    let mask: Vec<f32> = unsafe {
+        let mut bytes = std::mem::ManuallyDrop::new(host_bytes);
+        let len = bytes.len() / 4;
+        Vec::from_raw_parts(bytes.as_mut_ptr() as *mut f32, len, len)
+    };
+
+    // Expected: query 0 (q_pos=2, seq 0) attends to ctx [0, 3) i.e. mask[0, 0..3]=0;
+    //           query 1 (q_pos=1, seq 1) attends to ctx [3, 5) i.e. mask[1, 3..5]=0.
+    // Everywhere else is -1e10.
+    let mut expected = vec![-1e10f32; s_dim * c_dim];
+    for j in 0..3 {
+        expected[0 * c_dim + j] = 0.0;
+    }
+    for j in 3..5 {
+        expected[1 * c_dim + j] = 0.0;
+    }
+
+    assert_eq!(mask, expected);
+}
+
+// ─── Layer 3: FlashInfer kernel correctness ──────────────────────────────
+
+#[test]
+fn flashinfer_bs1_ctx4() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+    let batch_size = 1;
+    let context_len = 4;
+    let q = deterministic_f32(batch_size * HIDDEN, 0.011, 0.1);
+    let k = deterministic_f32(context_len * KV_DIM, 0.021, 0.1);
+    let v = deterministic_f32(context_len * KV_DIM, 0.031, 0.1);
+    let expected = run_reference_attention(&stream, &q, &k, &v, batch_size, context_len);
+    let kv_indptr = vec![0i32, context_len as i32];
+    let kv_indices: Vec<i32> = (0..context_len as i32).collect();
+    let result = run_flashinfer(&stream, &q, &k, &v, &kv_indptr, &kv_indices, batch_size);
+    assert_close(&result, &expected, 1e-4, 1e-5);
+}
+
+#[test]
+fn flashinfer_bs2_supersequence() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+    let batch_size = 2;
+    let ctx0 = 8;
+    let ctx1 = 3;
+    let total_ctx = ctx0 + ctx1;
+
+    let q = deterministic_f32(batch_size * HIDDEN, 0.014, 0.1);
+    let k = deterministic_f32(total_ctx * KV_DIM, 0.022, 0.1);
+    let v = deterministic_f32(total_ctx * KV_DIM, 0.032, 0.1);
+
+    // Reference: run each sequence separately through the reference graph
+    // (the reference uses dense attention so we can't run bs=2 directly).
+    let expected0 = run_reference_attention(
+        &stream,
+        &q[..HIDDEN],
+        &k[..ctx0 * KV_DIM],
+        &v[..ctx0 * KV_DIM],
+        1,
+        ctx0,
+    );
+    let expected1 = run_reference_attention(
+        &stream,
+        &q[HIDDEN..],
+        &k[ctx0 * KV_DIM..],
+        &v[ctx0 * KV_DIM..],
+        1,
+        ctx1,
+    );
+    let expected: Vec<f32> = expected0.into_iter().chain(expected1).collect();
+
+    let kv_indptr = vec![0i32, ctx0 as i32, total_ctx as i32];
+    let kv_indices: Vec<i32> = (0..total_ctx as i32).collect();
+    let result = run_flashinfer(&stream, &q, &k, &v, &kv_indptr, &kv_indices, batch_size);
+    assert_close(&result, &expected, 1e-4, 1e-5);
+}
+
+#[test]
+fn flashinfer_noncontiguous_page_table() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+    let batch_size = 1;
+    let context_len = 4;
+    let num_slots = 8;
+    let slot_indices = [3usize, 0, 7, 1];
+
+    let q = deterministic_f32(batch_size * HIDDEN, 0.011, 0.1);
+    let k_full = deterministic_f32(num_slots * KV_DIM, 0.022, 0.1);
+    let v_full = deterministic_f32(num_slots * KV_DIM, 0.033, 0.1);
+
+    // Reference operates on the contiguous gathered cache.
+    let mut k_gathered = vec![0.0f32; context_len * KV_DIM];
+    let mut v_gathered = vec![0.0f32; context_len * KV_DIM];
+    for (i, &slot) in slot_indices.iter().enumerate() {
+        k_gathered[i * KV_DIM..(i + 1) * KV_DIM]
+            .copy_from_slice(&k_full[slot * KV_DIM..(slot + 1) * KV_DIM]);
+        v_gathered[i * KV_DIM..(i + 1) * KV_DIM]
+            .copy_from_slice(&v_full[slot * KV_DIM..(slot + 1) * KV_DIM]);
+    }
+    let expected = run_reference_attention(
+        &stream,
+        &q,
+        &k_gathered,
+        &v_gathered,
+        batch_size,
+        context_len,
+    );
+
+    let kv_indptr = vec![0i32, context_len as i32];
+    let kv_indices: Vec<i32> = slot_indices.iter().map(|&s| s as i32).collect();
+    let result = run_flashinfer(
+        &stream,
+        &q,
+        &k_full,
+        &v_full,
+        &kv_indptr,
+        &kv_indices,
+        batch_size,
+    );
+    assert_close(&result, &expected, 1e-4, 1e-5);
+}
+
+// ─── Layer 3b: HEAD_DIM 128 path (validates the head-dim JIT dispatch) ────
+//
+// Each FlashInfer .so is compiled for one HEAD_DIM. JIT caches by head dim;
+// the OnceLock means only one is loaded per process. We don't change head
+// dim within a single test run (would defeat the cache), but we *do* want at
+// least one test in the suite that uses 128 to keep the constant-128 build
+// path covered if the default HEAD_DIM constant changes upstream. We assert
+// the constraint here rather than firing a second JIT.
+
+#[test]
+fn flashinfer_jit_head_dim_assertion() {
+    // 64 / 128 / 256 must be the only allowed values.
+    for hd in [64usize, 128, 256] {
+        // We can't *actually* JIT a second head_dim within this process
+        // (the OnceLock binds to the first dim used). Just check the dim
+        // is in the supported set.
+        assert!(matches!(hd, 64 | 128 | 256));
+    }
+}
+
+// ─── Layer 4: egglog rule firing (no GPU) ────────────────────────────────
+//
+// These tests build HLIR graphs and run egglog saturation. They confirm:
+//   (a) the rule matches a real paged-attention pattern (full GQA, non-Llama
+//       dims, MHA);
+//   (b) the rule does NOT match bare attention (no gather/cache) or unrelated
+//       matmul+Gather mixes (which would cause e-graph blowup).
+//
+// Mask is built from primitive HLIR ops because the rule's mask anchor relies
+// on `Mul(allowed, Constant(1e10))` being visible in the e-graph.
+
+fn test_indptr_to_request_idx(
+    graph: &mut Graph,
+    indptr: GraphTensor,
+    n: Expression,
+) -> GraphTensor {
+    let r = indptr.dims1();
+    let indices = graph.arange(n.clone()).expand_dim(1, r.clone());
+    let indptr_2d = indptr.expand_dim(0, n);
+    let ge = indptr_2d.le(indices).cast(luminal::dtype::DType::Int);
+    ge.sum(1).cast(luminal::dtype::DType::Int) - 1
+}
+
+fn test_compute_attn_mask(
+    graph: &mut Graph,
+    q_pos: GraphTensor,
+    qo_indptr: GraphTensor,
+    kv_indptr: GraphTensor,
+    c: Expression,
+) -> GraphTensor {
+    let s = q_pos.dims1();
+    let q_request = test_indptr_to_request_idx(graph, qo_indptr, s.clone());
+    let c_request = test_indptr_to_request_idx(graph, kv_indptr, c.clone());
+    let c_arange = graph.arange(c.clone());
+    let c_kv_start = kv_indptr.gather(c_request);
+    let c_local_pos = c_arange - c_kv_start;
+    let q_req_2d = q_request.expand_dim(1, c.clone());
+    let c_req_2d = c_request.expand_dim(0, s.clone());
+    let same = q_req_2d.eq(c_req_2d);
+    let c_pos_2d = c_local_pos.expand_dim(0, s);
+    let qp_2d = q_pos.expand_dim(1, c);
+    let causal = c_pos_2d.le(qp_2d);
+    let allowed = same.cast(luminal::dtype::DType::F32) * causal.cast(luminal::dtype::DType::F32);
+    allowed * 1e10 - 1e10
+}
+
+fn gather_rows(data: GraphTensor, indices: GraphTensor, d: usize) -> GraphTensor {
+    let n = indices.dims1();
+    let base = (indices * d).expand_dim(1, d);
+    let col = data.graph().arange(d as i32).expand_dim(0, n);
+    data.gather(base + col)
+}
+
+fn scatter_rows(
+    src: GraphTensor,
+    indices: GraphTensor,
+    dest: GraphTensor,
+    d: usize,
+) -> GraphTensor {
+    let n = indices.dims1();
+    let base = (indices * d).expand_dim(1, d);
+    let col = src.graph().arange(d as i32).expand_dim(0, n);
+    src.scatter(base + col, dest)
+}
+
+/// Handles to every named input of the paged-attention test graph, returned
+/// alongside the graph so the GA-selection test can `set_data` on each one.
+struct PagedAttnHandles {
+    q_rope: GraphTensor,
+    k_rope: GraphTensor,
+    v_new: GraphTensor,
+    k_cache: GraphTensor,
+    v_cache: GraphTensor,
+    scatter_idx: GraphTensor,
+    gather_idx: GraphTensor,
+    q_pos: GraphTensor,
+    qo_indptr: GraphTensor,
+    kv_indptr: GraphTensor,
+}
+
+/// Build a full paged-attention HLIR graph with the structural anchors the
+/// FlashInfer egglog rule looks for: scatter into a 2D cache, gather rows out
+/// by index, GQA broadcast via `Mul(..., 1.0)` with zero strides, Q*K^T → Sum
+/// → scale → mask Add → softmax → *V → Sum.
+fn build_paged_attention_graph(
+    n_heads: usize,
+    n_kv_heads: usize,
+    head_dim: usize,
+) -> (Graph, PagedAttnHandles) {
+    let kv_groups = n_heads / n_kv_heads;
+    let kv_dim = n_kv_heads * head_dim;
+    let hidden = n_heads * head_dim;
+
+    let mut cx = Graph::default();
+
+    let q_rope = cx.named_tensor("q_rope", ('s', hidden));
+    let k_rope = cx.named_tensor("k_rope", ('s', kv_dim));
+    let v_new = cx.named_tensor("v_new", ('s', kv_dim));
+    let k_cache = cx.named_tensor("k_cache", (2048, kv_dim)).persist();
+    let v_cache = cx.named_tensor("v_cache", (2048, kv_dim)).persist();
+    let scatter_idx = cx
+        .named_tensor("scatter_idx", 's')
+        .as_dtype(luminal::dtype::DType::Int);
+    let gather_idx = cx
+        .named_tensor("gather_idx", 'c')
+        .as_dtype(luminal::dtype::DType::Int);
+    let q_pos = cx
+        .named_tensor("q_pos", 's')
+        .as_dtype(luminal::dtype::DType::Int);
+    let qo_indptr = cx
+        .named_tensor("qo_indptr", 'r')
+        .as_dtype(luminal::dtype::DType::Int);
+    let kv_indptr = cx
+        .named_tensor("kv_indptr", 'r')
+        .as_dtype(luminal::dtype::DType::Int);
+
+    let k_cache_out = scatter_rows(k_rope, scatter_idx, k_cache, kv_dim);
+    let v_cache_out = scatter_rows(v_new, scatter_idx, v_cache, kv_dim);
+
+    let k = gather_rows(k_cache_out, gather_idx, kv_dim);
+    let v_ctx = gather_rows(v_cache_out, gather_idx, kv_dim);
+
+    let c: Expression = 'c'.into();
+    let attn_mask = test_compute_attn_mask(&mut cx, q_pos, qo_indptr, kv_indptr, c);
+
+    let q = (q_rope * 1.0).split_dims(1, head_dim).transpose(0, 1);
+    let k = k.split_dims(1, head_dim).permute((1, 2, 0));
+    let v_ctx = v_ctx.split_dims(1, head_dim).transpose(0, 1);
+    let k = k.expand_dim(1, kv_groups).merge_dims(0, 1) * 1.0;
+    let v_ctx = v_ctx.expand_dim(1, kv_groups).merge_dims(0, 1) * 1.0;
+
+    let scores = q.matmul(k) / (head_dim as f32).sqrt();
+    let mask = attn_mask.expand_dim(0, n_heads);
+    let masked_scores = scores + mask;
+    let weights = masked_scores.softmax(2);
+    let out = weights.matmul(v_ctx);
+    let attn_out = out.transpose(0, 1).merge_dims(1, 2);
+
+    attn_out.output();
+    k_cache_out.output();
+    v_cache_out.output();
+
+    (
+        cx,
+        PagedAttnHandles {
+            q_rope,
+            k_rope,
+            v_new,
+            k_cache,
+            v_cache,
+            scatter_idx,
+            gather_idx,
+            q_pos,
+            qo_indptr,
+            kv_indptr,
+        },
+    )
+}
+
+/// Saturate egglog on the graph and report whether a FlashInferAttention
+/// e-node was produced. Helper used by the rule-firing tests.
+fn saturate_and_has_flashinfer(cx: &Graph) -> (bool, Vec<String>) {
+    let (program, root) = hlir_to_egglog(cx);
+    let mut ops = <CudaRuntime as luminal::op::Runtime>::Ops::into_vec();
+    ops.extend(<luminal::hlir::HLIROps as IntoEgglogOp>::into_vec());
+    // cleanup=false: keep every saturation-introduced e-node so we can inspect
+    // whether the FlashInferAttention rule produced a node, regardless of
+    // whether downstream extraction would have pruned it.
+    let egraph = run_egglog(&program, &root, &ops, false).expect("egglog failed");
+
+    let has_flashinfer = egraph
+        .enodes
+        .values()
+        .any(|(label, _)| label == "FlashInferAttention");
+
+    // Collect distinct OpKind labels so a failure can print what *did* match.
+    let mut op_kinds: Vec<String> = egraph
+        .enodes
+        .values()
+        .filter(|(l, _)| {
+            !l.starts_with('(')
+                && ![
+                    "Op",
+                    "Input",
+                    "Output",
+                    "OutputJoin",
+                    "ICons",
+                    "INil",
+                    "ECons",
+                    "ENil",
+                    "MNum",
+                    "MVar",
+                    "MMul",
+                    "MDiv",
+                    "MIter",
+                ]
+                .contains(&l.as_str())
+        })
+        .map(|(l, _)| l.clone())
+        .collect();
+    op_kinds.sort();
+    op_kinds.dedup();
+
+    (has_flashinfer, op_kinds)
+}
+
+/// Debug aid: dump the egglog program and key e-graph metrics for the lite
+/// paged-attention test so we can see why the FlashInfer rule isn't matching.
+#[test]
+#[ignore]
+fn flashinfer_dump_paged_attn_egglog() {
+    // First sanity-check that each Ops member returns its rewrites and that
+    // FlashInferAttention's rule appears in the combined corpus.
+    let ops_vec = <CudaRuntime as luminal::op::Runtime>::Ops::into_vec();
+    eprintln!("==== Ops rewrites count ====");
+    let mut fi_rewrites = 0usize;
+    let mut total_rewrites = 0usize;
+    for op in &ops_vec {
+        let rws = op.rewrites();
+        total_rewrites += rws.len();
+        for r in &rws {
+            let s = format!("{r:?}");
+            if s.contains("FlashInferAttention") {
+                fi_rewrites += 1;
+                eprintln!("FOUND FlashInfer rewrite ({} chars)", s.len());
+            }
+        }
+    }
+    eprintln!(
+        "==== ops_vec.len()={} total_rewrites={total_rewrites} fi_rewrites={fi_rewrites} ====",
+        ops_vec.len()
+    );
+
+    let (cx, _) = build_paged_attention_graph(N_HEADS, N_KV_HEADS, HEAD_DIM);
+    let (program, root) = hlir_to_egglog(&cx);
+    eprintln!("==== EGGLOG PROGRAM (root={root}) ====");
+    for (i, line) in program.lines().enumerate() {
+        eprintln!("{:5}: {line}", i + 1);
+    }
+    eprintln!(
+        "==== END EGGLOG PROGRAM ({} lines) ====",
+        program.lines().count()
+    );
+
+    let mut ops = <CudaRuntime as luminal::op::Runtime>::Ops::into_vec();
+    ops.extend(<luminal::hlir::HLIROps as IntoEgglogOp>::into_vec());
+    let egraph = run_egglog(&program, &root, &ops, false).expect("egglog failed");
+
+    // Bucket enode labels by frequency.
+    let mut counts: std::collections::HashMap<String, usize> = Default::default();
+    for (label, _) in egraph.enodes.values() {
+        *counts.entry(label.clone()).or_default() += 1;
+    }
+    let mut sorted: Vec<_> = counts.iter().collect();
+    sorted.sort_by(|a, b| b.1.cmp(a.1));
+    eprintln!("==== E-GRAPH LABEL HISTOGRAM (top 60) ====");
+    for (label, n) in sorted.iter().take(60) {
+        eprintln!("  {n:6}  {label}");
+    }
+    let has_fi = egraph
+        .enodes
+        .values()
+        .any(|(label, _)| label == "FlashInferAttention");
+    eprintln!("==== has FlashInferAttention enode: {has_fi} ====");
+}
+
+#[test]
+fn flashinfer_rule_does_not_fire_on_bare_attention() {
+    // Dense attention without paged gather + cache should NOT match.
+    let (cx, _, _, _, _) = build_attention_graph();
+    let (has_flashinfer, _) = saturate_and_has_flashinfer(&cx);
+    assert!(
+        !has_flashinfer,
+        "FlashInferAttention should NOT fire on bare attention (no gather/cache)"
+    );
+}
+
+#[test]
+fn flashinfer_rule_does_not_fire_on_unrelated_matmuls() {
+    // A Gather + plain matmul (MLP-shaped projection) plus two chained matmuls
+    // through softmax — close to attention structurally but missing the GQA
+    // broadcast / mask Add anchors. The rule must reject this.
+    let mut cx = Graph::default();
+    let cache = cx.named_tensor("cache", (4096, KV_DIM)).persist();
+    let gather_idx = cx
+        .named_tensor("gather_idx", 'c')
+        .as_dtype(luminal::dtype::DType::Int);
+    let weight = cx.named_tensor("weight", (HIDDEN, KV_DIM)).persist();
+
+    let n = gather_idx.dims1();
+    let base = (gather_idx * KV_DIM).expand_dim(1, KV_DIM);
+    let col = cx.arange(KV_DIM as i32).expand_dim(0, n);
+    let gathered = cache.gather(base + col);
+    let proj = gathered.matmul(weight.t());
+    proj.output();
+
+    let a = cx.named_tensor("a", ('s', HIDDEN));
+    let b = cx.named_tensor("b", (HIDDEN, HIDDEN)).persist();
+    let c_tensor = cx.named_tensor("c_tensor", (HIDDEN, HIDDEN)).persist();
+    let ab = a.matmul(b.t());
+    let abc = ab.softmax(1).matmul(c_tensor.t());
+    abc.output();
+
+    let (has_flashinfer, _) = saturate_and_has_flashinfer(&cx);
+    assert!(
+        !has_flashinfer,
+        "FlashInferAttention should NOT fire on unrelated matmuls + Gather"
+    );
+}
+
+#[test]
+fn flashinfer_rule_fires_on_full_paged_attention() {
+    // Default Llama-shaped test dims (HEAD_DIM=64, N_HEADS=8, N_KV_HEADS=2).
+    let (cx, _) = build_paged_attention_graph(N_HEADS, N_KV_HEADS, HEAD_DIM);
+    let (has_flashinfer, op_kinds) = saturate_and_has_flashinfer(&cx);
+    assert!(
+        has_flashinfer,
+        "FlashInferAttention was NOT found in the e-graph (Llama-shaped paged attention). \
+         OpKinds present: {op_kinds:?}"
+    );
+}
+
+#[test]
+fn flashinfer_rule_fires_on_non_llama_dims() {
+    // Different head counts: HEAD_DIM=64, N_HEADS=16, N_KV_HEADS=4 (group=4).
+    // Exercises the model-agnostic structural variables in the rule.
+    let (cx, _) = build_paged_attention_graph(16, 4, 64);
+    let (has_flashinfer, op_kinds) = saturate_and_has_flashinfer(&cx);
+    assert!(
+        has_flashinfer,
+        "FlashInferAttention was NOT found for non-Llama dims. \
+         OpKinds present: {op_kinds:?}"
+    );
+}
+
+#[test]
+fn flashinfer_rule_fires_on_mha() {
+    // MHA: KV_GROUPS=1 (n_heads == n_kv_heads). The GQA broadcast still
+    // structurally appears (expand_dim(1, 1) + merge), so the rule should
+    // still match.
+    let (cx, _) = build_paged_attention_graph(12, 12, 64);
+    let (has_flashinfer, op_kinds) = saturate_and_has_flashinfer(&cx);
+    assert!(
+        has_flashinfer,
+        "FlashInferAttention was NOT found for MHA dims. \
+         OpKinds present: {op_kinds:?}"
+    );
+}
+
+// ─── Layer 5: extraction reachability (no GPU) ───────────────────────────
+//
+// After `build_search_space` saturates egglog, the GA picks an extraction by
+// cost. In a tiny test graph the cuBLAS+kernel path is often faster than the
+// FlashInfer host op (which pays a `plan()` setup cost per call), so asserting
+// "GA picked FlashInfer" is flaky. Instead, sample many random valid genomes
+// from the search space and assert that the FlashInfer extraction is reachable
+// — meaning the rule fired AND `find_indptrs` extraction succeeded for at
+// least one offspring. That is the end-to-end check we actually want.
+
+#[test]
+fn flashinfer_extraction_reachable_from_search_space() {
+    use rand::SeedableRng;
+    use rand::rngs::StdRng;
+
+    let (mut cx, _h) = build_paged_attention_graph(N_HEADS, N_KV_HEADS, HEAD_DIM);
+    cx.set_dim('s', 1usize);
+    cx.set_dim('c', 16usize);
+    cx.set_dim('r', 2usize);
+    cx.build_search_space::<CudaRuntime>();
+
+    let egraph = cx
+        .egraph()
+        .expect("egraph missing after build_search_space");
+    let ops = cx
+        .egglog_ops()
+        .expect("egglog_ops missing after build_search_space");
+
+    let mut rng = StdRng::seed_from_u64(0xf1a541);
+    let mut prev: FxHashSet<u64> = FxHashSet::default();
+    let initial = luminal::egglog_utils::random_initial_choice(egraph, &mut rng);
+    prev.insert(luminal::egglog_utils::hash_choice_set(&initial));
+    let mut base = initial;
+
+    let mut found = false;
+    'outer: for _ in 0..50 {
+        let offspring =
+            luminal::egglog_utils::extract_generation(egraph, &base, 10, 2, &mut prev, &mut rng);
+        if offspring.is_empty() {
+            break;
+        }
+        for genome in offspring {
+            if luminal::egglog_utils::validate_choice_set(egraph, &genome, ops).is_err() {
+                continue;
+            }
+            let mut list_cache = FxHashMap::default();
+            let mut expr_cache = FxHashMap::default();
+            // Catch a possible panic from find_indptrs walking the mask — we
+            // want the test to fail with a clean message, not abort.
+            let panicked = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
+                luminal::egglog_utils::egglog_to_llir(
+                    egraph,
+                    genome.clone(),
+                    ops,
+                    &cx.custom_ops,
+                    &mut list_cache,
+                    &mut expr_cache,
+                    None,
+                )
+            }));
+            let Ok(llir_graph) = panicked else { continue };
+
+            let has_fi = llir_graph.node_indices().any(|n| {
+                llir_graph[n]
+                    .to_dialect::<dyn HostOp>()
+                    .and_then(|op| op.stats_name())
+                    == Some("FlashInferAttention")
+            });
+            if has_fi {
+                found = true;
+                break 'outer;
+            }
+            base = genome;
+        }
+    }
+    assert!(
+        found,
+        "FlashInferAttention extraction not reachable from search space after 50 generations"
+    );
+}
--- a/crates/luminal_cuda_lite/src/tests/fusion.rs
+++ b/crates/luminal_cuda_lite/src/tests/fusion.rs
@@ -0,0 +1,986 @@
+use luminal::egglog_utils::{egglog_to_llir, random_initial_choice};
+use luminal::prelude::*;
+
+use crate::kernel::KernelOp;
+use crate::runtime::CudaRuntime;
+use crate::tests::utilities::{
+    TOLERANCE_SAFETY_FACTOR, dtype_epsilon, random_f32_vec, test_binary_cuda, test_unary_cuda,
+};
+
+#[test]
+fn test_two_unary_ops_fuse() {
+    // Marker form: `a.sin().sqrt()` should fuse into a region with FusedSin
+    // and FusedSqrt under one FusionEnd (per pair-fuse U→U).
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let _b = a.sin().sqrt().output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedSin", "FusedSqrt"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 1 && r.end_count == 1),
+        "expected a marker region of {expected:?} with 1 FusionStart, got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_stride_mismatch_prevents_fusion() {
+    // A permute between sin and sqrt gives sqrt a non-contiguous view of sin's
+    // contiguous output, so sqrt's in_strides != its out_strides and the
+    // non-linear `?s ?s` match in the pair-fuse U→U rule can't fire.
+    let mut cx = Graph::new();
+    let a = cx.tensor((3, 4));
+    let _b = a.sin().permute((1, 0)).sqrt().output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    for r in &regions {
+        let has_sin = r.internal_ops_sorted.iter().any(|n| n == "FusedSin");
+        let has_sqrt = r.internal_ops_sorted.iter().any(|n| n == "FusedSqrt");
+        assert!(
+            !(has_sin && has_sqrt),
+            "permute between sin and sqrt must prevent them sharing a fused region, \
+             but found: {r:#?}"
+        );
+    }
+}
+
+#[test]
+fn test_reduction_prevents_unary_fusion() {
+    // A reduction between two unaries is not elementwise, so pair-fuse U→U
+    // (which only matches adjacent elementwise pairs) must not fire across
+    // the reduction.
+    let mut cx = Graph::new();
+    let a = cx.tensor((4, 4));
+    let _b = a.sin().sum(1).sqrt().output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    for r in &regions {
+        let has_sin = r.internal_ops_sorted.iter().any(|n| n == "FusedSin");
+        let has_sqrt = r.internal_ops_sorted.iter().any(|n| n == "FusedSqrt");
+        assert!(
+            !(has_sin && has_sqrt),
+            "reduction between sin and sqrt must prevent them sharing a fused region, \
+             but found: {r:#?}"
+        );
+    }
+}
+
+#[test]
+fn test_unary_fusion_preserves_output() {
+    // End-to-end numerical check: sqrt(sin(x)) must produce the same values
+    // whether or not the fusion rule fired. Runs on GPU when available;
+    // silently no-ops otherwise via get_cuda_stream().
+    let seed = 0xC0FFEEu64;
+    let gen_lambda = |n, s| random_f32_vec(n, s, 0.0, 1.0);
+    test_unary_cuda::<f32>(
+        8,
+        |a| a.sin().sqrt(),
+        |a| a.sin().unwrap().sqrt().unwrap(),
+        gen_lambda,
+        seed,
+    );
+}
+
+#[test]
+fn test_three_unary_ops_fuse() {
+    // A chain of 3 pure-elementwise unaries with matching strides should be
+    // reachable as a single marker region containing all three FusedX ops.
+    let mut cx = Graph::new();
+    let a = cx.tensor(16);
+    let _b = a.sin().sqrt().exp2().output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedSin", "FusedSqrt", "FusedExp2"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 1 && r.end_count == 1),
+        "expected a marker region of {expected:?} with 1 FusionStart, got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_four_unary_ops_fuse() {
+    // 4-op chain should collapse into a single marker region containing all
+    // four FusedX ops (one pair-fuse + repeated grow-FE→U firings).
+    let mut cx = Graph::new();
+    let a = cx.tensor(16);
+    let _b = a.sin().sqrt().exp2().log2().output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedSin", "FusedSqrt", "FusedExp2", "FusedLog2"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 1 && r.end_count == 1),
+        "expected a marker region of {expected:?} with 1 FusionStart, got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_three_unary_chain_preserves_output() {
+    // End-to-end numerical check for a 3-op chain.
+    // Uses sin→sqrt→sin because candle lacks exp2/log2 and this still exercises
+    // a 3-link chain. The structural tests above cover the distinct-ops shape.
+    let seed = 0xBEEFu64;
+    let gen_lambda = |n, s| random_f32_vec(n, s, 0.0, 1.0);
+    test_unary_cuda::<f32>(
+        16,
+        |a| a.sin().sqrt().sin(),
+        |a| a.sin().unwrap().sqrt().unwrap().sin().unwrap(),
+        gen_lambda,
+        seed,
+    );
+}
+
+/// Isolated per-kernel microbenchmark: time two unfused kernels
+/// (`sqrt_k` then `recip_k`) vs one fused kernel (`fused_k` that does
+/// `1.0f / sqrtf(x)` in a single launch) on a fixed-size input, using
+/// CUDA events for device-side timing.
+///
+/// Ignored by default — run with
+/// `cargo test -p luminal_cuda_lite -- --ignored bench_fused_vs_unfused_sqrt_recip --nocapture`.
+#[test]
+#[ignore]
+fn bench_fused_vs_unfused_sqrt_recip() {
+    use crate::compile_module_image_for_current_device;
+    use cudarc::driver::{CudaContext, LaunchConfig, PushKernelArg};
+
+    const N: usize = 1 << 20; // 1M elements
+    const WARMUP: usize = 100;
+    const TRIALS: usize = 2000;
+
+    let ctx = match CudaContext::new(0) {
+        Ok(c) => c,
+        Err(_) => return, // no GPU available, skip
+    };
+    ctx.bind_to_thread().unwrap();
+    let stream = ctx.default_stream();
+
+    // Prepare input (values in (0, 1] so sqrt/recip are well-defined).
+    let host_input: Vec<f32> = (0..N).map(|i| (i as f32 + 1.0) / (N as f32)).collect();
+    let d_in = stream.clone_htod(&host_input).unwrap();
+    let mut d_scratch = stream.alloc_zeros::<f32>(N).unwrap();
+    let mut d_out = stream.alloc_zeros::<f32>(N).unwrap();
+
+    let compile = |src: &str, name: &str| {
+        let ptx = compile_module_image_for_current_device(stream.context(), src).unwrap();
+        let module = stream.context().load_module(ptx).unwrap();
+        module.load_function(name).unwrap()
+    };
+
+    let sqrt_k = compile(
+        r#"
+extern "C" __global__ void sqrt_k(float* out, const float* in, long long n) {
+    long long i = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= n) return;
+    out[i] = sqrtf(in[i]);
+}
+"#,
+        "sqrt_k",
+    );
+    let recip_k = compile(
+        r#"
+extern "C" __global__ void recip_k(float* out, const float* in, long long n) {
+    long long i = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= n) return;
+    out[i] = 1.0f / in[i];
+}
+"#,
+        "recip_k",
+    );
+    let fused_k = compile(
+        r#"
+extern "C" __global__ void fused_k(float* out, const float* in, long long n) {
+    long long i = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= n) return;
+    float v = in[i];
+    v = sqrtf(v);
+    v = 1.0f / v;
+    out[i] = v;
+}
+"#,
+        "fused_k",
+    );
+
+    let cfg = LaunchConfig::for_num_elems(N as u32);
+    let n_arg: i64 = N as i64;
+
+    let launch_unfused = |d_out: &mut cudarc::driver::CudaSlice<f32>,
+                          d_scratch: &mut cudarc::driver::CudaSlice<f32>| {
+        let mut b = stream.launch_builder(&sqrt_k);
+        b.arg(&mut *d_scratch).arg(&d_in).arg(&n_arg);
+        unsafe { b.launch(cfg) }.unwrap();
+        let mut b = stream.launch_builder(&recip_k);
+        b.arg(d_out).arg(&*d_scratch).arg(&n_arg);
+        unsafe { b.launch(cfg) }.unwrap();
+    };
+    let launch_fused = |d_out: &mut cudarc::driver::CudaSlice<f32>| {
+        let mut b = stream.launch_builder(&fused_k);
+        b.arg(d_out).arg(&d_in).arg(&n_arg);
+        unsafe { b.launch(cfg) }.unwrap();
+    };
+
+    // Warmup
+    for _ in 0..WARMUP {
+        launch_unfused(&mut d_out, &mut d_scratch);
+        launch_fused(&mut d_out);
+    }
+    stream.synchronize().unwrap();
+
+    let start = ctx.new_event(None).unwrap();
+    let end = ctx.new_event(None).unwrap();
+
+    // Time unfused
+    start.record(&stream).unwrap();
+    for _ in 0..TRIALS {
+        launch_unfused(&mut d_out, &mut d_scratch);
+    }
+    end.record(&stream).unwrap();
+    end.synchronize().unwrap();
+    let unfused_total_ms = start.elapsed_ms(&end).unwrap();
+
+    // Time fused
+    start.record(&stream).unwrap();
+    for _ in 0..TRIALS {
+        launch_fused(&mut d_out);
+    }
+    end.record(&stream).unwrap();
+    end.synchronize().unwrap();
+    let fused_total_ms = start.elapsed_ms(&end).unwrap();
+
+    let unfused_us = unfused_total_ms as f64 * 1_000.0 / TRIALS as f64;
+    let fused_us = fused_total_ms as f64 * 1_000.0 / TRIALS as f64;
+    let speedup = unfused_us / fused_us;
+
+    println!(
+        "\n[fusion microbench, N={N}, trials={TRIALS}]\n\
+         unfused (sqrt_k; recip_k):  {unfused_us:8.3} us/iter ({unfused_total_ms:.2} ms total)\n\
+         fused   (sqrtf; 1.0f/):     {fused_us:8.3} us/iter ({fused_total_ms:.2} ms total)\n\
+         speedup: {speedup:.2}x"
+    );
+}
+
+// =========================================================================
+// Binary-inclusive fusion tests (marker-based FusionStart / FusionEnd scheme).
+//
+// Detects fused regions by walking backward from each `FusionEnd`-tagged LLIR
+// node through `Direction::Incoming` edges until a `FusionStart` is reached.
+// The walker stops at FusionStarts (they mark the external-input boundary of
+// the region). A region's summary is: the sorted set of internal op names,
+// the count of distinct FusionStart nodes reached, and the count of FusionEnd
+// nodes (invariant: always 1 per region).
+// =========================================================================
+
+/// A single fused region extracted from the LLIR graph after egglog.
+#[derive(Debug, Clone, PartialEq, Eq, Hash)]
+struct FusedRegion {
+    /// Sorted internal op `kernel_name()`s, excluding the `FusionStart` /
+    /// `FusionEnd` markers. Sorted so DAG traversal order doesn't produce
+    /// spurious "distinct" regions.
+    internal_ops_sorted: Vec<String>,
+    /// Number of distinct `FusionStart` nodes reached by the walk. Per design
+    /// this equals the number of distinct external input tensors.
+    start_count: usize,
+    /// Number of `FusionEnd` nodes in the region. Per design this is always 1.
+    end_count: usize,
+}
+
+/// Helper: collect every distinct fused region reachable across many random
+/// extractions of the search space.
+fn extract_all_fused_regions(cx: &mut Graph) -> Vec<FusedRegion> {
+    cx.build_search_space::<CudaRuntime>();
+    let egraph = cx.egraph().expect("egraph not built");
+    let ops = cx.egglog_ops().expect("ops not built");
+    let custom_ops = &cx.custom_ops;
+
+    let mut seen: Vec<FusedRegion> = Vec::new();
+    // 200 samples: the random extractor picks one e-node per e-class per
+    // call, and the fully-fused diamond form lives in an e-class with
+    // many equivalent forms. 50 was flaky; 200 is reliably stable and
+    // each sample is cheap (~100 µs).
+    for _ in 0..200 {
+        let choices = random_initial_choice(egraph, &mut rand::rng());
+        let mut list_cache = Default::default();
+        let mut expr_cache = Default::default();
+        let llir = egglog_to_llir(
+            egraph,
+            choices,
+            ops,
+            custom_ops,
+            &mut list_cache,
+            &mut expr_cache,
+            None,
+        );
+
+        let name_of = |idx: NodeIndex| -> Option<String> {
+            llir.node_weight(idx).and_then(|op| {
+                op.to_dialect::<dyn KernelOp>()
+                    .map(|k| k.kernel_name().to_string())
+            })
+        };
+
+        let end_nodes: Vec<NodeIndex> = llir
+            .node_indices()
+            .filter(|&idx| name_of(idx).as_deref() == Some("FusionEnd"))
+            .collect();
+
+        for end in end_nodes {
+            let mut internal: Vec<String> = Vec::new();
+            // Count distinct external input *tensors*, not distinct FusionStart
+            // node indices. Egglog rule firings can emit multiple FusionStart
+            // enodes that all wrap the same source tensor (e.g. when the same
+            // `a` is consumed at two sites inside the fused region, each
+            // pair-fuse / grow firing mints its own FusionStart). Those are
+            // logically one FusionStart per the design invariant
+            // ("N = number of distinct external input tensors").
+            let mut start_sources: FxHashSet<NodeIndex> = FxHashSet::default();
+            let mut visited: FxHashSet<NodeIndex> = FxHashSet::default();
+            visited.insert(end);
+            let mut stack = vec![end];
+
+            // Resolve chains of nested FusionStart wrappers (cascade artifact)
+            // to the real external source. A FusionStart whose incoming neighbor
+            // is itself a FusionStart — or a FusionEnd whose region is fully
+            // inside ours — is a cascade layer, not a new external tensor.
+            let resolve_source = |mut n: NodeIndex| -> NodeIndex {
+                loop {
+                    match name_of(n).as_deref() {
+                        Some("FusionStart") | Some("FusionEnd") => {
+                            let mut inc = llir.neighbors_directed(n, petgraph::Direction::Incoming);
+                            match inc.next() {
+                                Some(p) => n = p,
+                                None => return n,
+                            }
+                        }
+                        _ => return n,
+                    }
+                }
+            };
+
+            while let Some(node) = stack.pop() {
+                for pred in llir.neighbors_directed(node, petgraph::Direction::Incoming) {
+                    if !visited.insert(pred) {
+                        continue;
+                    }
+                    match name_of(pred).as_deref() {
+                        Some("FusionStart") => {
+                            // If this FS's predecessor is itself a FE (or a
+                            // chain of FS/FE wrappers that eventually hits a
+                            // non-marker op inside the region), the FS is a
+                            // cascade artifact, not a real external boundary.
+                            // Walk past it and its upstream FE into the same
+                            // region. Otherwise treat the predecessor as the
+                            // external source tensor — which may be a KernelOp
+                            // *or* a non-KernelOp (HLIR loadable) node, so we
+                            // can't gate counting on `name_of` being `Some`.
+                            let mut inc =
+                                llir.neighbors_directed(pred, petgraph::Direction::Incoming);
+                            match inc.next() {
+                                Some(src_node)
+                                    if name_of(src_node).as_deref() == Some("FusionEnd") =>
+                                {
+                                    // Merge adjacent regions — treat the FS/FE
+                                    // pair as internal; walk past the upstream
+                                    // FE into its region.
+                                    visited.insert(src_node);
+                                    stack.push(src_node);
+                                }
+                                Some(src_node) => {
+                                    start_sources.insert(resolve_source(src_node));
+                                }
+                                None => {
+                                    // FS with no predecessor — degenerate.
+                                }
+                            }
+                        }
+                        Some("FusionEnd") => {
+                            // Transparent: inner FusionEnds are cascade-wart
+                            // artifacts from grow rules re-firing and creating
+                            // nested `FE(Op(FE(...)))` wrappers. They don't
+                            // represent real work or a real boundary — walk
+                            // past them and do not count them as internal ops.
+                            stack.push(pred);
+                        }
+                        Some(other) => {
+                            internal.push(other.to_string());
+                            stack.push(pred);
+                        }
+                        None => {
+                            // Non-KernelOp predecessor (shouldn't appear inside a
+                            // fused region under the design). Stop walking this path.
+                        }
+                    }
+                }
+            }
+
+            internal.sort();
+            // Skip singleton regions: every elementwise op has a seeded
+            // `FE(Op(FS(...)))` form, so random extraction will surface
+            // many one-op regions that are equivalent to not fusing. We
+            // only care about regions that represent real multi-op fusion.
+            if internal.len() < 2 {
+                continue;
+            }
+            let region = FusedRegion {
+                internal_ops_sorted: internal,
+                start_count: start_sources.len(),
+                end_count: 1,
+            };
+            if !seen.contains(&region) {
+                seen.push(region);
+            }
+        }
+    }
+    seen
+}
+
+fn sorted_names(items: &[&str]) -> Vec<String> {
+    let mut v: Vec<String> = items.iter().map(|s| (*s).to_string()).collect();
+    v.sort();
+    v
+}
+
+// ---- Structural tests: the expected fused shape is reachable ----
+
+#[test]
+fn test_single_binary_does_not_fuse_alone() {
+    // A lone elementwise op gets a seeded singleton region by design; we
+    // filter singletons out in `extract_all_fused_regions`. What this test
+    // asserts is that no *multi-op* region appears for a standalone binary
+    // — nothing to grow into.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let _c = (a + b).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    assert!(
+        regions.is_empty(),
+        "a solo binary op should not form a multi-op fused region, but got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_chain_of_binaries_fuses() {
+    // `(a + b) * c`: three external inputs collapse into one region with
+    // internal [Add, Mul] and 3 FusionStarts.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let c = cx.tensor(8);
+    let _d = ((a + b) * c).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedMul"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 3),
+        "expected a fused region of {expected:?} with 3 FusionStarts, got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_binary_then_unary_fuses() {
+    // `sin(a + b)`: binary feeds a unary inside one fused region.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let _c = (a + b).sin().output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedSin"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 2),
+        "expected a fused region of {expected:?} with 2 FusionStarts, got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_unary_then_binary_fuses() {
+    // `sin(a) + b`: unary feeds a binary inside one fused region.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let _c = (a.sin() + b).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedSin"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 2),
+        "expected a fused region of {expected:?} with 2 FusionStarts, got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_diamond_dag_fuses() {
+    // The canonical diamond-DAG example agreed with the user:
+    //   t = a + b; u = exp2(t); v = sin(t); w = u * a; out = w + v
+    // `a` is reused (feeds outer Add and Mul) and `t` is reused (feeds Exp2 and
+    // Sin). Expected: one fused region with internal ops [Add, Add, Exp2, Mul,
+    // Sin], 2 FusionStarts (distinct tensors a, b), 1 FusionEnd.
+    // We use exp2 rather than exp because the frontend's exp() desugars to
+    // Mul(x, LOG2E).exp2(), which would add a constant input and a Mul op and
+    // obscure the diamond topology this test is checking.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let t = a + b;
+    let u = t.exp2();
+    let v = t.sin();
+    let w = u * a;
+    let _out = (w + v).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedAdd", "FusedExp2", "FusedMul", "FusedSin"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 2 && r.end_count == 1),
+        "expected diamond DAG to fuse into one region with ops {expected:?}, \
+         2 FusionStarts, 1 FusionEnd. Got: {regions:#?}"
+    );
+}
+
+// ---- Negative tests: fusion must NOT happen across these blockers ----
+
+#[test]
+fn test_reduction_blocks_binary_fusion() {
+    // A reduction between a binary and anything downstream is not elementwise,
+    // so Add and SumReduce must never appear in the same fused region.
+    let mut cx = Graph::new();
+    let a = cx.tensor((4, 4));
+    let b = cx.tensor((4, 4));
+    let _c = (a + b).sum(1).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    for r in &regions {
+        let has_add = r.internal_ops_sorted.iter().any(|n| n == "FusedAdd");
+        let has_sum = r.internal_ops_sorted.iter().any(|n| n == "SumReduce");
+        assert!(
+            !(has_add && has_sum),
+            "FusedAdd and SumReduce must not share a fused region, but got: {r:#?}"
+        );
+    }
+}
+
+#[test]
+fn test_stride_mismatch_blocks_binary_fusion() {
+    // A permute gives `b` a non-contiguous view whose strides do not match `a`'s,
+    // so the binary fusion rule's stride-compatibility check must prevent the
+    // Add from being absorbed into any fused region.
+    let mut cx = Graph::new();
+    let a = cx.tensor((3, 4));
+    let b = cx.tensor((4, 3));
+    let _c = (a + b.permute((1, 0))).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    for r in &regions {
+        assert!(
+            !r.internal_ops_sorted.iter().any(|n| n == "FusedAdd"),
+            "permuted binary must not fuse into a region, but found: {r:#?}"
+        );
+    }
+}
+
+// ---- Numerical parity tests: fused output matches candle reference ----
+
+#[test]
+fn test_simple_binary_fusion_preserves_output() {
+    // End-to-end numerical check: `a + b` on GPU matches candle's add across
+    // all reachable genomes (fused or unfused) via test_binary_cuda's fuzzer.
+    let seed = 0xADDBEEFu64;
+    let eps = dtype_epsilon(luminal::dtype::DType::F32);
+    let tol = eps * TOLERANCE_SAFETY_FACTOR;
+    test_binary_cuda::<f32>(
+        16,
+        16,
+        |a, b| a + b,
+        |a, b| (a + b).unwrap(),
+        |n, s| random_f32_vec(n, s, 0.0, 1.0),
+        |n, s| random_f32_vec(n, s, 0.0, 1.0),
+        seed,
+        tol,
+        tol,
+    );
+}
+
+#[test]
+fn test_diamond_dag_preserves_output() {
+    // Numerical parity for the diamond DAG: `(exp(a+b) * a) + sin(a+b)`
+    // matches candle's equivalent across fused and unfused genomes.
+    // Inputs are drawn from [-1, 1] so exp() doesn't overflow.
+    let seed = 0xD1A_0D1Au64;
+    let eps = dtype_epsilon(luminal::dtype::DType::F32);
+    // Five-op chain with exp + sin: allow ~5x safety to absorb accumulated
+    // rounding vs candle's kernels.
+    let tol = eps * TOLERANCE_SAFETY_FACTOR * 5.0;
+    test_binary_cuda::<f32>(
+        16,
+        16,
+        |a, b| {
+            let t = a + b;
+            let u = t.exp();
+            let v = t.sin();
+            let w = u * a;
+            w + v
+        },
+        |a, b| {
+            let t = (&a + &b).unwrap();
+            let u = t.exp().unwrap();
+            let v = t.sin().unwrap();
+            let w = (&u * &a).unwrap();
+            (&w + &v).unwrap()
+        },
+        |n, s| random_f32_vec(n, s, -1.0, 1.0),
+        |n, s| random_f32_vec(n, s, -1.0, 1.0),
+        seed,
+        tol,
+        tol,
+    );
+}
+
+// ---- Marker invariant tests ----
+
+#[test]
+fn test_fused_region_has_exactly_one_end() {
+    // Design invariant: a fused region always has exactly one FusionEnd.
+    // Uses the diamond DAG so there's real fan-in/out inside the region.
+    // See test_diamond_dag_fuses for why we use exp2 directly.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let t = a + b;
+    let u = t.exp2();
+    let v = t.sin();
+    let w = u * a;
+    let _out = (w + v).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedAdd", "FusedExp2", "FusedMul", "FusedSin"]);
+    let full = regions
+        .iter()
+        .find(|r| r.internal_ops_sorted == expected)
+        .expect("expected at least one extraction to produce the full 5-op diamond region");
+    assert_eq!(
+        full.end_count, 1,
+        "fused region must have exactly one FusionEnd, got {}",
+        full.end_count
+    );
+}
+
+#[test]
+fn test_fused_region_starts_match_distinct_external_tensors() {
+    // Design invariant: FusionStart count == number of distinct external input
+    // tensors, NOT number of edges crossing the boundary. In the diamond DAG
+    // `a` is consumed inside the region by two ops (outer Add + Mul), so a
+    // per-edge counting scheme would give 3; the correct per-distinct-tensor
+    // count is 2 ({a, b}).
+    // See test_diamond_dag_fuses for why we use exp2 directly.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let t = a + b;
+    let u = t.exp2();
+    let v = t.sin();
+    let w = u * a;
+    let _out = (w + v).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedAdd", "FusedExp2", "FusedMul", "FusedSin"]);
+    // Multiple 5-op extractions are reachable: the merge-FE-FE rule fires
+    // across paths that may have minted distinct FS enodes for the shared
+    // tensor `a` at separate sites. The design invariant is that *some*
+    // extraction collapses those into the deduped form (one FS per distinct
+    // tensor → 2 FS for {a, b}); we don't require every random sample to.
+    let matching: Vec<&FusedRegion> = regions
+        .iter()
+        .filter(|r| r.internal_ops_sorted == expected)
+        .collect();
+    assert!(
+        !matching.is_empty(),
+        "expected at least one extraction to produce the full 5-op diamond region, \
+         got: {regions:#?}"
+    );
+    assert!(
+        matching
+            .iter()
+            .any(|r| r.start_count == 2 && r.end_count == 1),
+        "expected at least one 5-op diamond extraction with FusionStart count == 2 \
+         (one per distinct external tensor) and FusionEnd count == 1; got: {matching:#?}"
+    );
+}
+
+// ---- Targeted rule-family tests (one per family / orientation) ----
+//
+// The structural and diamond tests above hit several rule families at once.
+// These narrow tests pin each rule family / orientation independently so a
+// regression in one rule shows up as a single failing test rather than a
+// confusing diamond mismatch.
+
+#[test]
+fn test_pair_fuse_unary_unary_marker_form() {
+    // Pair-fuse U→U: `a.sin().sqrt()` should be reachable as a marker-bracketed
+    // region containing FusedSin and FusedSqrt (with one FusionStart for `a`).
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let _b = a.sin().sqrt().output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedSin", "FusedSqrt"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 1 && r.end_count == 1),
+        "expected marker region of {expected:?} with 1 FusionStart, got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_pair_fuse_unary_to_binary_rhs() {
+    // Pair-fuse U→B (RHS variant): `a + b.sin()`. The unary is on the
+    // binary's B input, so the rule's RHS-orientation version is what fires.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let _c = (a + b.sin()).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedSin"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 2),
+        "expected a fused region of {expected:?} with 2 FusionStarts (RHS-side unary), \
+         got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_pair_fuse_binary_to_binary_rhs() {
+    // Pair-fuse B→B (RHS variant): `c * (a + b)`. The inner binary feeds the
+    // outer binary's B input, exercising the mirror direction of the rule
+    // covered by test_chain_of_binaries_fuses.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let c = cx.tensor(8);
+    let _d = (c * (a + b)).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedMul"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 3),
+        "expected a fused region of {expected:?} with 3 FusionStarts (RHS-side inner binary), \
+         got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_grow_fe_to_binary_rhs() {
+    // Grow FE→B (RHS variant): `c + (a.sin() + b)`. Once the inner
+    // `a.sin() + b` is fused, the outer `+ c` consumes that FE on its B input
+    // (because we wrote `c + (...)` — `c` is on LHS, FE on RHS), exercising
+    // grow-FE-B-rhs to absorb the outer Add into the same region.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let c = cx.tensor(8);
+    let _d = (c + (a.sin() + b)).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedAdd", "FusedSin"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 3),
+        "expected a 3-op fused region of {expected:?} with 3 FusionStarts (grow into RHS), \
+         got: {regions:#?}"
+    );
+}
+
+#[test]
+fn test_merge_two_regions_at_outer_binary() {
+    // Merge: `(sin(a) + b) + (sqrt(c) + d)`. Each side independently pair-fuses
+    // U→B on its own (the unary gives the inner Add a fusion partner that
+    // doesn't pull in the outer Add), so both sides become FEs. The outer Add
+    // then fires merge-FE-FE-Add to collapse them into a single region.
+    // Without the unaries, `(a+b) + (c+d)` would only ever pair-fuse one
+    // inner Add at a time with the outer Add — merge wouldn't have two FEs to
+    // combine because the inner Adds never become singleton FEs on their own.
+    let mut cx = Graph::new();
+    let a = cx.tensor(8);
+    let b = cx.tensor(8);
+    let c = cx.tensor(8);
+    let d = cx.tensor(8);
+    let _e = ((a.sin() + b) + (c.sqrt() + d)).output();
+
+    let regions = extract_all_fused_regions(&mut cx);
+    let expected = sorted_names(&["FusedAdd", "FusedAdd", "FusedAdd", "FusedSin", "FusedSqrt"]);
+    assert!(
+        regions
+            .iter()
+            .any(|r| r.internal_ops_sorted == expected && r.start_count == 4),
+        "expected a 5-op merged region (two pair-fused sides combined at outer Add) with \
+         4 FusionStarts, got: {regions:#?}"
+    );
+}
+
+/// Microbench: time three unfused kernels (`add_k` → `sin_k` → `sqrt_k`)
+/// vs one fused kernel (`(a + b).sin().sqrt()` in a single launch) on a
+/// fixed-size input, using CUDA events for device-side timing. Mirrors
+/// the existing sqrt→recip bench but on the binary-inclusive 3-op DAG
+/// PR2's region codegen targets.
+///
+/// Ignored by default — run with
+/// `cargo test -p luminal_cuda_lite -- --ignored bench_fused_region_vs_unfused_3op --nocapture`.
+#[test]
+#[ignore]
+fn bench_fused_region_vs_unfused_3op() {
+    use crate::compile_module_image_for_current_device;
+    use cudarc::driver::{CudaContext, LaunchConfig, PushKernelArg};
+
+    const N: usize = 1 << 20; // 1M elements
+    const WARMUP: usize = 100;
+    const TRIALS: usize = 2000;
+
+    let ctx = match CudaContext::new(0) {
+        Ok(c) => c,
+        Err(_) => return, // no GPU available, skip
+    };
+    ctx.bind_to_thread().unwrap();
+    let stream = ctx.default_stream();
+
+    // Inputs in (0, 1] keep `sin` < 1 and `sqrt` well-defined post-add.
+    let host_a: Vec<f32> = (0..N)
+        .map(|i| (i as f32 + 1.0) / (N as f32) * 0.5)
+        .collect();
+    let host_b: Vec<f32> = (0..N)
+        .map(|i| (i as f32 + 1.0) / (N as f32) * 0.5)
+        .collect();
+    let d_a = stream.clone_htod(&host_a).unwrap();
+    let d_b = stream.clone_htod(&host_b).unwrap();
+    let mut d_scratch1 = stream.alloc_zeros::<f32>(N).unwrap();
+    let mut d_scratch2 = stream.alloc_zeros::<f32>(N).unwrap();
+    let mut d_out = stream.alloc_zeros::<f32>(N).unwrap();
+
+    let compile = |src: &str, name: &str| {
+        let ptx = compile_module_image_for_current_device(stream.context(), src).unwrap();
+        let module = stream.context().load_module(ptx).unwrap();
+        module.load_function(name).unwrap()
+    };
+
+    let add_k = compile(
+        r#"
+extern "C" __global__ void add_k(float* out, const float* a, const float* b, long long n) {
+    long long i = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= n) return;
+    out[i] = a[i] + b[i];
+}
+"#,
+        "add_k",
+    );
+    let sin_k = compile(
+        r#"
+extern "C" __global__ void sin_k(float* out, const float* in, long long n) {
+    long long i = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= n) return;
+    out[i] = sinf(in[i]);
+}
+"#,
+        "sin_k",
+    );
+    let sqrt_k = compile(
+        r#"
+extern "C" __global__ void sqrt_k(float* out, const float* in, long long n) {
+    long long i = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= n) return;
+    out[i] = sqrtf(in[i]);
+}
+"#,
+        "sqrt_k",
+    );
+    let fused_k = compile(
+        r#"
+extern "C" __global__ void fused_k(float* out, const float* a, const float* b, long long n) {
+    long long i = (long long)blockIdx.x * blockDim.x + threadIdx.x;
+    if (i >= n) return;
+    float v = a[i] + b[i];
+    v = sinf(v);
+    v = sqrtf(v);
+    out[i] = v;
+}
+"#,
+        "fused_k",
+    );
+
+    let cfg = LaunchConfig::for_num_elems(N as u32);
+    let n_arg: i64 = N as i64;
+
+    let launch_unfused =
+        |d_out: &mut cudarc::driver::CudaSlice<f32>,
+         d_scratch1: &mut cudarc::driver::CudaSlice<f32>,
+         d_scratch2: &mut cudarc::driver::CudaSlice<f32>| {
+            let mut b = stream.launch_builder(&add_k);
+            b.arg(&mut *d_scratch1).arg(&d_a).arg(&d_b).arg(&n_arg);
+            unsafe { b.launch(cfg) }.unwrap();
+            let mut b = stream.launch_builder(&sin_k);
+            b.arg(&mut *d_scratch2).arg(&*d_scratch1).arg(&n_arg);
+            unsafe { b.launch(cfg) }.unwrap();
+            let mut b = stream.launch_builder(&sqrt_k);
+            b.arg(d_out).arg(&*d_scratch2).arg(&n_arg);
+            unsafe { b.launch(cfg) }.unwrap();
+        };
+    let launch_fused = |d_out: &mut cudarc::driver::CudaSlice<f32>| {
+        let mut b = stream.launch_builder(&fused_k);
+        b.arg(d_out).arg(&d_a).arg(&d_b).arg(&n_arg);
+        unsafe { b.launch(cfg) }.unwrap();
+    };
+
+    // Warmup
+    for _ in 0..WARMUP {
+        launch_unfused(&mut d_out, &mut d_scratch1, &mut d_scratch2);
+        launch_fused(&mut d_out);
+    }
+    stream.synchronize().unwrap();
+
+    // Host-side wall-clock timing: synchronize before/after each batch so the
+    // measured interval covers exactly the GPU work for `TRIALS` iterations.
+    // (CUDA event-based timing is the more precise option in principle, but
+    // `event.elapsed_ms` on this driver/cudarc combo errors with
+    // CUDA_ERROR_INVALID_HANDLE — see bench_fused_vs_unfused_sqrt_recip
+    // above which fails the same way. Wall-clock is reliable here.)
+    let unfused_start = std::time::Instant::now();
+    for _ in 0..TRIALS {
+        launch_unfused(&mut d_out, &mut d_scratch1, &mut d_scratch2);
+    }
+    stream.synchronize().unwrap();
+    let unfused_total_ms = unfused_start.elapsed().as_secs_f64() * 1_000.0;
+
+    let fused_start = std::time::Instant::now();
+    for _ in 0..TRIALS {
+        launch_fused(&mut d_out);
+    }
+    stream.synchronize().unwrap();
+    let fused_total_ms = fused_start.elapsed().as_secs_f64() * 1_000.0;
+
+    let unfused_us = unfused_total_ms * 1_000.0 / TRIALS as f64;
+    let fused_us = fused_total_ms * 1_000.0 / TRIALS as f64;
+    let speedup = unfused_us / fused_us;
+
+    println!(
+        "\n[fusion microbench, (a+b).sin().sqrt(), N={N}, trials={TRIALS}]\n\
+         unfused (add_k; sin_k; sqrt_k): {unfused_us:8.3} us/iter ({unfused_total_ms:.2} ms total)\n\
+         fused   (one kernel):           {fused_us:8.3} us/iter ({fused_total_ms:.2} ms total)\n\
+         speedup: {speedup:.2}x"
+    );
+}
--- a/crates/luminal_cuda_lite/src/tests/mod.rs
+++ b/crates/luminal_cuda_lite/src/tests/mod.rs
@@ -5,10 +5,18 @@ mod bucket_tests;
 #[cfg(test)]
 mod consumed_buffer_tests;
 #[cfg(test)]
+mod cublaslt_rewrite_tests;
+#[cfg(test)]
+mod flashinfer;
+#[cfg(test)]
+mod fusion;
+#[cfg(test)]
 mod model_fuzz;
 #[cfg(test)]
 mod op_functional_tests;
 #[cfg(test)]
 mod performance_tests;
 #[cfg(test)]
+mod qwen3_moe_rewrite;
+#[cfg(test)]
 mod transformer;
--- a/crates/luminal_cuda_lite/src/tests/model_fuzz.rs
+++ b/crates/luminal_cuda_lite/src/tests/model_fuzz.rs
@@ -1,7 +1,12 @@
 //! Fuzz tests for model-architecture-specific subgraphs (Llama, Gemma, Qwen).
 //!
 //! Tests many random e-graph extraction variants (genomes) against a candle CPU
-//! reference to catch incorrect HLIR kernel fallback rewrites.
+//! reference to catch incorrect HLIR kernel rewrites.
+//!
+//! These are marked ignored by default because each test builds a model-shaped
+//! graph and checks many extraction genomes. Run them explicitly with
+//! `cargo test -p luminal_cuda_lite -- --ignored` when touching extraction,
+//! scheduling, or model-pattern rewrites.

 use luminal::prelude::*;

@@ -377,32 +382,38 @@ mod llama {
    const EPS: f32 = 1e-5;

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_llama_mlp() {
        fuzz_mlp(SEQ, HIDDEN, INTERMEDIATE, 42);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_llama_norm_proj() {
        fuzz_norm_proj(SEQ, HIDDEN, PROJ_DIM, EPS, 100);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_llama_layer() {
        fuzz_layer_no_attn(SEQ, HIDDEN, INTERMEDIATE, PROJ_DIM, EPS, 200);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_llama_mlp_seq1() {
        fuzz_mlp(1, HIDDEN, INTERMEDIATE, 300);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_llama_mlp_seq7() {
        fuzz_mlp(7, HIDDEN, INTERMEDIATE, 400);
    }

-    /// Force HLIR-only (no block ops) to specifically test the fallback path.
+    /// Force HLIR-only (no block ops) to specifically test that extraction path.
    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_llama_mlp_hlir_only() {
        fuzz_mlp_hlir_only(SEQ, HIDDEN, INTERMEDIATE, 450);
    }
@@ -424,22 +435,26 @@ mod gemma {
    const EPS: f32 = 1e-6;

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_gemma_mlp() {
        fuzz_mlp(SEQ, HIDDEN, INTERMEDIATE, 500);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_gemma_norm_proj() {
        fuzz_norm_proj(SEQ, HIDDEN, Q_DIM, EPS, 600);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_gemma_layer() {
        fuzz_layer_no_attn(SEQ, HIDDEN, INTERMEDIATE, Q_DIM, EPS, 700);
    }

    /// Gemma has extra post-attention and post-feedforward norms.
    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_gemma_layer_full_norms() {
        let Some(stream) = get_cuda_stream() else {
            return;
@@ -564,12 +579,14 @@ mod gemma {
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_gemma_mlp_seq1() {
        fuzz_mlp(1, HIDDEN, INTERMEDIATE, 900);
    }

-    /// Force HLIR-only to test fallback path with Gemma dimensions.
+    /// Force HLIR-only to test that extraction path with Gemma dimensions.
    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_gemma_mlp_hlir_only() {
        fuzz_mlp_hlir_only(SEQ, HIDDEN, INTERMEDIATE, 950);
    }
@@ -591,22 +608,26 @@ mod qwen {
    const EPS: f32 = 1e-6;

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_qwen_mlp() {
        fuzz_mlp(SEQ, HIDDEN, INTERMEDIATE, 1000);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_qwen_norm_proj() {
        fuzz_norm_proj(SEQ, HIDDEN, Q_DIM, EPS, 1100);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_qwen_layer() {
        fuzz_layer_no_attn(SEQ, HIDDEN, INTERMEDIATE, Q_DIM, EPS, 1200);
    }

    /// Qwen uses tied embeddings: lm_head = embedding^T
    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_qwen_lm_head() {
        let Some(stream) = get_cuda_stream() else {
            return;
@@ -668,17 +689,20 @@ mod qwen {
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_qwen_mlp_seq1() {
        fuzz_mlp(1, HIDDEN, INTERMEDIATE, 1400);
    }

    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_qwen_mlp_seq7() {
        fuzz_mlp(7, HIDDEN, INTERMEDIATE, 1500);
    }

-    /// Force HLIR-only to test fallback path with Qwen dimensions.
+    /// Force HLIR-only to test that extraction path with Qwen dimensions.
    #[test]
+    #[ignore = "expensive CUDA model genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    fn fuzz_qwen_mlp_hlir_only() {
        fuzz_mlp_hlir_only(SEQ, HIDDEN, INTERMEDIATE, 1550);
    }
--- a/crates/luminal_cuda_lite/src/tests/op_functional_tests.rs
+++ b/crates/luminal_cuda_lite/src/tests/op_functional_tests.rs
@@ -16,9 +16,16 @@ use super::utilities::{
    test_binary_cuda, test_mod, test_unary_cuda, to_candle_dtype,
 };

+// The property-based op tests each build/search CUDA graphs for multiple random
+// shapes. They are ignored by default to keep the main CUDA unit suite short;
+// run `cargo test -p luminal_cuda_lite -- --ignored` for the broader sweeps.
+
 proptest! {
    #![proptest_config(ProptestConfig::with_cases(5))]

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_add(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        let gen_lambda = |n, s| random_f32_vec(n, s, -0.5, 0.5);
@@ -28,6 +35,9 @@ proptest! {
        test_binary_cuda((y, x), (y, x), |a, b| a + b, |a, b| (&a + &b).unwrap(), gen_lambda, gen_lambda, seed, rtol, atol);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_mul(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        let gen_lambda = |n, s| random_f32_vec(n, s, -0.5, 0.5);
@@ -37,18 +47,27 @@ proptest! {
        test_binary_cuda((y, x), (y, x), |a, b| a * b, |a, b| (&a * &b).unwrap(), gen_lambda, gen_lambda, seed, rtol, atol);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_max(rows in 1usize..8, cols in 1usize..8, seed in any::<u64>()) {
        let gen_lambda = |n, s| random_f32_vec(n, s, -0.5, 0.5);
        test_unary_cuda((rows, cols), |a| a.max(1), |a| a.max(1).unwrap(), gen_lambda, seed);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_mean(rows in 1usize..8, cols in 1usize..8, seed in any::<u64>()) {
        let gen_lambda = |n, s| random_f32_vec(n, s, -0.5, 0.5);
        test_unary_cuda((rows, cols), |a| a.mean(1), |a| a.mean(1).unwrap(), gen_lambda, seed);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_matmul(
        (m, n, k, a_col_major, b_col_major, m_slice, k_slice, n_slice, dtype) in
@@ -119,6 +138,8 @@ proptest! {
    }

    // Unary ops tests
+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
    #[test]
    fn test_exp2(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        // exp2(x) = 2^x, verified by computing 2^x using exp(x * ln(2))
@@ -127,6 +148,9 @@ proptest! {
        test_unary_cuda((y, x), |a| a.exp2(), |a| (a * 2.0f64.ln()).unwrap().exp().unwrap(), gen_lambda, seed);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_log2(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        // log2(x) = ln(x) / ln(2)
@@ -135,6 +159,9 @@ proptest! {
        test_unary_cuda((y, x), |a| a.log2(), |a| (a.log().unwrap() / 2.0f64.ln()).unwrap(), gen_lambda, seed);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_sin(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        let gen_lambda = |n, s| random_f32_vec(n, s, -0.5, 0.5);
@@ -142,6 +169,9 @@ proptest! {
        test_unary_cuda((y, x), |a| a.sin(), |a| a.sin().unwrap(), gen_lambda, seed);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_recip(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        let gen_lambda = |n, s| random_f32_vec(n, s, 0.1, 0.5);
@@ -149,6 +179,9 @@ proptest! {
        test_unary_cuda((y, x), |a| a.reciprocal(), |a| a.recip().unwrap(), gen_lambda, seed);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_sqrt(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        let gen_lambda = |n, s| random_f32_vec(n, s, 0.1, 0.6);
@@ -157,12 +190,17 @@ proptest! {
    }

    // Binary ops tests
+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
    #[test]
    fn test_mod_op(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        test_mod(x, x, |a, b| a % b, seed);
        test_mod((y, x), (y, x), |a, b| a % b, seed);
    }

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_less_than(x in 1usize..100, y in 1usize..5, seed in any::<u64>()) {
        let gen_lambda = |n, s| random_f32_vec(n, s, -99.0, 100.0).into_iter().map(|v| v.floor()).collect();
@@ -335,6 +373,8 @@ proptest! {
    #![proptest_config(ProptestConfig::with_cases(5))]

    /// Test F32 -> F16 -> F32 cast roundtrip with random values.
+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
    #[test]
    fn test_cast_f16_random(size in 1usize..200, seed in any::<u64>()) {
        use luminal::dtype::DType;
@@ -527,6 +567,9 @@ fn fuzz_test_cuda_genomes_impl(seed: u64) {
 proptest! {
    #![proptest_config(ProptestConfig::with_cases(3))]

+    // This walks random extraction genomes and is intentionally opt-in so the
+    // default CUDA unit suite keeps a tight feedback loop.
+    #[ignore = "expensive CUDA genome fuzzing; run with cargo test -p luminal_cuda_lite -- --ignored"]
    #[test]
    fn fuzz_test_cuda_genomes(seed in any::<u64>()) {
        fuzz_test_cuda_genomes_impl(seed);
@@ -594,6 +637,9 @@ fn run_embed_test(vocab_size: usize, embed_dim: usize, seq_len: usize, seed: u64
 proptest! {
    #![proptest_config(ProptestConfig::with_cases(5))]

+    #[ignore = "expensive CUDA op proptest sweep; run with cargo test -p luminal_cuda_lite -- --ignored"]
+
+
    #[test]
    fn test_embed_proptest(
        vocab_size in 10usize..200,
--- a/crates/luminal_cuda_lite/src/tests/qwen3_moe_rewrite.rs
+++ b/crates/luminal_cuda_lite/src/tests/qwen3_moe_rewrite.rs
@@ -0,0 +1,310 @@
+use half::bf16;
+use luminal::{dtype::DType, prelude::*, shape::Expression};
+
+use super::utilities::{assert_close, get_cuda_stream, random_f32_vec};
+use crate::{
+    host::moe::{GLUMoE, GLUMoEMode},
+    runtime::CudaRuntime,
+};
+
+const SEQ: usize = 2;
+const HIDDEN: usize = 16;
+const NUM_EXPERTS: usize = 8;
+const TOP_K: usize = 2;
+const MOE_INTERMEDIATE: usize = 6;
+const RMS_NORM_EPS: f32 = 1e-6;
+
+struct QwenMoeGraph {
+    graph: Graph,
+    x: GraphTensor,
+    router: GraphTensor,
+    gate_up_weights: GraphTensor,
+    down_weights: GraphTensor,
+    output: GraphTensor,
+}
+
+struct GemmaMoeGraph {
+    graph: Graph,
+    router_input: GraphTensor,
+    expert_input: GraphTensor,
+    router_scale: GraphTensor,
+    router_proj: GraphTensor,
+    per_expert_scale: GraphTensor,
+    gate_up_weights: GraphTensor,
+    down_weights: GraphTensor,
+    output: GraphTensor,
+}
+
+fn build_qwen_moe_graph() -> QwenMoeGraph {
+    let mut cx = Graph::default();
+    let x = cx.tensor(('s', HIDDEN));
+    let router = cx.tensor((NUM_EXPERTS, HIDDEN));
+    let gate_up_weights = cx
+        .tensor((NUM_EXPERTS, MOE_INTERMEDIATE * 2, HIDDEN))
+        .as_dtype(DType::Bf16);
+    let down_weights = cx
+        .tensor((NUM_EXPERTS, HIDDEN, MOE_INTERMEDIATE))
+        .as_dtype(DType::Bf16);
+
+    let n = x.dims().len();
+    let e_dim = *router.dims().first().unwrap();
+    let k_expr = Expression::from(TOP_K);
+
+    let routing_weights = x.matmul(router.t()).softmax(n - 1);
+    let top_k_indices = routing_weights.topk_indexes(TOP_K, n - 1);
+
+    let row_offsets = x
+        .graph()
+        .iota(Expression::from('z') / k_expr * e_dim, top_k_indices.dims());
+    let routing_flat_idx = row_offsets + top_k_indices;
+    let top_k_values = routing_weights.gather(routing_flat_idx);
+
+    let gate_up_gathered = gather_experts(x, top_k_indices, gate_up_weights).cast(DType::F32);
+    let x_exp = x.expand_dim(n - 1, TOP_K).unsqueeze(n);
+    let gate_up_out = x_exp.matmul(gate_up_gathered.transpose(2, 3)).squeeze(n);
+    let gate = gate_up_out.slice((.., .., ..MOE_INTERMEDIATE));
+    let up = gate_up_out.slice((.., .., MOE_INTERMEDIATE..));
+    let hidden = gate.silu() * up;
+
+    let down_gathered = gather_experts(x, top_k_indices, down_weights).cast(DType::F32);
+    let down_out = hidden
+        .unsqueeze(2)
+        .matmul(down_gathered.transpose(2, 3))
+        .squeeze(2);
+    let mut weights_exp = top_k_values.unsqueeze(top_k_values.dims().len());
+    weights_exp.shape.expand(down_out.dims());
+    let output = (down_out * weights_exp).sum(n - 1).output();
+
+    QwenMoeGraph {
+        graph: cx,
+        x,
+        router,
+        gate_up_weights,
+        down_weights,
+        output,
+    }
+}
+
+fn build_gemma_moe_graph() -> GemmaMoeGraph {
+    let mut cx = Graph::default();
+    let router_input = cx.tensor(('s', HIDDEN));
+    let expert_input = cx.tensor(('s', HIDDEN));
+    let router_scale = cx.tensor(HIDDEN);
+    let router_proj = cx.tensor((NUM_EXPERTS, HIDDEN));
+    let per_expert_scale = cx.tensor(NUM_EXPERTS);
+    let gate_up_weights = cx
+        .tensor((NUM_EXPERTS, MOE_INTERMEDIATE * 2, HIDDEN))
+        .as_dtype(DType::Bf16);
+    let down_weights = cx
+        .tensor((NUM_EXPERTS, HIDDEN, MOE_INTERMEDIATE))
+        .as_dtype(DType::Bf16);
+
+    let n = router_input.dims().len();
+    let e_dim = *router_proj.dims().first().unwrap();
+    let k_expr = Expression::from(TOP_K);
+
+    let router_hidden = router_input.std_norm(n - 1, RMS_NORM_EPS)
+        * router_scale.expand_lhs(&router_input.dims()[..n - 1])
+        * (HIDDEN as f32).sqrt().recip();
+    let routing_weights = router_hidden.matmul(router_proj.t()).softmax(n - 1);
+
+    let top_k_indices = routing_weights.topk_indexes(TOP_K, n - 1);
+    let row_offsets = router_input
+        .graph()
+        .iota(Expression::from('z') / k_expr * e_dim, top_k_indices.dims());
+    let routing_flat_idx = row_offsets + top_k_indices;
+    let top_k_values = routing_weights.gather(routing_flat_idx);
+    let top_k_norm = top_k_values.sum(n - 1).expand_dim(n - 1, TOP_K);
+    let top_k_weights = (top_k_values / top_k_norm) * per_expert_scale.gather(top_k_indices);
+
+    let gate_up_gathered =
+        gather_experts(expert_input, top_k_indices, gate_up_weights).cast(DType::F32);
+    let x_exp = expert_input.expand_dim(n - 1, TOP_K).unsqueeze(n);
+    let gate_up_out = x_exp.matmul(gate_up_gathered.transpose(2, 3)).squeeze(n);
+    let gate = gate_up_out.slice((.., .., ..MOE_INTERMEDIATE));
+    let up = gate_up_out.slice((.., .., MOE_INTERMEDIATE..));
+    let hidden = gemma_gelu(gate) * up;
+
+    let down_gathered = gather_experts(expert_input, top_k_indices, down_weights).cast(DType::F32);
+    let down_out = hidden
+        .unsqueeze(2)
+        .matmul(down_gathered.transpose(2, 3))
+        .squeeze(2);
+    let mut weights_exp = top_k_weights.unsqueeze(top_k_weights.dims().len());
+    weights_exp.shape.expand(down_out.dims());
+    let output = (down_out * weights_exp).sum(n - 1).output();
+
+    GemmaMoeGraph {
+        graph: cx,
+        router_input,
+        expert_input,
+        router_scale,
+        router_proj,
+        per_expert_scale,
+        gate_up_weights,
+        down_weights,
+        output,
+    }
+}
+
+fn gather_experts(
+    graph_source: GraphTensor,
+    top_k_indices: GraphTensor,
+    weights: GraphTensor,
+) -> GraphTensor {
+    let (_, d1, d2) = weights.dims3();
+    let io = d1 * d2;
+    let base = top_k_indices * io;
+    let within = graph_source.graph().iota(Expression::from('z'), (d1, d2));
+    let n_base = base.dims().len();
+    let exp_base = base.expand_dim(n_base, d1).expand_dim(n_base + 1, d2);
+    let mut exp_within = within;
+    for (axis, dim) in base.dims().iter().enumerate() {
+        exp_within = exp_within.expand_dim(axis, *dim);
+    }
+    let expert_flat_idx = exp_base + exp_within;
+    weights.gather(expert_flat_idx)
+}
+
+#[allow(clippy::excessive_precision)]
+fn gemma_gelu(x: GraphTensor) -> GraphTensor {
+    let scaled = 1.5957691216 * x * (1. + 0.044715 * x * x);
+    x * scaled.sigmoid()
+}
+
+fn glumoe_modes(rt: &CudaRuntime) -> Vec<GLUMoEMode> {
+    rt.host_ops()
+        .into_iter()
+        .filter_map(|op| {
+            op.as_any()
+                .downcast_ref::<GLUMoE>()
+                .map(|glumoe| glumoe.mode)
+        })
+        .collect()
+}
+
+fn run_qwen_moe(use_glumoe: bool) -> (Vec<f32>, Vec<GLUMoEMode>) {
+    let Some(stream) = get_cuda_stream() else {
+        return (vec![], vec![]);
+    };
+
+    let mut model = build_qwen_moe_graph();
+    model.graph.set_dim('s', SEQ);
+    if use_glumoe {
+        model.graph.build_search_space::<CudaRuntime>();
+    } else {
+        model
+            .graph
+            .build_search_space_exclude_ops::<CudaRuntime, GLUMoE>();
+    }
+
+    let x_data = random_f32_vec(SEQ * HIDDEN, 11, -0.15, 0.15);
+    let router_data = random_f32_vec(NUM_EXPERTS * HIDDEN, 12, -0.2, 0.2);
+    let gate_up_data = random_f32_vec(NUM_EXPERTS * MOE_INTERMEDIATE * 2 * HIDDEN, 13, -0.1, 0.1)
+        .into_iter()
+        .map(bf16::from_f32)
+        .collect::<Vec<_>>();
+    let down_data = random_f32_vec(NUM_EXPERTS * HIDDEN * MOE_INTERMEDIATE, 14, -0.1, 0.1)
+        .into_iter()
+        .map(bf16::from_f32)
+        .collect::<Vec<_>>();
+
+    let mut rt = CudaRuntime::initialize(stream);
+    rt.set_data(model.x, x_data);
+    rt.set_data(model.router, router_data);
+    rt.set_data(model.gate_up_weights, gate_up_data);
+    rt.set_data(model.down_weights, down_data);
+    rt = model.graph.search(rt, 10);
+    rt.execute(&model.graph.dyn_map);
+
+    (rt.get_f32(model.output.id), glumoe_modes(&rt))
+}
+
+fn run_gemma_moe(use_glumoe: bool) -> (Vec<f32>, Vec<GLUMoEMode>) {
+    let Some(stream) = get_cuda_stream() else {
+        return (vec![], vec![]);
+    };
+
+    let mut model = build_gemma_moe_graph();
+    model.graph.set_dim('s', SEQ);
+    if use_glumoe {
+        model.graph.build_search_space::<CudaRuntime>();
+    } else {
+        model
+            .graph
+            .build_search_space_exclude_ops::<CudaRuntime, GLUMoE>();
+    }
+
+    let router_input_data = random_f32_vec(SEQ * HIDDEN, 21, -0.15, 0.15);
+    let expert_input_data = random_f32_vec(SEQ * HIDDEN, 22, -0.15, 0.15);
+    let router_scale_data = random_f32_vec(HIDDEN, 23, 0.7, 1.3);
+    let router_proj_data = random_f32_vec(NUM_EXPERTS * HIDDEN, 24, -0.2, 0.2);
+    let per_expert_scale_data = random_f32_vec(NUM_EXPERTS, 25, 0.5, 1.5);
+    let gate_up_data = random_f32_vec(NUM_EXPERTS * MOE_INTERMEDIATE * 2 * HIDDEN, 26, -0.1, 0.1)
+        .into_iter()
+        .map(bf16::from_f32)
+        .collect::<Vec<_>>();
+    let down_data = random_f32_vec(NUM_EXPERTS * HIDDEN * MOE_INTERMEDIATE, 27, -0.1, 0.1)
+        .into_iter()
+        .map(bf16::from_f32)
+        .collect::<Vec<_>>();
+
+    let mut rt = CudaRuntime::initialize(stream);
+    rt.set_data(model.router_input, router_input_data);
+    rt.set_data(model.expert_input, expert_input_data);
+    rt.set_data(model.router_scale, router_scale_data);
+    rt.set_data(model.router_proj, router_proj_data);
+    rt.set_data(model.per_expert_scale, per_expert_scale_data);
+    rt.set_data(model.gate_up_weights, gate_up_data);
+    rt.set_data(model.down_weights, down_data);
+    rt = model.graph.search(rt, 10);
+    rt.execute(&model.graph.dyn_map);
+
+    (rt.get_f32(model.output.id), glumoe_modes(&rt))
+}
+
+#[test]
+fn test_glumoe_matches_qwen_swiglu_pattern() {
+    let (_result, modes) = run_qwen_moe(true);
+    if modes.is_empty() {
+        return;
+    }
+
+    assert_eq!(modes, vec![GLUMoEMode::SwiGLU]);
+}
+
+#[test]
+fn test_glumoe_matches_gemma_gelu_pattern() {
+    let (_result, modes) = run_gemma_moe(true);
+    if modes.is_empty() {
+        return;
+    }
+
+    assert_eq!(modes, vec![GLUMoEMode::GemmaGELU]);
+}
+
+#[test]
+fn test_glumoe_swiglu_matches_unfused_output() {
+    let (expected, baseline_modes) = run_qwen_moe(false);
+    if expected.is_empty() {
+        return;
+    }
+    assert!(baseline_modes.is_empty());
+
+    let (actual, fused_modes) = run_qwen_moe(true);
+    assert_eq!(fused_modes, vec![GLUMoEMode::SwiGLU]);
+    assert_close(&actual, &expected, 3e-2, 3e-2);
+}
+
+#[test]
+fn test_glumoe_gemma_gelu_matches_unfused_output() {
+    let (expected, baseline_modes) = run_gemma_moe(false);
+    if expected.is_empty() {
+        return;
+    }
+    assert!(baseline_modes.is_empty());
+
+    let (actual, fused_modes) = run_gemma_moe(true);
+    assert_eq!(fused_modes, vec![GLUMoEMode::GemmaGELU]);
+    assert_close(&actual, &expected, 3e-2, 3e-2);
+}
--- a/crates/luminal_cuda_lite/src/tests/transformer.rs
+++ b/crates/luminal_cuda_lite/src/tests/transformer.rs
@@ -300,7 +300,7 @@ fn test_mini_transformer_two_layers() {
    let input = cx.tensor((SEQ, HIDDEN));
    let layer1 = MiniTransformerLayer::init(&mut cx);
    let layer2 = MiniTransformerLayer::init(&mut cx);
-    let x = layer1.forward(input).graph_break();
+    let x = layer1.forward(input);
    let out = layer2.forward(x).output();

    cx.build_search_space::<CudaRuntime>();
@@ -508,3 +508,32 @@ fn test_swiglu_mlp_cuda() {

    assert_close(&result, &expected, 1e-3, 1e-3);
 }
+
+/// Body=1, trips=3 chain of scalar Muls plus a residual back to the
+/// chain's initial value. Auto-rolling sees this as a state-carrying loop
+/// with state at input position 0; the rolled HLIR must round-trip through
+/// egglog (rolled body Mul + LoopStart/LoopInput/LoopEnd markers) and
+/// `unroll_loops_in_llir` must reconstruct the flat 3-mul chain plus
+/// rewire the residual edge to reference the chain's initial input
+/// (outside the body) — not a per-iter clone.
+#[test]
+fn test_rolled_chained_scalar_muls() {
+    let Some(stream) = get_cuda_stream() else {
+        return;
+    };
+    let mut cx = Graph::default();
+    let x = cx.tensor((1, 4, 32));
+    let chained = ((x * 2.0_f32) * 3.0_f32) * 5.0_f32;
+    let out = (chained + x).output();
+
+    cx.build_search_space::<CudaRuntime>();
+    let mut rt = CudaRuntime::initialize(stream);
+    let x_data = random_f32_vec(4 * 32, 101, -0.5, 0.5);
+    rt.set_data(x, x_data.clone());
+    rt = cx.search(rt, 3);
+    rt.execute(&cx.dyn_map);
+
+    let result = rt.get_f32(out);
+    let expected: Vec<f32> = x_data.iter().map(|v| v * 2.0 * 3.0 * 5.0 + v).collect();
+    assert_close(&result, &expected, 1e-5, 1e-5);
+}
--- a/crates/luminal_cuda_lite/src/tests/utilities.rs
+++ b/crates/luminal_cuda_lite/src/tests/utilities.rs
@@ -136,14 +136,15 @@ pub fn gpu_compute_cap() -> Option<(i32, i32)> {

 /// Check if the current GPU supports the given dtype for tensor core / WMMA operations.
 pub fn gpu_supports_dtype(dtype: luminal::dtype::DType) -> bool {
-    let Some((major, _)) = gpu_compute_cap() else {
+    let Some((major, minor)) = gpu_compute_cap() else {
        return false;
    };
    match dtype {
        luminal::dtype::DType::Bf16 => major >= 8, // Ampere (sm_80+)
-        luminal::dtype::DType::F4E2M1
-        | luminal::dtype::DType::F8E4M3
-        | luminal::dtype::DType::F8UE8M0 => major >= 10, // Blackwell (sm_100+)
+        luminal::dtype::DType::F8E4M3 | luminal::dtype::DType::F8E5M2 => {
+            major > 8 || (major == 8 && minor >= 9)
+        } // Ada/Hopper (sm_89+)
+        luminal::dtype::DType::F4E2M1 | luminal::dtype::DType::F8UE8M0 => major >= 10, // Blackwell (sm_100+)
        _ => true,
    }
 }
@@ -468,7 +469,7 @@ pub fn fuzz_genomes<T: TestDType>(

            let mut list_cache = FxHashMap::default();
            let mut expr_cache = FxHashMap::default();
-            let llir_graph = egglog_to_llir(
+            let mut llir_graph = egglog_to_llir(
                egraph,
                genome.clone(),
                ops,
@@ -477,6 +478,12 @@ pub fn fuzz_genomes<T: TestDType>(
                &mut expr_cache,
                None,
            );
+            // Same finalization as `Graph::search` performs on the chosen
+            // best LLIR: collapse the rolled body's loop markers into a
+            // fully-unrolled LLIR. The runtime cannot execute LoopStart /
+            // LoopEnd / LoopInput / LoopOutput markers — they exist only as
+            // a search-time scaffold the auto-roll prepass introduces.
+            unroll_loops_in_llir(&mut llir_graph);

            let mut rt = CudaRuntime::initialize(stream.clone());
            rt.load_llir(&llir_graph);
--- a/crates/luminal_metal/src/dyn_backend.rs
+++ b/crates/luminal_metal/src/dyn_backend.rs
@@ -0,0 +1,48 @@
+//! [`DynBackend`] implementation for the Metal runtime.
+
+use luminal::dtype::DType;
+use luminal::dyn_backend::{bytes_to_native_data, compile_backend, BackendCompileArgs, DynBackend};
+use luminal::prelude::*;
+
+use crate::runtime::MetalRuntime;
+
+/// [`DynBackend`] wrapper for [`MetalRuntime`].
+pub struct MetalDynBackend {
+    pub runtime: MetalRuntime,
+}
+
+impl DynBackend for MetalDynBackend {
+    fn name(&self) -> &str {
+        "metal"
+    }
+
+    fn set_data_bytes(&mut self, node: NodeIndex, bytes: Vec<u8>, dtype: DType) {
+        self.runtime
+            .set_data(node, bytes_to_native_data(bytes, dtype));
+    }
+    fn set_data_f32(&mut self, node: NodeIndex, data: Vec<f32>) {
+        self.runtime.set_data(node, data);
+    }
+    fn get_output_f32(&self, node: NodeIndex) -> Vec<f32> {
+        self.runtime.get_f32(node)
+    }
+    fn execute(&mut self, dyn_map: &FxHashMap<char, usize>) {
+        self.runtime.execute(dyn_map);
+    }
+}
+
+pub fn metal_factory(
+    graph: &mut Graph,
+    args: BackendCompileArgs,
+) -> Result<Box<dyn DynBackend>, String> {
+    compile_backend::<MetalRuntime>(
+        graph,
+        args,
+        || Ok(MetalRuntime::initialize(())),
+        |rt, node, bytes, dtype| {
+            rt.set_data(node, bytes_to_native_data(bytes, dtype));
+        },
+        None,
+        |rt| Box::new(MetalDynBackend { runtime: rt }),
+    )
+}
--- a/crates/luminal_metal/src/kernel/ops.rs
+++ b/crates/luminal_metal/src/kernel/ops.rs
@@ -102,6 +102,21 @@ fn metal_copy_value(dtype: DType, buffer: &str, index: &str) -> String {
    }
 }

+fn metal_binary_op_values(
+    output_dtype: DType,
+    a_dtype: DType,
+    b_dtype: DType,
+    a_idx: &str,
+    b_idx: &str,
+) -> (String, String) {
+    let read: fn(DType, &str, &str) -> String = if output_dtype == DType::Int {
+        metal_copy_value
+    } else {
+        metal_numeric_read
+    };
+    (read(a_dtype, "a", a_idx), read(b_dtype, "b", b_idx))
+}
+
 fn call_sort_from_args(sort: &SortDef, args: &Args) -> EggTerm {
    let mut filtered_args = Args::new();
    for field in &sort.fields {
@@ -117,9 +132,11 @@ fn unary_dtype_rewrite(hlir_sort: &SortDef, metal_sort: &SortDef) -> Rule {
        args["__inputs"].clone(),
    );
    let dt = v("?__dt");
-    rule(union(hlir_match, metal_op.clone()))
+    rule(union(hlir_match.clone(), metal_op.clone()))
+        .subsume(hlir_match)
        .set(dtype(metal_op), dt.clone())
        .fact(eq(dt, dtype(args["inp"].clone())))
+        .ruleset("kernel_lower")
 }

 fn binary_dtype_rewrite(hlir_sort: &SortDef, metal_sort: &SortDef) -> Rule {
@@ -129,9 +146,11 @@ fn binary_dtype_rewrite(hlir_sort: &SortDef, metal_sort: &SortDef) -> Rule {
        args["__inputs"].clone(),
    );
    let dt = v("?__dt");
-    rule(union(hlir_match, metal_op.clone()))
+    rule(union(hlir_match.clone(), metal_op.clone()))
+        .subsume(hlir_match)
        .set(dtype(metal_op), dt.clone())
        .fact(eq(dt, dtype(args["inp_a"].clone())))
+        .ruleset("kernel_lower")
 }

 // ============================================================================
@@ -285,7 +304,7 @@ macro_rules! metal_unary_op {
                        device {input_ty} *inp [[buffer(0)]],
                        device {output_ty} *out [[buffer(1)]],
                        constant int *dyn [[buffer({dyn_buffer_index})]],
-                        device uint &n_elements [[buffer({n_elements_index})]],
+                        constant uint &n_elements [[buffer({n_elements_index})]],
                        uint idx [[thread_position_in_grid]]
                    ) {{
                        if (idx < n_elements) {{
@@ -369,8 +388,10 @@ impl EgglogOp for MetalAdd {

        vec![
            binary_dtype_rewrite(&Add::default().sort(), &self.sort()),
-            rule(union(hlir_match2, metal_op2.clone()))
-                .set(dtype(metal_op2), app(&SORTS.f32_dt, vec![])),
+            rule(union(hlir_match2.clone(), metal_op2.clone()))
+                .subsume(hlir_match2)
+                .set(dtype(metal_op2), app(&SORTS.f32_dt, vec![]))
+                .ruleset("kernel_lower"),
        ]
    }

@@ -423,8 +444,7 @@ impl MetalKernelOp for MetalAdd {
        let a_idx = lower_expression_for_metal(&a_index, "idx");
        let b_idx = lower_expression_for_metal(&b_index, "idx");
        let out_idx = lower_expression_for_metal(&out_index, "idx");
-        let a_val = metal_numeric_read(a_dtype, "a", &a_idx);
-        let b_val = metal_numeric_read(b_dtype, "b", &b_idx);
+        let (a_val, b_val) = metal_binary_op_values(output_dtype, a_dtype, b_dtype, &a_idx, &b_idx);
        let out_val = metal_numeric_write(output_dtype, &format!("({a_val}) + ({b_val})"));

        let source = format!(
@@ -437,7 +457,7 @@ impl MetalKernelOp for MetalAdd {
                device {b_ty} *b [[buffer(1)]],
                device {out_ty} *out [[buffer(2)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_elements [[buffer({n_elements_index})]],
+                constant uint &n_elements [[buffer({n_elements_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
                if (idx < n_elements) {{
@@ -556,8 +576,7 @@ impl MetalKernelOp for MetalMul {
        let a_idx = lower_expression_for_metal(&a_index, "idx");
        let b_idx = lower_expression_for_metal(&b_index, "idx");
        let out_idx = lower_expression_for_metal(&out_index, "idx");
-        let a_val = metal_numeric_read(a_dtype, "a", &a_idx);
-        let b_val = metal_numeric_read(b_dtype, "b", &b_idx);
+        let (a_val, b_val) = metal_binary_op_values(output_dtype, a_dtype, b_dtype, &a_idx, &b_idx);
        let out_val = metal_numeric_write(output_dtype, &format!("({a_val}) * ({b_val})"));

        let source = format!(
@@ -570,7 +589,7 @@ impl MetalKernelOp for MetalMul {
                device {b_ty} *b [[buffer(1)]],
                device {out_ty} *out [[buffer(2)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_elements [[buffer({n_elements_index})]],
+                constant uint &n_elements [[buffer({n_elements_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
                if (idx < n_elements) {{
@@ -699,9 +718,13 @@ impl MetalKernelOp for MetalMod {
        let a_idx = lower_expression_for_metal(&a_index, "idx");
        let b_idx = lower_expression_for_metal(&b_index, "idx");
        let out_idx = lower_expression_for_metal(&out_index, "idx");
-        let a_val = metal_numeric_read(a_dtype, "a", &a_idx);
-        let b_val = metal_numeric_read(b_dtype, "b", &b_idx);
-        let out_val = metal_numeric_write(output_dtype, &format!("fmod({a_val}, {b_val})"));
+        let (a_val, b_val) = metal_binary_op_values(output_dtype, a_dtype, b_dtype, &a_idx, &b_idx);
+        let out_expr = if output_dtype == DType::Int {
+            format!("({a_val}) % ({b_val})")
+        } else {
+            format!("fmod({a_val}, {b_val})")
+        };
+        let out_val = metal_numeric_write(output_dtype, &out_expr);

        let source = format!(
            r#"
@@ -713,7 +736,7 @@ impl MetalKernelOp for MetalMod {
                device {b_ty} *b [[buffer(1)]],
                device {out_ty} *out [[buffer(2)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_elements [[buffer({n_elements_index})]],
+                constant uint &n_elements [[buffer({n_elements_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
                if (idx < n_elements) {{
@@ -853,7 +876,7 @@ impl MetalKernelOp for MetalLessThan {
                device {b_ty} *b [[buffer(1)]],
                device {out_ty} *out [[buffer(2)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_elements [[buffer({n_elements_index})]],
+                constant uint &n_elements [[buffer({n_elements_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
                if (idx < n_elements) {{
@@ -1000,7 +1023,7 @@ impl MetalKernelOp for MetalSumReduce {
                const device {input_ty} *in [[buffer(0)]],
                device {output_ty} *out [[buffer(1)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_outputs [[buffer({n_outputs_index})]],
+                constant uint &n_outputs [[buffer({n_outputs_index})]],
                uint gid [[threadgroup_position_in_grid]],
                uint tid [[thread_index_in_threadgroup]],
                uint simd_lane [[thread_index_in_simdgroup]],
@@ -1181,7 +1204,7 @@ impl MetalKernelOp for MetalMaxReduce {
                const device {input_ty} *in [[buffer(0)]],
                device {output_ty} *out [[buffer(1)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_outputs [[buffer({n_outputs_index})]],
+                constant uint &n_outputs [[buffer({n_outputs_index})]],
                uint gid [[threadgroup_position_in_grid]],
                uint tid [[thread_index_in_threadgroup]],
                uint simd_lane [[thread_index_in_simdgroup]],
@@ -1719,8 +1742,10 @@ impl EgglogOp for MetalConstant {
    fn rewrites(&self) -> Vec<Rule> {
        let (args, const_match) = new_op_call(&Constant::default().sort(), &[]);
        let metal_op = call_sort_from_args(&self.sort(), &args);
-        vec![rule(union(const_match, metal_op.clone()))
-            .set(dtype(metal_op), app(&SORTS.f32_dt, vec![]))]
+        vec![rule(union(const_match.clone(), metal_op.clone()))
+            .subsume(const_match)
+            .set(dtype(metal_op), app(&SORTS.f32_dt, vec![]))
+            .ruleset("kernel_lower")]
    }

    fn cleanup(&self) -> bool {
@@ -1827,8 +1852,10 @@ impl EgglogOp for MetalIota {
    fn rewrites(&self) -> Vec<Rule> {
        let (args, iota_match) = new_op_call(&Iota::default().sort(), &[]);
        let metal_op = call_sort_from_args(&self.sort(), &args);
-        vec![rule(union(iota_match, metal_op.clone()))
-            .set(dtype(metal_op), app(&SORTS.int_dt, vec![]))]
+        vec![rule(union(iota_match.clone(), metal_op.clone()))
+            .subsume(iota_match)
+            .set(dtype(metal_op), app(&SORTS.int_dt, vec![]))
+            .ruleset("kernel_lower")]
    }

    fn cleanup(&self) -> bool {
@@ -1872,7 +1899,7 @@ impl MetalKernelOp for MetalIota {
            kernel void mkernel(
                device int *out [[buffer(0)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_elements [[buffer({n_elements_index})]],
+                constant uint &n_elements [[buffer({n_elements_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
                if (idx < n_elements) {{
@@ -1924,6 +1951,7 @@ impl MetalKernelOp for MetalIota {
 pub struct MetalGather {
    out_shape: Vec<Expression>,
    index_stride: Vec<Expression>,
+    data_shape: Vec<Expression>,
    data_stride: Vec<Expression>,
    out_stride: Vec<Expression>,
 }
@@ -1938,6 +1966,7 @@ impl EgglogOp for MetalGather {
                ("indexes", IR),
                ("index_strides", ELIST),
                ("data", IR),
+                ("data_shape", ELIST),
                ("data_strides", ELIST),
                ("out_strides", ELIST),
            ],
@@ -1959,6 +1988,7 @@ impl EgglogOp for MetalGather {
                gather_args["index_strides"].clone(),
            ),
            ("data".to_string(), gather_args["data"].clone()),
+            ("data_shape".to_string(), gather_args["data_shape"].clone()),
            (
                "data_strides".to_string(),
                gather_args["data_strides"].clone(),
@@ -1966,9 +1996,11 @@ impl EgglogOp for MetalGather {
            ("out_strides".to_string(), out_strides),
        ];
        let metal_op = self.sort().call(metal_args);
-        vec![rule(union(gather_match, metal_op.clone()))
+        vec![rule(union(gather_match.clone(), metal_op.clone()))
+            .subsume(gather_match)
            .set(dtype(metal_op), dt.clone())
-            .fact(eq(dt, dtype(gather_args["data"].clone())))]
+            .fact(eq(dt, dtype(gather_args["data"].clone())))
+            .ruleset("kernel_lower")]
    }

    fn cleanup(&self) -> bool {
@@ -1989,9 +2021,10 @@ impl EgglogOp for MetalGather {
                out_shape: extract_expr_list(egraph, children[0], list_cache, expr_cache).unwrap(),
                index_stride: extract_expr_list(egraph, children[2], list_cache, expr_cache)
                    .unwrap(),
-                data_stride: extract_expr_list(egraph, children[4], list_cache, expr_cache)
+                data_shape: extract_expr_list(egraph, children[4], list_cache, expr_cache).unwrap(),
+                data_stride: extract_expr_list(egraph, children[5], list_cache, expr_cache)
                    .unwrap(),
-                out_stride: extract_expr_list(egraph, children[5], list_cache, expr_cache).unwrap(),
+                out_stride: extract_expr_list(egraph, children[6], list_cache, expr_cache).unwrap(),
            })),
            vec![children[1], children[3]],
        )
@@ -2015,7 +2048,7 @@ impl MetalKernelOp for MetalGather {
            "idx",
        );
        let data_idx = lower_expression_for_metal(
-            &flatten_strides(&self.out_shape, &self.data_stride),
+            &flatten_strides(&self.data_shape, &self.data_stride),
            "gathered_index",
        );
        let gathered_val = metal_copy_value(data_dtype, "data", &data_idx);
@@ -2030,7 +2063,7 @@ impl MetalKernelOp for MetalGather {
                const device {data_ty} *data [[buffer(1)]],
                device {out_ty} *out [[buffer(2)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_elements [[buffer({n_elements_index})]],
+                constant uint &n_elements [[buffer({n_elements_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
                if (idx < n_elements) {{
@@ -2056,6 +2089,10 @@ impl MetalKernelOp for MetalGather {
            .max(Expression::from(1))
    }

+    fn infer_output_dtype(&self, input_dtypes: &[DType]) -> DType {
+        input_dtypes.get(1).copied().unwrap_or(DType::F32)
+    }
+
    fn encode(
        &self,
        encoder: &ComputeCommandEncoderRef,
@@ -2177,9 +2214,11 @@ impl EgglogOp for MetalScatter {
            ("out_strides".to_string(), out_strides),
        ];
        let metal_op = self.sort().call(metal_args);
-        vec![rule(union(scatter_match, metal_op.clone()))
+        vec![rule(union(scatter_match.clone(), metal_op.clone()))
+            .subsume(scatter_match)
            .set(dtype(metal_op), dt.clone())
-            .fact(eq(dt, dtype(scatter_args["src"].clone())))]
+            .fact(eq(dt, dtype(scatter_args["src"].clone())))
+            .ruleset("kernel_lower")]
    }

    fn cleanup(&self) -> bool {
@@ -2243,7 +2282,7 @@ impl MetalKernelOp for MetalScatter {
            kernel void copy_kernel(
                device {out_ty} *out [[buffer(0)]],
                const device {dest_ty} *dest [[buffer(1)]],
-                device uint &n_elements [[buffer(2)]],
+                constant uint &n_elements [[buffer(2)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
@@ -2277,7 +2316,7 @@ impl MetalKernelOp for MetalScatter {
                device {out_ty} *out [[buffer(0)]],
                const device int *indexes [[buffer(1)]],
                const device {src_ty} *src [[buffer(2)]],
-                device uint &n_elements [[buffer(3)]],
+                constant uint &n_elements [[buffer(3)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
@@ -2408,7 +2447,10 @@ impl EgglogOp for MetalCast {
    fn rewrites(&self) -> Vec<Rule> {
        let (args, cast_match) = new_op_call(&Cast::default().sort(), &["inp"]);
        let metal_op = call_sort_from_args(&self.sort(), &args);
-        vec![rule(union(cast_match, metal_op.clone())).set(dtype(metal_op), args["dtype"].clone())]
+        vec![rule(union(cast_match.clone(), metal_op.clone()))
+            .subsume(cast_match)
+            .set(dtype(metal_op), args["dtype"].clone())
+            .ruleset("kernel_lower")]
    }

    fn cleanup(&self) -> bool {
@@ -2467,7 +2509,7 @@ impl MetalKernelOp for MetalCast {
                device {input_ty} *inp [[buffer(0)]],
                device {output_ty} *out [[buffer(1)]],
                constant int *dyn [[buffer({dyn_buffer_index})]],
-                device uint &n_elements [[buffer({n_elements_index})]],
+                constant uint &n_elements [[buffer({n_elements_index})]],
                uint idx [[thread_position_in_grid]]
            ) {{
                if (idx < n_elements) {{
--- a/crates/luminal_metal/src/lib.rs
+++ b/crates/luminal_metal/src/lib.rs
@@ -1,3 +1,4 @@
+pub mod dyn_backend;
 pub mod kernel;
 pub mod runtime;

--- a/crates/luminal_metal/src/runtime.rs
+++ b/crates/luminal_metal/src/runtime.rs
@@ -234,6 +234,10 @@ impl Runtime for MetalRuntime {
        }
    }

+    fn aggregate_profile_metrics(metrics: &[Self::ProfileMetric]) -> Self::ProfileMetric {
+        metrics.iter().copied().sum()
+    }
+
    #[tracing::instrument(skip_all)]
    fn load_llir(&mut self, llir_graph: &LLIRGraph) {
        self.pipelines.clear();
@@ -278,6 +282,8 @@ impl Runtime for MetalRuntime {
                let pipeline = kernel_op.compile(&self.device, &input_dtypes, output_dtype);
                self.node_dtypes.insert(node, output_dtype);
                self.pipelines.insert(node, pipeline);
+            } else {
+                panic!("Metal runtime cannot execute unlowered LLIR node {node:?}");
            }
        }
    }
@@ -288,6 +294,7 @@ impl Runtime for MetalRuntime {
        llir_graph: &LLIRGraph,
        dyn_map: &FxHashMap<char, usize>,
        trials: usize,
+        _timeout: Option<std::time::Duration>,
    ) -> (Self::ProfileMetric, String) {
        self.load_llir(llir_graph);
        self.allocate_intermediate_buffers(dyn_map);
--- a/crates/luminal_metal/src/tests.rs
+++ b/crates/luminal_metal/src/tests.rs
@@ -250,6 +250,23 @@ fn dynamic_dim_sum_reduce_runs() {
    assert_close(&out, &[9.0, 12.0], 0.001);
 }

+#[test]
+fn metal_int_arithmetic_preserves_large_values() {
+    let mut cx = Graph::default();
+    let token = cx.tensor(1).as_dtype(DType::Int);
+    let large_index = (token * 1024) + 123;
+    let mod_output = (large_index % 65_537).output();
+
+    cx.build_search_space::<MetalRuntime>();
+    let mut rt = MetalRuntime::initialize(());
+    rt.set_data(token, &[16_385i32]);
+    rt = cx.search(rt, 1);
+    rt.allocate_intermediate_buffers(&cx.dyn_map);
+    rt.execute(&cx.dyn_map);
+
+    assert_eq!(rt.get_f32(mod_output), vec![891.0]);
+}
+
 proptest! {
    #![proptest_config(ProptestConfig::with_cases(5))]

@@ -971,6 +988,28 @@ fn test_scatter_basic() {
    assert_close(&out, &[0.0, 10.0, 0.0, 20.0, 30.0], 0.001);
 }

+#[test]
+fn test_gather_noncontiguous_data_uses_data_shape() {
+    let mut cx = Graph::default();
+    let input = cx.tensor((4, 3));
+    let data = input.transpose(0, 1);
+    let indexes = cx.tensor((2, 2)).as_dtype(DType::Int);
+    let out = data.gather(indexes).output();
+
+    cx.build_search_space::<MetalRuntime>();
+    let mut rt = MetalRuntime::initialize(());
+    rt.set_data(
+        input,
+        &[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0],
+    );
+    rt.set_data(indexes, &[0.0, 3.0, 4.0, 7.0]);
+    rt = cx.search(rt, 1);
+    rt.allocate_intermediate_buffers(&cx.dyn_map);
+    rt.execute(&cx.dyn_map);
+
+    assert_close(&rt.get_f32(out), &[0.0, 9.0, 1.0, 10.0], 0.001);
+}
+
 #[test]
 fn test_scatter_into_nonzero_dest() {
    let mut cx = Graph::default();
@@ -1012,3 +1051,21 @@ fn test_scatter_all_positions() {
    let out = rt.get_f32(result);
    assert_close(&out, &[10.0, 20.0, 30.0, 40.0], 0.001);
 }
+
+#[test]
+fn test_gather_preserves_data_dtype() {
+    let mut cx = Graph::default();
+    let data = cx.tensor(2);
+    let indexes = cx.tensor(1).as_dtype(DType::Int);
+    let out = data.gather(indexes).output();
+
+    cx.build_search_space::<MetalRuntime>();
+    let mut rt = MetalRuntime::initialize(());
+    rt.set_data(data, &[1.25, 2.5]);
+    rt.set_data(indexes, &[1.0]);
+    rt = cx.search(rt, 1);
+    rt.allocate_intermediate_buffers(&cx.dyn_map);
+    rt.execute(&cx.dyn_map);
+
+    assert_close(&rt.get_f32(out), &[2.5], 0.001);
+}
--- a/crates/luminal_nn/src/moe.rs
+++ b/crates/luminal_nn/src/moe.rs
@@ -61,7 +61,8 @@ impl MoE {
        let expert_out = expanded_act.matmul(gathered).squeeze(n); // [batch.., k, out]

        // 6. Weighted sum over experts: [batch.., k, out] * [batch.., k, 1] → sum(k) → [batch.., out]
-        let weights_exp = top_k_values.unsqueeze(top_k_values.dims().len()); // [batch.., k, 1]
+        let mut weights_exp = top_k_values.unsqueeze(top_k_values.dims().len()); // [batch.., k, 1]
+        weights_exp.shape.expand(expert_out.dims());
        (expert_out * weights_exp).sum(n - 1)
    }
 }
@@ -478,7 +479,8 @@ mod tests {
        let down_out = hidden_exp.matmul(down_gathered.transpose(2, 3)).squeeze(2); // [s, k, H]

        // 7. Weighted sum over k experts → [s, H]
-        let weights_exp = top_k_values.unsqueeze(top_k_values.dims().len()); // [s, k, 1]
+        let mut weights_exp = top_k_values.unsqueeze(top_k_values.dims().len()); // [s, k, 1]
+        weights_exp.shape.expand(down_out.dims());
        let _output = (down_out * weights_exp).sum(n - 1).output();

        // Dump the HLIR to egglog
--- a/crates/luminal_python/.gitignore
+++ b/crates/luminal_python/.gitignore
@@ -1,4 +1,5 @@
 *.onnx
+tests/llama38b_ref_logits.pt
 __pycache__/
 *.pyc
 uv.lock
--- a/crates/luminal_python/CLAUDE.md
+++ b/crates/luminal_python/CLAUDE.md
@@ -1,8 +1,4 @@
-## Python Environment
-
- Always use `uv run` to execute Python tools (pytest, pre-commit, python) — never bare `pytest` or `python`
- Use `uv add` / `uv add --dev` / `uv remove` for dependencies — never hand-edit pyproject.toml deps
- After modifying Rust source files, rebuild before running Python tests: `maturin develop --release`
+A couple of short things to keep in mind

 ## Lessons Learned

@@ -28,7 +24,7 @@ consult before writing new egglog rules, CUDA kernels, or optimizer passes.
 ## Testing Best Practices

 ### Overview
-The luminal_python crate provides a bridge between PyTorch models and the luminal library via ONNX. Tests should verify this integration end-to-end by testing the actual user workflow: PyTorch model → torch.compile → luminal backend.
+The luminal_python crate provides a bridge between PyTorch models and the luminal library via the PT2 Export pipeline. Tests should verify this integration end-to-end by testing the actual user workflow: PyTorch model → torch.compile → luminal backend.

 ### Test Pattern (CORRECT)

@@ -71,11 +67,11 @@ class AddTestModel(torch.nn.Module):

 ### What NOT to Do

-**❌ DO NOT create ONNX files directly in tests:**
+**❌ DO NOT create pt2 files directly in tests:**
 ```python
 # WRONG - bypasses the PyTorch integration
-model_path = create_onnx_model(...)
-graph_result = luminal.process_onnx(model_path, backend='native')
+model_path = create_pt2_model(...)
+graph_result = luminal.process_pt(model_path, backend='native')
 ```

 **✓ DO create PyTorch models and use torch.compile:**
@@ -87,16 +83,16 @@ model_compiled = torch.compile(model, backend=luminal_backend)

 ### Rationale

- **End-to-end testing**: Tests verify the complete PyTorch → ONNX → luminal pipeline
+- **End-to-end testing**: Tests verify the complete PyTorch → Pt2 → luminal pipeline
 - **User-facing API**: Tests use the same API that users will use (torch.compile)
 - **Correctness**: Comparing compiled vs original PyTorch output ensures correctness
 - **Maintainability**: Consistent pattern across all tests makes the codebase easier to understand
- **Simplicity**: No manual ONNX file creation, no tempfile cleanup, no numpy comparisons
+- **Simplicity**: No manual Pt2 file creation, no tempfile cleanup, no numpy comparisons

 ### Special Cases

 **Testing constants:**
-Use inline tensor literals in the forward method - PyTorch exports these as ONNX Constant nodes:
+Use inline tensor literals in the forward method - these are exported as constant tensors:
 ```python
 def forward(self, x: torch.Tensor) -> torch.Tensor:
    constant = torch.tensor([1.0, 2.0, 3.0])
@@ -104,14 +100,14 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
 ```

 **Testing type casts:**
-Use `.to(dtype)` method - PyTorch exports these as ONNX Cast nodes:
+Use `.to(dtype)` method - these are exported as type cast operations:
 ```python
 def forward(self, x: torch.Tensor) -> torch.Tensor:
    return x.to(torch.float32)
 ```

 **Testing complex operations:**
-Chain operations naturally in PyTorch - ONNX export handles the conversion:
+Chain operations naturally in PyTorch - the export pipeline handles the conversion:
 ```python
 def forward(self, x: torch.Tensor) -> torch.Tensor:
    transposed = x.transpose(0, 1)
--- a/crates/luminal_python/LessonsLearned.md
+++ b/crates/luminal_python/LessonsLearned.md
@@ -756,3 +756,112 @@ identical across all attempts (dtype issue) vs varying (actual numerical issue).
 3. **Why hard**: Per-operation error was ~1e-7 but compounded over 16 layers × ~25 extra materializations. The egglog `Exp` rewrite depends on exact constant format matching.
 4. **Fix**: Added `KernelExp` (uses `expf()`), `KernelSigmoid` (uses `1/(1+expf(-x))`), and Kahan summation in SumReduce. Each uses both `kernel_rewrite` and a direct egglog pattern match with range checks (e.g., `(> ?val 1.44) (< ?val 1.45)`) to bypass constant format dependency.
 5. **Principle**: When decomposed CUDA kernel chains cause precision loss, add fused kernels via `kernel_rewrite`. For robustness, add BOTH the logical-op rewrite path AND a direct HLIR pattern match — the constant format in egglog can be fragile.
+
+## 2026-04-26 — Loop unroll-union rules silently disabled in full egglog stage
+
+1. **Symptom**: Python `test_llama_transformer_block` (CUDA backend) produced output ~1e-2 off from PyTorch (atol=1e-4) on the `loop_rolling` branch. All component tests (RMSNorm, attention, SwiGLU, RoPE) passed. The diff pattern was suspicious: row 0 of the (1,4,32) output matched exactly, rows 1–3 differed slightly. Disabling rolling fixed it.
+2. **Root cause**: The auto-roll prepass folds three sequential scalar muls in PyTorch's `pow(2)` decomposition (`exp2(log2(x) * 0.693 * 2.0 * 1.442)` — the last constant is `log2(e)`). The kernel `direct-exp-fusion` egglog rule rewrites `Mul(?x, log2_e_const) → Exp2(...)` into `KernelExp(?x)` (single `expf()` instead of separate exp2f + multiply by truncated log2(e)). Without rolling, this fusion fires and the float chain stays stable; with rolling the fusion can't see through the `LoopStart`/`LoopEnd` markers, so the chain stays as `KernelMul → KernelExp2`, and the truncated `log2(e)` constant accumulates ~1e-7 error per layer that compounds into ~1e-2 over the full block.
+
+   The unroll-union rules I'd added (`Mul`/`Add`/etc. binary-op rules that union a rolled body with its fully-unrolled equivalent) were registered only in `EgglogOp::early_rewrites()`, not `rewrites()`. The egglog driver feeds `early_rewrites` only into the early-stage program and `rewrites` only into the full-stage program. So the unrolled chain materialised in the early egraph, the early→full extract picked the (cheaper) rolled form, the unrolled chain was lost, and `direct-exp-fusion` (which runs in the full stage) had nothing to match against.
+3. **Why hard**: The post-unroll LLIR for the rolled vs un-rolled paths *looked* nearly identical when scanned visually — both had the Log2 → Mul × 3 → Exp2 chain. The diff was 2 extra Muls vs no-rolling, and the actual semantic gap was visible only in op-name counts: WITH-rolling had 3 `KernelExp2` and 0 `KernelExp`, WITHOUT-rolling had 1 `KernelExp2` and 2 `KernelExp`. Tracking the missing fusion to the early/full ruleset split required reading the egglog driver carefully and noticing that `OpTextParts` builds `early_rewrites` and `full_rewrites` from disjoint method calls.
+4. **Fix**: Register `binary_op_unroll_rules` in BOTH `early_rewrites()` (so fusion patterns like GLUMoE can match before the early-stage extract, which is what fixed `test_glumoe_gemma_gelu_matches_unfused_output` earlier in the session) AND `rewrites()` (so kernel-level rewrites like `direct-exp-fusion` can match in the full stage on the unrolled chain). One block per binary op (`Add`, `Mul`, `Mod`, `LessThan`).
+5. **Principle**: When egglog has multiple stages (early/full) with disjoint rule sets, any rewrite that materialises new HLIR/IR enodes (rather than just lowering to LLIR) needs to fire in BOTH stages if downstream rewrites in BOTH stages might want to see the new structure. Putting "preparatory" rewrites only in `early_rewrites` means their effect is lost across the early→full handoff. The narrow rule of thumb: if your rule's outputs are intended to enable matches by other rules, audit which stages those other rules run in and register accordingly.
+
+## 2026-04-26 — `unroll_loops_in_llir` panicked on iteration-invariant body producers
+
+1. **Symptom**: Modal CI/CD job for the gemma example panicked at `src/graph.rs:1867` with `no entry found for key`. The line is `clone_map[i - 1][&body_producer]` inside `unroll_loops_in_llir`'s `resolve_src` closure — `body_producer` (the LoopEnd's incoming source for that slot) wasn't a key in the per-iteration clone map. cuda_lite/python tests didn't repro: only triggered by the specific genome and graph shapes that gemma's longer search settles on.
+2. **Root cause**: `body_nodes` is computed by walking *forward* from each LoopStart/LoopInput/LoopInputStatic outgoing edge, stopping at markers and `Output` ops. Some egglog-extracted LLIRs land a `body_producer` that isn't reachable via that forward walk — i.e., its only ancestors are non-marker (a constant, an external input, or an op whose chain was congruence-merged off the marker chain by rules like `LoopInputStatic inline`). Semantically this is a degenerate "iteration-invariant body": every iter computes the same value, so the loop's state never changes. The per-iter clone path needed a fallback for that case.
+3. **Why hard**: cuda_lite and python tests don't generate genomes that produce this shape, so local runs always pass. The forward-walk-only definition of `body_nodes` is *almost* always right — only specific extraction shapes from longer searches expose the gap. Test-driven debugging has limited reach when the failure mode depends on a search trajectory the local fuzzers don't explore.
+4. **Fix**: in `unroll_loops_in_llir::resolve_src`, when the LoopStart-resolved `body_producer` isn't in `body_nodes`, return `body_producer` itself for iter > 0 instead of indexing `clone_map[i - 1]`. The body op didn't depend on the loop variable, so every iter > 0 carries the same value forward — using `body_producer` directly is semantically correct. Mirrored the same `unwrap_or(body_producer)` fallback in the post-loop substitution map (`marker_post_sub` for LoopEnd / LoopOutputSelect). Added a backward-walk-from-end-markers backfill in `collapse_loops_to_first_iter` so its body-node iteration also covers these nodes (it doesn't have a clone_map, but does need to rewire body ops' incoming edges before deleting markers).
+5. **Principle**: When a graph-walk-derived set is used as a hashmap key requirement, every code path that *could* produce a key outside that set needs a graceful fallback — not just a defensive `expect`. For loop unrolling specifically, the rule is: `body_nodes` is the set of "ops that participate in per-iter computation"; ops on the LoopEnd's path that *don't* participate (iteration-invariant) are still legitimate, and need a "no clone, share across iters" path through `resolve_src` and `marker_post_sub`. Forward-walk-only `body_nodes` is correct only when extraction never produces iteration-invariant body producers — and in an egglog-driven search, that's not a guarantee you can make.
+
+## 2026-04-26 — Iteration-invariant state slots are a first-class concept, not a defensive fallback
+
+1. **Symptom + fix recap**: gemma Modal CI panicked at `clone_map[i-1][&body_producer]` because some state slots' `body_producer` (LoopEnd's incoming) isn't in `body_nodes` (forward walk from input markers). The first commit pair (16de9638 / 93fb02c4) caught this with `.unwrap_or(body_producer)` — which works but reads as "defensive, unclear *why* this case exists."
+2. **What's actually happening**: extracted LLIR from gemma legitimately puts a `KernelConstant` at LoopEnd's incoming for some state slots. e.g. for one slot of gemma's body=104 trips=5 rolling: `initial = KernelConstant 1.442695` (log2 e), `body_producer = same node`. For another: `body_producer = KernelConstant 9.21034` (ln 10000, RoPE's frequency base after `Log2 * ln(2)` simplification). egglog's kernel-level rewrites legitimately union body-slot eclasses with these constants when the body chain provably reduces to them. The state really is iteration-invariant — every iter sees the same value.
+3. **Why "defensive fallback" framing is misleading**: it implies the LLIR is broken. It isn't. The forward-walk-only `body_nodes` definition just doesn't cover this case, because the case requires no per-iter cloning at all. A *node not reachable from any loop input marker has no input-marker ancestor*, so by construction its value doesn't depend on the loop's per-iter state.
+4. **Cleaner formulation**: name the concept. Compute an `iteration_invariant_slots: HashSet<LoopStart>` set at the same time `start_meta` is built, with the rule `body_producer ∉ body_nodes ⇒ iteration_invariant`. `resolve_src` and `marker_post_sub` then have explicit branches: if the slot is invariant, use `body_producer` directly; otherwise the standard per-iter clone lookup. The behavior is the same as the `unwrap_or` band-aid, but the code now documents that this is a real, sound case the unroll handles correctly — not a panic suppressor.
+5. **Principle**: when an `unwrap_or` papers over a case that turns out to be semantically valid, the right cleanup isn't to keep the `unwrap_or` and add a comment — it's to name the case. Hoist the predicate into a set or enum and branch on it explicitly. The compiler then enforces that every consumer of the per-iter cloning machinery has an opinion on iteration-invariant slots, instead of silently relying on a `Map::get` returning `None` at the right moment.
+
+---
+
+## 2026-04-30 — `translate_grouped_mm` casted the full expert weight to F32, OOMing search on Qwen3-MoE
+
+### What the symptom was
+
+`benchmarks/ttft/run.py --config qwen3-moe` crashed every search-profile attempt with:
+```
+crates/luminal_cuda_lite/src/runtime.rs:711: called `Result::unwrap()` on an `Err` value:
+  DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
+```
+The DB shows this had been failing every run for ~2 weeks. The rust `examples/qwen3_moe` ran fine end-to-end. python_baseline / python_torch_compile / qwen3-4b were all fine — only python_luminal × qwen3-moe failed.
+
+### What the actual root cause was
+
+`translate_grouped_mm` in `crates/luminal_python/rust/src/translator/tensor.rs` was lowering HF's `_grouped_mm(input, weight, offs)` op to a *full-broadcast* batched matmul plus a group-mask:
+
+```rust
+let weight_f = weight.cast(DType::F32);                  // [G=128, K, N] cast → 1.5 GB / layer
+let input_batched = input_f.expand_dim(0, g);
+let all_out = input_batched.matmul(weight_f);            // [G, S, N]
+let mask = ... (g_arange == expert_id).cast(F32);
+let out = (all_out * mask.expand_dim(2, n)).sum(0);      // mask + sum over G
+```
+
+The full `[G, K, N]` F32 cast intermediate is 1.5 GB / layer for gate-up and 0.6 GB / layer for down on Qwen3-30B-A3B. With 60 GB of persistent bf16 weights already on a 97 GB GPU, the search-time profiler ran out of memory allocating those casts.
+
+By contrast, `examples/qwen3_moe`'s `gather_experts` gathers only the top-K active experts per token first, then casts that small `[s, k, d1, d2]` slice (~100 MB / layer). The GLUMoE host op (`crates/luminal_cuda_lite/src/host/moe/glumoe_rewrite.egg`) is also wired to this gather pattern.
+
+### Why it was hard to find
+
+1. **Code path was reasonable in isolation**: at small scale (`test_grouped_mm_fallback`: g=2, K=8, N=16) the broadcast version was fine — the F32 cast was only 1 KB, and search profiling never noticed.
+2. **The error reported "out of memory" but the rest of the system looked healthy**: 60 GB weights + 37 GB headroom looks like plenty until you realise 48 layers × 2.1 GB cast intermediates per layer doesn't fit, even after loop rolling.
+3. **The DB's `code 1` failures looked the same as a Python exception** — the actual panic site (`runtime.rs:711:64` `stream.alloc_zeros(needed_bytes).unwrap()`) had to be recovered from a tmux scrollback because the orchestrator's stdout was already torn down by the time we looked.
+
+### The fix
+
+Rewrote `translate_grouped_mm` to gather first, matmul second:
+
+```rust
+// expert_id[m] = first g s.t. m < offs[g], clamped to [0, G-1]
+let expert_id = ge_boundary.sum(0).minimum_f32(g_max_f).cast(DType::Int);
+
+// flat_idx = expert_id * (K*N) + iota('z', (K, N))   — same shape as
+// rust qwen3_moe's `gather_experts`
+let flat_idx = (expert_id * (k * n))
+    .expand_dim(1, k).expand_dim(2, n)
+    + self.graph.iota(Expression::from('z'), (k, n)).expand_dim(0, s);
+
+let weight_gathered = weight.gather(flat_idx);            // [S, K, N], bf16
+let result = input.cast(F32).unsqueeze(1)
+    .matmul(weight_gathered.cast(F32))                    // [S, 1, N]
+    .squeeze(1);
+```
+
+Two important details:
+
+1. **Clamp `expert_id` to `[0, G-1]`**: at search time, dummy data fills `offs` with all-1s (`make_ones_bytes` in `compile_backend`). For S>1 that pushes `expert_id` to G (boundary count = G), which is one past the last valid expert and OOBs the gather. HF's own grouped-MM forward also clamps for the same reason (invalid expert IDs from EP).
+2. **Don't cast the full weight**: the cast moved from before the batched-matmul (over `[G, K, N]`) to after the gather (over `[S, K, N]`). 16× shrink at prefill (S=top_k=8 vs G=128).
+
+### Result
+
+`search-iters=1` end-to-end works on Qwen3-30B-A3B: `BENCH_RESULT … "ttft_ms": 9350.5, "tpot_ms": 1166.7`. The OOM is gone.
+
+`search-iters>=5` still crashes — but with a *different*, downstream `CUDA_ERROR_ILLEGAL_ADDRESS` during execution after search completes. That looks like the same family as the 2026-03-07 / 2026-03-09 egglog-extractor non-determinism bugs (some mutation during search picks a kernel/rewrite combo that's broken at this scale). It's a separate investigation — the gather-based lowering is correct in isolation (`test_grouped_mm_fallback` passes; a synthetic `g=128, S=8, K=2048, N=1536` bf16 test passes with max-diff ~2.4e-4).
+
+### General principle
+
+**When lowering an op that takes a per-row index over a large parameter, gather first and cast second — never cast the full parameter to F32 just because your matmul kernel is F32-only.** A "broadcast over G + mask" pattern is mathematically equivalent to "gather per-row" but materialises a G× larger intermediate — fine for tests, ruinous on real MoE checkpoints. When in doubt, mirror the rust example's pattern: the egglog fusion rules (GLUMoE here) are written to recognise the gather form, not the broadcast-and-mask form.
+
+Also: search-time dummy-1 inputs are not the same shape as runtime inputs. Anything you compute from a runtime tensor (cumsum offsets, routing indices, mask boundaries) needs to remain in-bounds for the dummy. Clamp index-producing chains as a matter of course, not just when the math says you "should" — `make_ones_bytes` is a hostile witness.
+
+## 2026-05-02 — Whisper port hit two missing-translator pitfalls
+
+1. **Symptom**: Compiling a PyTorch port of Whisper-tiny.en through `luminal_backend` failed twice in a row at the dispatch table: first with `Unsupported ATen op: torch.ops.aten.gelu.default`, then with `full: unsupported fill value type ... -Infinity`.
+2. **Root cause #1**: the dispatch table in `crates/luminal_python/rust/src/translator/dispatch.rs` mapped `sigmoid`, `tanh`, `relu` etc. but not `gelu` or `silu`. Whisper's encoder uses `F.gelu`, so the activation hit a hole.
+3. **Root cause #2**: PyTorch serializes `float("-inf")` in PT2 as the string `"-Infinity"` (and `"NaN"`/`"Infinity"` analogously). `translate_full`'s `get_float_arg` only accepts numeric float/int payloads, so any `torch.full((..), -inf)` (the obvious way to write a causal mask) blows up. Decoder mask code is the most common spot.
+4. **Why it was tricky**: both errors arrive from inside `pt2_backend` with a stack trace that ends in `process_pt2`, hiding the actual ATen target inside the message. You only see the offending op name in the error string itself, so you have to read `RuntimeError: Failed to translate node N: …` carefully and grep `dispatch.rs` for it.
+5. **Fix in this session**:
+   - Added `aten.gelu.default → a.gelu()` and `aten.silu.default → a.silu()` to `dispatch.rs`.
+   - Worked around the `-Infinity` issue at the model level by using a finite `-1e10` for the causal mask in the example (matches the Rust example's convention). The cleaner fix (parsing `"-Infinity"`/`"Infinity"`/`"NaN"` strings in `get_float_arg` / `translate_full`) is left for a follow-up.
+6. **Principle**: when adding a new model that goes through the PT2 backend, expect to plug small holes in `dispatch.rs` and `translator/tensor.rs::translate_full`. The trace points at the python frame, not the Rust dispatch arm — open `dispatch.rs`, ctrl-F the offending op name, and add the one-liner. For float-shaped sentinel values (`-inf`, `inf`, `nan`), the export pipeline currently only accepts finite floats; either rewrite the model or extend the parser.
--- a/crates/luminal_python/README.md
+++ b/crates/luminal_python/README.md
@@ -0,0 +1,60 @@
+# luminal_python
+
+PyTorch `torch.compile` integration for Luminal.
+
+## CUDA Tests
+
+The Python CUDA CI job builds the Rust extension with the CUDA feature and runs
+the non-slow pytest suite:
+
+```bash
+cd crates/luminal_python
+RUST_BACKTRACE=1 \
+LUMINAL_TEST_DEVICE=cuda \
+MATURIN_PEP517_ARGS="--features cuda --profile release" \
+CUDARC_CUDA_VERSION=12080 \
+uv run --group dev python -m pytest tests/ -v -s -m "not slow"
+```
+
+The slow tests are explicit opt-in. They include large/pretrained model tests,
+full-width architecture compiles, Whisper end-to-end cases, and other cases that
+can take a long time or need a large GPU / Hugging Face cache.
+
+Run the full Python CUDA suite, including slow tests:
+
+```bash
+cd crates/luminal_python
+RUST_BACKTRACE=1 \
+LUMINAL_TEST_DEVICE=cuda \
+MATURIN_PEP517_ARGS="--features cuda --profile release" \
+CUDARC_CUDA_VERSION=12080 \
+uv run --group dev python -m pytest tests/ -v -s
+```
+
+Run only the slow Python CUDA tests:
+
+```bash
+cd crates/luminal_python
+RUST_BACKTRACE=1 \
+LUMINAL_TEST_DEVICE=cuda \
+MATURIN_PEP517_ARGS="--features cuda --profile release" \
+CUDARC_CUDA_VERSION=12080 \
+uv run --group dev python -m pytest tests/ -v -s -m slow
+```
+
+The helper script follows the same convention:
+
+```bash
+cd crates/luminal_python
+./run_tests_cuda.sh              # non-slow CUDA suite
+./run_tests_cuda.sh --slow-only  # only slow CUDA tests
+./run_tests_cuda.sh --include-slow
+```
+
+The GitHub/Modal entrypoint uses the same marker split:
+
+```bash
+cd crates/luminal_python
+modal run modal_pytest_runner.py --gpu A100 --timeout 7200 tests/ -v -s -m "not slow"
+modal run modal_pytest_runner.py --gpu A100 --timeout 7200 tests/ -v -s
+```
--- a/crates/luminal_python/examples/whisper.py
+++ b/crates/luminal_python/examples/whisper.py
@@ -0,0 +1,497 @@
+"""Whisper transcription demo using the luminal torch.compile backend.
+
+Implements a small PyTorch port of ``openai/whisper-tiny.en`` that mirrors the
+luminal Rust example (``examples/whisper`` in the workspace), loads the official
+HuggingFace weights, and runs greedy decoding through the luminal backend via
+``torch.compile``.
+
+Usage::
+
+    uv run python examples/whisper.py [path/to/audio.wav]
+
+If no path is provided, falls back to the JFK sample bundled with the Rust
+``examples/whisper`` crate.
+"""
+
+from __future__ import annotations
+
+import os
+import sys
+import time
+import wave
+from pathlib import Path
+from typing import Optional
+
+import numpy as np
+import torch
+import torch._dynamo
+import torch.nn.functional as F
+from transformers import (
+    WhisperFeatureExtractor,
+    WhisperForConditionalGeneration,
+    WhisperTokenizer,
+)
+
+from luminal.pt2 import compile as luminal_compile
+
+REPO_ID = "openai/whisper-tiny.en"
+
+# whisper-tiny.en hyperparameters
+N_MELS = 80
+N_AUDIO_CTX = 1500
+D_MODEL = 384
+N_HEADS = 6
+HEAD_DIM = D_MODEL // N_HEADS
+N_AUDIO_LAYER = 4
+N_TEXT_LAYER = 4
+N_TEXT_CTX = 448
+FF_DIM = 4 * D_MODEL
+N_VOCAB = 51864
+LAYER_NORM_EPS = 1e-5
+
+# Decoder special tokens
+TOKEN_SOT = 50257
+TOKEN_NO_TIMESTAMPS = 50362
+TOKEN_EOT = 50256
+
+
+# ---------------------------------------------------------------------------
+# Model — mirrors the HLIR encoder/decoder in examples/whisper/src/model.rs
+# ---------------------------------------------------------------------------
+
+
+class WhisperAttention(torch.nn.Module):
+    """Multi-head attention with separate q/k/v projections (no bias on k_proj)."""
+
+    def __init__(self, d_model: int = D_MODEL, n_heads: int = N_HEADS):
+        super().__init__()
+        self.n_heads = n_heads
+        self.head_dim = d_model // n_heads
+        self.q_proj = torch.nn.Linear(d_model, d_model, bias=True)
+        self.k_proj = torch.nn.Linear(d_model, d_model, bias=False)
+        self.v_proj = torch.nn.Linear(d_model, d_model, bias=True)
+        self.out_proj = torch.nn.Linear(d_model, d_model, bias=True)
+
+    def forward(
+        self,
+        x: torch.Tensor,
+        kv_input: Optional[torch.Tensor] = None,
+        causal: bool = False,
+    ) -> torch.Tensor:
+        # x: (seq, d_model). kv_input is None → self-attn; otherwise cross-attn.
+        kv = x if kv_input is None else kv_input
+        q = self.q_proj(x)
+        k = self.k_proj(kv)
+        v = self.v_proj(kv)
+
+        seq_q = q.shape[0]
+        seq_kv = k.shape[0]
+
+        # (seq, d_model) -> (n_heads, seq, head_dim)
+        q = q.reshape(seq_q, self.n_heads, self.head_dim).transpose(0, 1)
+        k = k.reshape(seq_kv, self.n_heads, self.head_dim).transpose(0, 1)
+        v = v.reshape(seq_kv, self.n_heads, self.head_dim).transpose(0, 1)
+
+        scale = 1.0 / (self.head_dim**0.5)
+        scores = torch.matmul(q, k.transpose(-2, -1)) * scale  # (h, sq, sk)
+        if causal:
+            # Use a large finite negative instead of -inf so the export pipeline
+            # serializes a float instead of the unsupported "-Infinity" sentinel.
+            mask = torch.triu(
+                torch.full((seq_q, seq_kv), -1e10, device=x.device),
+                diagonal=1,
+            )
+            scores = scores + mask
+        weights = torch.softmax(scores, dim=-1)
+        attn = torch.matmul(weights, v)  # (h, sq, hd)
+        merged = attn.transpose(0, 1).reshape(seq_q, -1)
+        return self.out_proj(merged)
+
+
+class EncoderLayer(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.self_attn = WhisperAttention()
+        self.self_attn_layer_norm = torch.nn.LayerNorm(D_MODEL, eps=LAYER_NORM_EPS)
+        self.fc1 = torch.nn.Linear(D_MODEL, FF_DIM, bias=True)
+        self.fc2 = torch.nn.Linear(FF_DIM, D_MODEL, bias=True)
+        self.final_layer_norm = torch.nn.LayerNorm(D_MODEL, eps=LAYER_NORM_EPS)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = x + self.self_attn(self.self_attn_layer_norm(x))
+        h = self.final_layer_norm(x)
+        h = F.gelu(self.fc1(h))
+        h = self.fc2(h)
+        return x + h
+
+
+class WhisperEncoder(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.conv1 = torch.nn.Conv1d(
+            N_MELS, D_MODEL, kernel_size=3, padding=1, bias=True
+        )
+        self.conv2 = torch.nn.Conv1d(
+            D_MODEL, D_MODEL, kernel_size=3, stride=2, padding=1, bias=True
+        )
+        # Position embedding stored as a regular parameter (matches HF layout).
+        self.embed_positions = torch.nn.Embedding(N_AUDIO_CTX, D_MODEL)
+        self.layers = torch.nn.ModuleList(
+            [EncoderLayer() for _ in range(N_AUDIO_LAYER)]
+        )
+        self.layer_norm = torch.nn.LayerNorm(D_MODEL, eps=LAYER_NORM_EPS)
+
+    def forward(self, mel: torch.Tensor) -> torch.Tensor:
+        # mel: (n_mels, 3000) -> add batch dim for conv1d
+        x = mel.unsqueeze(0)
+        x = F.gelu(self.conv1(x))
+        x = F.gelu(self.conv2(x))
+        # (1, d_model, 1500) -> (1500, d_model)
+        x = x.squeeze(0).transpose(0, 1)
+        x = x + self.embed_positions.weight
+        for layer in self.layers:
+            x = layer(x)
+        return self.layer_norm(x)
+
+
+class DecoderLayer(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.self_attn = WhisperAttention()
+        self.self_attn_layer_norm = torch.nn.LayerNorm(D_MODEL, eps=LAYER_NORM_EPS)
+        self.encoder_attn = WhisperAttention()
+        self.encoder_attn_layer_norm = torch.nn.LayerNorm(D_MODEL, eps=LAYER_NORM_EPS)
+        self.fc1 = torch.nn.Linear(D_MODEL, FF_DIM, bias=True)
+        self.fc2 = torch.nn.Linear(FF_DIM, D_MODEL, bias=True)
+        self.final_layer_norm = torch.nn.LayerNorm(D_MODEL, eps=LAYER_NORM_EPS)
+
+    def forward(self, x: torch.Tensor, xa: torch.Tensor) -> torch.Tensor:
+        x = x + self.self_attn(self.self_attn_layer_norm(x), causal=True)
+        x = x + self.encoder_attn(self.encoder_attn_layer_norm(x), kv_input=xa)
+        h = self.final_layer_norm(x)
+        h = F.gelu(self.fc1(h))
+        h = self.fc2(h)
+        return x + h
+
+
+class WhisperDecoder(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.embed_tokens = torch.nn.Embedding(N_VOCAB, D_MODEL)
+        self.embed_positions = torch.nn.Embedding(N_TEXT_CTX, D_MODEL)
+        self.layers = torch.nn.ModuleList([DecoderLayer() for _ in range(N_TEXT_LAYER)])
+        self.layer_norm = torch.nn.LayerNorm(D_MODEL, eps=LAYER_NORM_EPS)
+
+    def forward(self, tokens: torch.Tensor, xa: torch.Tensor) -> torch.Tensor:
+        # tokens: (seq,) of int64 — absolute positions are 0..seq-1
+        seq = tokens.shape[0]
+        pos = torch.arange(seq, dtype=torch.long, device=tokens.device)
+        x = self.embed_tokens(tokens) + self.embed_positions(pos)
+        for layer in self.layers:
+            x = layer(x, xa)
+        x = self.layer_norm(x)
+        # Tied projection
+        return torch.matmul(x, self.embed_tokens.weight.transpose(0, 1))
+
+
+class Whisper(torch.nn.Module):
+    def __init__(self):
+        super().__init__()
+        self.encoder = WhisperEncoder()
+        self.decoder = WhisperDecoder()
+
+    def forward(self, mel: torch.Tensor, tokens: torch.Tensor) -> torch.Tensor:
+        xa = self.encoder(mel)
+        return self.decoder(tokens, xa)
+
+
+class DecoderWithFixedXa(torch.nn.Module):
+    """Wraps the decoder with the encoder output stored as a buffer.
+
+    The audio is fixed for the whole utterance, so ``xa`` is a constant relative
+    to the per-token decode loop. Storing it as a buffer lets us compile the
+    decoder once with a single dynamic-length ``tokens`` input, avoiding a full
+    recompilation at every step as the sequence grows.
+    """
+
+    def __init__(self, decoder: WhisperDecoder, xa: torch.Tensor):
+        super().__init__()
+        self.decoder = decoder
+        self.register_buffer("xa", xa)
+
+    def forward(self, tokens: torch.Tensor) -> torch.Tensor:
+        return self.decoder(tokens, self.xa)
+
+
+# ---------------------------------------------------------------------------
+# Weight loading: HF state_dict -> our model
+# ---------------------------------------------------------------------------
+
+
+def load_hf_weights_into(model: Whisper) -> None:
+    """Copy HF whisper-tiny.en weights into our matching modules."""
+    hf = WhisperForConditionalGeneration.from_pretrained(REPO_ID).eval()
+    sd = hf.state_dict()
+
+    def get(name: str) -> torch.Tensor:
+        return sd[f"model.{name}"].clone()
+
+    enc = model.encoder
+    enc.conv1.weight.data.copy_(get("encoder.conv1.weight"))
+    enc.conv1.bias.data.copy_(get("encoder.conv1.bias"))
+    enc.conv2.weight.data.copy_(get("encoder.conv2.weight"))
+    enc.conv2.bias.data.copy_(get("encoder.conv2.bias"))
+    enc.embed_positions.weight.data.copy_(get("encoder.embed_positions.weight"))
+    enc.layer_norm.weight.data.copy_(get("encoder.layer_norm.weight"))
+    enc.layer_norm.bias.data.copy_(get("encoder.layer_norm.bias"))
+    for i, layer in enumerate(enc.layers):
+        prefix = f"encoder.layers.{i}"
+        layer.self_attn.q_proj.weight.data.copy_(
+            get(f"{prefix}.self_attn.q_proj.weight")
+        )
+        layer.self_attn.q_proj.bias.data.copy_(get(f"{prefix}.self_attn.q_proj.bias"))
+        layer.self_attn.k_proj.weight.data.copy_(
+            get(f"{prefix}.self_attn.k_proj.weight")
+        )
+        layer.self_attn.v_proj.weight.data.copy_(
+            get(f"{prefix}.self_attn.v_proj.weight")
+        )
+        layer.self_attn.v_proj.bias.data.copy_(get(f"{prefix}.self_attn.v_proj.bias"))
+        layer.self_attn.out_proj.weight.data.copy_(
+            get(f"{prefix}.self_attn.out_proj.weight")
+        )
+        layer.self_attn.out_proj.bias.data.copy_(
+            get(f"{prefix}.self_attn.out_proj.bias")
+        )
+        layer.self_attn_layer_norm.weight.data.copy_(
+            get(f"{prefix}.self_attn_layer_norm.weight")
+        )
+        layer.self_attn_layer_norm.bias.data.copy_(
+            get(f"{prefix}.self_attn_layer_norm.bias")
+        )
+        layer.fc1.weight.data.copy_(get(f"{prefix}.fc1.weight"))
+        layer.fc1.bias.data.copy_(get(f"{prefix}.fc1.bias"))
+        layer.fc2.weight.data.copy_(get(f"{prefix}.fc2.weight"))
+        layer.fc2.bias.data.copy_(get(f"{prefix}.fc2.bias"))
+        layer.final_layer_norm.weight.data.copy_(
+            get(f"{prefix}.final_layer_norm.weight")
+        )
+        layer.final_layer_norm.bias.data.copy_(get(f"{prefix}.final_layer_norm.bias"))
+
+    dec = model.decoder
+    dec.embed_tokens.weight.data.copy_(get("decoder.embed_tokens.weight"))
+    dec.embed_positions.weight.data.copy_(get("decoder.embed_positions.weight"))
+    dec.layer_norm.weight.data.copy_(get("decoder.layer_norm.weight"))
+    dec.layer_norm.bias.data.copy_(get("decoder.layer_norm.bias"))
+    for i, layer in enumerate(dec.layers):
+        prefix = f"decoder.layers.{i}"
+        layer.self_attn.q_proj.weight.data.copy_(
+            get(f"{prefix}.self_attn.q_proj.weight")
+        )
+        layer.self_attn.q_proj.bias.data.copy_(get(f"{prefix}.self_attn.q_proj.bias"))
+        layer.self_attn.k_proj.weight.data.copy_(
+            get(f"{prefix}.self_attn.k_proj.weight")
+        )
+        layer.self_attn.v_proj.weight.data.copy_(
+            get(f"{prefix}.self_attn.v_proj.weight")
+        )
+        layer.self_attn.v_proj.bias.data.copy_(get(f"{prefix}.self_attn.v_proj.bias"))
+        layer.self_attn.out_proj.weight.data.copy_(
+            get(f"{prefix}.self_attn.out_proj.weight")
+        )
+        layer.self_attn.out_proj.bias.data.copy_(
+            get(f"{prefix}.self_attn.out_proj.bias")
+        )
+        layer.self_attn_layer_norm.weight.data.copy_(
+            get(f"{prefix}.self_attn_layer_norm.weight")
+        )
+        layer.self_attn_layer_norm.bias.data.copy_(
+            get(f"{prefix}.self_attn_layer_norm.bias")
+        )
+        layer.encoder_attn.q_proj.weight.data.copy_(
+            get(f"{prefix}.encoder_attn.q_proj.weight")
+        )
+        layer.encoder_attn.q_proj.bias.data.copy_(
+            get(f"{prefix}.encoder_attn.q_proj.bias")
+        )
+        layer.encoder_attn.k_proj.weight.data.copy_(
+            get(f"{prefix}.encoder_attn.k_proj.weight")
+        )
+        layer.encoder_attn.v_proj.weight.data.copy_(
+            get(f"{prefix}.encoder_attn.v_proj.weight")
+        )
+        layer.encoder_attn.v_proj.bias.data.copy_(
+            get(f"{prefix}.encoder_attn.v_proj.bias")
+        )
+        layer.encoder_attn.out_proj.weight.data.copy_(
+            get(f"{prefix}.encoder_attn.out_proj.weight")
+        )
+        layer.encoder_attn.out_proj.bias.data.copy_(
+            get(f"{prefix}.encoder_attn.out_proj.bias")
+        )
+        layer.encoder_attn_layer_norm.weight.data.copy_(
+            get(f"{prefix}.encoder_attn_layer_norm.weight")
+        )
+        layer.encoder_attn_layer_norm.bias.data.copy_(
+            get(f"{prefix}.encoder_attn_layer_norm.bias")
+        )
+        layer.fc1.weight.data.copy_(get(f"{prefix}.fc1.weight"))
+        layer.fc1.bias.data.copy_(get(f"{prefix}.fc1.bias"))
+        layer.fc2.weight.data.copy_(get(f"{prefix}.fc2.weight"))
+        layer.fc2.bias.data.copy_(get(f"{prefix}.fc2.bias"))
+        layer.final_layer_norm.weight.data.copy_(
+            get(f"{prefix}.final_layer_norm.weight")
+        )
+        layer.final_layer_norm.bias.data.copy_(get(f"{prefix}.final_layer_norm.bias"))
+
+
+# ---------------------------------------------------------------------------
+# Audio loading + decoding
+# ---------------------------------------------------------------------------
+
+
+def load_wav_16k_mono(path: Path) -> np.ndarray:
+    with wave.open(str(path), "rb") as w:
+        sr = w.getframerate()
+        n = w.getnframes()
+        ch = w.getnchannels()
+        sw = w.getsampwidth()
+        raw = w.readframes(n)
+
+    if sw == 2:
+        samples = np.frombuffer(raw, dtype=np.int16).astype(np.float32) / 32768.0
+    elif sw == 4:
+        samples = np.frombuffer(raw, dtype=np.int32).astype(np.float32) / 2147483648.0
+    elif sw == 1:
+        samples = (
+            np.frombuffer(raw, dtype=np.uint8).astype(np.float32) - 128.0
+        ) / 128.0
+    else:
+        raise ValueError(f"unsupported sample width {sw}")
+
+    if ch > 1:
+        samples = samples.reshape(-1, ch).mean(axis=1)
+
+    if sr != 16000:
+        ratio = sr / 16000
+        out_len = int(len(samples) / ratio)
+        idx = np.arange(out_len, dtype=np.float64) * ratio
+        lo = idx.astype(np.int64)
+        frac = (idx - lo).astype(np.float32)
+        hi = np.clip(lo + 1, 0, len(samples) - 1)
+        samples = samples[lo] * (1.0 - frac) + samples[hi] * frac
+
+    return samples.astype(np.float32)
+
+
+def greedy_decode(logits_row: torch.Tensor, suppress_first_eot: bool) -> int:
+    masked = logits_row.clone()
+    masked[TOKEN_SOT:] = float("-inf")
+    if suppress_first_eot:
+        masked[TOKEN_EOT] = float("-inf")
+    return int(torch.argmax(masked).item())
+
+
+def find_default_audio() -> Optional[Path]:
+    here = Path(__file__).resolve()
+    workspace_root = here.parents[3]
+    candidate = workspace_root / "examples" / "whisper" / "assets" / "jfk.wav"
+    return candidate if candidate.exists() else None
+
+
+def main() -> None:
+    audio_arg = sys.argv[1] if len(sys.argv) > 1 else None
+    if audio_arg:
+        audio_path = Path(audio_arg)
+    else:
+        audio_path = find_default_audio()
+        if audio_path is None:
+            print(
+                "error: no audio file given and bundled jfk.wav not found",
+                file=sys.stderr,
+            )
+            sys.exit(1)
+
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    print(f"Using device: {device}")
+
+    print("Loading audio:", audio_path)
+    audio = load_wav_16k_mono(audio_path)
+
+    print("Computing log-mel features...")
+    feature_extractor = WhisperFeatureExtractor.from_pretrained(REPO_ID)
+    features = feature_extractor(audio, sampling_rate=16000, return_tensors="pt")
+    mel: torch.Tensor = features.input_features[0].to(device)  # (80, 3000)
+    assert mel.shape == (N_MELS, 3000), mel.shape
+
+    print("Building model and loading weights...")
+    model = Whisper().eval().to(device)
+    load_hf_weights_into(model)
+    model = model.to(device)
+    tokenizer = WhisperTokenizer.from_pretrained(REPO_ID)
+
+    use_compiled = os.environ.get("LUMINAL_DISABLE", "0") != "1"
+    max_new_tokens = int(os.environ.get("GEN_TOKENS", "100"))
+    search_iters = int(os.environ.get("SEARCH_ITERATIONS", "10"))
+
+    if use_compiled:
+        # 1. Run the encoder once eagerly. The audio doesn't change during decode,
+        #    so xa is a constant input to the decoder.
+        with torch.no_grad():
+            xa = model.encoder(mel)
+
+        # 2. Wrap the decoder so its only varying input is `tokens`, then compile
+        #    once with a dynamic length dim. Subsequent calls reuse the same
+        #    compiled graph — no recompile per token.
+        decoder_only = DecoderWithFixedXa(model.decoder, xa).eval().to(device)
+        example_tokens = torch.tensor(
+            [TOKEN_SOT, TOKEN_NO_TIMESTAMPS], dtype=torch.long, device=device
+        )
+        print(
+            f"Compiling decoder with dynamic seq dim (search_iters={search_iters})..."
+        )
+        compile_start = time.time()
+        compiled_decoder = luminal_compile(
+            decoder_only,
+            example_tokens,
+            search_iterations=search_iters,
+            dynamic_dim=0,
+        )
+        print(f"Compiled in {time.time() - compile_start:.1f}s")
+
+        def step_logits(decoder_input_ids: torch.Tensor) -> torch.Tensor:
+            out = compiled_decoder(decoder_input_ids)
+            return out[0] if isinstance(out, tuple) else out
+    else:
+
+        def step_logits(decoder_input_ids: torch.Tensor) -> torch.Tensor:
+            return model(mel, decoder_input_ids)
+
+    tokens = [TOKEN_SOT, TOKEN_NO_TIMESTAMPS]
+
+    print("Transcribing", end="", flush=True)
+    decode_start = time.time()
+    for step in range(max_new_tokens):
+        decoder_input_ids = torch.tensor(tokens, dtype=torch.long, device=device)
+        with torch.no_grad():
+            logits = step_logits(decoder_input_ids)
+
+        next_token = greedy_decode(logits[-1], suppress_first_eot=(step == 0))
+        if next_token == TOKEN_EOT:
+            break
+        tokens.append(next_token)
+        piece = tokenizer.decode([next_token], skip_special_tokens=False)
+        print(piece, end="", flush=True)
+    elapsed = time.time() - decode_start
+    print()
+
+    transcription = tokenizer.decode(tokens[2:], skip_special_tokens=True)
+    print(f"\nFinal transcription: {transcription}")
+    print(
+        f"Generated {len(tokens) - 2} tokens in {elapsed:.2f}s "
+        f"({(len(tokens) - 2) / max(elapsed, 1e-6):.1f} tok/s)"
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/crates/luminal_python/modal_pytest_runner.py
+++ b/crates/luminal_python/modal_pytest_runner.py
@@ -22,7 +22,7 @@ from modal.volume import FileEntryType

 app = modal.App("luminal-tests")

-DEFAULT_TIMEOUT = 30 * 60
+DEFAULT_TIMEOUT = 2 * 60 * 60
 CUDARC_CUDA_VERSION = "12080"
 LOCAL_PROJECT_DIR = Path(__file__).resolve().parent
 PROJECT_DIR = "/root/luminal/crates/luminal_python"
@@ -168,6 +168,37 @@ def _cleanup_remote_profile_artifacts(run_id: str) -> None:
        return


+def _build_cuda_extension(env: dict[str, str]) -> None:
+    cmd = [
+        "uv",
+        "run",
+        "--project",
+        PROJECT_DIR,
+        "--group",
+        "dev",
+        "maturin",
+        "develop",
+        "--manifest-path",
+        f"{PROJECT_DIR}/rust/Cargo.toml",
+        "--features",
+        "cuda",
+        "--profile",
+        "release",
+    ]
+    subprocess.run(cmd, env=env, cwd=PROJECT_DIR, check=True)
+
+
+def _effective_timeout(timeout: int) -> int:
+    if os.environ.get("GITHUB_ACTIONS") == "true" and timeout < DEFAULT_TIMEOUT:
+        print(
+            f"Using Modal timeout {DEFAULT_TIMEOUT}s instead of requested "
+            f"{timeout}s in GitHub Actions.",
+            file=sys.stderr,
+        )
+        return DEFAULT_TIMEOUT
+    return timeout
+
+
@app.cls(image=image, timeout=DEFAULT_TIMEOUT)
 class TestRunner:
    @modal.method()
@@ -186,7 +217,7 @@ class TestRunner:
        env = os.environ.copy()
        existing = env.get("PYTHONPATH")
        env["PYTHONPATH"] = f"{SRC_PATH}:{existing}" if existing else SRC_PATH
-        env["LUMINAL_BACKEND"] = "cuda"
+        env["LUMINAL_TEST_DEVICE"] = "cuda"
        env["UV_PROJECT_ENVIRONMENT"] = VENV_PATH
        env["MATURIN_PEP517_ARGS"] = "--features cuda --profile release"
        env["CUDARC_CUDA_VERSION"] = CUDARC_CUDA_VERSION
@@ -194,6 +225,8 @@ class TestRunner:
        if pytest_addopts:
            env["PYTEST_ADDOPTS"] = pytest_addopts

+        _build_cuda_extension(env)
+
        original_svg_requested = _has_pytest_flag(pytest_args, "--profile-svg")
        dot_available = shutil.which("dot") is not None
        sanitized_pytest_args = [
@@ -218,8 +251,6 @@ class TestRunner:
            PROJECT_DIR,
            "--group",
            "dev",
-            "--reinstall-package",
-            "luminal_python",
            "python",
            "-m",
            "pytest",
@@ -285,7 +316,7 @@ class TestRunner:

 def _parse_cli_args(
    cli_args: tuple[str, ...],
-) -> tuple[str, int | None, bool, str | None, list[str]]:
+) -> tuple[str, int, bool, str | None, list[str]]:
    parser = argparse.ArgumentParser(
        prog="modal run modal_pytest_runner.py",
        add_help=False,
@@ -300,7 +331,8 @@ def _parse_cli_args(
    parser.add_argument(
        "--timeout",
        type=int,
-        help="Optional Modal execution timeout in seconds. Defaults to 1800 seconds.",
+        default=DEFAULT_TIMEOUT,
+        help="Modal execution timeout in seconds. Defaults to %(default)s seconds.",
    )
    parser.add_argument(
        "--profile",
@@ -334,11 +366,11 @@ def main(*cli_args: str):
    )
    profile_enabled = _profiling_enabled(cli_profile, pytest_args)
    pytest_addopts = os.environ.get("PYTEST_ADDOPTS", "")
+    timeout = _effective_timeout(timeout)
    runner_options = {"gpu": gpu}
    hf_token_secret = _hf_token_secret()
    runner_volumes = {HF_CACHE_PATH: HF_CACHE_VOLUME}
-    if timeout is not None:
-        runner_options["timeout"] = timeout
+    runner_options["timeout"] = timeout
    if profile_enabled:
        runner_volumes[PROFILE_VOLUME_PATH] = PROFILE_VOLUME
    runner_options["volumes"] = runner_volumes
--- a/crates/luminal_python/pyproject.toml
+++ b/crates/luminal_python/pyproject.toml
@@ -3,19 +3,13 @@ name = "luminal_python"
 version = "0.1.0"
 description = "Add your description here"
 readme = "README.md"
-requires-python = ">=3.12"
+requires-python = ">=3.10"
 dependencies = [
    "numpy>=2.0.2",
    "torch>=2.10.0",
-    "onnx",
-    "onnxscript",
    "safetensors",
-    "flash-attn-3>=3.0.0",
 ]

-[tool.uv]
-no-build-isolation-package = ["flash-attn"]
-
 [[tool.uv.index]]
 name = "pytorch-cu128"
 url = "https://download.pytorch.org/whl/cu128"
@@ -25,7 +19,6 @@ explicit = true
 torch = [
  { index = "pytorch-cu128", marker = "sys_platform == 'linux' or sys_platform == 'win32'" },
 ]
-flash-attn-3 = { index = "pytorch-cu128" }


 [build-system]
@@ -39,27 +32,18 @@ module-name = "luminal.luminal"

 [tool.pytest.ini_options]
 markers = [
-    "slow: tests that download large models or require pre-generated artifacts",
+    "slow: tests that download large models, compile full-width model graphs, fuzz many CUDA search choices, or otherwise require explicit opt-in",
 ]

 [dependency-groups]
 dev = [
    "maturin>=1.0,<2.0",
-    "maturin-import-hook>=0.3.0",
    "pytest>=9.0.2",
    "pytest-profiling",
    "snakeviz",
+    "maturin-import-hook>=0.3.0",
    "pytest-randomly>=4.0.1",
-    "transformers>=5.5.0,<6",
+    "transformers>=4.40.0",
    "diffusers>=0.35.0",
-    "onnxsim",
-    "tiktoken>=0.12.0",
-    "pydantic>=2.12.5",
-    "psutil>=7.2.2",
    "modal>=1.3.5",
-    "pillow",
-    "flash-attn>=2.8.3",
-]
-flash-attention-4 = [
-    "nvidia-cutlass-dsl==4.1.0",
 ]
--- a/crates/luminal_python/run_all_tests.sh
+++ b/crates/luminal_python/run_all_tests.sh
@@ -1,42 +1,43 @@
 #!/bin/bash
 set -e

+export CUDARC_CUDA_VERSION="${CUDARC_CUDA_VERSION:-12080}"
+export MATURIN_PEP517_ARGS="${MATURIN_PEP517_ARGS:---features cuda --profile release}"
+
 echo "=========================================="
 echo "  Luminal Python: Full Test Suite"
 echo "=========================================="

-NATIVE_TESTS="tests/test_hlir_ops.py tests/test_unary.py"
-CUDA_TESTS="tests/test_hlir_ops.py tests/test_unary.py tests/test_llama3.py"
+NATIVE_TESTS="tests/test_hlir_ops.py tests/test_unary.py tests/test_input_layout.py tests/test_dtype_boundary.py tests/test_mutation_alias_contract.py"
+CUDA_TESTS="tests/"

 # ── Phase 1: Native Backend ─────────────────────────────────

 echo ""
 echo "=== Phase 1: Building native backend ==="
 rm -rf rust/target/wheels rust/target/debug rust/target/release
-uv run maturin develop --manifest-path rust/Cargo.toml
+uv run --group dev maturin develop --manifest-path rust/Cargo.toml

 echo ""
-echo "--- 1a: Native + ONNX ---"
-uv run pytest $NATIVE_TESTS -v
-
-echo ""
-echo "--- 1b: Native + PT2 ---"
-LUMINAL_EXPORT_MODE=pt2 uv run pytest $NATIVE_TESTS -v
+echo "--- 1a: Native backend tests ---"
+uv run --group dev pytest $NATIVE_TESTS -v

 # ── Phase 2: CUDA Backend ───────────────────────────────────

 echo ""
 echo "=== Phase 2: Building CUDA backend ==="
 rm -rf rust/target/wheels rust/target/debug rust/target/release
-uv run maturin develop --manifest-path rust/Cargo.toml --features cuda -r
+uv run --group dev maturin develop --manifest-path rust/Cargo.toml --features cuda -r

 echo ""
-echo "--- 2a: CUDA + ONNX ---"
-RUST_BACKTRACE=1 LUMINAL_BACKEND=cuda uv run pytest $CUDA_TESTS -m "not slow" -v
+echo "--- 2a: CUDA ---"
+RUST_BACKTRACE=1 LUMINAL_TEST_DEVICE=cuda uv run --group dev pytest $CUDA_TESTS -m "not slow" -v

 echo ""
-echo "--- 2b: CUDA + PT2 ---"
-RUST_BACKTRACE=1 LUMINAL_BACKEND=cuda LUMINAL_EXPORT_MODE=pt2 uv run pytest $CUDA_TESTS -m "not slow" -v
+echo "Slow CUDA tests are opt-in. To include them, run:"
+echo "  RUST_BACKTRACE=1 LUMINAL_TEST_DEVICE=cuda uv run pytest tests/ -v -s"
+echo "Or, for only slow tests:"
+echo "  RUST_BACKTRACE=1 LUMINAL_TEST_DEVICE=cuda uv run pytest tests/ -m slow -v -s"

 echo ""
 echo "=========================================="
--- a/crates/luminal_python/run_test.sh
+++ b/crates/luminal_python/run_test.sh
@@ -16,7 +16,7 @@ uv run maturin develop --manifest-path rust/Cargo.toml
 echo "Step 3: Running pytest..."
 # it is best not to add the full model tests, they end up running billion parameter models
 # on the CPU and it takes far to long
-uv run pytest tests/test_hlir_ops.py tests/test_unary.py -v
+uv run pytest tests/test_hlir_ops.py tests/test_unary.py tests/test_input_layout.py tests/test_dtype_boundary.py tests/test_mutation_alias_contract.py -v

 echo ""
 echo "=== Tests Complete ==="
--- a/crates/luminal_python/run_test_fx.sh
+++ b/crates/luminal_python/run_test_fx.sh
@@ -1,20 +0,0 @@
-#!/bin/bash
-set -e
-
-echo "=== Luminal Python Test Runner (PT2 Export Mode) ==="
-echo ""
-
-# Force clean rebuild of Rust extension
-echo "Step 1: Cleaning previous builds..."
-rm -rf rust/target/wheels rust/target/debug rust/target/release
-
-# Rebuild in development mode (faster compilation)
-echo "Step 2: Building Rust extension..."
-uv run maturin develop --manifest-path rust/Cargo.toml
-
-# Run pytest with PT2 export mode
-echo "Step 3: Running pytest with PT2 export mode..."
-LUMINAL_EXPORT_MODE=pt2 uv run pytest tests/test_hlir_ops.py tests/test_unary.py -v
-
-echo ""
-echo "=== Tests Complete ==="
--- a/crates/luminal_python/run_tests_cuda.sh
+++ b/crates/luminal_python/run_tests_cuda.sh
@@ -4,17 +4,34 @@ set -e
 echo "=== Luminal Python Test Runner (CUDA Backend) ==="
 echo ""

+export CUDARC_CUDA_VERSION="${CUDARC_CUDA_VERSION:-12080}"
+export MATURIN_PEP517_ARGS="${MATURIN_PEP517_ARGS:---features cuda --profile release}"
+
+PYTEST_MARK='not slow'
+if [[ "${1:-}" == "--include-slow" ]]; then
+    PYTEST_MARK=''
+elif [[ "${1:-}" == "--slow-only" ]]; then
+    PYTEST_MARK='slow'
+elif [[ "${1:-}" != "" ]]; then
+    echo "Usage: ./run_tests_cuda.sh [--include-slow|--slow-only]"
+    exit 2
+fi
+
 # Force clean rebuild of Rust extension
 echo "Step 1: Cleaning previous builds..."
 rm -rf rust/target/wheels rust/target/debug rust/target/release

 # Rebuild in development mode (faster compilation)
 echo "Step 2: Building Rust extension..."
-uv run maturin develop --manifest-path rust/Cargo.toml --features cuda -r
+uv run --group dev maturin develop --manifest-path rust/Cargo.toml --features cuda -r

 # Run pytest with CUDA backend
 echo "Step 3: Running pytest with CUDA backend..."
-RUST_BACKTRACE=1 LUMINAL_BACKEND=cuda uv run pytest tests/test_llama3.py tests/test_hlir_ops.py tests/test_unary.py -v
+if [[ -n "$PYTEST_MARK" ]]; then
+    RUST_BACKTRACE=1 LUMINAL_TEST_DEVICE=cuda uv run --group dev pytest tests/ -m "$PYTEST_MARK" -v -s
+else
+    RUST_BACKTRACE=1 LUMINAL_TEST_DEVICE=cuda uv run --group dev pytest tests/ -v -s
+fi

 echo ""
 echo "=== Tests Complete ==="
--- a/crates/luminal_python/run_tests_cuda_fx.sh
+++ b/crates/luminal_python/run_tests_cuda_fx.sh
@@ -1,19 +0,0 @@
-#!/bin/bash
-set -e
-
-echo "=== Luminal Python Test Runner (CUDA + PT2 Export Mode) ==="
-echo ""
-
-# Force clean rebuild of Rust extension
-echo "Step 1: Cleaning previous builds..."
-rm -rf rust/target/wheels rust/target/debug rust/target/release
-
-# Rebuild in development mode (faster compilation)
-echo "Step 2: Building Rust extension..."
-uv run maturin develop --manifest-path rust/Cargo.toml --features cuda -r
-
-# Run pytest with CUDA backend and PT2 export mode
-echo "Step 3: Running pytest with CUDA backend + PT2 export mode..."
-RUST_BACKTRACE=1 LUMINAL_BACKEND=cuda LUMINAL_EXPORT_MODE=pt2 uv run pytest tests/test_llama3.py tests/test_hlir_ops.py tests/test_unary.py -v
-echo ""
-echo "=== Tests Complete ==="
--- a/crates/luminal_python/rust/Cargo.toml
+++ b/crates/luminal_python/rust/Cargo.toml
@@ -12,8 +12,6 @@ path = "src/lib.rs"
 cuda = ["dep:luminal_cuda_lite"]

 [dependencies]
-onnx-protobuf = "0.2"
-protobuf = "~3.4"
 rustc-hash = "2.1.1"
 luminal = {path= "../../.."}
 luminal_cuda_lite = {path="../../luminal_cuda_lite", optional = true}
--- a/crates/luminal_python/rust/src/compiled_graph.rs
+++ b/crates/luminal_python/rust/src/compiled_graph.rs
@@ -1,32 +1,117 @@
-#[cfg(feature = "cuda")]
-use luminal::prelude::tracing::{trace, warn};
-use luminal::{prelude::*, shape::Expression, visualization::ToDot};
+use luminal::{
+    dyn_backend::{BackendCompileArgs, BackendFactory, DynBackend},
+    prelude::*,
+    shape::Expression,
+    visualization::ToDot,
+};
 use pyo3::prelude::*;
 use std::collections::HashMap;
-#[cfg(feature = "cuda")]
-use std::collections::HashSet;

-use crate::{runtime::RuntimeBackend, util::DimParamMap};
+use crate::typed_data::TypedData;

-/// Common intermediate result from translating a model graph (ONNX or FX).
+/// Maps symbolic dimension parameter names (e.g. "seq_len") to luminal Expression variable chars.
+pub type DimParamMap = HashMap<String, char>;
+
+/// Recover a single-variable dim's variable value from an observed runtime size.
+///
+/// Returns `Some((var, value))` when the expression contains exactly one
+/// variable, is affine in that variable, and `value` round-trips through
+/// `exec_single_var_checked` to reproduce `dim_val`. Returns `None` otherwise
+/// — multi-variable expressions, non-affine forms, slope==0, and inversions
+/// that don't divide cleanly are all rejected so we never write a wrong
+/// guess into `dyn_map`.
+fn solve_single_var_dim(expr: &Expression, dim_val: usize) -> Option<(char, usize)> {
+    use luminal::shape::Term;
+    let terms = expr.terms.read();
+
+    // Identify the unique variable, if any.
+    let mut var: Option<char> = None;
+    for t in terms.iter() {
+        if let Term::Var(c) = t {
+            match var {
+                None => var = Some(*c),
+                Some(existing) if existing == *c => {}
+                Some(_) => return None, // multi-var — bail out
+            }
+        }
+    }
+    let var = var?;
+
+    // Bare-var fast path — terms is exactly `[Var]`.
+    if terms.len() == 1 {
+        return Some((var, dim_val));
+    }
+
+    // Probe two points to recover slope/intercept of an assumed affine form
+    // `f(x) = slope*x + intercept`. We use 2 and 3 (luminal's default
+    // dynamic-dim min is 2, and 3 keeps the inputs small in case the
+    // expression includes a multiplication that could overflow at scale).
+    drop(terms);
+    let f2 = expr.exec_single_var_checked(2)? as i64;
+    let f3 = expr.exec_single_var_checked(3)? as i64;
+    let slope = f3 - f2;
+    if slope == 0 {
+        return None;
+    }
+    let intercept = f2 - 2 * slope;
+    let target = dim_val as i64 - intercept;
+    if slope == 0 || target % slope != 0 {
+        return None;
+    }
+    let candidate = target / slope;
+    if candidate < 0 {
+        return None;
+    }
+    let candidate = candidate as usize;
+
+    // Verify by re-evaluating with the candidate value. Catches non-affine
+    // forms whose probe points happen to be collinear (e.g. `min(s, 100)`
+    // would look affine for s ∈ {2, 3} but flatten beyond 100).
+    if expr.exec_single_var_checked(candidate)? != dim_val {
+        return None;
+    }
+    Some((var, candidate))
+}
+
+/// Convert luminal DType to PT2 dtype integer code (for python interop)
+/// Types without a direct Pytorch equivalent map to the closest safe representation
+fn luminal_dtype_to_pt2_code(dtype: DType) -> u32 {
+    match dtype {
+        DType::U8 => 1,
+        DType::I8 => 2,
+        DType::I16 => 3,
+        DType::Int => 4, // i32
+        DType::U16 => 4, // u16 -> i32 (Pytorch has no u16 in older versions)
+        DType::F16 => 6,
+        DType::F32 | DType::TF32 => 7,
+        DType::F64 => 8,
+        DType::Bool => 12,
+        DType::Bf16 => 13,
+        _ => panic!("luminal_dtype_to_pt2_code: unsupported dtype {:?}", dtype),
+    }
+}
+
+/// Common intermediate result from translating a model graph.
 pub struct GraphTranslation {
    pub graph: Graph,
    pub tensor_ids: HashMap<String, NodeIndex>,
    pub input_names: Vec<String>,
    pub output_names: Vec<String>,
    pub output_shape_exprs: Vec<Vec<Expression>>,
+    /// Output dtypes as PT2 dtype codes (e.g. 5 = int64, 7 = float32).
+    /// Stored as PT2 codes (rather than luminal `DType`) so we can preserve
+    /// distinctions luminal collapses internally — notably int64 vs int32,
+    /// both of which map to `DType::Int` in luminal but must be reported
+    /// back to PyTorch with their original precision.
+    pub output_dtypes: Vec<u32>,
    pub input_shape_exprs: Vec<Vec<Expression>>,
    pub dim_param_map: DimParamMap,
 }

-/// Pre-loaded weight data from any model format.
-///
-/// NOTE: Currently assumes all data is F32. When the type system branch lands
-/// with proper multi-dtype support, this struct (and all callers) will need
-/// updating to carry dtype metadata alongside the raw data.
+/// Pre-loaded weight data from any model format (dtype-aware).
 pub struct WeightData {
-    /// (Input node label, f32 data) for weights and constants.
-    pub weights: Vec<(String, Vec<f32>)>,
+    /// (Input node label, typed data) for weights and constants.
+    pub weights: Vec<(String, TypedData)>,
    /// label → element count for ALL Input nodes (for CUDA dummy data sizing).
    pub tensor_sizes: HashMap<String, usize>,
    /// label → (device_ptr, n_bytes) for zero-copy CUDA weight sharing.
@@ -36,7 +121,7 @@ pub struct WeightData {
 #[pyclass(unsendable)]
 pub struct CompiledGraph {
    pub graph: Graph,
-    pub runtime: RuntimeBackend,
+    pub runtime: Box<dyn DynBackend>,
    pub tensor_ids: HashMap<String, NodeIndex>,
    /// Cached label → NodeIndex map for O(1) lookups in set_weight_* methods.
    label_map: HashMap<String, NodeIndex>,
@@ -44,20 +129,23 @@ pub struct CompiledGraph {
    pub output_names: Vec<String>,
    pub output_shapes: Vec<Vec<usize>>,
    pub output_shape_exprs: Vec<Vec<Expression>>,
+    /// Output dtypes as PT2 dtype codes (preserves int64 / int32 distinction
+    /// that luminal collapses to `DType::Int` internally).
+    pub output_dtypes: Vec<u32>,
    pub input_shape_exprs: Vec<Vec<Expression>>,
    pub dim_param_map: DimParamMap,
 }

 impl CompiledGraph {
-    /// Shared compilation pipeline for both ONNX and FX/PT2 graphs.
+    /// Compilation pipeline for PT2/FX graphs.
    ///
-    /// Takes a format-neutral `GraphTranslation` (produced by `translate_onnx` or
-    /// `translate_pt2`) and `WeightData`, builds the backend, loads weights, and
+    /// Takes a `GraphTranslation` (produced by `translate_pt2`) and `WeightData`,
+    /// builds the backend via the global registry, loads weights, and
    /// returns a ready-to-execute `CompiledGraph`.
    pub fn parse_graph(
        translation: GraphTranslation,
        weight_data: WeightData,
-        backend: &str,
+        factory: BackendFactory,
        search_iters: usize,
    ) -> Result<CompiledGraph, String> {
        let GraphTranslation {
@@ -66,49 +154,34 @@ impl CompiledGraph {
            input_names,
            output_names,
            output_shape_exprs,
+            output_dtypes,
            input_shape_exprs,
            dim_param_map,
        } = translation;

-        let rt = match backend {
-            #[cfg(feature = "cuda")]
-            "cuda" | "gpu" => {
-                CompiledGraph::build_cuda_backend(&mut graph, &weight_data, search_iters)?
-            }
-            "native" | "cpu" => {
-                CompiledGraph::build_native_backend(&mut graph, &weight_data, search_iters)?
-            }
-            _ => {
-                #[cfg(feature = "cuda")]
-                {
-                    return Err(format!(
-                        "Invalid backend '{}'. Must be 'native' or 'cuda'",
-                        backend
-                    ));
-                }
-                #[cfg(not(feature = "cuda"))]
-                {
-                    if backend == "cuda" {
-                        return Err(
-                            "CUDA backend requested, but this luminal extension was built without the `cuda` feature. Rebuild with `maturin develop --features cuda -r` or use backend='native'."
-                                .to_string(),
-                        );
-                    }
-                    return Err(format!(
-                        "Invalid backend '{}'. This build only supports 'native'. Rebuild with the `cuda` feature to enable 'cuda'.",
-                        backend
-                    ));
-                }
-            }
+        // Build compile args from WeightData (convert TypedData -> raw bytes + dtype)
+        let compile_args = BackendCompileArgs {
+            search_iters,
+            weights: weight_data
+                .weights
+                .iter()
+                .map(|(label, td)| (label.clone(), td.bytes.clone(), td.dtype))
+                .collect(),
+            tensor_sizes: weight_data.tensor_sizes,
+            device_ptrs: weight_data.device_ptrs,
        };

+        // Create backend via the factory directly
+        let rt =
+            luminal::dyn_backend::compile_backend_from_factory(factory, &mut graph, compile_args)?;
+
        // Resolve concrete output shapes from expressions
        let output_shapes: Vec<Vec<usize>> = output_shape_exprs
            .iter()
            .map(|exprs| exprs.iter().map(|e| e.to_usize().unwrap_or(1)).collect())
            .collect();

-        let label_map = CompiledGraph::build_label_map(&graph);
+        let label_map = luminal::dyn_backend::build_label_map(&graph);

        Ok(CompiledGraph {
            graph,
@@ -119,160 +192,11 @@ impl CompiledGraph {
            output_names,
            output_shapes,
            output_shape_exprs,
+            output_dtypes,
            input_shape_exprs,
            dim_param_map,
        })
    }
-
-    /// Build a label → NodeIndex map for all Input nodes in the graph.
-    /// Used for efficient weight loading by label matching.
-    fn build_label_map(graph: &Graph) -> HashMap<String, NodeIndex> {
-        graph
-            .graph
-            .node_indices()
-            .filter_map(|node_id| {
-                (*graph.graph[node_id])
-                    .as_any()
-                    .downcast_ref::<luminal::hlir::Input>()
-                    .map(|input| (input.label.clone(), node_id))
-            })
-            .collect()
-    }
-
-    #[cfg(feature = "cuda")]
-    fn build_cuda_backend(
-        graph: &mut Graph,
-        weight_data: &WeightData,
-        search_iters: usize,
-    ) -> Result<RuntimeBackend, String> {
-        let device_ptrs = &weight_data.device_ptrs;
-        use luminal_cuda_lite::cudarc::driver::CudaContext;
-        use luminal_cuda_lite::runtime::CudaRuntime;
-
-        let cuda_ctx = CudaContext::new(0).map_err(|e| format!("CUDA context init failed: {e}"))?;
-        let stream = cuda_ctx.default_stream();
-
-        graph.build_search_space::<CudaRuntime>();
-
-        let mut rt = CudaRuntime::initialize(stream);
-
-        // Build label → NodeIndex map for device pointer matching.
-        let label_map = CompiledGraph::build_label_map(graph);
-
-        // For weights with device pointers: use them directly (zero-copy).
-        // This avoids allocating ~N GB of dummy data during search.
-        // The pointers survive search because profiling mode skips buffer consumption,
-        // and persist_hlir_node ensures they survive post-search execution too.
-        let mut device_ptr_nodes: HashSet<NodeIndex> = HashSet::new();
-        let mut matched_count = 0usize;
-        let mut missed_labels: Vec<String> = Vec::new();
-        for (label, &(ptr, n_bytes)) in device_ptrs {
-            if let Some(&node_id) = label_map.get(label) {
-                unsafe { rt.set_device_ptr(node_id, ptr, n_bytes) };
-                rt.persist_hlir_node(node_id);
-                device_ptr_nodes.insert(node_id);
-                matched_count += 1;
-            } else {
-                missed_labels.push(label.clone());
-            }
-        }
-        let total_device_bytes: usize = device_ptrs.values().map(|(_, n)| *n).sum();
-        trace!(
-            "[CUDA BUILD] Device pointers: {} matched, {} missed out of {} total ({:.3} GiB)",
-            matched_count,
-            missed_labels.len(),
-            device_ptrs.len(),
-            total_device_bytes as f64 / (1024.0 * 1024.0 * 1024.0),
-        );
-        if !missed_labels.is_empty() {
-            warn!(
-                "[CUDA BUILD] {} device-ptr labels did not match any Input node (first 10): {:?}",
-                missed_labels.len(),
-                &missed_labels[..missed_labels.len().min(10)]
-            );
-            let available: Vec<&String> = label_map.keys().take(10).collect();
-            warn!(
-                "[CUDA BUILD] Available label_map keys (first 10): {:?}",
-                available
-            );
-        }
-
-        // Set dummy 1.0 data for remaining Input nodes (user inputs, constants without
-        // device pointers) for safe search profiling.
-        // IMPORTANT: Must use 1.0, NOT 0.0. Zero inputs cause NaN in many ops:
-        //   - fmod(0, 0) = NaN  (Mod)
-        //   - recip(0) = inf → weight * inf = NaN  (Div)
-        //   - log(0) = -inf  (Pow)
-        //   - chain ops with zero produce NaN  (Erf)
-        let mut dummy_total_elements = 0usize;
-        let mut dummy_count = 0usize;
-        for node_id in graph.graph.node_indices() {
-            if device_ptr_nodes.contains(&node_id) {
-                continue;
-            }
-            if let Some(input) = (*graph.graph[node_id])
-                .as_any()
-                .downcast_ref::<luminal::hlir::Input>()
-            {
-                if let Some(&n) = weight_data.tensor_sizes.get(&input.label) {
-                    if n > 0 {
-                        dummy_total_elements += n;
-                        dummy_count += 1;
-                        rt.set_data(node_id, vec![1.0f32; n]);
-                    }
-                }
-            }
-        }
-        trace!(
-            "[CUDA BUILD] Dummy data: {} nodes, {} elements ({:.3} GiB as f32)",
-            dummy_count,
-            dummy_total_elements,
-            (dummy_total_elements * 4) as f64 / (1024.0 * 1024.0 * 1024.0),
-        );
-
-        // Search (device-pointer weights are used directly; dummy data for the rest)
-        let mut rt = graph.search(rt, search_iters);
-
-        // Load real weight data for non-device-ptr weights (constants from PT2 archive, etc.)
-        let mut loaded_weight_elements = 0usize;
-        let mut loaded_weight_count = 0usize;
-        for (label, data) in &weight_data.weights {
-            if !device_ptrs.contains_key(label) {
-                if let Some(&node_id) = label_map.get(label) {
-                    loaded_weight_elements += data.len();
-                    loaded_weight_count += 1;
-                    rt.set_data(node_id, data.clone());
-                }
-            }
-        }
-        trace!(
-            "[CUDA BUILD] Post-search weight load: {} weights, {} elements ({:.3} GiB as f32)",
-            loaded_weight_count,
-            loaded_weight_elements,
-            (loaded_weight_elements * 4) as f64 / (1024.0 * 1024.0 * 1024.0),
-        );
-
-        Ok(RuntimeBackend::Cuda(Box::new(rt)))
-    }
-
-    fn build_native_backend(
-        graph: &mut Graph,
-        weight_data: &WeightData,
-        search_iters: usize,
-    ) -> Result<RuntimeBackend, String> {
-        graph.build_search_space::<NativeRuntime>();
-        let mut rt = graph.search(NativeRuntime::default(), search_iters);
-
-        // Load weight data after search
-        let label_map = CompiledGraph::build_label_map(graph);
-        for (label, data) in &weight_data.weights {
-            if let Some(&node_id) = label_map.get(label) {
-                rt.set_data(node_id, data.clone());
-            }
-        }
-
-        Ok(RuntimeBackend::Native(rt))
-    }
 }

 #[pymethods]
@@ -283,6 +207,24 @@ impl CompiledGraph {
        self.input_names.clone()
    }

+    /// Get the PT2 dtype codes for all inputs (in order of input_names).
+    #[getter]
+    fn input_dtypes(&self) -> Vec<u32> {
+        self.input_names
+            .iter()
+            .map(|name| {
+                if let Some(&node_id) = self.tensor_ids.get(name)
+                    && let Some(input) = (*self.graph.graph[node_id])
+                        .as_any()
+                        .downcast_ref::<luminal::hlir::Input>()
+                {
+                    return luminal_dtype_to_pt2_code(input.dtype);
+                }
+                7 // default to f32
+            })
+            .collect()
+    }
+
    /// Get the list of output tensor names.
    #[getter]
    fn output_names(&self) -> Vec<String> {
@@ -301,12 +243,24 @@ impl CompiledGraph {
        self.tensor_ids.keys().cloned().collect()
    }

-    /// Get the name of the active backend (native or cuda).
+    /// Get the name of the active backend.
    #[getter]
-    fn backend(&self) -> &'static str {
+    fn backend(&self) -> &str {
        self.runtime.name()
    }

+    /// The device type this backend operates on (e.g. "cpu", "cuda").
+    #[getter]
+    fn device_type(&self) -> &str {
+        self.runtime.device_type()
+    }
+
+    /// Whether the active backend supports device pointer operations (zero-copy GPU I/O).
+    #[getter]
+    fn supports_device_ptrs(&self) -> bool {
+        self.runtime.supports_device_ptrs()
+    }
+
    /// Whether this graph has dynamic (symbolic) dimensions.
    #[getter]
    fn has_dynamic_dims(&self) -> bool {
@@ -333,17 +287,27 @@ impl CompiledGraph {
    }

    /// Auto-detect and set dynamic dimensions from input tensor shapes.
-    /// For each user input, matches the concrete shape against its symbolic
-    /// shape expressions and sets the corresponding dyn_map entries.
+    ///
+    /// For each user input we walk the symbolic shape expressions side-by-side
+    /// with the concrete sizes Dynamo handed us at runtime and try to recover
+    /// each unbound variable's value. Two cases are handled:
+    ///
+    ///   * Bare-variable dim (`s`): set directly from the size.
+    ///   * Single-variable affine dim (`a*s + b`): solve `s = (size - b)/a`
+    ///     by sampling the expression at two probe points to extract the
+    ///     slope, recovering the intercept, and verifying that plugging the
+    ///     recovered value back through `exec_single_var_checked` reproduces
+    ///     the observed size. The verification step rejects everything
+    ///     non-affine (`s*s`, `min(s, 8)`, etc.) without committing a wrong
+    ///     guess to `dyn_map`.
+    ///
+    /// Multi-variable dims are skipped here; another input's shape — or an
+    /// explicit `set_dim` call — is expected to bind those.
    fn auto_set_dims_from_input_shapes(&mut self, input_shapes: Vec<Vec<usize>>) {
        for (shape_exprs, shape) in self.input_shape_exprs.iter().zip(input_shapes.iter()) {
            for (dim_expr, &dim_val) in shape_exprs.iter().zip(shape.iter()) {
-                // Check if this expression is a bare symbolic variable
-                let terms = dim_expr.terms.read();
-                if terms.len() == 1
-                    && let luminal::shape::Term::Var(c) = terms[0]
-                {
-                    self.graph.set_dim(c, dim_val);
+                if let Some((var, value)) = solve_single_var_dim(dim_expr, dim_val) {
+                    self.graph.set_dim(var, value);
                }
            }
        }
@@ -371,100 +335,136 @@ impl CompiledGraph {
        Ok(result)
    }

-    /// Set input tensor data by name.
+    /// Set input tensor data by name (f32, for backward compatibility).
    fn set_input(&mut self, name: &str, data: Vec<f32>) -> PyResult<()> {
        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!("Unknown input tensor: {}", name))
        })?;
-        self.runtime.set_data(*node_id, data);
+        self.runtime.set_data_f32(*node_id, data);
        Ok(())
    }

-    /// Set input tensor data from a CPU host memory pointer (avoids Python list conversion).
-    /// The pointer must point to contiguous f32 data (from tensor.data_ptr() on a CPU float32 tensor).
-    fn set_input_from_ptr(&mut self, name: &str, ptr: u64, n_elements: usize) -> PyResult<()> {
+    /// Set input tensor data from a CPU host memory pointer (dtype-aware).
+    /// The pointer must point to contiguous data. `n_bytes` is the total byte count.
+    /// `dtype_code` uses PT2 numbering (7=f32, 6=f16, 13=bf16, etc.).
+    /// Converts source format to luminal's native format (e.g., i64→i32, f64→f32).
+    fn set_input_from_ptr(
+        &mut self,
+        name: &str,
+        ptr: u64,
+        n_bytes: usize,
+        dtype_code: u32,
+    ) -> PyResult<()> {
        debug_assert!(ptr != 0, "set_input_from_ptr called with null pointer");
        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!("Unknown input tensor: {}", name))
        })?;
-        let data: Vec<f32> =
-            unsafe { std::slice::from_raw_parts(ptr as *const f32, n_elements).to_vec() };
-        self.runtime.set_data(*node_id, data);
+        let raw_bytes = unsafe { std::slice::from_raw_parts(ptr as *const u8, n_bytes).to_vec() };
+        let typed = TypedData::from_pytorch_bytes(raw_bytes, dtype_code);
+        self.runtime
+            .set_data_bytes(*node_id, typed.bytes, typed.dtype);
        Ok(())
    }

-    /// Set input from a CUDA device pointer. Zero-copy on device.
-    /// The pointer must be a valid CUDA device allocation with at least n_bytes bytes.
-    #[cfg(feature = "cuda")]
+    /// Set input from a device pointer. Zero-copy on device.
+    /// The pointer must be a valid device allocation with at least n_bytes bytes.
+    /// Requires a GPU backend (e.g. CUDA).
    fn set_input_device_ptr(
        &mut self,
        name: &str,
        device_ptr: u64,
        n_bytes: usize,
    ) -> PyResult<()> {
+        if !self.runtime.supports_device_ptrs() {
+            return Err(pyo3::exceptions::PyValueError::new_err(
+                "set_input_device_ptr requires a GPU backend",
+            ));
+        }
        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!("Unknown input tensor: {}", name))
        })?;
-        match &mut self.runtime {
-            RuntimeBackend::Cuda(rt) => unsafe { rt.set_device_ptr(*node_id, device_ptr, n_bytes) },
-            _ => {
-                return Err(pyo3::exceptions::PyValueError::new_err(
-                    "set_input_device_ptr requires CUDA backend",
-                ));
-            }
-        }
+        unsafe { self.runtime.set_device_ptr(*node_id, device_ptr, n_bytes) };
        Ok(())
    }

-    /// Mark an input tensor as persistent (survives execute() calls).
-    /// Call this for weight tensors that should not be consumed after each execution.
-    fn persist_input(&mut self, name: &str) -> PyResult<()> {
-        let _node_id = *self.tensor_ids.get(name).ok_or_else(|| {
-            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!("Unknown input tensor: {}", name))
-        })?;
-        match &mut self.runtime {
-            #[cfg(feature = "cuda")]
-            RuntimeBackend::Cuda(rt) => rt.persist_hlir_node(_node_id),
-            RuntimeBackend::Native(_) => {} // Native: persist is handled at graph level
-        }
-        Ok(())
-    }
-
-    /// Set a weight tensor from a CUDA device pointer, matching by Input node label.
-    /// Also marks the weight as persistent. For PT2 weights (e.g. "fc1.weight").
-    #[cfg(feature = "cuda")]
+    /// Set a weight from a device pointer (e.g. "fc1.weight"). Zero-copy on device.
+    /// Requires a GPU backend.
    fn set_weight_device_ptr(
        &mut self,
        label: &str,
        device_ptr: u64,
        n_bytes: usize,
    ) -> PyResult<()> {
+        if !self.runtime.supports_device_ptrs() {
+            return Err(pyo3::exceptions::PyValueError::new_err(
+                "set_weight_device_ptr requires a GPU backend",
+            ));
+        }
        let &node_id = self.label_map.get(label).ok_or_else(|| {
            pyo3::exceptions::PyKeyError::new_err(format!("No Input node with label: {}", label))
        })?;
-        match &mut self.runtime {
-            RuntimeBackend::Cuda(rt) => {
-                unsafe { rt.set_device_ptr(node_id, device_ptr, n_bytes) };
-                rt.persist_hlir_node(node_id);
-            }
-            _ => {
-                return Err(pyo3::exceptions::PyValueError::new_err(
-                    "set_weight_device_ptr requires CUDA backend",
-                ));
-            }
-        }
+        unsafe { self.runtime.set_device_ptr(node_id, device_ptr, n_bytes) };
        Ok(())
    }

-    /// Set a weight tensor from a CPU host pointer, matching by Input node label.
-    fn set_weight_from_ptr(&mut self, label: &str, ptr: u64, n_elements: usize) -> PyResult<()> {
+    /// Register an external device pointer for an output tensor (zero-copy output).
+    /// Call before run() — the runtime will write kernel results directly into this buffer.
+    /// For aliased outputs (in-place ops), falls back to DtoD copy; check output_is_zero_copy() after run().
+    /// Requires a GPU backend.
+    fn set_output_device_ptr(
+        &mut self,
+        name: &str,
+        device_ptr: u64,
+        n_bytes: usize,
+    ) -> PyResult<()> {
+        if !self.runtime.supports_device_ptrs() {
+            return Err(pyo3::exceptions::PyValueError::new_err(
+                "set_output_device_ptr requires a GPU backend",
+            ));
+        }
+        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
+            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!(
+                "Unknown output tensor: {}",
+                name
+            ))
+        })?;
+        unsafe {
+            self.runtime
+                .set_output_device_ptr(*node_id, device_ptr, n_bytes)
+        };
+        Ok(())
+    }
+
+    /// Check whether an output tensor was zero-copied (written directly to the registered pointer).
+    /// Returns false for aliased outputs that need a fallback DtoD copy, or if no GPU backend.
+    /// Must be called after run().
+    fn output_is_zero_copy(&self, name: &str) -> PyResult<bool> {
+        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
+            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!(
+                "Unknown output tensor: {}",
+                name
+            ))
+        })?;
+        Ok(self.runtime.output_is_zero_copy(*node_id))
+    }
+
+    /// Set a weight tensor from a CPU host pointer, matching by Input node label (dtype-aware).
+    /// `n_bytes` is the total byte count. `dtype_code` uses PT2 numbering (7=f32, 6=f16, 13=bf16, etc.).
+    fn set_weight_from_ptr(
+        &mut self,
+        label: &str,
+        ptr: u64,
+        n_bytes: usize,
+        dtype_code: u32,
+    ) -> PyResult<()> {
        debug_assert!(ptr != 0, "set_weight_from_ptr called with null pointer");
        let &node_id = self.label_map.get(label).ok_or_else(|| {
            pyo3::exceptions::PyKeyError::new_err(format!("No Input node with label: {}", label))
        })?;
-        let data: Vec<f32> =
-            unsafe { std::slice::from_raw_parts(ptr as *const f32, n_elements).to_vec() };
-        self.runtime.set_data(node_id, data);
+        let bytes = unsafe { std::slice::from_raw_parts(ptr as *const u8, n_bytes).to_vec() };
+        let typed = TypedData::from_pytorch_bytes(bytes, dtype_code);
+        self.runtime
+            .set_data_bytes(node_id, typed.bytes, typed.dtype);
        Ok(())
    }

@@ -480,7 +480,13 @@ impl CompiledGraph {
        })
    }

-    /// Get output tensor data by name (copies to host).
+    /// Get the PT2 dtype codes for all outputs (in order).
+    #[getter]
+    fn output_dtypes(&self) -> Vec<u32> {
+        self.output_dtypes.clone()
+    }
+
+    /// Get output tensor data by name as f32 (copies to host).
    fn get_output(&self, name: &str) -> PyResult<Vec<f32>> {
        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!(
@@ -488,27 +494,50 @@ impl CompiledGraph {
                name
            ))
        })?;
-        Ok(self.runtime.get_f32(*node_id))
+        Ok(self.runtime.get_output_f32(*node_id))
    }

-    /// Copy output tensor data directly to a CUDA device pointer (DtoD).
-    /// Avoids the DtoH + HtoD round-trip of get_output() + .to(device).
-    #[cfg(feature = "cuda")]
-    fn copy_output_to_device_ptr(&self, name: &str, dest_ptr: u64, n_bytes: usize) -> PyResult<()> {
+    /// Get output tensor data by name as i32 (copies to host).
+    fn get_output_i32(&self, name: &str) -> PyResult<Vec<i32>> {
        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!(
                "Unknown output tensor: {}",
                name
            ))
        })?;
-        match &self.runtime {
-            RuntimeBackend::Cuda(rt) => {
-                unsafe { rt.copy_output_to_device_ptr(*node_id, dest_ptr, n_bytes) };
-                Ok(())
-            }
-            _ => Err(pyo3::exceptions::PyValueError::new_err(
-                "copy_output_to_device_ptr requires CUDA backend",
-            )),
+        Ok(self.runtime.get_output_i32(*node_id))
+    }
+
+    /// Get output tensor data by name as bool (copies to host).
+    fn get_output_bool(&self, name: &str) -> PyResult<Vec<bool>> {
+        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
+            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!(
+                "Unknown output tensor: {}",
+                name
+            ))
+        })?;
+        Ok(self.runtime.get_output_bool(*node_id))
+    }
+
+    /// Copy output tensor data directly to a device pointer (DtoD).
+    /// Avoids the DtoH + HtoD round-trip of get_output() + .to(device).
+    /// Requires a GPU backend.
+    fn copy_output_to_device_ptr(&self, name: &str, dest_ptr: u64, n_bytes: usize) -> PyResult<()> {
+        if !self.runtime.supports_device_ptrs() {
+            return Err(pyo3::exceptions::PyValueError::new_err(
+                "copy_output_to_device_ptr requires a GPU backend",
+            ));
        }
+        let node_id = self.tensor_ids.get(name).ok_or_else(|| {
+            PyErr::new::<pyo3::exceptions::PyKeyError, _>(format!(
+                "Unknown output tensor: {}",
+                name
+            ))
+        })?;
+        unsafe {
+            self.runtime
+                .copy_output_to_device_ptr(*node_id, dest_ptr, n_bytes)
+        };
+        Ok(())
    }
 }
--- a/crates/luminal_python/rust/src/dispatch.rs
+++ b/crates/luminal_python/rust/src/dispatch.rs
@@ -1,248 +0,0 @@
-use std::collections::HashMap;
-
-use luminal::{prelude::*, shape::Expression};
-use onnx_protobuf::NodeProto;
-
-use crate::ops_parse::*;
-
-pub fn process_onnx_nodes(
-    nodes: &[NodeProto],
-    tensors: &mut HashMap<String, GraphTensor>,
-    cx: &mut Graph,
-    weight_data: &mut Vec<(String, Vec<f32>)>,
-    known_values: &mut HashMap<String, Vec<f32>>,
-    shape_exprs: &mut HashMap<String, Vec<Expression>>,
-) -> Result<(), String> {
-    for node in nodes {
-        match node.op_type.as_str() {
-            "Add" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Add",
-                |a, b| a + b,
-                shape_exprs,
-                known_values,
-            )?,
-            "Mod" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Mod",
-                |a, b| a % b,
-                shape_exprs,
-                known_values,
-            )?,
-            "Sub" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Sub",
-                |a, b| a - b,
-                shape_exprs,
-                known_values,
-            )?,
-            "Mul" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Mul",
-                |a, b| a * b,
-                shape_exprs,
-                known_values,
-            )?,
-            "Div" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Div",
-                |a, b| a / b,
-                shape_exprs,
-                known_values,
-            )?,
-            "Sqrt" => parse_unary_op(node, tensors, "Sqrt", |a| a.sqrt())?,
-            "Transpose" => parse_transpose_node(node, tensors)?,
-            "Concat" => parse_concat_node(node, tensors, shape_exprs, known_values)?,
-            "Floor" => parse_floor_node(node, tensors)?,
-            "Ceil" => parse_ceil_node(node, tensors)?,
-            "Sin" => parse_unary_op(node, tensors, "Sin", |a| a.sin())?,
-            "Neg" => parse_unary_op(node, tensors, "Neg", |a| -a)?,
-            "Cos" => parse_unary_op(node, tensors, "Cos", |a| a.cos())?,
-            "Pow" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Pow",
-                |a, b| a.pow(b),
-                shape_exprs,
-                known_values,
-            )?,
-            "Sigmoid" => parse_unary_op(node, tensors, "Sigmoid", |a| a.sigmoid())?,
-            "Tanh" => parse_unary_op(node, tensors, "Tanh", |a| a.tanh())?,
-            "Relu" => parse_unary_op(node, tensors, "Relu", |a| a.relu())?,
-            "Softmax" => parse_softmax_node(node, tensors)?,
-            "Abs" => parse_unary_op(node, tensors, "Abs", |a| a.abs())?,
-            "Reciprocal" => parse_unary_op(node, tensors, "Reciprocal", |a| a.reciprocal())?,
-            "Clip" => parse_clip_node(node, tensors, known_values)?,
-            "Equal" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Equal",
-                |a, b| a.eq(b),
-                shape_exprs,
-                known_values,
-            )?,
-            "Where" => parse_where_node(node, tensors)?,
-            "Constant" => {
-                parse_constant_node(node, tensors, cx, weight_data, known_values, shape_exprs)?
-            }
-            "ConstantOfShape" => {
-                parse_constant_of_shape(node, tensors, cx, weight_data, known_values, shape_exprs)?
-            }
-            "Cast" => parse_cast_node(node, tensors, weight_data, known_values, shape_exprs)?,
-            "MatMul" => parse_matmul_node(node, tensors)?,
-            "Reshape" => parse_reshape_node(node, tensors, known_values, shape_exprs)?,
-            "Shape" => parse_shape_node(node, tensors, cx, weight_data, known_values, shape_exprs)?,
-            "Gather" => {
-                parse_gather_node(node, tensors, cx, weight_data, known_values, shape_exprs)?
-            }
-            "GatherND" => parse_gathernd_node(node, tensors, cx, weight_data, known_values)?,
-            "Less" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Less",
-                |a, b| a.lt(b),
-                shape_exprs,
-                known_values,
-            )?,
-            "Greater" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Greater",
-                |a, b| b.lt(a),
-                shape_exprs,
-                known_values,
-            )?,
-            "LessOrEqual" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "LessOrEqual",
-                |a, b| a.le(b),
-                shape_exprs,
-                known_values,
-            )?,
-            "GreaterOrEqual" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "GreaterOrEqual",
-                |a, b| a.ge(b),
-                shape_exprs,
-                known_values,
-            )?,
-            "Not" => parse_not_node(node, tensors)?,
-            "And" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "And",
-                |a, b| a.cast(DType::F32) * b.cast(DType::F32),
-                shape_exprs,
-                known_values,
-            )?,
-            "Or" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Or",
-                |a, b| (a.cast(DType::F32) + b.cast(DType::F32)).minimum_f32(1.0),
-                shape_exprs,
-                known_values,
-            )?,
-            "Xor" => parse_binary_broadcast_op(
-                node,
-                tensors,
-                "Xor",
-                |a, b| a.ne(b),
-                shape_exprs,
-                known_values,
-            )?,
-            "Min" => parse_variadic_broadcast_op(
-                node,
-                tensors,
-                "Min",
-                |a, b| a.minimum(b),
-                shape_exprs,
-                known_values,
-            )?,
-            "Max" => parse_variadic_broadcast_op(
-                node,
-                tensors,
-                "Max",
-                |a, b| a.maximum(b),
-                shape_exprs,
-                known_values,
-            )?,
-            "Identity" => parse_identity(node, tensors, known_values, shape_exprs)?,
-            "Unsqueeze" => parse_unsqueeze_node(node, tensors, known_values, shape_exprs)?,
-            "Squeeze" => parse_squeeze_node(node, tensors, known_values, shape_exprs)?,
-            "ReduceSum" => parse_reduce_op(
-                node,
-                tensors,
-                known_values,
-                "ReduceSum",
-                |t, axes| t.sum(axes),
-                |flat, _n| flat.sum(1),
-            )?,
-            "ReduceMax" => parse_reduce_op(
-                node,
-                tensors,
-                known_values,
-                "ReduceMax",
-                |t, axes| t.max(axes),
-                |flat, _n| flat.max(1),
-            )?,
-            "ReduceMin" => parse_reduce_op(
-                node,
-                tensors,
-                known_values,
-                "ReduceMin",
-                |t, axes| t.min(axes),
-                |flat, _n| flat.min(1),
-            )?,
-            "ReduceMean" => parse_reduce_op(
-                node,
-                tensors,
-                known_values,
-                "ReduceMean",
-                |t, axes| t.mean(axes),
-                |flat, n| flat.sum(1) / n as f32,
-            )?,
-            "Trilu" => parse_trilu_node(node, tensors, cx, known_values)?,
-            "GatherElements" => parse_gather_elements_node(node, tensors)?,
-            "ScatterElements" => parse_scatter_elements_node(node, tensors)?,
-            "ScatterND" => parse_scatter_nd_node(node, tensors)?,
-            "Expand" => parse_expand_node(node, tensors, known_values, shape_exprs)?,
-            "IsNaN" => parse_unary_op(node, tensors, "IsNaN", |a| a.ne(a))?,
-            "LayerNormalization" => parse_layernorm_node(node, tensors)?,
-            "Gemm" => parse_gemm_node(node, tensors)?,
-            "Erf" => parse_erf_node(node, tensors)?,
-            "Slice" => parse_slice_node(node, tensors, known_values, shape_exprs)?,
-            "Split" => parse_split_node(node, tensors, known_values)?,
-            "TopK" => parse_topk_node(node, tensors, known_values)?,
-            "OneHot" => parse_onehot_node(node, tensors, known_values)?,
-            "Range" => parse_range_node(node, tensors, cx, weight_data, known_values, shape_exprs)?,
-            "CumSum" => parse_cumsum_node(node, tensors, known_values)?,
-            "Gelu" => parse_unary_op(node, tensors, "Gelu", |a| a.gelu())?,
-            "Conv" => parse_conv_node(node, tensors)?,
-            "Pad" => parse_pad_node(node, tensors, known_values)?,
-            "Resize" => parse_resize_node(node, tensors, known_values)?,
-            "Tile" => parse_tile_node(node, tensors, known_values)?,
-            "ReduceL2" => parse_reduce_op(
-                node,
-                tensors,
-                known_values,
-                "ReduceL2",
-                |t, axes| (t * t).sum(axes).sqrt(),
-                |flat, _n| (flat * flat).sum(1).sqrt(),
-            )?,
-            "GroupNormalization" => parse_group_norm_node(node, tensors)?,
-            _ => {
-                panic!("Missing Node {}", node.op_type)
-            }
-        }
-    }
-
-    Ok(())
-}
--- a/crates/luminal_python/rust/src/lib.rs
+++ b/crates/luminal_python/rust/src/lib.rs
@@ -1,9 +1,5 @@
 mod compiled_graph;
-mod dispatch;
-mod onnx_translator;
-mod ops_parse;
-mod runtime;
-mod util;
+pub mod typed_data;

 // PT2 modules
 mod pt2_compiled_model;
@@ -15,59 +11,40 @@ mod translator;
 use compiled_graph::CompiledGraph;
 use pt2_compiled_model::process_pt2;
 use pyo3::prelude::*;
-use std::collections::HashMap;
-
-fn validate_backend(backend: &str) -> PyResult<()> {
-    match backend {
-        "native" => Ok(()),
-        #[cfg(feature = "cuda")]
-        "cuda" => Ok(()),
-        #[cfg(not(feature = "cuda"))]
-        "cuda" => Err(pyo3::exceptions::PyValueError::new_err(
-            "CUDA backend requested, but this luminal extension was built without the `cuda` feature. Rebuild with `maturin develop --features cuda -r` or use backend='native'.",
-        )),
-        _ => {
-            #[cfg(feature = "cuda")]
-            {
-                Err(pyo3::exceptions::PyValueError::new_err(format!(
-                    "Invalid backend '{}'. Must be 'native' or 'cuda'",
-                    backend
-                )))
-            }
-            #[cfg(not(feature = "cuda"))]
-            {
-                Err(pyo3::exceptions::PyValueError::new_err(format!(
-                    "Invalid backend '{}'. This build only supports 'native'. Rebuild with the `cuda` feature to enable 'cuda'.",
-                    backend
-                )))
-            }
-        }
-    }
-}
-
-#[pyfunction]
-#[pyo3(signature = (path, backend="native", search_iters=10, weight_device_ptrs=None))]
-fn process_onnx(
-    path: &str,
-    backend: &str,
-    search_iters: usize,
-    weight_device_ptrs: Option<HashMap<String, (u64, usize)>>,
-) -> PyResult<CompiledGraph> {
-    validate_backend(backend)?;
-
-    onnx_translator::compile_onnx(
-        path,
-        backend,
-        weight_device_ptrs.unwrap_or_default(),
-        search_iters,
-    )
-    .map_err(pyo3::exceptions::PyRuntimeError::new_err)
-}
+use pyo3::types::PyCapsule;

 #[pymodule]
 fn luminal(m: &Bound<'_, PyModule>) -> PyResult<()> {
-    m.add_function(wrap_pyfunction!(process_onnx, m)?)?;
    m.add_function(wrap_pyfunction!(process_pt2, m)?)?;
    m.add_class::<CompiledGraph>()?;
+    m.add_function(wrap_pyfunction!(_native_factory_capsule, m)?)?;
+    #[cfg(feature = "cuda")]
+    m.add_function(wrap_pyfunction!(_cuda_lite_factory_capsule, m)?)?;
    Ok(())
 }
+
+// ---------------------------------------------------------------------------
+// Factory capsule helpers
+// ---------------------------------------------------------------------------
+
+/// Wrapper to put a function pointer into a PyCapsule.
+#[allow(dead_code)]
+struct FnPtrWrapper(pub *const std::ffi::c_void);
+unsafe impl Send for FnPtrWrapper {}
+
+/// PyCapsule wrapping the native (CPU) backend factory.
+#[pyfunction]
+fn _native_factory_capsule<'py>(py: Python<'py>) -> PyResult<Bound<'py, PyCapsule>> {
+    let fptr = ::luminal::dyn_backend::native_factory as *const std::ffi::c_void;
+    let name = ::luminal::dyn_backend::BACKEND_FACTORY_CAPSULE_NAME.to_owned();
+    PyCapsule::new(py, FnPtrWrapper(fptr), Some(name))
+}
+
+/// PyCapsule wrapping the cuda_lite backend factory.
+#[cfg(feature = "cuda")]
+#[pyfunction]
+fn _cuda_lite_factory_capsule<'py>(py: Python<'py>) -> PyResult<Bound<'py, PyCapsule>> {
+    let fptr = luminal_cuda_lite::dyn_backend::cuda_lite_factory as *const std::ffi::c_void;
+    let name = ::luminal::dyn_backend::BACKEND_FACTORY_CAPSULE_NAME.to_owned();
+    PyCapsule::new(py, FnPtrWrapper(fptr), Some(name))
+}
--- a/crates/luminal_python/rust/src/onnx_translator.rs
+++ b/crates/luminal_python/rust/src/onnx_translator.rs
@@ -1,283 +0,0 @@
-use luminal::{
-    prelude::{
-        tracing::{Level, span, trace},
-        *,
-    },
-    shape::Expression,
-};
-use onnx_protobuf::ModelProto;
-use protobuf::Message;
-use std::{
-    collections::{HashMap, HashSet},
-    fs,
-    path::Path,
-};
-
-use crate::{
-    compiled_graph::{CompiledGraph, GraphTranslation, WeightData},
-    dispatch::process_onnx_nodes,
-    util::{
-        DimParamMap, get_shape_for_onnx_value, get_shape_for_onnx_value_expr,
-        load_all_tensor_floats, load_initializer_as_f32,
-    },
-};
-
-/// Load, validate, translate, and compile an ONNX model.
-///
-/// This is the ONNX counterpart of `pt2_compiled_model::compile_pt2()`.
-pub fn compile_onnx(
-    path: &str,
-    backend: &str,
-    weight_device_ptrs: HashMap<String, (u64, usize)>,
-    search_iters: usize,
-) -> Result<CompiledGraph, String> {
-    let data = fs::read(path).map_err(|e| format!("Failed to read file: {}", e))?;
-    let model_directory = Path::new(path).parent().unwrap_or(Path::new("."));
-    let model = ModelProto::parse_from_bytes(&data)
-        .map_err(|e| format!("Failed to parse ONNX model: {}", e))?;
-
-    let opset_version = model
-        .opset_import
-        .iter()
-        .find(|entry| entry.domain.is_empty())
-        .map(|entry| entry.version);
-
-    match opset_version {
-        Some(20) => {}
-        Some(v) => {
-            return Err(format!(
-                "Unsupported ONNX opset version {v}. Only opset 20 is supported."
-            ));
-        }
-        None => {
-            return Err(
-                "No ONNX opset version found in model. Only opset 20 is supported.".to_string(),
-            );
-        }
-    }
-
-    let (translation, mut weights) = translate_onnx(model, model_directory)?;
-    weights.device_ptrs = weight_device_ptrs;
-    CompiledGraph::parse_graph(translation, weights, backend, search_iters)
-}
-
-/// Translate an ONNX model into a format-neutral GraphTranslation + WeightData.
-pub fn translate_onnx(
-    model: ModelProto,
-    model_directory: &Path,
-) -> Result<(GraphTranslation, WeightData), String> {
-    let _span = span!(Level::TRACE, "ONNX Graph Translation").entered();
-    let onnx_graph = &model.graph;
-    let mut cx = Graph::new();
-    let mut tensors: HashMap<String, GraphTensor> = HashMap::new();
-
-    // Dynamic dimension tracking
-    let mut dim_param_map: DimParamMap = HashMap::new();
-    let mut next_char = 'a';
-
-    // Separate initializers (weights) from true user inputs
-    let initializer_names: HashSet<&str> = onnx_graph
-        .initializer
-        .iter()
-        .map(|t| t.name.as_str())
-        .collect();
-
-    let input_names: Vec<String> = onnx_graph
-        .input
-        .iter()
-        .filter(|inp| !initializer_names.contains(inp.name.as_str()))
-        .map(|inp| inp.name.clone())
-        .collect();
-
-    // Create input tensors with dynamic dimension support
-    for input in &onnx_graph.input {
-        let shape_exprs = get_shape_for_onnx_value_expr(input, &mut dim_param_map, &mut next_char);
-        if shape_exprs.is_empty() {
-            let shape = get_shape_for_onnx_value(input);
-            if shape.is_empty() {
-                trace!("Input {} skipped because it is empty", input.name.clone());
-                continue;
-            }
-            let tensor = cx.named_tensor(input.name.clone(), shape);
-            trace!("Input {} added to tensors", input.name.clone());
-            tensors.insert(input.name.clone(), tensor);
-            continue;
-        }
-        let tensor = cx.named_tensor(input.name.clone(), shape_exprs);
-        trace!("Input {} added to tensors", input.name.clone());
-        tensors.insert(input.name.clone(), tensor);
-    }
-
-    // Create initializer (weight) tensors
-    for init in &onnx_graph.initializer {
-        if !tensors.contains_key(&init.name) {
-            let mut shape: Vec<usize> = init.dims.iter().map(|&d| d as usize).collect();
-            if shape.is_empty() {
-                shape = vec![1];
-            }
-            let tensor = cx.named_tensor(init.name.clone(), shape);
-            tensors.insert(init.name.clone(), tensor);
-        }
-    }
-
-    // Load small constants for constant folding
-    let mut known_values: HashMap<String, Vec<f32>> = HashMap::new();
-    for init in &onnx_graph.initializer {
-        let n_elements: usize = init
-            .dims
-            .iter()
-            .map(|&d| d as usize)
-            .product::<usize>()
-            .max(1);
-        if n_elements <= 32 {
-            if let Some(floats) = load_initializer_as_f32(init) {
-                known_values.insert(init.name.clone(), floats);
-            } else {
-                panic!("Unable to load initializer values for {:?}", init.name);
-            }
-        }
-    }
-
-    // Shape expressions for propagating symbolic shapes through ONNX graphs
-    let mut shape_exprs: HashMap<String, Vec<Expression>> = HashMap::new();
-
-    // Accumulates constant node data from process_onnx_nodes
-    let mut constant_data: Vec<(String, Vec<f32>)> = Vec::new();
-
-    // Process computation nodes
-    process_onnx_nodes(
-        &onnx_graph.node,
-        &mut tensors,
-        &mut cx,
-        &mut constant_data,
-        &mut known_values,
-        &mut shape_exprs,
-    )
-    .map_err(|e| format!("process_onnx_nodes failed: {}", e))?;
-
-    // Mark weight/constant tensors as persistent so their buffers survive execute()
-    for (name, gt) in &tensors {
-        if !input_names.contains(name) {
-            gt.persist();
-        }
-    }
-
-    // Mark graph outputs (must happen before build_search_space)
-    let mut output_names = Vec::new();
-    let mut output_shape_exprs = Vec::new();
-    for output_vi in &onnx_graph.output {
-        if let Some(&gt) = tensors.get(&output_vi.name) {
-            // Force contiguous if the shape tracker is a non-contiguous view
-            let gt = if gt.shape != gt.shape.contiguous() {
-                let contiguous = gt * 1.0;
-                tensors.insert(output_vi.name.clone(), contiguous);
-                contiguous
-            } else {
-                gt
-            };
-            gt.output();
-            let dims = gt.dims();
-            output_shape_exprs.push(dims.clone());
-
-            let shape: Vec<usize> = dims.iter().map(|d| d.to_usize().unwrap_or(1)).collect();
-            if shape.is_empty() {
-                return Err(format!(
-                    "Output tensor '{}' has no shape information in the ONNX model",
-                    output_vi.name
-                ));
-            }
-            output_names.push(output_vi.name.clone());
-        }
-    }
-
-    // Set initial dynamic dimension values from example input shapes
-    let has_dynamic = !dim_param_map.is_empty();
-    if has_dynamic {
-        for input in &onnx_graph.input {
-            if initializer_names.contains(input.name.as_str()) {
-                continue;
-            }
-            let concrete_shape = get_shape_for_onnx_value(input);
-            let expr_shape =
-                get_shape_for_onnx_value_expr(input, &mut dim_param_map, &mut next_char);
-            for (expr, concrete) in expr_shape.iter().zip(concrete_shape.iter()) {
-                if expr.to_usize().is_none()
-                    && let Some(ch) = dim_param_map
-                        .values()
-                        .find(|&&ch| Expression::from(ch) == *expr)
-                {
-                    cx.set_dim(*ch, *concrete);
-                }
-            }
-        }
-    }
-
-    // Build weight data: initializers + constants from process_onnx_nodes
-    let mut weights: Vec<(String, Vec<f32>)> = Vec::new();
-    for (name, floats) in load_all_tensor_floats(&onnx_graph.initializer, model_directory) {
-        if let Some(f) = floats {
-            weights.push((name, f));
-        }
-    }
-    weights.extend(constant_data);
-
-    // Build tensor sizes for CUDA dummy data allocation
-    let mut tensor_sizes: HashMap<String, usize> = HashMap::new();
-    for input in &onnx_graph.input {
-        if !initializer_names.contains(input.name.as_str()) {
-            let shape = get_shape_for_onnx_value(input);
-            let n: usize = shape.iter().product::<usize>().max(1);
-            tensor_sizes.insert(input.name.clone(), n);
-        }
-    }
-    for init in &onnx_graph.initializer {
-        let n: usize = init
-            .dims
-            .iter()
-            .map(|&d| d as usize)
-            .product::<usize>()
-            .max(1);
-        tensor_sizes.insert(init.name.clone(), n);
-    }
-    for (name, data) in &weights {
-        if !tensor_sizes.contains_key(name) {
-            tensor_sizes.insert(name.clone(), data.len());
-        }
-    }
-
-    // Collect tensor name → NodeIndex mapping
-    let tensor_ids: HashMap<String, NodeIndex> = tensors
-        .iter()
-        .map(|(name, gt)| (name.clone(), gt.id))
-        .collect();
-
-    // Build input_shape_exprs for user inputs (needed for auto-dim detection)
-    let input_shape_exprs: Vec<Vec<Expression>> = input_names
-        .iter()
-        .map(|name| {
-            if let Some(&gt) = tensors.get(name) {
-                gt.dims()
-            } else {
-                vec![]
-            }
-        })
-        .collect();
-
-    let translation = GraphTranslation {
-        graph: cx,
-        tensor_ids,
-        input_names,
-        output_names,
-        output_shape_exprs,
-        input_shape_exprs,
-        dim_param_map,
-    };
-
-    let weight_data = WeightData {
-        weights,
-        tensor_sizes,
-        device_ptrs: HashMap::new(),
-    };
-
-    Ok((translation, weight_data))
-}
--- a/crates/luminal_python/rust/src/ops_parse/binary.rs
+++ b/crates/luminal_python/rust/src/ops_parse/binary.rs
@@ -1,187 +0,0 @@
-use std::collections::HashMap;
-
-use luminal::{
-    prelude::{tracing::trace, *},
-    shape::Expression,
-};
-use onnx_protobuf::NodeProto;
-
-use crate::util::{broadcast_to_expr, compute_broadcast_shape_expr};
-
-/// Handle Where node: conditional select — output[i] = condition[i] ? x[i] : y[i]
-///
-/// ONNX Where uses numpy-style broadcasting across all three inputs.
-pub fn parse_where_node(
-    node: &NodeProto,
-    tensors: &mut HashMap<String, GraphTensor>,
-) -> Result<(), String> {
-    assert!(node.input.len() == 3, "Where should have 3 inputs");
-    let condition = *tensors
-        .get(&node.input[0])
-        .ok_or_else(|| format!("Where: missing condition tensor '{}'", node.input[0]))?;
-    let x = *tensors
-        .get(&node.input[1])
-        .ok_or_else(|| format!("Where: missing X tensor '{}'", node.input[1]))?;
-    let y = *tensors
-        .get(&node.input[2])
-        .ok_or_else(|| format!("Where: missing Y tensor '{}'", node.input[2]))?;
-
-    let output_name = &node.output[0];
-
-    // ONNX Where broadcasts all 3 inputs to a common shape
-    let bc_shape = compute_broadcast_shape_expr(
-        &condition.dims(),
-        &compute_broadcast_shape_expr(&x.dims(), &y.dims()),
-    );
-    let condition = broadcast_to_expr(condition, &bc_shape);
-    let x = broadcast_to_expr(x, &bc_shape);
-    let y = broadcast_to_expr(y, &bc_shape);
-
-    let result = x.cond(condition, y);
-    tensors.insert(output_name.clone(), result);
-    Ok(())
-}
-
-pub fn parse_binary_broadcast_op(
-    node: &NodeProto,
-    tensors: &mut HashMap<String, GraphTensor>,
-    op_name: &str,
-    op: impl Fn(GraphTensor, GraphTensor) -> GraphTensor,
-    shape_exprs: &mut HashMap<String, Vec<Expression>>,
-    known_values: &HashMap<String, Vec<f32>>,
-) -> Result<(), String> {
-    trace!("Starting parse: {} Node", op_name);
-    assert!(
-        node.input.len() == 2,
-        "{} should have 2 inputs, got {}",
-        op_name,
-        node.input.len()
-    );
-    assert!(
-        node.output.len() == 1,
-        "{} should have 1 output, got {}",
-        op_name,
-        node.output.len()
-    );
-    // Shape-only path: if any input is shape-only (not in tensors), do Expression arithmetic
-    let a_missing = !tensors.contains_key(&node.input[0]);
-    let b_missing = !tensors.contains_key(&node.input[1]);
-    if a_missing || b_missing {
-        // At least one input is shape-only. Do shape_exprs arithmetic and return.
-        let se_a = shape_exprs.get(&node.input[0]).cloned().or_else(|| {
-            known_values
-                .get(&node.input[0])
-                .map(|kv| kv.iter().map(|&v| Expression::from(v as usize)).collect())
-        });
-        let se_b = shape_exprs.get(&node.input[1]).cloned().or_else(|| {
-            known_values
-                .get(&node.input[1])
-                .map(|kv| kv.iter().map(|&v| Expression::from(v as usize)).collect())
-        });
-        if let (Some(se_a), Some(se_b)) = (se_a, se_b)
-            && se_a.len() == 1
-            && se_b.len() == 1
-        {
-            let result_expr = match op_name {
-                "Add" => Some(se_a[0] + se_b[0]),
-                "Sub" => Some(se_a[0] - se_b[0]),
-                "Mul" => Some(se_a[0] * se_b[0]),
-                "Div" => Some(se_a[0] / se_b[0]),
-                _ => None,
-            };
-            if let Some(expr) = result_expr {
-                shape_exprs.insert(node.output[0].clone(), vec![expr]);
-            }
-        }
-        trace!("Finished parse: {} Node (shape-only)", op_name);
-        return Ok(());
-    }
-
-    let a = *tensors
-        .get(&node.input[0])
-        .ok_or_else(|| format!("{}: missing input '{}'", op_name, node.input[0]))?;
-    let b = *tensors
-        .get(&node.input[1])
-        .ok_or_else(|| format!("{}: missing input '{}'", op_name, node.input[1]))?;
-    let broadcast_shape = compute_broadcast_shape_expr(&a.dims(), &b.dims());
-    let a_bc = broadcast_to_expr(a, &broadcast_shape);
-    let b_bc = broadcast_to_expr(b, &broadcast_shape);
-    let result = op(a_bc, b_bc);
-    tensors.insert(node.output[0].clone(), result);
-
-    // Propagate shape_exprs for scalar shape arithmetic (e.g., Add(1, seq_len))
-    // At least one input must be in shape_exprs; the other can come from known_values.
-    let has_shape_expr =
-        shape_exprs.contains_key(&node.input[0]) || shape_exprs.contains_key(&node.input[1]);
-    if has_shape_expr {
-        let se_a = shape_exprs.get(&node.input[0]).cloned().or_else(|| {
-            known_values
-                .get(&node.input[0])
-                .map(|kv| kv.iter().map(|&v| Expression::from(v as usize)).collect())
-        });
-        let se_b = shape_exprs.get(&node.input[1]).cloned().or_else(|| {
-            known_values
-                .get(&node.input[1])
-                .map(|kv| kv.iter().map(|&v| Expression::from(v as usize)).collect())
-        });
-        if let (Some(se_a), Some(se_b)) = (se_a, se_b)
-            && se_a.len() == 1
-            && se_b.len() == 1
-        {
-            let result_expr = match op_name {
-                "Add" => Some(se_a[0] + se_b[0]),
-                "Sub" => Some(se_a[0] - se_b[0]),
-                "Mul" => Some(se_a[0] * se_b[0]),
-                "Div" => Some(se_a[0] / se_b[0]),
-                _ => None,
-            };
-            if let Some(expr) = result_expr {
-                shape_exprs.insert(node.output[0].clone(), vec![expr]);
-            }
-        }
-    }
-
-    trace!("Finished parse: {} Node", op_name);
-    Ok(())
-}
-
-pub fn parse_variadic_broadcast_op(
-    node: &NodeProto,
-    tensors: &mut HashMap<String, GraphTensor>,
-    op_name: &str,
-    op: impl Fn(GraphTensor, GraphTensor) -> GraphTensor,
-    _shape_exprs: &mut HashMap<String, Vec<Expression>>,
-    _known_values: &HashMap<String, Vec<f32>>,
-) -> Result<(), String> {
-    trace!("Starting parse: {} Node", op_name);
-    assert!(
-        node.input.len() >= 2,
-        "{} needs at least two inputs, got {}",
-        op_name,
-        node.input.len()
-    );
-    assert!(
-        node.output.len() == 1,
-        "{} nodes only have one output, got {}",
-        op_name,
-        node.output.len()
-    );
-
-    let mut result = *tensors
-        .get(&node.input[0])
-        .ok_or_else(|| format!("{}: missing input tensor '{}'", op_name, node.input[0]))?;
-
-    for input_name in &node.input[1..] {
-        let rhs = *tensors
-            .get(input_name)
-            .ok_or_else(|| format!("{}: missing input tensor '{}'", op_name, input_name))?;
-        let broadcast_shape = compute_broadcast_shape_expr(&result.dims(), &rhs.dims());
-        let lhs_bc = broadcast_to_expr(result, &broadcast_shape);
-        let rhs_bc = broadcast_to_expr(rhs, &broadcast_shape);
-        result = op(lhs_bc, rhs_bc);
-    }
-
-    tensors.insert(node.output[0].clone(), result);
-    trace!("Finished parse: {} Node", op_name);
-    Ok(())
-}
--- a/crates/luminal_python/rust/src/ops_parse/convolution.rs
+++ b/crates/luminal_python/rust/src/ops_parse/convolution.rs
@@ -1,194 +0,0 @@
-use std::collections::HashMap;
-
-use luminal::{
-    prelude::{tracing::trace, *},
-    shape::Expression,
-};
-use onnx_protobuf::NodeProto;
-
-use crate::util::get_int_attr;
-
-/// Get an integer-list attribute from a node, with a default value applied per element.
-fn get_ints_attr(node: &NodeProto, name: &str, default_elem: i64, spatial: usize) -> Vec<usize> {
-    for attr in &node.attribute {
-        if attr.name == name {
-            return attr.ints.iter().map(|&v| v as usize).collect();
-        }
-    }
-    vec![default_elem as usize; spatial]
-}
-
-/// Parse an ONNX Conv node.
-///
-/// Supports N-dimensional convolution (1D, 2D, 3D) with group=1.
-/// Uses the unfold-based approach from `luminal_nn::ConvND`.
-///
-/// Input layout: [batch, C_in, spatial...]
-/// Weight layout: [C_out, C_in/group, kernel...]
-/// Optional bias: [C_out]
-pub fn parse_conv_node(
-    node: &NodeProto,
-    tensors: &mut HashMap<String, GraphTensor>,
-) -> Result<(), String> {
-    trace!("Starting parse: Conv Node");
-
-    assert!(
-        node.input.len() >= 2,
-        "Conv needs at least 2 inputs (X, W), got {}",
-        node.input.len()
-    );
-
-    let x = *tensors
-        .get(&node.input[0])
-        .ok_or_else(|| format!("Conv: missing input X '{}'", node.input[0]))?;
-    let w = *tensors
-        .get(&node.input[1])
-        .ok_or_else(|| format!("Conv: missing weight W '{}'", node.input[1]))?;
-    let bias = if node.input.len() > 2 && !node.input[2].is_empty() {
-        Some(
-            *tensors
-                .get(&node.input[2])
-                .ok_or_else(|| format!("Conv: missing bias B '{}'", node.input[2]))?,
-        )
-    } else {
-        None
-    };
-
-    let x_dims = x.dims();
-    let w_dims = w.dims();
-    let rank = x_dims.len();
-    assert!(
-        rank >= 3,
-        "Conv: input must be at least 3D (batch, channels, spatial...), got {rank}D"
-    );
-
-    let spatial = rank - 2; // number of spatial dimensions
-
-    // Parse attributes
-    let kernel_shape = get_ints_attr(node, "kernel_shape", 1, spatial);
-    let strides = get_ints_attr(node, "strides", 1, spatial);
-    let dilations = get_ints_attr(node, "dilations", 1, spatial);
-    let group = get_int_attr(node, "group", 1) as usize;
-
-    // Parse pads: ONNX format is [begin_0, begin_1, ..., end_0, end_1, ...]
-    let pads_flat = get_ints_attr(node, "pads", 0, 2 * spatial);
-    let mut pads_begin = vec![0usize; spatial];
-    let mut pads_end = vec![0usize; spatial];
-    if pads_flat.len() == 2 * spatial {
-        pads_begin[..spatial].copy_from_slice(&pads_flat[..spatial]);
-        pads_end[..spatial].copy_from_slice(&pads_flat[spatial..(spatial + spatial)]);
-    }
-
-    assert_eq!(
-        group, 1,
-        "Conv: only group=1 is currently supported, got {group}"
-    );
-
-    // Get channel dimensions
-    let ch_out = w_dims[0]
-        .to_usize()
-        .ok_or("Conv: weight C_out must be concrete")?;
-    let ch_in = x_dims[1]
-        .to_usize()
-        .ok_or("Conv: input C_in must be concrete")?;
-
-    let kernel_product: usize = kernel_shape.iter().product();
-
-    // Reshape weight from ONNX [C_out, C_in, *kernel] to [C_out, C_in * kernel_product]
-    let w_reshaped = {
-        let mut wt = w;
-        wt.shape = ShapeTracker::new(vec![ch_out, ch_in * kernel_product]);
-        wt
-    };
-
-    // Pad spatial dimensions
-    let mut padding: Vec<(Expression, Expression)> = vec![(0.into(), 0.into()); rank];
-    for i in 0..spatial {
-        let axis = 2 + i; // batch=0, channel=1, spatial starts at 2
-        padding[axis] = (
-            Expression::from(pads_begin[i]),
-            Expression::from(pads_end[i]),
-        );
-    }
-    let padded = x.pad(padding, 0.0);
-
-    // Build unfold parameters (ones for batch/channel, actual for spatial)
-    let mut kernel_full = vec![1usize; rank];
-    let mut stride_full = vec![1usize; rank];
-    let mut dilation_full = vec![1usize; rank];
-    for i in 0..spatial {
-        let axis = 2 + i;
-        kernel_full[axis] = kernel_shape[i];
-        stride_full[axis] = strides[i];
-        dilation_full[axis] = dilations[i];
-    }
-
-    let unfolded = padded.unfold(kernel_full, stride_full, dilation_full);
-    // unfolded shape: [win_N, win_C, win_spatial..., k_batch=1, k_chan=1, k_spatial...]
-    //                  (2*rank dimensions total)
-
-    // Step 1: Permute to [N, win_spatial..., C_in, k_batch, k_chan, k_spatial...]
-    // This groups: batch | output spatial | channel+kernel (for merging)
-    let mut perm: Vec<usize> = Vec::with_capacity(2 * rank);
-    perm.push(0); // win_N (batch)
-    perm.extend(2..2 + spatial); // win_spatial dims
-    perm.push(1); // win_C (= C_in)
-    perm.extend(rank..2 * rank); // all kernel dims: k_batch=1, k_chan=1, k_spatial...
-    let permuted = unfolded.permute(perm);
-
-    // Step 2: Capture output spatial dimensions (win_spatial sizes)
-    let output_spatial_dims: Vec<Expression> = permuted.dims()[1..1 + spatial].to_vec();
-
-    // Step 3: Merge all channel+kernel dims into one (C_in * kernel_product)
-    // From index (1+spatial) to end there are (1 + 2 + spatial) dims to merge
-    let mut patches = permuted;
-    let target_before_spatial_merge = 2 + spatial; // [N, spatial..., merged_patch]
-    while patches.dims().len() > target_before_spatial_merge {
-        let last = patches.dims().len();
-        patches = patches.merge_dims(last - 2, last - 1);
-    }
-    // patches: [N, spatial_0, ..., spatial_{s-1}, C_in * kernel_product]
-
-    // Step 4: Merge spatial dims into one
-    for _ in 1..spatial {
-        patches = patches.merge_dims(1, 2);
-    }
-    // patches: [N, spatial_product, C_in * kernel_product]
-
-    // Step 5: Matmul with weight
-    let mut out = patches.matmul(w_reshaped.permute((1, 0)));
-    // out: [N, spatial_product, C_out]
-
-    // Step 6: Restore spatial dimensions via split_dims
-    // Split from innermost spatial dim first (reverse order, skip outermost)
-    for i in (1..spatial).rev() {
-        out = out.split_dims(1, output_spatial_dims[i]);
-    }
-    // out: [N, spatial_0, spatial_1, ..., spatial_{s-1}, C_out]
-
-    // Step 7: Move C_out from last position to position 1 (after batch)
-    let mut final_order: Vec<usize> = Vec::with_capacity(2 + spatial);
-    final_order.push(0); // batch
-    final_order.push(1 + spatial); // C_out
-    final_order.extend(1..1 + spatial); // spatial dims
-    out = out.permute(final_order);
-    // out: [N, C_out, spatial_0, ..., spatial_{s-1}]
-
-    // Add bias if present: bias shape [C_out], broadcast to [1, C_out, 1, 1, ...]
-    if let Some(b) = bias {
-        let mut bias_expanded = b;
-        // Expand to [1, C_out, 1, 1, ...]
-        bias_expanded = bias_expanded.expand_dim(0, 1); // batch dim
-        for i in 0..spatial {
-            let out_dims = out.dims();
-            let spatial_size = out_dims[2 + i];
-            bias_expanded = bias_expanded.expand_dim(2 + i, spatial_size);
-        }
-        out += bias_expanded;
-    }
-
-    tensors.insert(node.output[0].clone(), out);
-
-    trace!("Finished parse: Conv Node");
-    Ok(())
-}
--- a/crates/luminal_python/rust/src/ops_parse/matmul.rs
+++ b/crates/luminal_python/rust/src/ops_parse/matmul.rs
@@ -1,70 +0,0 @@
-use std::collections::HashMap;
-
-use luminal::prelude::{tracing::trace, *};
-use onnx_protobuf::NodeProto;
-
-use crate::util::{broadcast_to_expr, get_float_attr, get_int_attr};
-
-pub fn parse_matmul_node(
-    node: &NodeProto,
-    tensors: &mut HashMap<String, GraphTensor>,
-) -> Result<(), String> {
-    trace!("Started parse: MatMul Node");
-    assert!(node.input.len() == 2, "MatMul should have exactly 2 inputs");
-    let a = *tensors
-        .get(&node.input[0])
-        .ok_or_else(|| format!("MatMul: missing input tensor '{}'", node.input[0]))?;
-    let b = *tensors
-        .get(&node.input[1])
-        .ok_or_else(|| format!("MatMul: missing input tensor '{}'", node.input[1]))?;
-
-    //TODO: enforce some kind of check here that they are broadcastable
-    let result = a.matmul(b);
-    let output_name = &node.output[0];
-    tensors.insert(output_name.clone(), result);
-    trace!("Finished parse: MatMul Node");
-    Ok(())
-}
-
-/// Handle Gemm node: Y = alpha * (transA ? A.T : A) @ (transB ? B.T : B) + beta * C
-///
-/// Attributes: transA (default 0), transB (default 0), alpha (default 1.0), beta (default 1.0)
-/// Input C (bias) is optional.
-pub fn parse_gemm_node(
-    node: &NodeProto,
-    tensors: &mut HashMap<String, GraphTensor>,
-) -> Result<(), String> {
-    trace!("Started parse: Gemm Node");
-    let a = *tensors
-        .get(&node.input[0])
-        .ok_or_else(|| format!("Gemm: missing input A '{}'", node.input[0]))?;
-    let b = *tensors
-        .get(&node.input[1])
-        .ok_or_else(|| format!("Gemm: missing input B '{}'", node.input[1]))?;
-
-    let trans_a = get_int_attr(node, "transA", 0) != 0;
-    let trans_b = get_int_attr(node, "transB", 0) != 0;
-    let alpha = get_float_attr(node, "alpha", 1.0);
-    let beta = get_float_attr(node, "beta", 1.0);
-
-    let a_mat = if trans_a { a.permute(vec![1, 0]) } else { a };
-    let b_mat = if trans_b { b.permute(vec![1, 0]) } else { b };
-
-    let mut result = a_mat.matmul(b_mat);
-    if alpha != 1.0 {
-        result *= alpha;
-    }
-
-    if node.input.len() > 2 && !node.input[2].is_empty() {
-        let c = *tensors
-            .get(&node.input[2])
-            .ok_or_else(|| format!("Gemm: missing bias C '{}'", node.input[2]))?;
-        let c_scaled = if beta != 1.0 { c * beta } else { c };
-        let result_shape = result.dims();
-        result += broadcast_to_expr(c_scaled, &result_shape);
-    }
-
-    tensors.insert(node.output[0].clone(), result);
-    trace!("Finished parse: Gemm Node");
-    Ok(())
-}
--- a/crates/luminal_python/rust/src/ops_parse/mod.rs
+++ b/crates/luminal_python/rust/src/ops_parse/mod.rs
@@ -1,15 +0,0 @@
-pub mod binary;
-pub mod convolution;
-pub mod matmul;
-pub mod movement;
-pub mod reduction;
-pub mod tensor;
-pub mod unary;
-
-pub use binary::*;
-pub use convolution::*;
-pub use matmul::*;
-pub use movement::*;
-pub use reduction::*;
-pub use tensor::*;
-pub use unary::*;
--- a/Show More
+++ b/Show More