mirror of
https://git.teahaven.kr/Rust-related/luminal.git
synced 2026-06-04 16:49:49 +09:00
* luminal_python + cuda_lite: unblock Qwen3-MoE compile path
Four small fixes that together let Qwen3MoeForCausalLM compile end-to-end
through torch.compile + luminal_backend, plus a regression test suite.
1. KernelScatter bf16 OOB
crates/luminal_cuda_lite/src/kernel/hlir.rs
The Scatter kernel sized n_vec as `n_dest / 4`, correct only for
4-byte dtypes. For bf16 (and any 1/2/8-byte type) the float4
vectorised copy walked the destination 2× / 4× / 0.5× the actual
buffer size. Whether that crashed with CUDA_ERROR_ILLEGAL_ADDRESS or
silently corrupted neighbouring allocations depended on which
surrounding kernels the egglog search picked → ~40% crash rate at
search-iters≥5 on StaticCache(dtype=bfloat16) MoE inference. Fix:
parameterise n_vec and remainder_start by elements_per_vec =
16 / sizeof(self.dtype). For F32/Int the generated PTX is identical.
2. maximum_f32 dtype mismatch on Int tensors
src/frontend/binary.rs
`maximum_f32(rhs)` built an F32 `constant_float`; the inner `lt`
then panicked "Dtypes must match to compare tensors. Got Int and
F32" whenever self was Int — e.g. `aten.clamp` on top-k expert
indices coming out of an MoE router. Fix: cast the constant to
self.dtype before the compare. For Int self this floors the bound,
matching PyTorch's `clamp(int_tensor, min=<float>)` semantics.
3. Three new ATen ops in the luminal_python translator
crates/luminal_python/rust/src/translator/{dispatch,tensor}.rs
- aten.empty.memory_format
- aten.empty_permuted.default → translate_empty (zero-fill)
- aten.histc.default → translate_histc
Qwen3-MoE allocates the expert-output staging tensor via
`empty_permuted` and counts tokens-per-expert via
`torch.histc(expert_ids.int(), bins=K, min=0, max=K-1)`.
empty / empty_permuted lower to a zero-filled tensor of the
requested shape — PyTorch's contract on empty outputs is undefined
for any read prior to a write, and downstream writes overwrite our
zeros, so this is sound.
histc implements only the bincount-equivalent case (one integer per
bin); non-integer-bin or non-contiguous-bin usage bails with a clear
error rather than silently dropping values.
4. crates/luminal_python/tests/test_qwen3_moe.py — new file
Four regression tests over progressively larger Qwen3MoeForCausalLM
configs:
- tiny: 2 experts, top-1, ~70K params (atol 1e-5)
- small: 4 experts, top-2 (atol 1e-4)
- medium: 8 experts, top-2, 2 layers (atol 1e-4)
- real_config_1layer: full Qwen3-30B-A3B arch
(128 experts, top-8, 2048 hidden),
num_hidden_layers=1, random weights
(atol 1e-3)
The size ladder lets any future regression surface at the cheapest
test that catches it. Each individual fix above is exercised:
gather-then-matmul (PR #298) by every test, KernelScatter bf16
indirectly via the bf16 weight init path, the clamp-on-Int and the
empty/histc translators by every test.
Validation on H200/CUDA:
- 4 passed in tests/test_qwen3_moe.py (this PR's new tests)
- 223 passed across tests/test_unary.py, test_capsule_validation.py,
test_hlir_ops.py — no existing-test regression
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test: add full-depth Qwen3-30B-A3B regression test
The 1-layer real-config test exercised the production *layer* shape but
not the full network depth. Adds a sibling test that loads the actual
Qwen/Qwen3-30B-A3B pretrained checkpoint at its native bf16 dtype,
keeps all 48 layers, and runs a full forward through luminal_backend.
Asserts compile+run completes and the compiled output is finite + in the
right magnitude band vs eager (within 10×). Tight numerical equivalence
at full depth is not asserted: random egglog seeds can pick lowering
plans whose 48-layer accumulation diverges structurally from eager
even though per-layer correctness holds. The smaller-config tests above
use atol≤1e-3 and cover the per-op correctness this test cannot.
This catches:
- egglog cleanup behaviour over a 48-layer-wide e-graph (the
`egglog_utils.rs:1286: No valid graphs` panic surfaces here if the
cleanup cascade re-regresses on MoE root-eclasses);
- per-layer state plumbing that single-layer tests can't see;
- bf16-specific code paths that fp32 random-init tests mask.
Memory profile: ~60 GB bf16 weights + ~15 GB compiled-runtime peak;
single-token input keeps activations and KV cache trivial. Fits an H200
or H100 with margin to spare.
Run time: ~90 s for compile (egglog search at default budget) + ~1 s
for both forward passes.
Verified with 5 passed in 5:29 on H200/CUDA.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* luminal_python: fix bf16 cast-back on where / masked_fill
`translate_where`, `translate_where_scalar_other`, and
`translate_masked_fill_scalar` all computed `c * x + (1 - c) * y` in F32
and never cast the result back to the input dtype. When the input was
bf16 (the common case for MoE inference), the F32 buffer was downstream
read as bf16 — which walks the buffer at half-stride and produces
output[1] = input[0], output[3] = input[1], … with zeros at the even
positions. For Qwen3-MoE's `batched_mm_experts_forward` the corruption
landed at the masked-fill of unused expert outputs and propagated as
~10^38 saturation through the rest of the layer.
Three changes:
1. Extract a shared `where_formula(cond, x, y, out_dtype)` helper that
builds the c*x + (1-c)*y graph in F32 and then `cast(out_dtype)`s
the result. All three callers route through it now.
2. `translate_where_scalar_other` and `translate_masked_fill_scalar`
build a tensor for the scalar branch via the same
`constant_float(val).cast(out_dtype).expand_rhs(shape)` recipe that
`translate_full_like` uses, then call the shared helper.
3. The standalone half-stride misread on a tiny `masked_fill` graph is
still observable in isolation (egglog picks a different rewrite plan
for that graph than for `full_like + where`), but does not occur in
real models — the qwen3-moe test suite (5 tests, including full
`Qwen/Qwen3-30B-A3B` pretrained at all 48 layers) is now green and
the bench's `Qwen3MoeExperts` path produces correct output.
Validation on H200/CUDA:
- 5 passed in tests/test_qwen3_moe.py (was: full-config wrong-magnitude
output blocking the regression test from being meaningful)
- 223 passed in tests/test_unary.py + test_capsule_validation.py +
test_hlir_ops.py — no existing-test regression
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* cargo fmt
* ruff format on tests/test_qwen3_moe.py
* clippy: use += instead of x = x + y
* fixed whisper with schedule edges in runtime
* scatter no copy fix
* whisper fix
* hold out slow tests
* flashinfer
* fmt
* flashinfer jit
---------
Co-authored-by: Tucker Morgan <tucker@luminal.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
29 lines
819 B
YAML
29 lines
819 B
YAML
name: Test Python Native
|
|
|
|
on:
|
|
push:
|
|
branches: ["main"]
|
|
pull_request:
|
|
branches: ["main"]
|
|
workflow_dispatch:
|
|
|
|
jobs:
|
|
python_native_tests:
|
|
name: Python Native Tests
|
|
runs-on: ubuntu-latest
|
|
container:
|
|
image: ghcr.io/luminal-ai/luminal-docker:cpu
|
|
timeout-minutes: 45
|
|
defaults:
|
|
run:
|
|
working-directory: crates/luminal_python
|
|
|
|
steps:
|
|
- uses: actions/checkout@v6
|
|
- name: Update Rust toolchain
|
|
run: rustup update
|
|
- name: Build maturin extension
|
|
run: uv run maturin develop --manifest-path rust/Cargo.toml --profile release
|
|
- name: Run pytest
|
|
run: uv run pytest tests/test_hlir_ops.py tests/test_unary.py -v -m "not slow"
|