Handle iteration-invariant body producers in loop unroll

`unroll_loops_in_llir` was panicking on `clone_map[i-1][&body_producer]`
with "no entry found for key" on the gemma Modal CI job. The line
fired when extraction landed a `body_producer` (LoopEnd's incoming
source) that isn't in `body_nodes` — a forward-walk-from-input-markers
set that misses ops whose only ancestors are non-marker (a constant,
external input, or an op whose chain got congruence-merged off the
marker chain by rules like `LoopInputStatic inline`).

Semantically that body op is iteration-invariant: every iter would
compute the same value, so the loop's state never changes. The
per-iter clone path needed a "no clone, share across iters" fallback
rather than indexing the clone map.

Fix:
- In `unroll_loops_in_llir::resolve_src`, when the LoopStart-resolved
  `body_producer` isn't in `body_nodes`, return `body_producer` itself
  for iter > 0 (skip the clone_map lookup).
- Mirror the same `unwrap_or(body_producer)` fallback in
  `marker_post_sub` for LoopEnd / LoopOutputSelect post-loop wiring.
- In `collapse_loops_to_first_iter`, add a backward-walk-from-end-markers
  pass that backfills body_nodes with any non-marker non-Output ancestor
  of an end-marker. Collapse doesn't have a clone_map (no panic site),
  but it does iterate body_nodes to rewire incoming edges before
  deleting markers — without backfill, an iteration-invariant
  body_producer would keep dangling edges to removed markers.

Local cuda_lite + python CUDA suites pass. The extraction shape that
triggers this isn't reachable from the local fuzzers' search depth, so
this lands as a defensive fix to unblock the gemma Modal job; once
that job goes green we'll know whether the fallback covers all cases
or whether more diagnostic info is needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joe Fioti
2026-04-26 07:45:05 +00:00
parent f08d24e73f
commit 16de9638fc
2 changed files with 63 additions and 3 deletions

View File

@@ -766,3 +766,11 @@ identical across all attempts (dtype issue) vs varying (actual numerical issue).
3. **Why hard**: The post-unroll LLIR for the rolled vs un-rolled paths *looked* nearly identical when scanned visually — both had the Log2 → Mul × 3 → Exp2 chain. The diff was 2 extra Muls vs no-rolling, and the actual semantic gap was visible only in op-name counts: WITH-rolling had 3 `KernelExp2` and 0 `KernelExp`, WITHOUT-rolling had 1 `KernelExp2` and 2 `KernelExp`. Tracking the missing fusion to the early/full ruleset split required reading the egglog driver carefully and noticing that `OpTextParts` builds `early_rewrites` and `full_rewrites` from disjoint method calls.
4. **Fix**: Register `binary_op_unroll_rules` in BOTH `early_rewrites()` (so fusion patterns like GLUMoE can match before the early-stage extract, which is what fixed `test_glumoe_gemma_gelu_matches_unfused_output` earlier in the session) AND `rewrites()` (so kernel-level rewrites like `direct-exp-fusion` can match in the full stage on the unrolled chain). One block per binary op (`Add`, `Mul`, `Mod`, `LessThan`).
5. **Principle**: When egglog has multiple stages (early/full) with disjoint rule sets, any rewrite that materialises new HLIR/IR enodes (rather than just lowering to LLIR) needs to fire in BOTH stages if downstream rewrites in BOTH stages might want to see the new structure. Putting "preparatory" rewrites only in `early_rewrites` means their effect is lost across the early→full handoff. The narrow rule of thumb: if your rule's outputs are intended to enable matches by other rules, audit which stages those other rules run in and register accordingly.
## 2026-04-26 — `unroll_loops_in_llir` panicked on iteration-invariant body producers
1. **Symptom**: Modal CI/CD job for the gemma example panicked at `src/graph.rs:1867` with `no entry found for key`. The line is `clone_map[i - 1][&body_producer]` inside `unroll_loops_in_llir`'s `resolve_src` closure — `body_producer` (the LoopEnd's incoming source for that slot) wasn't a key in the per-iteration clone map. cuda_lite/python tests didn't repro: only triggered by the specific genome and graph shapes that gemma's longer search settles on.
2. **Root cause**: `body_nodes` is computed by walking *forward* from each LoopStart/LoopInput/LoopInputStatic outgoing edge, stopping at markers and `Output` ops. Some egglog-extracted LLIRs land a `body_producer` that isn't reachable via that forward walk — i.e., its only ancestors are non-marker (a constant, an external input, or an op whose chain was congruence-merged off the marker chain by rules like `LoopInputStatic inline`). Semantically this is a degenerate "iteration-invariant body": every iter computes the same value, so the loop's state never changes. The per-iter clone path needed a fallback for that case.
3. **Why hard**: cuda_lite and python tests don't generate genomes that produce this shape, so local runs always pass. The forward-walk-only definition of `body_nodes` is *almost* always right — only specific extraction shapes from longer searches expose the gap. Test-driven debugging has limited reach when the failure mode depends on a search trajectory the local fuzzers don't explore.
4. **Fix**: in `unroll_loops_in_llir::resolve_src`, when the LoopStart-resolved `body_producer` isn't in `body_nodes`, return `body_producer` itself for iter > 0 instead of indexing `clone_map[i - 1]`. The body op didn't depend on the loop variable, so every iter > 0 carries the same value forward — using `body_producer` directly is semantically correct. Mirrored the same `unwrap_or(body_producer)` fallback in the post-loop substitution map (`marker_post_sub` for LoopEnd / LoopOutputSelect). Added a backward-walk-from-end-markers backfill in `collapse_loops_to_first_iter` so its body-node iteration also covers these nodes (it doesn't have a clone_map, but does need to rewire body ops' incoming edges before deleting markers).
5. **Principle**: When a graph-walk-derived set is used as a hashmap key requirement, every code path that *could* produce a key outside that set needs a graceful fallback — not just a defensive `expect`. For loop unrolling specifically, the rule is: `body_nodes` is the set of "ops that participate in per-iter computation"; ops on the LoopEnd's path that *don't* participate (iteration-invariant) are still legitimate, and need a "no clone, share across iters" path through `resolve_src` and `marker_post_sub`. Forward-walk-only `body_nodes` is correct only when extraction never produces iteration-invariant body producers — and in an egglog-driven search, that's not a guarantee you can make.

View File

@@ -1863,8 +1863,19 @@ pub fn unroll_loops_in_llir(llir: &mut LLIRGraph) {
if let Some(&(initial, body_producer)) = start_meta.get(&src) {
if i == 0 {
initial
} else {
} else if body_nodes.contains(&body_producer) {
clone_map[i - 1][&body_producer]
} else {
// Iteration-invariant body producer: extracted LLIRs from
// some genomes can put an op at LoopEnd's incoming whose
// only ancestors are non-marker (e.g. a constant that
// egglog congruence collapsed onto the body slot). It
// doesn't depend on iteration, so every iter > 0 carries
// forward the same single value rather than a per-iter
// clone. Without this fallback, `clone_map[i-1][&body]`
// panics with "no entry found for key" because the body
// producer isn't in the forward-walk-derived `body_nodes`.
body_producer
}
} else if let Some(sources) = input_per_iter.get(&src) {
sources[i]
@@ -1947,13 +1958,25 @@ pub fn unroll_loops_in_llir(llir: &mut LLIRGraph) {
.neighbors_directed(end_node, Direction::Incoming)
.next()
.expect("LoopEnd missing body producer during rewire");
marker_post_sub.insert(end_node, clone_map[iters - 1][&body_producer]);
// Mirror the iteration-invariant fallback in `resolve_src`: if the
// body producer isn't in the forward-walk-derived `body_nodes`, it
// wasn't cloned per iter, so the post-loop substitution is just the
// body producer itself (used as the loop's final value).
let sub = clone_map[iters - 1]
.get(&body_producer)
.copied()
.unwrap_or(body_producer);
marker_post_sub.insert(end_node, sub);
}
// Each LoopOutputSelect(stream, iter) routes to iter's clone of that
// stream's body producer.
for (&select_node, &(stream_id, iter)) in &output_selects {
let body_producer = output_body_producer[&stream_id];
marker_post_sub.insert(select_node, clone_map[iter][&body_producer]);
let sub = clone_map[iter]
.get(&body_producer)
.copied()
.unwrap_or(body_producer);
marker_post_sub.insert(select_node, sub);
}
for &consumer in &post_loop_consumers {
@@ -2119,6 +2142,35 @@ pub fn collapse_loops_to_first_iter(llir: &mut LLIRGraph) {
static_source.insert(static_node, src);
}
// Mirror the backward-walk patch in `unroll_loops_in_llir` so the body-
// node set is closed under "feeds a LoopEnd / LoopOutput" — without it
// a forward-walk-only set can miss body ops on LLIRs extracted from
// genomes that picked structurally unusual representatives.
let end_producers: Vec<NodeIndex> = ends
.values()
.chain(outputs.values())
.flat_map(|n| {
llir.neighbors_directed(*n, Direction::Incoming)
.collect::<Vec<_>>()
})
.collect();
let mut backfill_stack: Vec<NodeIndex> = end_producers;
while let Some(n) = backfill_stack.pop() {
if body_nodes.contains(&n) || loop_markers.contains(&n) {
continue;
}
if llir[n].to_op::<Output>().is_some() {
continue;
}
body_nodes.insert(n);
for pred in llir
.neighbors_directed(n, Direction::Incoming)
.collect::<Vec<_>>()
{
backfill_stack.push(pred);
}
}
// Resolve a source reference to its iter-0 equivalent.
let resolve_src = |src: NodeIndex| -> NodeIndex {
if let Some(&initial) = start_initial.get(&src) {