ai-rfm/GUIDELINES.md

# GUIDELINES

## Purpose

Make overfitting robust and measurable, targeting `mean_rmsd_100 <= 1.0`.

## Workflow

1. **Branch per line of work**: create a branch (e.g. `attempt/<topic>`) before changing `train.py` for a new experiment.
2. Modify code/config, run training, write `reports/latest_eval.json`.
3. Append one line to `README.md` attempt log for every attempt (success and failure).
4. **Never commit `train.py` and `README.md` in the same git commit.** After a training run: (a) commit code/eval artifacts (`train.py`, `reports/latest_eval.json`, checkpoints, etc.) without `README.md`; then (b) make a **docs-only** commit that touches **only** `README.md` (attempt log line). Pre-commit enforces this split so `README.md` can be cherry-picked to `main` without dragging `train.py` along. Feature-branch commits are not blocked by the mean-RMSD performance gate.
5. When a branch is ready to land: **merge (or cherry-pick) into `main`**. The performance gate and `BEST_PRACTICE.json` / best-artifact refresh run only on **`main`** when `train.py` is part of the commit.
6. **`README.md` attempt log must also live on `main`**: if you only merged code later or abandoned a `train.py` merge, still **bring new `## Attempt Log` lines onto `main`** soon after (docs-only commit is fine—stage **only** `README.md` so the mean-RMSD gate does not run). Cherry-pick the README hunk from the branch or copy the lines; do not leave the canonical log only on a feature branch.
7. **Mandatory best-update integration**: if any feature-branch attempt records a strictly better `mean_rmsd_100` than the current `main` anchor, treat it as merge-ready work. Merge/cherry-pick it into `main` promptly (do not keep a known best only on a feature branch), then continue new experiments from a fresh branch off updated `main`.
8. **Per-attempt logging+commit is mandatory**: every experiment run must immediately append its result to `README.md`, then record it in git **before** the next run—but as **separate commits** from rule 4: code commit first, `README.md`-only commit second (same attempt, two commits minimum when both files change).
9. **SDF outputs stay out of git**: `*.sdf` is ignored; regenerate trajectories locally instead of committing structure files.

## Training budget and stopping

- Do not shrink the epoch budget by default while the learning curve is still improving.
- If wall-clock is tight, use explicit early stopping on **training-side** signals only (e.g. plateauing training loss), with a large max-epoch cap and patience.
- **Do not** use the final gated metric (`mean_rmsd_100`) or any equivalent “mini-test” of the same objective during training to decide when to stop. That peeks at the evaluation target and is leakage / cheating in this single-sample overfit setting.
- **Do not** introduce a held-out RMSD split for stopping; the reported metric is the quality gate, not a training control signal.
- Record in the attempt note how training ended (e.g. `max_epochs`, `early_stop` on train loss only).

## Rollback and re-integration (not “nuke everything”)

- **Anchor**: `BEST_PRACTICE.json` plus the last `main` commit that passed the merge-time gate define the production story. Feature branches are scratch space.
- **Prefer selective undo**: when attempts pile up, re-read the branch history commit-by-commit, decide what actually helped, and **drop only what is useless** (revert single commits, `git restore` specific paths, or reset a branch tip while keeping good commits reachable).
- **Cherry-pick integration**: to land work on `main` without merging a messy branch wholesale, create a fresh branch from `main` and **cherry-pick** only the commits you still believe in; resolve conflicts; run eval; merge to `main` when the gate passes.
- **Log honestly**: append a short note when you abandon a direction (what was dropped and why), without erasing earlier attempt log lines.

## What Counts As An Independent Attempt

- Independent attempts must change a research-level concept, such as:
  - model architecture/backbone/head design;
  - objective/loss formulation;
  - training strategy (curriculum, teacher forcing style, optimization regime);
  - representation or rollout/evaluation coupling logic.
- Pure hyperparameter sweeps (LR, batch size, seed, minor weight nudges) are not treated as standalone attempts.
- Hyperparameter changes are allowed only as supporting details within a larger conceptual change.

## Micro-tuning cap per strategy

- For each new strategy (new research-level concept), micro-tuning is capped at **5 runs**.
- Micro-tuning includes LR/seed/batch/clip/time-power/weight nudges that do not change the core concept.
- After 5 micro-tuning runs for that strategy, stop tuning it and either:
  - promote the best result from that strategy, or
  - declare the strategy exhausted in `README.md` and move to a new independent strategy.
- Do not reset this counter by branching or renaming; count is per strategy idea.

## Non-negotiable flow-matching rule

- Time conditioning in training must be random every sample (middle-time flow supervision).
- Do not replace training time with fixed constants.

## Required report format

`reports/latest_eval.json` must include:

- `mean_rmsd_100` (float, lower is better)
- `num_runs` (int, must be 100)
- `timestamp_utc`
- `command`
- `notes`

## Repro notes

- Keep seed explicit in commands.
- Keep sample path explicit.
- Prefer additive experiments (do not silently remove prior working options).

## Multi-layer diagnosis mindset

- Do not optimize only a scalar metric; analyze behavior from multiple views each attempt.
- Use trajectory inspection as one analysis axis, not a fixed prescription.
- Combine at least two kinds of evidence when judging a strategy:
  - quantitative metrics (RMSD, train/eval gap, stability);
  - qualitative dynamics (trajectory patterns, mode collapse, unrealistic motion);
  - structural diagnostics (e.g., internal-distance change, geometry consistency).
- Treat metric improvement without believable dynamics (or vice versa) as incomplete progress.
- Example signal: if motion appears translation-dominant with weak internal change, investigate rotation/torsion learning capacity and loss balance.

## Safety

- On **`main`**, if pre-commit blocks a `train.py` change due to no RMSD improvement, either improve the model and re-evaluate, or keep iterating on a **feature branch** and merge/cherry-pick only when ready.
- On **feature branches**, you may commit freely **without** the mean-RMSD gate; the **flow-matching token check still runs** whenever `train.py` is staged (same as on `main`).