ai-rfm/README.md

# ai_rfm

RFM overfitting sandbox for a single ligand sample, with hard quality gates.

## Environment first (UV, cu126 only)

1. Ensure Python 3.10 is available.
2. Install env and deps:
   - `uv sync`
3. Install git hooks:
   - `uv run pre-commit install`

This repository is intentionally pinned to CUDA 12.6 PyTorch wheels and matching PyG wheels.

## Repository policy

- Every attempt must update this README (append a short entry in `## Attempt Log`).
- Attempt log is mandatory for both successful and failed trials.
- **Branch-first attempts**: do training experiments on a **feature branch**; **commit each attempt** as **two commits** when both change: (1) `train.py` plus eval artifacts (`reports/latest_eval.json`, …) **without** `README.md` (PyTorch `*.pt` checkpoints are **not** tracked—local only); (2) a **docs-only** commit with **only** `README.md` (attempt log). Pre-commit blocks staging `train.py` and `README.md` together. Pre-commit does **not** enforce the mean-RMSD improvement rule on feature branches.
- **Main is the gate**: merging or committing to **`main`** with `train.py` staged triggers the performance gate (strictly better `mean_rmsd_100`, staged `latest_eval`, README log, auto-update of `BEST_PRACTICE.json`; checkpoints remain local). Land work via merge or **cherry-pick** of the commits you still trust after re-evaluation.
- **`## Attempt Log` on `main`**: new log lines written on a feature branch must be **replicated on `main`** (docs-only `README.md` commit if `train.py` is not landing yet). See `GUIDELINES.md` workflow step 6.
- Flow-matching training time must stay random (middle-time supervision is mandatory).
- Independent attempts must be research-level changes (architecture/training strategy/loss design). Pure hyperparameter-only runs are not counted as standalone attempts.
- When failures accumulate, **re-evaluate branch commits** and integrate with **cherry-pick** (or selective revert / path restore)—not wholesale rollback unless explicitly justified. Do not use `mean_rmsd_100` (or equivalent) as a training-time early-stopping signal.

## Evaluation target

- Metric: mean RMSD over 100 runs (`batchsize=100` style aggregated evaluation).
- Success criterion: `mean_rmsd_100 <= 1.0`.

## Key files

- `train.py`: training/evaluation entry point.
- `GUIDELINES.md`: operating rules and workflow.
- `BEST_PRACTICE.json`: current best-known metric and config.
- `reports/latest_eval.json`: most recent measured metric.
- `artifacts/*.pt`: checkpoints are **gitignored**; written locally by `train.py` / hooks (`latest_eval_best_model.pt`, `best_model.pt`).
- `reports/trajectories/`: trajectory SDFs are **gitignored** (`*.sdf`); regenerate locally (`python scripts/update_best_artifacts.py` after training when needed).
- `scripts/precommit_performance_gate.py`: flow-matching token check on any branch when `train.py` is staged; **mean-RMSD gate and `BEST_PRACTICE.json` refresh only on `main`** (does not stage `.pt` / `.sdf`).

## Attempt Log

- 2026-04-16: Bootstrapped docs/environment policy and cu126 UV config. Added best-practice/performance gating scaffolding before the next training run.
- 2026-04-16: Updated `train.py` to use final test metric as source of truth (`mean_rmsd_100` from 100 rollout predictions) and removed train-loss based best checkpoint tracking. Current measured `mean_rmsd_100=2.593694`.
- 2026-04-16: Updated evaluation to always use the best training checkpoint, then run 100 random initializations to time=1 and store the final RMSD mean in `reports/latest_eval.json`.
- 2026-04-16: Re-ran with best-checkpoint evaluation path active; current `mean_rmsd_100=2.582932` (improved from `2.593694`), artifacts synced to `BEST_PRACTICE.json`.
- 2026-04-16: Moved `BEST_PRACTICE.json` updates out of `train.py`; pre-commit now auto-generates/stages best report from `reports/latest_eval.json` when an improved train.py commit is made.
- 2026-04-16: Re-ran after pre-commit auto-best refactor; current `mean_rmsd_100=2.570120` (improved from `2.582932`).
- 2026-04-16: Added model-type support (`gcn`/`mlp`) and time-sampling control; best current run is `gcn hidden=512 layers=8 batch=96` with `mean_rmsd_100=2.523552`.
- 2026-04-16: Added pre-commit artifact refresh: on best update it now stages `BEST_PRACTICE.json`, `artifacts/best_model.pt`, and regenerates 6 trajectory visualizations in `reports/trajectories/`.
- 2026-04-16: Enforced random-time flow-matching rule (no fixed training time), saved best checkpoint to git-tracked artifact path, and improved metric to `mean_rmsd_100=2.519821` with `gcn hidden=512 layers=8 batch=96`.
- 2026-04-16: Added a general multi-layer diagnosis principle to `GUIDELINES.md` so experiments are judged with quantitative + qualitative + structural evidence, not metric-only optimization.
- 2026-04-16: Tried weighted objective to counter weak rotation/torsion motion (`w_center=0.8, w_omega=2.0, w_torsion=3.0, grad_clip=0.8`) and improved to `mean_rmsd_100=2.505556`.
- 2026-04-16: Failed attempt B (longer, lower-lr weighted run) reached `mean_rmsd_100=2.531661`; reverted artifacts to current best.
- 2026-04-16: Failed attempt C (torsion-heavy weights, `time_power=1.2`) reached `mean_rmsd_100=2.564594`; no commit.
- 2026-04-16: Failed attempt D (deeper GCN config) reached `mean_rmsd_100=2.739573`; no commit.
- 2026-04-16: Failed attempt E (`w_center=0.75, w_omega=2.1, w_torsion=3.2, lr=9e-4`) reached `mean_rmsd_100=2.535795`; no commit.
- 2026-04-16: Failed attempt F (balanced weights `w_center=0.9, w_omega=1.8, w_torsion=2.6`) reached `mean_rmsd_100=2.522751`; no commit.
- 2026-04-16: Failed attempt G (`accum=3` for stability) reached `mean_rmsd_100=2.561071`; no commit.
- 2026-04-16: Policy update: every attempt (success/failure) must be logged; checkpoint flow changed to `artifacts/latest_eval_best_model.pt` per run, while pre-commit promotes improved runs to `artifacts/best_model.pt`.
- 2026-04-16: Improved attempt H (same weighted config, `seed=1`) reached `mean_rmsd_100=2.461592` (improved from `2.505556`).
- 2026-04-16: Failed attempt I (same weighted config, `seed=2`) reached `mean_rmsd_100=2.590216`; no commit.
- 2026-04-16: Failed attempt J (same weighted config, `seed=3`) reached `mean_rmsd_100=2.554448`; no commit.
- 2026-04-16: Failed attempt K (research-level: added terminal-consistency auxiliary loss from `x_t` to `x_1`) reached `mean_rmsd_100=2.722863`; no commit.
- 2026-04-16: Failed attempt L (research-level: decoupled architecture with centered-coordinate trunk + separate translation head, with terminal auxiliary term) reached `mean_rmsd_100=2.637292`; no commit.
- 2026-04-16: Failed attempt M (research-level: decoupled centered-coordinate architecture only, no terminal auxiliary term) reached `mean_rmsd_100=2.479326`; close to best but no commit.
- 2026-04-16: Failed attempt N (training-strategy: added configurable early stopping with large max-epoch budget, patience/min-delta/check cadence controls) ran to max epoch with ongoing improvements (`stop_reason=max_epochs`) and reached `mean_rmsd_100=2.764940`; no commit.
- 2026-04-16: Rollback (per `GUIDELINES.md`): restored `train.py`, `reports/latest_eval.json`, and `artifacts/latest_eval_best_model.pt` to last committed baseline after attempts K–N; `mean_rmsd_100` anchor unchanged at `2.461592` (`BEST_PRACTICE.json`). Objective-aligned early stopping remains disallowed for training control.
- 2026-04-16: Policy update: experiments run on **feature branches** with **one commit per attempt**; mean-RMSD pre-commit gate applies only on **`main`** (merge/cherry-pick integration). Re-triage failed stacks via **cherry-pick** / selective drops, not default full-tree rollback.
- 2026-04-16: Branch `attempt/gat-wrapped-torsion` (single commit batching three evals): Failed O — `gat` + `--torsion-wrapped-loss`, `mean_rmsd_100=2.691410`. Failed P — `gcn` + `--torsion-wrapped-loss`, `2.657594`. Failed Q — `gcn` + `--gcn-residual` (best on branch `2.514058`); all above main best `2.461592` — no merge to `main`.
- 2026-04-16: Branch `attempt/default-wrapped-clean-deps`: Made wrapped torsion loss default (`--torsion-wrapped-loss` via `BooleanOptionalAction`, default on) and added displacement-domain objective option. Dependency cleanup removed unused packages from `pyproject.toml`. Validation run (`python train.py --sdf reports/trajectories/trajectory_00.sdf --epochs 200 --batch-size 32 --eval-runs 100 --model-type gcn --hidden 256 --gcn-layers 6 --loss-domain displacement --seed 1`) reached `mean_rmsd_100=2.528226` (no improvement vs best `2.461592`), so branch not ready to merge.
- 2026-04-16: Branch `attempt/default-wrapped-clean-deps` update: removed `--torsion-wrapped-loss` CLI toggle and enforced wrapped torsion loss always-on in code. Failed R — stronger baseline (`sample.sdf`, `gcn hidden=512 layers=8`, displacement loss, `epochs=800`, `seed=1`) reached `mean_rmsd_100=2.512292`.
- 2026-04-16: Failed S — weighted config (`w_center=0.8, w_omega=2.0, w_torsion=3.0, grad_clip=0.8`, `epochs=1200`, `seed=1`) reached `mean_rmsd_100=2.507794` (better than R, still above best `2.461592`).
- 2026-04-16: Failed T — same weighted config with time-bias (`time_power=1.3`) reached `mean_rmsd_100=2.517704`; no branch promotion.
- 2026-04-16: Attempt U (recommended #1, residual GCN): `--gcn-residual` with weighted displacement setup (`epochs=1200`, `seed=1`) reached `mean_rmsd_100=2.463247` (close, but above best `2.461592`).
- 2026-04-16: Attempt V (recommended #2, SO(3) geodesic rotation loss): initial full-budget run was too slow, then reduced-budget run (`--rotation-loss geodesic --epochs=200 --batch-size=24 --seed=1`) improved to `mean_rmsd_100=2.429729` (new branch best at the time).
- 2026-04-16: Attempt W (recommended #3, split heads + normalization): `--channel-layernorm --head-mlp-layers 2` with weighted displacement setup (`epochs=1200`, `seed=1`) reached `mean_rmsd_100=2.634111` (degraded).
- 2026-04-16: Attempt X (geodesic refinement, longer budget): `--rotation-loss geodesic --epochs=400 --batch-size=24 --seed=2` showed NaN instability and reached `mean_rmsd_100=2.552385`.
- 2026-04-16: Attempt Y (geodesic seed sweep): `--rotation-loss geodesic --epochs=200 --batch-size=24 --seed=3` diverged to NaN early and reached `mean_rmsd_100=2.591940`.
- 2026-04-16: Attempt Z (geodesic stable rerun): same setup as V (`--rotation-loss geodesic --epochs=200 --batch-size=24 --seed=1`) improved further to `mean_rmsd_100=2.426296` (current best in this branch, better than anchor `2.461592`).
- 2026-04-16: Added train-loss-only early stopping controls (`--early-stop-patience`, `--early-stop-min-delta`, `--early-stop-check-every`, `--early-stop-warmup`) with `stop_reason`/`stop_epoch` reporting in logs and `reports/latest_eval.json`; objective-metric stopping remains disabled.
- 2026-04-16: Attempt AA (merge prep rerun on CUDA): repeated geodesic best-practice config (`--rotation-loss geodesic --epochs=200 --batch-size=24 --seed=1`) and measured `mean_rmsd_100=2.429895` (`num_runs=100`), still improving over main anchor `2.461592`.
- 2026-04-16: Branch `attempt/geodesic-stability-next`: stress-tested geodesic+residual variants; best observed metric reached `mean_rmsd_100=2.388103` (`--rotation-loss geodesic --gcn-residual --epochs=280 --batch-size=24 --lr=7e-4 --seed=1`), with occasional NaN instability in nearby runs.
- 2026-04-16: Stabilization-only update: added non-finite guards/clamps in geodesic loss, Kabsch RMSD, and training loss fallback to reduce NaN-caused crashes during long geodesic sweeps.
- 2026-04-16: Policy update in `GUIDELINES.md`: when a branch obtains a strict best `mean_rmsd_100`, integration into `main` is mandatory before continuing new branch experiments.
- 2026-04-16: Hook policy update: `train-performance-gate` now runs at both commit-time and `post-merge`, and enforces main-branch merge-time validation/refresh when merged diff includes `train.py`.
- 2026-04-16: Attempt AB (trajectory-instability hypothesis): added `--omega-max-norm` clipping to stabilize geodesic+residual rotation outputs and reduce NaN-prone spikes; run with `--omega-max-norm 3.0` reached `mean_rmsd_100=2.436618` (more stable but worse than branch best `2.388103`).
- 2026-04-16: Strategy S1 (hybrid rotation loss, capped at 5 micro-tuning runs) completed: alpha sweep (`0.7/0.5/0.3`) then lr/seed tuning on best alpha; best S1 result was `mean_rmsd_100=2.439254` (no new best), strategy marked exhausted.
- 2026-04-16: Strategy S2 (rotation-weight curriculum, capped at 5 micro-tuning runs) completed: best run used `--rotation-weight-start 1.0 --rotation-weight-warmup-epochs 120` and reached `mean_rmsd_100=2.417450` (no new best), strategy marked exhausted.
- 2026-04-16: Multi-GPU parallel sweep (GPU0/1/2) around residual-geodesic schedules produced `2.394431`, `2.420601`, and `2.450024`; no update over branch best `2.388103`.
- 2026-04-16: Follow-up parallel sweep (GPU0/1/2) with direct best-axis reruns produced `2.430481`, `2.412036`, and `2.720380`; observed heavy seed sensitivity and intermittent fallback-to-1000 behavior on unstable seeds.
- 2026-04-16: Continued parallel sweep with rotation curriculum variants (`start=0.85/0.95` and lower-lr schedule) produced `2.450391`, `2.457748`, and `2.426384`; no improvement over branch best `2.388103`.
- 2026-04-16: Deep schedule parallel sweep (`epochs=320~380`, `start=1.0` with warmup variants, multi-seed) produced `2.464117`, `2.410706`, and `2.419527`; still below branch best and showed late-epoch fallback instability in 일부 runs.
- 2026-04-16: Post-reset attempt on `attempt/s3-tail-risk-next` (trajectory-tail-risk focus) using residual-geodesic with clipped omega and scheduled rotation weight (`lr=6.8e-4`, `grad_clip=0.7`, `start=1.0`, warmup `120`) reached `mean_rmsd_100=2.464730`; no improvement.
- 2026-04-16: Restarted branch `attempt/s3-restart-after-doc-sync` and ran immediate S3 continuation (`lr=6.0e-4`, `grad_clip=0.7`, geodesic+residual, `omega_max_norm=5.0`, warmup `120`), obtaining `mean_rmsd_100=2.474573`; no improvement over best `2.388103`.
- 2026-04-16: Strategy S4 start (tail-risk suppressor): added upper-quantile tail penalty in training loss and ran first trial (`tail-risk-weight=0.2`, `tail-risk-quantile=0.85`, `lr=6.8e-4`, geodesic+residual), yielding `mean_rmsd_100=2.466082`; no improvement over best `2.388103`.
- 2026-04-16: Strategy S4 micro-tuning #2 lowered tail penalty (`tail-risk-weight=0.1`, quantile `0.85`, `lr=6.4e-4`) to reduce over-regularization, but result was `mean_rmsd_100=2.476267`; no improvement.
- 2026-04-16: Strategy S4 micro-tuning #3 softened tail coverage (`tail-risk-quantile=0.9`, `tail-risk-weight=0.2`, `lr=6.8e-4`) and improved to `mean_rmsd_100=2.440570`, but still below best `2.388103`.
- 2026-04-16: Strategy S4 micro-tuning #4 increased tail penalty (`tail-risk-weight=0.25`, quantile `0.9`) and regressed sharply to `mean_rmsd_100=2.601258`; indicates over-penalization risk.
- 2026-04-16: Strategy S4 micro-tuning #5 changed seed (`seed=2`, `tail-risk-weight=0.2`, quantile `0.9`) and encountered prolonged fallback-to-1000 behavior with `mean_rmsd_100=2.709563`; S4 hit 5-run cap with no best update.
- 2026-04-16: Structural torsion head (`--torsion-head bond_pair`, GCN only): translation/rotation still use full-graph mean-pooled trunk+time; each torsion `k` runs the **same GCN weights** on the **movable-side induced subgraph** (mask only selects nodes/edges for that subgraph—mask values are not fed as features), mean-pools that subgraph, concatenates with global pooled context, `LayerNorm`, then a small MLP to one scalar. Replaced the prior mask-as-feature design. One calibration run (`epochs=320`, geodesic+residual) reached `mean_rmsd_100=2.598530` with long `train_mse=1000` plateaus; worse than best `2.388103`, likely dominated by multi-forward cost + same geodesic instability rather than readout alone.
- 2026-04-16: bond_pair stabilization pass: subgraph batch `add_self_loops`, post-pool `LayerNorm` on subgraph embedding, small Xavier init on torsion MLP, `torch.nan_to_num` + optional output clamp (`--subgraph-torsion-clip`), and Adam param-group with `--subgraph-lr-scale` (default `0.3`) for `sub_convs`/torsion head vs main `--lr`. Smoke (48ep) avoided `1000` train spikes; full run (`epochs=320`, `lr=5.5e-4`, `subgraph_lr_scale=0.25`, `clip=6`, `eval-runs=100`) reached `mean_rmsd_100=2.606118` (still above best `2.388103`) but training telemetry stayed below the non-finite fallback wall in logged epochs.
- 2026-04-16: bond_pair subgraph **full-graph trunk fuse**: subgraph GCN first-layer input is `concat(local_coords, full-graph node h)` (same-step trunk embedding, state-dependent—not static chemistry). Full run (`epochs=320`, same stab hyperparams as prior) yielded `mean_rmsd_100=2.620511` vs `2.606118` without fuse—slightly worse, so next axis is likely rotation/geodesic coupling or subgraph depth, not more static atom features.
- 2026-04-16: **Post-fuse strategy sweep (automation)**: implemented `--torsion-sub-gcn-layers` (subgraph depth ≤ trunk) and dedicated `sub_gcn_skip` for the subgraph stack; added `scripts/run_bond_pair_strategy_sweep.sh` to run six queued experiments in order—(E1) `rotation-loss=hybrid` α=0.5, (E2) `rotation-loss=mse`, (E3) `--torsion-sub-gcn-layers 4`, (E4) `--subgraph-lr-scale 0.4`, (E5) `--subgraph-lr-scale 0.15`, (E6) main `--lr 4.5e-4`—all with `bond_pair`, `gcn-residual`, `omega_max_norm=5`, stab clip `6`, default `subgraph_lr_scale=0.25` except E4/E5/E6 as noted. Run locally: `bash scripts/run_bond_pair_strategy_sweep.sh /path/to/sample.sdf` → `reports/bond_pair_strategy_sweep.tsv`; append each `mean_rmsd_100` line to this log after the run (this agent session could not complete long GPU sweeps end-to-end).
- 2026-04-17: GPU via project `.venv`: A/B on residual-geodesic (`gcn` hidden=512 layers=8, epochs=280, batch=24, lr=7e-4, weighted displacement, seed=1, eval-runs=100). Run A with `--ema-decay 0.999` reached `mean_rmsd_100=2.554015` (late epochs showed `train_mse=1000` fallback). Run B without EMA reached `mean_rmsd_100=2.350750`; `reports/latest_eval.json` and `artifacts/latest_eval_best_model.pt` left from Run B. On this pair, EMA best-train checkpoint hurt final RMSD vs online weights; no claim vs historical branch best `2.388103` (different wall-clock / step budget semantics).
- 2026-04-17: Fast-forward merged `attempt/post-main-doc-sync` into `main`; committed `BEST_PRACTICE.json` + `artifacts/best_model.pt` sync to anchor `mean_rmsd_100=2.350750` (command matches no-EMA residual-geodesic run).
- 2026-04-17: Branch `attempt/graph-readout-geodesic` (off updated `main`): identical budget vs anchor but `--graph-readout attention` reached `mean_rmsd_100=2.716856` with early `train_mse=1000` wall; worse than anchor `2.350750`; branch not for merging to `main`.
- 2026-04-17: Fixed `scripts/update_best_artifacts.py` to build `RFMModel` with the same flags as training (notably `gcn_residual`); best-practice trajectories had looked static because the forward pass did not match the checkpoint. Checkpoints now store RFM metadata; old runs infer from `BEST_PRACTICE.json` command when keys are absent.
- 2026-04-17: Removed all `*.pt` and `*.sdf` from git history (`git-filter-repo`); added `*.pt` to `.gitignore`; pre-commit no longer stages checkpoints. `git-filter-repo` drops `origin`—re-add with `git remote add origin <url>` before push. Pushes to existing remotes need **`git push --force-with-lease`** once because history was rewritten.