docs: attempt log for 2026-04-17 GPU A/B and merge prep.

Made-with: Cursor
2026-04-17 10:33:29 +09:00
parent 23c43400b1
commit 2a9f08d218
1 changed files with 2 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -102,3 +102,5 @@ This repository is intentionally pinned to CUDA 12.6 PyTorch wheels and matching
 - 2026-04-16: Structural torsion head (`--torsion-head bond_pair`, GCN only): translation/rotation still use full-graph mean-pooled trunk+time; each torsion `k` runs the **same GCN weights** on the **movable-side induced subgraph** (mask only selects nodes/edges for that subgraph—mask values are not fed as features), mean-pools that subgraph, concatenates with global pooled context, `LayerNorm`, then a small MLP to one scalar. Replaced the prior mask-as-feature design. One calibration run (`epochs=320`, geodesic+residual) reached `mean_rmsd_100=2.598530` with long `train_mse=1000` plateaus; worse than best `2.388103`, likely dominated by multi-forward cost + same geodesic instability rather than readout alone.
 - 2026-04-16: bond_pair stabilization pass: subgraph batch `add_self_loops`, post-pool `LayerNorm` on subgraph embedding, small Xavier init on torsion MLP, `torch.nan_to_num` + optional output clamp (`--subgraph-torsion-clip`), and Adam param-group with `--subgraph-lr-scale` (default `0.3`) for `sub_convs`/torsion head vs main `--lr`. Smoke (48ep) avoided `1000` train spikes; full run (`epochs=320`, `lr=5.5e-4`, `subgraph_lr_scale=0.25`, `clip=6`, `eval-runs=100`) reached `mean_rmsd_100=2.606118` (still above best `2.388103`) but training telemetry stayed below the non-finite fallback wall in logged epochs.
 - 2026-04-16: bond_pair subgraph **full-graph trunk fuse**: subgraph GCN first-layer input is `concat(local_coords, full-graph node h)` (same-step trunk embedding, state-dependent—not static chemistry). Full run (`epochs=320`, same stab hyperparams as prior) yielded `mean_rmsd_100=2.620511` vs `2.606118` without fuse—slightly worse, so next axis is likely rotation/geodesic coupling or subgraph depth, not more static atom features.
+- 2026-04-16: **Post-fuse strategy sweep (automation)**: implemented `--torsion-sub-gcn-layers` (subgraph depth ≤ trunk) and dedicated `sub_gcn_skip` for the subgraph stack; added `scripts/run_bond_pair_strategy_sweep.sh` to run six queued experiments in order—(E1) `rotation-loss=hybrid` α=0.5, (E2) `rotation-loss=mse`, (E3) `--torsion-sub-gcn-layers 4`, (E4) `--subgraph-lr-scale 0.4`, (E5) `--subgraph-lr-scale 0.15`, (E6) main `--lr 4.5e-4`—all with `bond_pair`, `gcn-residual`, `omega_max_norm=5`, stab clip `6`, default `subgraph_lr_scale=0.25` except E4/E5/E6 as noted. Run locally: `bash scripts/run_bond_pair_strategy_sweep.sh /path/to/sample.sdf` → `reports/bond_pair_strategy_sweep.tsv`; append each `mean_rmsd_100` line to this log after the run (this agent session could not complete long GPU sweeps end-to-end).
+- 2026-04-17: GPU via project `.venv`: A/B on residual-geodesic (`gcn` hidden=512 layers=8, epochs=280, batch=24, lr=7e-4, weighted displacement, seed=1, eval-runs=100). Run A with `--ema-decay 0.999` reached `mean_rmsd_100=2.554015` (late epochs showed `train_mse=1000` fallback). Run B without EMA reached `mean_rmsd_100=2.350750`; `reports/latest_eval.json` and `artifacts/latest_eval_best_model.pt` left from Run B. On this pair, EMA best-train checkpoint hurt final RMSD vs online weights; no claim vs historical branch best `2.388103` (different wall-clock / step budget semantics).