chore: enforce train vs README split, ignore SDFs, drop tracked trajectories.

Add pre-commit guard against staging train.py with README.md, document the two-commit workflow, gitignore *.sdf, and remove trajectory SDFs from the index so logs stay small. Made-with: Cursor
2026-04-16 23:59:19 +09:00
parent 9e221c62a6
commit 8e4e38e851
5 changed files with 41 additions and 4 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,2 +1,5 @@
 __pycache__/
 *.pyc
+
+# Structure outputs / trajectories (large; keep out of git history)
+*.sdf
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,6 +1,12 @@
 repos:
  - repo: local
    hooks:
+      - id: no-train-readme-same-commit
+        name: forbid train.py + README.md in one commit
+        entry: python scripts/precommit_no_train_readme_mix.py
+        language: system
+        pass_filenames: false
+        stages: [pre-commit]
      - id: train-performance-gate
        name: train.py gate (flow all branches; RMSD gate main only)
        entry: python scripts/precommit_performance_gate.py
--- a/GUIDELINES.md
+++ b/GUIDELINES.md
@@ -9,11 +9,12 @@ Make overfitting robust and measurable, targeting `mean_rmsd_100 <= 1.0`.
 1. **Branch per line of work**: create a branch (e.g. `attempt/<topic>`) before changing `train.py` for a new experiment.
 2. Modify code/config, run training, write `reports/latest_eval.json`.
 3. Append one line to `README.md` attempt log for every attempt (success and failure).
-4. **Commit each attempt on the branch** (include `train.py`, `reports/latest_eval.json`, and README log when you touched training). Feature-branch commits are not blocked by the mean-RMSD performance gate.
+4. **Never commit `train.py` and `README.md` in the same git commit.** After a training run: (a) commit code/eval artifacts (`train.py`, `reports/latest_eval.json`, checkpoints, etc.) without `README.md`; then (b) make a **docs-only** commit that touches **only** `README.md` (attempt log line). Pre-commit enforces this split so `README.md` can be cherry-picked to `main` without dragging `train.py` along. Feature-branch commits are not blocked by the mean-RMSD performance gate.
 5. When a branch is ready to land: **merge (or cherry-pick) into `main`**. The performance gate and `BEST_PRACTICE.json` / best-artifact refresh run only on **`main`** when `train.py` is part of the commit.
 6. **`README.md` attempt log must also live on `main`**: if you only merged code later or abandoned a `train.py` merge, still **bring new `## Attempt Log` lines onto `main`** soon after (docs-only commit is fine—stage **only** `README.md` so the mean-RMSD gate does not run). Cherry-pick the README hunk from the branch or copy the lines; do not leave the canonical log only on a feature branch.
 7. **Mandatory best-update integration**: if any feature-branch attempt records a strictly better `mean_rmsd_100` than the current `main` anchor, treat it as merge-ready work. Merge/cherry-pick it into `main` promptly (do not keep a known best only on a feature branch), then continue new experiments from a fresh branch off updated `main`.
-8. **Per-attempt logging+commit is mandatory**: every experiment run must immediately (a) append its result to `README.md` and then (b) create a branch commit for that attempt before starting the next run. Do not batch multiple uncommitted runs.
+8. **Per-attempt logging+commit is mandatory**: every experiment run must immediately append its result to `README.md`, then record it in git **before** the next run—but as **separate commits** from rule 4: code commit first, `README.md`-only commit second (same attempt, two commits minimum when both files change).
+9. **SDF outputs stay out of git**: `*.sdf` is ignored; regenerate trajectories locally instead of committing structure files.

 ## Training budget and stopping

--- a/README.md
+++ b/README.md
@@ -16,7 +16,7 @@ This repository is intentionally pinned to CUDA 12.6 PyTorch wheels and matching

 - Every attempt must update this README (append a short entry in `## Attempt Log`).
 - Attempt log is mandatory for both successful and failed trials.
- **Branch-first attempts**: do training experiments on a **feature branch**; **commit each attempt** on that branch (typically include `train.py`, `reports/latest_eval.json`, and README log for that run). Pre-commit does **not** enforce the mean-RMSD improvement rule on feature branches.
+- **Branch-first attempts**: do training experiments on a **feature branch**; **commit each attempt** as **two commits** when both change: (1) `train.py` plus eval artifacts (`reports/latest_eval.json`, checkpoints, …) **without** `README.md`; (2) a **docs-only** commit with **only** `README.md` (attempt log). Pre-commit blocks staging `train.py` and `README.md` together. Pre-commit does **not** enforce the mean-RMSD improvement rule on feature branches.
 - **Main is the gate**: merging or committing to **`main`** with `train.py` staged triggers the performance gate (strictly better `mean_rmsd_100`, staged `latest_eval`, README log, auto-update of `BEST_PRACTICE.json` and best artifacts). Land work via merge or **cherry-pick** of the commits you still trust after re-evaluation.
 - **`## Attempt Log` on `main`**: new log lines written on a feature branch must be **replicated on `main`** (docs-only `README.md` commit if `train.py` is not landing yet). See `GUIDELINES.md` workflow step 6.
 - Flow-matching training time must stay random (middle-time supervision is mandatory).
@@ -36,7 +36,7 @@ This repository is intentionally pinned to CUDA 12.6 PyTorch wheels and matching
 - `reports/latest_eval.json`: most recent measured metric.
 - `artifacts/latest_eval_best_model.pt`: checkpoint from latest run that produced `latest_eval`.
 - `artifacts/best_model.pt`: best checkpoint from latest improved run.
- `reports/trajectories/`: 6 regenerated trajectories from current best model.
+- `reports/trajectories/`: trajectory SDFs are **gitignored** (`*.sdf`); regenerate locally after training when needed.
 - `scripts/precommit_performance_gate.py`: flow-matching token check on any branch when `train.py` is staged; **mean-RMSD gate and best-artifact refresh only on `main`**.

 ## Attempt Log
--- a/scripts/precommit_no_train_readme_mix.py
+++ b/scripts/precommit_no_train_readme_mix.py
@@ -0,0 +1,27 @@
+#!/usr/bin/env python3
+"""Reject commits that stage train.py and README.md together (cherry-pick hygiene)."""
+from __future__ import annotations
+
+import subprocess
+import sys
+
+
+def main() -> int:
+    out = subprocess.check_output(
+        ["git", "diff", "--cached", "--name-only"],
+        text=True,
+    )
+    names = {line.strip() for line in out.splitlines() if line.strip()}
+    if "train.py" in names and "README.md" in names:
+        print(
+            "pre-commit: do not commit train.py and README.md in the same commit.\n"
+            "Use two commits: (1) train.py plus any eval artifacts you intend to land; "
+            "(2) README.md attempt-log line only (docs-only commit for easy cherry-pick to main).",
+            file=sys.stderr,
+        )
+        return 1
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())