Files
sessions/planning/TRACK_G_V1_BIDIRECTIONAL_SYNC.md
Myeongseon Choi 60a8ad1f0b docs(planning): land Track G v1 bidirectional-sync plan
Bring the post-v0.7.23 audit + redesign of Track G's `.git` sync into
the planning tree as a tracked document. The plan was authored as a
working draft from a code audit + external-tool methodology survey
(Git refspecs, VS Code/Zed remote-dev, Jujutsu's op log, Syncthing
conflict copies); committing it makes the rationale and the phased
delivery (A0 verification → A1 op log → A2 `git bundle` → A3
conflict UI) reviewable alongside the code that will eventually
implement it.

The originally co-authored Track T (Terminus pane survival) section
has been removed from this plan; that fix already shipped in commit
0e2fdd9 (`fix(sublime/terminal): pin stdio to /dev/tty +
auto_close=False`).

Wire it in:

- README planning index links the new file alongside the existing
  PYTHON_RUST_BOUNDARY / VSCODE_REMOTE_TRANSPORT_MODEL / DEEP-RESEARCH
  documents.
- BACKLOG Track G section's v1 scope paragraph points to the plan,
  so contributors landing v1 work see the architecture before
  touching the wipe-and-replace path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 09:50:51 +09:00

23 KiB
Raw Permalink Blame History

Track G v1 — Bidirectional .git Sync

Status: Draft plan, post-v0.7.23. Authored from a code audit + external-tool methodology survey.

Symptom triggering this plan (verbatim from test.log):

Sublime Merge에서 만든 로컬 test 브랜치는 살아 있는데 remote에는 전파되지 않음.

The user's framing: 단순 양방향 sync로는 race condition을 못 풀고 한쪽이 다른쪽을 덮어쓰니, 협업 에디터들의 방법론을 차용하자.

Note: the Terminus pane-survival diagnosis that originally accompanied this audit was landed separately as commit 0e2fdd9 (fix(sublime/terminal): pin stdio to /dev/tty + auto_close=False). This document is now scoped to bidirectional .git sync only.


1. What the audit found (concrete)

Current Track G v0 architecture (commands.py:7115+):

on every mirror sync.done:
  for each discovered repo:
    1. read post-checkout marker  →  git checkout <new_head> on remote
    2. probe remote ref fingerprint; skip if unchanged
    3. tar -czf .git | base64  →  WIPE local .git  →  untar
    4. re-install post-checkout hook
    5. materialise dirty working-tree files

Three classes of problem with this model:

A. Hook-based op capture is unreliable in our environment

windows.log evidence: every git.checkout_proxy event in the trace shows proxied: falsethe post-checkout hook never fired during the user's test-branch reproduction. There are three plausible explanations and they're not mutually exclusive:

  • A1. Sublime Merge uses libgit2 internally, and libgit2 does not invoke client-side hooks by default. This means a fundamental class of user actions performed via Sublime Merge — git checkout, git checkout -b, branch deletes — never hit our hook. If true, the entire marker-based mechanism is dead-on-arrival for our primary user.
  • A2. post-checkout only fires on checkout; git branch -d, git branch -m, and git commit never trigger it regardless of front-end. Failure modes #2 (delete), #4 (commit) are uncovered by design.
  • A3. Multiple checkouts in quick succession overwrite the single per-repo marker file (git_branch_proxy.py keeps one marker per repo); intermediate states are lost.

A1 is the load-bearing finding. Before any plan that depends on hooks, we have to verify it. But we should plan as if it's true, because the alternative — building a working hook around libgit2 — is harder than removing the dependency on hooks entirely.

B. Wipe-and-replace is structurally hostile to local writes

git_dot_git_sync.py:194 — every fetch tick removes the entire local .git (preserving only SESSIONS_PENDING_CHECKOUT) and replaces it with the remote tarball. Anything the local user wrote into .git between fetches that isn't on the preserved-files list is destroyed:

  • A branch ref the user created locally that doesn't exist on remote yet (failure mode #1).
  • A commit object Sublime Merge wrote locally that hasn't been pushed (failure mode #4).
  • Stash entries, reflog entries, refs/notes entries.

The v0.7.23 mirror-boundary fix prevents the outer mirror from pruning .git, but the inner tar replace still does the same damage. This is the single biggest correctness hole.

C. No three-way diff over ref state

Track G has no memory of "what the local refs looked like at the end of the last successful refresh." Without that, it can't tell:

  • Did the user create refs/heads/test locally? Or did remote have it last time and we just lost it?
  • Did the user delete refs/heads/feature/old? Or is it just absent from the remote and we should let it be?

Every failure mode reduces to "we couldn't tell who changed what since the last sync."

2. What we steal from the methodology survey

The survey (full report in research notes) covered Git refspecs, VS Code/Zed/Gateway remote-dev, CRDTs, OT, file-sync conflict copies, and Jujutsu. The honest landings:

  • CRDTs: wrong tool. Ref state is a CAS-on-pointers problem under structural constraints, not a free-form text merge. Adopting Automerge here multiplies storage and replaces a tractable problem (Git already solved it) with an intractable one (semantic merge of pointer values).
  • Headless backends (VS Code Server, Zed Headless, JetBrains Backend): foreclosed. Sublime Merge is a separate native app that wants a real on-disk .git; the whole reason we have a local mirror is to feed it. The headless answer would invalidate the project.
  • OT: the algorithm doesn't apply (refs aren't a stream of insert/delete ops), but the central-arbitrator pattern does — and we already have one (the remote box's .git).
  • Git's own model: directly applicable. Two clones of the same repo never silently overwrite each other because of refspec namespacing + fast-forward checks + --force-with-lease. We are reinventing this badly.
  • File-sync conflict copies (Syncthing/Dropbox): directly applicable for the working-tree edge cases.
  • Jujutsu's operation log: directly applicable as the foundation we're missing.

3. The redesign — three changes, in dependency order

Change #1 — Op log + ref snapshot at every refresh boundary (foundation)

Promoted from "safety net" to foundation because of finding A1: without reliable hooks, we have to detect ref-state changes by polling, and polling needs a baseline.

Add a sessions-owned sidecar under each repo: .git/sessions/op-log.jsonl and .git/sessions/last-snapshot.json. The snapshot stores {ref_name → sha} and the symbolic HEAD target for both local and remote at the end of the last successful refresh.

each refresh tick (per repo):
  before     = read_snapshot()                            # {local: {refs}, remote: {refs}}
  local_now  = read_local_refs()                          # cheap: walk refs/heads/*
  remote_now = exec(host, "git for-each-ref ... ; HEAD")  # cheap: one exec/once

  diff = three_way(before, local_now, remote_now)
  apply(diff)                                             # ← Changes #2 + #3
  write_snapshot({local: local_now, remote: remote_now})
  append op_log({ts, diff, actions, errors})

The diff classifies every ref into one of:

  • unchanged — both sides match the snapshot. Skip.
  • local_only_new — local has it, remote doesn't, snapshot didn't have it on either. User created. Action in Change #2.
  • local_only_deleted — snapshot had it on both, neither has it now. (Edge case — only happens if user deleted on both sides between ticks.)
  • local_deleted — snapshot had it on local, local doesn't. User deleted. Action in Change #2.
  • remote_only_new — remote has it, local doesn't, snapshot didn't have it. Remote teammate created. Mirror into local.
  • remote_deleted — snapshot had it on remote, remote doesn't. Mirror local prune.
  • local_advanced — local SHA is descendant of snapshot SHA, remote SHA == snapshot SHA. User committed. Action in Change #2.
  • remote_advanced — same on remote side. Fast-forward local.
  • diverged — both sides moved differently. Surface to user; do nothing automatic. Action in Change #3.

Op log is append-only JSONL, rotated at N=1000 lines or 30 days. Gives us a "Sessions: Undo Last Sync" command that walks the most recent entry and restores ref state via git update-ref. Critically: it gives us debuggability — when refs vanish, we know which tick wiped them.

Invariants:

  • Every ref-mutating action writes to the log before the action (write-ahead).
  • The log lives under .git/sessions/ so git itself ignores it.
  • Snapshots are atomic: write to last-snapshot.json.tmp, fsync, rename.
  • The whole read snapshot → diff → apply → write snapshot sequence runs under a per-repo flock on .git/sessions/refresh.lock. Sessions stacks overlapping refresh ticks (the mirror_queue evidence in windows.log shows multiple dequeue events for the same workspace within the same second); without the lock, two ticks read the same baseline, both compute "local_only_new" for the same ref, both call update-ref with the same expected_old, the second's CAS fails, and the diff classifier treats it as divergence — false-positive UI noise that trains users to dismiss real divergence. The lock is fcntl.flock(LOCK_EX | LOCK_NB); on contention skip the tick (the next one picks up the new state). This is not deferred to v1+; it's part of Change #1 itself.

On "undo". The op log enables a forensic command — Sessions: Show Last Sync — that displays the previous tick's diff and resulting ref state side-by-side, lets the user copy SHAs, and offers a local-only "restore local refs from snapshot" action. It does not undo remote-side changes that have already been pushed (those may have been built on by other consumers; rolling them back via --force-with-lease is a separate user-driven decision, not a button in the editor). The naming reflects this: forensic + local-restore, not "undo." If users need remote rollback they run git push --force-with-lease themselves with the SHA the readout gave them.

Change #2 — Replace tar wipe with git bundle over the existing bridge (eliminates the wipe)

Borrow Git's own model. After Change #1's diff classifies what happened, perform the actual sync via Git primitives instead of tar-replace.

Transport choice. The Rust bridge today is exec/once only — single round-trip argv → {exit_code, stdout, stderr}. There is no streaming/duplex endpoint. That rules out git fetch ssh://host/path through the bridge (pack-protocol needs a duplex pipe), and it rules out git fetch ssh://... running its own SSH child too — that path would respawn ssh outside the bridge's ControlMaster on every refresh, regressing the v0.7.21 askpass-flash fix and racing the bridge's auth state.

The right primitive is git bundle:

  • git bundle create - <refspec> packs refs + objects into a single self-contained file written to stdout. Fits the existing exec/once shape (one argv, one stdout payload, one timeout) — exactly what we already use for the tar -czf .git | base64 path, just with a vastly smaller payload because bundles only contain the requested refs plus reachable objects.
  • Bundles support incremental ranges: git bundle create - <new_sha> ^<last_seen_sha> writes only objects new since the snapshot. Steady-state bandwidth drops from "26 MB tar" to "kilobytes of new commits."
  • Local apply: git bundle unbundle <file> reads the bundle and writes new objects + advances the named refs. No streaming required either way.
on remote (one exec/once per refresh):
  set sessions-scoped config (idempotent, one-time per repo):
    git config receive.denyCurrentBranch updateInstead
  for each ref in diff.local_only_new  diff.local_advanced:
    # Send local commits + ref to remote. Reuse `git bundle` in the
    # other direction: build bundle locally, ship to remote, unbundle.
    local: git bundle create - <local_sha> ^<snapshot_sha_or_empty>
                                      | base64 -w0     →  tx
    remote (via exec/once):
      printf %s "<bundle_b64>" | base64 -d | git -C <root> bundle unbundle /dev/stdin <ref>
      git update-ref -m "sessions sync" refs/heads/<name> <local_sha> <snapshot_sha>   # CAS
  for each ref in diff.local_deleted:
    remote: git update-ref -d refs/heads/<name> <snapshot_sha>                          # CAS
  for the active HEAD checkout (the post-checkout case):
    if user moved HEAD locally: git -C <root> checkout <new_head>                       # current behaviour, kept

on local (replaces the tar pull):
  remote (one exec/once):
    git -C <root> bundle create - --branches \
        $(for r in <changed_refs>; do printf '^%s ' "<snapshot_sha_for_$r>"; done)
        | base64 -w0
  local:
    base64 -d | git -C <local-mirror> bundle unbundle /dev/stdin
    # bundle wrote into refs/heads/* directly per the bundle's ref names — undesirable.
    # Use --map-refs or rewrite: bundle creates with the source ref name; we want
    # them under refs/sessions/<host>/heads/*. Fix: bundle uses fully-qualified
    # ref names, so on the remote side rewrite the bundle's ref list to
    # refs/sessions/<host>/heads/* before piping. (`git bundle` accepts
    # "refs/heads/foo" or any other refname; emit them as
    # "refs/sessions/<host>/heads/foo" by passing explicit names.)
  for each ref in diff.remote_only_new  diff.remote_advanced:
    git update-ref refs/heads/<name> refs/sessions/<host>/heads/<name>   # only if local is ancestor (FF)
  for each ref in diff.remote_deleted:
    git update-ref -d refs/heads/<name> <snapshot_sha>                   # CAS

Notes:

  • The refs/sessions/<host>/heads/* namespace gives Sublime Merge an explicit, separate view of the remote tracking refs. It also means we never write into refs/heads/* except through fast-forward/CAS, so user-created branches survive every refresh by construction.
  • refs/heads/* becomes user-territory; the sync layer only proposes changes there via the diff classifier. Fast-forwards apply automatically; divergence surfaces to UI (Change #3).
  • --force-with-lease-equivalent for ref updates: git update-ref -m "sessions sync" <ref> <new_sha> <expected_old_sha>. Atomic CAS primitive. If the expected-old check fails (someone moved the ref between our snapshot and our update), abort and treat as diverged.
  • Initial seed. First sync after migration: snapshot is empty, bundles are full ref histories. Same one-shot cost as the v0 tar pull, never repeated. Backfill refs/sessions/<host>/heads/* from this first bundle.

updateInstead is not a free pass. It updates the working tree only when the index and worktree match the new commit's tree on the paths being updated; on dirty conflict the push is rejected. So even with the config flipped, a remote with edits in flight on the active branch refuses our update. Explicit handling:

  • The CAS-guarded ref update writes the proposed new_sha into the remote's ref store regardless of working-tree state — update-ref doesn't touch the worktree.
  • The separate working-tree update (the "make the worktree reflect the new HEAD" step, equivalent to git checkout) is the part that fails on dirty trees. That's the existing G6 path.
  • Therefore: split the proxy into ref-mutation (always proceeds via CAS) and worktree-mutation (subject to dirty-tree rejection, retried on next tick). When the worktree update is deferred, the ref already advanced — for-each-ref reports the new tip, the local mirror sees it on the next refresh, but the remote worktree still shows the old contents until the user resolves dirty state. Surface this state explicitly: status bar "Branch advanced; remote worktree out of sync (dirty): <files>".

Failure modes addressed. This Change kills failure modes #1 (local-only branches survive — they live in refs/heads/* which is never wiped), #2 (deletion is detected via the diff and propagated via CAS-guarded update-ref -d), #3 (CAS via expected_old_sha rejects concurrent moves), and #4 (commit objects are bundled and unbundled before any clobber risk). Failure mode #5 stays for Change #3.

Change #3 — Conflict-copy semantics + divergence UI (closes the working-tree edge case)

Two narrow additions for the cases Change #2 surfaces but doesn't auto-resolve:

during materialise(file):
  if local.mtime > last_fetch.mtime
     and hash(local) != hash(remote)
     and hash(local) != hash(last_fetched_remote_for_this_path):
       write remote bytes to <file>.sessions-conflict-<ts>
       leave <file> alone
       enqueue notification

during reconcile_ref where diff == "diverged":
  status bar:
    "Branch <name> diverged: local=<short_sha> remote=<short_sha>.
     Run `Sessions: Resolve Diverged Refs` to choose."
  command-palette resolution prompt: [Keep local | Take remote | Open Sublime Merge]

<file>.sessions-conflict-<ts> is added to .gitignore automatically by Sessions (one-time append on first conflict). Resolution is always user-driven; the sync layer never auto-resolves a divergence.

4. What we explicitly do not do

  • No CRDT for refs. Wrong tool, wrong constraints.
  • No CRDT for working-tree text. Sublime doesn't expose buffer state as a manipulable structure; we'd be shipping a parallel editor. Conflict-copy is the right depth.
  • No headless backend. Foreclosed by Sublime Merge's local-.git requirement.
  • No live ref polling between refresh ticks. The existing refresh cadence is good enough; adding an inotify or filesystem watcher is scope-creep until we have a concrete user complaint about latency.
  • No replacement of the post-checkout hook proxy. Keep it as a latency optimisation — when it does fire (real git binary, e.g., user runs git checkout in a terminal against the local mirror), the marker gives us sub-second response. When it doesn't fire (libgit2 inside Sublime Merge), the polling diff in Change #1 catches it on the next tick. Belt + suspenders.

5. Phased delivery

Phase Scope Ships fixes for
A0 Verify finding A1: does Sublime Merge fire client-side hooks? See §5.1 protocol below (decides A1+ rationale)
A1 Change #1 — op log + snapshot. Pure addition; no behaviour change. Lets us see what's happening. Debuggability, not user-visible
A2 Change #2 — refspec sync replaces tar wipe. Largest single change. Failure modes #1, #2, #3, #4
A3 Change #3 — conflict copies + divergence UI Failure mode #5, makes A2's diverged-branch case actionable

A0 must complete before A2 design is finalised (it changes the rationale, not the design). A1 ships first because it's pure addition with no risk. A2 + A3 ship together because A3 closes the UX hole A2 opens.

5.1 A0 verification protocol

Sublime Merge has multiple branch-mutation entry points and may use different code paths for each (libgit2 vs shell-out can vary by operation, by platform, and by Sublime Merge version). A one-bit "did we see a marker" answer doesn't generalise. Run the matrix:

  • Sublime Merge build to test against: the latest stable on the user's primary platform. Record the build number in the report.
  • Setup per repo: install_post_checkout_hook writes the v0 hook; tail <.git>/SESSIONS_PENDING_CHECKOUT and the hook's stderr (redirect via exec 2>>/tmp/sessions-hook-trace.log in the hook).
  • Operations to exercise (in order, fresh marker between each):
    1. Branch checkout — sidebar double-click on an existing branch.
    2. Branch checkout — command palette Switch Branch.
    3. Branch checkout — context menu on a commit, "Checkout Commit."
    4. Branch create — sidebar "New Branch" dialog.
    5. Branch create — git checkout -b from the embedded terminal (control: this must fire the hook; if it doesn't, the hook itself is broken, not Sublime Merge).
    6. Branch delete — sidebar right-click "Delete."
    7. Commit — stage + commit a small change.
    8. Push — push that commit.
  • Per-operation record: marker file present (Y/N), marker contents (paste verbatim if Y), hook stderr (paste).

Outcomes that change the plan:

  • Hook fires for ops 14: A1 is partially false; we have a real ops-capture channel for the user's primary path. Plan rationale shifts but Change #1 (polling diff) is still valuable as backstop for delete/commit/push.
  • Hook fires only for op 5 (the control): A1 is true for Sublime Merge entirely; Change #1 becomes the sole capture mechanism, hook stays for terminal users only.
  • Hook fires for none, including op 5: the hook installation itself is broken; investigate that first before any A1 conclusion.

6. Risks & open questions

  1. A0 outcome. If Sublime Merge does fire hooks (we were wrong about libgit2), Change #1's polling diff is still a strict improvement, but the urgency drops. Plan stays the same; rationale shifts.

  2. receive.denyCurrentBranch=updateInstead surprise. Mutates the user's remote git config. Mitigation: scope per-repo, surface a one-time notification, document in release notes, support opt-out (fall back to current git checkout proxy).

  3. Object-pack push size. First sync after adopting Change #2 will push any local-only commits the user accumulated under v0. Could be tens of MB. Mitigation: gate behind a dry-run + confirm.

  4. Migration from existing wiped-and-restored .git directories. Some installs will have refs/sessions/<host>/* empty until the first Change #2 fetch. Backfill on first run; idempotent.

  5. Worktree (.git file) repos — still v1+, deferred. Track G v0 already filters these out (commands.py:7167). No regression.

  6. Op-log size on busy repos — refs/heads/* with thousands of entries × N refresh ticks. Mitigation: log only the diff (typical: 03 entries per tick), rotate at 1000 lines.

  7. Concurrent Sessions instances on the same workspace — two editors open against one host. Today: undefined. Post-A2: each instance's per-repo flock (Change #1 invariant) serialises refresh ticks within an editor; cross-editor contention is also covered because flock is at the OS level on the same .git/sessions/refresh.lock file. The losing instance skips its tick and picks up state on the next one.

  8. Critic adjudication notes (post-review). This plan was reviewed adversarially before sign-off. The top issue raised — "Change #2's transport story is incoherent" — is addressed by switching from git fetch ssh://... to git bundle over the existing exec/once bridge (§3 Change #2 transport choice). Other significant issues addressed inline: denyCurrentBranch=updateInstead on dirty trees (§3 "updateInstead is not a free pass"), concurrent refresh atomicity promoted from v1+ to v1 invariant (§3 Change #1 invariants, risk #7), A0 verification protocol made explicit (§5.1), "Undo Last Sync" renamed to forensic "Show Last Sync" (§3 Change #1, "On undo"). Outstanding from the review: bandwidth estimate for for-each-ref polling (low priority — order-of-magnitude analysis can land with the A1 implementation; if a thousand-ref repo crosses 100 KB/tick we'll add response compression).


7. Why this is shippable

  • A1 is a pure addition (no behaviour change). Ships behind a feature flag, dark-launches the diff classifier.
  • A2's footprint replaces git_dot_git_sync.py:_replace_local_dot_git (one ~100-line function) with a git fetch invocation + a small reconciler. The total spec is smaller than what we have.
  • A3 is two narrow additions, both cheap.
  • Every change is independently reversible: feature flag at the workspace-state level, fall back to v0 tar-wipe for the duration of a release if A2 ships broken.

The single most important sentence in this plan: stop wiping .git. Every other recommendation flows from that, and from the realisation that hooks are a latency optimisation, not the primary ops capture.