OUT-3645: tombstone-based sync recovery + ON CONFLICT alignment#106
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
ad99185 to
1805b25
Compare
Greptile SummaryThis PR replaces ad-hoc retry logic with a tombstone-based recovery model: a
Confidence Score: 3/5The sweep retry path can create duplicate Assembly files when a remote API call succeeds but the DB update fails; this is a real defect on the newly added retry code path that warrants a fix before deploy.
Important Files Changed
Sequence DiagramsequenceDiagram
participant Sched as retryFailedSyncsSchedule
participant RS as ResyncService
participant DB as Database
participant TDev as Trigger.dev
participant Helper as retryFailedSyncsForPortal
participant Sync as SyncService
participant API as Assembly / Dropbox API
Sched->>RS: resyncFailedFiles()
RS->>DB: findFailedSyncs()
DB-->>RS: rows[]
RS->>RS: group rows by portalId
RS->>TDev: resyncFailedFilesInAssembly.trigger(portalId, rows)
TDev->>Helper: retryFailedSyncsForPortal(portalId, rows)
Helper->>Helper: initializeSyncDependencies()
loop for each failedSync row
Helper->>Sync: markAttempt(id, action, target)
alt "action=CREATE target=DROPBOX"
Helper->>API: copilotApi.retrieveFile
Helper->>Sync: completePendingDropboxCreate
Sync->>API: createAndUploadFileInDropbox [idempotent]
Sync->>DB: markUpdated(id, dbxFileId)
else "action=CREATE target=ASSEMBLY"
Helper->>API: getFileFromDropbox
Helper->>Sync: completePendingAssemblyCreate
Sync->>API: copilotApi.createFile [NOT idempotent]
Sync->>DB: markUpdated(id, assemblyFileId)
else "action=DELETE"
Helper->>API: deleteFile or filesDeleteV2
Helper->>DB: markDeleted(id)
end
note over Helper,DB: on error: markFailure(id, message)
end
Reviews (3): Last reviewed commit: "fix(OUT-3645): reduce sweeper cadence to..." | Re-trigger Greptile |
…re-insert creates Make failed sync operations between Dropbox and Assembly durable and recoverable in either direction: - Per-file delete handlers wrap the side effect with a tombstone (pending_action* columns), soft-deleting on success and leaving the row for the sweeper on failure. - Create handlers (Assembly→Dropbox in syncAssemblyFilesToDropbox and Dropbox→Assembly leaf-file in createAndUploadFileToAssembly) now pre-insert the mapping row with a create-pending tombstone before the target-side API call. Race protection via INSERT ... ON CONFLICT DO NOTHING against the two partial unique indexes. - A Trigger.dev schedule (*/15 * * * *) sweeps rows with active tombstones (subject to MAX_ATTEMPTS=10 and per-attempt backoff), dispatches by (action, target) to the appropriate retry helper. - Update path keeps the existing delete+sync chain composition to avoid webhook ping-pong (FileDeleted/FileCreated with stale source IDs mid-flight). Cleanups: - Removed Vercel cron /api/workers/resync-failed-files and CRON_SECRET. - Removed recoverLegacySync and the legacy contentHash IS NULL branch in findFailedSyncs. Production cleanup of pre-PR-2 partial-create rows handled separately via one-time SQL script. - Shared normalizeError utility at src/utils/normalizeError.ts. Tests added for MapFilesService tombstone helpers and ResyncService per-portal fan-out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The partial unique indexes on file_folder_sync were narrowed in OUT-3778
to also require the indexed file_id column IS NOT NULL. ON CONFLICT
clauses targeting those indexes must repeat the same predicate or
Postgres throws "no unique or exclusion constraint matching the
ON CONFLICT specification" at runtime.
In insertCreatePending, derive both the conflict columns and the WHERE
clause from `isDropboxTarget`:
- target=DROPBOX → match (portal, channel, assembly_file_id) index,
WHERE deleted_at IS NULL AND assembly_file_id IS NOT NULL
- target=ASSEMBLY → match (portal, channel, dbx_file_id) index,
WHERE deleted_at IS NULL AND dbx_file_id IS NOT NULL
Semantically the IS NOT NULL is always satisfied at insert time (the
caller pre-populates whichever id corresponds to the target side); the
clause exists only so Postgres can resolve the partial index.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- insertCreatePending now stamps pendingActionLastAttemptAt = NOW() so the sweeper's backoff window guards the row while the original completePending* call is still in-flight. Prevents the sweeper from racing the original call and producing duplicate remote files. - Removes markFailure from inside completePendingAssemblyCreate / completePendingDropboxCreate; errors now propagate to callers. Each caller (the original create path in Sync.service.ts and the sweeper retry path in retryFailedSyncsForPortal) records the failure exactly once, eliminating the double-markFailure on the sweeper path that was shortening the backoff window. - Test mock extended to cover the insert chain; new test asserts the pendingActionLastAttemptAt stamp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the sweeper from `*/15 * * * *` to `0 8,20 * * *` (08:00 + 20:00 UTC). 15-minute polling was overkill: the typical failure modes (transient Dropbox/Copilot 5xx, intermittent rate-limits) recover within minutes and the next sweep tick covers the same row regardless. Off-peak hours minimise contention with user-driven syncs during business hours. Tradeoff: in-flight transient failures now take up to ~12 hours to be swept (vs. ~15 min). The user-triggered Resync button (OUT-3784) is the on-demand recovery path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fileIdColumn insertCreatePending previously branched twice on `isDropboxTarget`: once to pick the conflict columns and again to build the WHERE clause. Both references resolve to the same column per branch, so collapse to a single `fileIdColumn` ref and define `conflictColumns` / `conflictWhere` once. Behaviour unchanged — drizzle interpolates the column ref by name in the sql template, so the generated SQL matches the same partial unique index as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b42b714 to
4abcc14
Compare
62b1624
into
feature/refresh-sync-OUT-3645
Summary
Replaces ad-hoc retry logic with a tombstone-based recovery model for failed sync operations, plus the partial-unique-index work from OUT-3778 (included in this stack because the runtime code in this PR depends on those indexes).
What's in this PR
Tombstone-based recovery (OUT-3645)
pending_action,pending_action_target,pending_action_attempts,pending_action_last_attempt_at,pending_action_last_errorcolumns tofile_folder_sync.markAttempt/markUpdated/markDeleted/markFailurehelpers onMapFilesService.completePendingAssemblyCreate/completePendingDropboxCreateshared between the original create flow and the sweeper retry path.0 8,20 * * *— twice daily at 08:00 and 20:00 UTC, chosen as off-peak hours) replaces the previous Vercel cron; dispatches retries by(action, target).normalizeErrorutility.Partial unique indexes (OUT-3778, bundled)
file_folder_sync:(portal_id, channel_sync_id, assembly_file_id) WHERE deleted_at IS NULL AND assembly_file_id IS NOT NULL(portal_id, channel_sync_id, dbx_file_id) WHERE deleted_at IS NULL AND dbx_file_id IS NOT NULLIS NOT NULLguards in the WHERE keep the index lean (Greptile P2 fix).CREATE UNIQUE INDEX IF NOT EXISTSso it's idempotent.ON CONFLICT alignment
insertCreatePending'sON CONFLICTWHERE clause is now target-aware to match the narrowed partial-index predicates. Without this, PG throwsno unique or exclusion constraint matching the ON CONFLICT specificationat runtime.Known accepted regressions (documented for reviewers)
pending_action_attempts = MAX_ATTEMPTSmask the file from the bidirectional sync filter. Operational monitoring needed.Pre-deploy checklist
pending_action IS NULL AND content_hash IS NULL). User-handled separately.Test plan
pnpm typecheckcleanpnpm lintcleanpnpm testpasses (tombstone helpers + ResyncService per-portal fan-out + Sweeper retry paths)ON CONFLICTpath ininsertCreatePendingdoesn't throw under concurrent creates of the same file.🤖 Generated with Claude Code