Skip to content

OUT-3645: tombstone-based sync recovery + ON CONFLICT alignment#106

Merged
SandipBajracharya merged 5 commits into
feature/refresh-sync-OUT-3645from
OUT-3645
May 29, 2026
Merged

OUT-3645: tombstone-based sync recovery + ON CONFLICT alignment#106
SandipBajracharya merged 5 commits into
feature/refresh-sync-OUT-3645from
OUT-3645

Conversation

@SandipBajracharya

@SandipBajracharya SandipBajracharya commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces ad-hoc retry logic with a tombstone-based recovery model for failed sync operations, plus the partial-unique-index work from OUT-3778 (included in this stack because the runtime code in this PR depends on those indexes).

What's in this PR

Tombstone-based recovery (OUT-3645)

  • Adds pending_action, pending_action_target, pending_action_attempts, pending_action_last_attempt_at, pending_action_last_error columns to file_folder_sync.
  • Pre-insert pattern for creates in both directions: a tombstone row is inserted before the Assembly/Dropbox API call, then completed (or marked failed for the sweeper to retry) afterwards.
  • New markAttempt / markUpdated / markDeleted / markFailure helpers on MapFilesService.
  • completePendingAssemblyCreate / completePendingDropboxCreate shared between the original create flow and the sweeper retry path.
  • Scheduled Trigger.dev sweep (0 8,20 * * * — twice daily at 08:00 and 20:00 UTC, chosen as off-peak hours) replaces the previous Vercel cron; dispatches retries by (action, target).
  • Shared normalizeError utility.

Partial unique indexes (OUT-3778, bundled)

  • Two partial unique indexes on file_folder_sync:
    • (portal_id, channel_sync_id, assembly_file_id) WHERE deleted_at IS NULL AND assembly_file_id IS NOT NULL
    • (portal_id, channel_sync_id, dbx_file_id) WHERE deleted_at IS NULL AND dbx_file_id IS NOT NULL
  • The IS NOT NULL guards in the WHERE keep the index lean (Greptile P2 fix).
  • Migration uses CREATE UNIQUE INDEX IF NOT EXISTS so it's idempotent.

ON CONFLICT alignment

  • insertCreatePending's ON CONFLICT WHERE clause is now target-aware to match the narrowed partial-index predicates. Without this, PG throws no unique or exclusion constraint matching the ON CONFLICT specification at runtime.

Known accepted regressions (documented for reviewers)

  • Rows stuck at pending_action_attempts = MAX_ATTEMPTS mask the file from the bidirectional sync filter. Operational monitoring needed.
  • Concurrent user updates during the pre-create window (seconds) can be silently dropped.
  • Stale single-ID mapping rows mask the corresponding source file.
  • Sweeper cadence is twice a day, so transient remote-API failures can take up to ~12 hours to recover (vs. the previous ~15 min cadence). The user-triggered Resync button (OUT-3784) is the immediate-recovery path.

Pre-deploy checklist

  • Audit existing prod data for duplicates before applying the partial index migration — audit queries are in the migration file's header comments. If either returns rows, resolve before deploy.
  • One-time SQL backfill for pre-PR-2 legacy rows (pending_action IS NULL AND content_hash IS NULL). User-handled separately.

Test plan

  • pnpm typecheck clean
  • pnpm lint clean
  • pnpm test passes (tombstone helpers + ResyncService per-portal fan-out + Sweeper retry paths)
  • Manual: trigger a sync that fails mid-flight (e.g. revoke Dropbox token for a moment), verify tombstone row is created and the sweeper retries it on the next tick.
  • Manual: verify ON CONFLICT path in insertCreatePending doesn't throw under concurrent creates of the same file.

🤖 Generated with Claude Code

@linear-code

linear-code Bot commented May 27, 2026

Copy link
Copy Markdown

OUT-3645

@vercel

vercel Bot commented May 27, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
dropbox-integration Ready Ready Preview, Comment May 29, 2026 9:41am

Request Review

@SandipBajracharya SandipBajracharya changed the title feat(OUT-3645): tombstone-based sync recovery + ON CONFLICT alignment OUT-3645: tombstone-based sync recovery + ON CONFLICT alignment May 27, 2026
@greptile-apps

greptile-apps Bot commented May 27, 2026

Copy link
Copy Markdown

Greptile Summary

This PR replaces ad-hoc retry logic with a tombstone-based recovery model: a pending_action row is pre-inserted before every remote API call, then resolved to a full mapping or left for a Trigger.dev scheduled sweep (twice daily at 08:00 and 20:00 UTC) to retry. Two partial unique indexes with IS NOT NULL guards are added to prevent duplicate live rows, and the ON CONFLICT clause in insertCreatePending is aligned to match their WHERE predicates.

  • Tombstone lifecycle: insertCreatePending now stamps pendingActionLastAttemptAt = new Date() on insert, shielding in-flight creates from the sweeper; markAttempt / markUpdated / markDeleted / markFailure helpers manage state transitions.
  • Sweeper: retryFailedSyncsSchedule replaces the old Vercel cron route; ResyncService fans out per-portal retries via resyncFailedFilesInAssembly Trigger.dev tasks with a per-row exponential backoff.
  • Idempotency asymmetry: completePendingDropboxCreate delegates to createAndUploadFileInDropbox which checks for an existing Dropbox file before creating; completePendingAssemblyCreate calls copilotApi.createFile unconditionally, so a retry after a partial failure (API call succeeded, markUpdated failed) creates a second Assembly file.

Confidence Score: 3/5

The sweep retry path can create duplicate Assembly files when a remote API call succeeds but the DB update fails; this is a real defect on the newly added retry code path that warrants a fix before deploy.

completePendingAssemblyCreate calls copilotApi.createFile unconditionally on every retry attempt. If a previous sweep's createFile returned successfully but the subsequent markUpdated failed (transient DB error), the row retains pendingAction = 'create' and assemblyFileId = null. The next sweep fires createFile again, producing a second Assembly file with no cleanup path. The Dropbox direction already guards against this by checking whether the file exists before uploading.

src/features/sync/lib/Sync.service.ts — specifically completePendingAssemblyCreate, which lacks the idempotency guard present in its Dropbox counterpart.

Important Files Changed

Filename Overview
src/features/sync/lib/Sync.service.ts Adds completePendingAssemblyCreate and completePendingDropboxCreate shared between the initial sync path and the sweeper; the Dropbox direction is idempotent but the Assembly direction calls createFile unconditionally on every retry, risking duplicate Assembly files if markUpdated fails after a successful API call.
src/features/workers/resync-failed-files/lib/resync-failed-files.service.ts New ResyncService with findFailedSyncs + resyncFailedFiles; backoff formula and attempt-cap logic are correct, but findMany has no row limit, which could push oversized payloads into Trigger.dev.
src/features/workers/resync-failed-files/helper/resync-failed-files.helper.ts Per-portal retry helpers for all four (action, target) combinations; DELETE paths are idempotent (404/not_found handled), CREATE paths delegate to the shared completePending* methods; getFileFromDropbox still swallows all errors (flagged in a previous review cycle).
src/features/sync/lib/MapFiles.service.ts Adds insertCreatePending, markAttempt, markUpdated, markDeleted, markFailure; tombstone insertion now stamps pendingActionLastAttemptAt = new Date() to guard in-flight rows from the sweeper. Logic looks correct.
src/trigger/processFileSync.ts Adds resyncFailedFilesInAssembly task and retryFailedSyncsSchedule (0 8,20 * * *); removes the old Vercel-cron-based route. Schedule and concurrency limits look reasonable.
src/db/migrations/20260526094109_add_unique_indexes_on_portal_channel_assembly_and_dbx_file.sql Two partial unique indexes with IS NOT NULL guards; IF NOT EXISTS makes the migration idempotent; audit queries in comments are appropriate.

Sequence Diagram

sequenceDiagram
    participant Sched as retryFailedSyncsSchedule
    participant RS as ResyncService
    participant DB as Database
    participant TDev as Trigger.dev
    participant Helper as retryFailedSyncsForPortal
    participant Sync as SyncService
    participant API as Assembly / Dropbox API

    Sched->>RS: resyncFailedFiles()
    RS->>DB: findFailedSyncs()
    DB-->>RS: rows[]
    RS->>RS: group rows by portalId
    RS->>TDev: resyncFailedFilesInAssembly.trigger(portalId, rows)
    TDev->>Helper: retryFailedSyncsForPortal(portalId, rows)
    Helper->>Helper: initializeSyncDependencies()
    loop for each failedSync row
        Helper->>Sync: markAttempt(id, action, target)
        alt "action=CREATE target=DROPBOX"
            Helper->>API: copilotApi.retrieveFile
            Helper->>Sync: completePendingDropboxCreate
            Sync->>API: createAndUploadFileInDropbox [idempotent]
            Sync->>DB: markUpdated(id, dbxFileId)
        else "action=CREATE target=ASSEMBLY"
            Helper->>API: getFileFromDropbox
            Helper->>Sync: completePendingAssemblyCreate
            Sync->>API: copilotApi.createFile [NOT idempotent]
            Sync->>DB: markUpdated(id, assemblyFileId)
        else "action=DELETE"
            Helper->>API: deleteFile or filesDeleteV2
            Helper->>DB: markDeleted(id)
        end
        note over Helper,DB: on error: markFailure(id, message)
    end
Loading

Reviews (3): Last reviewed commit: "fix(OUT-3645): reduce sweeper cadence to..." | Re-trigger Greptile

Comment thread src/features/sync/lib/Sync.service.ts Outdated
@SandipBajracharya

Copy link
Copy Markdown
Collaborator Author

@greptileai

@SandipBajracharya

Copy link
Copy Markdown
Collaborator Author

@greptileai

Comment thread src/features/sync/lib/MapFiles.service.ts Outdated

@priosshrsth priosshrsth left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

SandipBajracharya and others added 5 commits May 29, 2026 15:24
…re-insert creates

Make failed sync operations between Dropbox and Assembly durable and
recoverable in either direction:

- Per-file delete handlers wrap the side effect with a tombstone
  (pending_action* columns), soft-deleting on success and leaving the
  row for the sweeper on failure.
- Create handlers (Assembly→Dropbox in syncAssemblyFilesToDropbox and
  Dropbox→Assembly leaf-file in createAndUploadFileToAssembly) now
  pre-insert the mapping row with a create-pending tombstone before the
  target-side API call. Race protection via INSERT ... ON CONFLICT DO
  NOTHING against the two partial unique indexes.
- A Trigger.dev schedule (*/15 * * * *) sweeps rows with active
  tombstones (subject to MAX_ATTEMPTS=10 and per-attempt backoff),
  dispatches by (action, target) to the appropriate retry helper.
- Update path keeps the existing delete+sync chain composition to avoid
  webhook ping-pong (FileDeleted/FileCreated with stale source IDs
  mid-flight).

Cleanups:
- Removed Vercel cron /api/workers/resync-failed-files and CRON_SECRET.
- Removed recoverLegacySync and the legacy contentHash IS NULL branch
  in findFailedSyncs. Production cleanup of pre-PR-2 partial-create
  rows handled separately via one-time SQL script.
- Shared normalizeError utility at src/utils/normalizeError.ts.

Tests added for MapFilesService tombstone helpers and ResyncService
per-portal fan-out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The partial unique indexes on file_folder_sync were narrowed in OUT-3778
to also require the indexed file_id column IS NOT NULL. ON CONFLICT
clauses targeting those indexes must repeat the same predicate or
Postgres throws "no unique or exclusion constraint matching the
ON CONFLICT specification" at runtime.

In insertCreatePending, derive both the conflict columns and the WHERE
clause from `isDropboxTarget`:
  - target=DROPBOX → match (portal, channel, assembly_file_id) index,
    WHERE deleted_at IS NULL AND assembly_file_id IS NOT NULL
  - target=ASSEMBLY → match (portal, channel, dbx_file_id) index,
    WHERE deleted_at IS NULL AND dbx_file_id IS NOT NULL

Semantically the IS NOT NULL is always satisfied at insert time (the
caller pre-populates whichever id corresponds to the target side); the
clause exists only so Postgres can resolve the partial index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- insertCreatePending now stamps pendingActionLastAttemptAt = NOW() so
  the sweeper's backoff window guards the row while the original
  completePending* call is still in-flight. Prevents the sweeper from
  racing the original call and producing duplicate remote files.
- Removes markFailure from inside completePendingAssemblyCreate /
  completePendingDropboxCreate; errors now propagate to callers. Each
  caller (the original create path in Sync.service.ts and the sweeper
  retry path in retryFailedSyncsForPortal) records the failure exactly
  once, eliminating the double-markFailure on the sweeper path that was
  shortening the backoff window.
- Test mock extended to cover the insert chain; new test asserts the
  pendingActionLastAttemptAt stamp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the sweeper from `*/15 * * * *` to `0 8,20 * * *` (08:00 + 20:00
UTC). 15-minute polling was overkill: the typical failure modes (transient
Dropbox/Copilot 5xx, intermittent rate-limits) recover within minutes and
the next sweep tick covers the same row regardless. Off-peak hours
minimise contention with user-driven syncs during business hours.

Tradeoff: in-flight transient failures now take up to ~12 hours to be
swept (vs. ~15 min). The user-triggered Resync button (OUT-3784) is the
on-demand recovery path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fileIdColumn

insertCreatePending previously branched twice on `isDropboxTarget`: once
to pick the conflict columns and again to build the WHERE clause. Both
references resolve to the same column per branch, so collapse to a
single `fileIdColumn` ref and define `conflictColumns` / `conflictWhere`
once. Behaviour unchanged — drizzle interpolates the column ref by name
in the sql template, so the generated SQL matches the same partial
unique index as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@SandipBajracharya SandipBajracharya merged commit 62b1624 into feature/refresh-sync-OUT-3645 May 29, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants