OUT-3784: add Resync button + per-channel orchestrator#108
OUT-3784: add Resync button + per-channel orchestrator#108SandipBajracharya wants to merge 13 commits into
Conversation
Additive migration. Adds pending_action, pending_action_target, pending_action_attempts, pending_action_last_attempt_at, and pending_action_last_error columns plus a CHECK constraint enforcing that pending_action and pending_action_target are both set or both null. Also gitignores docs/ for spec and plan artifacts. Subsequent PRs populate these columns from per-file handlers and read from the resync sweeper. PR 1 is observation-only — no code reads or writes the new columns yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two partial unique indexes scoped to live rows so a single assemblyFileId maps to at most one dbxFileId per portal+channel: (portal_id, channel_sync_id, assembly_file_id) WHERE deleted_at IS NULL (portal_id, channel_sync_id, dbx_file_id) WHERE deleted_at IS NULL Foundation for upcoming pre-insert work (INSERT ... ON CONFLICT DO NOTHING) that needs these as atomic race-protection targets. Shipping separately so the migration bakes before any dependent code paths land. Audit query for existing duplicates is in the migration SQL comments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…re-insert creates Make failed sync operations between Dropbox and Assembly durable and recoverable in either direction: - Per-file delete handlers wrap the side effect with a tombstone (pending_action* columns), soft-deleting on success and leaving the row for the sweeper on failure. - Create handlers (Assembly→Dropbox in syncAssemblyFilesToDropbox and Dropbox→Assembly leaf-file in createAndUploadFileToAssembly) now pre-insert the mapping row with a create-pending tombstone before the target-side API call. Race protection via INSERT ... ON CONFLICT DO NOTHING against the two partial unique indexes. - A Trigger.dev schedule (*/15 * * * *) sweeps rows with active tombstones (subject to MAX_ATTEMPTS=10 and per-attempt backoff), dispatches by (action, target) to the appropriate retry helper. - Update path keeps the existing delete+sync chain composition to avoid webhook ping-pong (FileDeleted/FileCreated with stale source IDs mid-flight). Cleanups: - Removed Vercel cron /api/workers/resync-failed-files and CRON_SECRET. - Removed recoverLegacySync and the legacy contentHash IS NULL branch in findFailedSyncs. Production cleanup of pre-PR-2 partial-create rows handled separately via one-time SQL script. - Shared normalizeError utility at src/utils/normalizeError.ts. Tests added for MapFilesService tombstone helpers and ResyncService per-portal fan-out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add IF NOT EXISTS to the CREATE UNIQUE INDEX statements so the migration is idempotent and tolerates manual pre-application. - Narrow each partial index WHERE clause to also require the indexed file_id column IS NOT NULL. Rows with a NULL file id can never trigger a uniqueness violation (NULLs are distinct in PG), so excluding them keeps the index lean. Snapshot updated to match. - Document the implication in the schema: any future ON CONFLICT targeting these indexes must repeat the partial predicate verbatim. - Expand audit-query comments into multi-line block format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The partial unique indexes on file_folder_sync were narrowed in OUT-3778
to also require the indexed file_id column IS NOT NULL. ON CONFLICT
clauses targeting those indexes must repeat the same predicate or
Postgres throws "no unique or exclusion constraint matching the
ON CONFLICT specification" at runtime.
In insertCreatePending, derive both the conflict columns and the WHERE
clause from `isDropboxTarget`:
- target=DROPBOX → match (portal, channel, assembly_file_id) index,
WHERE deleted_at IS NULL AND assembly_file_id IS NOT NULL
- target=ASSEMBLY → match (portal, channel, dbx_file_id) index,
WHERE deleted_at IS NULL AND dbx_file_id IS NOT NULL
Semantically the IS NOT NULL is always satisfied at insert time (the
caller pre-populates whichever id corresponds to the target side); the
clause exists only so Postgres can resolve the partial index.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- insertCreatePending now stamps pendingActionLastAttemptAt = NOW() so the sweeper's backoff window guards the row while the original completePending* call is still in-flight. Prevents the sweeper from racing the original call and producing duplicate remote files. - Removes markFailure from inside completePendingAssemblyCreate / completePendingDropboxCreate; errors now propagate to callers. Each caller (the original create path in Sync.service.ts and the sweeper retry path in retryFailedSyncsForPortal) records the failure exactly once, eliminating the double-markFailure on the sweeper path that was shortening the backoff window. - Test mock extended to cover the insert chain; new test asserts the pendingActionLastAttemptAt stamp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the sweeper from `*/15 * * * *` to `0 8,20 * * *` (08:00 + 20:00 UTC). 15-minute polling was overkill: the typical failure modes (transient Dropbox/Copilot 5xx, intermittent rate-limits) recover within minutes and the next sweep tick covers the same row regardless. Off-peak hours minimise contention with user-driven syncs during business hours. Tradeoff: in-flight transient failures now take up to ~12 hours to be swept (vs. ~15 min). The user-triggered Resync button (OUT-3784) is the on-demand recovery path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Stamp assemblyFileId via updateFileMap immediately after copilotApi.createFile in completePendingAssemblyCreate, before uploadFileInAssembly. Shrinks the race window from upload-duration to one UPDATE so the controller's existingFile lookup dedupes the Assembly file.created echo. - Add a path-based dedupe in the Assembly webhook controller: lookup file_folder_sync by (portalId, normalized itemPath, pendingAction=CREATE, pendingActionTarget=ASSEMBLY, deletedAt IS NULL) and skip handleFileCreated/FolderCreated when a pre-inserted row matches. Stored itemPath has a leading "/"; Copilot's data.path does not, so we normalize. - Swap order in handleChannelFileChanges: process deletes before creates so the (portal, channel, dbx_file_id) partial unique index from OUT-3778 isn't violated during rename flows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rvice
Move the channel lookup, path normalization, and pre-insert row match
out of webhook.controller.ts and into a single method:
AssemblyWebhookService.findInFlightAssemblyCreate(
assemblyChannelId,
filePath,
)
- Dedupe is now scoped to (portalId, channelSyncId, itemPath) rather
than (portalId, itemPath). Path uniqueness lives at per-channel
granularity (matches the partial unique indexes from OUT-3778), so
the previous query could silently drop legitimate file.created
events for a different channel that shared a path.
- Normalization (leading "/") moved inside the service method. It now
only runs for create events instead of for every webhook, addressing
the latent crash on non-create events that omit data.path.
- Controller becomes one call: lighter to read, easier to test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…FileId The pendingCreate lookup in findInFlightAssemblyCreate matched any row with pendingAction=CREATE, target=ASSEMBLY, and a matching itemPath, ignoring whether assemblyFileId was set. The JSDoc already documented the intent as "row exists with assemblyFileId IS NULL", but the query omitted the filter. Defensive correctness fix: if completePendingAssemblyCreate gets past the early updateFileMap (stamping assemblyFileId) but uploadFileInAssembly then throws, the row stays at (pendingAction=CREATE, assemblyFileId=B). The normal delete-webhook cleanup chain usually soft-deletes such rows when the orphaned Assembly file is removed — but if that chain breaks (lost webhook, errored handler, sweeper exhausted at MAX_ATTEMPTS), a later legitimate file.created event for the same path would falsely match the stuck row and silently skip handleFileCreated. Adding isNull(t.assemblyFileId) to the query restricts the match to the actual in-flight window (post-insertCreatePending, pre-updateFileMap) that the dedupe was designed for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- New POST /api/sync/resync endpoint: verifies channel ownership, resets pendingActionAttempts on tombstoned rows, and enqueues an orchestrator task. - New `resyncFailedFilesAndMasterSync` trigger.dev task that runs `resyncFailedFilesInAssembly` (when there are tombstones) followed by `bidirectionalMasterSync.triggerAndWait`. Triggered with `concurrencyKey: channelSyncId` so concurrent clicks for the same channel are serialised. The scheduled sweeper is unchanged. - New `resyncing_at` timestamp column on channel_sync. Set when the user clicks Resync; cleared by the orchestrator's finally block (or by the API path's catch on trigger() failure) so the row never gets stuck in the resyncing state. - UI: leftmost Resync button on each active mapping row, disabled while `mapItem.resyncingAt` is set. Last Updated column shows "Resyncing..." during that window. State is sourced from channel_sync via realtime so it survives a page refresh. - Service accepts dropbox connections with a null rootNamespaceId (personal Dropbox accounts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR adds a user-triggered "Resync" button to active channel mappings, backed by a new resync API endpoint, a
Confidence Score: 5/5Safe to merge. The feature is well-scoped, the channel-ownership and active-status guards are in place, and the rollback path on trigger failure is tested. The core logic is solid: duplicate-resync guard, proper auth checks, try/catch rollback, and concurrency serialization are all correctly implemented and covered by tests. The two findings are non-blocking quality observations — a brief UI flicker on the rare 409 path (corrected by realtime), and oversized trigger payloads for channels with many large-error tombstones. No files require special attention for merging. Important Files Changed
Sequence DiagramsequenceDiagram
participant UI as Browser UI
participant API as Resync Endpoint
participant SVC as ResyncService
participant DB as Database
participant TR as Trigger.dev
UI->>UI: optimistic setResyncingAt(now)
UI->>API: POST channelSyncId
API->>SVC: resyncFailedFilesForChannel()
SVC->>DB: "findFirst channelSync (status=true, not deleted)"
DB-->>SVC: channel or 404
SVC->>SVC: guard: resyncingAt set? return 409
SVC->>DB: "findFirst dropboxConnections (status=true)"
DB-->>SVC: connection or 404
SVC->>DB: UPDATE file_folder_sync reset attempt counters RETURNING
DB-->>SVC: resetRows
SVC->>DB: "UPDATE channel_sync set resyncing_at=NOW()"
SVC->>TR: enqueue orchestrator (serialized per channel)
alt enqueueing fails
SVC->>DB: "UPDATE channel_sync set resyncing_at=NULL"
API-->>UI: error response
UI->>UI: rollback setResyncingAt(null)
else enqueueing succeeds
API-->>UI: 200 OK
TR->>TR: resyncFailedFilesInAssembly.triggerAndWait (if rows)
TR->>TR: bidirectionalMasterSync.triggerAndWait
TR->>DB: "finally: UPDATE channel_sync set resyncing_at=NULL"
DB-->>UI: realtime UPDATE clears resyncingAt
end
Reviews (3): Last reviewed commit: "fix(OUT-3784): reject resync when one is..." | Re-trigger Greptile |
| const channel = await db.query.channelSync.findFirst({ | ||
| where: (t, { eq }) => | ||
| and(eq(t.id, channelSyncId), eq(t.portalId, portalId), isNull(t.deletedAt)), | ||
| }) | ||
| if (!channel) { | ||
| throw new APIError('Channel mapping not found', httpStatus.NOT_FOUND) | ||
| } |
There was a problem hiding this comment.
Missing active-status guard on the channel
resyncFailedFilesForChannel looks up the channel by id + portalId + !deletedAt but never checks channel.status === true. A soft-inactive (status = false, deletedAt = null) channel passes this guard, so an authenticated portal member can POST to /api/sync/resync with the ID of any of their disabled channels and trigger a full bidirectionalMasterSync + tombstone retry on it — work that shouldn't run because the mapping has been deactivated. The UI hides the Resync button for inactive rows, but the API doesn't enforce the same constraint.
| queue: { | ||
| name: 'resync-failed-files-and-master-sync', | ||
| concurrencyLimit: 1, | ||
| }, |
There was a problem hiding this comment.
concurrencyLimit: 1 on the queue may serialize all channels globally
The intent described in the PR is "different channels run in parallel; same channel serializes." That works when concurrencyKey is passed at trigger time AND trigger.dev applies concurrencyLimit per key — but the queue-level concurrencyLimit: 1 is a global limit for the whole named queue. If trigger.dev treats it globally (1 run across all keys at once), every second channel's resync would queue behind the first, serializing all channels regardless of which portal they belong to. Please confirm the per-key scoping behaviour in your trigger.dev version, or move the limit into the task's concurrency field with a key expression to make the per-channel intent explicit and version-proof.
There was a problem hiding this comment.
concurrencyLimit: 1 does not serialize all channels globally. Self verified.
There was a problem hiding this comment.
You're right, and I was wrong. In trigger.dev v3, concurrencyLimit on a queue applies per unique concurrencyKey, not globally. So concurrencyLimit: 1 + concurrencyKey: channelSyncId correctly means "1 concurrent run per channel" — different channels run in parallel as intended.
Apologies for the noise. The code is correct as written.
Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.
| } finally { | ||
| // Always clear resyncingAt so the UI re-enables the button — even on failure. | ||
| await db | ||
| .update(channelSync) | ||
| .set({ resyncingAt: null }) | ||
| .where(eq(channelSync.id, channelSyncId)) | ||
| } |
There was a problem hiding this comment.
No recovery path if the trigger.dev worker is killed before
finally executes
If the runner process is terminated by an OOM kill, a cold-start timeout, or a trigger.dev infrastructure event before the finally block runs, resyncing_at is never cleared. The channel row stays in "Resyncing…" state permanently — the UI button is disabled and the "Last Updated" cell shows "Resyncing..." with no way to clear it without a manual DB update. Consider adding a scheduled sweeper (e.g. clear resyncing_at where resyncing_at < NOW() - interval '30 minutes') to act as a watchdog, or expose an admin reset endpoint.
There was a problem hiding this comment.
Let this be for now. Low priority
- resyncFailedFilesForChannel now also requires `channel.status === true`. Without this guard, an authenticated portal member could POST to /api/sync/resync with the id of any of their disabled channels and trigger a full bidirectional sweep. The UI hides the Resync button for inactive rows; the API now mirrors that constraint. - Document the per-channel concurrency model on the orchestrator task: trigger.dev v4 splits a named queue into one sub-queue per concurrencyKey, so concurrencyLimit: 1 + concurrencyKey: channelSyncId serialises same-channel runs while letting different channels run in parallel. Same idiom as processDropboxChanges (keyed by accountId). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| const channel = await db.query.channelSync.findFirst({ | ||
| where: (t, { eq }) => | ||
| and( | ||
| eq(t.id, channelSyncId), | ||
| eq(t.portalId, portalId), | ||
| eq(t.status, true), | ||
| isNull(t.deletedAt), | ||
| ), | ||
| }) | ||
| if (!channel) { | ||
| throw new APIError('Channel mapping not found', httpStatus.NOT_FOUND) | ||
| } |
There was a problem hiding this comment.
resyncingAt goes dark for queued orchestrators
The channel lookup has no isNull(t.resyncingAt) guard. When a second request arrives while the first orchestrator is still running, both tombstone-reset and trigger proceed normally. The first orchestrator's finally block then clears resyncingAt — and the second orchestrator starts running with resyncingAt = null. From that point on the UI shows "Active", the Resync button is re-enabled, and the channel appears idle even though a full bidirectionalMasterSync is in progress. Adding isNull(t.resyncingAt) to the channel lookup (and returning a 409 Conflict) would cap this to one pending resync per channel, matching the invariant that "resyncingAt is set ↔ a resync is enqueued or running."
…annel Without this guard, a second resync arriving while the first orchestrator is still running would enqueue behind it via concurrencyKey, but the first orchestrator's `finally` block then clears `resyncingAt` — leaving the second orchestrator running with `resyncingAt = null`. From that point the UI reports the channel as "Active", the Resync button re-enables, and a full bidirectional sweep runs invisibly. resyncFailedFilesForChannel now throws 409 Conflict if `resyncingAt` is already set on the channel. Preserves the invariant: "resyncingAt is set ↔ a resync is enqueued or running" and caps the channel to one in-flight resync at a time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
81f0c7d to
4474ab7
Compare
Summary
Adds a user-triggered "Resync" button on each active channel mapping. Click → resets attempt counters on tombstoned rows for that channel → triggers an orchestrator task that runs the failed-files retry followed by a full bidirectional sync. Per-channel serialization via
concurrencyKey: channelSyncId. Stacks on top of OUT-3782 (#107).What's in this PR
1. Retry-sync endpoint —
POST /api/sync/resyncVerifies channel ownership, resets
pendingActionAttempts = 0andpendingActionLastAttemptAt = nullon rows with active tombstones, stampsresyncingAt = NOW()on the channel, and enqueues the orchestrator. Wrapstrigger()in try/catch and rolls backresyncingAtif the trigger.dev call itself fails.2. New orchestrator task —
resyncFailedFilesAndMasterSyncTrigger.dev task that:
resyncFailedFilesInAssembly.triggerAndWaitbidirectionalMasterSync.triggerAndWaitfor the channelfinallyblock clearsresyncingAtso the row never gets stuck in the resyncing state, even on failure.Triggered with
concurrencyKey: channelSyncIdso multiple clicks for the same channel serialize; different channels run in parallel. The scheduled sweeper is unchanged — it continues calling the bare retry task, not the orchestrator.3.
resyncing_attimestamp column onchannel_syncNew nullable timestamp column drives the "Resyncing..." UI state and survives page reloads via realtime.
4. UI — Resync button
mapItem.status === true).mapItem.resyncingAtis set.console.erroron failure (matches existing patterns). Optimistic local update; realtime confirms via the channel UPDATE event.5. Personal Dropbox accounts
ResyncServiceno longer requiresrootNamespaceIdto be non-null on the connection lookup —DropboxClientalready accepts null and skips the namespace header for personal accounts.Edge cases handled
trigger()throws synchronously → orchestrator never runs → service rollback clearsresyncingAt.finallyclearsresyncingAt.concurrencyKeyper channel; doesn't pollute other channels.Test plan
pnpm typecheckcleanpnpm lintcleanpnpm testpasses (orchestrator wiring + concurrencyKey + null rootNamespaceId + trigger-throws rollback)resyncing_atfrom realtime).🤖 Generated with Claude Code