refactor(platform): collapse RAG indexing status onto fileMetadata.ragStatus by larryro · Pull Request #1840 · tale-project/tale

larryro · 2026-06-09T03:56:59Z

Context

RAG indexing status was stored in three places that drift: documents.ragInfo + documents.indexed, fileMetadata.ragStatus, and the RAG service's own Postgres row — kept loosely in sync by polling, with two independent writers inside Convex. This makes fileMetadata.ragStatus the single source of truth within Convex and projects it on reads. It's a Convex-internal change: the RAG Python service, the synchronous search/retrieval path, and the receive-only boundary are untouched; vectors stay in RAG's Postgres.

What changed

Single canonical owner

documents.ragInfo / documents.indexed are no longer written (kept @deprecated + optional for existing rows). Every reader projects status through one helper, getDocumentRagProjection (join documents.fileId == fileMetadata.storageId), so the response shape and the frontend are unchanged and drift becomes structurally impossible. This also resolves the within-Convex asymmetry (chat uploads have no documents row) by keying on storageId, which both upload kinds share.

Schema + index swap

Add fileMetadata.ragIndexedAt (canonical completion timestamp) and a by_organizationId_and_ragStatus index. getAgentScopedFileIds and listIndexedDocumentsForAgent query that index and resolve to documents, replacing the retired documents.by_organizationId_and_indexed (kept, deprecate-don't-delete).

Server-driven status advancement

New pollFileRagStatus (storageId-keyed, self-scheduling), scheduled by uploadFileToRag, so Document-Hub uploads reach completed even with no chat open — previously the only server-driven poller was the documents pipeline, and checkFileRagStatuses is polled only by the chat UI. The documents-pipeline uploaders (uploadDocumentToRag, reindexDocumentInRag, retryRagIndexing) now write fileMetadata status and schedule this poller. checkRagDocumentStatus is retired-but-kept (@deprecated) so any pre-deploy scheduled job drains harmlessly.

Coverage + backfill

REST POST /api/v1/documents with a fileId now creates the fileMetadata row (the one creation path that didn't).
One-time backfill backfill_filemetadata_rag_status copies legacy documents.ragInfo → fileMetadata (fill-holes-only, idempotent), registered in migrations runAll.

Verification

tsc --noEmit, oxlint --type-aware, and format:check all clean.
286 server tests pass (documents / file_metadata / projects); updated the mocks in get_agent_scoped_file_ids, list_indexed_documents_for_agent, and upsert_document_by_external_id for the fileMetadata-driven query path.
convex dev --once pushed successfully — schema validated against existing local data and the new index built.
Manual app E2E still recommended (needs the live RAG service): upload a doc with no chat open → badge advances to completed; agent retrieval; chat upload; an integration sync re-queue; REST create with fileId.

Summary by CodeRabbit

Release Notes

New Features
- Server-side polling for RAG indexing status provides more reliable tracking of document indexing progress
Improvements
- Enhanced metadata tracking to ensure accurate RAG indexing status across document updates
- Better handling of file replacements to prevent stale indexed state
- Added timestamps for document indexing completion events

@deprecated

…gStatus RAG indexing status was stored in three places that drifted: documents.ragInfo + documents.indexed, fileMetadata.ragStatus, and the RAG service's own row, kept loosely in sync by polling. Make fileMetadata.ragStatus the single source of truth within Convex and project it on reads. - documents.ragInfo / documents.indexed are no longer written (kept @deprecated, optional, for existing rows). Reads go through a new getDocumentRagProjection helper (join documents.fileId == fileMetadata.storageId), so the response shape and the frontend are unchanged. - Add fileMetadata.ragIndexedAt (canonical completion timestamp) + a new by_organizationId_and_ragStatus index. getAgentScopedFileIds and listIndexedDocumentsForAgent query that index and resolve to documents, replacing the retired documents by_organizationId_and_indexed index. - Add a server-driven poller pollFileRagStatus (keyed by storageId), scheduled by uploadFileToRag, so Document-Hub uploads advance to completed even with no chat open. The documents-pipeline uploaders (uploadDocumentToRag, reindexDocumentInRag, retryRagIndexing) now write fileMetadata status and schedule this poller; checkRagDocumentStatus is retired-but-kept for drain. - Fix the one creation gap: REST POST /api/v1/documents with a fileId now creates the fileMetadata row. - One-time backfill copies legacy documents.ragInfo -> fileMetadata (fill-holes only, idempotent), registered in migrations runAll. Retrieval/search path and the RAG Python service are untouched; vectors stay in RAG's Postgres.

coderabbitai · 2026-06-09T04:10:01Z

📝 Walkthrough

Walkthrough

This PR migrates RAG (Retrieval-Augmented Generation) indexing state from the document-level ragInfo field to the fileMetadata.ragStatus field as the canonical source of truth. It introduces server-side polling infrastructure to track indexing progress, updates all queries and mutations to use the new file-metadata location, and adds comprehensive deprecation documentation for the retired document-level fields. The schema gains new timestamp and index fields to support efficient RAG-status queries, and projection helpers are introduced to abstract reading RAG state for backward compatibility.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

tale-project/tale#862: Earlier PR that introduced the indexed document retrieval patterns now being refactored in this PR to use fileMetadata.ragStatus as the source instead of documents.indexed.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is comprehensive, covering context, changes, verification, and test plan. However, the pre-merge checklist items are not ticked and lack explicit N/A markings required by the template.	Complete the pre-merge checklist by ticking boxes or marking N/A with brief reasons for each item (bun checks, translations, documentation, README updates).
Docstring Coverage	⚠️ Warning	Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main architectural change: collapsing RAG indexing status to be canonical on fileMetadata.ragStatus rather than distributed across multiple storage locations.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch refactor/rag-status-canonical-filemetadata

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

services/platform/convex/documents/list_indexed_documents_for_agent.ts (2)

292-326: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Same filtering issue in the inline composite-cursor mock.

The ad-hoc mock at lines 304-309 also creates fms for all docs without filtering by ragInfo.status === 'completed'. Apply the same filter as the main createMockCtx.

🛠️ Proposed fix

-    const fms = docs.map((d) => ({
-      storageId: d.fileId,
-      documentId: d._id,
-      ragStatus: 'completed' as const,
-      organizationId: 'org1',
-    }));
+    const fms = docs
+      .filter(
+        (d) =>
+          (d.ragInfo as { status?: string } | undefined)?.status === 'completed',
+      )
+      .map((d) => ({
+        storageId: d.fileId,
+        documentId: d._id,
+        ragStatus: 'completed' as const,
+        organizationId: 'org1',
+      }));
     const docsById = new Map(docs.map((d) => [d._id, d]));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`
around lines 292 - 326, The inline composite-cursor mock in createMockCtx is
building the fms array for all documents without filtering by ragInfo.status ===
'completed'; update the ad-hoc mock (the composite-cursor branch that constructs
fms at runtime) to apply the same filter used in the main createMockCtx so only
documents with ragInfo?.status === 'completed' are included when populating fms
(match the filtering logic and field checks used elsewhere in createMockCtx).

14-50: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Mock does not filter by completion status.

The mock creates fileMetadata rows with ragStatus: 'completed' for all input docs (line 18-23), without filtering by ragInfo.status. This breaks test accuracy:

Test "excludes non-completed documents" (line 175) passes docs with ragInfo: { status: 'queued' } but the mock will return them as completed fileMetadata, causing incorrect test results.
The production code queries by_organizationId_and_ragStatus with .eq('ragStatus', 'completed'), which should only return truly completed files.

Compare to get_agent_scoped_file_ids.test.ts (lines 24-28), which correctly filters:

const completedFms = docs
  .filter(
    (d) =>
      (d.ragInfo as { status?: string } | undefined)?.status === 'completed',
  )
  .map((d) => ({...}));

🛠️ Proposed fix

 function createMockCtx(docs: MockDoc[]) {
-  // Completion is canonical on fileMetadata.ragStatus now: the query paginates
-  // completed fileMetadata (one per doc), then resolves each to its document
-  // via ctx.db.get(documentId).
-  const fms = docs.map((d) => ({
-    storageId: d.fileId,
-    documentId: d._id,
-    ragStatus: 'completed' as const,
-    organizationId: 'org1',
-  }));
+  // Derive completed fileMetadata from docs where ragInfo.status === 'completed'
+  // to match the by_organizationId_and_ragStatus index filter in production.
+  const fms = docs
+    .filter(
+      (d) =>
+        (d.ragInfo as { status?: string } | undefined)?.status === 'completed',
+    )
+    .map((d) => ({
+      storageId: d.fileId,
+      documentId: d._id,
+      ragStatus: 'completed' as const,
+      organizationId: 'org1',
+    }));
   const docsById = new Map(docs.map((d) => [d._id, d]));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`
around lines 14 - 50, The test mock that generates fileMetadata rows should only
mark a doc as ragStatus: 'completed' when the source doc's ragInfo.status ===
'completed'; update the mock generation logic to filter docs by
(d.ragInfo?.status === 'completed') before mapping to fileMetadata so it mirrors
the production query that uses
by_organizationId_and_ragStatus.eq('ragStatus','completed'); locate the mock in
the tests that reference AgentIndexedDocumentItem /
AgentIndexedDocumentListResult (the place creating fileMetadata entries for
docs) and change the mapping/filtering to match the pattern used in
get_agent_scoped_file_ids.test.ts (filter then map).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@services/platform/convex/documents/list_indexed_documents_for_agent.test.ts`:
- Line 45: The inline getter uses a type assertion (id as string); replace it
with a type guard to avoid assertions—update the get function (const get = async
(id: unknown) => ...) to check that id is a string (e.g., if (typeof id !==
'string') return null) before calling docsById.get(id) and return the result or
null; keep the function signature and semantics but remove the "as" assertion
and rely on the runtime type guard.
- Line 322: The inline type assertion in the get function (const get = async
(id: unknown) => docsById.get(id as string) ?? null;) should be replaced with a
type guard: check that id is a string (e.g., typeof id === "string") before
calling docsById.get and return null for non-strings; update the get function to
perform this guard so it no longer uses the `as string` assertion and adheres to
the project's type-safety guideline.

In `@services/platform/convex/documents/rest_api.ts`:
- Around line 78-95: When handling body.fileId in this block (where you call
toId<'_storage'>(body.fileId),
rc.ctx.runQuery(internal.file_metadata.internal_queries.getStorageMetadata, ...)
and
rc.ctx.runMutation(internal.file_metadata.internal_mutations.saveFileMetadata,
...)), add a defensive check for meta === null/undefined: if meta is missing,
emit a warning via the request context logger (e.g., rc.ctx.logger.warn or
rc.log.warn) that includes the storageId and fileId, and either abort with a
clear client error/throw or return early instead of saving a metadata row with
size: 0; ensure the warning message includes enough identifiers (storageId,
rc.org.organizationId, rc.user.userId) so the caller/debugger can correlate the
invalid fileId.

In `@services/platform/convex/documents/upsert_document_by_external_id.test.ts`:
- Around line 66-69: The test fixture's mock for the query builder (.first())
doesn't simulate lookups by storageId, so calls to fileMetadata.by_storageId
always return empty and the existingIndexed branch never gets exercised; update
the fixture that defines matched and the mock first() used in the RAG-status
reindex gate so it captures and responds to storageId queries (i.e., when
fileMetadata.by_storageId is invoked with a specific storageId return the
corresponding matched entry instead of null), ensuring the test triggers the
existingIndexed code path—look for the mocked first(), matched array,
fileMetadata.by_storageId lookup, and the existingIndexed branch to implement
this behavior.

In `@services/platform/convex/documents/upsert_document_by_external_id.ts`:
- Around line 163-170: The code currently does an unconditional lookup using
existingFileId -> ctx.db.query('fileMetadata').withIndex('by_storageId'...) to
set existingFm/existingIndexed even for updates that cannot trigger reindex;
gate that DB read behind the same reindex preconditions (the same
boolean/condition used elsewhere to decide reindexing) so you only query when
reindex is allowed. Modify the logic around
existingFileId/existingFm/existingIndexed (and the similar occurrence around the
block at the other occurrence) to check the reindex-enabled condition first, and
only call ctx.db.query('fileMetadata').withIndex('by_storageId', ...) when that
condition is true; default existingFm/null and existingIndexed=false when
skipped.

In `@services/platform/convex/file_metadata/internal_actions.ts`:
- Around line 195-200: The empty catch around "body = await response.json();"
should not swallow errors; change it to catch the error (e.g., catch (err)) and
log it with console.warn or console.error including context (such as the
response status or that parsing failed) before calling reschedule() and
returning null (or re-throw if you prefer). Update the try/catch in the same
block that calls response.json(), referencing the variables "body", "response",
and the helper "reschedule" so the log clearly identifies which response failed
to parse.

---

Outside diff comments:
In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`:
- Around line 292-326: The inline composite-cursor mock in createMockCtx is
building the fms array for all documents without filtering by ragInfo.status ===
'completed'; update the ad-hoc mock (the composite-cursor branch that constructs
fms at runtime) to apply the same filter used in the main createMockCtx so only
documents with ragInfo?.status === 'completed' are included when populating fms
(match the filtering logic and field checks used elsewhere in createMockCtx).
- Around line 14-50: The test mock that generates fileMetadata rows should only
mark a doc as ragStatus: 'completed' when the source doc's ragInfo.status ===
'completed'; update the mock generation logic to filter docs by
(d.ragInfo?.status === 'completed') before mapping to fileMetadata so it mirrors
the production query that uses
by_organizationId_and_ragStatus.eq('ragStatus','completed'); locate the mock in
the tests that reference AgentIndexedDocumentItem /
AgentIndexedDocumentListResult (the place creating fileMetadata entries for
docs) and change the mapping/filtering to match the pattern used in
get_agent_scoped_file_ids.test.ts (filter then map).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1b65b587-7964-45b2-8f06-3b8771714a9b

📥 Commits

Reviewing files that changed from the base of the PR and between 2494f42 and 03133ac.

⛔ Files ignored due to path filters (2)

services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
services/platform/convex/migrations/backfill_filemetadata_rag_status.ts is excluded by !**/migrations/**

📒 Files selected for processing (22)

services/platform/convex/documents/actions.ts
services/platform/convex/documents/get_agent_scoped_file_ids.test.ts
services/platform/convex/documents/get_agent_scoped_file_ids.ts
services/platform/convex/documents/get_document_rag_projection.ts
services/platform/convex/documents/internal_actions.ts
services/platform/convex/documents/list_indexed_documents_for_agent.test.ts
services/platform/convex/documents/list_indexed_documents_for_agent.ts
services/platform/convex/documents/rest_api.ts
services/platform/convex/documents/schema.ts
services/platform/convex/documents/transform_to_document_item.ts
services/platform/convex/documents/types.ts
services/platform/convex/documents/update_document_internal.ts
services/platform/convex/documents/update_document_rag_info.ts
services/platform/convex/documents/upsert_document_by_external_id.test.ts
services/platform/convex/documents/upsert_document_by_external_id.ts
services/platform/convex/documents/validators.ts
services/platform/convex/file_metadata/internal_actions.ts
services/platform/convex/file_metadata/internal_mutations.ts
services/platform/convex/file_metadata/internal_queries.ts
services/platform/convex/file_metadata/schema.ts
services/platform/convex/migrations.ts
services/platform/convex/projects/queries.ts

coderabbitai · 2026-06-09T04:10:04Z

  });

-  return { db: { query: queryFn } } as unknown as Parameters<
+  const get = async (id: unknown) => docsById.get(id as string) ?? null;


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Replace type assertion with type guard.

The guideline prohibits using as type assertions. Use a type guard instead to maintain type safety.

♻️ Proposed fix

- const get = async (id: unknown) => docsById.get(id as string) ?? null; + const get = async (id: unknown) => + typeof id === 'string' ? docsById.get(id) ?? null : null;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const get = async (id: unknown) => docsById.get(id as string) ?? null;

const get = async (id: unknown) =>

typeof id === 'string' ? docsById.get(id) ?? null : null;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/platform/convex/documents/list_indexed_documents_for_agent.test.ts` at line 45, The inline getter uses a type assertion (id as string); replace it with a type guard to avoid assertions—update the get function (const get = async (id: unknown) => ...) to check that id is a string (e.g., if (typeof id !== 'string') return null) before calling docsById.get(id) and return the result or null; keep the function signature and semantics but remove the "as" assertion and rely on the runtime type guard.

Source: Coding guidelines

coderabbitai · 2026-06-09T04:10:04Z

      }),
    });
-    const ctx = { db: { query: queryFn } } as unknown as Parameters<
+    const get = async (id: unknown) => docsById.get(id as string) ?? null;


🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Replace type assertion with type guard.

Same issue as line 45—use a type guard instead of as string to align with coding guidelines.

♻️ Proposed fix

- const get = async (id: unknown) => docsById.get(id as string) ?? null; + const get = async (id: unknown) => + typeof id === 'string' ? docsById.get(id) ?? null : null;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const get = async (id: unknown) => docsById.get(id as string) ?? null;

const get = async (id: unknown) =>

typeof id === 'string' ? docsById.get(id) ?? null : null;

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/platform/convex/documents/list_indexed_documents_for_agent.test.ts` at line 322, The inline type assertion in the get function (const get = async (id: unknown) => docsById.get(id as string) ?? null;) should be replaced with a type guard: check that id is a string (e.g., typeof id === "string") before calling docsById.get and return null for non-strings; update the get function to perform this guard so it no longer uses the `as string` assertion and adheres to the project's type-safety guideline.

Source: Coding guidelines

coderabbitai · 2026-06-09T04:10:04Z

+  if (body.fileId) {
+    const storageId = toId<'_storage'>(body.fileId);
+    const meta = await rc.ctx.runQuery(
+      internal.file_metadata.internal_queries.getStorageMetadata,
+      { storageId },
+    );
+    await rc.ctx.runMutation(
+      internal.file_metadata.internal_mutations.saveFileMetadata,
+      {
+        organizationId: rc.org.organizationId,
+        storageId,
+        fileName: body.title ?? 'document',
+        contentType:
+          body.mimeType ?? meta?.contentType ?? 'application/octet-stream',
+        size: meta?.size ?? 0,
+        uploadedBy: rc.user.userId,
+      },
+    );


🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider handling missing blob metadata more defensively.

When meta is null (blob doesn't exist), the code proceeds with size: 0 which may create an inconsistent fileMetadata row. If the caller passes an invalid fileId, the operation silently succeeds but the RAG pipeline will likely fail later.

This may be acceptable as a "caller error" scenario, but a warning log would help with debugging.

Optional: Add a warning for missing blob

if (body.fileId) { const storageId = toId<'_storage'>(body.fileId); const meta = await rc.ctx.runQuery( internal.file_metadata.internal_queries.getStorageMetadata, { storageId }, ); + if (!meta) { + console.warn( + `[createDocument] No blob found for fileId=${body.fileId}; fileMetadata row may be inconsistent`, + ); + } await rc.ctx.runMutation(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/platform/convex/documents/rest_api.ts` around lines 78 - 95, When handling body.fileId in this block (where you call toId<'_storage'>(body.fileId), rc.ctx.runQuery(internal.file_metadata.internal_queries.getStorageMetadata, ...) and rc.ctx.runMutation(internal.file_metadata.internal_mutations.saveFileMetadata, ...)), add a defensive check for meta === null/undefined: if meta is missing, emit a warning via the request context logger (e.g., rc.ctx.logger.warn or rc.log.warn) that includes the storageId and fileId, and either abort with a clear client error/throw or return early instead of saving a metadata row with size: 0; ensure the warning message includes enough identifiers (storageId, rc.org.organizationId, rc.user.userId) so the caller/debugger can correlate the invalid fileId.

coderabbitai · 2026-06-09T04:10:04Z

+            // The RAG-status reindex gate queries fileMetadata `by_storageId`
+            // and calls .first(); these fixtures have no fileMetadata, so the
+            // gate stays inactive (matched is empty for that lookup).
+            first: async () => matched[0] ?? null,


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Mock .first() does not model fileMetadata.by_storageId behavior.

Line 69 adds .first(), but the query builder never captures storageId, so canonical gate lookups always resolve as empty in this fixture. That leaves the new existingIndexed branch effectively untested.

Suggested fixture enhancement

interface MockFolder { @@ } -function createMockCtx(initial: MockDoc[], initialFolders: MockFolder[] = []) { +interface MockFileMetadata { + storageId: string; + ragStatus?: 'queued' | 'running' | 'completed' | 'failed'; +} + +function createMockCtx( + initial: MockDoc[], + initialFolders: MockFolder[] = [], + initialFileMetadata: MockFileMetadata[] = [], +) { @@ const folders = new Map<string, MockFolder>(); for (const f of initialFolders) folders.set(f._id, f); + const fileMetadata = new Map<string, MockFileMetadata>(); + for (const fm of initialFileMetadata) fileMetadata.set(fm.storageId, fm); @@ - query: () => ({ + query: (table: string) => ({ withIndex: ( @@ let orgFilter: string | undefined; let externalFilter: string | undefined; + let storageFilter: string | undefined; const qb = { eq: (field: string, value: unknown) => { if (field === 'organizationId') orgFilter = value as string; if (field === 'externalItemId') externalFilter = value as string; + if (field === 'storageId') storageFilter = value as string; return qb; }, }; cb(qb); + if (table === 'fileMetadata') { + const fm = storageFilter + ? fileMetadata.get(storageFilter) ?? null + : null; + return { + [Symbol.asyncIterator]: async function* () { + if (fm) yield fm; + }, + first: async () => fm, + }; + } const matched: MockDoc[] = [];

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/platform/convex/documents/upsert_document_by_external_id.test.ts` around lines 66 - 69, The test fixture's mock for the query builder (.first()) doesn't simulate lookups by storageId, so calls to fileMetadata.by_storageId always return empty and the existingIndexed branch never gets exercised; update the fixture that defines matched and the mock first() used in the RAG-status reindex gate so it captures and responds to storageId queries (i.e., when fileMetadata.by_storageId is invoked with a specific storageId return the corresponding matched entry instead of null), ensuring the test triggers the existingIndexed code path—look for the mocked first(), matched array, fileMetadata.by_storageId lookup, and the existingIndexed branch to implement this behavior.

coderabbitai · 2026-06-09T04:10:04Z

+    const existingFileId = existing.fileId;
+    const existingFm = existingFileId
+      ? await ctx.db
+          .query('fileMetadata')
+          .withIndex('by_storageId', (q) => q.eq('storageId', existingFileId))
+          .first()
+      : null;
+    const existingIndexed = existingFm?.ragStatus === 'completed';


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Gate the fileMetadata read behind the same reindex preconditions.

Line 164 currently performs a by_storageId lookup on every update, including metadata/folder-only updates where reindex cannot run. This adds avoidable reads on a hot sync path.

Proposed fix

- const existingFileId = existing.fileId; - const existingFm = existingFileId - ? await ctx.db - .query('fileMetadata') - .withIndex('by_storageId', (q) => q.eq('storageId', existingFileId)) - .first() - : null; - const existingIndexed = existingFm?.ragStatus === 'completed'; + let existingIndexed = false; + if ( + contentChanged && + existing.fileId && + args.fileId && + existing.fileId !== args.fileId + ) { + const existingFm = await ctx.db + .query('fileMetadata') + .withIndex('by_storageId', (q) => q.eq('storageId', existing.fileId)) + .first(); + existingIndexed = existingFm?.ragStatus === 'completed'; + }

Also applies to: 177-177

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/platform/convex/documents/upsert_document_by_external_id.ts` around lines 163 - 170, The code currently does an unconditional lookup using existingFileId -> ctx.db.query('fileMetadata').withIndex('by_storageId'...) to set existingFm/existingIndexed even for updates that cannot trigger reindex; gate that DB read behind the same reindex preconditions (the same boolean/condition used elsewhere to decide reindexing) so you only query when reindex is allowed. Modify the logic around existingFileId/existingFm/existingIndexed (and the similar occurrence around the block at the other occurrence) to check the reindex-enabled condition first, and only call ctx.db.query('fileMetadata').withIndex('by_storageId', ...) when that condition is true; default existingFm/null and existingIndexed=false when skipped.

coderabbitai · 2026-06-09T04:10:04Z

+      try {
+        body = await response.json();
+      } catch {
+        await reschedule();
+        return null;
+      }


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty catch block violates coding guidelines.

The catch block on line 197 is empty. Per coding guidelines, catch blocks should log with console.warn/console.error or re-throw to avoid hiding bugs.

Proposed fix

let body: unknown; try { body = await response.json(); - } catch { + } catch (parseErr) { + console.warn( + `[pollFileRagStatus] Failed to parse RAG response for ${args.storageId}:`, + parseErr, + ); await reschedule(); return null; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try {

body = await response.json();

} catch {

await reschedule();

return null;

}

try {

body = await response.json();

} catch (parseErr) {

console.warn(

`[pollFileRagStatus] Failed to parse RAG response for ${args.storageId}:`,

parseErr,

);

await reschedule();

return null;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@services/platform/convex/file_metadata/internal_actions.ts` around lines 195 - 200, The empty catch around "body = await response.json();" should not swallow errors; change it to catch the error (e.g., catch (err)) and log it with console.warn or console.error including context (such as the response status or that parsing failed) before calling reschedule() and returning null (or re-throw if you prefer). Update the try/catch in the same block that calls response.json(), referencing the variables "body", "response", and the helper "reschedule" so the log clearly identifies which response failed to parse.

Source: Coding guidelines

Review follow-up to the fileMetadata.ragStatus collapse. - ragIndexedAt is now stamped in Unix seconds (was Date.now() ms). The RagStatusBadge renders `new Date(indexedAt * 1000)` and both the legacy ragInfo.indexedAt writer and the backfill store seconds, so writing ms made every fresh completion render a far-future date and mixed units in the column. - retryRagIndexing and uploadDocumentToRag now ensure the canonical fileMetadata row exists (new ensureFileMetadataForDocument mutation) before writing status. updateFileRagStatus silently no-ops on a missing row, which left the "Index" button reporting success while status stayed stuck on not_indexed. The ensure mutation reads size/contentType from _storage and schedules no upload, so it does not double-index. - Register backfillFileMetadataDocumentId in runAll ahead of the rag-status backfill so completed fileMetadata rows are linked and stay in agent RAG retrieval scope. - Add unit tests for the seconds stamping and ensureFileMetadataForDocument.

…ueries Addresses medium+ findings from the review of the fileMetadata.ragStatus single-source-of-truth cutover. Migration (backfill_filemetadata_rag_status): - Mirror only TERMINAL legacy ragInfo statuses. Copying queued/running would pin the canonical row forever — nothing advances a non-terminal ragStatus post-cutover and expireStaleRagQueue only rescues 'queued'. - Create the fileMetadata row when missing instead of skipping, so a legacy completed doc whose blob never got a row no longer regresses to not_indexed and silently drops out of agent RAG retrieval. Idempotent (re-run finds the created row and skips); OCC serializes overlapping chains. - BATCH_SIZE 200 -> 100 for write-byte / OCC headroom now that batches insert. - Rewrite the misleading runAll ordering comment (the two backfills are independent; call order is cosmetic). Agent-scope query scale (getAgentScopedFileIds, listIndexedDocumentsForAgent): - Replace by_organizationId_and_ragStatus with composite by_organizationId_and_ragStatus_and_documentId and query .eq(org).eq('completed').gt('documentId', undefined). Convex orders a missing optional field as the least value, so the bound seeks PAST chat-upload / transcript rows (documentId absent) at the index level — the scan stays bounded by Hub-doc count instead of the org's whole completed corpus. - Lower list_indexed MAX_LIMIT 500 -> 100. - Degrade gracefully (skip knowledge context / empty results) when the scope query throws, instead of aborting the agent turn — in generate_response and both rag_search tool paths. Tests: - New: backfill migration (terminal-only filter, insert-when-missing, idempotency, self-schedule), pollFileRagStatus poller (status/HTTP branches), update_document_internal reindex gate. - Extended: upsert reindex gate (now exercised), get_agent_scoped / list_indexed SSOT skip branches (documentId-missing, stale-blob, trashed), transform_to_document_item RAG projection mapping.

coderabbitai Bot requested changes Jun 9, 2026

View reviewed changes

larryro added 2 commits June 9, 2026 16:04

larryro merged commit 08ca625 into main Jun 9, 2026
30 checks passed

larryro deleted the refactor/rag-status-canonical-filemetadata branch June 9, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(platform): collapse RAG indexing status onto fileMetadata.ragStatus#1840

refactor(platform): collapse RAG indexing status onto fileMetadata.ragStatus#1840
larryro merged 3 commits into
mainfrom
refactor/rag-status-canonical-filemetadata

larryro commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

coderabbitai Bot Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	const get = async (id: unknown) => docsById.get(id as string) ?? null;
	const get = async (id: unknown) =>
	typeof id === 'string' ? docsById.get(id) ?? null : null;

Conversation

larryro commented Jun 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What changed

Verification

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 9, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Jun 9, 2026 •

edited by coderabbitai Bot

Loading