Skip to content

refactor(platform): collapse RAG indexing status onto fileMetadata.ragStatus#1840

Merged
larryro merged 3 commits into
mainfrom
refactor/rag-status-canonical-filemetadata
Jun 9, 2026
Merged

refactor(platform): collapse RAG indexing status onto fileMetadata.ragStatus#1840
larryro merged 3 commits into
mainfrom
refactor/rag-status-canonical-filemetadata

Conversation

@larryro

@larryro larryro commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Context

RAG indexing status was stored in three places that drift: documents.ragInfo + documents.indexed, fileMetadata.ragStatus, and the RAG service's own Postgres row — kept loosely in sync by polling, with two independent writers inside Convex. This makes fileMetadata.ragStatus the single source of truth within Convex and projects it on reads. It's a Convex-internal change: the RAG Python service, the synchronous search/retrieval path, and the receive-only boundary are untouched; vectors stay in RAG's Postgres.

What changed

Single canonical owner

  • documents.ragInfo / documents.indexed are no longer written (kept @deprecated + optional for existing rows). Every reader projects status through one helper, getDocumentRagProjection (join documents.fileId == fileMetadata.storageId), so the response shape and the frontend are unchanged and drift becomes structurally impossible. This also resolves the within-Convex asymmetry (chat uploads have no documents row) by keying on storageId, which both upload kinds share.

Schema + index swap

  • Add fileMetadata.ragIndexedAt (canonical completion timestamp) and a by_organizationId_and_ragStatus index. getAgentScopedFileIds and listIndexedDocumentsForAgent query that index and resolve to documents, replacing the retired documents.by_organizationId_and_indexed (kept, deprecate-don't-delete).

Server-driven status advancement

  • New pollFileRagStatus (storageId-keyed, self-scheduling), scheduled by uploadFileToRag, so Document-Hub uploads reach completed even with no chat open — previously the only server-driven poller was the documents pipeline, and checkFileRagStatuses is polled only by the chat UI. The documents-pipeline uploaders (uploadDocumentToRag, reindexDocumentInRag, retryRagIndexing) now write fileMetadata status and schedule this poller. checkRagDocumentStatus is retired-but-kept (@deprecated) so any pre-deploy scheduled job drains harmlessly.

Coverage + backfill

  • REST POST /api/v1/documents with a fileId now creates the fileMetadata row (the one creation path that didn't).
  • One-time backfill backfill_filemetadata_rag_status copies legacy documents.ragInfofileMetadata (fill-holes-only, idempotent), registered in migrations runAll.

Verification

  • tsc --noEmit, oxlint --type-aware, and format:check all clean.
  • 286 server tests pass (documents / file_metadata / projects); updated the mocks in get_agent_scoped_file_ids, list_indexed_documents_for_agent, and upsert_document_by_external_id for the fileMetadata-driven query path.
  • convex dev --once pushed successfully — schema validated against existing local data and the new index built.
  • Manual app E2E still recommended (needs the live RAG service): upload a doc with no chat open → badge advances to completed; agent retrieval; chat upload; an integration sync re-queue; REST create with fileId.

Summary by CodeRabbit

Release Notes

  • New Features

    • Server-side polling for RAG indexing status provides more reliable tracking of document indexing progress
  • Improvements

    • Enhanced metadata tracking to ensure accurate RAG indexing status across document updates
    • Better handling of file replacements to prevent stale indexed state
    • Added timestamps for document indexing completion events

…gStatus

RAG indexing status was stored in three places that drifted: documents.ragInfo
+ documents.indexed, fileMetadata.ragStatus, and the RAG service's own row,
kept loosely in sync by polling. Make fileMetadata.ragStatus the single source
of truth within Convex and project it on reads.

- documents.ragInfo / documents.indexed are no longer written (kept @deprecated,
  optional, for existing rows). Reads go through a new getDocumentRagProjection
  helper (join documents.fileId == fileMetadata.storageId), so the response
  shape and the frontend are unchanged.
- Add fileMetadata.ragIndexedAt (canonical completion timestamp) + a new
  by_organizationId_and_ragStatus index. getAgentScopedFileIds and
  listIndexedDocumentsForAgent query that index and resolve to documents,
  replacing the retired documents by_organizationId_and_indexed index.
- Add a server-driven poller pollFileRagStatus (keyed by storageId), scheduled
  by uploadFileToRag, so Document-Hub uploads advance to completed even with no
  chat open. The documents-pipeline uploaders (uploadDocumentToRag,
  reindexDocumentInRag, retryRagIndexing) now write fileMetadata status and
  schedule this poller; checkRagDocumentStatus is retired-but-kept for drain.
- Fix the one creation gap: REST POST /api/v1/documents with a fileId now
  creates the fileMetadata row.
- One-time backfill copies legacy documents.ragInfo -> fileMetadata (fill-holes
  only, idempotent), registered in migrations runAll.

Retrieval/search path and the RAG Python service are untouched; vectors stay in
RAG's Postgres.
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR migrates RAG (Retrieval-Augmented Generation) indexing state from the document-level ragInfo field to the fileMetadata.ragStatus field as the canonical source of truth. It introduces server-side polling infrastructure to track indexing progress, updates all queries and mutations to use the new file-metadata location, and adds comprehensive deprecation documentation for the retired document-level fields. The schema gains new timestamp and index fields to support efficient RAG-status queries, and projection helpers are introduced to abstract reading RAG state for backward compatibility.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • tale-project/tale#862: Earlier PR that introduced the indexed document retrieval patterns now being refactored in this PR to use fileMetadata.ragStatus as the source instead of documents.indexed.
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is comprehensive, covering context, changes, verification, and test plan. However, the pre-merge checklist items are not ticked and lack explicit N/A markings required by the template. Complete the pre-merge checklist by ticking boxes or marking N/A with brief reasons for each item (bun checks, translations, documentation, README updates).
Docstring Coverage ⚠️ Warning Docstring coverage is 41.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main architectural change: collapsing RAG indexing status to be canonical on fileMetadata.ragStatus rather than distributed across multiple storage locations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/rag-status-canonical-filemetadata

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
services/platform/convex/documents/list_indexed_documents_for_agent.ts (2)

292-326: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Same filtering issue in the inline composite-cursor mock.

The ad-hoc mock at lines 304-309 also creates fms for all docs without filtering by ragInfo.status === 'completed'. Apply the same filter as the main createMockCtx.

🛠️ Proposed fix
-    const fms = docs.map((d) => ({
-      storageId: d.fileId,
-      documentId: d._id,
-      ragStatus: 'completed' as const,
-      organizationId: 'org1',
-    }));
+    const fms = docs
+      .filter(
+        (d) =>
+          (d.ragInfo as { status?: string } | undefined)?.status === 'completed',
+      )
+      .map((d) => ({
+        storageId: d.fileId,
+        documentId: d._id,
+        ragStatus: 'completed' as const,
+        organizationId: 'org1',
+      }));
     const docsById = new Map(docs.map((d) => [d._id, d]));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`
around lines 292 - 326, The inline composite-cursor mock in createMockCtx is
building the fms array for all documents without filtering by ragInfo.status ===
'completed'; update the ad-hoc mock (the composite-cursor branch that constructs
fms at runtime) to apply the same filter used in the main createMockCtx so only
documents with ragInfo?.status === 'completed' are included when populating fms
(match the filtering logic and field checks used elsewhere in createMockCtx).

14-50: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Mock does not filter by completion status.

The mock creates fileMetadata rows with ragStatus: 'completed' for all input docs (line 18-23), without filtering by ragInfo.status. This breaks test accuracy:

  1. Test "excludes non-completed documents" (line 175) passes docs with ragInfo: { status: 'queued' } but the mock will return them as completed fileMetadata, causing incorrect test results.
  2. The production code queries by_organizationId_and_ragStatus with .eq('ragStatus', 'completed'), which should only return truly completed files.

Compare to get_agent_scoped_file_ids.test.ts (lines 24-28), which correctly filters:

const completedFms = docs
  .filter(
    (d) =>
      (d.ragInfo as { status?: string } | undefined)?.status === 'completed',
  )
  .map((d) => ({...}));
🛠️ Proposed fix
 function createMockCtx(docs: MockDoc[]) {
-  // Completion is canonical on fileMetadata.ragStatus now: the query paginates
-  // completed fileMetadata (one per doc), then resolves each to its document
-  // via ctx.db.get(documentId).
-  const fms = docs.map((d) => ({
-    storageId: d.fileId,
-    documentId: d._id,
-    ragStatus: 'completed' as const,
-    organizationId: 'org1',
-  }));
+  // Derive completed fileMetadata from docs where ragInfo.status === 'completed'
+  // to match the by_organizationId_and_ragStatus index filter in production.
+  const fms = docs
+    .filter(
+      (d) =>
+        (d.ragInfo as { status?: string } | undefined)?.status === 'completed',
+    )
+    .map((d) => ({
+      storageId: d.fileId,
+      documentId: d._id,
+      ragStatus: 'completed' as const,
+      organizationId: 'org1',
+    }));
   const docsById = new Map(docs.map((d) => [d._id, d]));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`
around lines 14 - 50, The test mock that generates fileMetadata rows should only
mark a doc as ragStatus: 'completed' when the source doc's ragInfo.status ===
'completed'; update the mock generation logic to filter docs by
(d.ragInfo?.status === 'completed') before mapping to fileMetadata so it mirrors
the production query that uses
by_organizationId_and_ragStatus.eq('ragStatus','completed'); locate the mock in
the tests that reference AgentIndexedDocumentItem /
AgentIndexedDocumentListResult (the place creating fileMetadata entries for
docs) and change the mapping/filtering to match the pattern used in
get_agent_scoped_file_ids.test.ts (filter then map).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@services/platform/convex/documents/list_indexed_documents_for_agent.test.ts`:
- Line 45: The inline getter uses a type assertion (id as string); replace it
with a type guard to avoid assertions—update the get function (const get = async
(id: unknown) => ...) to check that id is a string (e.g., if (typeof id !==
'string') return null) before calling docsById.get(id) and return the result or
null; keep the function signature and semantics but remove the "as" assertion
and rely on the runtime type guard.
- Line 322: The inline type assertion in the get function (const get = async
(id: unknown) => docsById.get(id as string) ?? null;) should be replaced with a
type guard: check that id is a string (e.g., typeof id === "string") before
calling docsById.get and return null for non-strings; update the get function to
perform this guard so it no longer uses the `as string` assertion and adheres to
the project's type-safety guideline.

In `@services/platform/convex/documents/rest_api.ts`:
- Around line 78-95: When handling body.fileId in this block (where you call
toId<'_storage'>(body.fileId),
rc.ctx.runQuery(internal.file_metadata.internal_queries.getStorageMetadata, ...)
and
rc.ctx.runMutation(internal.file_metadata.internal_mutations.saveFileMetadata,
...)), add a defensive check for meta === null/undefined: if meta is missing,
emit a warning via the request context logger (e.g., rc.ctx.logger.warn or
rc.log.warn) that includes the storageId and fileId, and either abort with a
clear client error/throw or return early instead of saving a metadata row with
size: 0; ensure the warning message includes enough identifiers (storageId,
rc.org.organizationId, rc.user.userId) so the caller/debugger can correlate the
invalid fileId.

In `@services/platform/convex/documents/upsert_document_by_external_id.test.ts`:
- Around line 66-69: The test fixture's mock for the query builder (.first())
doesn't simulate lookups by storageId, so calls to fileMetadata.by_storageId
always return empty and the existingIndexed branch never gets exercised; update
the fixture that defines matched and the mock first() used in the RAG-status
reindex gate so it captures and responds to storageId queries (i.e., when
fileMetadata.by_storageId is invoked with a specific storageId return the
corresponding matched entry instead of null), ensuring the test triggers the
existingIndexed code path—look for the mocked first(), matched array,
fileMetadata.by_storageId lookup, and the existingIndexed branch to implement
this behavior.

In `@services/platform/convex/documents/upsert_document_by_external_id.ts`:
- Around line 163-170: The code currently does an unconditional lookup using
existingFileId -> ctx.db.query('fileMetadata').withIndex('by_storageId'...) to
set existingFm/existingIndexed even for updates that cannot trigger reindex;
gate that DB read behind the same reindex preconditions (the same
boolean/condition used elsewhere to decide reindexing) so you only query when
reindex is allowed. Modify the logic around
existingFileId/existingFm/existingIndexed (and the similar occurrence around the
block at the other occurrence) to check the reindex-enabled condition first, and
only call ctx.db.query('fileMetadata').withIndex('by_storageId', ...) when that
condition is true; default existingFm/null and existingIndexed=false when
skipped.

In `@services/platform/convex/file_metadata/internal_actions.ts`:
- Around line 195-200: The empty catch around "body = await response.json();"
should not swallow errors; change it to catch the error (e.g., catch (err)) and
log it with console.warn or console.error including context (such as the
response status or that parsing failed) before calling reschedule() and
returning null (or re-throw if you prefer). Update the try/catch in the same
block that calls response.json(), referencing the variables "body", "response",
and the helper "reschedule" so the log clearly identifies which response failed
to parse.

---

Outside diff comments:
In `@services/platform/convex/documents/list_indexed_documents_for_agent.ts`:
- Around line 292-326: The inline composite-cursor mock in createMockCtx is
building the fms array for all documents without filtering by ragInfo.status ===
'completed'; update the ad-hoc mock (the composite-cursor branch that constructs
fms at runtime) to apply the same filter used in the main createMockCtx so only
documents with ragInfo?.status === 'completed' are included when populating fms
(match the filtering logic and field checks used elsewhere in createMockCtx).
- Around line 14-50: The test mock that generates fileMetadata rows should only
mark a doc as ragStatus: 'completed' when the source doc's ragInfo.status ===
'completed'; update the mock generation logic to filter docs by
(d.ragInfo?.status === 'completed') before mapping to fileMetadata so it mirrors
the production query that uses
by_organizationId_and_ragStatus.eq('ragStatus','completed'); locate the mock in
the tests that reference AgentIndexedDocumentItem /
AgentIndexedDocumentListResult (the place creating fileMetadata entries for
docs) and change the mapping/filtering to match the pattern used in
get_agent_scoped_file_ids.test.ts (filter then map).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1b65b587-7964-45b2-8f06-3b8771714a9b

📥 Commits

Reviewing files that changed from the base of the PR and between 2494f42 and 03133ac.

⛔ Files ignored due to path filters (2)
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
  • services/platform/convex/migrations/backfill_filemetadata_rag_status.ts is excluded by !**/migrations/**
📒 Files selected for processing (22)
  • services/platform/convex/documents/actions.ts
  • services/platform/convex/documents/get_agent_scoped_file_ids.test.ts
  • services/platform/convex/documents/get_agent_scoped_file_ids.ts
  • services/platform/convex/documents/get_document_rag_projection.ts
  • services/platform/convex/documents/internal_actions.ts
  • services/platform/convex/documents/list_indexed_documents_for_agent.test.ts
  • services/platform/convex/documents/list_indexed_documents_for_agent.ts
  • services/platform/convex/documents/rest_api.ts
  • services/platform/convex/documents/schema.ts
  • services/platform/convex/documents/transform_to_document_item.ts
  • services/platform/convex/documents/types.ts
  • services/platform/convex/documents/update_document_internal.ts
  • services/platform/convex/documents/update_document_rag_info.ts
  • services/platform/convex/documents/upsert_document_by_external_id.test.ts
  • services/platform/convex/documents/upsert_document_by_external_id.ts
  • services/platform/convex/documents/validators.ts
  • services/platform/convex/file_metadata/internal_actions.ts
  • services/platform/convex/file_metadata/internal_mutations.ts
  • services/platform/convex/file_metadata/internal_queries.ts
  • services/platform/convex/file_metadata/schema.ts
  • services/platform/convex/migrations.ts
  • services/platform/convex/projects/queries.ts

});

return { db: { query: queryFn } } as unknown as Parameters<
const get = async (id: unknown) => docsById.get(id as string) ?? null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Replace type assertion with type guard.

The guideline prohibits using as type assertions. Use a type guard instead to maintain type safety.

♻️ Proposed fix
-  const get = async (id: unknown) => docsById.get(id as string) ?? null;
+  const get = async (id: unknown) => 
+    typeof id === 'string' ? docsById.get(id) ?? null : null;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const get = async (id: unknown) => docsById.get(id as string) ?? null;
const get = async (id: unknown) =>
typeof id === 'string' ? docsById.get(id) ?? null : null;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/list_indexed_documents_for_agent.test.ts`
at line 45, The inline getter uses a type assertion (id as string); replace it
with a type guard to avoid assertions—update the get function (const get = async
(id: unknown) => ...) to check that id is a string (e.g., if (typeof id !==
'string') return null) before calling docsById.get(id) and return the result or
null; keep the function signature and semantics but remove the "as" assertion
and rely on the runtime type guard.

Source: Coding guidelines

}),
});
const ctx = { db: { query: queryFn } } as unknown as Parameters<
const get = async (id: unknown) => docsById.get(id as string) ?? null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Replace type assertion with type guard.

Same issue as line 45—use a type guard instead of as string to align with coding guidelines.

♻️ Proposed fix
-  const get = async (id: unknown) => docsById.get(id as string) ?? null;
+  const get = async (id: unknown) => 
+    typeof id === 'string' ? docsById.get(id) ?? null : null;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const get = async (id: unknown) => docsById.get(id as string) ?? null;
const get = async (id: unknown) =>
typeof id === 'string' ? docsById.get(id) ?? null : null;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/list_indexed_documents_for_agent.test.ts`
at line 322, The inline type assertion in the get function (const get = async
(id: unknown) => docsById.get(id as string) ?? null;) should be replaced with a
type guard: check that id is a string (e.g., typeof id === "string") before
calling docsById.get and return null for non-strings; update the get function to
perform this guard so it no longer uses the `as string` assertion and adheres to
the project's type-safety guideline.

Source: Coding guidelines

Comment on lines +78 to +95
if (body.fileId) {
const storageId = toId<'_storage'>(body.fileId);
const meta = await rc.ctx.runQuery(
internal.file_metadata.internal_queries.getStorageMetadata,
{ storageId },
);
await rc.ctx.runMutation(
internal.file_metadata.internal_mutations.saveFileMetadata,
{
organizationId: rc.org.organizationId,
storageId,
fileName: body.title ?? 'document',
contentType:
body.mimeType ?? meta?.contentType ?? 'application/octet-stream',
size: meta?.size ?? 0,
uploadedBy: rc.user.userId,
},
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

Consider handling missing blob metadata more defensively.

When meta is null (blob doesn't exist), the code proceeds with size: 0 which may create an inconsistent fileMetadata row. If the caller passes an invalid fileId, the operation silently succeeds but the RAG pipeline will likely fail later.

This may be acceptable as a "caller error" scenario, but a warning log would help with debugging.

Optional: Add a warning for missing blob
   if (body.fileId) {
     const storageId = toId<'_storage'>(body.fileId);
     const meta = await rc.ctx.runQuery(
       internal.file_metadata.internal_queries.getStorageMetadata,
       { storageId },
     );
+    if (!meta) {
+      console.warn(
+        `[createDocument] No blob found for fileId=${body.fileId}; fileMetadata row may be inconsistent`,
+      );
+    }
     await rc.ctx.runMutation(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/rest_api.ts` around lines 78 - 95, When
handling body.fileId in this block (where you call
toId<'_storage'>(body.fileId),
rc.ctx.runQuery(internal.file_metadata.internal_queries.getStorageMetadata, ...)
and
rc.ctx.runMutation(internal.file_metadata.internal_mutations.saveFileMetadata,
...)), add a defensive check for meta === null/undefined: if meta is missing,
emit a warning via the request context logger (e.g., rc.ctx.logger.warn or
rc.log.warn) that includes the storageId and fileId, and either abort with a
clear client error/throw or return early instead of saving a metadata row with
size: 0; ensure the warning message includes enough identifiers (storageId,
rc.org.organizationId, rc.user.userId) so the caller/debugger can correlate the
invalid fileId.

Comment on lines +66 to +69
// The RAG-status reindex gate queries fileMetadata `by_storageId`
// and calls .first(); these fixtures have no fileMetadata, so the
// gate stays inactive (matched is empty for that lookup).
first: async () => matched[0] ?? null,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Mock .first() does not model fileMetadata.by_storageId behavior.

Line 69 adds .first(), but the query builder never captures storageId, so canonical gate lookups always resolve as empty in this fixture. That leaves the new existingIndexed branch effectively untested.

Suggested fixture enhancement
 interface MockFolder {
@@
 }
 
-function createMockCtx(initial: MockDoc[], initialFolders: MockFolder[] = []) {
+interface MockFileMetadata {
+  storageId: string;
+  ragStatus?: 'queued' | 'running' | 'completed' | 'failed';
+}
+
+function createMockCtx(
+  initial: MockDoc[],
+  initialFolders: MockFolder[] = [],
+  initialFileMetadata: MockFileMetadata[] = [],
+) {
@@
   const folders = new Map<string, MockFolder>();
   for (const f of initialFolders) folders.set(f._id, f);
+  const fileMetadata = new Map<string, MockFileMetadata>();
+  for (const fm of initialFileMetadata) fileMetadata.set(fm.storageId, fm);
@@
-      query: () => ({
+      query: (table: string) => ({
         withIndex: (
@@
           let orgFilter: string | undefined;
           let externalFilter: string | undefined;
+          let storageFilter: string | undefined;
           const qb = {
             eq: (field: string, value: unknown) => {
               if (field === 'organizationId') orgFilter = value as string;
               if (field === 'externalItemId') externalFilter = value as string;
+              if (field === 'storageId') storageFilter = value as string;
               return qb;
             },
           };
           cb(qb);
+          if (table === 'fileMetadata') {
+            const fm = storageFilter
+              ? fileMetadata.get(storageFilter) ?? null
+              : null;
+            return {
+              [Symbol.asyncIterator]: async function* () {
+                if (fm) yield fm;
+              },
+              first: async () => fm,
+            };
+          }
           const matched: MockDoc[] = [];
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/upsert_document_by_external_id.test.ts`
around lines 66 - 69, The test fixture's mock for the query builder (.first())
doesn't simulate lookups by storageId, so calls to fileMetadata.by_storageId
always return empty and the existingIndexed branch never gets exercised; update
the fixture that defines matched and the mock first() used in the RAG-status
reindex gate so it captures and responds to storageId queries (i.e., when
fileMetadata.by_storageId is invoked with a specific storageId return the
corresponding matched entry instead of null), ensuring the test triggers the
existingIndexed code path—look for the mocked first(), matched array,
fileMetadata.by_storageId lookup, and the existingIndexed branch to implement
this behavior.

Comment on lines +163 to +170
const existingFileId = existing.fileId;
const existingFm = existingFileId
? await ctx.db
.query('fileMetadata')
.withIndex('by_storageId', (q) => q.eq('storageId', existingFileId))
.first()
: null;
const existingIndexed = existingFm?.ragStatus === 'completed';

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Gate the fileMetadata read behind the same reindex preconditions.

Line 164 currently performs a by_storageId lookup on every update, including metadata/folder-only updates where reindex cannot run. This adds avoidable reads on a hot sync path.

Proposed fix
-    const existingFileId = existing.fileId;
-    const existingFm = existingFileId
-      ? await ctx.db
-          .query('fileMetadata')
-          .withIndex('by_storageId', (q) => q.eq('storageId', existingFileId))
-          .first()
-      : null;
-    const existingIndexed = existingFm?.ragStatus === 'completed';
+    let existingIndexed = false;
+    if (
+      contentChanged &&
+      existing.fileId &&
+      args.fileId &&
+      existing.fileId !== args.fileId
+    ) {
+      const existingFm = await ctx.db
+        .query('fileMetadata')
+        .withIndex('by_storageId', (q) => q.eq('storageId', existing.fileId))
+        .first();
+      existingIndexed = existingFm?.ragStatus === 'completed';
+    }

Also applies to: 177-177

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/documents/upsert_document_by_external_id.ts` around
lines 163 - 170, The code currently does an unconditional lookup using
existingFileId -> ctx.db.query('fileMetadata').withIndex('by_storageId'...) to
set existingFm/existingIndexed even for updates that cannot trigger reindex;
gate that DB read behind the same reindex preconditions (the same
boolean/condition used elsewhere to decide reindexing) so you only query when
reindex is allowed. Modify the logic around
existingFileId/existingFm/existingIndexed (and the similar occurrence around the
block at the other occurrence) to check the reindex-enabled condition first, and
only call ctx.db.query('fileMetadata').withIndex('by_storageId', ...) when that
condition is true; default existingFm/null and existingIndexed=false when
skipped.

Comment on lines +195 to +200
try {
body = await response.json();
} catch {
await reschedule();
return null;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty catch block violates coding guidelines.

The catch block on line 197 is empty. Per coding guidelines, catch blocks should log with console.warn/console.error or re-throw to avoid hiding bugs.

Proposed fix
       let body: unknown;
       try {
         body = await response.json();
-      } catch {
+      } catch (parseErr) {
+        console.warn(
+          `[pollFileRagStatus] Failed to parse RAG response for ${args.storageId}:`,
+          parseErr,
+        );
         await reschedule();
         return null;
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
body = await response.json();
} catch {
await reschedule();
return null;
}
try {
body = await response.json();
} catch (parseErr) {
console.warn(
`[pollFileRagStatus] Failed to parse RAG response for ${args.storageId}:`,
parseErr,
);
await reschedule();
return null;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/platform/convex/file_metadata/internal_actions.ts` around lines 195
- 200, The empty catch around "body = await response.json();" should not swallow
errors; change it to catch the error (e.g., catch (err)) and log it with
console.warn or console.error including context (such as the response status or
that parsing failed) before calling reschedule() and returning null (or re-throw
if you prefer). Update the try/catch in the same block that calls
response.json(), referencing the variables "body", "response", and the helper
"reschedule" so the log clearly identifies which response failed to parse.

Source: Coding guidelines

larryro added 2 commits June 9, 2026 16:04
Review follow-up to the fileMetadata.ragStatus collapse.

- ragIndexedAt is now stamped in Unix seconds (was Date.now() ms). The
  RagStatusBadge renders `new Date(indexedAt * 1000)` and both the legacy
  ragInfo.indexedAt writer and the backfill store seconds, so writing ms made
  every fresh completion render a far-future date and mixed units in the column.
- retryRagIndexing and uploadDocumentToRag now ensure the canonical fileMetadata
  row exists (new ensureFileMetadataForDocument mutation) before writing status.
  updateFileRagStatus silently no-ops on a missing row, which left the "Index"
  button reporting success while status stayed stuck on not_indexed. The ensure
  mutation reads size/contentType from _storage and schedules no upload, so it
  does not double-index.
- Register backfillFileMetadataDocumentId in runAll ahead of the rag-status
  backfill so completed fileMetadata rows are linked and stay in agent RAG
  retrieval scope.
- Add unit tests for the seconds stamping and ensureFileMetadataForDocument.
…ueries

Addresses medium+ findings from the review of the fileMetadata.ragStatus
single-source-of-truth cutover.

Migration (backfill_filemetadata_rag_status):
- Mirror only TERMINAL legacy ragInfo statuses. Copying queued/running would
  pin the canonical row forever — nothing advances a non-terminal ragStatus
  post-cutover and expireStaleRagQueue only rescues 'queued'.
- Create the fileMetadata row when missing instead of skipping, so a legacy
  completed doc whose blob never got a row no longer regresses to not_indexed
  and silently drops out of agent RAG retrieval. Idempotent (re-run finds the
  created row and skips); OCC serializes overlapping chains.
- BATCH_SIZE 200 -> 100 for write-byte / OCC headroom now that batches insert.
- Rewrite the misleading runAll ordering comment (the two backfills are
  independent; call order is cosmetic).

Agent-scope query scale (getAgentScopedFileIds, listIndexedDocumentsForAgent):
- Replace by_organizationId_and_ragStatus with composite
  by_organizationId_and_ragStatus_and_documentId and query
  .eq(org).eq('completed').gt('documentId', undefined). Convex orders a missing
  optional field as the least value, so the bound seeks PAST chat-upload /
  transcript rows (documentId absent) at the index level — the scan stays
  bounded by Hub-doc count instead of the org's whole completed corpus.
- Lower list_indexed MAX_LIMIT 500 -> 100.
- Degrade gracefully (skip knowledge context / empty results) when the scope
  query throws, instead of aborting the agent turn — in generate_response and
  both rag_search tool paths.

Tests:
- New: backfill migration (terminal-only filter, insert-when-missing,
  idempotency, self-schedule), pollFileRagStatus poller (status/HTTP branches),
  update_document_internal reindex gate.
- Extended: upsert reindex gate (now exercised), get_agent_scoped /
  list_indexed SSOT skip branches (documentId-missing, stale-blob, trashed),
  transform_to_document_item RAG projection mapping.
@larryro larryro merged commit 08ca625 into main Jun 9, 2026
30 checks passed
@larryro larryro deleted the refactor/rag-status-canonical-filemetadata branch June 9, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant