feat(platform): simplify file tools — auto RAG indexing, remove parse/template ops#1407
Conversation
…/template ops - Auto-trigger RAG indexing when files are uploaded (no more manual trigger) - Add `retrieve` operation to rag_search tool for full document content access - Remove `parse` operations from all file tools (pdf, docx, pptx, text, excel) - Remove `list_templates` and template-based generation from docx/pptx tools - Rewrite pptx tool to use markdown/HTML generation (consistent with pdf/docx) - Add retry-on-empty logic in RAG search service for files still being indexed - Update file tool descriptions to guide AI toward rag_search retrieve - Clean up dead code: template generators, parse helpers, unused validators
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
This comment has been minimized.
This comment has been minimized.
📝 WalkthroughWalkthroughThis PR removes file parsing capabilities from individual file tools (DOCX, Excel, PDF, PPTX, Text) and their supporting infrastructure, including helpers for parsing files ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
services/platform/convex/lib/attachments/process_attachments.ts (2)
231-247:⚠️ Potential issue | 🟠 MajorKeep failed images out of the
rag_search.retrievebucket.
unprocessedAttachmentsnow mixesfailedImageswith text-bearing files, then tells the agent to userag_search.retrievefor all of them. That fallback works for documents/spreadsheets, but it sends image failures to a tool path that cannot recover image content.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/lib/attachments/process_attachments.ts` around lines 231 - 247, The assembled unprocessedAttachments array incorrectly mixes failedImages with text-bearing files then instructs the agent to use rag_search.retrieve for all; update the logic around unprocessedAttachments (the variable built from fileAttachments, failedImages, failedSpreadsheets) so failedImages are excluded from the rag_search.retrieve fallback and are handled separately: build the contentParts block with only fileAttachments and failedSpreadsheets when adding the “[ATTACHED FILES - Use rag_search tool with operation="retrieve"…]” message and loop, and for failedImages push a distinct text entry that does not reference rag_search.retrieve (e.g., note that the image failed and requires re-upload or a different image-recovery tool) so the agent isn’t directed to use rag_search.retrieve for images.
258-263:⚠️ Potential issue | 🟠 MajorDon’t return empty metadata arrays after doing the work.
The function still advertises
imageInfoListandtextFileInfoList, and it already has the attachment data needed to populate them, but the return value hardcodes both to[]. That silently breaks any downstream consumer still relying on these fields for follow-up tool calls or attachment rendering.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/lib/attachments/process_attachments.ts` around lines 258 - 263, The return currently hardcodes imageInfoList and textFileInfoList to empty arrays while the function (process_attachments.ts) already collects attachment data; replace those empty arrays with the actual lists built earlier in the function (use the image/text attachment variables populated during processing) so returned parsedDocuments, imageInfoList, and textFileInfoList reflect the real attachment metadata; keep promptContent as-is and ensure the returned objects use the existing symbols/variables that hold the image and text file metadata instead of [].services/platform/convex/agent_tools/files/pdf_tool.ts (1)
30-66:⚠️ Potential issue | 🟠 MajorUpdate the example file assistant prompt in the same PR.
This tool now explicitly tells models to use
rag_searchfor reading, butexamples/agents/file-assistant.jsonstill advertises “Parse existing PDFs” and lists🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/files/pdf_tool.ts` around lines 30 - 66, The example assistant prompt in the repo still advertises using the pdf parsing tool and the “Parse existing PDFs” capability while pdf_tool.ts now instructs models to use rag_search for reading files and to only call the "generate" operation when explicitly requested; update the example assistant (examples/agents/file-assistant.json) to remove references to the pdf parsing operation and the “Parse existing PDFs” wording, replace guidance to read file contents with using rag_search (operation='retrieve'/'search'), and ensure any example interactions no longer show calling pdf tool for parsing/downloading PDFs except when demonstrating the "generate" operation.services/platform/convex/agent_tools/files/docx_tool.ts (1)
63-90:⚠️ Potential issue | 🟠 MajorMake the two generate modes mutually exclusive.
The schema allows
sections,sourceType, andcontenttogether, but execution silently chooses content mode and ignores the structured sections/title/subtitle. For model-generated tool calls, that ambiguity will produce the wrong document without any validation error.Based on learnings, "In any file under services/platform/convex/agent_tools ... prefer runtime validation of identifiers like tableName coming from LLM/user input over relying on TypeScript type restrictions alone. Implement validation at call boundaries that returns descriptive, user-friendly errors."
Also applies to: 145-196
services/platform/convex/agent_tools/rag/rag_search_tool.ts (1)
230-269:⚠️ Potential issue | 🟠 MajorImplement the zero-result retry that this PR promises.
This branch still does one
/searchcall and immediately returns “No relevant results found” on an empty result set. Newly uploaded files that are still indexing never get the documented 3-second retry, so the main user-facing behavior in this PR is still missing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts` around lines 230 - 269, The code currently does one fetch/search and immediately returns "No relevant results found" when formatSearchResults(result.results) is falsy; implement a retry loop around the fetch + fetchJson<SearchResponse> call that, when result.total_results === 0 or formatSearchResults(...) returns falsy, waits and retries for up to the documented 3 seconds (e.g., 3 attempts with ~1s delay or a single 3s wait depending on project convention) before returning the no-results response. Update the flow around the existing fetch(url, ...) and fetchJson<SearchResponse> usage so that debugLog, result usage, and the final returned object are produced after the retries (or after the final attempt), preserving the same fields (result.total_results, processing_time_ms, usage, model) and only returning success:false or the no-results message after all retries are exhausted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/platform/convex/agent_tools/files/excel_tool.ts`:
- Around line 34-49: The sheet schema currently allows rows with arbitrary
column counts; add a runtime check on the sheet object (the z.object that
contains name, headers, rows inside the sheets array) that validates every row
length equals headers.length. Implement this by using zod's refine or
superRefine on that sheet object (reference the sheets schema's sheet object),
iterate rows and if any row length differs from headers.length call ctx.addIssue
or return false with a clear message (include sheet.name in the error). This
ensures rows are validated against the header count at the schema boundary
rather than relying on TypeScript types.
In `@services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts`:
- Around line 32-58: The pagination loop using chunkStart/chunk_range.end can
hang if the service returns a stale or malformed range; add a safety bound and a
fallback advance: introduce a max iteration cap (e.g. MAX_ITERATIONS) and an
iteration counter in the while(true) loop to throw a clear error once exceeded,
and after receiving result check if result.chunk_range.end <= chunkStart — if
so, advance chunkStart by MAX_CHUNK_WINDOW as a fallback (or throw if you prefer
strictness) to ensure progress; also wrap fetch with an AbortController and a
short timeout so the fetch call itself cannot hang indefinitely; update
references to chunkStart, chunkEnd, MAX_CHUNK_WINDOW, fetch and fetchJson
accordingly.
In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts`:
- Around line 117-124: The "retrieve" operation currently accepts any fileId and
directly calls the RAG service, letting callers bypass knowledge-scope checks;
to fix, apply the same authorization/ID resolution used by the "search" flow by
calling resolveFileIds (or the shared authorization function used by search)
before performing the retrieve, validate that the returned resolved IDs include
the requested fileId (or replace the incoming id with the resolved one), and
reject the request if resolution/authorization fails; update both places where
"retrieve" is handled (the z.object operation with literal 'retrieve' and the
duplicate handler around lines 173-195) to enforce this check and reuse the same
resolveFileIds/authorization logic.
In `@services/platform/convex/file_metadata/internal_actions.ts`:
- Line 18: The internal action schema currently validates storageId with
v.string() but should use the project-specific ID validator v.id('_storage') to
match the file_metadata module and schema; update the storageId field in
internal_actions.ts from v.string() to v.id('_storage') (ensuring any references
to storageId in the same internal action signature continue to use the ID type)
so it matches other endpoints, internal_queries/mutations, and the expected
schema.
In `@services/platform/convex/file_metadata/mutations.ts`:
- Around line 58-66: The call to
ctx.scheduler.runAfter(internal.file_metadata.internal_actions.uploadFileToRag,
...) is awaited on the main mutation path so scheduler failures can abort
metadata creation; change this to fire-and-forget by removing the await and
ensuring any returned promise errors are caught and logged (e.g., call
ctx.scheduler.runAfter(...).catch(...) or wrap in try/catch and log) so
scheduling failures do not propagate and block the mutation.
---
Outside diff comments:
In `@services/platform/convex/agent_tools/files/pdf_tool.ts`:
- Around line 30-66: The example assistant prompt in the repo still advertises
using the pdf parsing tool and the “Parse existing PDFs” capability while
pdf_tool.ts now instructs models to use rag_search for reading files and to only
call the "generate" operation when explicitly requested; update the example
assistant (examples/agents/file-assistant.json) to remove references to the pdf
parsing operation and the “Parse existing PDFs” wording, replace guidance to
read file contents with using rag_search (operation='retrieve'/'search'), and
ensure any example interactions no longer show calling pdf tool for
parsing/downloading PDFs except when demonstrating the "generate" operation.
In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts`:
- Around line 230-269: The code currently does one fetch/search and immediately
returns "No relevant results found" when formatSearchResults(result.results) is
falsy; implement a retry loop around the fetch + fetchJson<SearchResponse> call
that, when result.total_results === 0 or formatSearchResults(...) returns falsy,
waits and retries for up to the documented 3 seconds (e.g., 3 attempts with ~1s
delay or a single 3s wait depending on project convention) before returning the
no-results response. Update the flow around the existing fetch(url, ...) and
fetchJson<SearchResponse> usage so that debugLog, result usage, and the final
returned object are produced after the retries (or after the final attempt),
preserving the same fields (result.total_results, processing_time_ms, usage,
model) and only returning success:false or the no-results message after all
retries are exhausted.
In `@services/platform/convex/lib/attachments/process_attachments.ts`:
- Around line 231-247: The assembled unprocessedAttachments array incorrectly
mixes failedImages with text-bearing files then instructs the agent to use
rag_search.retrieve for all; update the logic around unprocessedAttachments (the
variable built from fileAttachments, failedImages, failedSpreadsheets) so
failedImages are excluded from the rag_search.retrieve fallback and are handled
separately: build the contentParts block with only fileAttachments and
failedSpreadsheets when adding the “[ATTACHED FILES - Use rag_search tool with
operation="retrieve"…]” message and loop, and for failedImages push a distinct
text entry that does not reference rag_search.retrieve (e.g., note that the
image failed and requires re-upload or a different image-recovery tool) so the
agent isn’t directed to use rag_search.retrieve for images.
- Around line 258-263: The return currently hardcodes imageInfoList and
textFileInfoList to empty arrays while the function (process_attachments.ts)
already collects attachment data; replace those empty arrays with the actual
lists built earlier in the function (use the image/text attachment variables
populated during processing) so returned parsedDocuments, imageInfoList, and
textFileInfoList reflect the real attachment metadata; keep promptContent as-is
and ensure the returned objects use the existing symbols/variables that hold the
image and text file metadata instead of [].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 88b0a878-ab99-4f1b-a0a8-37c173073a4f
⛔ Files ignored due to path filters (1)
services/platform/convex/_generated/api.d.tsis excluded by!**/_generated/**
📒 Files selected for processing (27)
services/platform/convex/agent_tools/files/docx_tool.tsservices/platform/convex/agent_tools/files/excel_tool.tsservices/platform/convex/agent_tools/files/helpers/analyze_text.tsservices/platform/convex/agent_tools/files/helpers/parse_file.tsservices/platform/convex/agent_tools/files/helpers/resolve_file_name.tsservices/platform/convex/agent_tools/files/internal_actions.tsservices/platform/convex/agent_tools/files/pdf_tool.tsservices/platform/convex/agent_tools/files/pptx_tool.tsservices/platform/convex/agent_tools/files/text_tool.tsservices/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.tsservices/platform/convex/agent_tools/rag/rag_search_tool.tsservices/platform/convex/documents/generate_document_helpers.tsservices/platform/convex/documents/generate_docx_from_template.tsservices/platform/convex/documents/generate_pptx.tsservices/platform/convex/documents/helpers.tsservices/platform/convex/documents/internal_actions.tsservices/platform/convex/documents/internal_queries.tsservices/platform/convex/documents/list_documents_by_extension.tsservices/platform/convex/documents/types.tsservices/platform/convex/documents/validators.tsservices/platform/convex/file_metadata/internal_actions.tsservices/platform/convex/file_metadata/internal_mutations.tsservices/platform/convex/file_metadata/mutations.tsservices/platform/convex/lib/action_cache/index.tsservices/platform/convex/lib/attachments/process_attachments.tsservices/platform/convex/workflow_engine/action_defs/rag/rag_action.tsservices/rag/app/services/rag_service.py
💤 Files with no reviewable changes (11)
- services/platform/convex/agent_tools/files/internal_actions.ts
- services/platform/convex/documents/validators.ts
- services/platform/convex/lib/action_cache/index.ts
- services/platform/convex/documents/internal_queries.ts
- services/platform/convex/documents/helpers.ts
- services/platform/convex/agent_tools/files/helpers/resolve_file_name.ts
- services/platform/convex/documents/list_documents_by_extension.ts
- services/platform/convex/documents/generate_docx_from_template.ts
- services/platform/convex/documents/generate_pptx.ts
- services/platform/convex/agent_tools/files/helpers/parse_file.ts
- services/platform/convex/agent_tools/files/helpers/analyze_text.ts
| sheets: z | ||
| .array( | ||
| z.object({ | ||
| name: z.string().describe('Sheet name'), | ||
| headers: z | ||
| .array(z.string()) | ||
| .nonempty() | ||
| .describe( | ||
| "Column headers for the sheet (must align with each row's columns)", | ||
| ), | ||
| rows: z | ||
| .array( | ||
| z.array(z.union([z.string(), z.number(), z.boolean(), z.null()])), | ||
| ) | ||
| .describe('2D array of cell values (rows x columns)'), | ||
| }), |
There was a problem hiding this comment.
Validate row width against header count at schema boundary.
At Line 34, the schema allows rows with any column count. This can produce malformed sheet data relative to headers.
Suggested fix
const excelArgs = z.object({
operation: z.literal('generate'),
fileName: z
.string()
.describe('Base name for the Excel file (without extension)'),
sheets: z
.array(
z.object({
name: z.string().describe('Sheet name'),
headers: z
.array(z.string())
.nonempty()
.describe(
"Column headers for the sheet (must align with each row's columns)",
),
rows: z
.array(
z.array(z.union([z.string(), z.number(), z.boolean(), z.null()])),
)
.describe('2D array of cell values (rows x columns)'),
- }),
+ }).superRefine((sheet, ctx) => {
+ const expected = sheet.headers.length;
+ sheet.rows.forEach((row, index) => {
+ if (row.length !== expected) {
+ ctx.addIssue({
+ code: 'custom',
+ path: ['rows', index],
+ message: `Row ${index + 1} has ${row.length} columns; expected ${expected}`,
+ });
+ }
+ });
+ }),
)
.describe('Sheets to include in the Excel file'),
});Based on learnings: “In any file under services/platform/convex/agent_tools ... prefer runtime validation ... coming from LLM/user input over relying on TypeScript type restrictions alone.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/agent_tools/files/excel_tool.ts` around lines 34 -
49, The sheet schema currently allows rows with arbitrary column counts; add a
runtime check on the sheet object (the z.object that contains name, headers,
rows inside the sheets array) that validates every row length equals
headers.length. Implement this by using zod's refine or superRefine on that
sheet object (reference the sheets schema's sheet object), iterate rows and if
any row length differs from headers.length call ctx.addIssue or return false
with a clear message (include sheet.name in the error). This ensures rows are
validated against the header count at the schema boundary rather than relying on
TypeScript types.
| while (true) { | ||
| const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1; | ||
| const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`; | ||
|
|
||
| const response = await fetch(url); | ||
|
|
||
| if (!response.ok) { | ||
| const errorText = await response.text().catch(() => ''); | ||
| throw new Error( | ||
| `RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`, | ||
| ); | ||
| } | ||
|
|
||
| const result = await fetchJson<DocumentContentResponse>(response); | ||
| documentId = result.file_id; | ||
| title = result.title; | ||
| totalChunks = result.total_chunks; | ||
|
|
||
| if (result.chunks) { | ||
| allChunks.push(...result.chunks); | ||
| } | ||
|
|
||
| if (result.chunk_range.end >= totalChunks) { | ||
| break; | ||
| } | ||
|
|
||
| chunkStart = result.chunk_range.end + 1; |
There was a problem hiding this comment.
Bound the pagination loop before using this in tools/actions.
This loop assumes the remote chunk_range.end always advances. If the RAG service returns a stale or malformed range, chunkStart never moves and the caller spins forever; paired with the timeout-less fetch, one bad response can pin the tool/action until platform timeout.
Suggested fix
while (true) {
const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1;
const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`;
- const response = await fetch(url);
+ const response = await fetch(url, {
+ signal: AbortSignal.timeout(30_000),
+ });
if (!response.ok) {
const errorText = await response.text().catch(() => '');
throw new Error(
`RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`,
@@
- if (result.chunk_range.end >= totalChunks) {
+ if (result.chunk_range.end < chunkStart) {
+ throw new Error(
+ `RAG get_chunks returned a non-advancing chunk range for file ${fileId}`,
+ );
+ }
+
+ if (result.chunk_range.end >= totalChunks) {
break;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| while (true) { | |
| const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1; | |
| const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`; | |
| const response = await fetch(url); | |
| if (!response.ok) { | |
| const errorText = await response.text().catch(() => ''); | |
| throw new Error( | |
| `RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`, | |
| ); | |
| } | |
| const result = await fetchJson<DocumentContentResponse>(response); | |
| documentId = result.file_id; | |
| title = result.title; | |
| totalChunks = result.total_chunks; | |
| if (result.chunks) { | |
| allChunks.push(...result.chunks); | |
| } | |
| if (result.chunk_range.end >= totalChunks) { | |
| break; | |
| } | |
| chunkStart = result.chunk_range.end + 1; | |
| while (true) { | |
| const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1; | |
| const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`; | |
| const response = await fetch(url, { | |
| signal: AbortSignal.timeout(30_000), | |
| }); | |
| if (!response.ok) { | |
| const errorText = await response.text().catch(() => ''); | |
| throw new Error( | |
| `RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`, | |
| ); | |
| } | |
| const result = await fetchJson<DocumentContentResponse>(response); | |
| documentId = result.file_id; | |
| title = result.title; | |
| totalChunks = result.total_chunks; | |
| if (result.chunks) { | |
| allChunks.push(...result.chunks); | |
| } | |
| if (result.chunk_range.end < chunkStart) { | |
| throw new Error( | |
| `RAG get_chunks returned a non-advancing chunk range for file ${fileId}`, | |
| ); | |
| } | |
| if (result.chunk_range.end >= totalChunks) { | |
| break; | |
| } | |
| chunkStart = result.chunk_range.end + 1; |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts`
around lines 32 - 58, The pagination loop using chunkStart/chunk_range.end can
hang if the service returns a stale or malformed range; add a safety bound and a
fallback advance: introduce a max iteration cap (e.g. MAX_ITERATIONS) and an
iteration counter in the while(true) loop to throw a clear error once exceeded,
and after receiving result check if result.chunk_range.end <= chunkStart — if
so, advance chunkStart by MAX_CHUNK_WINDOW as a fallback (or throw if you prefer
strictness) to ensure progress; also wrap fetch with an AbortController and a
short timeout so the fetch call itself cannot hang indefinitely; update
references to chunkStart, chunkEnd, MAX_CHUNK_WINDOW, fetch and fetchJson
accordingly.
| z.object({ | ||
| operation: z.literal('retrieve'), | ||
| fileId: z | ||
| .string() | ||
| .describe( | ||
| 'File ID of the document to retrieve full content from (e.g., "kg2bazp7fbgt9srq63knfagjrd7yfenj")', | ||
| ), | ||
| }), |
There was a problem hiding this comment.
Enforce knowledge-scope authorization before retrieve.
retrieve accepts any fileId and calls the RAG service directly. Unlike search, it never goes through resolveFileIds, so anyone who can supply a foreign fileId can bypass the agent’s knowledge boundary and fetch full document text.
Suggested fix
if (args.operation === 'retrieve') {
+ const allowedFileIds = new Set(await resolveFileIds(ctx));
+ if (!allowedFileIds.has(args.fileId)) {
+ throw new Error('This file is not available to the current agent.');
+ }
+
debugLog('tool:rag_search retrieve start', { fileId: args.fileId });
const ragServiceUrl = getRagConfig().serviceUrl;
const result = await fetchDocumentChunks(ragServiceUrl, args.fileId);Also applies to: 173-195
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts` around lines 117
- 124, The "retrieve" operation currently accepts any fileId and directly calls
the RAG service, letting callers bypass knowledge-scope checks; to fix, apply
the same authorization/ID resolution used by the "search" flow by calling
resolveFileIds (or the shared authorization function used by search) before
performing the retrieve, validate that the returned resolved IDs include the
requested fileId (or replace the incoming id with the resolved one), and reject
the request if resolution/authorization fails; update both places where
"retrieve" is handled (the z.object operation with literal 'retrieve' and the
duplicate handler around lines 173-195) to enforce this check and reuse the same
resolveFileIds/authorization logic.
| */ | ||
| export const uploadFileToRag = internalAction({ | ||
| args: { | ||
| storageId: v.string(), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Verify current validator shape in the new action
rg -n "uploadFileToRag|storageId:\s*v\.(string|id\('_storage'\))" --type ts services/platform/convex/file_metadata/internal_actions.ts -C 3
# Compare with repo conventions for storage ID boundaries
rg -n "storageId:\s*v\.id\('_storage'\)" --type ts services/platform/convex -C 2Repository: tale-project/tale
Length of output: 4994
Use v.id('_storage') for storageId at this internal action boundary.
storageId is validated as v.string() at line 18, but all other endpoints in the file_metadata module (mutations, queries, internal_queries) and internal actions across the codebase consistently use v.id('_storage'). The schema definition also expects v.id('_storage'). Changing to v.id('_storage') maintains type safety and consistency with the rest of the codebase.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/file_metadata/internal_actions.ts` at line 18, The
internal action schema currently validates storageId with v.string() but should
use the project-specific ID validator v.id('_storage') to match the
file_metadata module and schema; update the storageId field in
internal_actions.ts from v.string() to v.id('_storage') (ensuring any references
to storageId in the same internal action signature continue to use the ID type)
so it matches other endpoints, internal_queries/mutations, and the expected
schema.
| await ctx.scheduler.runAfter( | ||
| 0, | ||
| internal.file_metadata.internal_actions.uploadFileToRag, | ||
| { | ||
| storageId: args.storageId, | ||
| fileName: args.fileName, | ||
| contentType: args.contentType, | ||
| }, | ||
| ); |
There was a problem hiding this comment.
Do not let background RAG scheduling failure fail metadata creation.
At Line 58, await ctx.scheduler.runAfter(...) is on the main mutation path with no isolation. If scheduling fails, the whole mutation can fail, which blocks file metadata creation for a non-critical async step.
Suggested fix
- await ctx.scheduler.runAfter(
- 0,
- internal.file_metadata.internal_actions.uploadFileToRag,
- {
- storageId: args.storageId,
- fileName: args.fileName,
- contentType: args.contentType,
- },
- );
+ try {
+ await ctx.scheduler.runAfter(
+ 0,
+ internal.file_metadata.internal_actions.uploadFileToRag,
+ {
+ storageId: args.storageId,
+ fileName: args.fileName,
+ contentType: args.contentType,
+ },
+ );
+ } catch (error) {
+ console.error('[saveFileMetadata] Failed to schedule uploadFileToRag', {
+ storageId: args.storageId,
+ error: error instanceof Error ? error.message : String(error),
+ });
+ }Based on learnings: “deleteDocumentFromRag is scheduled via ctx.scheduler.runAfter(...) as a fire-and-forget cleanup... failures are intentionally logged ... rather than propagated.”
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| await ctx.scheduler.runAfter( | |
| 0, | |
| internal.file_metadata.internal_actions.uploadFileToRag, | |
| { | |
| storageId: args.storageId, | |
| fileName: args.fileName, | |
| contentType: args.contentType, | |
| }, | |
| ); | |
| try { | |
| await ctx.scheduler.runAfter( | |
| 0, | |
| internal.file_metadata.internal_actions.uploadFileToRag, | |
| { | |
| storageId: args.storageId, | |
| fileName: args.fileName, | |
| contentType: args.contentType, | |
| }, | |
| ); | |
| } catch (error) { | |
| console.error('[saveFileMetadata] Failed to schedule uploadFileToRag', { | |
| storageId: args.storageId, | |
| error: error instanceof Error ? error.message : String(error), | |
| }); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/file_metadata/mutations.ts` around lines 58 - 66,
The call to
ctx.scheduler.runAfter(internal.file_metadata.internal_actions.uploadFileToRag,
...) is awaited on the main mutation path so scheduler failures can abort
metadata creation; change this to fire-and-forget by removing the await and
ensuring any returned promise errors are caught and logged (e.g., call
ctx.scheduler.runAfter(...).catch(...) or wrap in try/catch and log) so
scheduling failures do not propagate and block the mutation.
- Remove unused `isTextFile` and `getParseEndpoint` exports from file-types.ts (Knip) - Mock `get_document_statuses` in test_passes_file_ids to handle retry logic (RAG test)
d5261dd to
4139717
Compare
Chat file uploads were fire-and-forget — users could send messages before files finished indexing, causing the agent to fail reading file content. - Add ragStatus/ragError fields to fileMetadata schema - Track indexing lifecycle: queued → running → completed/failed - Poll RAG service with fast intervals (5-15s) for chat UX - Block send button while any attachment is still indexing - Show per-file indexing spinner/error in chat input - Raise chat attachment size limit from 10MB to 100MB
The previous 4-minute timeout was too aggressive for large PDFs with images, causing premature "Index failed" status while RAG was still processing. Now uses progressive backoff: 5s→10s→30s→60s intervals over 60 attempts (~31 minutes total).
Track extraction progress through the full pipeline: - RAG DB: add progress_phase/progress_detail columns to documents table - PDF extraction: on_progress callback reports page-level progress (throttled to ~3s DB writes to avoid overhead) - RAG status endpoint: returns progress_phase + progress_detail - Convex polling: picks up progress and stores as ragProgress on fileMetadata - Chat UI: shows live progress (e.g. "extracting 12/50") instead of generic "Indexing..." Progress phases: extracting → embedding → storing
This comment has been minimized.
This comment has been minimized.
A 349-chunk document (590KB) was exceeding the model's 163K token limit. Now retrieve truncates at 50K chars (~15K tokens) and advises the agent to use search instead. Also stops fetching chunks once 60K chars are accumulated to avoid unnecessary network calls.
…per call Instead of fetching the entire document and truncating, retrieve now uses the RAG API's native pagination. Default page size is 10 chunks. Agent can use chunkStart/chunkEnd params to paginate through large documents, or use search for targeted lookups. Returns chunkRange, totalChunks, and hasMore for agent pagination.
…text getFailedMessageErrors now reads errors from messageMetadata (primary) in addition to MessageDoc.error (fallback). This fixes the mismatch where UIMessage.id (firstMessage._id in a group) didn't match the failed MessageDoc._id in multi-step agent responses, causing the error to be undefined in the frontend.
…generic text" This reverts commit d2474b2.
…sponses UIMessage.id is the first message in a group, but the error lives on the last (failed) message which has a different _id. When the direct lookup fails, fall back to any error in the messageErrors map for the thread. This ensures the real error (e.g. "context length exceeded") is shown instead of the generic "Something went wrong".
…to runtime Embedding dimensions are determined dynamically by the RAG service, so the HNSW index must be created at runtime (matching the chunks table pattern).
…responses In error scenarios, metadata is saved with the failed message's ID but UIMessage.id is the first message in the group. When direct lookup by messageId fails, fall back to querying by threadId (most recent entry). This fixes "Token usage and model information are not available" in the Message Information dialog for failed responses.
…hours - Convert raw "42/108" progress to "39%" in chat file pill for clarity - Extend polling from 31 min to ~2.5 hours (120 attempts) to match RAG service's 3-hour ingestion timeout for very large files
…rce cards Replace server-side scheduled polling with a client-driven action that batches status checks for pending files, eliminating wasted server resources when users leave the page. Rework source cards to group citations by source file, display chunk content inline with a detail dialog, and show relevance/page metadata.
…tions and tool descriptions Rework citation parsing to correctly handle rag_search retrieve results, offset citation numbers across successive tool calls to prevent collisions, and improve deduplication to distinguish different chunks from the same file. Move "read file" guidance to the top of each file tool description for better agent visibility. Return fileId and filename from retrieve operations to enable proper source card attribution.
Guide the agent to pass user-provided file IDs (from chat uploads) via the fileIds parameter first, falling back to broader search when needed. Clarify that list_indexed only covers Document Hub files.
- Add non-null assertion for threadId in message_metadata query (already guarded by if-check) - Add progress_phase/progress_detail fields to test mock data matching updated SQL query - Update vision_client test to accept additional on_progress kwarg
- Use destructuring instead of non-null assertion for threadId narrowing - Add ragStatus: 'queued' to file metadata insert test expectations
…uter The platform's generateDocument helper builds URLs like /api/v1/pptx/from-html but the crawler only had the JSON-based POST /api/v1/pptx endpoint, causing 404s. Added the missing endpoints following the same pattern as the DOCX router.
Instead of calling the crawler service to generate .pptx files, the pptx tool now accepts a complete HTML document from the LLM and stores it directly. This gives the AI full control over styling, layout, and animations (using reveal.js or any framework via CDN). - Add storeRawContent internal action for direct string-to-storage uploads - Remove sourceType param; tool now takes a single html param - Add explicit "no templates" instruction to prevent template hallucination
Summary
fileMetadatainsert (viasaveFileMetadata), covering all upload paths (agent chat, document hub, webhook, OneDrive, etc.)retrieveoperation torag_searchtool for full document content retrieval, replacing parseparseoperations from file tools (pdf, docx, pptx, text, excel) — file content now accessed viarag_searchretrievelist_templatesand template-based generation removedrag_searchretrieve for reading filesTest plan
fileMetadatais created and RAG indexing is triggered asynchronouslyrag_searchwithoperation='retrieve'to get full content of an indexed filerag_searchwithoperation='search'on a just-uploaded file → verify retry logic worksnpm run lintandnpx tsc --noEmit— both passSummary by CodeRabbit
New Features
Bug Fixes
Refactor