feat(platform): simplify file tools — auto RAG indexing, remove parse/template ops by larryro · Pull Request #1407 · tale-project/tale

larryro · 2026-04-11T02:59:47Z

Summary

Auto RAG indexing: trigger RAG indexing on every new fileMetadata insert (via saveFileMetadata), covering all upload paths (agent chat, document hub, webhook, OneDrive, etc.)
RAG search retry: when search returns no results and some files are still indexing, wait 3s and retry once (zero overhead in normal case)
Add retrieve operation to rag_search tool for full document content retrieval, replacing parse
Remove all parse operations from file tools (pdf, docx, pptx, text, excel) — file content now accessed via rag_search retrieve
Remove all template operations from docx/pptx tools — list_templates and template-based generation removed
Rewrite pptx tool to use markdown/HTML generation (consistent with pdf/docx)
Update file tool descriptions to guide AI toward rag_search retrieve for reading files
Clean up dead code: template generators, parse helpers, unused validators (-1864 lines)

Test plan

Upload a file in agent chat → verify fileMetadata is created and RAG indexing is triggered asynchronously
Upload a file in document hub → verify RAG indexing is triggered
Use rag_search with operation='retrieve' to get full content of an indexed file
Use rag_search with operation='search' on a just-uploaded file → verify retry logic works
Generate PDF/DOCX/PPTX from markdown → verify generation still works
Verify no parse or template operations are available in file tools
Run npm run lint and npx tsc --noEmit — both pass

Summary by CodeRabbit

New Features
- Added document retrieval operation to search functionality for fetching full file contents.
- Automatic file indexing for improved searchability.
Bug Fixes
- Improved search reliability with retry logic for recently uploaded documents.
Refactor
- Removed file parsing operations from document tools; use search functionality with retrieval instead.
- Removed template-based document generation; now supports content-based generation only.

…/template ops - Auto-trigger RAG indexing when files are uploaded (no more manual trigger) - Add `retrieve` operation to rag_search tool for full document content access - Remove `parse` operations from all file tools (pdf, docx, pptx, text, excel) - Remove `list_templates` and template-based generation from docx/pptx tools - Rewrite pptx tool to use markdown/HTML generation (consistent with pdf/docx) - Add retry-on-empty logic in RAG search service for files still being indexed - Update file tool descriptions to guide AI toward rag_search retrieve - Clean up dead code: template generators, parse helpers, unused validators

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-04-11T03:11:03Z

📝 Walkthrough

Walkthrough

This PR removes file parsing capabilities from individual file tools (DOCX, Excel, PDF, PPTX, Text) and their supporting infrastructure, including helpers for parsing files (parseFile), analyzing text (analyzeTextContent), and resolving file names. In parallel, it introduces a unified document retrieval mechanism via a new fetchDocumentChunks helper that retrieves document content from a RAG service endpoint. File tools are simplified to generation-only operations, and the document generation pipeline is consolidated through a unified generateDocument action supporting PPTX output. Additionally, file metadata creation now triggers background RAG upload scheduling, and attachment processing is refactored to direct file reading to the RAG search tool's new retrieve operation instead of in-agent parsing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: add file attachment support for chat agent with document parsing #12 — Deletes the parse-related helpers and tool parse branches that this PR removes, touching the same files and parsing infrastructure.
feat: contract comparison workflow with chunk-level RAG #722 — Modifies RAG document retrieval by adding a retrieve operation to rag_search_tool and introducing helpers to fetch document chunks, directly overlapping with the new RAG retrieval capabilities introduced here.
feat: extract and surface document file dates across platform #862 — Both PRs introduce RAG retrieval helpers (fetchDocumentChunks), add RAG upload scheduling (uploadFileToRag), and modify document metadata flows, so the changes align on the same RAG integration code paths.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: auto RAG indexing, removal of parse operations, and removal of template operations from file tools.
Docstring Coverage	✅ Passed	Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/file-tools-optimization

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

services/platform/convex/lib/attachments/process_attachments.ts (2)
231-247: ⚠️ Potential issue | 🟠 Major

Keep failed images out of the rag_search.retrieve bucket.

unprocessedAttachments now mixes failedImages with text-bearing files, then tells the agent to use rag_search.retrieve for all of them. That fallback works for documents/spreadsheets, but it sends image failures to a tool path that cannot recover image content.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/lib/attachments/process_attachments.ts` around lines
231 - 247, The assembled unprocessedAttachments array incorrectly mixes
failedImages with text-bearing files then instructs the agent to use
rag_search.retrieve for all; update the logic around unprocessedAttachments (the
variable built from fileAttachments, failedImages, failedSpreadsheets) so
failedImages are excluded from the rag_search.retrieve fallback and are handled
separately: build the contentParts block with only fileAttachments and
failedSpreadsheets when adding the “[ATTACHED FILES - Use rag_search tool with
operation="retrieve"…]” message and loop, and for failedImages push a distinct
text entry that does not reference rag_search.retrieve (e.g., note that the
image failed and requires re-upload or a different image-recovery tool) so the
agent isn’t directed to use rag_search.retrieve for images.
258-263: ⚠️ Potential issue | 🟠 Major

Don’t return empty metadata arrays after doing the work.

The function still advertises imageInfoList and textFileInfoList, and it already has the attachment data needed to populate them, but the return value hardcodes both to []. That silently breaks any downstream consumer still relying on these fields for follow-up tool calls or attachment rendering.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/lib/attachments/process_attachments.ts` around lines
258 - 263, The return currently hardcodes imageInfoList and textFileInfoList to
empty arrays while the function (process_attachments.ts) already collects
attachment data; replace those empty arrays with the actual lists built earlier
in the function (use the image/text attachment variables populated during
processing) so returned parsedDocuments, imageInfoList, and textFileInfoList
reflect the real attachment metadata; keep promptContent as-is and ensure the
returned objects use the existing symbols/variables that hold the image and text
file metadata instead of [].
services/platform/convex/agent_tools/files/pdf_tool.ts (1)
30-66: ⚠️ Potential issue | 🟠 Major

Update the example file assistant prompt in the same PR.

This tool now explicitly tells models to use rag_search for reading, but examples/agents/file-assistant.json still advertises “Parse existing PDFs” and lists pdf as a parsing tool. That example assistant will keep emitting removed operations.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/files/pdf_tool.ts` around lines 30 - 66,
The example assistant prompt in the repo still advertises using the pdf parsing
tool and the “Parse existing PDFs” capability while pdf_tool.ts now instructs
models to use rag_search for reading files and to only call the "generate"
operation when explicitly requested; update the example assistant
(examples/agents/file-assistant.json) to remove references to the pdf parsing
operation and the “Parse existing PDFs” wording, replace guidance to read file
contents with using rag_search (operation='retrieve'/'search'), and ensure any
example interactions no longer show calling pdf tool for parsing/downloading
PDFs except when demonstrating the "generate" operation.
services/platform/convex/agent_tools/files/docx_tool.ts (1)

63-90: ⚠️ Potential issue | 🟠 Major

Make the two generate modes mutually exclusive.

The schema allows sections, sourceType, and content together, but execution silently chooses content mode and ignores the structured sections/title/subtitle. For model-generated tool calls, that ambiguity will produce the wrong document without any validation error.

Based on learnings, "In any file under services/platform/convex/agent_tools ... prefer runtime validation of identifiers like tableName coming from LLM/user input over relying on TypeScript type restrictions alone. Implement validation at call boundaries that returns descriptive, user-friendly errors."

Also applies to: 145-196
services/platform/convex/agent_tools/rag/rag_search_tool.ts (1)
230-269: ⚠️ Potential issue | 🟠 Major

Implement the zero-result retry that this PR promises.

This branch still does one /search call and immediately returns “No relevant results found” on an empty result set. Newly uploaded files that are still indexing never get the documented 3-second retry, so the main user-facing behavior in this PR is still missing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts` around lines 230
- 269, The code currently does one fetch/search and immediately returns "No
relevant results found" when formatSearchResults(result.results) is falsy;
implement a retry loop around the fetch + fetchJson<SearchResponse> call that,
when result.total_results === 0 or formatSearchResults(...) returns falsy, waits
and retries for up to the documented 3 seconds (e.g., 3 attempts with ~1s delay
or a single 3s wait depending on project convention) before returning the
no-results response. Update the flow around the existing fetch(url, ...) and
fetchJson<SearchResponse> usage so that debugLog, result usage, and the final
returned object are produced after the retries (or after the final attempt),
preserving the same fields (result.total_results, processing_time_ms, usage,
model) and only returning success:false or the no-results message after all
retries are exhausted.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/platform/convex/agent_tools/files/excel_tool.ts`:
- Around line 34-49: The sheet schema currently allows rows with arbitrary
column counts; add a runtime check on the sheet object (the z.object that
contains name, headers, rows inside the sheets array) that validates every row
length equals headers.length. Implement this by using zod's refine or
superRefine on that sheet object (reference the sheets schema's sheet object),
iterate rows and if any row length differs from headers.length call ctx.addIssue
or return false with a clear message (include sheet.name in the error). This
ensures rows are validated against the header count at the schema boundary
rather than relying on TypeScript types.

In `@services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts`:
- Around line 32-58: The pagination loop using chunkStart/chunk_range.end can
hang if the service returns a stale or malformed range; add a safety bound and a
fallback advance: introduce a max iteration cap (e.g. MAX_ITERATIONS) and an
iteration counter in the while(true) loop to throw a clear error once exceeded,
and after receiving result check if result.chunk_range.end <= chunkStart — if
so, advance chunkStart by MAX_CHUNK_WINDOW as a fallback (or throw if you prefer
strictness) to ensure progress; also wrap fetch with an AbortController and a
short timeout so the fetch call itself cannot hang indefinitely; update
references to chunkStart, chunkEnd, MAX_CHUNK_WINDOW, fetch and fetchJson
accordingly.

In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts`:
- Around line 117-124: The "retrieve" operation currently accepts any fileId and
directly calls the RAG service, letting callers bypass knowledge-scope checks;
to fix, apply the same authorization/ID resolution used by the "search" flow by
calling resolveFileIds (or the shared authorization function used by search)
before performing the retrieve, validate that the returned resolved IDs include
the requested fileId (or replace the incoming id with the resolved one), and
reject the request if resolution/authorization fails; update both places where
"retrieve" is handled (the z.object operation with literal 'retrieve' and the
duplicate handler around lines 173-195) to enforce this check and reuse the same
resolveFileIds/authorization logic.

In `@services/platform/convex/file_metadata/internal_actions.ts`:
- Line 18: The internal action schema currently validates storageId with
v.string() but should use the project-specific ID validator v.id('_storage') to
match the file_metadata module and schema; update the storageId field in
internal_actions.ts from v.string() to v.id('_storage') (ensuring any references
to storageId in the same internal action signature continue to use the ID type)
so it matches other endpoints, internal_queries/mutations, and the expected
schema.

In `@services/platform/convex/file_metadata/mutations.ts`:
- Around line 58-66: The call to
ctx.scheduler.runAfter(internal.file_metadata.internal_actions.uploadFileToRag,
...) is awaited on the main mutation path so scheduler failures can abort
metadata creation; change this to fire-and-forget by removing the await and
ensuring any returned promise errors are caught and logged (e.g., call
ctx.scheduler.runAfter(...).catch(...) or wrap in try/catch and log) so
scheduling failures do not propagate and block the mutation.

---

Outside diff comments:
In `@services/platform/convex/agent_tools/files/pdf_tool.ts`:
- Around line 30-66: The example assistant prompt in the repo still advertises
using the pdf parsing tool and the “Parse existing PDFs” capability while
pdf_tool.ts now instructs models to use rag_search for reading files and to only
call the "generate" operation when explicitly requested; update the example
assistant (examples/agents/file-assistant.json) to remove references to the pdf
parsing operation and the “Parse existing PDFs” wording, replace guidance to
read file contents with using rag_search (operation='retrieve'/'search'), and
ensure any example interactions no longer show calling pdf tool for
parsing/downloading PDFs except when demonstrating the "generate" operation.

In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts`:
- Around line 230-269: The code currently does one fetch/search and immediately
returns "No relevant results found" when formatSearchResults(result.results) is
falsy; implement a retry loop around the fetch + fetchJson<SearchResponse> call
that, when result.total_results === 0 or formatSearchResults(...) returns falsy,
waits and retries for up to the documented 3 seconds (e.g., 3 attempts with ~1s
delay or a single 3s wait depending on project convention) before returning the
no-results response. Update the flow around the existing fetch(url, ...) and
fetchJson<SearchResponse> usage so that debugLog, result usage, and the final
returned object are produced after the retries (or after the final attempt),
preserving the same fields (result.total_results, processing_time_ms, usage,
model) and only returning success:false or the no-results message after all
retries are exhausted.

In `@services/platform/convex/lib/attachments/process_attachments.ts`:
- Around line 231-247: The assembled unprocessedAttachments array incorrectly
mixes failedImages with text-bearing files then instructs the agent to use
rag_search.retrieve for all; update the logic around unprocessedAttachments (the
variable built from fileAttachments, failedImages, failedSpreadsheets) so
failedImages are excluded from the rag_search.retrieve fallback and are handled
separately: build the contentParts block with only fileAttachments and
failedSpreadsheets when adding the “[ATTACHED FILES - Use rag_search tool with
operation="retrieve"…]” message and loop, and for failedImages push a distinct
text entry that does not reference rag_search.retrieve (e.g., note that the
image failed and requires re-upload or a different image-recovery tool) so the
agent isn’t directed to use rag_search.retrieve for images.
- Around line 258-263: The return currently hardcodes imageInfoList and
textFileInfoList to empty arrays while the function (process_attachments.ts)
already collects attachment data; replace those empty arrays with the actual
lists built earlier in the function (use the image/text attachment variables
populated during processing) so returned parsedDocuments, imageInfoList, and
textFileInfoList reflect the real attachment metadata; keep promptContent as-is
and ensure the returned objects use the existing symbols/variables that hold the
image and text file metadata instead of [].

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 88b0a878-ab99-4f1b-a0a8-37c173073a4f

📥 Commits

Reviewing files that changed from the base of the PR and between 1d6fbf9 and c54cd12.

⛔ Files ignored due to path filters (1)

services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**

📒 Files selected for processing (27)

services/platform/convex/agent_tools/files/docx_tool.ts
services/platform/convex/agent_tools/files/excel_tool.ts
services/platform/convex/agent_tools/files/helpers/analyze_text.ts
services/platform/convex/agent_tools/files/helpers/parse_file.ts
services/platform/convex/agent_tools/files/helpers/resolve_file_name.ts
services/platform/convex/agent_tools/files/internal_actions.ts
services/platform/convex/agent_tools/files/pdf_tool.ts
services/platform/convex/agent_tools/files/pptx_tool.ts
services/platform/convex/agent_tools/files/text_tool.ts
services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
services/platform/convex/agent_tools/rag/rag_search_tool.ts
services/platform/convex/documents/generate_document_helpers.ts
services/platform/convex/documents/generate_docx_from_template.ts
services/platform/convex/documents/generate_pptx.ts
services/platform/convex/documents/helpers.ts
services/platform/convex/documents/internal_actions.ts
services/platform/convex/documents/internal_queries.ts
services/platform/convex/documents/list_documents_by_extension.ts
services/platform/convex/documents/types.ts
services/platform/convex/documents/validators.ts
services/platform/convex/file_metadata/internal_actions.ts
services/platform/convex/file_metadata/internal_mutations.ts
services/platform/convex/file_metadata/mutations.ts
services/platform/convex/lib/action_cache/index.ts
services/platform/convex/lib/attachments/process_attachments.ts
services/platform/convex/workflow_engine/action_defs/rag/rag_action.ts
services/rag/app/services/rag_service.py

💤 Files with no reviewable changes (11)

services/platform/convex/agent_tools/files/internal_actions.ts
services/platform/convex/documents/validators.ts
services/platform/convex/lib/action_cache/index.ts
services/platform/convex/documents/internal_queries.ts
services/platform/convex/documents/helpers.ts
services/platform/convex/agent_tools/files/helpers/resolve_file_name.ts
services/platform/convex/documents/list_documents_by_extension.ts
services/platform/convex/documents/generate_docx_from_template.ts
services/platform/convex/documents/generate_pptx.ts
services/platform/convex/agent_tools/files/helpers/parse_file.ts
services/platform/convex/agent_tools/files/helpers/analyze_text.ts

coderabbitai · 2026-04-11T03:11:06Z

+  sheets: z
+    .array(
+      z.object({
+        name: z.string().describe('Sheet name'),
+        headers: z
+          .array(z.string())
+          .nonempty()
+          .describe(
+            "Column headers for the sheet (must align with each row's columns)",
+          ),
+        rows: z
+          .array(
+            z.array(z.union([z.string(), z.number(), z.boolean(), z.null()])),
+          )
+          .describe('2D array of cell values (rows x columns)'),
+      }),


⚠️ Potential issue | 🟠 Major

Validate row width against header count at schema boundary.

At Line 34, the schema allows rows with any column count. This can produce malformed sheet data relative to headers.

Suggested fix

const excelArgs = z.object({ operation: z.literal('generate'), fileName: z .string() .describe('Base name for the Excel file (without extension)'), sheets: z .array( z.object({ name: z.string().describe('Sheet name'), headers: z .array(z.string()) .nonempty() .describe( "Column headers for the sheet (must align with each row's columns)", ), rows: z .array( z.array(z.union([z.string(), z.number(), z.boolean(), z.null()])), ) .describe('2D array of cell values (rows x columns)'), - }), + }).superRefine((sheet, ctx) => { + const expected = sheet.headers.length; + sheet.rows.forEach((row, index) => { + if (row.length !== expected) { + ctx.addIssue({ + code: 'custom', + path: ['rows', index], + message: `Row ${index + 1} has ${row.length} columns; expected ${expected}`, + }); + } + }); + }), ) .describe('Sheets to include in the Excel file'), });

Based on learnings: “In any file under services/platform/convex/agent_tools ... prefer runtime validation ... coming from LLM/user input over relying on TypeScript type restrictions alone.”

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/files/excel_tool.ts` around lines 34 - 49, The sheet schema currently allows rows with arbitrary column counts; add a runtime check on the sheet object (the z.object that contains name, headers, rows inside the sheets array) that validates every row length equals headers.length. Implement this by using zod's refine or superRefine on that sheet object (reference the sheets schema's sheet object), iterate rows and if any row length differs from headers.length call ctx.addIssue or return false with a clear message (include sheet.name in the error). This ensures rows are validated against the header count at the schema boundary rather than relying on TypeScript types.

coderabbitai · 2026-04-11T03:11:06Z

+  while (true) {
+    const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1;
+    const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`;
+
+    const response = await fetch(url);
+
+    if (!response.ok) {
+      const errorText = await response.text().catch(() => '');
+      throw new Error(
+        `RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`,
+      );
+    }
+
+    const result = await fetchJson<DocumentContentResponse>(response);
+    documentId = result.file_id;
+    title = result.title;
+    totalChunks = result.total_chunks;
+
+    if (result.chunks) {
+      allChunks.push(...result.chunks);
+    }
+
+    if (result.chunk_range.end >= totalChunks) {
+      break;
+    }
+
+    chunkStart = result.chunk_range.end + 1;


⚠️ Potential issue | 🟠 Major

Bound the pagination loop before using this in tools/actions.

This loop assumes the remote chunk_range.end always advances. If the RAG service returns a stale or malformed range, chunkStart never moves and the caller spins forever; paired with the timeout-less fetch, one bad response can pin the tool/action until platform timeout.

Suggested fix

while (true) { const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1; const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`; - const response = await fetch(url); + const response = await fetch(url, { + signal: AbortSignal.timeout(30_000), + }); if (!response.ok) { const errorText = await response.text().catch(() => ''); throw new Error( `RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`, @@ - if (result.chunk_range.end >= totalChunks) { + if (result.chunk_range.end < chunkStart) { + throw new Error( + `RAG get_chunks returned a non-advancing chunk range for file ${fileId}`, + ); + } + + if (result.chunk_range.end >= totalChunks) { break; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

while (true) {

const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1;

const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`;

const response = await fetch(url);

if (!response.ok) {

const errorText = await response.text().catch(() => '');

throw new Error(

`RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`,

);

}

const result = await fetchJson<DocumentContentResponse>(response);

documentId = result.file_id;

title = result.title;

totalChunks = result.total_chunks;

if (result.chunks) {

allChunks.push(...result.chunks);

}

if (result.chunk_range.end >= totalChunks) {

break;

}

chunkStart = result.chunk_range.end + 1;

while (true) {

const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1;

const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`;

const response = await fetch(url, {

signal: AbortSignal.timeout(30_000),

});

if (!response.ok) {

const errorText = await response.text().catch(() => '');

throw new Error(

`RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`,

);

}

const result = await fetchJson<DocumentContentResponse>(response);

documentId = result.file_id;

title = result.title;

totalChunks = result.total_chunks;

if (result.chunks) {

allChunks.push(...result.chunks);

}

if (result.chunk_range.end < chunkStart) {

throw new Error(

`RAG get_chunks returned a non-advancing chunk range for file ${fileId}`,

);

}

if (result.chunk_range.end >= totalChunks) {

break;

}

chunkStart = result.chunk_range.end + 1;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts` around lines 32 - 58, The pagination loop using chunkStart/chunk_range.end can hang if the service returns a stale or malformed range; add a safety bound and a fallback advance: introduce a max iteration cap (e.g. MAX_ITERATIONS) and an iteration counter in the while(true) loop to throw a clear error once exceeded, and after receiving result check if result.chunk_range.end <= chunkStart — if so, advance chunkStart by MAX_CHUNK_WINDOW as a fallback (or throw if you prefer strictness) to ensure progress; also wrap fetch with an AbortController and a short timeout so the fetch call itself cannot hang indefinitely; update references to chunkStart, chunkEnd, MAX_CHUNK_WINDOW, fetch and fetchJson accordingly.

coderabbitai · 2026-04-11T03:11:06Z

+  z.object({
+    operation: z.literal('retrieve'),
+    fileId: z
+      .string()
+      .describe(
+        'File ID of the document to retrieve full content from (e.g., "kg2bazp7fbgt9srq63knfagjrd7yfenj")',
+      ),
+  }),


⚠️ Potential issue | 🔴 Critical

Enforce knowledge-scope authorization before retrieve.

retrieve accepts any fileId and calls the RAG service directly. Unlike search, it never goes through resolveFileIds, so anyone who can supply a foreign fileId can bypass the agent’s knowledge boundary and fetch full document text.

Suggested fix

if (args.operation === 'retrieve') { + const allowedFileIds = new Set(await resolveFileIds(ctx)); + if (!allowedFileIds.has(args.fileId)) { + throw new Error('This file is not available to the current agent.'); + } + debugLog('tool:rag_search retrieve start', { fileId: args.fileId }); const ragServiceUrl = getRagConfig().serviceUrl; const result = await fetchDocumentChunks(ragServiceUrl, args.fileId);

Also applies to: 173-195

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts` around lines 117 - 124, The "retrieve" operation currently accepts any fileId and directly calls the RAG service, letting callers bypass knowledge-scope checks; to fix, apply the same authorization/ID resolution used by the "search" flow by calling resolveFileIds (or the shared authorization function used by search) before performing the retrieve, validate that the returned resolved IDs include the requested fileId (or replace the incoming id with the resolved one), and reject the request if resolution/authorization fails; update both places where "retrieve" is handled (the z.object operation with literal 'retrieve' and the duplicate handler around lines 173-195) to enforce this check and reuse the same resolveFileIds/authorization logic.

coderabbitai · 2026-04-11T03:11:06Z

+ */
+export const uploadFileToRag = internalAction({
+  args: {
+    storageId: v.string(),


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify current validator shape in the new action rg -n "uploadFileToRag|storageId:\s*v\.(string|id$'_storage'$)" --type ts services/platform/convex/file_metadata/internal_actions.ts -C 3 # Compare with repo conventions for storage ID boundaries rg -n "storageId:\s*v\.id$'_storage'$" --type ts services/platform/convex -C 2

Repository: tale-project/tale

Length of output: 4994

Use v.id('_storage') for storageId at this internal action boundary.

storageId is validated as v.string() at line 18, but all other endpoints in the file_metadata module (mutations, queries, internal_queries) and internal actions across the codebase consistently use v.id('_storage'). The schema definition also expects v.id('_storage'). Changing to v.id('_storage') maintains type safety and consistency with the rest of the codebase.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/file_metadata/internal_actions.ts` at line 18, The internal action schema currently validates storageId with v.string() but should use the project-specific ID validator v.id('_storage') to match the file_metadata module and schema; update the storageId field in internal_actions.ts from v.string() to v.id('_storage') (ensuring any references to storageId in the same internal action signature continue to use the ID type) so it matches other endpoints, internal_queries/mutations, and the expected schema.

coderabbitai · 2026-04-11T03:11:06Z

+    await ctx.scheduler.runAfter(
+      0,
+      internal.file_metadata.internal_actions.uploadFileToRag,
+      {
+        storageId: args.storageId,
+        fileName: args.fileName,
+        contentType: args.contentType,
+      },
+    );


⚠️ Potential issue | 🟠 Major

Do not let background RAG scheduling failure fail metadata creation.

At Line 58, await ctx.scheduler.runAfter(...) is on the main mutation path with no isolation. If scheduling fails, the whole mutation can fail, which blocks file metadata creation for a non-critical async step.

Suggested fix

- await ctx.scheduler.runAfter( - 0, - internal.file_metadata.internal_actions.uploadFileToRag, - { - storageId: args.storageId, - fileName: args.fileName, - contentType: args.contentType, - }, - ); + try { + await ctx.scheduler.runAfter( + 0, + internal.file_metadata.internal_actions.uploadFileToRag, + { + storageId: args.storageId, + fileName: args.fileName, + contentType: args.contentType, + }, + ); + } catch (error) { + console.error('[saveFileMetadata] Failed to schedule uploadFileToRag', { + storageId: args.storageId, + error: error instanceof Error ? error.message : String(error), + }); + }

Based on learnings: “deleteDocumentFromRag is scheduled via ctx.scheduler.runAfter(...) as a fire-and-forget cleanup... failures are intentionally logged ... rather than propagated.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

await ctx.scheduler.runAfter(

0,

internal.file_metadata.internal_actions.uploadFileToRag,

{

storageId: args.storageId,

fileName: args.fileName,

contentType: args.contentType,

},

);

try {

await ctx.scheduler.runAfter(

0,

internal.file_metadata.internal_actions.uploadFileToRag,

{

storageId: args.storageId,

fileName: args.fileName,

contentType: args.contentType,

},

);

} catch (error) {

console.error('[saveFileMetadata] Failed to schedule uploadFileToRag', {

storageId: args.storageId,

error: error instanceof Error ? error.message : String(error),

});

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/file_metadata/mutations.ts` around lines 58 - 66, The call to ctx.scheduler.runAfter(internal.file_metadata.internal_actions.uploadFileToRag, ...) is awaited on the main mutation path so scheduler failures can abort metadata creation; change this to fire-and-forget by removing the await and ensuring any returned promise errors are caught and logged (e.g., call ctx.scheduler.runAfter(...).catch(...) or wrap in try/catch and log) so scheduling failures do not propagate and block the mutation.

- Remove unused `isTextFile` and `getParseEndpoint` exports from file-types.ts (Knip) - Mock `get_document_statuses` in test_passes_file_ids to handle retry logic (RAG test)

Chat file uploads were fire-and-forget — users could send messages before files finished indexing, causing the agent to fail reading file content. - Add ragStatus/ragError fields to fileMetadata schema - Track indexing lifecycle: queued → running → completed/failed - Poll RAG service with fast intervals (5-15s) for chat UX - Block send button while any attachment is still indexing - Show per-file indexing spinner/error in chat input - Raise chat attachment size limit from 10MB to 100MB

The previous 4-minute timeout was too aggressive for large PDFs with images, causing premature "Index failed" status while RAG was still processing. Now uses progressive backoff: 5s→10s→30s→60s intervals over 60 attempts (~31 minutes total).

Track extraction progress through the full pipeline: - RAG DB: add progress_phase/progress_detail columns to documents table - PDF extraction: on_progress callback reports page-level progress (throttled to ~3s DB writes to avoid overhead) - RAG status endpoint: returns progress_phase + progress_detail - Convex polling: picks up progress and stores as ragProgress on fileMetadata - Chat UI: shows live progress (e.g. "extracting 12/50") instead of generic "Indexing..." Progress phases: extracting → embedding → storing

A 349-chunk document (590KB) was exceeding the model's 163K token limit. Now retrieve truncates at 50K chars (~15K tokens) and advises the agent to use search instead. Also stops fetching chunks once 60K chars are accumulated to avoid unnecessary network calls.

…per call Instead of fetching the entire document and truncating, retrieve now uses the RAG API's native pagination. Default page size is 10 chunks. Agent can use chunkStart/chunkEnd params to paginate through large documents, or use search for targeted lookups. Returns chunkRange, totalChunks, and hasMore for agent pagination.

…text getFailedMessageErrors now reads errors from messageMetadata (primary) in addition to MessageDoc.error (fallback). This fixes the mismatch where UIMessage.id (firstMessage._id in a group) didn't match the failed MessageDoc._id in multi-step agent responses, causing the error to be undefined in the frontend.

…generic text" This reverts commit d2474b2.

…sponses UIMessage.id is the first message in a group, but the error lives on the last (failed) message which has a different _id. When the direct lookup fails, fall back to any error in the messageErrors map for the thread. This ensures the real error (e.g. "context length exceeded") is shown instead of the generic "Something went wrong".

…to runtime Embedding dimensions are determined dynamically by the RAG service, so the HNSW index must be created at runtime (matching the chunks table pattern).

…responses In error scenarios, metadata is saved with the failed message's ID but UIMessage.id is the first message in the group. When direct lookup by messageId fails, fall back to querying by threadId (most recent entry). This fixes "Token usage and model information are not available" in the Message Information dialog for failed responses.

…hours - Convert raw "42/108" progress to "39%" in chat file pill for clarity - Extend polling from 31 min to ~2.5 hours (120 attempts) to match RAG service's 3-hour ingestion timeout for very large files

…rce cards Replace server-side scheduled polling with a client-driven action that batches status checks for pending files, eliminating wasted server resources when users leave the page. Rework source cards to group citations by source file, display chunk content inline with a detail dialog, and show relevance/page metadata.

…tions and tool descriptions Rework citation parsing to correctly handle rag_search retrieve results, offset citation numbers across successive tool calls to prevent collisions, and improve deduplication to distinguish different chunks from the same file. Move "read file" guidance to the top of each file tool description for better agent visibility. Return fileId and filename from retrieve operations to enable proper source card attribution.

Guide the agent to pass user-provided file IDs (from chat uploads) via the fileIds parameter first, falling back to broader search when needed. Clarify that list_indexed only covers Document Hub files.

- Add non-null assertion for threadId in message_metadata query (already guarded by if-check) - Add progress_phase/progress_detail fields to test mock data matching updated SQL query - Update vision_client test to accept additional on_progress kwarg

- Use destructuring instead of non-null assertion for threadId narrowing - Add ragStatus: 'queued' to file metadata insert test expectations

…uter The platform's generateDocument helper builds URLs like /api/v1/pptx/from-html but the crawler only had the JSON-based POST /api/v1/pptx endpoint, causing 404s. Added the missing endpoints following the same pattern as the DOCX router.

Instead of calling the crawler service to generate .pptx files, the pptx tool now accepts a complete HTML document from the LLM and stores it directly. This gives the AI full control over styling, layout, and animations (using reveal.js or any framework via CDN). - Add storeRawContent internal action for direct string-to-storage uploads - Remove sourceType param; tool now takes a single html param - Add explicit "no templates" instruction to prevent template hallucination

greptile-apps Bot reviewed Apr 11, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

coderabbitai Bot requested changes Apr 11, 2026

View reviewed changes

fix: resolve CI failures — remove unused exports and fix RAG test mock

4139717

- Remove unused `isTextFile` and `getParseEndpoint` exports from file-types.ts (Knip) - Mock `get_document_statuses` in test_passes_file_ids to handle retry logic (RAG test)

larryro force-pushed the feat/file-tools-optimization branch from d5261dd to 4139717 Compare April 11, 2026 03:13

larryro added 3 commits April 11, 2026 11:43

This comment has been minimized.

Sign in to view

larryro added 16 commits April 11, 2026 12:24

Revert "fix(platform): show actual error messages in chat instead of …

1b548a9

…generic text" This reverts commit d2474b2.

fix(db): make semantic_cache embedding nullable and defer HNSW index …

7efff3d

…to runtime Embedding dimensions are determined dynamically by the RAG service, so the HNSW index must be created at runtime (matching the chunks table pattern).

fix(platform): show progress as percentage and extend polling to 2.5 …

606f042

…hours - Convert raw "42/108" progress to "39%" in chat file pill for clarity - Extend polling from 31 min to ~2.5 hours (120 attempts) to match RAG service's 3-hour ingestion timeout for very large files

fix(platform): improve RAG tool instructions for file ID prioritization

c9da9dd

Guide the agent to pass user-provided file IDs (from chat uploads) via the fileIds parameter first, falling back to broader search when needed. Clarify that list_indexed only covers Document Hub files.

fix: resolve lint and test failures

1894343

- Use destructuring instead of non-null assertion for threadId narrowing - Add ragStatus: 'queued' to file metadata insert test expectations

fix: add ragStatus to internal_mutations test expectations

5b3f223

larryro merged commit 0ccc373 into main Apr 11, 2026
24 checks passed

larryro deleted the feat/file-tools-optimization branch April 11, 2026 06:09

coderabbitai Bot mentioned this pull request Apr 11, 2026

feat(platform): surface OCR visibility with scanned page detection #1401

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(platform): simplify file tools — auto RAG indexing, remove parse/template ops#1407

feat(platform): simplify file tools — auto RAG indexing, remove parse/template ops#1407
larryro merged 21 commits into
mainfrom
feat/file-tools-optimization

larryro commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

This comment has been minimized.

coderabbitai Bot commented Apr 11, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

coderabbitai Bot Apr 11, 2026

Uh oh!

This comment has been minimized.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larryro commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

coderabbitai Bot commented Apr 11, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading