Skip to content

feat(platform): simplify file tools — auto RAG indexing, remove parse/template ops#1407

Merged
larryro merged 21 commits into
mainfrom
feat/file-tools-optimization
Apr 11, 2026
Merged

feat(platform): simplify file tools — auto RAG indexing, remove parse/template ops#1407
larryro merged 21 commits into
mainfrom
feat/file-tools-optimization

Conversation

@larryro

@larryro larryro commented Apr 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Auto RAG indexing: trigger RAG indexing on every new fileMetadata insert (via saveFileMetadata), covering all upload paths (agent chat, document hub, webhook, OneDrive, etc.)
  • RAG search retry: when search returns no results and some files are still indexing, wait 3s and retry once (zero overhead in normal case)
  • Add retrieve operation to rag_search tool for full document content retrieval, replacing parse
  • Remove all parse operations from file tools (pdf, docx, pptx, text, excel) — file content now accessed via rag_search retrieve
  • Remove all template operations from docx/pptx tools — list_templates and template-based generation removed
  • Rewrite pptx tool to use markdown/HTML generation (consistent with pdf/docx)
  • Update file tool descriptions to guide AI toward rag_search retrieve for reading files
  • Clean up dead code: template generators, parse helpers, unused validators (-1864 lines)

Test plan

  • Upload a file in agent chat → verify fileMetadata is created and RAG indexing is triggered asynchronously
  • Upload a file in document hub → verify RAG indexing is triggered
  • Use rag_search with operation='retrieve' to get full content of an indexed file
  • Use rag_search with operation='search' on a just-uploaded file → verify retry logic works
  • Generate PDF/DOCX/PPTX from markdown → verify generation still works
  • Verify no parse or template operations are available in file tools
  • Run npm run lint and npx tsc --noEmit — both pass

Summary by CodeRabbit

  • New Features

    • Added document retrieval operation to search functionality for fetching full file contents.
    • Automatic file indexing for improved searchability.
  • Bug Fixes

    • Improved search reliability with retry logic for recently uploaded documents.
  • Refactor

    • Removed file parsing operations from document tools; use search functionality with retrieval instead.
    • Removed template-based document generation; now supports content-based generation only.

…/template ops

- Auto-trigger RAG indexing when files are uploaded (no more manual trigger)
- Add `retrieve` operation to rag_search tool for full document content access
- Remove `parse` operations from all file tools (pdf, docx, pptx, text, excel)
- Remove `list_templates` and template-based generation from docx/pptx tools
- Rewrite pptx tool to use markdown/HTML generation (consistent with pdf/docx)
- Add retry-on-empty logic in RAG search service for files still being indexed
- Update file tool descriptions to guide AI toward rag_search retrieve
- Clean up dead code: template generators, parse helpers, unused validators

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@blacksmith-sh

This comment has been minimized.

@coderabbitai

coderabbitai Bot commented Apr 11, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR removes file parsing capabilities from individual file tools (DOCX, Excel, PDF, PPTX, Text) and their supporting infrastructure, including helpers for parsing files (parseFile), analyzing text (analyzeTextContent), and resolving file names. In parallel, it introduces a unified document retrieval mechanism via a new fetchDocumentChunks helper that retrieves document content from a RAG service endpoint. File tools are simplified to generation-only operations, and the document generation pipeline is consolidated through a unified generateDocument action supporting PPTX output. Additionally, file metadata creation now triggers background RAG upload scheduling, and attachment processing is refactored to direct file reading to the RAG search tool's new retrieve operation instead of in-agent parsing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: auto RAG indexing, removal of parse operations, and removal of template operations from file tools.
Docstring Coverage ✅ Passed Docstring coverage is 83.33% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/file-tools-optimization

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
services/platform/convex/lib/attachments/process_attachments.ts (2)

231-247: ⚠️ Potential issue | 🟠 Major

Keep failed images out of the rag_search.retrieve bucket.

unprocessedAttachments now mixes failedImages with text-bearing files, then tells the agent to use rag_search.retrieve for all of them. That fallback works for documents/spreadsheets, but it sends image failures to a tool path that cannot recover image content.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/lib/attachments/process_attachments.ts` around lines
231 - 247, The assembled unprocessedAttachments array incorrectly mixes
failedImages with text-bearing files then instructs the agent to use
rag_search.retrieve for all; update the logic around unprocessedAttachments (the
variable built from fileAttachments, failedImages, failedSpreadsheets) so
failedImages are excluded from the rag_search.retrieve fallback and are handled
separately: build the contentParts block with only fileAttachments and
failedSpreadsheets when adding the “[ATTACHED FILES - Use rag_search tool with
operation="retrieve"…]” message and loop, and for failedImages push a distinct
text entry that does not reference rag_search.retrieve (e.g., note that the
image failed and requires re-upload or a different image-recovery tool) so the
agent isn’t directed to use rag_search.retrieve for images.

258-263: ⚠️ Potential issue | 🟠 Major

Don’t return empty metadata arrays after doing the work.

The function still advertises imageInfoList and textFileInfoList, and it already has the attachment data needed to populate them, but the return value hardcodes both to []. That silently breaks any downstream consumer still relying on these fields for follow-up tool calls or attachment rendering.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/lib/attachments/process_attachments.ts` around lines
258 - 263, The return currently hardcodes imageInfoList and textFileInfoList to
empty arrays while the function (process_attachments.ts) already collects
attachment data; replace those empty arrays with the actual lists built earlier
in the function (use the image/text attachment variables populated during
processing) so returned parsedDocuments, imageInfoList, and textFileInfoList
reflect the real attachment metadata; keep promptContent as-is and ensure the
returned objects use the existing symbols/variables that hold the image and text
file metadata instead of [].
services/platform/convex/agent_tools/files/pdf_tool.ts (1)

30-66: ⚠️ Potential issue | 🟠 Major

Update the example file assistant prompt in the same PR.

This tool now explicitly tells models to use rag_search for reading, but examples/agents/file-assistant.json still advertises “Parse existing PDFs” and lists pdf as a parsing tool. That example assistant will keep emitting removed operations.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/files/pdf_tool.ts` around lines 30 - 66,
The example assistant prompt in the repo still advertises using the pdf parsing
tool and the “Parse existing PDFs” capability while pdf_tool.ts now instructs
models to use rag_search for reading files and to only call the "generate"
operation when explicitly requested; update the example assistant
(examples/agents/file-assistant.json) to remove references to the pdf parsing
operation and the “Parse existing PDFs” wording, replace guidance to read file
contents with using rag_search (operation='retrieve'/'search'), and ensure any
example interactions no longer show calling pdf tool for parsing/downloading
PDFs except when demonstrating the "generate" operation.
services/platform/convex/agent_tools/files/docx_tool.ts (1)

63-90: ⚠️ Potential issue | 🟠 Major

Make the two generate modes mutually exclusive.

The schema allows sections, sourceType, and content together, but execution silently chooses content mode and ignores the structured sections/title/subtitle. For model-generated tool calls, that ambiguity will produce the wrong document without any validation error.

Based on learnings, "In any file under services/platform/convex/agent_tools ... prefer runtime validation of identifiers like tableName coming from LLM/user input over relying on TypeScript type restrictions alone. Implement validation at call boundaries that returns descriptive, user-friendly errors."

Also applies to: 145-196

services/platform/convex/agent_tools/rag/rag_search_tool.ts (1)

230-269: ⚠️ Potential issue | 🟠 Major

Implement the zero-result retry that this PR promises.

This branch still does one /search call and immediately returns “No relevant results found” on an empty result set. Newly uploaded files that are still indexing never get the documented 3-second retry, so the main user-facing behavior in this PR is still missing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts` around lines 230
- 269, The code currently does one fetch/search and immediately returns "No
relevant results found" when formatSearchResults(result.results) is falsy;
implement a retry loop around the fetch + fetchJson<SearchResponse> call that,
when result.total_results === 0 or formatSearchResults(...) returns falsy, waits
and retries for up to the documented 3 seconds (e.g., 3 attempts with ~1s delay
or a single 3s wait depending on project convention) before returning the
no-results response. Update the flow around the existing fetch(url, ...) and
fetchJson<SearchResponse> usage so that debugLog, result usage, and the final
returned object are produced after the retries (or after the final attempt),
preserving the same fields (result.total_results, processing_time_ms, usage,
model) and only returning success:false or the no-results message after all
retries are exhausted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/platform/convex/agent_tools/files/excel_tool.ts`:
- Around line 34-49: The sheet schema currently allows rows with arbitrary
column counts; add a runtime check on the sheet object (the z.object that
contains name, headers, rows inside the sheets array) that validates every row
length equals headers.length. Implement this by using zod's refine or
superRefine on that sheet object (reference the sheets schema's sheet object),
iterate rows and if any row length differs from headers.length call ctx.addIssue
or return false with a clear message (include sheet.name in the error). This
ensures rows are validated against the header count at the schema boundary
rather than relying on TypeScript types.

In `@services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts`:
- Around line 32-58: The pagination loop using chunkStart/chunk_range.end can
hang if the service returns a stale or malformed range; add a safety bound and a
fallback advance: introduce a max iteration cap (e.g. MAX_ITERATIONS) and an
iteration counter in the while(true) loop to throw a clear error once exceeded,
and after receiving result check if result.chunk_range.end <= chunkStart — if
so, advance chunkStart by MAX_CHUNK_WINDOW as a fallback (or throw if you prefer
strictness) to ensure progress; also wrap fetch with an AbortController and a
short timeout so the fetch call itself cannot hang indefinitely; update
references to chunkStart, chunkEnd, MAX_CHUNK_WINDOW, fetch and fetchJson
accordingly.

In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts`:
- Around line 117-124: The "retrieve" operation currently accepts any fileId and
directly calls the RAG service, letting callers bypass knowledge-scope checks;
to fix, apply the same authorization/ID resolution used by the "search" flow by
calling resolveFileIds (or the shared authorization function used by search)
before performing the retrieve, validate that the returned resolved IDs include
the requested fileId (or replace the incoming id with the resolved one), and
reject the request if resolution/authorization fails; update both places where
"retrieve" is handled (the z.object operation with literal 'retrieve' and the
duplicate handler around lines 173-195) to enforce this check and reuse the same
resolveFileIds/authorization logic.

In `@services/platform/convex/file_metadata/internal_actions.ts`:
- Line 18: The internal action schema currently validates storageId with
v.string() but should use the project-specific ID validator v.id('_storage') to
match the file_metadata module and schema; update the storageId field in
internal_actions.ts from v.string() to v.id('_storage') (ensuring any references
to storageId in the same internal action signature continue to use the ID type)
so it matches other endpoints, internal_queries/mutations, and the expected
schema.

In `@services/platform/convex/file_metadata/mutations.ts`:
- Around line 58-66: The call to
ctx.scheduler.runAfter(internal.file_metadata.internal_actions.uploadFileToRag,
...) is awaited on the main mutation path so scheduler failures can abort
metadata creation; change this to fire-and-forget by removing the await and
ensuring any returned promise errors are caught and logged (e.g., call
ctx.scheduler.runAfter(...).catch(...) or wrap in try/catch and log) so
scheduling failures do not propagate and block the mutation.

---

Outside diff comments:
In `@services/platform/convex/agent_tools/files/pdf_tool.ts`:
- Around line 30-66: The example assistant prompt in the repo still advertises
using the pdf parsing tool and the “Parse existing PDFs” capability while
pdf_tool.ts now instructs models to use rag_search for reading files and to only
call the "generate" operation when explicitly requested; update the example
assistant (examples/agents/file-assistant.json) to remove references to the pdf
parsing operation and the “Parse existing PDFs” wording, replace guidance to
read file contents with using rag_search (operation='retrieve'/'search'), and
ensure any example interactions no longer show calling pdf tool for
parsing/downloading PDFs except when demonstrating the "generate" operation.

In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts`:
- Around line 230-269: The code currently does one fetch/search and immediately
returns "No relevant results found" when formatSearchResults(result.results) is
falsy; implement a retry loop around the fetch + fetchJson<SearchResponse> call
that, when result.total_results === 0 or formatSearchResults(...) returns falsy,
waits and retries for up to the documented 3 seconds (e.g., 3 attempts with ~1s
delay or a single 3s wait depending on project convention) before returning the
no-results response. Update the flow around the existing fetch(url, ...) and
fetchJson<SearchResponse> usage so that debugLog, result usage, and the final
returned object are produced after the retries (or after the final attempt),
preserving the same fields (result.total_results, processing_time_ms, usage,
model) and only returning success:false or the no-results message after all
retries are exhausted.

In `@services/platform/convex/lib/attachments/process_attachments.ts`:
- Around line 231-247: The assembled unprocessedAttachments array incorrectly
mixes failedImages with text-bearing files then instructs the agent to use
rag_search.retrieve for all; update the logic around unprocessedAttachments (the
variable built from fileAttachments, failedImages, failedSpreadsheets) so
failedImages are excluded from the rag_search.retrieve fallback and are handled
separately: build the contentParts block with only fileAttachments and
failedSpreadsheets when adding the “[ATTACHED FILES - Use rag_search tool with
operation="retrieve"…]” message and loop, and for failedImages push a distinct
text entry that does not reference rag_search.retrieve (e.g., note that the
image failed and requires re-upload or a different image-recovery tool) so the
agent isn’t directed to use rag_search.retrieve for images.
- Around line 258-263: The return currently hardcodes imageInfoList and
textFileInfoList to empty arrays while the function (process_attachments.ts)
already collects attachment data; replace those empty arrays with the actual
lists built earlier in the function (use the image/text attachment variables
populated during processing) so returned parsedDocuments, imageInfoList, and
textFileInfoList reflect the real attachment metadata; keep promptContent as-is
and ensure the returned objects use the existing symbols/variables that hold the
image and text file metadata instead of [].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 88b0a878-ab99-4f1b-a0a8-37c173073a4f

📥 Commits

Reviewing files that changed from the base of the PR and between 1d6fbf9 and c54cd12.

⛔ Files ignored due to path filters (1)
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
📒 Files selected for processing (27)
  • services/platform/convex/agent_tools/files/docx_tool.ts
  • services/platform/convex/agent_tools/files/excel_tool.ts
  • services/platform/convex/agent_tools/files/helpers/analyze_text.ts
  • services/platform/convex/agent_tools/files/helpers/parse_file.ts
  • services/platform/convex/agent_tools/files/helpers/resolve_file_name.ts
  • services/platform/convex/agent_tools/files/internal_actions.ts
  • services/platform/convex/agent_tools/files/pdf_tool.ts
  • services/platform/convex/agent_tools/files/pptx_tool.ts
  • services/platform/convex/agent_tools/files/text_tool.ts
  • services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
  • services/platform/convex/agent_tools/rag/rag_search_tool.ts
  • services/platform/convex/documents/generate_document_helpers.ts
  • services/platform/convex/documents/generate_docx_from_template.ts
  • services/platform/convex/documents/generate_pptx.ts
  • services/platform/convex/documents/helpers.ts
  • services/platform/convex/documents/internal_actions.ts
  • services/platform/convex/documents/internal_queries.ts
  • services/platform/convex/documents/list_documents_by_extension.ts
  • services/platform/convex/documents/types.ts
  • services/platform/convex/documents/validators.ts
  • services/platform/convex/file_metadata/internal_actions.ts
  • services/platform/convex/file_metadata/internal_mutations.ts
  • services/platform/convex/file_metadata/mutations.ts
  • services/platform/convex/lib/action_cache/index.ts
  • services/platform/convex/lib/attachments/process_attachments.ts
  • services/platform/convex/workflow_engine/action_defs/rag/rag_action.ts
  • services/rag/app/services/rag_service.py
💤 Files with no reviewable changes (11)
  • services/platform/convex/agent_tools/files/internal_actions.ts
  • services/platform/convex/documents/validators.ts
  • services/platform/convex/lib/action_cache/index.ts
  • services/platform/convex/documents/internal_queries.ts
  • services/platform/convex/documents/helpers.ts
  • services/platform/convex/agent_tools/files/helpers/resolve_file_name.ts
  • services/platform/convex/documents/list_documents_by_extension.ts
  • services/platform/convex/documents/generate_docx_from_template.ts
  • services/platform/convex/documents/generate_pptx.ts
  • services/platform/convex/agent_tools/files/helpers/parse_file.ts
  • services/platform/convex/agent_tools/files/helpers/analyze_text.ts

Comment on lines +34 to +49
sheets: z
.array(
z.object({
name: z.string().describe('Sheet name'),
headers: z
.array(z.string())
.nonempty()
.describe(
"Column headers for the sheet (must align with each row's columns)",
),
rows: z
.array(
z.array(z.union([z.string(), z.number(), z.boolean(), z.null()])),
)
.describe('2D array of cell values (rows x columns)'),
}),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate row width against header count at schema boundary.

At Line 34, the schema allows rows with any column count. This can produce malformed sheet data relative to headers.

Suggested fix
 const excelArgs = z.object({
   operation: z.literal('generate'),
   fileName: z
     .string()
     .describe('Base name for the Excel file (without extension)'),
   sheets: z
     .array(
       z.object({
         name: z.string().describe('Sheet name'),
         headers: z
           .array(z.string())
           .nonempty()
           .describe(
             "Column headers for the sheet (must align with each row's columns)",
           ),
         rows: z
           .array(
             z.array(z.union([z.string(), z.number(), z.boolean(), z.null()])),
           )
           .describe('2D array of cell values (rows x columns)'),
-      }),
+      }).superRefine((sheet, ctx) => {
+        const expected = sheet.headers.length;
+        sheet.rows.forEach((row, index) => {
+          if (row.length !== expected) {
+            ctx.addIssue({
+              code: 'custom',
+              path: ['rows', index],
+              message: `Row ${index + 1} has ${row.length} columns; expected ${expected}`,
+            });
+          }
+        });
+      }),
     )
     .describe('Sheets to include in the Excel file'),
 });

Based on learnings: “In any file under services/platform/convex/agent_tools ... prefer runtime validation ... coming from LLM/user input over relying on TypeScript type restrictions alone.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/files/excel_tool.ts` around lines 34 -
49, The sheet schema currently allows rows with arbitrary column counts; add a
runtime check on the sheet object (the z.object that contains name, headers,
rows inside the sheets array) that validates every row length equals
headers.length. Implement this by using zod's refine or superRefine on that
sheet object (reference the sheets schema's sheet object), iterate rows and if
any row length differs from headers.length call ctx.addIssue or return false
with a clear message (include sheet.name in the error). This ensures rows are
validated against the header count at the schema boundary rather than relying on
TypeScript types.

Comment on lines +32 to +58
while (true) {
const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1;
const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`;

const response = await fetch(url);

if (!response.ok) {
const errorText = await response.text().catch(() => '');
throw new Error(
`RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`,
);
}

const result = await fetchJson<DocumentContentResponse>(response);
documentId = result.file_id;
title = result.title;
totalChunks = result.total_chunks;

if (result.chunks) {
allChunks.push(...result.chunks);
}

if (result.chunk_range.end >= totalChunks) {
break;
}

chunkStart = result.chunk_range.end + 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound the pagination loop before using this in tools/actions.

This loop assumes the remote chunk_range.end always advances. If the RAG service returns a stale or malformed range, chunkStart never moves and the caller spins forever; paired with the timeout-less fetch, one bad response can pin the tool/action until platform timeout.

Suggested fix
   while (true) {
     const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1;
     const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`;

-    const response = await fetch(url);
+    const response = await fetch(url, {
+      signal: AbortSignal.timeout(30_000),
+    });

     if (!response.ok) {
       const errorText = await response.text().catch(() => '');
       throw new Error(
         `RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`,
@@
-    if (result.chunk_range.end >= totalChunks) {
+    if (result.chunk_range.end < chunkStart) {
+      throw new Error(
+        `RAG get_chunks returned a non-advancing chunk range for file ${fileId}`,
+      );
+    }
+
+    if (result.chunk_range.end >= totalChunks) {
       break;
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
while (true) {
const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1;
const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`;
const response = await fetch(url);
if (!response.ok) {
const errorText = await response.text().catch(() => '');
throw new Error(
`RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`,
);
}
const result = await fetchJson<DocumentContentResponse>(response);
documentId = result.file_id;
title = result.title;
totalChunks = result.total_chunks;
if (result.chunks) {
allChunks.push(...result.chunks);
}
if (result.chunk_range.end >= totalChunks) {
break;
}
chunkStart = result.chunk_range.end + 1;
while (true) {
const chunkEnd = chunkStart + MAX_CHUNK_WINDOW - 1;
const url = `${serviceUrl}/api/v1/documents/${encodeURIComponent(fileId)}/content?return_chunks=true&chunk_start=${chunkStart}&chunk_end=${chunkEnd}`;
const response = await fetch(url, {
signal: AbortSignal.timeout(30_000),
});
if (!response.ok) {
const errorText = await response.text().catch(() => '');
throw new Error(
`RAG get_chunks error (${response.status}): ${errorText || 'Unknown error'}`,
);
}
const result = await fetchJson<DocumentContentResponse>(response);
documentId = result.file_id;
title = result.title;
totalChunks = result.total_chunks;
if (result.chunks) {
allChunks.push(...result.chunks);
}
if (result.chunk_range.end < chunkStart) {
throw new Error(
`RAG get_chunks returned a non-advancing chunk range for file ${fileId}`,
);
}
if (result.chunk_range.end >= totalChunks) {
break;
}
chunkStart = result.chunk_range.end + 1;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts`
around lines 32 - 58, The pagination loop using chunkStart/chunk_range.end can
hang if the service returns a stale or malformed range; add a safety bound and a
fallback advance: introduce a max iteration cap (e.g. MAX_ITERATIONS) and an
iteration counter in the while(true) loop to throw a clear error once exceeded,
and after receiving result check if result.chunk_range.end <= chunkStart — if
so, advance chunkStart by MAX_CHUNK_WINDOW as a fallback (or throw if you prefer
strictness) to ensure progress; also wrap fetch with an AbortController and a
short timeout so the fetch call itself cannot hang indefinitely; update
references to chunkStart, chunkEnd, MAX_CHUNK_WINDOW, fetch and fetchJson
accordingly.

Comment on lines +117 to +124
z.object({
operation: z.literal('retrieve'),
fileId: z
.string()
.describe(
'File ID of the document to retrieve full content from (e.g., "kg2bazp7fbgt9srq63knfagjrd7yfenj")',
),
}),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Enforce knowledge-scope authorization before retrieve.

retrieve accepts any fileId and calls the RAG service directly. Unlike search, it never goes through resolveFileIds, so anyone who can supply a foreign fileId can bypass the agent’s knowledge boundary and fetch full document text.

Suggested fix
       if (args.operation === 'retrieve') {
+        const allowedFileIds = new Set(await resolveFileIds(ctx));
+        if (!allowedFileIds.has(args.fileId)) {
+          throw new Error('This file is not available to the current agent.');
+        }
+
         debugLog('tool:rag_search retrieve start', { fileId: args.fileId });

         const ragServiceUrl = getRagConfig().serviceUrl;
         const result = await fetchDocumentChunks(ragServiceUrl, args.fileId);

Also applies to: 173-195

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/rag/rag_search_tool.ts` around lines 117
- 124, The "retrieve" operation currently accepts any fileId and directly calls
the RAG service, letting callers bypass knowledge-scope checks; to fix, apply
the same authorization/ID resolution used by the "search" flow by calling
resolveFileIds (or the shared authorization function used by search) before
performing the retrieve, validate that the returned resolved IDs include the
requested fileId (or replace the incoming id with the resolved one), and reject
the request if resolution/authorization fails; update both places where
"retrieve" is handled (the z.object operation with literal 'retrieve' and the
duplicate handler around lines 173-195) to enforce this check and reuse the same
resolveFileIds/authorization logic.

*/
export const uploadFileToRag = internalAction({
args: {
storageId: v.string(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify current validator shape in the new action
rg -n "uploadFileToRag|storageId:\s*v\.(string|id\('_storage'\))" --type ts services/platform/convex/file_metadata/internal_actions.ts -C 3

# Compare with repo conventions for storage ID boundaries
rg -n "storageId:\s*v\.id\('_storage'\)" --type ts services/platform/convex -C 2

Repository: tale-project/tale

Length of output: 4994


Use v.id('_storage') for storageId at this internal action boundary.

storageId is validated as v.string() at line 18, but all other endpoints in the file_metadata module (mutations, queries, internal_queries) and internal actions across the codebase consistently use v.id('_storage'). The schema definition also expects v.id('_storage'). Changing to v.id('_storage') maintains type safety and consistency with the rest of the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/file_metadata/internal_actions.ts` at line 18, The
internal action schema currently validates storageId with v.string() but should
use the project-specific ID validator v.id('_storage') to match the
file_metadata module and schema; update the storageId field in
internal_actions.ts from v.string() to v.id('_storage') (ensuring any references
to storageId in the same internal action signature continue to use the ID type)
so it matches other endpoints, internal_queries/mutations, and the expected
schema.

Comment on lines +58 to +66
await ctx.scheduler.runAfter(
0,
internal.file_metadata.internal_actions.uploadFileToRag,
{
storageId: args.storageId,
fileName: args.fileName,
contentType: args.contentType,
},
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not let background RAG scheduling failure fail metadata creation.

At Line 58, await ctx.scheduler.runAfter(...) is on the main mutation path with no isolation. If scheduling fails, the whole mutation can fail, which blocks file metadata creation for a non-critical async step.

Suggested fix
-    await ctx.scheduler.runAfter(
-      0,
-      internal.file_metadata.internal_actions.uploadFileToRag,
-      {
-        storageId: args.storageId,
-        fileName: args.fileName,
-        contentType: args.contentType,
-      },
-    );
+    try {
+      await ctx.scheduler.runAfter(
+        0,
+        internal.file_metadata.internal_actions.uploadFileToRag,
+        {
+          storageId: args.storageId,
+          fileName: args.fileName,
+          contentType: args.contentType,
+        },
+      );
+    } catch (error) {
+      console.error('[saveFileMetadata] Failed to schedule uploadFileToRag', {
+        storageId: args.storageId,
+        error: error instanceof Error ? error.message : String(error),
+      });
+    }

Based on learnings: “deleteDocumentFromRag is scheduled via ctx.scheduler.runAfter(...) as a fire-and-forget cleanup... failures are intentionally logged ... rather than propagated.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
await ctx.scheduler.runAfter(
0,
internal.file_metadata.internal_actions.uploadFileToRag,
{
storageId: args.storageId,
fileName: args.fileName,
contentType: args.contentType,
},
);
try {
await ctx.scheduler.runAfter(
0,
internal.file_metadata.internal_actions.uploadFileToRag,
{
storageId: args.storageId,
fileName: args.fileName,
contentType: args.contentType,
},
);
} catch (error) {
console.error('[saveFileMetadata] Failed to schedule uploadFileToRag', {
storageId: args.storageId,
error: error instanceof Error ? error.message : String(error),
});
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/file_metadata/mutations.ts` around lines 58 - 66,
The call to
ctx.scheduler.runAfter(internal.file_metadata.internal_actions.uploadFileToRag,
...) is awaited on the main mutation path so scheduler failures can abort
metadata creation; change this to fire-and-forget by removing the await and
ensuring any returned promise errors are caught and logged (e.g., call
ctx.scheduler.runAfter(...).catch(...) or wrap in try/catch and log) so
scheduling failures do not propagate and block the mutation.

- Remove unused `isTextFile` and `getParseEndpoint` exports from file-types.ts (Knip)
- Mock `get_document_statuses` in test_passes_file_ids to handle retry logic (RAG test)
@larryro larryro force-pushed the feat/file-tools-optimization branch from d5261dd to 4139717 Compare April 11, 2026 03:13
larryro added 3 commits April 11, 2026 11:43
Chat file uploads were fire-and-forget — users could send messages before
files finished indexing, causing the agent to fail reading file content.

- Add ragStatus/ragError fields to fileMetadata schema
- Track indexing lifecycle: queued → running → completed/failed
- Poll RAG service with fast intervals (5-15s) for chat UX
- Block send button while any attachment is still indexing
- Show per-file indexing spinner/error in chat input
- Raise chat attachment size limit from 10MB to 100MB
The previous 4-minute timeout was too aggressive for large PDFs with
images, causing premature "Index failed" status while RAG was still
processing. Now uses progressive backoff: 5s→10s→30s→60s intervals
over 60 attempts (~31 minutes total).
Track extraction progress through the full pipeline:
- RAG DB: add progress_phase/progress_detail columns to documents table
- PDF extraction: on_progress callback reports page-level progress
  (throttled to ~3s DB writes to avoid overhead)
- RAG status endpoint: returns progress_phase + progress_detail
- Convex polling: picks up progress and stores as ragProgress on fileMetadata
- Chat UI: shows live progress (e.g. "extracting 12/50") instead of generic "Indexing..."

Progress phases: extracting → embedding → storing
@blacksmith-sh

This comment has been minimized.

larryro added 16 commits April 11, 2026 12:24
A 349-chunk document (590KB) was exceeding the model's 163K token limit.
Now retrieve truncates at 50K chars (~15K tokens) and advises the agent
to use search instead. Also stops fetching chunks once 60K chars are
accumulated to avoid unnecessary network calls.
…per call

Instead of fetching the entire document and truncating, retrieve now
uses the RAG API's native pagination. Default page size is 10 chunks.
Agent can use chunkStart/chunkEnd params to paginate through large
documents, or use search for targeted lookups.

Returns chunkRange, totalChunks, and hasMore for agent pagination.
…text

getFailedMessageErrors now reads errors from messageMetadata (primary)
in addition to MessageDoc.error (fallback). This fixes the mismatch
where UIMessage.id (firstMessage._id in a group) didn't match the
failed MessageDoc._id in multi-step agent responses, causing the
error to be undefined in the frontend.
…sponses

UIMessage.id is the first message in a group, but the error lives on
the last (failed) message which has a different _id. When the direct
lookup fails, fall back to any error in the messageErrors map for the
thread. This ensures the real error (e.g. "context length exceeded")
is shown instead of the generic "Something went wrong".
…to runtime

Embedding dimensions are determined dynamically by the RAG service, so the
HNSW index must be created at runtime (matching the chunks table pattern).
…responses

In error scenarios, metadata is saved with the failed message's ID but
UIMessage.id is the first message in the group. When direct lookup by
messageId fails, fall back to querying by threadId (most recent entry).
This fixes "Token usage and model information are not available" in the
Message Information dialog for failed responses.
…hours

- Convert raw "42/108" progress to "39%" in chat file pill for clarity
- Extend polling from 31 min to ~2.5 hours (120 attempts) to match
  RAG service's 3-hour ingestion timeout for very large files
…rce cards

Replace server-side scheduled polling with a client-driven action that batches
status checks for pending files, eliminating wasted server resources when users
leave the page. Rework source cards to group citations by source file, display
chunk content inline with a detail dialog, and show relevance/page metadata.
…tions and tool descriptions

Rework citation parsing to correctly handle rag_search retrieve results,
offset citation numbers across successive tool calls to prevent collisions,
and improve deduplication to distinguish different chunks from the same file.
Move "read file" guidance to the top of each file tool description for
better agent visibility. Return fileId and filename from retrieve operations
to enable proper source card attribution.
Guide the agent to pass user-provided file IDs (from chat uploads) via
the fileIds parameter first, falling back to broader search when needed.
Clarify that list_indexed only covers Document Hub files.
- Add non-null assertion for threadId in message_metadata query (already
  guarded by if-check)
- Add progress_phase/progress_detail fields to test mock data matching
  updated SQL query
- Update vision_client test to accept additional on_progress kwarg
- Use destructuring instead of non-null assertion for threadId narrowing
- Add ragStatus: 'queued' to file metadata insert test expectations
…uter

The platform's generateDocument helper builds URLs like /api/v1/pptx/from-html
but the crawler only had the JSON-based POST /api/v1/pptx endpoint, causing 404s.
Added the missing endpoints following the same pattern as the DOCX router.
Instead of calling the crawler service to generate .pptx files, the pptx
tool now accepts a complete HTML document from the LLM and stores it
directly. This gives the AI full control over styling, layout, and
animations (using reveal.js or any framework via CDN).

- Add storeRawContent internal action for direct string-to-storage uploads
- Remove sourceType param; tool now takes a single html param
- Add explicit "no templates" instruction to prevent template hallucination
@larryro larryro merged commit 0ccc373 into main Apr 11, 2026
24 checks passed
@larryro larryro deleted the feat/file-tools-optimization branch April 11, 2026 06:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant