Skip to content

feat: add document_write agent tool with approval flow#807

Merged
larryro merged 9 commits into
mainfrom
feat/document-write-tool
Mar 17, 2026
Merged

feat: add document_write agent tool with approval flow#807
larryro merged 9 commits into
mainfrom
feat/document-write-tool

Conversation

@larryro

@larryro larryro commented Mar 17, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add a new document_write tool that allows agents to save downloaded files as documents, gated behind an approval workflow with a dedicated approval card UI
  • Add crawler HTTP download utility and update file tools for consistent document creation tracking
  • Harden with atomic claim mutation to prevent double-execution, organization ownership checks, rate limiting, and didRetry guard to fix generate_response overwrite bug

Test plan

  • Verify agent can invoke document_write tool and approval card renders correctly
  • Confirm approve/reject flow works for single and batch file operations
  • Test double-execution prevention (atomic claim mutation)
  • Test organization ownership validation on fileIds
  • Verify generate_response no longer overwrites full message content on non-retry generations
  • Run document_write_tool.test.ts unit tests

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features
    • Introduced document write functionality enabling users to save AI-generated files to the documents hub with an approval workflow
    • Enhanced PDF tool to support downloading existing PDFs from URLs
    • Added document approval interface for reviewing and approving file save requests
    • Expanded agent capabilities to support saving generated content directly to document storage

larryro added 2 commits March 17, 2026 17:43
Add a new document_write tool that allows agents to save downloaded files
as documents, gated behind an approval workflow. Includes crawler HTTP
download utility, approval card UI, and updates to file tools for
consistent document creation tracking.
- Add didRetry guard to prevent savedMessageId from overwriting full
  message content (including tool call parts) on non-retry generations
- Add atomic claimDocumentWriteForExecution mutation to prevent
  double-execution race condition
- Add organization ownership check on fileIds
- Add random suffix to resourceId for uniqueness
- Add rate limiting to createDocumentWriteApproval
- Add metadata and organizationId to approval card memo comparator
- Add batch-all-failed UI state for approval card
- Extract hardcoded error strings to i18n
- Replace fragile ValueError with DownloadDetectedException in crawler
- Validate folder path segments early in tool handler
- Add handler tests for document_write tool

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai

coderabbitai Bot commented Mar 17, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR implements an end-to-end document write approval workflow. In the crawler service, it adds download detection for URLs that trigger file downloads via a new DownloadDetectedException exception class, a reusable HTTP download utility with size limits, and integration into image and PDF conversion services to handle downloads as fallback paths. In the platform service, it introduces a new agent tool document_write that enables agents to request approval before saving generated files to the documents hub. The feature includes new UI components for approval cards and message rendering, backend mutations and actions for approval creation and execution, type definitions and validators for the approval workflow, and extensions to the document schema to support 'agent' as a source provider. Integration spans chat interfaces, approval workflows, agent definitions, and the workflow engine.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add document_write agent tool with approval flow' clearly and concisely summarizes the main change: introducing a new document_write agent tool with an approval workflow.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/document-write-tool
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
services/crawler/app/services/pdf_service.py (1)

133-147: ⚠️ Potential issue | 🟠 Major

wait_until parameter is partially ignored in url_to_pdf.

Navigation is hardcoded to domcontentloaded, and only load gets an additional wait. Callers can pass commit or networkidle via the API (as documented in UrlToPdfRequest), but these values are not honored. The image_service.py demonstrates the correct handling pattern already used in the codebase.

💡 Proposed fix
-            # Navigate with domcontentloaded first, detecting download-triggered responses
+            # Navigate while preserving explicit caller intent where possible
             try:
-                await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
+                if wait_until == "commit":
+                    await page.goto(url, wait_until="commit", timeout=timeout)
+                else:
+                    await page.goto(url, wait_until="domcontentloaded", timeout=timeout)
             except PlaywrightError as e:
                 if "Download is starting" in str(e):
                     raise DownloadDetectedException(
                         f"URL triggers a file download instead of rendering a page: {url}"
                     ) from e
                 raise

-            # If 'load' was requested, try to wait for it but don't fail if it times out
-            if wait_until == "load":
+            # If stricter load states were requested, try to wait without hard-failing on timeout
+            if wait_until in ("load", "networkidle"):
                 with contextlib.suppress(PlaywrightTimeoutError):
-                    await page.wait_for_load_state("load", timeout=timeout)
+                    await page.wait_for_load_state(wait_until, timeout=timeout)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/services/pdf_service.py` around lines 133 - 147, The
url_to_pdf implementation ignores caller-supplied wait_until values by
hardcoding page.goto(..., wait_until="domcontentloaded") and only conditionally
waiting for "load"; update url_to_pdf to pass the incoming wait_until parameter
into page.goto (e.g., page.goto(url, wait_until=wait_until, timeout=timeout))
and handle special cases like "load" or "commit"/"networkidle" the same way
image_service.py does (preserve the DownloadDetectedException handling for
download-triggering errors and still suppress PlaywrightTimeoutError when doing
any extra wait_for_load_state or equivalent). Ensure the function name
url_to_pdf and the UrlToPdfRequest semantics are preserved so callers that pass
"commit" or "networkidle" receive the intended behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/routers/image.py`:
- Around line 158-162: The Content-Disposition filename currently uses
request.options.image_type while media_type uses the actual downloaded content
type (ct); change the code that builds the filename (around the Response return
using file_bytes, ct, and request.options.image_type) to derive the file
extension from ct (e.g., map common MIME types like image/png, image/jpeg,
image/gif, image/webp -> png, jpg/jpeg, gif, webp) and use that extension in the
filename, falling back to request.options.image_type if ct is unknown or
unmapped, so the returned filename extension matches the actual media_type.

In `@services/crawler/app/routers/pdf.py`:
- Line 149: Change the unpacking of the fallback download so the unused
content_type is ignored to silence linters; instead of assigning to content_type
from the call to download_file(...) (the line with file_bytes, content_type =
await download_file(str(request.url), timeout_seconds)), discard the second
value (e.g., file_bytes, _ = await download_file(... ) or file_bytes,
_content_type = ...) so only file_bytes is used; keep the call to download_file,
request.url and timeout_seconds unchanged.

In `@services/crawler/app/utils/http_download.py`:
- Line 53: The debug log currently outputs the full URL
(logger.debug(f"Downloaded {total} bytes from {url}")), which can leak
credentials in query strings; change this to log a sanitized URL by parsing the
url variable (e.g., using urllib.parse.urlparse) and reconstructing only the
scheme, netloc and path or replace the query/fragment with a fixed token like
"<REDACTED>", then call logger.debug with the sanitized value (keep the same
total variable). Locate the logger.debug line that references url and replace it
with the sanitized-url approach so no raw query parameters or fragments are
logged.
- Around line 36-40: Guard the Content-Length parsing so malformed headers don't
raise ValueError and crash: when checking content_length against
MAX_FILE_DOWNLOAD_BYTES, parse it inside a try/except ValueError (e.g. assign
parsed_length = int(content_length) inside try), and if parsing fails simply
skip the size-based HTTPException check (or log the malformed header) so
streaming-size enforcement continues; update the block that references
content_length, int(content_length), MAX_FILE_DOWNLOAD_BYTES, HTTPException and
status.HTTP_400_BAD_REQUEST to use the safe parsed_length variable and only
raise the 400 when parsed_length is successfully obtained and exceeds
MAX_FILE_DOWNLOAD_BYTES.

In
`@services/platform/app/features/chat/components/document-write-approval-card.tsx`:
- Around line 186-307: The success banners and footer currently depend on the
top-level executionError; change their conditionals to derive success/failure
from per-file counts (failedCount and successCount) instead: update the two
"savedSuccessfully" blocks, the batch partial success block, the "All failed
(batch)" block, and the resolved footer (the ternary that shows
statusApprovedPartial/statusApprovedFailed/statusApprovedSuccess) to use
failedCount, successCount and files.length along with isBatch, executedAt and
status instead of checking executionError; keep the executionError/temporary UI
error block to display the top-level error when present and ensure single-file
execution error display still shows file.executionError where appropriate (refer
to variables executionError, failedCount, successCount, files.length, isBatch,
executedAt, status in document-write-approval-card.tsx).

In
`@services/platform/convex/agent_tools/documents/__tests__/document_write_tool.test.ts`:
- Around line 240-248: The test currently only checks that result.success is
defined, which passes even when the handler failed because the default mock
ctx's runQuery returns undefined; update the test that calls handler(...) with
folderPath 'a//b' to stub the context so runQuery returns valid file metadata
(and runMutation or whatever persistence method the handler uses is stubbed to
succeed) before invoking handler(createMockCtx()) and then assert explicitly
that result.success === true; refer to the test helper createMockCtx, the
handler under test, and the mocked runQuery/runMutation behavior to locate and
change the stubbing and the final assertion.

In `@services/platform/convex/agent_tools/documents/document_write_tool.ts`:
- Around line 130-159: The loop that looks up file metadata (using
toId<'_storage'>(file.fileId) and
ctx.runQuery(internal.file_metadata.internal_queries.getByStorageId, ...)) can
throw for malformed LLM-produced fileId values; wrap the validation and query in
a runtime-check + try/catch: first validate file.fileId (non-empty and matches
expected storage ID pattern/format) and if invalid return the same structured
user-facing error { success: false, message: "...invalid fileId...", error:
"Invalid fileId: ..." } as other checks do; if validation passes, run the query
inside try/catch and on any query/argument error return a descriptive
invalid-fileId failure instead of letting the exception bubble out (update the
code around the for-loop and the filesMetadata.push to only run after successful
validation/query).

In `@services/platform/convex/agent_tools/documents/internal_mutations.ts`:
- Around line 74-77: The patch currently uses executedAt as a claim flag which
can leave an approval permanently marked as done if a later step fails; change
the double-execution protection to use a separate claim/status field (e.g.,
claimedAt or status: 'claimed') instead of executedAt. Specifically, in the code
that reads const approval = await ctx.db.get(args.approvalId) and then does
await ctx.db.patch(args.approvalId, { executedAt: Date.now() }), update the
guard to check for approval.claimedAt (or approval.status === 'claimed') and
patch claimedAt: Date.now() (or status:'claimed') there; ensure executedAt is
only set later in the successful completion/result-persistence path (i.e., set
executedAt via ctx.db.patch after results are persisted) so exceptions do not
leave the approval unretryable.
- Around line 94-96: Replace the throw when the approval record is missing with
a silent early return: instead of throwing after const approval = await
ctx.db.get(args.approvalId), detect if (!approval) and return immediately
(no-op) so the result updater does not fail on raced/deleted approvals; keep the
existing executedAt check (if (approval.executedAt) return) and continue
treating an absent approval as a benign condition. Use the same
ctx.db.get(args.approvalId), args.approvalId, and approval.executedAt
identifiers to locate the code to change.
- Around line 29-53: The handler should validate args.files before creating the
approval: add a guard at the start of the handler (before constructing metadata
and computing fileCount/firstTitle) to reject with a clear, user-friendly error
if args.files is empty and to detect duplicate fileId values (e.g., by checking
a Set over args.files.map(f => f.fileId)) and reject if any duplicates exist;
return/throw a descriptive error (not silently create an Id<'approvals'>) so
callers get immediate runtime validation feedback.

In `@services/platform/convex/agent_tools/files/docx_tool.ts`:
- Around line 202-203: The documentation line incorrectly instructs passing
fileStorageId directly; update the guidance to show calling document_write with
a files array containing objects { fileId: fileStorageId } and an optional
folderPath. Specifically, modify the AFTER GENERATING note in docx_tool.ts to
state: call document_write with a payload like { files: [{ fileId: <returned
fileStorageId> }], folderPath: <optional path> } so consumers know to include
fileStorageId inside the files array rather than as a top-level param.

In `@services/platform/convex/agents/file/agent.ts`:
- Around line 119-123: Update the instruction for using the pdf tool to include
the required fileName parameter: when calling pdf_tool with operation="generate"
(the pdf tool handler), include fileName=<desired PDF name>, sourceType="url",
and content=<the URL> so the handler receives fileName and avoids "Missing
required 'fileName' for generate operation" errors; update the text that
currently shows operation="generate", sourceType="url", content=<the URL> to the
full form with fileName as specified.

In `@services/platform/convex/approvals/types.ts`:
- Around line 127-143: The current normalization unconditionally synthesizes a
blank file when raw.files is empty, causing fake file entries (fileId: '') to
bypass validations; update the logic in the normalization block that checks
raw.files to only synthesize a legacy file object when at least one legacy field
is present (check raw.fileId, raw.fileName, raw.title, raw.mimeType,
raw.fileSize, raw.createdDocumentId, raw.executionError), otherwise return an
empty files array (or leave raw.files empty) instead of creating a blank file;
modify the branch that builds the files array so it only constructs the file
object when any of those legacy fields are non-null/defined.

In `@services/platform/convex/lib/agent_response/generate_response.ts`:
- Around line 1065-1072: The mutation that updates the retry message (using
ctx.runMutation(components.agent.messages.updateMessage) when didRetry &&
savedMessageId && result.text) must be best-effort and not allowed to fail the
whole action; wrap that call in a try/catch and skip/return silently on any
error, and also avoid running it if the action has been cancelled (check
ctx.signal?.aborted or the action cancellation flag before attempting the
mutation). Ensure you reference the didRetry, savedMessageId, result.text,
ctx.runMutation, and components.agent.messages.updateMessage symbols so the fix
is applied to that exact patch site.

In
`@services/platform/convex/workflow_engine/action_defs/document/document_action.ts`:
- Around line 238-286: After fetching fileMetadata (variable fileMetadata) and
before calling internal.documents.internal_mutations.createDocument (where
documentId is created), validate that fileMetadata.organizationId matches the
workflow organizationId (variable organizationId) and throw an error if they
differ; i.e., add a check comparing fileMetadata.organizationId !==
organizationId and throw a clear Error("File does not belong to this
organization.") to prevent cross-organization document creation (mirroring the
same check in document_write_tool.ts).

---

Outside diff comments:
In `@services/crawler/app/services/pdf_service.py`:
- Around line 133-147: The url_to_pdf implementation ignores caller-supplied
wait_until values by hardcoding page.goto(..., wait_until="domcontentloaded")
and only conditionally waiting for "load"; update url_to_pdf to pass the
incoming wait_until parameter into page.goto (e.g., page.goto(url,
wait_until=wait_until, timeout=timeout)) and handle special cases like "load" or
"commit"/"networkidle" the same way image_service.py does (preserve the
DownloadDetectedException handling for download-triggering errors and still
suppress PlaywrightTimeoutError when doing any extra wait_for_load_state or
equivalent). Ensure the function name url_to_pdf and the UrlToPdfRequest
semantics are preserved so callers that pass "commit" or "networkidle" receive
the intended behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 752defd2-cc60-412c-85ff-dd762d9c3d2e

📥 Commits

Reviewing files that changed from the base of the PR and between dc28e21 and bb2c94f.

⛔ Files ignored due to path filters (1)
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
📒 Files selected for processing (44)
  • services/crawler/app/exceptions.py
  • services/crawler/app/routers/image.py
  • services/crawler/app/routers/pdf.py
  • services/crawler/app/routers/web.py
  • services/crawler/app/services/base_converter.py
  • services/crawler/app/services/image_service.py
  • services/crawler/app/services/pdf_service.py
  • services/crawler/app/utils/http_download.py
  • services/platform/app/features/chat/components/chat-interface.tsx
  • services/platform/app/features/chat/components/chat-messages.tsx
  • services/platform/app/features/chat/components/document-write-approval-card.tsx
  • services/platform/app/features/chat/hooks/mutations.ts
  • services/platform/app/features/chat/hooks/queries.ts
  • services/platform/app/features/chat/hooks/use-merged-chat-items.ts
  • services/platform/app/features/custom-agents/components/test-chat-panel/test-message-list.tsx
  • services/platform/app/features/custom-agents/components/tool-selector.tsx
  • services/platform/app/features/custom-agents/hooks/use-test-chat.ts
  • services/platform/convex/agent_tools/documents/__tests__/document_write_tool.test.ts
  • services/platform/convex/agent_tools/documents/document_write_tool.ts
  • services/platform/convex/agent_tools/documents/internal_actions.ts
  • services/platform/convex/agent_tools/documents/internal_mutations.ts
  • services/platform/convex/agent_tools/files/docx_tool.ts
  • services/platform/convex/agent_tools/files/excel_tool.ts
  • services/platform/convex/agent_tools/files/image_tool.ts
  • services/platform/convex/agent_tools/files/pdf_tool.ts
  • services/platform/convex/agent_tools/files/pptx_tool.ts
  • services/platform/convex/agent_tools/files/text_tool.ts
  • services/platform/convex/agent_tools/tool_names.ts
  • services/platform/convex/agent_tools/tool_registry.ts
  • services/platform/convex/agents/file/agent.ts
  • services/platform/convex/approvals/actions.ts
  • services/platform/convex/approvals/helpers.ts
  • services/platform/convex/approvals/schema.ts
  • services/platform/convex/approvals/types.ts
  • services/platform/convex/approvals/validators.ts
  • services/platform/convex/custom_agents/system_defaults.ts
  • services/platform/convex/documents/query_documents.ts
  • services/platform/convex/documents/schema.ts
  • services/platform/convex/documents/update_document.ts
  • services/platform/convex/documents/update_document_internal.ts
  • services/platform/convex/documents/validators.ts
  • services/platform/convex/lib/agent_response/generate_response.ts
  • services/platform/convex/workflow_engine/action_defs/document/document_action.ts
  • services/platform/messages/en.json

Comment on lines +158 to +162
return Response(
content=file_bytes,
media_type=ct,
headers={"Content-Disposition": f"attachment; filename=screenshot.{request.options.image_type}"},
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use filename extension derived from actual downloaded content type.

The response media_type is based on downloaded content (ct), but the filename extension is taken from the requested screenshot type. This can return mismatched files (for example, image/gif body with .png filename).

💡 Proposed fix
             return Response(
                 content=file_bytes,
                 media_type=ct,
-                headers={"Content-Disposition": f"attachment; filename=screenshot.{request.options.image_type}"},
+                headers={
+                    "Content-Disposition": (
+                        "attachment; filename=screenshot."
+                        + ("jpg" if ct == "image/jpeg" else ct.split("/", 1)[1])
+                    )
+                },
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return Response(
content=file_bytes,
media_type=ct,
headers={"Content-Disposition": f"attachment; filename=screenshot.{request.options.image_type}"},
)
return Response(
content=file_bytes,
media_type=ct,
headers={
"Content-Disposition": (
"attachment; filename=screenshot."
("jpg" if ct == "image/jpeg" else ct.split("/", 1)[1])
)
},
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/image.py` around lines 158 - 162, The
Content-Disposition filename currently uses request.options.image_type while
media_type uses the actual downloaded content type (ct); change the code that
builds the filename (around the Response return using file_bytes, ct, and
request.options.image_type) to derive the file extension from ct (e.g., map
common MIME types like image/png, image/jpeg, image/gif, image/webp -> png,
jpg/jpeg, gif, webp) and use that extension in the filename, falling back to
request.options.image_type if ct is unknown or unmapped, so the returned
filename extension matches the actual media_type.

Comment thread services/crawler/app/routers/pdf.py Outdated
logger.info(f"URL triggers download, falling back to HTTP: {request.url}")
try:
timeout_seconds = request.timeout / 1000
file_bytes, content_type = await download_file(str(request.url), timeout_seconds)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused unpacked variable from fallback download.

content_type is never used after unpacking and should be ignored explicitly to avoid lint noise/failures.

💡 Proposed fix
-            file_bytes, content_type = await download_file(str(request.url), timeout_seconds)
+            file_bytes, _content_type = await download_file(str(request.url), timeout_seconds)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
file_bytes, content_type = await download_file(str(request.url), timeout_seconds)
file_bytes, _content_type = await download_file(str(request.url), timeout_seconds)
🧰 Tools
🪛 Ruff (0.15.6)

[warning] 149-149: Unpacked variable content_type is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/pdf.py` at line 149, Change the unpacking of the
fallback download so the unused content_type is ignored to silence linters;
instead of assigning to content_type from the call to download_file(...) (the
line with file_bytes, content_type = await download_file(str(request.url),
timeout_seconds)), discard the second value (e.g., file_bytes, _ = await
download_file(... ) or file_bytes, _content_type = ...) so only file_bytes is
used; keep the call to download_file, request.url and timeout_seconds unchanged.

Comment on lines +36 to +40
if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard Content-Length parsing to prevent malformed-header crashes.

At Line 36, int(content_length) can throw ValueError for invalid upstream headers, causing a 500 instead of continuing with streaming-size enforcement.

Proposed fix
-        if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES:
-            raise HTTPException(
-                status_code=status.HTTP_400_BAD_REQUEST,
-                detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
-            )
+        if content_length:
+            try:
+                parsed_content_length = int(content_length)
+            except ValueError:
+                parsed_content_length = None
+
+            if (
+                parsed_content_length is not None
+                and parsed_content_length > MAX_FILE_DOWNLOAD_BYTES
+            ):
+                raise HTTPException(
+                    status_code=status.HTTP_400_BAD_REQUEST,
+                    detail=f"File too large ({parsed_content_length} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
+                )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
)
if content_length:
try:
parsed_content_length = int(content_length)
except ValueError:
parsed_content_length = None
if (
parsed_content_length is not None
and parsed_content_length > MAX_FILE_DOWNLOAD_BYTES
):
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail=f"File too large ({parsed_content_length} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/utils/http_download.py` around lines 36 - 40, Guard the
Content-Length parsing so malformed headers don't raise ValueError and crash:
when checking content_length against MAX_FILE_DOWNLOAD_BYTES, parse it inside a
try/except ValueError (e.g. assign parsed_length = int(content_length) inside
try), and if parsing fails simply skip the size-based HTTPException check (or
log the malformed header) so streaming-size enforcement continues; update the
block that references content_length, int(content_length),
MAX_FILE_DOWNLOAD_BYTES, HTTPException and status.HTTP_400_BAD_REQUEST to use
the safe parsed_length variable and only raise the 400 when parsed_length is
successfully obtained and exceeds MAX_FILE_DOWNLOAD_BYTES.

)
chunks.append(chunk)

logger.debug(f"Downloaded {total} bytes from {url}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid logging full download URLs to prevent credential/query leakage.

At Line 53, logging the raw URL can expose signed query params or tokens in logs.

Proposed fix
+from urllib.parse import urlsplit, urlunsplit
@@
-        logger.debug(f"Downloaded {total} bytes from {url}")
+        parsed = urlsplit(url)
+        safe_url = urlunsplit((parsed.scheme, parsed.netloc, parsed.path, "", ""))
+        logger.debug("Downloaded {} bytes from {}", total, safe_url)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/utils/http_download.py` at line 53, The debug log
currently outputs the full URL (logger.debug(f"Downloaded {total} bytes from
{url}")), which can leak credentials in query strings; change this to log a
sanitized URL by parsing the url variable (e.g., using urllib.parse.urlparse)
and reconstructing only the scheme, netloc and path or replace the
query/fragment with a fixed token like "<REDACTED>", then call logger.debug with
the sanitized value (keep the same total variable). Locate the logger.debug line
that references url and replace it with the sanitized-url approach so no raw
query parameters or fragments are logged.

Comment on lines +186 to +307
{/* Batch success summary */}
{status === 'approved' && executedAt && isBatch && !executionError && (
<HStack gap={1} className="mb-3 text-xs text-green-600">
<CheckCircle className="size-3" />
{t('savedSuccessfully')}
</HStack>
)}

{/* Single file success */}
{status === 'approved' && executedAt && !isBatch && !executionError && (
<HStack gap={1} className="mb-3 text-xs text-green-600">
<CheckCircle className="size-3" />
{t('savedSuccessfully')}
</HStack>
)}

{/* Partial success (batch) */}
{status === 'approved' &&
executedAt &&
isBatch &&
failedCount > 0 &&
successCount > 0 && (
<HStack gap={1} className="mb-3 text-xs text-amber-600">
<CheckCircle className="size-3" />
{t('savedPartially', {
success: successCount,
total: files.length,
})}
</HStack>
)}

{/* Execution error (single file) */}
{status === 'approved' && executionError && !isBatch && (
<HStack
gap={1}
align="start"
className="text-destructive mb-3 text-xs wrap-break-word"
>
<XCircle className="size-3 shrink-0" />
<Text as="span" className="min-w-0">
{executionError}
</Text>
</HStack>
)}

{/* All failed (batch) */}
{status === 'approved' &&
executedAt &&
isBatch &&
failedCount === files.length && (
<HStack
gap={1}
align="start"
className="text-destructive mb-3 text-xs wrap-break-word"
>
<XCircle className="size-3 shrink-0" />
<Text as="span" className="min-w-0">
{t('allFilesFailed', { count: files.length })}
</Text>
</HStack>
)}

{/* Temporary UI error */}
{error && (
<HStack
gap={1}
align="start"
className="text-destructive mb-3 text-xs wrap-break-word"
>
<XCircle className="size-3 shrink-0" />
<Text as="span" className="min-w-0">
{error}
</Text>
</HStack>
)}

{/* Action buttons */}
{isPending && (
<ActionRow gap={2}>
<Tooltip
content={isBatch ? t('approveTooltipBatch') : t('approveTooltip')}
>
<Button
size="sm"
variant="primary"
onClick={handleApprove}
disabled={isProcessing}
className="flex-1"
>
{isApproving && <Loader2 className="mr-1 size-4 animate-spin" />}
{t('approve')}
</Button>
</Tooltip>

<Tooltip
content={isBatch ? t('rejectTooltipBatch') : t('rejectTooltip')}
>
<Button
size="sm"
variant="secondary"
onClick={handleReject}
disabled={isProcessing}
className="flex-1"
>
{isRejecting && <Loader2 className="mr-1 size-4 animate-spin" />}
{t('reject')}
</Button>
</Tooltip>
</ActionRow>
)}

{/* Resolved status */}
{!isPending && (
<HStack justify="between" align="center" className="mt-2">
<Text as="div" variant="caption">
{status === 'approved' && failedCount > 0 && successCount > 0
? t('statusApprovedPartial')
: status === 'approved' && executionError
? t('statusApprovedFailed')
: status === 'approved'
? t('statusApprovedSuccess')
: t('statusRejected')}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Derive the success UI from per-file results, not only executionError.

The card already renders failures from file.executionError, but every "success" branch here still keys off the top-level executionError. If execution writes only file-level errors, a single-file or batch approval can still show savedSuccessfully / statusApprovedSuccess next to failed rows. Use failedCount / successCount as the source of truth for the banners and resolved footer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@services/platform/app/features/chat/components/document-write-approval-card.tsx`
around lines 186 - 307, The success banners and footer currently depend on the
top-level executionError; change their conditionals to derive success/failure
from per-file counts (failedCount and successCount) instead: update the two
"savedSuccessfully" blocks, the batch partial success block, the "All failed
(batch)" block, and the resolved footer (the ternary that shows
statusApprovedPartial/statusApprovedFailed/statusApprovedSuccess) to use
failedCount, successCount and files.length along with isBatch, executedAt and
status instead of checking executionError; keep the executionError/temporary UI
error block to display the top-level error when present and ensure single-file
execution error display still shows file.executionError where appropriate (refer
to variables executionError, failedCount, successCount, files.length, isBatch,
executedAt, status in document-write-approval-card.tsx).

Comment on lines +202 to +203

AFTER GENERATING: To save the file to a folder in the documents hub, call document_write with the returned fileStorageId and the desired folderPath.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Verify document_write tool parameter names match the documentation

# Check the document_write_tool.ts file for its parameter schema
ast-grep --pattern $'const $_ = z.object({
  $$$
})'  services/platform/convex/agent_tools/documents/document_write_tool.ts

# Also check for any validator or args definition
rg -n -A 10 "fileStorageId|folderPath" services/platform/convex/agent_tools/documents/document_write_tool.ts

Repository: tale-project/tale

Length of output: 6386


Fix documentation to reflect the correct document_write API structure.

The guidance on lines 202-203 is incomplete. Users must pass the fileStorageId inside a files array with the structure { fileId: fileStorageId }, not as a direct parameter. The corrected guidance should be:

AFTER GENERATING: To save the file to the documents hub, call document_write with the files array (where each entry has fileId set to the returned fileStorageId) and the desired folderPath.

Or more explicitly show the structure: AFTER GENERATING: Call document_write({ files: [{ fileId: <returned fileStorageId> }], folderPath: <optional path> }) to save to the documents hub.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/files/docx_tool.ts` around lines 202 -
203, The documentation line incorrectly instructs passing fileStorageId
directly; update the guidance to show calling document_write with a files array
containing objects { fileId: fileStorageId } and an optional folderPath.
Specifically, modify the AFTER GENERATING note in docx_tool.ts to state: call
document_write with a payload like { files: [{ fileId: <returned fileStorageId>
}], folderPath: <optional path> } so consumers know to include fileStorageId
inside the files array rather than as a top-level param.

Comment on lines +119 to +123
**DOWNLOADING FILES FROM URLS**
When the user provides URLs to existing PDF files and asks to download/save them:
1. Use pdf tool with operation="generate", sourceType="url", content=<the URL>
2. The tool automatically detects direct PDF links and downloads the original file
3. To save to a folder, follow up with document_write using the returned fileStorageId

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

PDF_TOOL_FILE="$(fd -t f pdf_tool.ts | head -n 1)"
echo "Inspecting: $PDF_TOOL_FILE"

sed -n '1,260p' "$PDF_TOOL_FILE"
echo "---- schema/arg key scan ----"
rg -n -C3 --type=ts 'operation|sourceType|content|args|createTool|z\.object|v\.object' "$PDF_TOOL_FILE"

echo "---- related tests/examples ----"
rg -n -C2 --type=ts 'sourceType.*url|operation.*generate|pdf' services/platform/convex/agent_tools

Repository: tale-project/tale

Length of output: 50373


🏁 Script executed:

cat -n services/platform/convex/agents/file/agent.ts | sed -n '119,123p'

Repository: tale-project/tale

Length of output: 458


Update the instruction to include the required fileName parameter.

Line 121 omits fileName, which is required by the pdf_tool handler. The tool will reject calls without it, resulting in "Missing required 'fileName' for generate operation" errors.

Fix:

Replace line 121:

1. Use pdf tool with operation="generate", sourceType="url", content=<the URL>

With:

1. Use pdf tool with operation="generate", fileName=<desired PDF name>, sourceType="url", content=<the URL>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agents/file/agent.ts` around lines 119 - 123, Update
the instruction for using the pdf tool to include the required fileName
parameter: when calling pdf_tool with operation="generate" (the pdf tool
handler), include fileName=<desired PDF name>, sourceType="url", and
content=<the URL> so the handler receives fileName and avoids "Missing required
'fileName' for generate operation" errors; update the text that currently shows
operation="generate", sourceType="url", content=<the URL> to the full form with
fileName as specified.

Comment on lines +127 to +143
if (raw.files?.length) return raw;
return {
files: [
{
fileId: raw.fileId ?? '',
fileName: raw.fileName ?? '',
title: raw.title ?? '',
mimeType: raw.mimeType ?? '',
fileSize: raw.fileSize ?? 0,
createdDocumentId: raw.createdDocumentId,
executionError: raw.executionError,
},
],
folderPath: raw.folderPath,
requestedAt: raw.requestedAt,
executedAt: raw.executedAt,
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not synthesize a blank file entry during metadata normalization.

If raw.files is empty and legacy fields are absent, this creates a fake file with fileId: ''. That can bypass empty-file validation and fail later in execution with misleading per-file errors.

✅ Safer normalization logic
 export function normalizeDocumentWriteMetadata(
   raw: DocumentWriteMetadata,
 ): DocumentWriteMetadata {
   if (raw.files?.length) return raw;
+  const hasLegacyFileId =
+    typeof raw.fileId === 'string' && raw.fileId.trim().length > 0;
   return {
-    files: [
-      {
-        fileId: raw.fileId ?? '',
-        fileName: raw.fileName ?? '',
-        title: raw.title ?? '',
-        mimeType: raw.mimeType ?? '',
-        fileSize: raw.fileSize ?? 0,
-        createdDocumentId: raw.createdDocumentId,
-        executionError: raw.executionError,
-      },
-    ],
+    files: hasLegacyFileId
+      ? [
+          {
+            fileId: raw.fileId!,
+            fileName: raw.fileName ?? '',
+            title: raw.title ?? '',
+            mimeType: raw.mimeType ?? '',
+            fileSize: raw.fileSize ?? 0,
+            createdDocumentId: raw.createdDocumentId,
+            executionError: raw.executionError,
+          },
+        ]
+      : [],
     folderPath: raw.folderPath,
     requestedAt: raw.requestedAt,
     executedAt: raw.executedAt,
   };
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (raw.files?.length) return raw;
return {
files: [
{
fileId: raw.fileId ?? '',
fileName: raw.fileName ?? '',
title: raw.title ?? '',
mimeType: raw.mimeType ?? '',
fileSize: raw.fileSize ?? 0,
createdDocumentId: raw.createdDocumentId,
executionError: raw.executionError,
},
],
folderPath: raw.folderPath,
requestedAt: raw.requestedAt,
executedAt: raw.executedAt,
};
export function normalizeDocumentWriteMetadata(
raw: DocumentWriteMetadata,
): DocumentWriteMetadata {
if (raw.files?.length) return raw;
const hasLegacyFileId =
typeof raw.fileId === 'string' && raw.fileId.trim().length > 0;
return {
files: hasLegacyFileId
? [
{
fileId: raw.fileId!,
fileName: raw.fileName ?? '',
title: raw.title ?? '',
mimeType: raw.mimeType ?? '',
fileSize: raw.fileSize ?? 0,
createdDocumentId: raw.createdDocumentId,
executionError: raw.executionError,
},
]
: [],
folderPath: raw.folderPath,
requestedAt: raw.requestedAt,
executedAt: raw.executedAt,
};
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/approvals/types.ts` around lines 127 - 143, The
current normalization unconditionally synthesizes a blank file when raw.files is
empty, causing fake file entries (fileId: '') to bypass validations; update the
logic in the normalization block that checks raw.files to only synthesize a
legacy file object when at least one legacy field is present (check raw.fileId,
raw.fileName, raw.title, raw.mimeType, raw.fileSize, raw.createdDocumentId,
raw.executionError), otherwise return an empty files array (or leave raw.files
empty) instead of creating a blank file; modify the branch that builds the files
array so it only constructs the file object when any of those legacy fields are
non-null/defined.

Comment on lines +1065 to +1072
if (didRetry && savedMessageId && result.text) {
await ctx.runMutation(components.agent.messages.updateMessage, {
messageId: savedMessageId,
patch: {
message: { role: 'assistant', content: result.text },
},
});
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make retry-message patch best-effort and cancellation-aware.

Line 1065 performs a required mutation before the final cancellation gate and without error isolation. A transient patch failure can fail the whole action after a valid response was generated.

🔧 Suggested hardening
-    if (didRetry && savedMessageId && result.text) {
-      await ctx.runMutation(components.agent.messages.updateMessage, {
-        messageId: savedMessageId,
-        patch: {
-          message: { role: 'assistant', content: result.text },
-        },
-      });
-    }
+    if (didRetry && savedMessageId && result.text) {
+      // Avoid patching if generation was cancelled after retry completed.
+      if (!(await checkFreshAbort())) {
+        try {
+          await ctx.runMutation(components.agent.messages.updateMessage, {
+            messageId: savedMessageId,
+            patch: {
+              message: { role: 'assistant', content: result.text },
+            },
+          });
+        } catch (patchError) {
+          console.error(
+            '[generateAgentResponse] Failed to persist retry text to saved message:',
+            patchError,
+          );
+        }
+      }
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/lib/agent_response/generate_response.ts` around
lines 1065 - 1072, The mutation that updates the retry message (using
ctx.runMutation(components.agent.messages.updateMessage) when didRetry &&
savedMessageId && result.text) must be best-effort and not allowed to fail the
whole action; wrap that call in a try/catch and skip/return silently on any
error, and also avoid running it if the action has been cancelled (check
ctx.signal?.aborted or the action cancellation flag before attempting the
mutation). Ensure you reference the didRetry, savedMessageId, result.text,
ctx.runMutation, and components.agent.messages.updateMessage symbols so the fix
is applied to that exact patch site.

Comment on lines +238 to +286
const fileMetadata = await ctx.runQuery(
internal.file_metadata.internal_queries.getByStorageId,
{ storageId },
);

if (!fileMetadata) {
throw new Error(
`File metadata not found for storage ID "${params.fileId}". The file may not exist.`,
);
}

const docTitle = params.title ?? fileMetadata.fileName;
const organizationId =
typeof _variables.organizationId === 'string'
? _variables.organizationId
: undefined;
const userId =
typeof _variables.userId === 'string' ? _variables.userId : undefined;

if (!organizationId) {
throw new Error(
'organizationId is required in workflow variables to create a document',
);
}

let folderId: string | null = null;
if (params.folderPath) {
folderId = await ctx.runMutation(
internal.folders.internal_mutations.getOrCreateFolderPath,
{
organizationId,
pathSegments: params.folderPath.split('/').filter(Boolean),
createdBy: userId,
},
);
}

const documentId = await ctx.runMutation(
internal.documents.internal_mutations.createDocument,
{
organizationId,
title: docTitle,
fileId: storageId,
mimeType: fileMetadata.contentType,
sourceProvider: 'agent',
createdBy: userId,
...(folderId ? { folderId: toId<'folders'>(folderId) } : {}),
},
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== documents/internal_mutations.ts ==="
fd 'internal_mutations.ts' services/platform/convex/documents -x sed -n '1,260p' {}

echo
echo "=== ownership-related references ==="
rg -n -C3 'createDocument|organizationId|fileId|storageId|getByStorageId' \
  services/platform/convex/documents \
  services/platform/convex/agent_tools/documents \
  services/platform/convex/workflow_engine/action_defs/document

Repository: tale-project/tale

Length of output: 50374


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== create_document helper implementation ==="
fd 'create_document.ts' services/platform/convex/documents -x cat {}

echo
echo "=== file_metadata internal queries (search for getByStorageId) ==="
fd 'internal_queries.ts' services/platform/convex/file_metadata -x head -80 {}

echo
echo "=== file_metadata schema/types (to see organizationId field) ==="
rg -n 'organizationId' services/platform/convex/file_metadata --max-count=20

Repository: tale-project/tale

Length of output: 2469


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== document_write_tool.ts: search for ownership/org check ==="
rg -n -B5 -A5 'fileMetadata|organizationId.*file|createDocument' \
  services/platform/convex/agent_tools/documents/document_write_tool.ts | head -80

Repository: tale-project/tale

Length of output: 1756


Add organization ownership check for file metadata before creating document.

The code fetches file metadata but never validates that fileMetadata.organizationId matches the workflow's organizationId. The document_write_tool.ts already enforces this check (lines 144–150), preventing cross-organization file attachment. Apply the same validation here before calling createDocument:

Suggested fix location

After fetching fileMetadata (line 238–241) and before calling createDocument (line 275), add:

if (fileMetadata.organizationId !== organizationId) {
  throw new Error(`File does not belong to this organization.`);
}

This ensures a document in one organization cannot reference a storage file belonging to another organization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@services/platform/convex/workflow_engine/action_defs/document/document_action.ts`
around lines 238 - 286, After fetching fileMetadata (variable fileMetadata) and
before calling internal.documents.internal_mutations.createDocument (where
documentId is created), validate that fileMetadata.organizationId matches the
workflow organizationId (variable organizationId) and throw an error if they
differ; i.e., add a check comparing fileMetadata.organizationId !==
organizationId and throw a clear Error("File does not belong to this
organization.") to prevent cross-organization document creation (mirroring the
same check in document_write_tool.ts).

larryro added 5 commits March 17, 2026 19:56
Raise the default agent timeout from 7 minutes to 20 minutes and make
it configurable per agent (1–25 min) via a new Advanced section in the
agent settings UI. Update generate_response to trust the propagated
deadline directly instead of capping it at 9 minutes, and bump Convex
action limits in the Dockerfile to support longer-running agents.
The approval flow already gates document writes, making the rate
limit redundant. The check referenced a rate limit name that was
never registered, causing a runtime error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant