Skip to content

feat(crawler,platform): multi-format web extraction, DOCX generation, and image quality presets#556

Merged
larryro merged 5 commits into
mainfrom
feat/web-tool-multi-format-extraction
Feb 24, 2026
Merged

feat(crawler,platform): multi-format web extraction, DOCX generation, and image quality presets#556
larryro merged 5 commits into
mainfrom
feat/web-tool-multi-format-extraction

Conversation

@larryro

@larryro larryro commented Feb 24, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Multi-format web extraction: The web tool now detects content type (web page, document, image) via HEAD request and routes to the appropriate handler — Crawl4AI for web pages, direct download + parsing for documents (PDF/DOCX/PPTX), and Vision API for images. Replaces the previous PDF-only extraction pipeline.
  • DOCX from Markdown/HTML: New crawler endpoints (/docx/from-markdown, /docx/from-html) convert structured content into DOCX files, with full support for headings, lists, tables, blockquotes, and code blocks. Wired into the platform DOCX tool.
  • Image quality presets: Image tool now uses quality presets (low, standard, high, ultra) instead of a raw scale parameter. Defaults to low (1x JPEG) for smaller file sizes.

Test plan

  • Tests added for single-URL crawling (test_crawl_single_url.py)
  • Tests added for HTML-to-DOCX conversion (test_docx_converter.py)
  • Tests added for web image extraction and filtering (test_web_image_extractor.py)
  • Tests added for web router content-type routing (test_web_router.py)
  • Tests added for web tool format handling (web_tool.test.ts)
  • Manual: Verify web tool extracts content from a PDF URL
  • Manual: Verify web tool extracts content from an image URL
  • Manual: Verify web tool extracts markdown from a web page
  • Manual: Verify DOCX generation from markdown content
  • Manual: Verify image tool defaults to low quality

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Generate DOCX documents directly from Markdown or HTML content
    • Extract and process content from multiple file types (PDFs, DOCX, PPTX, images, web pages) with automatic type detection
    • Image quality settings for document generation (low, standard, high, ultra)
    • Single URL content crawling and extraction
    • Enhanced image processing with automatic descriptions
  • Tests

    • Added comprehensive test coverage for new features and content extraction paths

@greptile-apps

greptile-apps Bot commented Feb 24, 2026

Copy link
Copy Markdown

Greptile Summary

This PR transforms the web extraction pipeline from PDF-only to multi-format support, adds DOCX generation from markdown/HTML, and introduces image quality presets.

Major Changes:

  • Multi-format web extraction: The web tool now detects content type via HEAD requests and URL extensions, routing to appropriate handlers — Crawl4AI for web pages, direct file parsing for documents (PDF/DOCX/PPTX), and Vision API for images
  • DOCX conversion: New crawler endpoints (/docx/from-markdown, /docx/from-html) convert structured content to DOCX with support for headings, lists, tables, blockquotes, and code blocks
  • Image quality presets: Replaced raw scale parameter with quality presets (low, standard, high, ultra), defaulting to low (1x JPEG) for smaller file sizes
  • Web image extraction: Added intelligent image filtering from web pages with Vision API descriptions, filtering out icons, tracking pixels, and decorative images

Test Coverage:

  • 5 new test files with comprehensive coverage for content type routing, HTML parsing, image extraction, and single-URL crawling
  • Good mix of unit tests (content type detection, URL extraction) and integration tests (web extraction pipeline)

Issues Found:

  • Extension format inconsistency in content_type.py:55 — images return "svg" but documents return ".pdf" (missing leading dot)

Confidence Score: 4/5

  • safe to merge after fixing the extension format bug
  • well-tested multi-format extraction with one logical bug in content type detection that needs fixing
  • services/crawler/app/utils/content_type.py requires fix for extension format inconsistency before merge

Important Files Changed

Filename Overview
services/crawler/app/utils/content_type.py new utility for content type detection with extension format inconsistency bug
services/crawler/app/routers/web.py major refactor to support multi-format extraction with proper SSRF protection and routing
services/crawler/app/routers/docx.py added markdown/HTML to DOCX conversion endpoints with thorough validation
services/crawler/app/services/html_to_docx_converter.py comprehensive HTML parser for DOCX conversion supporting all common elements
services/crawler/app/services/web_image_extractor.py image filtering and Vision API integration with proper size limits and concurrency
services/platform/convex/agent_tools/files/image_tool.ts replaced scale parameter with quality presets, defaulting to low (1x JPEG)
services/platform/convex/agent_tools/files/docx_tool.ts added markdown/HTML to DOCX conversion via sourceType parameter
services/platform/convex/agent_tools/web/web_tool.ts updated to use new multi-format fetch-and-extract endpoint

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Web Tool Request] --> B{URL or Search?}
    B -->|URL provided| C[fetchAndExtract]
    B -->|No URL| D[searchPages]
    
    C --> E[Crawler Service<br/>/api/v1/web/fetch-and-extract]
    E --> F{detect_type_from_url}
    F -->|document ext| G[Download & Parse File]
    F -->|image ext| H[Download & Vision Extract]
    F -->|unknown| I[HEAD Request Probe]
    
    I --> J{detect_type_from_content_type}
    J -->|document| G
    J -->|image| H
    J -->|unknown| K[Crawl4AI Web Extraction]
    
    G --> L[parse_file_with_vision<br/>PDF/DOCX/PPTX]
    H --> M[Vision API OCR + Description]
    K --> N[Crawl4AI Single URL]
    N --> O[extract_and_describe_images]
    O --> P[Vision API for Page Images]
    
    L --> Q[Return Content + Metadata]
    M --> Q
    P --> R[Combine Text + Image Descriptions]
    R --> Q
    
    S[DOCX Tool Request] --> T{sourceType?}
    T -->|markdown/html| U[Crawler Service<br/>/api/v1/docx/from-markdown<br/>or /from-html]
    T -->|sections| V[generateDocx<br/>structured content]
    
    U --> W[html_to_docx_converter]
    W --> X[Parse HTML to Sections]
    X --> Y[DocxService.generate_docx]
    V --> Y
    Y --> Z[Return DOCX bytes]
    
    AA[Image Tool Request] --> AB{quality preset}
    AB --> AC[Quality Presets]
    AC -->|low| AD[1x JPEG default]
    AC -->|standard| AE[1x PNG]
    AC -->|high| AF[2x PNG]
    AC -->|ultra| AG[4x PNG]
Loading

Last reviewed commit: 066672b

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

24 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread services/crawler/app/utils/content_type.py Outdated
…s, and web pages

The web tool previously only supported semantic search over crawled pages. Now it
also detects and fetches URLs directly — routing to the appropriate extraction
pipeline based on content type (document, image, or web page).

Crawler changes:
- Add content type detection via URL extension and HTTP HEAD probe
- Route documents (PDF, DOCX, PPTX) to file parser, images to Vision API,
  and web pages to the existing Playwright→PDF pipeline
- Return content_type in the API response
- Add tests for content type detection utilities

Platform changes:
- Rename fetch_url_via_pdf to fetch_and_extract to reflect broader capability
- Add URL detection and dual-mode (fetch vs search) to web_tool
- Add unit tests for URL extraction regex
…m markdown/HTML

Switch web page extraction from Playwright→PDF→Vision to Crawl4AI for text
plus Vision API for image descriptions, improving speed and content quality.
Add markdown/HTML→DOCX conversion endpoints and wire them through the
platform document generation pipeline and DOCX agent tool.
… JPEG)

Replace manual scale option with semantic quality presets (low, standard,
high, ultra) that control scale, format, and compression. Default changed
from 2x PNG to 1x JPEG for smaller file sizes unless higher quality is
explicitly requested.
@larryro larryro force-pushed the feat/web-tool-multi-format-extraction branch from 066672b to 91f03f8 Compare February 24, 2026 21:16
…etection

Image extensions derived from MIME types were missing the leading dot
(e.g. "png" instead of ".png"), inconsistent with document extensions
and breaking _build_filename().
@coderabbitai

coderabbitai Bot commented Feb 24, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR extends the crawler service with multi-format content extraction and DOCX generation capabilities. It introduces new endpoints for converting Markdown and HTML to DOCX, refactors the web content extraction pipeline to support documents, images, and web pages with content-type detection, adds a new method for single-URL crawling with structured metadata extraction, and includes utilities for HTML-to-section conversion and web image extraction with Vision API integration. The platform layer is updated to support the new DOCX generation modes and content type metadata. Comprehensive tests validate all new functionality.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.35% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main changes: multi-format web extraction (documents, images, web pages), DOCX generation (from markdown and HTML), and image quality presets. It directly reflects the primary modifications across both crawler and platform services.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/web-tool-multi-format-extraction

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts (1)

76-100: ⚠️ Potential issue | 🟡 Minor

Pass through content_type so downstream consumers can use it.

The crawler response now includes content_type, but fetchAndExtract drops it. That prevents the web tool from exposing the metadata. Please include it in the success payload (and update the result type accordingly).

✅ Suggested fix (fetch_and_extract.ts)
     return {
       operation: 'fetch_url',
       success: true,
       url: args.url,
       title: result.title,
+      content_type: result.content_type,
       content,
       word_count: result.word_count,
       page_count: result.page_count,
       vision_used: result.vision_used,
       truncated,
     };
✅ Suggested type update (helpers/types.ts)
 export type WebFetchUrlResult = {
   operation: 'fetch_url';
   success: boolean;
   url: string;
   title?: string;
+  content_type?: string;
   content: string;
   word_count: number;
   page_count: number;
   vision_used: boolean;
   truncated?: boolean;
   error?: string;
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts` around
lines 76 - 100, The success payload from fetchAndExtract is dropping
content_type—capture and pass result.content_type through the returned object
and update the associated response type; specifically, in fetchAndExtract (the
block returning operation: 'fetch_url') add content_type: result.content_type to
the returned object (and ensure the local variable handling like
content/truncated remains unchanged), and update the corresponding
type/interface in helpers/types.ts (the fetch/web result type) to include
content_type so downstream consumers can access the metadata.
services/platform/convex/documents/generate_document_helpers.ts (1)

27-37: ⚠️ Potential issue | 🟠 Major

Add guard for unsupported docx + url combinations.

getEndpointPath() routes docx + url to /api/v1/docx/from-url, but this endpoint does not exist in the crawler service. The docx router supports /from-template, /from-markdown, and /from-html, but not /from-url. Requests with this combination will 404. Add a validation guard to prevent this unsupported case:

Suggested guard
 export function getEndpointPath(
   sourceType: DocumentSourceType,
   outputFormat: DocumentOutputFormat,
 ): string {
+  if (outputFormat === 'docx' && sourceType === 'url') {
+    throw new Error('DOCX output does not support URL sources');
+  }
   const formatPath =
     outputFormat === 'pdf'
       ? 'pdf'
       : outputFormat === 'docx'
         ? 'docx'
         : 'images';
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/documents/generate_document_helpers.ts` around lines
27 - 37, getEndpointPath currently maps outputFormat 'docx' + sourceType 'url'
to /api/v1/docx/from-url which doesn't exist; add a guard in getEndpointPath (or
a helper it calls) that detects when outputFormat === 'docx' and sourceType ===
'url' and throws a descriptive error (or returns a clearly handled error
response) to prevent building/using the unsupported /from-url route; keep
existing path logic for other combinations and reference getEndpointPath,
DocumentSourceType, and DocumentOutputFormat when implementing the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/routers/docx.py`:
- Around line 219-280: The endpoints convert_markdown_to_docx and
convert_html_to_docx accept empty strings because the Pydantic fields lack a
min_length; enforce validation by updating MarkdownToDocxRequest.content and
HtmlToDocxRequest.html in the models to use Field(..., min_length=1,
description=...) so empty input is rejected with 422, or as an alternative add
an early guard in each endpoint (convert_markdown_to_docx and
convert_html_to_docx) to check if request.content or request.html is empty/blank
and raise HTTPException(status_code=400, detail="Empty content") before calling
get_docx_service().
- Around line 237-241: Add an optional filename field (string, default
"document") to the request models MarkdownToDocxRequest and HtmlToDocxRequest in
app/models.py, and update the docx router endpoints in
services/crawler/app/routers/docx.py that return Response (the blocks that
currently set headers {"Content-Disposition": "attachment;
filename=document.docx"}) to use the request.filename value to construct the
Content-Disposition header (append ".docx" if missing and ensure a safe
filename). Update any references to the endpoint handlers (e.g., the
MarkdownToDocx and HtmlToDocx route functions) to read request.filename and
inject it into the Response headers so callers receive a meaningful filename by
defaulting to "document" when none is provided.
- Around line 219-220: The POST route decorator `@router.post`("/from-markdown")
for function convert_markdown_to_docx returns a raw Response but lacks OpenAPI
metadata for binary output; update its decorator to include a responses entry
that documents a 200 response with the binary content-type (e.g.
application/vnd.openxmlformats-officedocument.wordprocessingml.document or
application/octet-stream) and a schema of type "string" with format "binary",
and apply the same responses annotation to the other POST route in this file
that returns a Response (the endpoint around the other POST at lines ~251-252)
so both endpoints self-document binary responses in the generated OpenAPI spec.
- Line 216: Move the module constant _DOCX_CONTENT_TYPE into the top-level
constants block alongside _FILE_TEMPLATE and _FILE_UPLOAD to centralize
module-level declarations; locate the current mid-file definition of
_DOCX_CONTENT_TYPE and cut it, then paste it into the existing constants section
(near _FILE_TEMPLATE/_FILE_UPLOAD) so all file-related constants are grouped
together and remove the original mid-file definition to avoid duplication.

In `@services/crawler/app/routers/web.py`:
- Around line 79-95: The current _probe_url function uses httpx.AsyncClient with
follow_redirects=True which lets redirects bypass your SSRF hostname/IP checks;
change to disable automatic redirects and instead perform the HEAD (and any
subsequent redirect targets) manually by setting follow_redirects=False on the
AsyncClient, inspect the Location header on 3xx responses, validate each
redirect URL's hostname/IP against your SSRF allowlist/blocklist before issuing
the next request, and loop up to a safe redirect limit; apply the same
manual-redirect validation approach to the similar request flow in the other
block referenced (lines ~97-130) so no redirect target is fetched without
validation.

In `@services/crawler/app/services/crawler_service.py`:
- Around line 411-422: Extract the repeated CrawlerRunConfig construction into a
shared helper (e.g., _build_crawl_config) and call it from both crawl_single_url
and crawl_urls; the helper should accept word_count_threshold plus optional
overrides and return a CrawlerRunConfig with the common defaults (excluded_tags
["nav","footer","header","aside","select","option"], cache_mode="bypass",
screenshot=False, pdf=False, exclude_social_media_links=True, and
markdown_generator=DefaultMarkdownGenerator(content_filter=PruningContentFilter(threshold=0.4))).
Replace the duplicated inline CrawlerRunConfig instantiations in
crawl_single_url and crawl_urls with calls to
_build_crawl_config(word_count_threshold, **overrides) so both paths stay
synchronized when config changes.
- Line 435: The expression assigning markdown_content (markdown_content =
result.markdown.fit_markdown or result.markdown.raw_markdown) can yield None;
change it to fall back to an empty string so subsequent calls like
markdown_content.split() don't raise (e.g. use an explicit fallback such as
(result.markdown.fit_markdown or result.markdown.raw_markdown or "")). Apply the
same defensive fallback in the other occurrence in crawl_urls; update the
assignment sites so downstream code (splitting, length checks) safely operate on
an empty string rather than None.

In `@services/crawler/app/services/web_image_extractor.py`:
- Around line 84-101: The current image downloader in the async with semaphore
block reads the full response via response.content before enforcing
MIN_IMAGE_BYTES / MAX_IMAGE_DOWNLOAD_BYTES, causing large responses to be fully
downloaded then discarded; change the download to stream with
response.aiter_bytes() (or similar chunked reads) inside the same async with
semaphore and accumulate bytes up to MAX_IMAGE_DOWNLOAD_BYTES + 1, aborting the
read early and logging/skipping when the accumulator exceeds
MAX_IMAGE_DOWNLOAD_BYTES; keep the same timeout (IMAGE_DOWNLOAD_TIMEOUT),
preserve the existing small/oversize checks and logger usage (logger.debug) and
return the accumulated image_bytes when within bounds.
- Around line 129-146: The describe_image calls are unbounded and should be
limited by the same (or a new) semaphore to respect max_concurrent; wrap each
call to vision_client.describe_image in a small coroutine that acquires the
semaphore before calling and releases it in a finally block (e.g., create
describe_task = _describe_with_semaphore(image_bytes, vision_client, semaphore)
for items in downloaded), collect those describe tasks into describe_tasks and
then await asyncio.gather(*describe_tasks, return_exceptions=True); ensure the
helper coroutine handles exceptions and always releases the semaphore.

In `@services/crawler/app/utils/content_type.py`:
- Line 10: The MIME-to-extension mapping for key "application/vnd.ms-powerpoint"
in the content type dictionary is wrong (it currently maps to ".pptx"); update
that entry to map to ".ppt" and, if legacy PPT detection is desired, add ".ppt"
to the DOCUMENT_EXTENSIONS collection so functions that inspect URLs or
filenames (DOCUMENT_EXTENSIONS and the content-type mapping dict in
content_type.py) will recognize legacy PowerPoint files.
- Line 34: The code assigns path = parsed.path.lower().split("?")[0] but
parsed.path from urlparse never contains query parameters, so remove the
redundant .split("?")[0] and set path = parsed.path.lower(); update the
assignment in content_type.py where the variable path is defined (look for
parsed and path in that function) to simplify the code while keeping the same
behavior and run tests to confirm no behavioral change.
- Around line 54-55: The function detect_type_from_content_type returns document
extensions with a leading dot but image extensions without; update the image
branch (where ct in IMAGE_CONTENT_TYPES is checked) to prepend a dot to the
returned extension (e.g., return f".{ext}", "image") so image extensions match
the document format, referencing the IMAGE_CONTENT_TYPES check and the
detect_type_from_content_type function; also update the related test expectation
(currently asserting "svg") to expect ".svg".

In `@services/crawler/tests/test_crawl_single_url.py`:
- Around line 172-180: The test test_raises_timeout_error_on_slow_crawl is
relying on a real asyncio.sleep in slow_arun which can be flaky in CI; replace
the real sleep with a deterministic never-completes future or mock asyncio.sleep
so the coroutine never resolves (e.g., await asyncio.Future() that is never set)
and keep the timeout at service.crawl_single_url("https://example.com",
timeout=0.01) to trigger the TimeoutError, updating the helper
_make_service/arun_side_effect usage accordingly so the test no longer depends
on wall-clock sleep.

In `@services/crawler/tests/test_web_image_extractor.py`:
- Around line 98-110: The async test methods in TestExtractAndDescribeImages
(test_returns_empty_for_no_images, test_returns_empty_when_all_filtered) lack
the pytest asyncio marker so they won't run; fix by either adding
`@pytest.mark.asyncio` above the TestExtractAndDescribeImages class or setting a
module-level marker (pytestmark = pytest.mark.asyncio) at the top of the test
module so the async tests that call extract_and_describe_images(...) execute
under asyncio.

In `@services/crawler/tests/test_web_router.py`:
- Around line 82-94: The MIME mapping for legacy PowerPoint is incorrect: update
the mapping in content_type.py (the CONTENT_TYPE_MAPPING or equivalent
tuple/list that contains ("application/vnd.ms-powerpoint", ...)) to map
"application/vnd.ms-powerpoint" to ".ppt" (not ".pptx"), and then either add
support for .ppt in file_parser_service.py (e.g., implement or wire a legacy PPT
handler/parser used by parse_file / the existing parser registry) or explicitly
remove the .ppt mapping and add a clear comment/doc explaining legacy .ppt is
intentionally unsupported; also update tests (e.g., test_web_router.py) to
expect the corrected mapping or documented exclusion so parser and mapping
remain consistent.

In `@services/platform/convex/agent_tools/files/docx_tool.ts`:
- Around line 296-332: When handling Mode A in the docx tool, add an explicit
validation that if either args.sourceType or args.content is provided but the
other is missing, you raise a clear error (or reject) indicating both sourceType
and content are required for "generate from content" mode; implement this check
just before the Mode A branch that uses args.sourceType && args.content (the
block that calls internal.documents.internal_actions.generateDocument) so
partial inputs don't silently fall through to the sections/Mode B logic and the
caller receives a meaningful message about the missing parameter(s).

In `@services/platform/convex/agent_tools/web/web_tool.ts`:
- Around line 73-83: The metadata string built in the meta array doesn't include
the result.content_type; update the construction of meta (in web_tool.ts around
the code that builds meta) to add a conditional entry like `result.content_type
? \`Type: ${result.content_type}\` : null` so the content type (PDF, image,
page, etc.) is shown along with Title, URL, Words, Pages, Vision, and truncated
flags before the .filter(Boolean).join(' | ') call.
- Around line 59-67: The code drops the user instruction when a URL is extracted
from args.query; change the assignment so the extraction instruction is
preserved (e.g., set instruction = args.query when present, not undefined when
extractUrl returned the URL). Update the block around targetUrl/instruction in
web_tool.ts so that when targetUrl is truthy you pass the original query
(args.query) or the query-without-url as the instruction into fetchAndExtract
(referencing extractUrl, targetUrl, instruction, and fetchAndExtract) instead of
setting instruction to undefined.

---

Outside diff comments:
In `@services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts`:
- Around line 76-100: The success payload from fetchAndExtract is dropping
content_type—capture and pass result.content_type through the returned object
and update the associated response type; specifically, in fetchAndExtract (the
block returning operation: 'fetch_url') add content_type: result.content_type to
the returned object (and ensure the local variable handling like
content/truncated remains unchanged), and update the corresponding
type/interface in helpers/types.ts (the fetch/web result type) to include
content_type so downstream consumers can access the metadata.

In `@services/platform/convex/documents/generate_document_helpers.ts`:
- Around line 27-37: getEndpointPath currently maps outputFormat 'docx' +
sourceType 'url' to /api/v1/docx/from-url which doesn't exist; add a guard in
getEndpointPath (or a helper it calls) that detects when outputFormat === 'docx'
and sourceType === 'url' and throws a descriptive error (or returns a clearly
handled error response) to prevent building/using the unsupported /from-url
route; keep existing path logic for other combinations and reference
getEndpointPath, DocumentSourceType, and DocumentOutputFormat when implementing
the check.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72be014 and 066672b.

⛔ Files ignored due to path filters (1)
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
📒 Files selected for processing (23)
  • services/crawler/app/models.py
  • services/crawler/app/routers/docx.py
  • services/crawler/app/routers/web.py
  • services/crawler/app/services/crawler_service.py
  • services/crawler/app/services/docx_service.py
  • services/crawler/app/services/html_to_docx_converter.py
  • services/crawler/app/services/web_image_extractor.py
  • services/crawler/app/utils/__init__.py
  • services/crawler/app/utils/content_type.py
  • services/crawler/tests/conftest.py
  • services/crawler/tests/test_crawl_single_url.py
  • services/crawler/tests/test_docx_converter.py
  • services/crawler/tests/test_web_image_extractor.py
  • services/crawler/tests/test_web_router.py
  • services/platform/convex/agent_tools/files/docx_tool.ts
  • services/platform/convex/agent_tools/files/image_tool.ts
  • services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts
  • services/platform/convex/agent_tools/web/helpers/types.ts
  • services/platform/convex/agent_tools/web/web_tool.test.ts
  • services/platform/convex/agent_tools/web/web_tool.ts
  • services/platform/convex/documents/generate_document_helpers.ts
  • services/platform/convex/documents/internal_actions.ts
  • services/platform/convex/documents/types.ts

)


_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Move _DOCX_CONTENT_TYPE to the top-level constant block with _FILE_TEMPLATE / _FILE_UPLOAD.

Currently the constant sits mid-file between two unrelated endpoint handlers (lines 25-26 house the other module-level constants), making it harder to locate. Relocating it next to the existing _FILE_* constants keeps all module-level declarations in one place.

♻️ Proposed move
 _FILE_TEMPLATE = File(None, description="Optional template DOCX file to use as base")
 _FILE_UPLOAD = File(..., description="DOCX file to parse")
+_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
-_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
-
-
 `@router.post`("/from-markdown")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/docx.py` at line 216, Move the module constant
_DOCX_CONTENT_TYPE into the top-level constants block alongside _FILE_TEMPLATE
and _FILE_UPLOAD to centralize module-level declarations; locate the current
mid-file definition of _DOCX_CONTENT_TYPE and cut it, then paste it into the
existing constants section (near _FILE_TEMPLATE/_FILE_UPLOAD) so all
file-related constants are grouped together and remove the original mid-file
definition to avoid duplication.

Comment on lines +219 to +280
@router.post("/from-markdown")
async def convert_markdown_to_docx(request: MarkdownToDocxRequest):
"""
Convert Markdown content to DOCX.

Parses markdown into HTML, then extracts structure (headings, paragraphs,
lists, tables, etc.) and generates a Word document.

Args:
request: Markdown content

Returns:
DOCX file as binary response
"""
try:
docx_service = get_docx_service()
docx_bytes = await docx_service.markdown_to_docx(request.content)

return Response(
content=docx_bytes,
media_type=_DOCX_CONTENT_TYPE,
headers={"Content-Disposition": "attachment; filename=document.docx"},
)

except Exception:
logger.exception("Error converting markdown to DOCX")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="Failed to convert markdown to DOCX",
) from None


@router.post("/from-html")
async def convert_html_to_docx(request: HtmlToDocxRequest):
"""
Convert HTML content to DOCX.

Parses HTML to extract structure (headings, paragraphs, lists, tables, etc.)
and generates a Word document.

Args:
request: HTML content

Returns:
DOCX file as binary response
"""
try:
docx_service = get_docx_service()
docx_bytes = await docx_service.html_to_docx(request.html)

return Response(
content=docx_bytes,
media_type=_DOCX_CONTENT_TYPE,
headers={"Content-Disposition": "attachment; filename=document.docx"},
)

except Exception:
logger.exception("Error converting HTML to DOCX")
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
detail="Failed to convert HTML to DOCX",
) from None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Empty inputs will silently produce an empty/trivial DOCX with HTTP 200.

MarkdownToDocxRequest.content and HtmlToDocxRequest.html are declared with Field(...) (required but no min_length), so an empty string "" passes Pydantic validation and reaches markdown_to_docx/html_to_docx. The caller receives a valid DOCX binary response with no indication that the input was empty.

Enforce a minimum length on both request model fields, for example:

# In app/models.py → MarkdownToDocxRequest
content: str = Field(..., min_length=1, description="Markdown content to convert")

# In app/models.py → HtmlToDocxRequest
html: str = Field(..., min_length=1, description="HTML content to convert")

Alternatively, add an early guard inside each endpoint before calling the service:

🛡️ Proposed guard in each endpoint
     try:
         docx_service = get_docx_service()
+        if not request.content.strip():
+            raise HTTPException(
+                status_code=status.HTTP_400_BAD_REQUEST,
+                detail="Markdown content must not be empty",
+            )
         docx_bytes = await docx_service.markdown_to_docx(request.content)
     try:
         docx_service = get_docx_service()
+        if not request.html.strip():
+            raise HTTPException(
+                status_code=status.HTTP_400_BAD_REQUEST,
+                detail="HTML content must not be empty",
+            )
         docx_bytes = await docx_service.html_to_docx(request.html)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/docx.py` around lines 219 - 280, The endpoints
convert_markdown_to_docx and convert_html_to_docx accept empty strings because
the Pydantic fields lack a min_length; enforce validation by updating
MarkdownToDocxRequest.content and HtmlToDocxRequest.html in the models to use
Field(..., min_length=1, description=...) so empty input is rejected with 422,
or as an alternative add an early guard in each endpoint
(convert_markdown_to_docx and convert_html_to_docx) to check if request.content
or request.html is empty/blank and raise HTTPException(status_code=400,
detail="Empty content") before calling get_docx_service().

Comment on lines +219 to +220
@router.post("/from-markdown")
async def convert_markdown_to_docx(request: MarkdownToDocxRequest):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

New endpoints are missing OpenAPI binary-response documentation.

Because Response is returned directly (no response_model), FastAPI omits any schema for these routes. Adding a responses annotation in the decorator self-documents the binary content type in the generated OpenAPI spec, improving client discoverability.

♻️ Proposed annotation
-@router.post("/from-markdown")
+@router.post(
+    "/from-markdown",
+    responses={200: {"content": {_DOCX_CONTENT_TYPE: {}}, "description": "DOCX binary file"}},
+)
 async def convert_markdown_to_docx(request: MarkdownToDocxRequest):

-@router.post("/from-html")
+@router.post(
+    "/from-html",
+    responses={200: {"content": {_DOCX_CONTENT_TYPE: {}}, "description": "DOCX binary file"}},
+)
 async def convert_html_to_docx(request: HtmlToDocxRequest):

Also applies to: 251-252

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/docx.py` around lines 219 - 220, The POST route
decorator `@router.post`("/from-markdown") for function convert_markdown_to_docx
returns a raw Response but lacks OpenAPI metadata for binary output; update its
decorator to include a responses entry that documents a 200 response with the
binary content-type (e.g.
application/vnd.openxmlformats-officedocument.wordprocessingml.document or
application/octet-stream) and a schema of type "string" with format "binary",
and apply the same responses annotation to the other POST route in this file
that returns a Response (the endpoint around the other POST at lines ~251-252)
so both endpoints self-document binary responses in the generated OpenAPI spec.

Comment on lines +237 to +241
return Response(
content=docx_bytes,
media_type=_DOCX_CONTENT_TYPE,
headers={"Content-Disposition": "attachment; filename=document.docx"},
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Hardcoded filename=document.docx makes it impossible for callers to obtain a meaningful filename.

Both endpoints return attachment; filename=document.docx regardless of any caller-supplied name. Adding an optional filename field to MarkdownToDocxRequest / HtmlToDocxRequest (defaulting to "document") would let clients receive a semantically named file with no breaking change.

♻️ Proposed refactor

In app/models.py:

 class MarkdownToDocxRequest(BaseModel):
     content: str = Field(..., description="Markdown content to convert")
+    filename: str = Field("document", description="Output filename (without extension)")

 class HtmlToDocxRequest(BaseModel):
     html: str = Field(..., description="HTML content to convert")
+    filename: str = Field("document", description="Output filename (without extension)")

In the router:

-            headers={"Content-Disposition": "attachment; filename=document.docx"},
+            headers={"Content-Disposition": f'attachment; filename="{request.filename}.docx"'},

Also applies to: 269-273

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/docx.py` around lines 237 - 241, Add an optional
filename field (string, default "document") to the request models
MarkdownToDocxRequest and HtmlToDocxRequest in app/models.py, and update the
docx router endpoints in services/crawler/app/routers/docx.py that return
Response (the blocks that currently set headers {"Content-Disposition":
"attachment; filename=document.docx"}) to use the request.filename value to
construct the Content-Disposition header (append ".docx" if missing and ensure a
safe filename). Update any references to the endpoint handlers (e.g., the
MarkdownToDocx and HtmlToDocx route functions) to read request.filename and
inject it into the Response headers so callers receive a meaningful filename by
defaulting to "document" when none is provided.

Comment on lines +79 to +95
async def _probe_url(url_str: str, timeout: float) -> tuple[str, str | None, str]:
"""
Probe a URL with a HEAD request to determine Content-Type.

Returns:
Tuple of (content_type_header, extension_or_None, category).
"""
try:
async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client:
response = await client.head(url_str)
content_type = response.headers.get("content-type", "")
ext, category = detect_type_from_content_type(content_type)
return content_type, ext, category
except httpx.HTTPError:
logger.debug(f"HEAD request failed for {url_str}, will try URL extension detection")
return "", None, "unknown"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Redirects can bypass SSRF validation.

follow_redirects=True allows a public URL to redirect to a private IP after the initial hostname check, which defeats the SSRF guard. Validate each redirect target before issuing the next request (or disable auto-redirects and handle redirects manually).

🔒 Suggested redirect validation
-from urllib.parse import urlparse
+from urllib.parse import urlparse, urljoin
 async def _probe_url(url_str: str, timeout: float) -> tuple[str, str | None, str]:
@@
-    try:
-        async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client:
-            response = await client.head(url_str)
-            content_type = response.headers.get("content-type", "")
-            ext, category = detect_type_from_content_type(content_type)
-            return content_type, ext, category
+    try:
+        async with httpx.AsyncClient(follow_redirects=False, timeout=timeout, headers=_HTTP_HEADERS) as client:
+            current_url = url_str
+            for _ in range(5):
+                response = await client.head(current_url)
+                if response.is_redirect:
+                    location = response.headers.get("location")
+                    if not location:
+                        break
+                    next_url = urljoin(current_url, location)
+                    validate_url_not_private(next_url)
+                    current_url = next_url
+                    continue
+                content_type = response.headers.get("content-type", "")
+                ext, category = detect_type_from_content_type(content_type)
+                return content_type, ext, category
+        return "", None, "unknown"
     except httpx.HTTPError:
         logger.debug(f"HEAD request failed for {url_str}, will try URL extension detection")
         return "", None, "unknown"
 async def _download_file(url_str: str, timeout: float) -> tuple[bytes, str]:
@@
-    async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client:
-        async with client.stream("GET", url_str) as response:
-            response.raise_for_status()
-            content_type = response.headers.get("content-type", "")
-            content_length = response.headers.get("content-length")
-
-            if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES:
-                raise HTTPException(
-                    status_code=status.HTTP_400_BAD_REQUEST,
-                    detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
-                )
-
-            chunks = []
-            total = 0
-            async for chunk in response.aiter_bytes():
-                total += len(chunk)
-                if total > MAX_FILE_DOWNLOAD_BYTES:
-                    raise HTTPException(
-                        status_code=status.HTTP_400_BAD_REQUEST,
-                        detail=f"File too large (>{MAX_FILE_DOWNLOAD_BYTES} bytes). Download aborted.",
-                    )
-                chunks.append(chunk)
-
-            return b"".join(chunks), content_type
+    async with httpx.AsyncClient(follow_redirects=False, timeout=timeout, headers=_HTTP_HEADERS) as client:
+        current_url = url_str
+        for _ in range(5):
+            async with client.stream("GET", current_url) as response:
+                if response.is_redirect:
+                    location = response.headers.get("location")
+                    if not location:
+                        raise HTTPException(
+                            status_code=status.HTTP_400_BAD_REQUEST,
+                            detail="Redirect without Location header",
+                        )
+                    next_url = urljoin(current_url, location)
+                    validate_url_not_private(next_url)
+                    current_url = next_url
+                    continue
+                response.raise_for_status()
+                content_type = response.headers.get("content-type", "")
+                content_length = response.headers.get("content-length")
+
+                if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES:
+                    raise HTTPException(
+                        status_code=status.HTTP_400_BAD_REQUEST,
+                        detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
+                    )
+
+                chunks = []
+                total = 0
+                async for chunk in response.aiter_bytes():
+                    total += len(chunk)
+                    if total > MAX_FILE_DOWNLOAD_BYTES:
+                        raise HTTPException(
+                            status_code=status.HTTP_400_BAD_REQUEST,
+                            detail=f"File too large (>{MAX_FILE_DOWNLOAD_BYTES} bytes). Download aborted.",
+                        )
+                    chunks.append(chunk)
+
+                return b"".join(chunks), content_type
+        raise HTTPException(
+            status_code=status.HTTP_400_BAD_REQUEST,
+            detail="Too many redirects",
+        )

Also applies to: 97-130

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/web.py` around lines 79 - 95, The current
_probe_url function uses httpx.AsyncClient with follow_redirects=True which lets
redirects bypass your SSRF hostname/IP checks; change to disable automatic
redirects and instead perform the HEAD (and any subsequent redirect targets)
manually by setting follow_redirects=False on the AsyncClient, inspect the
Location header on 3xx responses, validate each redirect URL's hostname/IP
against your SSRF allowlist/blocklist before issuing the next request, and loop
up to a safe redirect limit; apply the same manual-redirect validation approach
to the similar request flow in the other block referenced (lines ~97-130) so no
redirect target is fetched without validation.

Comment on lines +98 to +110
class TestExtractAndDescribeImages:
"""Tests for extract_and_describe_images()."""

async def test_returns_empty_for_no_images(self):
descriptions, vision_used = await extract_and_describe_images([], "https://example.com")
assert descriptions == []
assert vision_used is False

async def test_returns_empty_when_all_filtered(self):
images = [{"src": "data:image/png;base64,tiny", "score": 5.0}]
descriptions, vision_used = await extract_and_describe_images(images, "https://example.com")
assert descriptions == []
assert vision_used is False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing pytest.mark.asyncio — async tests won't execute.

Same issue as test_crawl_single_url.py: the async test methods in TestExtractAndDescribeImages lack the asyncio marker and will silently "pass" without running.

🔧 Proposed fix: add module-level marker
 import pytest
 
 from app.services.web_image_extractor import (
     MAX_IMAGES_TO_PROCESS,
     MIN_IMAGE_BYTES,
     _filter_image_candidates,
     extract_and_describe_images,
 )
+
+pytestmark = pytest.mark.asyncio
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/tests/test_web_image_extractor.py` around lines 98 - 110,
The async test methods in TestExtractAndDescribeImages
(test_returns_empty_for_no_images, test_returns_empty_when_all_filtered) lack
the pytest asyncio marker so they won't run; fix by either adding
`@pytest.mark.asyncio` above the TestExtractAndDescribeImages class or setting a
module-level marker (pytestmark = pytest.mark.asyncio) at the top of the test
module so the async tests that call extract_and_describe_images(...) execute
under asyncio.

Comment on lines +82 to +94
(
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
".docx",
"document",
),
(
"application/vnd.openxmlformats-officedocument.presentationml.presentation",
".pptx",
"document",
),
("application/msword", ".doc", "document"),
("application/vnd.ms-powerpoint", ".pptx", "document"),
],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect MIME-to-extension mapping and parser support
echo "=== Searching for vnd.ms-powerpoint MIME mapping ==="
rg -n "vnd\.ms-powerpoint" services/crawler/

echo -e "\n=== Searching for DOCUMENT_CONTENT_TYPES definition ==="
rg -n "DOCUMENT_CONTENT_TYPES" services/crawler/ -C 3

echo -e "\n=== Searching for ppt/pptx parser support ==="
rg -n "ppt" services/crawler/app/services/file_parser_service.py

echo -e "\n=== Checking content_type.py for PowerPoint mappings ==="
cat -n services/crawler/app/utils/content_type.py | head -100

Repository: tale-project/tale

Length of output: 4698


Correct the legacy PowerPoint MIME type mapping.

application/vnd.ms-powerpoint is the MIME type for legacy Office 97-2003 .ppt files, not .pptx. The current mapping in content_type.py:10 incorrectly assigns it to .pptx. Additionally, the parser only supports .pptx (via python-pptx in file_parser_service.py), with no handler for .ppt files. This mismatch will cause parsing failures for legacy PowerPoint files. Either map application/vnd.ms-powerpoint to .ppt with corresponding parser support, or document why legacy format support is intentionally excluded.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/tests/test_web_router.py` around lines 82 - 94, The MIME
mapping for legacy PowerPoint is incorrect: update the mapping in
content_type.py (the CONTENT_TYPE_MAPPING or equivalent tuple/list that contains
("application/vnd.ms-powerpoint", ...)) to map "application/vnd.ms-powerpoint"
to ".ppt" (not ".pptx"), and then either add support for .ppt in
file_parser_service.py (e.g., implement or wire a legacy PPT handler/parser used
by parse_file / the existing parser registry) or explicitly remove the .ppt
mapping and add a clear comment/doc explaining legacy .ppt is intentionally
unsupported; also update tests (e.g., test_web_router.py) to expect the
corrected mapping or documented exclusion so parser and mapping remain
consistent.

Comment on lines +296 to +332

// Mode A: Generate from markdown/html content
if (args.sourceType && args.content) {
debugLog('tool:docx generate from content start', {
fileName: args.fileName,
sourceType: args.sourceType,
});

try {
const result = await ctx.runAction(
internal.documents.internal_actions.generateDocument,
{
fileName: args.fileName,
sourceType: args.sourceType,
outputFormat: 'docx',
content: args.content,
},
);

debugLog('tool:docx generate from content success', {
fileName: result.fileName,
fileId: result.fileId,
size: result.size,
});

return {
operation: 'generate',
...result,
} as GenerateDocxResult;
} catch (error) {
console.error('[tool:docx generate from content] error', {
fileName: args.fileName,
error: error instanceof Error ? error.message : String(error),
});
throw error;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Partial Mode A args (sourceType without content, or vice versa) silently fall through to Mode B.

If a caller provides sourceType but omits content (or vice versa), the condition on line 298 is false and execution falls through to the sections check on line 335, which produces an error message about missing sections or sourceType + content. This could be confusing since the user clearly intended Mode A. Consider validating explicitly:

🛡️ Proposed validation
      // Mode A: Generate from markdown/html content
      if (args.sourceType && args.content) {
+     } else if (args.sourceType || args.content) {
+       throw new Error(
+         "Both 'sourceType' and 'content' are required for markdown/HTML generation. Provide both or use 'sections' instead.",
+       );
+     }

Or equivalently, add the check before the Mode A block.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/files/docx_tool.ts` around lines 296 -
332, When handling Mode A in the docx tool, add an explicit validation that if
either args.sourceType or args.content is provided but the other is missing, you
raise a clear error (or reject) indicating both sourceType and content are
required for "generate from content" mode; implement this check just before the
Mode A branch that uses args.sourceType && args.content (the block that calls
internal.documents.internal_actions.generateDocument) so partial inputs don't
silently fall through to the sections/Mode B logic and the caller receives a
meaningful message about the missing parameter(s).

Comment on lines +59 to +67
const targetUrl = args.url || extractUrl(args.query);

if (targetUrl) {
const instruction = args.url ? args.query : undefined;

const result = await fetchAndExtract(ctx, {
url: targetUrl,
instruction,
});

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t drop the user instruction when a URL is embedded in query.

Right now instruction becomes undefined when the URL is parsed from query, even though the description says the query guides extraction. This loses intent for inputs like “Summarize https://…”.

✅ Suggested fix
-      const instruction = args.url ? args.query : undefined;
+      const instruction = args.url
+        ? args.query
+        : args.query.trim() === targetUrl
+          ? undefined
+          : args.query.replace(targetUrl, '').trim() || args.query;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/web/web_tool.ts` around lines 59 - 67,
The code drops the user instruction when a URL is extracted from args.query;
change the assignment so the extraction instruction is preserved (e.g., set
instruction = args.query when present, not undefined when extractUrl returned
the URL). Update the block around targetUrl/instruction in web_tool.ts so that
when targetUrl is truthy you pass the original query (args.query) or the
query-without-url as the instruction into fetchAndExtract (referencing
extractUrl, targetUrl, instruction, and fetchAndExtract) instead of setting
instruction to undefined.

Comment on lines +73 to +83
const meta = [
result.title ? `Title: ${result.title}` : null,
`URL: ${result.url}`,
`Words: ${result.word_count}`,
result.page_count ? `Pages: ${result.page_count}` : null,
result.vision_used ? 'Vision: used for extraction' : null,
result.truncated ? '(content truncated)' : null,
]
.filter(Boolean)
.join(' | ');

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Surface content_type in the metadata line.

Once content_type is passed through, include it in meta so users can see whether the content came from a PDF, image, or page.

✅ Suggested tweak
         const meta = [
           result.title ? `Title: ${result.title}` : null,
+          result.content_type ? `Content-Type: ${result.content_type}` : null,
           `URL: ${result.url}`,
           `Words: ${result.word_count}`,
           result.page_count ? `Pages: ${result.page_count}` : null,
           result.vision_used ? 'Vision: used for extraction' : null,
           result.truncated ? '(content truncated)' : null,
         ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/web/web_tool.ts` around lines 73 - 83,
The metadata string built in the meta array doesn't include the
result.content_type; update the construction of meta (in web_tool.ts around the
code that builds meta) to add a conditional entry like `result.content_type ?
\`Type: ${result.content_type}\` : null` so the content type (PDF, image, page,
etc.) is shown along with Title, URL, Words, Pages, Vision, and truncated flags
before the .filter(Boolean).join(' | ') call.

@larryro larryro merged commit 2015851 into main Feb 24, 2026
17 checks passed
@larryro larryro deleted the feat/web-tool-multi-format-extraction branch February 24, 2026 21:36
yannickmonney pushed a commit that referenced this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant