feat(crawler,platform): multi-format web extraction, DOCX generation, and image quality presets#556
Conversation
Greptile SummaryThis PR transforms the web extraction pipeline from PDF-only to multi-format support, adds DOCX generation from markdown/HTML, and introduces image quality presets. Major Changes:
Test Coverage:
Issues Found:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| services/crawler/app/utils/content_type.py | new utility for content type detection with extension format inconsistency bug |
| services/crawler/app/routers/web.py | major refactor to support multi-format extraction with proper SSRF protection and routing |
| services/crawler/app/routers/docx.py | added markdown/HTML to DOCX conversion endpoints with thorough validation |
| services/crawler/app/services/html_to_docx_converter.py | comprehensive HTML parser for DOCX conversion supporting all common elements |
| services/crawler/app/services/web_image_extractor.py | image filtering and Vision API integration with proper size limits and concurrency |
| services/platform/convex/agent_tools/files/image_tool.ts | replaced scale parameter with quality presets, defaulting to low (1x JPEG) |
| services/platform/convex/agent_tools/files/docx_tool.ts | added markdown/HTML to DOCX conversion via sourceType parameter |
| services/platform/convex/agent_tools/web/web_tool.ts | updated to use new multi-format fetch-and-extract endpoint |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Web Tool Request] --> B{URL or Search?}
B -->|URL provided| C[fetchAndExtract]
B -->|No URL| D[searchPages]
C --> E[Crawler Service<br/>/api/v1/web/fetch-and-extract]
E --> F{detect_type_from_url}
F -->|document ext| G[Download & Parse File]
F -->|image ext| H[Download & Vision Extract]
F -->|unknown| I[HEAD Request Probe]
I --> J{detect_type_from_content_type}
J -->|document| G
J -->|image| H
J -->|unknown| K[Crawl4AI Web Extraction]
G --> L[parse_file_with_vision<br/>PDF/DOCX/PPTX]
H --> M[Vision API OCR + Description]
K --> N[Crawl4AI Single URL]
N --> O[extract_and_describe_images]
O --> P[Vision API for Page Images]
L --> Q[Return Content + Metadata]
M --> Q
P --> R[Combine Text + Image Descriptions]
R --> Q
S[DOCX Tool Request] --> T{sourceType?}
T -->|markdown/html| U[Crawler Service<br/>/api/v1/docx/from-markdown<br/>or /from-html]
T -->|sections| V[generateDocx<br/>structured content]
U --> W[html_to_docx_converter]
W --> X[Parse HTML to Sections]
X --> Y[DocxService.generate_docx]
V --> Y
Y --> Z[Return DOCX bytes]
AA[Image Tool Request] --> AB{quality preset}
AB --> AC[Quality Presets]
AC -->|low| AD[1x JPEG default]
AC -->|standard| AE[1x PNG]
AC -->|high| AF[2x PNG]
AC -->|ultra| AG[4x PNG]
Last reviewed commit: 066672b
…s, and web pages The web tool previously only supported semantic search over crawled pages. Now it also detects and fetches URLs directly — routing to the appropriate extraction pipeline based on content type (document, image, or web page). Crawler changes: - Add content type detection via URL extension and HTTP HEAD probe - Route documents (PDF, DOCX, PPTX) to file parser, images to Vision API, and web pages to the existing Playwright→PDF pipeline - Return content_type in the API response - Add tests for content type detection utilities Platform changes: - Rename fetch_url_via_pdf to fetch_and_extract to reflect broader capability - Add URL detection and dual-mode (fetch vs search) to web_tool - Add unit tests for URL extraction regex
…m markdown/HTML Switch web page extraction from Playwright→PDF→Vision to Crawl4AI for text plus Vision API for image descriptions, improving speed and content quality. Add markdown/HTML→DOCX conversion endpoints and wire them through the platform document generation pipeline and DOCX agent tool.
… JPEG) Replace manual scale option with semantic quality presets (low, standard, high, ultra) that control scale, format, and compression. Default changed from 2x PNG to 1x JPEG for smaller file sizes unless higher quality is explicitly requested.
066672b to
91f03f8
Compare
…etection Image extensions derived from MIME types were missing the leading dot (e.g. "png" instead of ".png"), inconsistent with document extensions and breaking _build_filename().
📝 WalkthroughWalkthroughThis PR extends the crawler service with multi-format content extraction and DOCX generation capabilities. It introduces new endpoints for converting Markdown and HTML to DOCX, refactors the web content extraction pipeline to support documents, images, and web pages with content-type detection, adds a new method for single-URL crawling with structured metadata extraction, and includes utilities for HTML-to-section conversion and web image extraction with Vision API integration. The platform layer is updated to support the new DOCX generation modes and content type metadata. Comprehensive tests validate all new functionality. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 18
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts (1)
76-100:⚠️ Potential issue | 🟡 MinorPass through
content_typeso downstream consumers can use it.The crawler response now includes
content_type, butfetchAndExtractdrops it. That prevents the web tool from exposing the metadata. Please include it in the success payload (and update the result type accordingly).✅ Suggested fix (fetch_and_extract.ts)
return { operation: 'fetch_url', success: true, url: args.url, title: result.title, + content_type: result.content_type, content, word_count: result.word_count, page_count: result.page_count, vision_used: result.vision_used, truncated, };✅ Suggested type update (helpers/types.ts)
export type WebFetchUrlResult = { operation: 'fetch_url'; success: boolean; url: string; title?: string; + content_type?: string; content: string; word_count: number; page_count: number; vision_used: boolean; truncated?: boolean; error?: string; };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts` around lines 76 - 100, The success payload from fetchAndExtract is dropping content_type—capture and pass result.content_type through the returned object and update the associated response type; specifically, in fetchAndExtract (the block returning operation: 'fetch_url') add content_type: result.content_type to the returned object (and ensure the local variable handling like content/truncated remains unchanged), and update the corresponding type/interface in helpers/types.ts (the fetch/web result type) to include content_type so downstream consumers can access the metadata.services/platform/convex/documents/generate_document_helpers.ts (1)
27-37:⚠️ Potential issue | 🟠 MajorAdd guard for unsupported
docx+urlcombinations.
getEndpointPath()routesdocx+urlto/api/v1/docx/from-url, but this endpoint does not exist in the crawler service. The docx router supports/from-template,/from-markdown, and/from-html, but not/from-url. Requests with this combination will 404. Add a validation guard to prevent this unsupported case:Suggested guard
export function getEndpointPath( sourceType: DocumentSourceType, outputFormat: DocumentOutputFormat, ): string { + if (outputFormat === 'docx' && sourceType === 'url') { + throw new Error('DOCX output does not support URL sources'); + } const formatPath = outputFormat === 'pdf' ? 'pdf' : outputFormat === 'docx' ? 'docx' : 'images';🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/documents/generate_document_helpers.ts` around lines 27 - 37, getEndpointPath currently maps outputFormat 'docx' + sourceType 'url' to /api/v1/docx/from-url which doesn't exist; add a guard in getEndpointPath (or a helper it calls) that detects when outputFormat === 'docx' and sourceType === 'url' and throws a descriptive error (or returns a clearly handled error response) to prevent building/using the unsupported /from-url route; keep existing path logic for other combinations and reference getEndpointPath, DocumentSourceType, and DocumentOutputFormat when implementing the check.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/crawler/app/routers/docx.py`:
- Around line 219-280: The endpoints convert_markdown_to_docx and
convert_html_to_docx accept empty strings because the Pydantic fields lack a
min_length; enforce validation by updating MarkdownToDocxRequest.content and
HtmlToDocxRequest.html in the models to use Field(..., min_length=1,
description=...) so empty input is rejected with 422, or as an alternative add
an early guard in each endpoint (convert_markdown_to_docx and
convert_html_to_docx) to check if request.content or request.html is empty/blank
and raise HTTPException(status_code=400, detail="Empty content") before calling
get_docx_service().
- Around line 237-241: Add an optional filename field (string, default
"document") to the request models MarkdownToDocxRequest and HtmlToDocxRequest in
app/models.py, and update the docx router endpoints in
services/crawler/app/routers/docx.py that return Response (the blocks that
currently set headers {"Content-Disposition": "attachment;
filename=document.docx"}) to use the request.filename value to construct the
Content-Disposition header (append ".docx" if missing and ensure a safe
filename). Update any references to the endpoint handlers (e.g., the
MarkdownToDocx and HtmlToDocx route functions) to read request.filename and
inject it into the Response headers so callers receive a meaningful filename by
defaulting to "document" when none is provided.
- Around line 219-220: The POST route decorator `@router.post`("/from-markdown")
for function convert_markdown_to_docx returns a raw Response but lacks OpenAPI
metadata for binary output; update its decorator to include a responses entry
that documents a 200 response with the binary content-type (e.g.
application/vnd.openxmlformats-officedocument.wordprocessingml.document or
application/octet-stream) and a schema of type "string" with format "binary",
and apply the same responses annotation to the other POST route in this file
that returns a Response (the endpoint around the other POST at lines ~251-252)
so both endpoints self-document binary responses in the generated OpenAPI spec.
- Line 216: Move the module constant _DOCX_CONTENT_TYPE into the top-level
constants block alongside _FILE_TEMPLATE and _FILE_UPLOAD to centralize
module-level declarations; locate the current mid-file definition of
_DOCX_CONTENT_TYPE and cut it, then paste it into the existing constants section
(near _FILE_TEMPLATE/_FILE_UPLOAD) so all file-related constants are grouped
together and remove the original mid-file definition to avoid duplication.
In `@services/crawler/app/routers/web.py`:
- Around line 79-95: The current _probe_url function uses httpx.AsyncClient with
follow_redirects=True which lets redirects bypass your SSRF hostname/IP checks;
change to disable automatic redirects and instead perform the HEAD (and any
subsequent redirect targets) manually by setting follow_redirects=False on the
AsyncClient, inspect the Location header on 3xx responses, validate each
redirect URL's hostname/IP against your SSRF allowlist/blocklist before issuing
the next request, and loop up to a safe redirect limit; apply the same
manual-redirect validation approach to the similar request flow in the other
block referenced (lines ~97-130) so no redirect target is fetched without
validation.
In `@services/crawler/app/services/crawler_service.py`:
- Around line 411-422: Extract the repeated CrawlerRunConfig construction into a
shared helper (e.g., _build_crawl_config) and call it from both crawl_single_url
and crawl_urls; the helper should accept word_count_threshold plus optional
overrides and return a CrawlerRunConfig with the common defaults (excluded_tags
["nav","footer","header","aside","select","option"], cache_mode="bypass",
screenshot=False, pdf=False, exclude_social_media_links=True, and
markdown_generator=DefaultMarkdownGenerator(content_filter=PruningContentFilter(threshold=0.4))).
Replace the duplicated inline CrawlerRunConfig instantiations in
crawl_single_url and crawl_urls with calls to
_build_crawl_config(word_count_threshold, **overrides) so both paths stay
synchronized when config changes.
- Line 435: The expression assigning markdown_content (markdown_content =
result.markdown.fit_markdown or result.markdown.raw_markdown) can yield None;
change it to fall back to an empty string so subsequent calls like
markdown_content.split() don't raise (e.g. use an explicit fallback such as
(result.markdown.fit_markdown or result.markdown.raw_markdown or "")). Apply the
same defensive fallback in the other occurrence in crawl_urls; update the
assignment sites so downstream code (splitting, length checks) safely operate on
an empty string rather than None.
In `@services/crawler/app/services/web_image_extractor.py`:
- Around line 84-101: The current image downloader in the async with semaphore
block reads the full response via response.content before enforcing
MIN_IMAGE_BYTES / MAX_IMAGE_DOWNLOAD_BYTES, causing large responses to be fully
downloaded then discarded; change the download to stream with
response.aiter_bytes() (or similar chunked reads) inside the same async with
semaphore and accumulate bytes up to MAX_IMAGE_DOWNLOAD_BYTES + 1, aborting the
read early and logging/skipping when the accumulator exceeds
MAX_IMAGE_DOWNLOAD_BYTES; keep the same timeout (IMAGE_DOWNLOAD_TIMEOUT),
preserve the existing small/oversize checks and logger usage (logger.debug) and
return the accumulated image_bytes when within bounds.
- Around line 129-146: The describe_image calls are unbounded and should be
limited by the same (or a new) semaphore to respect max_concurrent; wrap each
call to vision_client.describe_image in a small coroutine that acquires the
semaphore before calling and releases it in a finally block (e.g., create
describe_task = _describe_with_semaphore(image_bytes, vision_client, semaphore)
for items in downloaded), collect those describe tasks into describe_tasks and
then await asyncio.gather(*describe_tasks, return_exceptions=True); ensure the
helper coroutine handles exceptions and always releases the semaphore.
In `@services/crawler/app/utils/content_type.py`:
- Line 10: The MIME-to-extension mapping for key "application/vnd.ms-powerpoint"
in the content type dictionary is wrong (it currently maps to ".pptx"); update
that entry to map to ".ppt" and, if legacy PPT detection is desired, add ".ppt"
to the DOCUMENT_EXTENSIONS collection so functions that inspect URLs or
filenames (DOCUMENT_EXTENSIONS and the content-type mapping dict in
content_type.py) will recognize legacy PowerPoint files.
- Line 34: The code assigns path = parsed.path.lower().split("?")[0] but
parsed.path from urlparse never contains query parameters, so remove the
redundant .split("?")[0] and set path = parsed.path.lower(); update the
assignment in content_type.py where the variable path is defined (look for
parsed and path in that function) to simplify the code while keeping the same
behavior and run tests to confirm no behavioral change.
- Around line 54-55: The function detect_type_from_content_type returns document
extensions with a leading dot but image extensions without; update the image
branch (where ct in IMAGE_CONTENT_TYPES is checked) to prepend a dot to the
returned extension (e.g., return f".{ext}", "image") so image extensions match
the document format, referencing the IMAGE_CONTENT_TYPES check and the
detect_type_from_content_type function; also update the related test expectation
(currently asserting "svg") to expect ".svg".
In `@services/crawler/tests/test_crawl_single_url.py`:
- Around line 172-180: The test test_raises_timeout_error_on_slow_crawl is
relying on a real asyncio.sleep in slow_arun which can be flaky in CI; replace
the real sleep with a deterministic never-completes future or mock asyncio.sleep
so the coroutine never resolves (e.g., await asyncio.Future() that is never set)
and keep the timeout at service.crawl_single_url("https://example.com",
timeout=0.01) to trigger the TimeoutError, updating the helper
_make_service/arun_side_effect usage accordingly so the test no longer depends
on wall-clock sleep.
In `@services/crawler/tests/test_web_image_extractor.py`:
- Around line 98-110: The async test methods in TestExtractAndDescribeImages
(test_returns_empty_for_no_images, test_returns_empty_when_all_filtered) lack
the pytest asyncio marker so they won't run; fix by either adding
`@pytest.mark.asyncio` above the TestExtractAndDescribeImages class or setting a
module-level marker (pytestmark = pytest.mark.asyncio) at the top of the test
module so the async tests that call extract_and_describe_images(...) execute
under asyncio.
In `@services/crawler/tests/test_web_router.py`:
- Around line 82-94: The MIME mapping for legacy PowerPoint is incorrect: update
the mapping in content_type.py (the CONTENT_TYPE_MAPPING or equivalent
tuple/list that contains ("application/vnd.ms-powerpoint", ...)) to map
"application/vnd.ms-powerpoint" to ".ppt" (not ".pptx"), and then either add
support for .ppt in file_parser_service.py (e.g., implement or wire a legacy PPT
handler/parser used by parse_file / the existing parser registry) or explicitly
remove the .ppt mapping and add a clear comment/doc explaining legacy .ppt is
intentionally unsupported; also update tests (e.g., test_web_router.py) to
expect the corrected mapping or documented exclusion so parser and mapping
remain consistent.
In `@services/platform/convex/agent_tools/files/docx_tool.ts`:
- Around line 296-332: When handling Mode A in the docx tool, add an explicit
validation that if either args.sourceType or args.content is provided but the
other is missing, you raise a clear error (or reject) indicating both sourceType
and content are required for "generate from content" mode; implement this check
just before the Mode A branch that uses args.sourceType && args.content (the
block that calls internal.documents.internal_actions.generateDocument) so
partial inputs don't silently fall through to the sections/Mode B logic and the
caller receives a meaningful message about the missing parameter(s).
In `@services/platform/convex/agent_tools/web/web_tool.ts`:
- Around line 73-83: The metadata string built in the meta array doesn't include
the result.content_type; update the construction of meta (in web_tool.ts around
the code that builds meta) to add a conditional entry like `result.content_type
? \`Type: ${result.content_type}\` : null` so the content type (PDF, image,
page, etc.) is shown along with Title, URL, Words, Pages, Vision, and truncated
flags before the .filter(Boolean).join(' | ') call.
- Around line 59-67: The code drops the user instruction when a URL is extracted
from args.query; change the assignment so the extraction instruction is
preserved (e.g., set instruction = args.query when present, not undefined when
extractUrl returned the URL). Update the block around targetUrl/instruction in
web_tool.ts so that when targetUrl is truthy you pass the original query
(args.query) or the query-without-url as the instruction into fetchAndExtract
(referencing extractUrl, targetUrl, instruction, and fetchAndExtract) instead of
setting instruction to undefined.
---
Outside diff comments:
In `@services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts`:
- Around line 76-100: The success payload from fetchAndExtract is dropping
content_type—capture and pass result.content_type through the returned object
and update the associated response type; specifically, in fetchAndExtract (the
block returning operation: 'fetch_url') add content_type: result.content_type to
the returned object (and ensure the local variable handling like
content/truncated remains unchanged), and update the corresponding
type/interface in helpers/types.ts (the fetch/web result type) to include
content_type so downstream consumers can access the metadata.
In `@services/platform/convex/documents/generate_document_helpers.ts`:
- Around line 27-37: getEndpointPath currently maps outputFormat 'docx' +
sourceType 'url' to /api/v1/docx/from-url which doesn't exist; add a guard in
getEndpointPath (or a helper it calls) that detects when outputFormat === 'docx'
and sourceType === 'url' and throws a descriptive error (or returns a clearly
handled error response) to prevent building/using the unsupported /from-url
route; keep existing path logic for other combinations and reference
getEndpointPath, DocumentSourceType, and DocumentOutputFormat when implementing
the check.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
⛔ Files ignored due to path filters (1)
services/platform/convex/_generated/api.d.tsis excluded by!**/_generated/**
📒 Files selected for processing (23)
services/crawler/app/models.pyservices/crawler/app/routers/docx.pyservices/crawler/app/routers/web.pyservices/crawler/app/services/crawler_service.pyservices/crawler/app/services/docx_service.pyservices/crawler/app/services/html_to_docx_converter.pyservices/crawler/app/services/web_image_extractor.pyservices/crawler/app/utils/__init__.pyservices/crawler/app/utils/content_type.pyservices/crawler/tests/conftest.pyservices/crawler/tests/test_crawl_single_url.pyservices/crawler/tests/test_docx_converter.pyservices/crawler/tests/test_web_image_extractor.pyservices/crawler/tests/test_web_router.pyservices/platform/convex/agent_tools/files/docx_tool.tsservices/platform/convex/agent_tools/files/image_tool.tsservices/platform/convex/agent_tools/web/helpers/fetch_and_extract.tsservices/platform/convex/agent_tools/web/helpers/types.tsservices/platform/convex/agent_tools/web/web_tool.test.tsservices/platform/convex/agent_tools/web/web_tool.tsservices/platform/convex/documents/generate_document_helpers.tsservices/platform/convex/documents/internal_actions.tsservices/platform/convex/documents/types.ts
| ) | ||
|
|
||
|
|
||
| _DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Move _DOCX_CONTENT_TYPE to the top-level constant block with _FILE_TEMPLATE / _FILE_UPLOAD.
Currently the constant sits mid-file between two unrelated endpoint handlers (lines 25-26 house the other module-level constants), making it harder to locate. Relocating it next to the existing _FILE_* constants keeps all module-level declarations in one place.
♻️ Proposed move
_FILE_TEMPLATE = File(None, description="Optional template DOCX file to use as base")
_FILE_UPLOAD = File(..., description="DOCX file to parse")
+_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"-_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
-
-
`@router.post`("/from-markdown")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/docx.py` at line 216, Move the module constant
_DOCX_CONTENT_TYPE into the top-level constants block alongside _FILE_TEMPLATE
and _FILE_UPLOAD to centralize module-level declarations; locate the current
mid-file definition of _DOCX_CONTENT_TYPE and cut it, then paste it into the
existing constants section (near _FILE_TEMPLATE/_FILE_UPLOAD) so all
file-related constants are grouped together and remove the original mid-file
definition to avoid duplication.
| @router.post("/from-markdown") | ||
| async def convert_markdown_to_docx(request: MarkdownToDocxRequest): | ||
| """ | ||
| Convert Markdown content to DOCX. | ||
|
|
||
| Parses markdown into HTML, then extracts structure (headings, paragraphs, | ||
| lists, tables, etc.) and generates a Word document. | ||
|
|
||
| Args: | ||
| request: Markdown content | ||
|
|
||
| Returns: | ||
| DOCX file as binary response | ||
| """ | ||
| try: | ||
| docx_service = get_docx_service() | ||
| docx_bytes = await docx_service.markdown_to_docx(request.content) | ||
|
|
||
| return Response( | ||
| content=docx_bytes, | ||
| media_type=_DOCX_CONTENT_TYPE, | ||
| headers={"Content-Disposition": "attachment; filename=document.docx"}, | ||
| ) | ||
|
|
||
| except Exception: | ||
| logger.exception("Error converting markdown to DOCX") | ||
| raise HTTPException( | ||
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | ||
| detail="Failed to convert markdown to DOCX", | ||
| ) from None | ||
|
|
||
|
|
||
| @router.post("/from-html") | ||
| async def convert_html_to_docx(request: HtmlToDocxRequest): | ||
| """ | ||
| Convert HTML content to DOCX. | ||
|
|
||
| Parses HTML to extract structure (headings, paragraphs, lists, tables, etc.) | ||
| and generates a Word document. | ||
|
|
||
| Args: | ||
| request: HTML content | ||
|
|
||
| Returns: | ||
| DOCX file as binary response | ||
| """ | ||
| try: | ||
| docx_service = get_docx_service() | ||
| docx_bytes = await docx_service.html_to_docx(request.html) | ||
|
|
||
| return Response( | ||
| content=docx_bytes, | ||
| media_type=_DOCX_CONTENT_TYPE, | ||
| headers={"Content-Disposition": "attachment; filename=document.docx"}, | ||
| ) | ||
|
|
||
| except Exception: | ||
| logger.exception("Error converting HTML to DOCX") | ||
| raise HTTPException( | ||
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | ||
| detail="Failed to convert HTML to DOCX", | ||
| ) from None |
There was a problem hiding this comment.
Empty inputs will silently produce an empty/trivial DOCX with HTTP 200.
MarkdownToDocxRequest.content and HtmlToDocxRequest.html are declared with Field(...) (required but no min_length), so an empty string "" passes Pydantic validation and reaches markdown_to_docx/html_to_docx. The caller receives a valid DOCX binary response with no indication that the input was empty.
Enforce a minimum length on both request model fields, for example:
# In app/models.py → MarkdownToDocxRequest
content: str = Field(..., min_length=1, description="Markdown content to convert")
# In app/models.py → HtmlToDocxRequest
html: str = Field(..., min_length=1, description="HTML content to convert")Alternatively, add an early guard inside each endpoint before calling the service:
🛡️ Proposed guard in each endpoint
try:
docx_service = get_docx_service()
+ if not request.content.strip():
+ raise HTTPException(
+ status_code=status.HTTP_400_BAD_REQUEST,
+ detail="Markdown content must not be empty",
+ )
docx_bytes = await docx_service.markdown_to_docx(request.content) try:
docx_service = get_docx_service()
+ if not request.html.strip():
+ raise HTTPException(
+ status_code=status.HTTP_400_BAD_REQUEST,
+ detail="HTML content must not be empty",
+ )
docx_bytes = await docx_service.html_to_docx(request.html)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/docx.py` around lines 219 - 280, The endpoints
convert_markdown_to_docx and convert_html_to_docx accept empty strings because
the Pydantic fields lack a min_length; enforce validation by updating
MarkdownToDocxRequest.content and HtmlToDocxRequest.html in the models to use
Field(..., min_length=1, description=...) so empty input is rejected with 422,
or as an alternative add an early guard in each endpoint
(convert_markdown_to_docx and convert_html_to_docx) to check if request.content
or request.html is empty/blank and raise HTTPException(status_code=400,
detail="Empty content") before calling get_docx_service().
| @router.post("/from-markdown") | ||
| async def convert_markdown_to_docx(request: MarkdownToDocxRequest): |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
New endpoints are missing OpenAPI binary-response documentation.
Because Response is returned directly (no response_model), FastAPI omits any schema for these routes. Adding a responses annotation in the decorator self-documents the binary content type in the generated OpenAPI spec, improving client discoverability.
♻️ Proposed annotation
-@router.post("/from-markdown")
+@router.post(
+ "/from-markdown",
+ responses={200: {"content": {_DOCX_CONTENT_TYPE: {}}, "description": "DOCX binary file"}},
+)
async def convert_markdown_to_docx(request: MarkdownToDocxRequest):
-@router.post("/from-html")
+@router.post(
+ "/from-html",
+ responses={200: {"content": {_DOCX_CONTENT_TYPE: {}}, "description": "DOCX binary file"}},
+)
async def convert_html_to_docx(request: HtmlToDocxRequest):Also applies to: 251-252
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/docx.py` around lines 219 - 220, The POST route
decorator `@router.post`("/from-markdown") for function convert_markdown_to_docx
returns a raw Response but lacks OpenAPI metadata for binary output; update its
decorator to include a responses entry that documents a 200 response with the
binary content-type (e.g.
application/vnd.openxmlformats-officedocument.wordprocessingml.document or
application/octet-stream) and a schema of type "string" with format "binary",
and apply the same responses annotation to the other POST route in this file
that returns a Response (the endpoint around the other POST at lines ~251-252)
so both endpoints self-document binary responses in the generated OpenAPI spec.
| return Response( | ||
| content=docx_bytes, | ||
| media_type=_DOCX_CONTENT_TYPE, | ||
| headers={"Content-Disposition": "attachment; filename=document.docx"}, | ||
| ) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Hardcoded filename=document.docx makes it impossible for callers to obtain a meaningful filename.
Both endpoints return attachment; filename=document.docx regardless of any caller-supplied name. Adding an optional filename field to MarkdownToDocxRequest / HtmlToDocxRequest (defaulting to "document") would let clients receive a semantically named file with no breaking change.
♻️ Proposed refactor
In app/models.py:
class MarkdownToDocxRequest(BaseModel):
content: str = Field(..., description="Markdown content to convert")
+ filename: str = Field("document", description="Output filename (without extension)")
class HtmlToDocxRequest(BaseModel):
html: str = Field(..., description="HTML content to convert")
+ filename: str = Field("document", description="Output filename (without extension)")In the router:
- headers={"Content-Disposition": "attachment; filename=document.docx"},
+ headers={"Content-Disposition": f'attachment; filename="{request.filename}.docx"'},Also applies to: 269-273
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/docx.py` around lines 237 - 241, Add an optional
filename field (string, default "document") to the request models
MarkdownToDocxRequest and HtmlToDocxRequest in app/models.py, and update the
docx router endpoints in services/crawler/app/routers/docx.py that return
Response (the blocks that currently set headers {"Content-Disposition":
"attachment; filename=document.docx"}) to use the request.filename value to
construct the Content-Disposition header (append ".docx" if missing and ensure a
safe filename). Update any references to the endpoint handlers (e.g., the
MarkdownToDocx and HtmlToDocx route functions) to read request.filename and
inject it into the Response headers so callers receive a meaningful filename by
defaulting to "document" when none is provided.
| async def _probe_url(url_str: str, timeout: float) -> tuple[str, str | None, str]: | ||
| """ | ||
| Probe a URL with a HEAD request to determine Content-Type. | ||
|
|
||
| Returns: | ||
| Tuple of (content_type_header, extension_or_None, category). | ||
| """ | ||
| try: | ||
| async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client: | ||
| response = await client.head(url_str) | ||
| content_type = response.headers.get("content-type", "") | ||
| ext, category = detect_type_from_content_type(content_type) | ||
| return content_type, ext, category | ||
| except httpx.HTTPError: | ||
| logger.debug(f"HEAD request failed for {url_str}, will try URL extension detection") | ||
| return "", None, "unknown" | ||
|
|
There was a problem hiding this comment.
Redirects can bypass SSRF validation.
follow_redirects=True allows a public URL to redirect to a private IP after the initial hostname check, which defeats the SSRF guard. Validate each redirect target before issuing the next request (or disable auto-redirects and handle redirects manually).
🔒 Suggested redirect validation
-from urllib.parse import urlparse
+from urllib.parse import urlparse, urljoin async def _probe_url(url_str: str, timeout: float) -> tuple[str, str | None, str]:
@@
- try:
- async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client:
- response = await client.head(url_str)
- content_type = response.headers.get("content-type", "")
- ext, category = detect_type_from_content_type(content_type)
- return content_type, ext, category
+ try:
+ async with httpx.AsyncClient(follow_redirects=False, timeout=timeout, headers=_HTTP_HEADERS) as client:
+ current_url = url_str
+ for _ in range(5):
+ response = await client.head(current_url)
+ if response.is_redirect:
+ location = response.headers.get("location")
+ if not location:
+ break
+ next_url = urljoin(current_url, location)
+ validate_url_not_private(next_url)
+ current_url = next_url
+ continue
+ content_type = response.headers.get("content-type", "")
+ ext, category = detect_type_from_content_type(content_type)
+ return content_type, ext, category
+ return "", None, "unknown"
except httpx.HTTPError:
logger.debug(f"HEAD request failed for {url_str}, will try URL extension detection")
return "", None, "unknown" async def _download_file(url_str: str, timeout: float) -> tuple[bytes, str]:
@@
- async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client:
- async with client.stream("GET", url_str) as response:
- response.raise_for_status()
- content_type = response.headers.get("content-type", "")
- content_length = response.headers.get("content-length")
-
- if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES:
- raise HTTPException(
- status_code=status.HTTP_400_BAD_REQUEST,
- detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
- )
-
- chunks = []
- total = 0
- async for chunk in response.aiter_bytes():
- total += len(chunk)
- if total > MAX_FILE_DOWNLOAD_BYTES:
- raise HTTPException(
- status_code=status.HTTP_400_BAD_REQUEST,
- detail=f"File too large (>{MAX_FILE_DOWNLOAD_BYTES} bytes). Download aborted.",
- )
- chunks.append(chunk)
-
- return b"".join(chunks), content_type
+ async with httpx.AsyncClient(follow_redirects=False, timeout=timeout, headers=_HTTP_HEADERS) as client:
+ current_url = url_str
+ for _ in range(5):
+ async with client.stream("GET", current_url) as response:
+ if response.is_redirect:
+ location = response.headers.get("location")
+ if not location:
+ raise HTTPException(
+ status_code=status.HTTP_400_BAD_REQUEST,
+ detail="Redirect without Location header",
+ )
+ next_url = urljoin(current_url, location)
+ validate_url_not_private(next_url)
+ current_url = next_url
+ continue
+ response.raise_for_status()
+ content_type = response.headers.get("content-type", "")
+ content_length = response.headers.get("content-length")
+
+ if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES:
+ raise HTTPException(
+ status_code=status.HTTP_400_BAD_REQUEST,
+ detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.",
+ )
+
+ chunks = []
+ total = 0
+ async for chunk in response.aiter_bytes():
+ total += len(chunk)
+ if total > MAX_FILE_DOWNLOAD_BYTES:
+ raise HTTPException(
+ status_code=status.HTTP_400_BAD_REQUEST,
+ detail=f"File too large (>{MAX_FILE_DOWNLOAD_BYTES} bytes). Download aborted.",
+ )
+ chunks.append(chunk)
+
+ return b"".join(chunks), content_type
+ raise HTTPException(
+ status_code=status.HTTP_400_BAD_REQUEST,
+ detail="Too many redirects",
+ )Also applies to: 97-130
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/web.py` around lines 79 - 95, The current
_probe_url function uses httpx.AsyncClient with follow_redirects=True which lets
redirects bypass your SSRF hostname/IP checks; change to disable automatic
redirects and instead perform the HEAD (and any subsequent redirect targets)
manually by setting follow_redirects=False on the AsyncClient, inspect the
Location header on 3xx responses, validate each redirect URL's hostname/IP
against your SSRF allowlist/blocklist before issuing the next request, and loop
up to a safe redirect limit; apply the same manual-redirect validation approach
to the similar request flow in the other block referenced (lines ~97-130) so no
redirect target is fetched without validation.
| class TestExtractAndDescribeImages: | ||
| """Tests for extract_and_describe_images().""" | ||
|
|
||
| async def test_returns_empty_for_no_images(self): | ||
| descriptions, vision_used = await extract_and_describe_images([], "https://example.com") | ||
| assert descriptions == [] | ||
| assert vision_used is False | ||
|
|
||
| async def test_returns_empty_when_all_filtered(self): | ||
| images = [{"src": "data:image/png;base64,tiny", "score": 5.0}] | ||
| descriptions, vision_used = await extract_and_describe_images(images, "https://example.com") | ||
| assert descriptions == [] | ||
| assert vision_used is False |
There was a problem hiding this comment.
Missing pytest.mark.asyncio — async tests won't execute.
Same issue as test_crawl_single_url.py: the async test methods in TestExtractAndDescribeImages lack the asyncio marker and will silently "pass" without running.
🔧 Proposed fix: add module-level marker
import pytest
from app.services.web_image_extractor import (
MAX_IMAGES_TO_PROCESS,
MIN_IMAGE_BYTES,
_filter_image_candidates,
extract_and_describe_images,
)
+
+pytestmark = pytest.mark.asyncio🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/tests/test_web_image_extractor.py` around lines 98 - 110,
The async test methods in TestExtractAndDescribeImages
(test_returns_empty_for_no_images, test_returns_empty_when_all_filtered) lack
the pytest asyncio marker so they won't run; fix by either adding
`@pytest.mark.asyncio` above the TestExtractAndDescribeImages class or setting a
module-level marker (pytestmark = pytest.mark.asyncio) at the top of the test
module so the async tests that call extract_and_describe_images(...) execute
under asyncio.
| ( | ||
| "application/vnd.openxmlformats-officedocument.wordprocessingml.document", | ||
| ".docx", | ||
| "document", | ||
| ), | ||
| ( | ||
| "application/vnd.openxmlformats-officedocument.presentationml.presentation", | ||
| ".pptx", | ||
| "document", | ||
| ), | ||
| ("application/msword", ".doc", "document"), | ||
| ("application/vnd.ms-powerpoint", ".pptx", "document"), | ||
| ], |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect MIME-to-extension mapping and parser support
echo "=== Searching for vnd.ms-powerpoint MIME mapping ==="
rg -n "vnd\.ms-powerpoint" services/crawler/
echo -e "\n=== Searching for DOCUMENT_CONTENT_TYPES definition ==="
rg -n "DOCUMENT_CONTENT_TYPES" services/crawler/ -C 3
echo -e "\n=== Searching for ppt/pptx parser support ==="
rg -n "ppt" services/crawler/app/services/file_parser_service.py
echo -e "\n=== Checking content_type.py for PowerPoint mappings ==="
cat -n services/crawler/app/utils/content_type.py | head -100Repository: tale-project/tale
Length of output: 4698
Correct the legacy PowerPoint MIME type mapping.
application/vnd.ms-powerpoint is the MIME type for legacy Office 97-2003 .ppt files, not .pptx. The current mapping in content_type.py:10 incorrectly assigns it to .pptx. Additionally, the parser only supports .pptx (via python-pptx in file_parser_service.py), with no handler for .ppt files. This mismatch will cause parsing failures for legacy PowerPoint files. Either map application/vnd.ms-powerpoint to .ppt with corresponding parser support, or document why legacy format support is intentionally excluded.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/tests/test_web_router.py` around lines 82 - 94, The MIME
mapping for legacy PowerPoint is incorrect: update the mapping in
content_type.py (the CONTENT_TYPE_MAPPING or equivalent tuple/list that contains
("application/vnd.ms-powerpoint", ...)) to map "application/vnd.ms-powerpoint"
to ".ppt" (not ".pptx"), and then either add support for .ppt in
file_parser_service.py (e.g., implement or wire a legacy PPT handler/parser used
by parse_file / the existing parser registry) or explicitly remove the .ppt
mapping and add a clear comment/doc explaining legacy .ppt is intentionally
unsupported; also update tests (e.g., test_web_router.py) to expect the
corrected mapping or documented exclusion so parser and mapping remain
consistent.
|
|
||
| // Mode A: Generate from markdown/html content | ||
| if (args.sourceType && args.content) { | ||
| debugLog('tool:docx generate from content start', { | ||
| fileName: args.fileName, | ||
| sourceType: args.sourceType, | ||
| }); | ||
|
|
||
| try { | ||
| const result = await ctx.runAction( | ||
| internal.documents.internal_actions.generateDocument, | ||
| { | ||
| fileName: args.fileName, | ||
| sourceType: args.sourceType, | ||
| outputFormat: 'docx', | ||
| content: args.content, | ||
| }, | ||
| ); | ||
|
|
||
| debugLog('tool:docx generate from content success', { | ||
| fileName: result.fileName, | ||
| fileId: result.fileId, | ||
| size: result.size, | ||
| }); | ||
|
|
||
| return { | ||
| operation: 'generate', | ||
| ...result, | ||
| } as GenerateDocxResult; | ||
| } catch (error) { | ||
| console.error('[tool:docx generate from content] error', { | ||
| fileName: args.fileName, | ||
| error: error instanceof Error ? error.message : String(error), | ||
| }); | ||
| throw error; | ||
| } | ||
| } |
There was a problem hiding this comment.
Partial Mode A args (sourceType without content, or vice versa) silently fall through to Mode B.
If a caller provides sourceType but omits content (or vice versa), the condition on line 298 is false and execution falls through to the sections check on line 335, which produces an error message about missing sections or sourceType + content. This could be confusing since the user clearly intended Mode A. Consider validating explicitly:
🛡️ Proposed validation
// Mode A: Generate from markdown/html content
if (args.sourceType && args.content) {
+ } else if (args.sourceType || args.content) {
+ throw new Error(
+ "Both 'sourceType' and 'content' are required for markdown/HTML generation. Provide both or use 'sections' instead.",
+ );
+ }Or equivalently, add the check before the Mode A block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/agent_tools/files/docx_tool.ts` around lines 296 -
332, When handling Mode A in the docx tool, add an explicit validation that if
either args.sourceType or args.content is provided but the other is missing, you
raise a clear error (or reject) indicating both sourceType and content are
required for "generate from content" mode; implement this check just before the
Mode A branch that uses args.sourceType && args.content (the block that calls
internal.documents.internal_actions.generateDocument) so partial inputs don't
silently fall through to the sections/Mode B logic and the caller receives a
meaningful message about the missing parameter(s).
| const targetUrl = args.url || extractUrl(args.query); | ||
|
|
||
| if (targetUrl) { | ||
| const instruction = args.url ? args.query : undefined; | ||
|
|
||
| const result = await fetchAndExtract(ctx, { | ||
| url: targetUrl, | ||
| instruction, | ||
| }); |
There was a problem hiding this comment.
Don’t drop the user instruction when a URL is embedded in query.
Right now instruction becomes undefined when the URL is parsed from query, even though the description says the query guides extraction. This loses intent for inputs like “Summarize https://…”.
✅ Suggested fix
- const instruction = args.url ? args.query : undefined;
+ const instruction = args.url
+ ? args.query
+ : args.query.trim() === targetUrl
+ ? undefined
+ : args.query.replace(targetUrl, '').trim() || args.query;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/agent_tools/web/web_tool.ts` around lines 59 - 67,
The code drops the user instruction when a URL is extracted from args.query;
change the assignment so the extraction instruction is preserved (e.g., set
instruction = args.query when present, not undefined when extractUrl returned
the URL). Update the block around targetUrl/instruction in web_tool.ts so that
when targetUrl is truthy you pass the original query (args.query) or the
query-without-url as the instruction into fetchAndExtract (referencing
extractUrl, targetUrl, instruction, and fetchAndExtract) instead of setting
instruction to undefined.
| const meta = [ | ||
| result.title ? `Title: ${result.title}` : null, | ||
| `URL: ${result.url}`, | ||
| `Words: ${result.word_count}`, | ||
| result.page_count ? `Pages: ${result.page_count}` : null, | ||
| result.vision_used ? 'Vision: used for extraction' : null, | ||
| result.truncated ? '(content truncated)' : null, | ||
| ] | ||
| .filter(Boolean) | ||
| .join(' | '); | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Surface content_type in the metadata line.
Once content_type is passed through, include it in meta so users can see whether the content came from a PDF, image, or page.
✅ Suggested tweak
const meta = [
result.title ? `Title: ${result.title}` : null,
+ result.content_type ? `Content-Type: ${result.content_type}` : null,
`URL: ${result.url}`,
`Words: ${result.word_count}`,
result.page_count ? `Pages: ${result.page_count}` : null,
result.vision_used ? 'Vision: used for extraction' : null,
result.truncated ? '(content truncated)' : null,
]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/agent_tools/web/web_tool.ts` around lines 73 - 83,
The metadata string built in the meta array doesn't include the
result.content_type; update the construction of meta (in web_tool.ts around the
code that builds meta) to add a conditional entry like `result.content_type ?
\`Type: ${result.content_type}\` : null` so the content type (PDF, image, page,
etc.) is shown along with Title, URL, Words, Pages, Vision, and truncated flags
before the .filter(Boolean).join(' | ') call.
… and image quality presets (#556)
Summary
/docx/from-markdown,/docx/from-html) convert structured content into DOCX files, with full support for headings, lists, tables, blockquotes, and code blocks. Wired into the platform DOCX tool.low,standard,high,ultra) instead of a rawscaleparameter. Defaults tolow(1x JPEG) for smaller file sizes.Test plan
test_crawl_single_url.py)test_docx_converter.py)test_web_image_extractor.py)test_web_router.py)web_tool.test.ts)🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Tests