feat(crawler,platform): multi-format web extraction, DOCX generation, and image quality presets by larryro · Pull Request #556 · tale-project/tale

larryro · 2026-02-24T21:11:48Z

Summary

Multi-format web extraction: The web tool now detects content type (web page, document, image) via HEAD request and routes to the appropriate handler — Crawl4AI for web pages, direct download + parsing for documents (PDF/DOCX/PPTX), and Vision API for images. Replaces the previous PDF-only extraction pipeline.
DOCX from Markdown/HTML: New crawler endpoints (/docx/from-markdown, /docx/from-html) convert structured content into DOCX files, with full support for headings, lists, tables, blockquotes, and code blocks. Wired into the platform DOCX tool.
Image quality presets: Image tool now uses quality presets (low, standard, high, ultra) instead of a raw scale parameter. Defaults to low (1x JPEG) for smaller file sizes.

Test plan

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Generate DOCX documents directly from Markdown or HTML content
- Extract and process content from multiple file types (PDFs, DOCX, PPTX, images, web pages) with automatic type detection
- Image quality settings for document generation (low, standard, high, ultra)
- Single URL content crawling and extraction
- Enhanced image processing with automatic descriptions
Tests
- Added comprehensive test coverage for new features and content extraction paths

greptile-apps · 2026-02-24T21:15:24Z

Greptile Summary

This PR transforms the web extraction pipeline from PDF-only to multi-format support, adds DOCX generation from markdown/HTML, and introduces image quality presets.

Major Changes:

Multi-format web extraction: The web tool now detects content type via HEAD requests and URL extensions, routing to appropriate handlers — Crawl4AI for web pages, direct file parsing for documents (PDF/DOCX/PPTX), and Vision API for images
DOCX conversion: New crawler endpoints (/docx/from-markdown, /docx/from-html) convert structured content to DOCX with support for headings, lists, tables, blockquotes, and code blocks
Image quality presets: Replaced raw scale parameter with quality presets (low, standard, high, ultra), defaulting to low (1x JPEG) for smaller file sizes
Web image extraction: Added intelligent image filtering from web pages with Vision API descriptions, filtering out icons, tracking pixels, and decorative images

Test Coverage:

5 new test files with comprehensive coverage for content type routing, HTML parsing, image extraction, and single-URL crawling
Good mix of unit tests (content type detection, URL extraction) and integration tests (web extraction pipeline)

Issues Found:

Extension format inconsistency in content_type.py:55 — images return "svg" but documents return ".pdf" (missing leading dot)

Confidence Score: 4/5

safe to merge after fixing the extension format bug
well-tested multi-format extraction with one logical bug in content type detection that needs fixing
services/crawler/app/utils/content_type.py requires fix for extension format inconsistency before merge

Important Files Changed

Filename	Overview
services/crawler/app/utils/content_type.py	new utility for content type detection with extension format inconsistency bug
services/crawler/app/routers/web.py	major refactor to support multi-format extraction with proper SSRF protection and routing
services/crawler/app/routers/docx.py	added markdown/HTML to DOCX conversion endpoints with thorough validation
services/crawler/app/services/html_to_docx_converter.py	comprehensive HTML parser for DOCX conversion supporting all common elements
services/crawler/app/services/web_image_extractor.py	image filtering and Vision API integration with proper size limits and concurrency
services/platform/convex/agent_tools/files/image_tool.ts	replaced scale parameter with quality presets, defaulting to low (1x JPEG)
services/platform/convex/agent_tools/files/docx_tool.ts	added markdown/HTML to DOCX conversion via sourceType parameter
services/platform/convex/agent_tools/web/web_tool.ts	updated to use new multi-format fetch-and-extract endpoint

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Web Tool Request] --> B{URL or Search?}
    B -->|URL provided| C[fetchAndExtract]
    B -->|No URL| D[searchPages]
    
    C --> E[Crawler Service<br/>/api/v1/web/fetch-and-extract]
    E --> F{detect_type_from_url}
    F -->|document ext| G[Download & Parse File]
    F -->|image ext| H[Download & Vision Extract]
    F -->|unknown| I[HEAD Request Probe]
    
    I --> J{detect_type_from_content_type}
    J -->|document| G
    J -->|image| H
    J -->|unknown| K[Crawl4AI Web Extraction]
    
    G --> L[parse_file_with_vision<br/>PDF/DOCX/PPTX]
    H --> M[Vision API OCR + Description]
    K --> N[Crawl4AI Single URL]
    N --> O[extract_and_describe_images]
    O --> P[Vision API for Page Images]
    
    L --> Q[Return Content + Metadata]
    M --> Q
    P --> R[Combine Text + Image Descriptions]
    R --> Q
    
    S[DOCX Tool Request] --> T{sourceType?}
    T -->|markdown/html| U[Crawler Service<br/>/api/v1/docx/from-markdown<br/>or /from-html]
    T -->|sections| V[generateDocx<br/>structured content]
    
    U --> W[html_to_docx_converter]
    W --> X[Parse HTML to Sections]
    X --> Y[DocxService.generate_docx]
    V --> Y
    Y --> Z[Return DOCX bytes]
    
    AA[Image Tool Request] --> AB{quality preset}
    AB --> AC[Quality Presets]
    AC -->|low| AD[1x JPEG default]
    AC -->|standard| AE[1x PNG]
    AC -->|high| AF[2x PNG]
    AC -->|ultra| AG[4x PNG]

_{Last reviewed commit: 066672b}

greptile-apps

_{24 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

…s, and web pages The web tool previously only supported semantic search over crawled pages. Now it also detects and fetches URLs directly — routing to the appropriate extraction pipeline based on content type (document, image, or web page). Crawler changes: - Add content type detection via URL extension and HTTP HEAD probe - Route documents (PDF, DOCX, PPTX) to file parser, images to Vision API, and web pages to the existing Playwright→PDF pipeline - Return content_type in the API response - Add tests for content type detection utilities Platform changes: - Rename fetch_url_via_pdf to fetch_and_extract to reflect broader capability - Add URL detection and dual-mode (fetch vs search) to web_tool - Add unit tests for URL extraction regex

…m markdown/HTML Switch web page extraction from Playwright→PDF→Vision to Crawl4AI for text plus Vision API for image descriptions, improving speed and content quality. Add markdown/HTML→DOCX conversion endpoints and wire them through the platform document generation pipeline and DOCX agent tool.

… JPEG) Replace manual scale option with semantic quality presets (low, standard, high, ultra) that control scale, format, and compression. Default changed from 2x PNG to 1x JPEG for smaller file sizes unless higher quality is explicitly requested.

…etection Image extensions derived from MIME types were missing the leading dot (e.g. "png" instead of ".png"), inconsistent with document extensions and breaking _build_filename().

coderabbitai · 2026-02-24T21:23:01Z

📝 Walkthrough

Walkthrough

This PR extends the crawler service with multi-format content extraction and DOCX generation capabilities. It introduces new endpoints for converting Markdown and HTML to DOCX, refactors the web content extraction pipeline to support documents, images, and web pages with content-type detection, adds a new method for single-URL crawling with structured metadata extraction, and includes utilities for HTML-to-section conversion and web image extraction with Vision API integration. The platform layer is updated to support the new DOCX generation modes and content type metadata. Comprehensive tests validate all new functionality.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

refactor(agent_tools): consolidate tools with unified read operations #1: Refactors web fetch helpers (fetchUrlViaPdf → fetchAndExtract) and adds content_type to response types, directly aligned with this PR's web extraction enhancements.
optimize agent tools #11: Modifies DocxService and DOCX generation architecture (template-based generation), overlapping with this PR's new markdown_to_docx and html_to_docx methods.
refactor(platform): consolidate crawler tools into unified web tool #369: Introduces web fetch/extract and document extraction tooling that integrates with this PR's multi-format content extraction pipeline for documents and images.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.35% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the three main changes: multi-format web extraction (documents, images, web pages), DOCX generation (from markdown and HTML), and image quality presets. It directly reflects the primary modifications across both crawler and platform services.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/web-tool-multi-format-extraction

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts (1)

76-100: ⚠️ Potential issue | 🟡 Minor

Pass through content_type so downstream consumers can use it.

The crawler response now includes content_type, but fetchAndExtract drops it. That prevents the web tool from exposing the metadata. Please include it in the success payload (and update the result type accordingly).

✅ Suggested fix (fetch_and_extract.ts)

     return {
       operation: 'fetch_url',
       success: true,
       url: args.url,
       title: result.title,
+      content_type: result.content_type,
       content,
       word_count: result.word_count,
       page_count: result.page_count,
       vision_used: result.vision_used,
       truncated,
     };

✅ Suggested type update (helpers/types.ts)

 export type WebFetchUrlResult = {
   operation: 'fetch_url';
   success: boolean;
   url: string;
   title?: string;
+  content_type?: string;
   content: string;
   word_count: number;
   page_count: number;
   vision_used: boolean;
   truncated?: boolean;
   error?: string;
 };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts` around
lines 76 - 100, The success payload from fetchAndExtract is dropping
content_type—capture and pass result.content_type through the returned object
and update the associated response type; specifically, in fetchAndExtract (the
block returning operation: 'fetch_url') add content_type: result.content_type to
the returned object (and ensure the local variable handling like
content/truncated remains unchanged), and update the corresponding
type/interface in helpers/types.ts (the fetch/web result type) to include
content_type so downstream consumers can access the metadata.

services/platform/convex/documents/generate_document_helpers.ts (1)

27-37: ⚠️ Potential issue | 🟠 Major

Add guard for unsupported docx + url combinations.

getEndpointPath() routes docx + url to /api/v1/docx/from-url, but this endpoint does not exist in the crawler service. The docx router supports /from-template, /from-markdown, and /from-html, but not /from-url. Requests with this combination will 404. Add a validation guard to prevent this unsupported case:

Suggested guard

 export function getEndpointPath(
   sourceType: DocumentSourceType,
   outputFormat: DocumentOutputFormat,
 ): string {
+  if (outputFormat === 'docx' && sourceType === 'url') {
+    throw new Error('DOCX output does not support URL sources');
+  }
   const formatPath =
     outputFormat === 'pdf'
       ? 'pdf'
       : outputFormat === 'docx'
         ? 'docx'
         : 'images';

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/documents/generate_document_helpers.ts` around lines
27 - 37, getEndpointPath currently maps outputFormat 'docx' + sourceType 'url'
to /api/v1/docx/from-url which doesn't exist; add a guard in getEndpointPath (or
a helper it calls) that detects when outputFormat === 'docx' and sourceType ===
'url' and throws a descriptive error (or returns a clearly handled error
response) to prevent building/using the unsupported /from-url route; keep
existing path logic for other combinations and reference getEndpointPath,
DocumentSourceType, and DocumentOutputFormat when implementing the check.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/routers/docx.py`:
- Around line 219-280: The endpoints convert_markdown_to_docx and
convert_html_to_docx accept empty strings because the Pydantic fields lack a
min_length; enforce validation by updating MarkdownToDocxRequest.content and
HtmlToDocxRequest.html in the models to use Field(..., min_length=1,
description=...) so empty input is rejected with 422, or as an alternative add
an early guard in each endpoint (convert_markdown_to_docx and
convert_html_to_docx) to check if request.content or request.html is empty/blank
and raise HTTPException(status_code=400, detail="Empty content") before calling
get_docx_service().
- Around line 237-241: Add an optional filename field (string, default
"document") to the request models MarkdownToDocxRequest and HtmlToDocxRequest in
app/models.py, and update the docx router endpoints in
services/crawler/app/routers/docx.py that return Response (the blocks that
currently set headers {"Content-Disposition": "attachment;
filename=document.docx"}) to use the request.filename value to construct the
Content-Disposition header (append ".docx" if missing and ensure a safe
filename). Update any references to the endpoint handlers (e.g., the
MarkdownToDocx and HtmlToDocx route functions) to read request.filename and
inject it into the Response headers so callers receive a meaningful filename by
defaulting to "document" when none is provided.
- Around line 219-220: The POST route decorator `@router.post`("/from-markdown")
for function convert_markdown_to_docx returns a raw Response but lacks OpenAPI
metadata for binary output; update its decorator to include a responses entry
that documents a 200 response with the binary content-type (e.g.
application/vnd.openxmlformats-officedocument.wordprocessingml.document or
application/octet-stream) and a schema of type "string" with format "binary",
and apply the same responses annotation to the other POST route in this file
that returns a Response (the endpoint around the other POST at lines ~251-252)
so both endpoints self-document binary responses in the generated OpenAPI spec.
- Line 216: Move the module constant _DOCX_CONTENT_TYPE into the top-level
constants block alongside _FILE_TEMPLATE and _FILE_UPLOAD to centralize
module-level declarations; locate the current mid-file definition of
_DOCX_CONTENT_TYPE and cut it, then paste it into the existing constants section
(near _FILE_TEMPLATE/_FILE_UPLOAD) so all file-related constants are grouped
together and remove the original mid-file definition to avoid duplication.

In `@services/crawler/app/routers/web.py`:
- Around line 79-95: The current _probe_url function uses httpx.AsyncClient with
follow_redirects=True which lets redirects bypass your SSRF hostname/IP checks;
change to disable automatic redirects and instead perform the HEAD (and any
subsequent redirect targets) manually by setting follow_redirects=False on the
AsyncClient, inspect the Location header on 3xx responses, validate each
redirect URL's hostname/IP against your SSRF allowlist/blocklist before issuing
the next request, and loop up to a safe redirect limit; apply the same
manual-redirect validation approach to the similar request flow in the other
block referenced (lines ~97-130) so no redirect target is fetched without
validation.

In `@services/crawler/app/services/crawler_service.py`:
- Around line 411-422: Extract the repeated CrawlerRunConfig construction into a
shared helper (e.g., _build_crawl_config) and call it from both crawl_single_url
and crawl_urls; the helper should accept word_count_threshold plus optional
overrides and return a CrawlerRunConfig with the common defaults (excluded_tags
["nav","footer","header","aside","select","option"], cache_mode="bypass",
screenshot=False, pdf=False, exclude_social_media_links=True, and
markdown_generator=DefaultMarkdownGenerator(content_filter=PruningContentFilter(threshold=0.4))).
Replace the duplicated inline CrawlerRunConfig instantiations in
crawl_single_url and crawl_urls with calls to
_build_crawl_config(word_count_threshold, **overrides) so both paths stay
synchronized when config changes.
- Line 435: The expression assigning markdown_content (markdown_content =
result.markdown.fit_markdown or result.markdown.raw_markdown) can yield None;
change it to fall back to an empty string so subsequent calls like
markdown_content.split() don't raise (e.g. use an explicit fallback such as
(result.markdown.fit_markdown or result.markdown.raw_markdown or "")). Apply the
same defensive fallback in the other occurrence in crawl_urls; update the
assignment sites so downstream code (splitting, length checks) safely operate on
an empty string rather than None.

In `@services/crawler/app/services/web_image_extractor.py`:
- Around line 84-101: The current image downloader in the async with semaphore
block reads the full response via response.content before enforcing
MIN_IMAGE_BYTES / MAX_IMAGE_DOWNLOAD_BYTES, causing large responses to be fully
downloaded then discarded; change the download to stream with
response.aiter_bytes() (or similar chunked reads) inside the same async with
semaphore and accumulate bytes up to MAX_IMAGE_DOWNLOAD_BYTES + 1, aborting the
read early and logging/skipping when the accumulator exceeds
MAX_IMAGE_DOWNLOAD_BYTES; keep the same timeout (IMAGE_DOWNLOAD_TIMEOUT),
preserve the existing small/oversize checks and logger usage (logger.debug) and
return the accumulated image_bytes when within bounds.
- Around line 129-146: The describe_image calls are unbounded and should be
limited by the same (or a new) semaphore to respect max_concurrent; wrap each
call to vision_client.describe_image in a small coroutine that acquires the
semaphore before calling and releases it in a finally block (e.g., create
describe_task = _describe_with_semaphore(image_bytes, vision_client, semaphore)
for items in downloaded), collect those describe tasks into describe_tasks and
then await asyncio.gather(*describe_tasks, return_exceptions=True); ensure the
helper coroutine handles exceptions and always releases the semaphore.

In `@services/crawler/app/utils/content_type.py`:
- Line 10: The MIME-to-extension mapping for key "application/vnd.ms-powerpoint"
in the content type dictionary is wrong (it currently maps to ".pptx"); update
that entry to map to ".ppt" and, if legacy PPT detection is desired, add ".ppt"
to the DOCUMENT_EXTENSIONS collection so functions that inspect URLs or
filenames (DOCUMENT_EXTENSIONS and the content-type mapping dict in
content_type.py) will recognize legacy PowerPoint files.
- Line 34: The code assigns path = parsed.path.lower().split("?")[0] but
parsed.path from urlparse never contains query parameters, so remove the
redundant .split("?")[0] and set path = parsed.path.lower(); update the
assignment in content_type.py where the variable path is defined (look for
parsed and path in that function) to simplify the code while keeping the same
behavior and run tests to confirm no behavioral change.
- Around line 54-55: The function detect_type_from_content_type returns document
extensions with a leading dot but image extensions without; update the image
branch (where ct in IMAGE_CONTENT_TYPES is checked) to prepend a dot to the
returned extension (e.g., return f".{ext}", "image") so image extensions match
the document format, referencing the IMAGE_CONTENT_TYPES check and the
detect_type_from_content_type function; also update the related test expectation
(currently asserting "svg") to expect ".svg".

In `@services/crawler/tests/test_crawl_single_url.py`:
- Around line 172-180: The test test_raises_timeout_error_on_slow_crawl is
relying on a real asyncio.sleep in slow_arun which can be flaky in CI; replace
the real sleep with a deterministic never-completes future or mock asyncio.sleep
so the coroutine never resolves (e.g., await asyncio.Future() that is never set)
and keep the timeout at service.crawl_single_url("https://example.com",
timeout=0.01) to trigger the TimeoutError, updating the helper
_make_service/arun_side_effect usage accordingly so the test no longer depends
on wall-clock sleep.

In `@services/crawler/tests/test_web_image_extractor.py`:
- Around line 98-110: The async test methods in TestExtractAndDescribeImages
(test_returns_empty_for_no_images, test_returns_empty_when_all_filtered) lack
the pytest asyncio marker so they won't run; fix by either adding
`@pytest.mark.asyncio` above the TestExtractAndDescribeImages class or setting a
module-level marker (pytestmark = pytest.mark.asyncio) at the top of the test
module so the async tests that call extract_and_describe_images(...) execute
under asyncio.

In `@services/crawler/tests/test_web_router.py`:
- Around line 82-94: The MIME mapping for legacy PowerPoint is incorrect: update
the mapping in content_type.py (the CONTENT_TYPE_MAPPING or equivalent
tuple/list that contains ("application/vnd.ms-powerpoint", ...)) to map
"application/vnd.ms-powerpoint" to ".ppt" (not ".pptx"), and then either add
support for .ppt in file_parser_service.py (e.g., implement or wire a legacy PPT
handler/parser used by parse_file / the existing parser registry) or explicitly
remove the .ppt mapping and add a clear comment/doc explaining legacy .ppt is
intentionally unsupported; also update tests (e.g., test_web_router.py) to
expect the corrected mapping or documented exclusion so parser and mapping
remain consistent.

In `@services/platform/convex/agent_tools/files/docx_tool.ts`:
- Around line 296-332: When handling Mode A in the docx tool, add an explicit
validation that if either args.sourceType or args.content is provided but the
other is missing, you raise a clear error (or reject) indicating both sourceType
and content are required for "generate from content" mode; implement this check
just before the Mode A branch that uses args.sourceType && args.content (the
block that calls internal.documents.internal_actions.generateDocument) so
partial inputs don't silently fall through to the sections/Mode B logic and the
caller receives a meaningful message about the missing parameter(s).

In `@services/platform/convex/agent_tools/web/web_tool.ts`:
- Around line 73-83: The metadata string built in the meta array doesn't include
the result.content_type; update the construction of meta (in web_tool.ts around
the code that builds meta) to add a conditional entry like `result.content_type
? \`Type: ${result.content_type}\` : null` so the content type (PDF, image,
page, etc.) is shown along with Title, URL, Words, Pages, Vision, and truncated
flags before the .filter(Boolean).join(' | ') call.
- Around line 59-67: The code drops the user instruction when a URL is extracted
from args.query; change the assignment so the extraction instruction is
preserved (e.g., set instruction = args.query when present, not undefined when
extractUrl returned the URL). Update the block around targetUrl/instruction in
web_tool.ts so that when targetUrl is truthy you pass the original query
(args.query) or the query-without-url as the instruction into fetchAndExtract
(referencing extractUrl, targetUrl, instruction, and fetchAndExtract) instead of
setting instruction to undefined.

---

Outside diff comments:
In `@services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts`:
- Around line 76-100: The success payload from fetchAndExtract is dropping
content_type—capture and pass result.content_type through the returned object
and update the associated response type; specifically, in fetchAndExtract (the
block returning operation: 'fetch_url') add content_type: result.content_type to
the returned object (and ensure the local variable handling like
content/truncated remains unchanged), and update the corresponding
type/interface in helpers/types.ts (the fetch/web result type) to include
content_type so downstream consumers can access the metadata.

In `@services/platform/convex/documents/generate_document_helpers.ts`:
- Around line 27-37: getEndpointPath currently maps outputFormat 'docx' +
sourceType 'url' to /api/v1/docx/from-url which doesn't exist; add a guard in
getEndpointPath (or a helper it calls) that detects when outputFormat === 'docx'
and sourceType === 'url' and throws a descriptive error (or returns a clearly
handled error response) to prevent building/using the unsupported /from-url
route; keep existing path logic for other combinations and reference
getEndpointPath, DocumentSourceType, and DocumentOutputFormat when implementing
the check.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72be014 and 066672b.

⛔ Files ignored due to path filters (1)

services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**

📒 Files selected for processing (23)

services/crawler/app/models.py
services/crawler/app/routers/docx.py
services/crawler/app/routers/web.py
services/crawler/app/services/crawler_service.py
services/crawler/app/services/docx_service.py
services/crawler/app/services/html_to_docx_converter.py
services/crawler/app/services/web_image_extractor.py
services/crawler/app/utils/__init__.py
services/crawler/app/utils/content_type.py
services/crawler/tests/conftest.py
services/crawler/tests/test_crawl_single_url.py
services/crawler/tests/test_docx_converter.py
services/crawler/tests/test_web_image_extractor.py
services/crawler/tests/test_web_router.py
services/platform/convex/agent_tools/files/docx_tool.ts
services/platform/convex/agent_tools/files/image_tool.ts
services/platform/convex/agent_tools/web/helpers/fetch_and_extract.ts
services/platform/convex/agent_tools/web/helpers/types.ts
services/platform/convex/agent_tools/web/web_tool.test.ts
services/platform/convex/agent_tools/web/web_tool.ts
services/platform/convex/documents/generate_document_helpers.ts
services/platform/convex/documents/internal_actions.ts
services/platform/convex/documents/types.ts

coderabbitai · 2026-02-24T21:23:05Z

        )


+_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"


🧹 Nitpick | 🔵 Trivial

Move _DOCX_CONTENT_TYPE to the top-level constant block with _FILE_TEMPLATE / _FILE_UPLOAD.

Currently the constant sits mid-file between two unrelated endpoint handlers (lines 25-26 house the other module-level constants), making it harder to locate. Relocating it next to the existing _FILE_* constants keeps all module-level declarations in one place.

♻️ Proposed move

_FILE_TEMPLATE = File(None, description="Optional template DOCX file to use as base") _FILE_UPLOAD = File(..., description="DOCX file to parse") +_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"

-_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document" - - `@router.post`("/from-markdown")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/docx.py` at line 216, Move the module constant _DOCX_CONTENT_TYPE into the top-level constants block alongside _FILE_TEMPLATE and _FILE_UPLOAD to centralize module-level declarations; locate the current mid-file definition of _DOCX_CONTENT_TYPE and cut it, then paste it into the existing constants section (near _FILE_TEMPLATE/_FILE_UPLOAD) so all file-related constants are grouped together and remove the original mid-file definition to avoid duplication.

coderabbitai · 2026-02-24T21:23:05Z

+@router.post("/from-markdown")
+async def convert_markdown_to_docx(request: MarkdownToDocxRequest):
+    """
+    Convert Markdown content to DOCX.
+
+    Parses markdown into HTML, then extracts structure (headings, paragraphs,
+    lists, tables, etc.) and generates a Word document.
+
+    Args:
+        request: Markdown content
+
+    Returns:
+        DOCX file as binary response
+    """
+    try:
+        docx_service = get_docx_service()
+        docx_bytes = await docx_service.markdown_to_docx(request.content)
+
+        return Response(
+            content=docx_bytes,
+            media_type=_DOCX_CONTENT_TYPE,
+            headers={"Content-Disposition": "attachment; filename=document.docx"},
+        )
+
+    except Exception:
+        logger.exception("Error converting markdown to DOCX")
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail="Failed to convert markdown to DOCX",
+        ) from None
+
+
+@router.post("/from-html")
+async def convert_html_to_docx(request: HtmlToDocxRequest):
+    """
+    Convert HTML content to DOCX.
+
+    Parses HTML to extract structure (headings, paragraphs, lists, tables, etc.)
+    and generates a Word document.
+
+    Args:
+        request: HTML content
+
+    Returns:
+        DOCX file as binary response
+    """
+    try:
+        docx_service = get_docx_service()
+        docx_bytes = await docx_service.html_to_docx(request.html)
+
+        return Response(
+            content=docx_bytes,
+            media_type=_DOCX_CONTENT_TYPE,
+            headers={"Content-Disposition": "attachment; filename=document.docx"},
+        )
+
+    except Exception:
+        logger.exception("Error converting HTML to DOCX")
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail="Failed to convert HTML to DOCX",
+        ) from None


⚠️ Potential issue | 🟡 Minor

Empty inputs will silently produce an empty/trivial DOCX with HTTP 200.

MarkdownToDocxRequest.content and HtmlToDocxRequest.html are declared with Field(...) (required but no min_length), so an empty string "" passes Pydantic validation and reaches markdown_to_docx/html_to_docx. The caller receives a valid DOCX binary response with no indication that the input was empty.

Enforce a minimum length on both request model fields, for example:

# In app/models.py → MarkdownToDocxRequest content: str = Field(..., min_length=1, description="Markdown content to convert") # In app/models.py → HtmlToDocxRequest html: str = Field(..., min_length=1, description="HTML content to convert")

Alternatively, add an early guard inside each endpoint before calling the service:

🛡️ Proposed guard in each endpoint

try: docx_service = get_docx_service() + if not request.content.strip(): + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail="Markdown content must not be empty", + ) docx_bytes = await docx_service.markdown_to_docx(request.content)

try: docx_service = get_docx_service() + if not request.html.strip(): + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail="HTML content must not be empty", + ) docx_bytes = await docx_service.html_to_docx(request.html)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/docx.py` around lines 219 - 280, The endpoints convert_markdown_to_docx and convert_html_to_docx accept empty strings because the Pydantic fields lack a min_length; enforce validation by updating MarkdownToDocxRequest.content and HtmlToDocxRequest.html in the models to use Field(..., min_length=1, description=...) so empty input is rejected with 422, or as an alternative add an early guard in each endpoint (convert_markdown_to_docx and convert_html_to_docx) to check if request.content or request.html is empty/blank and raise HTTPException(status_code=400, detail="Empty content") before calling get_docx_service().

coderabbitai · 2026-02-24T21:23:05Z

+@router.post("/from-markdown")
+async def convert_markdown_to_docx(request: MarkdownToDocxRequest):


🧹 Nitpick | 🔵 Trivial

New endpoints are missing OpenAPI binary-response documentation.

Because Response is returned directly (no response_model), FastAPI omits any schema for these routes. Adding a responses annotation in the decorator self-documents the binary content type in the generated OpenAPI spec, improving client discoverability.

♻️ Proposed annotation

-@router.post("/from-markdown") +@router.post( + "/from-markdown", + responses={200: {"content": {_DOCX_CONTENT_TYPE: {}}, "description": "DOCX binary file"}}, +) async def convert_markdown_to_docx(request: MarkdownToDocxRequest): -@router.post("/from-html") +@router.post( + "/from-html", + responses={200: {"content": {_DOCX_CONTENT_TYPE: {}}, "description": "DOCX binary file"}}, +) async def convert_html_to_docx(request: HtmlToDocxRequest):

Also applies to: 251-252

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/docx.py` around lines 219 - 220, The POST route decorator `@router.post`("/from-markdown") for function convert_markdown_to_docx returns a raw Response but lacks OpenAPI metadata for binary output; update its decorator to include a responses entry that documents a 200 response with the binary content-type (e.g. application/vnd.openxmlformats-officedocument.wordprocessingml.document or application/octet-stream) and a schema of type "string" with format "binary", and apply the same responses annotation to the other POST route in this file that returns a Response (the endpoint around the other POST at lines ~251-252) so both endpoints self-document binary responses in the generated OpenAPI spec.

coderabbitai · 2026-02-24T21:23:05Z

+        return Response(
+            content=docx_bytes,
+            media_type=_DOCX_CONTENT_TYPE,
+            headers={"Content-Disposition": "attachment; filename=document.docx"},
+        )


🧹 Nitpick | 🔵 Trivial

Hardcoded filename=document.docx makes it impossible for callers to obtain a meaningful filename.

Both endpoints return attachment; filename=document.docx regardless of any caller-supplied name. Adding an optional filename field to MarkdownToDocxRequest / HtmlToDocxRequest (defaulting to "document") would let clients receive a semantically named file with no breaking change.

♻️ Proposed refactor

In app/models.py:

class MarkdownToDocxRequest(BaseModel): content: str = Field(..., description="Markdown content to convert") + filename: str = Field("document", description="Output filename (without extension)") class HtmlToDocxRequest(BaseModel): html: str = Field(..., description="HTML content to convert") + filename: str = Field("document", description="Output filename (without extension)")

In the router:

- headers={"Content-Disposition": "attachment; filename=document.docx"}, + headers={"Content-Disposition": f'attachment; filename="{request.filename}.docx"'},

Also applies to: 269-273

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/docx.py` around lines 237 - 241, Add an optional filename field (string, default "document") to the request models MarkdownToDocxRequest and HtmlToDocxRequest in app/models.py, and update the docx router endpoints in services/crawler/app/routers/docx.py that return Response (the blocks that currently set headers {"Content-Disposition": "attachment; filename=document.docx"}) to use the request.filename value to construct the Content-Disposition header (append ".docx" if missing and ensure a safe filename). Update any references to the endpoint handlers (e.g., the MarkdownToDocx and HtmlToDocx route functions) to read request.filename and inject it into the Response headers so callers receive a meaningful filename by defaulting to "document" when none is provided.

coderabbitai · 2026-02-24T21:23:05Z

+async def _probe_url(url_str: str, timeout: float) -> tuple[str, str | None, str]:
+    """
+    Probe a URL with a HEAD request to determine Content-Type.
+
+    Returns:
+        Tuple of (content_type_header, extension_or_None, category).
+    """
+    try:
+        async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client:
+            response = await client.head(url_str)
+            content_type = response.headers.get("content-type", "")
+            ext, category = detect_type_from_content_type(content_type)
+            return content_type, ext, category
+    except httpx.HTTPError:
+        logger.debug(f"HEAD request failed for {url_str}, will try URL extension detection")
+        return "", None, "unknown"
+


⚠️ Potential issue | 🔴 Critical

Redirects can bypass SSRF validation.

follow_redirects=True allows a public URL to redirect to a private IP after the initial hostname check, which defeats the SSRF guard. Validate each redirect target before issuing the next request (or disable auto-redirects and handle redirects manually).

🔒 Suggested redirect validation

-from urllib.parse import urlparse +from urllib.parse import urlparse, urljoin

async def _probe_url(url_str: str, timeout: float) -> tuple[str, str | None, str]: @@ - try: - async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client: - response = await client.head(url_str) - content_type = response.headers.get("content-type", "") - ext, category = detect_type_from_content_type(content_type) - return content_type, ext, category + try: + async with httpx.AsyncClient(follow_redirects=False, timeout=timeout, headers=_HTTP_HEADERS) as client: + current_url = url_str + for _ in range(5): + response = await client.head(current_url) + if response.is_redirect: + location = response.headers.get("location") + if not location: + break + next_url = urljoin(current_url, location) + validate_url_not_private(next_url) + current_url = next_url + continue + content_type = response.headers.get("content-type", "") + ext, category = detect_type_from_content_type(content_type) + return content_type, ext, category + return "", None, "unknown" except httpx.HTTPError: logger.debug(f"HEAD request failed for {url_str}, will try URL extension detection") return "", None, "unknown"

async def _download_file(url_str: str, timeout: float) -> tuple[bytes, str]: @@ - async with httpx.AsyncClient(follow_redirects=True, timeout=timeout, headers=_HTTP_HEADERS) as client: - async with client.stream("GET", url_str) as response: - response.raise_for_status() - content_type = response.headers.get("content-type", "") - content_length = response.headers.get("content-length") - - if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES: - raise HTTPException( - status_code=status.HTTP_400_BAD_REQUEST, - detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.", - ) - - chunks = [] - total = 0 - async for chunk in response.aiter_bytes(): - total += len(chunk) - if total > MAX_FILE_DOWNLOAD_BYTES: - raise HTTPException( - status_code=status.HTTP_400_BAD_REQUEST, - detail=f"File too large (>{MAX_FILE_DOWNLOAD_BYTES} bytes). Download aborted.", - ) - chunks.append(chunk) - - return b"".join(chunks), content_type + async with httpx.AsyncClient(follow_redirects=False, timeout=timeout, headers=_HTTP_HEADERS) as client: + current_url = url_str + for _ in range(5): + async with client.stream("GET", current_url) as response: + if response.is_redirect: + location = response.headers.get("location") + if not location: + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail="Redirect without Location header", + ) + next_url = urljoin(current_url, location) + validate_url_not_private(next_url) + current_url = next_url + continue + response.raise_for_status() + content_type = response.headers.get("content-type", "") + content_length = response.headers.get("content-length") + + if content_length and int(content_length) > MAX_FILE_DOWNLOAD_BYTES: + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail=f"File too large ({int(content_length)} bytes). Maximum: {MAX_FILE_DOWNLOAD_BYTES} bytes.", + ) + + chunks = [] + total = 0 + async for chunk in response.aiter_bytes(): + total += len(chunk) + if total > MAX_FILE_DOWNLOAD_BYTES: + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail=f"File too large (>{MAX_FILE_DOWNLOAD_BYTES} bytes). Download aborted.", + ) + chunks.append(chunk) + + return b"".join(chunks), content_type + raise HTTPException( + status_code=status.HTTP_400_BAD_REQUEST, + detail="Too many redirects", + )

Also applies to: 97-130

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/web.py` around lines 79 - 95, The current _probe_url function uses httpx.AsyncClient with follow_redirects=True which lets redirects bypass your SSRF hostname/IP checks; change to disable automatic redirects and instead perform the HEAD (and any subsequent redirect targets) manually by setting follow_redirects=False on the AsyncClient, inspect the Location header on 3xx responses, validate each redirect URL's hostname/IP against your SSRF allowlist/blocklist before issuing the next request, and loop up to a safe redirect limit; apply the same manual-redirect validation approach to the similar request flow in the other block referenced (lines ~97-130) so no redirect target is fetched without validation.

coderabbitai · 2026-02-24T21:23:06Z

+class TestExtractAndDescribeImages:
+    """Tests for extract_and_describe_images()."""
+
+    async def test_returns_empty_for_no_images(self):
+        descriptions, vision_used = await extract_and_describe_images([], "https://example.com")
+        assert descriptions == []
+        assert vision_used is False
+
+    async def test_returns_empty_when_all_filtered(self):
+        images = [{"src": "data:image/png;base64,tiny", "score": 5.0}]
+        descriptions, vision_used = await extract_and_describe_images(images, "https://example.com")
+        assert descriptions == []
+        assert vision_used is False


⚠️ Potential issue | 🔴 Critical

Missing pytest.mark.asyncio — async tests won't execute.

Same issue as test_crawl_single_url.py: the async test methods in TestExtractAndDescribeImages lack the asyncio marker and will silently "pass" without running.

🔧 Proposed fix: add module-level marker

import pytest from app.services.web_image_extractor import ( MAX_IMAGES_TO_PROCESS, MIN_IMAGE_BYTES, _filter_image_candidates, extract_and_describe_images, ) + +pytestmark = pytest.mark.asyncio

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/tests/test_web_image_extractor.py` around lines 98 - 110, The async test methods in TestExtractAndDescribeImages (test_returns_empty_for_no_images, test_returns_empty_when_all_filtered) lack the pytest asyncio marker so they won't run; fix by either adding `@pytest.mark.asyncio` above the TestExtractAndDescribeImages class or setting a module-level marker (pytestmark = pytest.mark.asyncio) at the top of the test module so the async tests that call extract_and_describe_images(...) execute under asyncio.

coderabbitai · 2026-02-24T21:23:06Z

+            (
+                "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+                ".docx",
+                "document",
+            ),
+            (
+                "application/vnd.openxmlformats-officedocument.presentationml.presentation",
+                ".pptx",
+                "document",
+            ),
+            ("application/msword", ".doc", "document"),
+            ("application/vnd.ms-powerpoint", ".pptx", "document"),
+        ],


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Inspect MIME-to-extension mapping and parser support echo "=== Searching for vnd.ms-powerpoint MIME mapping ===" rg -n "vnd\.ms-powerpoint" services/crawler/ echo -e "\n=== Searching for DOCUMENT_CONTENT_TYPES definition ===" rg -n "DOCUMENT_CONTENT_TYPES" services/crawler/ -C 3 echo -e "\n=== Searching for ppt/pptx parser support ===" rg -n "ppt" services/crawler/app/services/file_parser_service.py echo -e "\n=== Checking content_type.py for PowerPoint mappings ===" cat -n services/crawler/app/utils/content_type.py | head -100

Repository: tale-project/tale

Length of output: 4698

Correct the legacy PowerPoint MIME type mapping.

application/vnd.ms-powerpoint is the MIME type for legacy Office 97-2003 .ppt files, not .pptx. The current mapping in content_type.py:10 incorrectly assigns it to .pptx. Additionally, the parser only supports .pptx (via python-pptx in file_parser_service.py), with no handler for .ppt files. This mismatch will cause parsing failures for legacy PowerPoint files. Either map application/vnd.ms-powerpoint to .ppt with corresponding parser support, or document why legacy format support is intentionally excluded.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/tests/test_web_router.py` around lines 82 - 94, The MIME mapping for legacy PowerPoint is incorrect: update the mapping in content_type.py (the CONTENT_TYPE_MAPPING or equivalent tuple/list that contains ("application/vnd.ms-powerpoint", ...)) to map "application/vnd.ms-powerpoint" to ".ppt" (not ".pptx"), and then either add support for .ppt in file_parser_service.py (e.g., implement or wire a legacy PPT handler/parser used by parse_file / the existing parser registry) or explicitly remove the .ppt mapping and add a clear comment/doc explaining legacy .ppt is intentionally unsupported; also update tests (e.g., test_web_router.py) to expect the corrected mapping or documented exclusion so parser and mapping remain consistent.

coderabbitai · 2026-02-24T21:23:06Z

+
+      // Mode A: Generate from markdown/html content
+      if (args.sourceType && args.content) {
+        debugLog('tool:docx generate from content start', {
+          fileName: args.fileName,
+          sourceType: args.sourceType,
+        });
+
+        try {
+          const result = await ctx.runAction(
+            internal.documents.internal_actions.generateDocument,
+            {
+              fileName: args.fileName,
+              sourceType: args.sourceType,
+              outputFormat: 'docx',
+              content: args.content,
+            },
+          );
+
+          debugLog('tool:docx generate from content success', {
+            fileName: result.fileName,
+            fileId: result.fileId,
+            size: result.size,
+          });
+
+          return {
+            operation: 'generate',
+            ...result,
+          } as GenerateDocxResult;
+        } catch (error) {
+          console.error('[tool:docx generate from content] error', {
+            fileName: args.fileName,
+            error: error instanceof Error ? error.message : String(error),
+          });
+          throw error;
+        }
+      }


⚠️ Potential issue | 🟡 Minor

Partial Mode A args (sourceType without content, or vice versa) silently fall through to Mode B.

If a caller provides sourceType but omits content (or vice versa), the condition on line 298 is false and execution falls through to the sections check on line 335, which produces an error message about missing sections or sourceType + content. This could be confusing since the user clearly intended Mode A. Consider validating explicitly:

🛡️ Proposed validation

// Mode A: Generate from markdown/html content if (args.sourceType && args.content) { + } else if (args.sourceType || args.content) { + throw new Error( + "Both 'sourceType' and 'content' are required for markdown/HTML generation. Provide both or use 'sections' instead.", + ); + }

Or equivalently, add the check before the Mode A block.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/files/docx_tool.ts` around lines 296 - 332, When handling Mode A in the docx tool, add an explicit validation that if either args.sourceType or args.content is provided but the other is missing, you raise a clear error (or reject) indicating both sourceType and content are required for "generate from content" mode; implement this check just before the Mode A branch that uses args.sourceType && args.content (the block that calls internal.documents.internal_actions.generateDocument) so partial inputs don't silently fall through to the sections/Mode B logic and the caller receives a meaningful message about the missing parameter(s).

coderabbitai · 2026-02-24T21:23:06Z

+      const targetUrl = args.url || extractUrl(args.query);
+
+      if (targetUrl) {
+        const instruction = args.url ? args.query : undefined;
+
+        const result = await fetchAndExtract(ctx, {
+          url: targetUrl,
+          instruction,
+        });


⚠️ Potential issue | 🟠 Major

Don’t drop the user instruction when a URL is embedded in query.

Right now instruction becomes undefined when the URL is parsed from query, even though the description says the query guides extraction. This loses intent for inputs like “Summarize https://…”.

✅ Suggested fix

- const instruction = args.url ? args.query : undefined; + const instruction = args.url + ? args.query + : args.query.trim() === targetUrl + ? undefined + : args.query.replace(targetUrl, '').trim() || args.query;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/web/web_tool.ts` around lines 59 - 67, The code drops the user instruction when a URL is extracted from args.query; change the assignment so the extraction instruction is preserved (e.g., set instruction = args.query when present, not undefined when extractUrl returned the URL). Update the block around targetUrl/instruction in web_tool.ts so that when targetUrl is truthy you pass the original query (args.query) or the query-without-url as the instruction into fetchAndExtract (referencing extractUrl, targetUrl, instruction, and fetchAndExtract) instead of setting instruction to undefined.

coderabbitai · 2026-02-24T21:23:06Z

+        const meta = [
+          result.title ? `Title: ${result.title}` : null,
+          `URL: ${result.url}`,
+          `Words: ${result.word_count}`,
+          result.page_count ? `Pages: ${result.page_count}` : null,
+          result.vision_used ? 'Vision: used for extraction' : null,
+          result.truncated ? '(content truncated)' : null,
+        ]
+          .filter(Boolean)
+          .join(' | ');
+


🧹 Nitpick | 🔵 Trivial

Surface content_type in the metadata line.

Once content_type is passed through, include it in meta so users can see whether the content came from a PDF, image, or page.

✅ Suggested tweak

const meta = [ result.title ? `Title: ${result.title}` : null, + result.content_type ? `Content-Type: ${result.content_type}` : null, `URL: ${result.url}`, `Words: ${result.word_count}`, result.page_count ? `Pages: ${result.page_count}` : null, result.vision_used ? 'Vision: used for extraction' : null, result.truncated ? '(content truncated)' : null, ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/agent_tools/web/web_tool.ts` around lines 73 - 83, The metadata string built in the meta array doesn't include the result.content_type; update the construction of meta (in web_tool.ts around the code that builds meta) to add a conditional entry like `result.content_type ? \`Type: ${result.content_type}\` : null` so the content type (PDF, image, page, etc.) is shown along with Title, URL, Words, Pages, Vision, and truncated flags before the .filter(Boolean).join(' | ') call.

… and image quality presets (#556)

greptile-apps Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread services/crawler/app/utils/content_type.py Outdated

larryro added 3 commits February 25, 2026 05:15

larryro force-pushed the feat/web-tool-multi-format-extraction branch from 066672b to 91f03f8 Compare February 24, 2026 21:16

larryro added 2 commits February 25, 2026 05:19

fix(crawler): combine nested async with statements to fix SIM117 lint

2583947

fix(crawler): add leading dot to image extensions from content-type d…

6474577

…etection Image extensions derived from MIME types were missing the leading dot (e.g. "png" instead of ".png"), inconsistent with document extensions and breaking _build_filename().

coderabbitai Bot requested changes Feb 24, 2026

View reviewed changes

larryro merged commit 2015851 into main Feb 24, 2026
17 checks passed

larryro deleted the feat/web-tool-multi-format-extraction branch February 24, 2026 21:36

coderabbitai Bot mentioned this pull request Mar 18, 2026

feat(agent): generate file download cards in chat via appendFilePart #810

Merged

5 tasks

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

feat(crawler,platform): multi-format web extraction, DOCX generation,…

92cc32d

… and image quality presets (#556)

		)


		_DOCX_CONTENT_TYPE = "application/vnd.openxmlformats-officedocument.wordprocessingml.document"

		@router.post("/from-markdown")
		async def convert_markdown_to_docx(request: MarkdownToDocxRequest):

Conversation

larryro commented Feb 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

greptile-apps Bot commented Feb 24, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot commented Feb 24, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Feb 24, 2026 •

edited by coderabbitai Bot

Loading