feat(platform): surface OCR visibility with scanned page detection by yannickmonney · Pull Request #1401 · tale-project/tale

yannickmonney · 2026-04-11T02:39:52Z

Summary

Add scanned page detection metadata (scanned_pages_detected, ocr_applied) to both PDF extraction pipelines (tale_knowledge and crawler service)
Return structured PdfExtractionResult from tale_knowledge and PdfExtractionMetadata from the crawler so downstream consumers can inform users about OCR activity
Detect scanned pages using a text threshold combined with a large-image area heuristic, and warn when scanned pages are found but no vision client is configured
Add i18n keys for OCR status messages (documents.ocr.*) in en.json and de.json
Update all tests for the new PdfExtractionResult return type and add a new test for scanned page detection without a vision client

Test plan

Platform lint passes (0 errors)
Platform typecheck passes
Platform server tests pass (1955 tests)
Platform client tests pass (3807 tests)
Verify PDF extraction with scanned pages returns correct scanned_pages_detected count
Verify OCR warning logged when vision client is None and scanned pages detected

Closes #1206

Summary by CodeRabbit

Release Notes

New Features
- Added automatic detection of scanned pages in PDF documents.
- Document preview now displays the count of detected scanned pages.
- Enhanced OCR status reporting with improved warning messages when vision model is unavailable.
Improvements
- Refined vision client configuration feedback with clearer diagnostic messages.

coderabbitai · 2026-04-11T02:39:59Z

📝 Walkthrough

Walkthrough

The pull request introduces OCR detection and tracking for scanned PDF documents across the system. In the knowledge extraction layer, extract_text_from_pdf_bytes now returns a PdfExtractionResult containing text, vision usage flags, and counts of scanned pages and OCR application. The crawler service extends this with metadata endpoints that detect scanned pages by analyzing embedded images and low text content, reporting these findings in file metadata responses. The platform service establishes a new async metadata extraction pipeline: saveFileMetadata schedules extractFileMetadata to run asynchronously, which contacts the crawler's extraction endpoint and persists results via updateFileVisionMetadata. Document-file linking and extraction flows are refactored to use these new mutations, and the UI displays scanned page counts when vision is required. Support for i18n messages for OCR status and warnings is added.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

tale-project/tale\feat(platform): configurable file retention policies #1391: Both PRs extend the Convex fileMetadata surface (schema, queries, mutations, and internal actions) to add and propagate new file-metadata fields and background processing flows.
tale-project/tale\feat(platform): simplify file tools — auto RAG indexing, remove parse/template ops #1407: Both PRs refactor the file-metadata save flow by introducing scheduled internal actions via ctx.scheduler.runAfter in saveFileMetadata and adding new internal mutations around file-metadata persistence.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main objective: surfacing OCR visibility through scanned page detection, which is the core focus of the PR changes across PDF extraction and UI layers.
Linked Issues check	✅ Passed	All objectives from `#1206` are met: scanned page detection is implemented via text-threshold + image-area heuristic, OCR activity is tracked and propagated through extraction pipelines, and UI displays scanned page counts to users.
Out of Scope Changes check	✅ Passed	All changes are directly related to `#1206`'s objective of exposing OCR capability and scanned-page detection; no unrelated modifications to other features or subsystems are present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/issue-1206-ocr-visibility

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

yannickmonney · 2026-04-11T02:40:02Z

@coderabbitai review

coderabbitai · 2026-04-11T02:40:07Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Add scanned page detection metadata (scanned_pages_detected, ocr_applied) to both PDF extraction pipelines. Return structured PdfExtractionResult from tale_knowledge and PdfExtractionMetadata from the crawler service so downstream consumers can inform users about OCR activity. - Detect scanned pages using text threshold + large-image heuristic - Warn when scanned pages found but no vision client configured - Propagate OCR metadata through crawler file parser response - Add i18n keys for OCR status in en.json and de.json - Update tests for new PdfExtractionResult return type

Move scanned page detection from crawler through fileMetadata to the document preview sidebar. Consolidate all direct fileMetadata inserts into the saveFileMetadata mutation so every upload path automatically triggers metadata extraction. - Extend crawler extract-metadata endpoints with scanned_pages_detected - Add pageCount, scannedPagesDetected, visionRequired to fileMetadata schema - Add extractFileMetadata action triggered by saveFileMetadata on insert - Replace extractDocumentDates with extractFileMetadata (handles dates + vision metadata) - Display scanned page count in document preview sidebar

Persist vision_used as ocr_applied in the RAG documents table during indexing, return it in the status query, and propagate through the platform polling path into fileMetadata. The document preview sidebar now shows whether OCR was applied after indexing completes.

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

services/crawler/app/routers/docx.py (1)
494-502: ⚠️ Potential issue | 🟠 Major

Avoid hardcoding a “no scanned pages” result for DOCX metadata.

This handler does not perform any scanned-page detection, but downstream code treats 0 as a definitive signal and sets visionRequired: false in services/platform/convex/file_metadata/internal_actions.ts:166-174. For image-only DOCX files that suppresses the OCR hint even though nothing was checked.
Suggested minimal fix
         return FileMetadataResponse(
             success=True,
             filename=filename,
             file_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
             title=meta["title"] or None,
             author=meta["author"] or None,
             created_at=meta["created_at"],
             modified_at=meta["modified_at"],
-            scanned_pages_detected=0,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/docx.py` around lines 494 - 502, The handler is
hardcoding scanned_pages_detected=0 which wrongly signals "no scanned pages"
downstream; update the response in services/crawler/app/routers/docx.py to omit
that hardcoded zero and return a null/undefined value instead (e.g.,
scanned_pages_detected=None) or remove the scanned_pages_detected field so
FileMetadataResponse does not assert 0; locate the return constructing
FileMetadataResponse and change the scanned_pages_detected assignment to None
(or omit the property) so downstream logic (e.g.,
services/platform/convex/file_metadata/internal_actions.ts) doesn't treat it as
a definitive "no OCR needed" signal.
services/crawler/app/routers/pptx.py (1)
295-304: ⚠️ Potential issue | 🟠 Major

Don't report scanned_pages_detected=0 without actually checking the slides.

This endpoint never inspects PPTX slides for scan-like/image-only content, but services/platform/convex/file_metadata/internal_actions.ts:166-174 turns 0 into visionRequired: false. That creates a false negative for image-only decks and hides OCR need in downstream UI. Return None until detection exists, or add a real heuristic here.
Suggested minimal fix
         return FileMetadataResponse(
             success=True,
             filename=filename,
             file_type="application/vnd.openxmlformats-officedocument.presentationml.presentation",
             title=meta["title"] or None,
             author=meta["author"] or None,
             slide_count=len(prs.slides),
             created_at=meta["created_at"],
             modified_at=meta["modified_at"],
-            scanned_pages_detected=0,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/pptx.py` around lines 295 - 304, The code
currently returns scanned_pages_detected=0 in the FileMetadataResponse (returned
from the PPTX metadata handler using prs.slides), which causes downstream logic
to treat decks as visionNotRequired; change the response to return
scanned_pages_detected=None until real detection is implemented (or implement a
real heuristic that inspects prs.slides for image-only slides), i.e., update the
FileMetadataResponse construction to set scanned_pages_detected=None instead of
0 (refer to FileMetadataResponse and the place where prs.slides is used).
services/platform/convex/file_metadata/internal_mutations.ts (1)
27-43: ⚠️ Potential issue | 🟠 Major

Reschedule metadata extraction when reusing an existing fileMetadata row.

extractFileMetadata is only queued on insert. Now that createDocumentFromUpload reuses existing rows and links the documentId afterward, a retry/re-link path can return from Line 43 without ever rerunning extraction, so the later document never gets its source dates (and any fresh vision metadata) backfilled. Please trigger the extraction action for the existing-row path too, or from linkDocumentToFile when attaching a previously unlinked record.

Also applies to: 68-76
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/file_metadata/internal_mutations.ts` around lines 27
- 43, When reusing an existing fileMetadata row in createDocumentFromUpload (the
branch that finds existing and calls ctx.db.patch on existing._id), ensure
extractFileMetadata is queued for that record too so metadata (source dates,
vision data) gets backfilled; add a call to the same extraction scheduling
mechanism used on insert (or invoke extractFileMetadata scheduling) immediately
after the patch/return of existing._id. Also add the same scheduling call inside
linkDocumentToFile when attaching a previously-unlinked fileMetadata record so
linking triggers metadata extraction for that record as well.
services/crawler/app/services/vision/pdf_extractor.py (1)
125-133: ⚠️ Potential issue | 🔴 Critical

Decouple scanned-page detection from OCR execution.

This branch only marks a page as scanned when ocr_scanned_pages is true, so deployments without Vision will report scanned_pages_detected = 0 even for obviously scanned PDFs. It also ignores the large-image-area check described in the module docstring/PR, which will overcount low-text pages as scanned. Compute is_scanned_page from layout alone, then gate only the OCR call on ocr_scanned_pages.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/services/vision/pdf_extractor.py` around lines 125 -
133, The code currently sets is_scanned_page only when ocr_scanned_pages is true
and skips the large-image-area heuristic; change the flow in the page-processing
block (references: MIN_TEXT_THRESHOLD, ocr_scanned_pages, _ocr_page,
is_scanned_page, scanned_pages_detected) so that is_scanned_page is computed
purely from layout heuristics (text length vs MIN_TEXT_THRESHOLD and the
large-image-area check described in the module docstring) regardless of
ocr_scanned_pages, increment/update scanned_pages_detected based on that
boolean, and then separately gate the OCR call: if is_scanned_page and
ocr_scanned_pages then call await _ocr_page(...), set elements and vision_used
accordingly and emit the debug log about sending to Vision API only when OCR is
actually invoked.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/tale_knowledge/tests/test_pdf.py`:
- Around line 212-219: The test test_scanned_page_without_vision_client should
also assert the warning is emitted when a scanned page is detected without a
vision client: wrap the call to extract_text_from_pdf_bytes(vision_client=None)
in a logging/warning capture (e.g., use pytest's caplog or pytest.warns) and
assert that the expected warning message (the one produced by the
scanned-page-without-vision-client code path) appears; update the test to
reference extract_text_from_pdf_bytes and assert the warning text and/or warning
category so CI verifies the warning path.

In `@services/crawler/app/routers/pdf.py`:
- Around line 255-267: The current loop only uses the large_image_ratio
heuristic (variables page.get_images, bbox, large_image_ratio, scanned_count) to
mark scanned pages, which differs from the extraction pipeline (e.g.,
extract_text_from_pdf_bytes) that also uses low-text detection and sets
scanned_pages_detected; replace or augment this logic to call the shared
scanned-page detector or the existing metadata helper used by
extract_text_from_pdf_bytes so both paths use the same detection logic and value
for scanned_pages_detected, i.e., invoke the common function instead of only
counting large-image pages (or combine both heuristics via that shared helper)
so the UI/visionRequired flag stays consistent.

In `@services/crawler/app/services/vision/pdf_extractor.py`:
- Around line 187-193: The code currently uses page_vision_used (set by both
_ocr_page() and image descriptions) to compute ocr_applied, causing scanned
pages with empty OCR text but later image descriptions to be misattributed;
modify _extract_page_with_layout() to return a separate boolean page_ocr_applied
(true only when OCR actually produced text) in addition to page_vision_used,
update the process_page coroutine to unpack and return page_ocr_applied
alongside page_vision_used (and keep the existing return signature ordering),
and change the aggregation logic that computes ocr_applied (currently using
page_vision_used around lines referenced) to use the new page_ocr_applied flag
so OCR usage is tracked independently from generic Vision/image-description
usage.

In `@services/platform/convex/file_metadata/internal_actions.ts`:
- Around line 89-100: The code currently bases visionRequired solely on
args.contentType (using IMAGE_CONTENT_TYPES) so uploads with a generic MIME like
application/octet-stream and an image filename (e.g.,
.png/.jpg/.jpeg/.gif/.webp) will be misclassified; update the logic in the
branch that checks IMAGE_CONTENT_TYPES and in the later branch around
createDocumentFromUpload to also inspect the uploaded filename/extension
(args.filename or the upload metadata) and treat known image extensions as
images before taking the “no vision needed” path, then call
internal.file_metadata.internal_mutations.updateFileVisionMetadata with
visionRequired: true for those cases; ensure you reuse the same extension set
(or a small helper isKnownImageExtension(filename)) so both places behave
consistently.

In `@services/platform/convex/file_metadata/queries.ts`:
- Around line 43-55: The handler for this public query must verify the caller is
a member of args.organizationId before reading fileMetadata; after obtaining
authUser via getAuthUserIdentity and before calling ctx.db.query('fileMetadata')
(the withIndex/...first() block), invoke the existing org-membership check
helper (e.g., assertUserIsOrgMember(authUser, args.organizationId) or
ensureUserInOrganization(authUser, args.organizationId)) and return null/throw
if the check fails so only org members can access the index query.
- Around line 31-34: Change the args schema to use documentId: v.id('documents')
instead of v.string() to match the typed documents ID and restore ID-format
validation, and update the query that compares the arg against the document.id
to use the correctly typed arg; also enforce org membership by calling
getOrganizationMember(ctx, args.organizationId, authUser) immediately after
getAuthUserIdentity(...) (before returning org-scoped metadata) so only members
of the organization can fetch the metadata.

In `@services/platform/messages/de.json`:
- Around line 2975-2976: The German translation value for the key "scannedPages"
is not title-cased; update the value from "gescannte Seiten" to "Gescannte
Seiten" so it matches the capitalization style of adjacent sidebar labels (e.g.,
"sourceSharepoint" / "SharePoint") and maintains UI copy consistency; locate the
"scannedPages" entry in the translations and change only its string value.

---

Outside diff comments:
In `@services/crawler/app/routers/docx.py`:
- Around line 494-502: The handler is hardcoding scanned_pages_detected=0 which
wrongly signals "no scanned pages" downstream; update the response in
services/crawler/app/routers/docx.py to omit that hardcoded zero and return a
null/undefined value instead (e.g., scanned_pages_detected=None) or remove the
scanned_pages_detected field so FileMetadataResponse does not assert 0; locate
the return constructing FileMetadataResponse and change the
scanned_pages_detected assignment to None (or omit the property) so downstream
logic (e.g., services/platform/convex/file_metadata/internal_actions.ts) doesn't
treat it as a definitive "no OCR needed" signal.

In `@services/crawler/app/routers/pptx.py`:
- Around line 295-304: The code currently returns scanned_pages_detected=0 in
the FileMetadataResponse (returned from the PPTX metadata handler using
prs.slides), which causes downstream logic to treat decks as visionNotRequired;
change the response to return scanned_pages_detected=None until real detection
is implemented (or implement a real heuristic that inspects prs.slides for
image-only slides), i.e., update the FileMetadataResponse construction to set
scanned_pages_detected=None instead of 0 (refer to FileMetadataResponse and the
place where prs.slides is used).

In `@services/crawler/app/services/vision/pdf_extractor.py`:
- Around line 125-133: The code currently sets is_scanned_page only when
ocr_scanned_pages is true and skips the large-image-area heuristic; change the
flow in the page-processing block (references: MIN_TEXT_THRESHOLD,
ocr_scanned_pages, _ocr_page, is_scanned_page, scanned_pages_detected) so that
is_scanned_page is computed purely from layout heuristics (text length vs
MIN_TEXT_THRESHOLD and the large-image-area check described in the module
docstring) regardless of ocr_scanned_pages, increment/update
scanned_pages_detected based on that boolean, and then separately gate the OCR
call: if is_scanned_page and ocr_scanned_pages then call await _ocr_page(...),
set elements and vision_used accordingly and emit the debug log about sending to
Vision API only when OCR is actually invoked.

In `@services/platform/convex/file_metadata/internal_mutations.ts`:
- Around line 27-43: When reusing an existing fileMetadata row in
createDocumentFromUpload (the branch that finds existing and calls ctx.db.patch
on existing._id), ensure extractFileMetadata is queued for that record too so
metadata (source dates, vision data) gets backfilled; add a call to the same
extraction scheduling mechanism used on insert (or invoke extractFileMetadata
scheduling) immediately after the patch/return of existing._id. Also add the
same scheduling call inside linkDocumentToFile when attaching a
previously-unlinked fileMetadata record so linking triggers metadata extraction
for that record as well.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82570a72-47fe-4ce9-97b5-0e8890206e32

📥 Commits

Reviewing files that changed from the base of the PR and between e80bc60 and ba80647.

📒 Files selected for processing (23)

packages/tale_knowledge/src/tale_knowledge/extraction/image.py
packages/tale_knowledge/src/tale_knowledge/extraction/pdf.py
packages/tale_knowledge/src/tale_knowledge/extraction/router.py
packages/tale_knowledge/tests/test_pdf.py
services/crawler/app/models.py
services/crawler/app/routers/docx.py
services/crawler/app/routers/pdf.py
services/crawler/app/routers/pptx.py
services/crawler/app/services/file_parser_service.py
services/crawler/app/services/vision/pdf_extractor.py
services/platform/app/features/documents/components/document-preview-dialog.tsx
services/platform/convex/agents/mutations.ts
services/platform/convex/conversations/send_message_via_integration.ts
services/platform/convex/documents/__tests__/create_document_from_upload.test.ts
services/platform/convex/documents/internal_actions.ts
services/platform/convex/documents/mutations.ts
services/platform/convex/file_metadata/internal_actions.ts
services/platform/convex/file_metadata/internal_mutations.ts
services/platform/convex/file_metadata/mutations.ts
services/platform/convex/file_metadata/queries.ts
services/platform/convex/file_metadata/schema.ts
services/platform/messages/de.json
services/platform/messages/en.json

coderabbitai · 2026-04-11T12:59:27Z

+    @pytest.mark.asyncio
+    async def test_scanned_page_without_vision_client(self):
+        pdf_bytes = _make_fullpage_image_pdf()
+
+        result = await extract_text_from_pdf_bytes(pdf_bytes, vision_client=None)
+        assert result.scanned_pages_detected == 1
+        assert result.ocr_applied is False
+        assert result.vision_used is False


🧹 Nitpick | 🔵 Trivial

Assert the warning path in this regression test.

This test proves the counters, but the new behavior also relies on emitting a warning when a scanned page is detected without a vision client. Add a log assertion here so that path is covered in CI instead of staying manual-only.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@packages/tale_knowledge/tests/test_pdf.py` around lines 212 - 219, The test test_scanned_page_without_vision_client should also assert the warning is emitted when a scanned page is detected without a vision client: wrap the call to extract_text_from_pdf_bytes(vision_client=None) in a logging/warning capture (e.g., use pytest's caplog or pytest.warns) and assert that the expected warning message (the one produced by the scanned-page-without-vision-client code path) appears; update the test to reference extract_text_from_pdf_bytes and assert the warning text and/or warning category so CI verifies the warning path.

coderabbitai · 2026-04-11T12:59:27Z

+        large_image_ratio = 0.5
+        scanned_count = 0
+        for page_num in range(page_count):
+            page = doc[page_num]
+            page_area = page.rect.get_area()
+            if page_area <= 0:
+                continue
+            for img in page.get_images(full=True):
+                bbox = page.get_image_bbox(img)
+                if bbox and bbox.get_area() / page_area > large_image_ratio:
+                    scanned_count += 1
+                    break
+


⚠️ Potential issue | 🟠 Major

Use the same scanned-page heuristic as the extraction pipeline.

This endpoint only applies the large-image-area check, but the PR’s extraction path also uses low-text detection. That means extract-metadata can return scanned_pages_detected = 0 for PDFs that extract_text_from_pdf_bytes will later treat as scanned and OCR, so the UI’s OCR indicator/visionRequired flag can drift from actual extraction behavior. Please reuse the shared detector or route this through the same metadata helper.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/pdf.py` around lines 255 - 267, The current loop only uses the large_image_ratio heuristic (variables page.get_images, bbox, large_image_ratio, scanned_count) to mark scanned pages, which differs from the extraction pipeline (e.g., extract_text_from_pdf_bytes) that also uses low-text detection and sets scanned_pages_detected; replace or augment this logic to call the shared scanned-page detector or the existing metadata helper used by extract_text_from_pdf_bytes so both paths use the same detection logic and value for scanned_pages_detected, i.e., invoke the common function instead of only counting large-image pages (or combine both heuristics via that shared helper) so the UI/visionRequired flag stays consistent.

coderabbitai · 2026-04-11T12:59:27Z

+    async def process_page(page_num: int) -> tuple[int, str, bool, bool]:
        page = doc[page_num]
-        content, vision_used = await _extract_page_with_layout(
+        content, page_vision_used, page_is_scanned = await _extract_page_with_layout(
            page, doc, semaphore, process_images, ocr_scanned_pages, usage=usage
        )
-        return page_num, f"--- Page {page_num + 1} ---\n{content}", vision_used
+        return page_num, f"--- Page {page_num + 1} ---\n{content}", page_vision_used, page_is_scanned



⚠️ Potential issue | 🟠 Major

Track OCR usage separately from generic Vision usage.

page_vision_used is set for both _ocr_page() and embedded-image descriptions, but Lines 210-213 use it to derive ocr_applied. A scanned page where OCR returns empty text can still flip ocr_applied=True if image description ran later. Return a dedicated page_ocr_applied flag from _extract_page_with_layout() and aggregate that instead.

Also applies to: 206-213

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/services/vision/pdf_extractor.py` around lines 187 - 193, The code currently uses page_vision_used (set by both _ocr_page() and image descriptions) to compute ocr_applied, causing scanned pages with empty OCR text but later image descriptions to be misattributed; modify _extract_page_with_layout() to return a separate boolean page_ocr_applied (true only when OCR actually produced text) in addition to page_vision_used, update the process_page coroutine to unpack and return page_ocr_applied alongside page_vision_used (and keep the existing return signature ordering), and change the aggregation logic that computes ocr_applied (currently using page_vision_used around lines referenced) to use the new page_ocr_applied flag so OCR usage is tracked independently from generic Vision/image-description usage.

coderabbitai · 2026-04-11T12:59:27Z

+    // Images: always need vision, no crawler call needed
+    if (IMAGE_CONTENT_TYPES.has(args.contentType)) {
+      await ctx.runMutation(
+        internal.file_metadata.internal_mutations.updateFileVisionMetadata,
+        {
+          storageId: args.storageId,
+          pageCount: 1,
+          scannedPagesDetected: 0,
+          visionRequired: true,
+        },
+      );
+      return null;


⚠️ Potential issue | 🟠 Major

Treat known image extensions as images when MIME is generic.

This branch relies on contentType alone, but createDocumentFromUpload falls back to application/octet-stream when the upload has no MIME type. A .png/.jpg uploaded that way will fall through to Lines 218-226 and be marked visionRequired: false. Add an extension fallback for common image types before taking the “no vision needed” path.

Also applies to: 218-226

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/file_metadata/internal_actions.ts` around lines 89 - 100, The code currently bases visionRequired solely on args.contentType (using IMAGE_CONTENT_TYPES) so uploads with a generic MIME like application/octet-stream and an image filename (e.g., .png/.jpg/.jpeg/.gif/.webp) will be misclassified; update the logic in the branch that checks IMAGE_CONTENT_TYPES and in the later branch around createDocumentFromUpload to also inspect the uploaded filename/extension (args.filename or the upload metadata) and treat known image extensions as images before taking the “no vision needed” path, then call internal.file_metadata.internal_mutations.updateFileVisionMetadata with visionRequired: true for those cases; ensure you reuse the same extension set (or a small helper isKnownImageExtension(filename)) so both places behave consistently.

coderabbitai · 2026-04-11T12:59:27Z

+        // Write dates to linked document (if any)
+        if (createdAt != null || modifiedAt != null) {
+          const fileMetadata = await ctx.runQuery(
+            internal.file_metadata.internal_queries.getByStorageId,
+            { storageId: args.storageId },
+          );
+
+          if (fileMetadata?.documentId) {
+            await ctx.runMutation(
+              internal.documents.internal_mutations.updateDocumentDates,
+              {
+                documentId: fileMetadata.documentId,
+                sourceCreatedAt: createdAt,
+                sourceModifiedAt: modifiedAt,
+              },
+            );
+          }


⚠️ Potential issue | 🟠 Major

Don't drop extracted dates before the file is linked to its document.

saveFileMetadata schedules this action before createDocumentFromUpload later calls linkDocumentToFile in services/platform/convex/documents/mutations.ts:166-202. If the crawler responds quickly, fileMetadata.documentId is still empty here and the extracted created_at / modified_at values are discarded with no retry. Persist the dates until the link exists, or reschedule a short retry for this block.

coderabbitai · 2026-04-11T12:59:27Z

+  args: {
+    organizationId: v.string(),
+    documentId: v.string(),
+  },


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n services/platform/convex/file_metadata/queries.ts | head -80

Repository: tale-project/tale

Length of output: 2731

🏁 Script executed:

fd -t f fileMetadata 'convex' | head -20

Repository: tale-project/tale

Length of output: 154

🏁 Script executed:

rg "fileMetadata" services/platform/convex --type ts -A 2 -B 2 | grep -E "(v\.object|documentId)" | head -30

Repository: tale-project/tale

Length of output: 1298

🏁 Script executed:

find services/platform/convex -name "*.ts" -type f | xargs grep -l "fileMetadata.*:" | head -10

Repository: tale-project/tale

Length of output: 294

🏁 Script executed:

cat -n services/platform/convex/schema.ts | grep -A 15 "fileMetadata"

Repository: tale-project/tale

Length of output: 1916

🏁 Script executed:

cat -n services/platform/convex/file_metadata/schema.ts

Repository: tale-project/tale

Length of output: 1477

🏁 Script executed:

cat -n services/platform/convex/lib/rls.ts | head -50

Repository: tale-project/tale

Length of output: 128

🏁 Script executed:

rg "getAuthUserIdentity" services/platform/convex --type ts -A 5 | head -60

Repository: tale-project/tale

Length of output: 4294

🏁 Script executed:

fd "rls" services/platform/convex --type f

Repository: tale-project/tale

Length of output: 553

🏁 Script executed:

rg "export.*getAuthUserIdentity" services/platform/convex --type ts -A 10 | head -40

Repository: tale-project/tale

Length of output: 1122

🏁 Script executed:

rg "isOrgMember" services/platform/convex --type ts -B 3 -A 3 | head -60

Repository: tale-project/tale

Length of output: 4954

🏁 Script executed:

cat -n services/platform/convex/documents/queries.ts | head -80

Repository: tale-project/tale

Length of output: 2935

🏁 Script executed:

cat -n services/platform/convex/file_metadata/queries.ts | grep -B 5 "getByDocumentId"

Repository: tale-project/tale

Length of output: 184

🏁 Script executed:

rg "getByDocumentId" services/platform/convex --type ts | head -20

Repository: tale-project/tale

Length of output: 149

Use v.id('documents') for documentId and verify org membership.

Line 33 declares documentId: v.string(), but the schema defines documentId: v.id('documents'). This causes a type mismatch on line 53 where the query compares the string argument against the typed field, breaking type safety and dropping ID-format validation.

Additionally, the query returns organization-scoped metadata without verifying the caller's org membership. Following the authorization pattern in other org-scoped queries (e.g., documents/queries.ts), add getOrganizationMember(ctx, args.organizationId, authUser) after the getAuthUserIdentity check.

Suggested fix

export const getByDocumentId = query({ args: { organizationId: v.string(), - documentId: v.string(), + documentId: v.id('documents'), }, returns: v.union( v.object({ pageCount: v.optional(v.number()), scannedPagesDetected: v.optional(v.number()), visionRequired: v.optional(v.boolean()), ocrApplied: v.optional(v.boolean()), }), v.null(), ), handler: async (ctx, args) => { const authUser = await getAuthUserIdentity(ctx); if (!authUser) return null; + + await getOrganizationMember(ctx, args.organizationId, authUser); const meta = await ctx.db

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/file_metadata/queries.ts` around lines 31 - 34, Change the args schema to use documentId: v.id('documents') instead of v.string() to match the typed documents ID and restore ID-format validation, and update the query that compares the arg against the document.id to use the correctly typed arg; also enforce org membership by calling getOrganizationMember(ctx, args.organizationId, authUser) immediately after getAuthUserIdentity(...) (before returning org-scoped metadata) so only members of the organization can fetch the metadata.

coderabbitai · 2026-04-11T12:59:27Z

+  handler: async (ctx, args) => {
+    const authUser = await getAuthUserIdentity(ctx);
+    if (!authUser) return null;
+
+    const meta = await ctx.db
+      .query('fileMetadata')
+      .withIndex('by_organizationId_and_documentId', (q) =>
+        q
+          .eq('organizationId', args.organizationId)
+          .eq('documentId', args.documentId),
+      )
+      .first();
+    if (!meta) return null;


⚠️ Potential issue | 🟠 Major

Enforce organization membership before reading fileMetadata.

This new public query only checks that the caller is authenticated. Any signed-in user who can guess an organizationId/documentId pair can probe another org’s OCR metadata. Add the org-membership check at this boundary before hitting the index.

Based on learnings, "In tale-project/tale, Convex authorization is enforced at the public API boundary (queries/mutations), while shared helpers ... remain authorization-agnostic. Do not add org-membership checks inside these helpers; perform them in the calling query/mutation."

🧰 Tools

🪛 GitHub Actions: Lint

[error] 52-52: TypeScript (tsc --noEmit) failed with TS2345: Argument of type 'string' is not assignable to parameter of type 'Id<"documents">'.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/file_metadata/queries.ts` around lines 43 - 55, The handler for this public query must verify the caller is a member of args.organizationId before reading fileMetadata; after obtaining authUser via getAuthUserIdentity and before calling ctx.db.query('fileMetadata') (the withIndex/...first() block), invoke the existing org-membership check helper (e.g., assertUserIsOrgMember(authUser, args.organizationId) or ensureUserInOrganization(authUser, args.organizationId)) and return null/throw if the check fails so only org members can access the index query.

coderabbitai · 2026-04-11T12:59:27Z

+        "sourceSharepoint": "SharePoint",
+        "scannedPages": "gescannte Seiten"


⚠️ Potential issue | 🟡 Minor

Capitalize the new sidebar label for consistency.

Line [2976] uses "gescannte Seiten" while adjacent sidebar labels are title-cased nouns. Use "Gescannte Seiten" for consistent UI copy.

Suggested fix

- "scannedPages": "gescannte Seiten" + "scannedPages": "Gescannte Seiten"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"sourceSharepoint": "SharePoint",

"scannedPages": "gescannte Seiten"

"sourceSharepoint": "SharePoint",

"scannedPages": "Gescannte Seiten"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/platform/messages/de.json` around lines 2975 - 2976, The German translation value for the key "scannedPages" is not title-cased; update the value from "gescannte Seiten" to "Gescannte Seiten" so it matches the capitalization style of adjacent sidebar labels (e.g., "sourceSharepoint" / "SharePoint") and maintains UI copy consistency; locate the "scannedPages" entry in the translations and change only its string value.

Store scannedPagesDetected and ocrApplied directly on the document record so the table list can display them without joining fileMetadata. extractFileMetadata writes scannedPagesDetected, and the RAG polling writes ocrApplied on completion. The RAG status column shows "OCR" next to the Indexed badge when OCR was applied.

…ompletion

The rag_service already queried ocr_applied from the database but the router dropped it when constructing DocumentStatusInfo. Add the field to the Pydantic model and pass it through in the statuses endpoint.

The upload endpoint sets status to 'processing' before indexing starts, which causes find_existing_by_hash (requires status='completed') to miss the document's own row. This led to full re-extraction and vision API calls on every re-upload of the same file. Add a file_id-based content_hash check at the top of index_document that also verifies chunks exist, so only previously completed documents skip.

- Fix line-too-long lint errors in indexing_service.py early-dedup code - Add early-dedup fetchrow mock to all RAG indexing tests - Remove fileMetadata query from document preview sidebar (use document fields directly), fixing the TS2322 type error

greptile-apps Bot reviewed Apr 11, 2026

View reviewed changes

yannickmonney mentioned this pull request Apr 11, 2026

OCR capability for scanned/image documents not evident #1206

Closed

larryro force-pushed the feat/issue-1206-ocr-visibility branch from 2101968 to 14f3a6d Compare April 11, 2026 11:49

larryro added 2 commits April 11, 2026 20:45

This comment has been minimized.

Sign in to view

coderabbitai Bot requested changes Apr 11, 2026

View reviewed changes

larryro added 7 commits April 11, 2026 21:02

fix(platform): rename scanned pages label to image pages

c1f0fbd

fix(platform): rename scannedPages i18n key to imagePages

c70ce91

fix(platform): sync ocrApplied from fileMetadata to document on RAG c…

1562e69

…ompletion

fix: add ocr_applied to RAG test mocks and fix documentId type in query

8681eed

fix(rag): surface ocr_applied in document statuses API response

287b10d

The rag_service already queried ocr_applied from the database but the router dropped it when constructing DocumentStatusInfo. Add the field to the Pydantic model and pass it through in the statuses endpoint.

This comment has been minimized.

Sign in to view

larryro added 2 commits April 11, 2026 21:31

fix: adjust test mocks for early-dedup fetchrow offset

2c290c0

larryro merged commit d6577cc into main Apr 11, 2026
24 checks passed

larryro deleted the feat/issue-1206-ocr-visibility branch April 11, 2026 13:37

		"sourceSharepoint": "SharePoint",
		"scannedPages": "gescannte Seiten"

Conversation

yannickmonney commented Apr 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yannickmonney commented Apr 11, 2026

Uh oh!

coderabbitai Bot commented Apr 11, 2026

Uh oh!

This comment has been minimized.

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yannickmonney commented Apr 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 11, 2026 •

edited

Loading