Skip to content

feat(platform): surface OCR visibility with scanned page detection#1401

Merged
larryro merged 12 commits into
mainfrom
feat/issue-1206-ocr-visibility
Apr 11, 2026
Merged

feat(platform): surface OCR visibility with scanned page detection#1401
larryro merged 12 commits into
mainfrom
feat/issue-1206-ocr-visibility

Conversation

@yannickmonney

@yannickmonney yannickmonney commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add scanned page detection metadata (scanned_pages_detected, ocr_applied) to both PDF extraction pipelines (tale_knowledge and crawler service)
  • Return structured PdfExtractionResult from tale_knowledge and PdfExtractionMetadata from the crawler so downstream consumers can inform users about OCR activity
  • Detect scanned pages using a text threshold combined with a large-image area heuristic, and warn when scanned pages are found but no vision client is configured
  • Add i18n keys for OCR status messages (documents.ocr.*) in en.json and de.json
  • Update all tests for the new PdfExtractionResult return type and add a new test for scanned page detection without a vision client

Test plan

  • Platform lint passes (0 errors)
  • Platform typecheck passes
  • Platform server tests pass (1955 tests)
  • Platform client tests pass (3807 tests)
  • Verify PDF extraction with scanned pages returns correct scanned_pages_detected count
  • Verify OCR warning logged when vision client is None and scanned pages detected

Closes #1206

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic detection of scanned pages in PDF documents.
    • Document preview now displays the count of detected scanned pages.
    • Enhanced OCR status reporting with improved warning messages when vision model is unavailable.
  • Improvements

    • Refined vision client configuration feedback with clearer diagnostic messages.

@coderabbitai

coderabbitai Bot commented Apr 11, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

The pull request introduces OCR detection and tracking for scanned PDF documents across the system. In the knowledge extraction layer, extract_text_from_pdf_bytes now returns a PdfExtractionResult containing text, vision usage flags, and counts of scanned pages and OCR application. The crawler service extends this with metadata endpoints that detect scanned pages by analyzing embedded images and low text content, reporting these findings in file metadata responses. The platform service establishes a new async metadata extraction pipeline: saveFileMetadata schedules extractFileMetadata to run asynchronously, which contacts the crawler's extraction endpoint and persists results via updateFileVisionMetadata. Document-file linking and extraction flows are refactored to use these new mutations, and the UI displays scanned page counts when vision is required. Support for i18n messages for OCR status and warnings is added.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective: surfacing OCR visibility through scanned page detection, which is the core focus of the PR changes across PDF extraction and UI layers.
Linked Issues check ✅ Passed All objectives from #1206 are met: scanned page detection is implemented via text-threshold + image-area heuristic, OCR activity is tracked and propagated through extraction pipelines, and UI displays scanned page counts to users.
Out of Scope Changes check ✅ Passed All changes are directly related to #1206's objective of exposing OCR capability and scanned-page detection; no unrelated modifications to other features or subsystems are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/issue-1206-ocr-visibility

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@yannickmonney

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Apr 11, 2026

Copy link
Copy Markdown
Contributor
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Add scanned page detection metadata (scanned_pages_detected, ocr_applied)
to both PDF extraction pipelines. Return structured PdfExtractionResult
from tale_knowledge and PdfExtractionMetadata from the crawler service
so downstream consumers can inform users about OCR activity.

- Detect scanned pages using text threshold + large-image heuristic
- Warn when scanned pages found but no vision client configured
- Propagate OCR metadata through crawler file parser response
- Add i18n keys for OCR status in en.json and de.json
- Update tests for new PdfExtractionResult return type
@larryro larryro force-pushed the feat/issue-1206-ocr-visibility branch from 2101968 to 14f3a6d Compare April 11, 2026 11:49
larryro added 2 commits April 11, 2026 20:45
Move scanned page detection from crawler through fileMetadata to the
document preview sidebar. Consolidate all direct fileMetadata inserts
into the saveFileMetadata mutation so every upload path automatically
triggers metadata extraction.

- Extend crawler extract-metadata endpoints with scanned_pages_detected
- Add pageCount, scannedPagesDetected, visionRequired to fileMetadata schema
- Add extractFileMetadata action triggered by saveFileMetadata on insert
- Replace extractDocumentDates with extractFileMetadata (handles dates + vision metadata)
- Display scanned page count in document preview sidebar
Persist vision_used as ocr_applied in the RAG documents table during
indexing, return it in the status query, and propagate through the
platform polling path into fileMetadata. The document preview sidebar
now shows whether OCR was applied after indexing completes.
@blacksmith-sh

This comment has been minimized.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
services/crawler/app/routers/docx.py (1)

494-502: ⚠️ Potential issue | 🟠 Major

Avoid hardcoding a “no scanned pages” result for DOCX metadata.

This handler does not perform any scanned-page detection, but downstream code treats 0 as a definitive signal and sets visionRequired: false in services/platform/convex/file_metadata/internal_actions.ts:166-174. For image-only DOCX files that suppresses the OCR hint even though nothing was checked.

Suggested minimal fix
         return FileMetadataResponse(
             success=True,
             filename=filename,
             file_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document",
             title=meta["title"] or None,
             author=meta["author"] or None,
             created_at=meta["created_at"],
             modified_at=meta["modified_at"],
-            scanned_pages_detected=0,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/docx.py` around lines 494 - 502, The handler is
hardcoding scanned_pages_detected=0 which wrongly signals "no scanned pages"
downstream; update the response in services/crawler/app/routers/docx.py to omit
that hardcoded zero and return a null/undefined value instead (e.g.,
scanned_pages_detected=None) or remove the scanned_pages_detected field so
FileMetadataResponse does not assert 0; locate the return constructing
FileMetadataResponse and change the scanned_pages_detected assignment to None
(or omit the property) so downstream logic (e.g.,
services/platform/convex/file_metadata/internal_actions.ts) doesn't treat it as
a definitive "no OCR needed" signal.
services/crawler/app/routers/pptx.py (1)

295-304: ⚠️ Potential issue | 🟠 Major

Don't report scanned_pages_detected=0 without actually checking the slides.

This endpoint never inspects PPTX slides for scan-like/image-only content, but services/platform/convex/file_metadata/internal_actions.ts:166-174 turns 0 into visionRequired: false. That creates a false negative for image-only decks and hides OCR need in downstream UI. Return None until detection exists, or add a real heuristic here.

Suggested minimal fix
         return FileMetadataResponse(
             success=True,
             filename=filename,
             file_type="application/vnd.openxmlformats-officedocument.presentationml.presentation",
             title=meta["title"] or None,
             author=meta["author"] or None,
             slide_count=len(prs.slides),
             created_at=meta["created_at"],
             modified_at=meta["modified_at"],
-            scanned_pages_detected=0,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/pptx.py` around lines 295 - 304, The code
currently returns scanned_pages_detected=0 in the FileMetadataResponse (returned
from the PPTX metadata handler using prs.slides), which causes downstream logic
to treat decks as visionNotRequired; change the response to return
scanned_pages_detected=None until real detection is implemented (or implement a
real heuristic that inspects prs.slides for image-only slides), i.e., update the
FileMetadataResponse construction to set scanned_pages_detected=None instead of
0 (refer to FileMetadataResponse and the place where prs.slides is used).
services/platform/convex/file_metadata/internal_mutations.ts (1)

27-43: ⚠️ Potential issue | 🟠 Major

Reschedule metadata extraction when reusing an existing fileMetadata row.

extractFileMetadata is only queued on insert. Now that createDocumentFromUpload reuses existing rows and links the documentId afterward, a retry/re-link path can return from Line 43 without ever rerunning extraction, so the later document never gets its source dates (and any fresh vision metadata) backfilled. Please trigger the extraction action for the existing-row path too, or from linkDocumentToFile when attaching a previously unlinked record.

Also applies to: 68-76

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/file_metadata/internal_mutations.ts` around lines 27
- 43, When reusing an existing fileMetadata row in createDocumentFromUpload (the
branch that finds existing and calls ctx.db.patch on existing._id), ensure
extractFileMetadata is queued for that record too so metadata (source dates,
vision data) gets backfilled; add a call to the same extraction scheduling
mechanism used on insert (or invoke extractFileMetadata scheduling) immediately
after the patch/return of existing._id. Also add the same scheduling call inside
linkDocumentToFile when attaching a previously-unlinked fileMetadata record so
linking triggers metadata extraction for that record as well.
services/crawler/app/services/vision/pdf_extractor.py (1)

125-133: ⚠️ Potential issue | 🔴 Critical

Decouple scanned-page detection from OCR execution.

This branch only marks a page as scanned when ocr_scanned_pages is true, so deployments without Vision will report scanned_pages_detected = 0 even for obviously scanned PDFs. It also ignores the large-image-area check described in the module docstring/PR, which will overcount low-text pages as scanned. Compute is_scanned_page from layout alone, then gate only the OCR call on ocr_scanned_pages.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/services/vision/pdf_extractor.py` around lines 125 -
133, The code currently sets is_scanned_page only when ocr_scanned_pages is true
and skips the large-image-area heuristic; change the flow in the page-processing
block (references: MIN_TEXT_THRESHOLD, ocr_scanned_pages, _ocr_page,
is_scanned_page, scanned_pages_detected) so that is_scanned_page is computed
purely from layout heuristics (text length vs MIN_TEXT_THRESHOLD and the
large-image-area check described in the module docstring) regardless of
ocr_scanned_pages, increment/update scanned_pages_detected based on that
boolean, and then separately gate the OCR call: if is_scanned_page and
ocr_scanned_pages then call await _ocr_page(...), set elements and vision_used
accordingly and emit the debug log about sending to Vision API only when OCR is
actually invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@packages/tale_knowledge/tests/test_pdf.py`:
- Around line 212-219: The test test_scanned_page_without_vision_client should
also assert the warning is emitted when a scanned page is detected without a
vision client: wrap the call to extract_text_from_pdf_bytes(vision_client=None)
in a logging/warning capture (e.g., use pytest's caplog or pytest.warns) and
assert that the expected warning message (the one produced by the
scanned-page-without-vision-client code path) appears; update the test to
reference extract_text_from_pdf_bytes and assert the warning text and/or warning
category so CI verifies the warning path.

In `@services/crawler/app/routers/pdf.py`:
- Around line 255-267: The current loop only uses the large_image_ratio
heuristic (variables page.get_images, bbox, large_image_ratio, scanned_count) to
mark scanned pages, which differs from the extraction pipeline (e.g.,
extract_text_from_pdf_bytes) that also uses low-text detection and sets
scanned_pages_detected; replace or augment this logic to call the shared
scanned-page detector or the existing metadata helper used by
extract_text_from_pdf_bytes so both paths use the same detection logic and value
for scanned_pages_detected, i.e., invoke the common function instead of only
counting large-image pages (or combine both heuristics via that shared helper)
so the UI/visionRequired flag stays consistent.

In `@services/crawler/app/services/vision/pdf_extractor.py`:
- Around line 187-193: The code currently uses page_vision_used (set by both
_ocr_page() and image descriptions) to compute ocr_applied, causing scanned
pages with empty OCR text but later image descriptions to be misattributed;
modify _extract_page_with_layout() to return a separate boolean page_ocr_applied
(true only when OCR actually produced text) in addition to page_vision_used,
update the process_page coroutine to unpack and return page_ocr_applied
alongside page_vision_used (and keep the existing return signature ordering),
and change the aggregation logic that computes ocr_applied (currently using
page_vision_used around lines referenced) to use the new page_ocr_applied flag
so OCR usage is tracked independently from generic Vision/image-description
usage.

In `@services/platform/convex/file_metadata/internal_actions.ts`:
- Around line 89-100: The code currently bases visionRequired solely on
args.contentType (using IMAGE_CONTENT_TYPES) so uploads with a generic MIME like
application/octet-stream and an image filename (e.g.,
.png/.jpg/.jpeg/.gif/.webp) will be misclassified; update the logic in the
branch that checks IMAGE_CONTENT_TYPES and in the later branch around
createDocumentFromUpload to also inspect the uploaded filename/extension
(args.filename or the upload metadata) and treat known image extensions as
images before taking the “no vision needed” path, then call
internal.file_metadata.internal_mutations.updateFileVisionMetadata with
visionRequired: true for those cases; ensure you reuse the same extension set
(or a small helper isKnownImageExtension(filename)) so both places behave
consistently.

In `@services/platform/convex/file_metadata/queries.ts`:
- Around line 43-55: The handler for this public query must verify the caller is
a member of args.organizationId before reading fileMetadata; after obtaining
authUser via getAuthUserIdentity and before calling ctx.db.query('fileMetadata')
(the withIndex/...first() block), invoke the existing org-membership check
helper (e.g., assertUserIsOrgMember(authUser, args.organizationId) or
ensureUserInOrganization(authUser, args.organizationId)) and return null/throw
if the check fails so only org members can access the index query.
- Around line 31-34: Change the args schema to use documentId: v.id('documents')
instead of v.string() to match the typed documents ID and restore ID-format
validation, and update the query that compares the arg against the document.id
to use the correctly typed arg; also enforce org membership by calling
getOrganizationMember(ctx, args.organizationId, authUser) immediately after
getAuthUserIdentity(...) (before returning org-scoped metadata) so only members
of the organization can fetch the metadata.

In `@services/platform/messages/de.json`:
- Around line 2975-2976: The German translation value for the key "scannedPages"
is not title-cased; update the value from "gescannte Seiten" to "Gescannte
Seiten" so it matches the capitalization style of adjacent sidebar labels (e.g.,
"sourceSharepoint" / "SharePoint") and maintains UI copy consistency; locate the
"scannedPages" entry in the translations and change only its string value.

---

Outside diff comments:
In `@services/crawler/app/routers/docx.py`:
- Around line 494-502: The handler is hardcoding scanned_pages_detected=0 which
wrongly signals "no scanned pages" downstream; update the response in
services/crawler/app/routers/docx.py to omit that hardcoded zero and return a
null/undefined value instead (e.g., scanned_pages_detected=None) or remove the
scanned_pages_detected field so FileMetadataResponse does not assert 0; locate
the return constructing FileMetadataResponse and change the
scanned_pages_detected assignment to None (or omit the property) so downstream
logic (e.g., services/platform/convex/file_metadata/internal_actions.ts) doesn't
treat it as a definitive "no OCR needed" signal.

In `@services/crawler/app/routers/pptx.py`:
- Around line 295-304: The code currently returns scanned_pages_detected=0 in
the FileMetadataResponse (returned from the PPTX metadata handler using
prs.slides), which causes downstream logic to treat decks as visionNotRequired;
change the response to return scanned_pages_detected=None until real detection
is implemented (or implement a real heuristic that inspects prs.slides for
image-only slides), i.e., update the FileMetadataResponse construction to set
scanned_pages_detected=None instead of 0 (refer to FileMetadataResponse and the
place where prs.slides is used).

In `@services/crawler/app/services/vision/pdf_extractor.py`:
- Around line 125-133: The code currently sets is_scanned_page only when
ocr_scanned_pages is true and skips the large-image-area heuristic; change the
flow in the page-processing block (references: MIN_TEXT_THRESHOLD,
ocr_scanned_pages, _ocr_page, is_scanned_page, scanned_pages_detected) so that
is_scanned_page is computed purely from layout heuristics (text length vs
MIN_TEXT_THRESHOLD and the large-image-area check described in the module
docstring) regardless of ocr_scanned_pages, increment/update
scanned_pages_detected based on that boolean, and then separately gate the OCR
call: if is_scanned_page and ocr_scanned_pages then call await _ocr_page(...),
set elements and vision_used accordingly and emit the debug log about sending to
Vision API only when OCR is actually invoked.

In `@services/platform/convex/file_metadata/internal_mutations.ts`:
- Around line 27-43: When reusing an existing fileMetadata row in
createDocumentFromUpload (the branch that finds existing and calls ctx.db.patch
on existing._id), ensure extractFileMetadata is queued for that record too so
metadata (source dates, vision data) gets backfilled; add a call to the same
extraction scheduling mechanism used on insert (or invoke extractFileMetadata
scheduling) immediately after the patch/return of existing._id. Also add the
same scheduling call inside linkDocumentToFile when attaching a
previously-unlinked fileMetadata record so linking triggers metadata extraction
for that record as well.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 82570a72-47fe-4ce9-97b5-0e8890206e32

📥 Commits

Reviewing files that changed from the base of the PR and between e80bc60 and ba80647.

📒 Files selected for processing (23)
  • packages/tale_knowledge/src/tale_knowledge/extraction/image.py
  • packages/tale_knowledge/src/tale_knowledge/extraction/pdf.py
  • packages/tale_knowledge/src/tale_knowledge/extraction/router.py
  • packages/tale_knowledge/tests/test_pdf.py
  • services/crawler/app/models.py
  • services/crawler/app/routers/docx.py
  • services/crawler/app/routers/pdf.py
  • services/crawler/app/routers/pptx.py
  • services/crawler/app/services/file_parser_service.py
  • services/crawler/app/services/vision/pdf_extractor.py
  • services/platform/app/features/documents/components/document-preview-dialog.tsx
  • services/platform/convex/agents/mutations.ts
  • services/platform/convex/conversations/send_message_via_integration.ts
  • services/platform/convex/documents/__tests__/create_document_from_upload.test.ts
  • services/platform/convex/documents/internal_actions.ts
  • services/platform/convex/documents/mutations.ts
  • services/platform/convex/file_metadata/internal_actions.ts
  • services/platform/convex/file_metadata/internal_mutations.ts
  • services/platform/convex/file_metadata/mutations.ts
  • services/platform/convex/file_metadata/queries.ts
  • services/platform/convex/file_metadata/schema.ts
  • services/platform/messages/de.json
  • services/platform/messages/en.json

Comment on lines +212 to +219
@pytest.mark.asyncio
async def test_scanned_page_without_vision_client(self):
pdf_bytes = _make_fullpage_image_pdf()

result = await extract_text_from_pdf_bytes(pdf_bytes, vision_client=None)
assert result.scanned_pages_detected == 1
assert result.ocr_applied is False
assert result.vision_used is False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Assert the warning path in this regression test.

This test proves the counters, but the new behavior also relies on emitting a warning when a scanned page is detected without a vision client. Add a log assertion here so that path is covered in CI instead of staying manual-only.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/tale_knowledge/tests/test_pdf.py` around lines 212 - 219, The test
test_scanned_page_without_vision_client should also assert the warning is
emitted when a scanned page is detected without a vision client: wrap the call
to extract_text_from_pdf_bytes(vision_client=None) in a logging/warning capture
(e.g., use pytest's caplog or pytest.warns) and assert that the expected warning
message (the one produced by the scanned-page-without-vision-client code path)
appears; update the test to reference extract_text_from_pdf_bytes and assert the
warning text and/or warning category so CI verifies the warning path.

Comment on lines +255 to +267
large_image_ratio = 0.5
scanned_count = 0
for page_num in range(page_count):
page = doc[page_num]
page_area = page.rect.get_area()
if page_area <= 0:
continue
for img in page.get_images(full=True):
bbox = page.get_image_bbox(img)
if bbox and bbox.get_area() / page_area > large_image_ratio:
scanned_count += 1
break

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the same scanned-page heuristic as the extraction pipeline.

This endpoint only applies the large-image-area check, but the PR’s extraction path also uses low-text detection. That means extract-metadata can return scanned_pages_detected = 0 for PDFs that extract_text_from_pdf_bytes will later treat as scanned and OCR, so the UI’s OCR indicator/visionRequired flag can drift from actual extraction behavior. Please reuse the shared detector or route this through the same metadata helper.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/routers/pdf.py` around lines 255 - 267, The current loop
only uses the large_image_ratio heuristic (variables page.get_images, bbox,
large_image_ratio, scanned_count) to mark scanned pages, which differs from the
extraction pipeline (e.g., extract_text_from_pdf_bytes) that also uses low-text
detection and sets scanned_pages_detected; replace or augment this logic to call
the shared scanned-page detector or the existing metadata helper used by
extract_text_from_pdf_bytes so both paths use the same detection logic and value
for scanned_pages_detected, i.e., invoke the common function instead of only
counting large-image pages (or combine both heuristics via that shared helper)
so the UI/visionRequired flag stays consistent.

Comment on lines +187 to 193
async def process_page(page_num: int) -> tuple[int, str, bool, bool]:
page = doc[page_num]
content, vision_used = await _extract_page_with_layout(
content, page_vision_used, page_is_scanned = await _extract_page_with_layout(
page, doc, semaphore, process_images, ocr_scanned_pages, usage=usage
)
return page_num, f"--- Page {page_num + 1} ---\n{content}", vision_used
return page_num, f"--- Page {page_num + 1} ---\n{content}", page_vision_used, page_is_scanned

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Track OCR usage separately from generic Vision usage.

page_vision_used is set for both _ocr_page() and embedded-image descriptions, but Lines 210-213 use it to derive ocr_applied. A scanned page where OCR returns empty text can still flip ocr_applied=True if image description ran later. Return a dedicated page_ocr_applied flag from _extract_page_with_layout() and aggregate that instead.

Also applies to: 206-213

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/services/vision/pdf_extractor.py` around lines 187 -
193, The code currently uses page_vision_used (set by both _ocr_page() and image
descriptions) to compute ocr_applied, causing scanned pages with empty OCR text
but later image descriptions to be misattributed; modify
_extract_page_with_layout() to return a separate boolean page_ocr_applied (true
only when OCR actually produced text) in addition to page_vision_used, update
the process_page coroutine to unpack and return page_ocr_applied alongside
page_vision_used (and keep the existing return signature ordering), and change
the aggregation logic that computes ocr_applied (currently using
page_vision_used around lines referenced) to use the new page_ocr_applied flag
so OCR usage is tracked independently from generic Vision/image-description
usage.

Comment on lines +89 to +100
// Images: always need vision, no crawler call needed
if (IMAGE_CONTENT_TYPES.has(args.contentType)) {
await ctx.runMutation(
internal.file_metadata.internal_mutations.updateFileVisionMetadata,
{
storageId: args.storageId,
pageCount: 1,
scannedPagesDetected: 0,
visionRequired: true,
},
);
return null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Treat known image extensions as images when MIME is generic.

This branch relies on contentType alone, but createDocumentFromUpload falls back to application/octet-stream when the upload has no MIME type. A .png/.jpg uploaded that way will fall through to Lines 218-226 and be marked visionRequired: false. Add an extension fallback for common image types before taking the “no vision needed” path.

Also applies to: 218-226

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/file_metadata/internal_actions.ts` around lines 89 -
100, The code currently bases visionRequired solely on args.contentType (using
IMAGE_CONTENT_TYPES) so uploads with a generic MIME like
application/octet-stream and an image filename (e.g.,
.png/.jpg/.jpeg/.gif/.webp) will be misclassified; update the logic in the
branch that checks IMAGE_CONTENT_TYPES and in the later branch around
createDocumentFromUpload to also inspect the uploaded filename/extension
(args.filename or the upload metadata) and treat known image extensions as
images before taking the “no vision needed” path, then call
internal.file_metadata.internal_mutations.updateFileVisionMetadata with
visionRequired: true for those cases; ensure you reuse the same extension set
(or a small helper isKnownImageExtension(filename)) so both places behave
consistently.

Comment on lines +173 to +189
// Write dates to linked document (if any)
if (createdAt != null || modifiedAt != null) {
const fileMetadata = await ctx.runQuery(
internal.file_metadata.internal_queries.getByStorageId,
{ storageId: args.storageId },
);

if (fileMetadata?.documentId) {
await ctx.runMutation(
internal.documents.internal_mutations.updateDocumentDates,
{
documentId: fileMetadata.documentId,
sourceCreatedAt: createdAt,
sourceModifiedAt: modifiedAt,
},
);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't drop extracted dates before the file is linked to its document.

saveFileMetadata schedules this action before createDocumentFromUpload later calls linkDocumentToFile in services/platform/convex/documents/mutations.ts:166-202. If the crawler responds quickly, fileMetadata.documentId is still empty here and the extracted created_at / modified_at values are discarded with no retry. Persist the dates until the link exists, or reschedule a short retry for this block.

Comment on lines +31 to +34
args: {
organizationId: v.string(),
documentId: v.string(),
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

cat -n services/platform/convex/file_metadata/queries.ts | head -80

Repository: tale-project/tale

Length of output: 2731


🏁 Script executed:

fd -t f fileMetadata 'convex' | head -20

Repository: tale-project/tale

Length of output: 154


🏁 Script executed:

rg "fileMetadata" services/platform/convex --type ts -A 2 -B 2 | grep -E "(v\.object|documentId)" | head -30

Repository: tale-project/tale

Length of output: 1298


🏁 Script executed:

find services/platform/convex -name "*.ts" -type f | xargs grep -l "fileMetadata.*:" | head -10

Repository: tale-project/tale

Length of output: 294


🏁 Script executed:

cat -n services/platform/convex/schema.ts | grep -A 15 "fileMetadata"

Repository: tale-project/tale

Length of output: 1916


🏁 Script executed:

cat -n services/platform/convex/file_metadata/schema.ts

Repository: tale-project/tale

Length of output: 1477


🏁 Script executed:

cat -n services/platform/convex/lib/rls.ts | head -50

Repository: tale-project/tale

Length of output: 128


🏁 Script executed:

rg "getAuthUserIdentity" services/platform/convex --type ts -A 5 | head -60

Repository: tale-project/tale

Length of output: 4294


🏁 Script executed:

fd "rls" services/platform/convex --type f

Repository: tale-project/tale

Length of output: 553


🏁 Script executed:

rg "export.*getAuthUserIdentity" services/platform/convex --type ts -A 10 | head -40

Repository: tale-project/tale

Length of output: 1122


🏁 Script executed:

rg "isOrgMember" services/platform/convex --type ts -B 3 -A 3 | head -60

Repository: tale-project/tale

Length of output: 4954


🏁 Script executed:

cat -n services/platform/convex/documents/queries.ts | head -80

Repository: tale-project/tale

Length of output: 2935


🏁 Script executed:

cat -n services/platform/convex/file_metadata/queries.ts | grep -B 5 "getByDocumentId"

Repository: tale-project/tale

Length of output: 184


🏁 Script executed:

rg "getByDocumentId" services/platform/convex --type ts | head -20

Repository: tale-project/tale

Length of output: 149


Use v.id('documents') for documentId and verify org membership.

Line 33 declares documentId: v.string(), but the schema defines documentId: v.id('documents'). This causes a type mismatch on line 53 where the query compares the string argument against the typed field, breaking type safety and dropping ID-format validation.

Additionally, the query returns organization-scoped metadata without verifying the caller's org membership. Following the authorization pattern in other org-scoped queries (e.g., documents/queries.ts), add getOrganizationMember(ctx, args.organizationId, authUser) after the getAuthUserIdentity check.

Suggested fix
 export const getByDocumentId = query({
   args: {
     organizationId: v.string(),
-    documentId: v.string(),
+    documentId: v.id('documents'),
   },
   returns: v.union(
     v.object({
       pageCount: v.optional(v.number()),
       scannedPagesDetected: v.optional(v.number()),
       visionRequired: v.optional(v.boolean()),
       ocrApplied: v.optional(v.boolean()),
     }),
     v.null(),
   ),
   handler: async (ctx, args) => {
     const authUser = await getAuthUserIdentity(ctx);
     if (!authUser) return null;
+
+    await getOrganizationMember(ctx, args.organizationId, authUser);
 
     const meta = await ctx.db
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/file_metadata/queries.ts` around lines 31 - 34,
Change the args schema to use documentId: v.id('documents') instead of
v.string() to match the typed documents ID and restore ID-format validation, and
update the query that compares the arg against the document.id to use the
correctly typed arg; also enforce org membership by calling
getOrganizationMember(ctx, args.organizationId, authUser) immediately after
getAuthUserIdentity(...) (before returning org-scoped metadata) so only members
of the organization can fetch the metadata.

Comment on lines +43 to +55
handler: async (ctx, args) => {
const authUser = await getAuthUserIdentity(ctx);
if (!authUser) return null;

const meta = await ctx.db
.query('fileMetadata')
.withIndex('by_organizationId_and_documentId', (q) =>
q
.eq('organizationId', args.organizationId)
.eq('documentId', args.documentId),
)
.first();
if (!meta) return null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Enforce organization membership before reading fileMetadata.

This new public query only checks that the caller is authenticated. Any signed-in user who can guess an organizationId/documentId pair can probe another org’s OCR metadata. Add the org-membership check at this boundary before hitting the index.

Based on learnings, "In tale-project/tale, Convex authorization is enforced at the public API boundary (queries/mutations), while shared helpers ... remain authorization-agnostic. Do not add org-membership checks inside these helpers; perform them in the calling query/mutation."

🧰 Tools
🪛 GitHub Actions: Lint

[error] 52-52: TypeScript (tsc --noEmit) failed with TS2345: Argument of type 'string' is not assignable to parameter of type 'Id<"documents">'.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/file_metadata/queries.ts` around lines 43 - 55, The
handler for this public query must verify the caller is a member of
args.organizationId before reading fileMetadata; after obtaining authUser via
getAuthUserIdentity and before calling ctx.db.query('fileMetadata') (the
withIndex/...first() block), invoke the existing org-membership check helper
(e.g., assertUserIsOrgMember(authUser, args.organizationId) or
ensureUserInOrganization(authUser, args.organizationId)) and return null/throw
if the check fails so only org members can access the index query.

Comment thread services/platform/messages/de.json Outdated
Comment on lines +2975 to +2976
"sourceSharepoint": "SharePoint",
"scannedPages": "gescannte Seiten"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Capitalize the new sidebar label for consistency.

Line [2976] uses "gescannte Seiten" while adjacent sidebar labels are title-cased nouns. Use "Gescannte Seiten" for consistent UI copy.

Suggested fix
-        "scannedPages": "gescannte Seiten"
+        "scannedPages": "Gescannte Seiten"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"sourceSharepoint": "SharePoint",
"scannedPages": "gescannte Seiten"
"sourceSharepoint": "SharePoint",
"scannedPages": "Gescannte Seiten"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/messages/de.json` around lines 2975 - 2976, The German
translation value for the key "scannedPages" is not title-cased; update the
value from "gescannte Seiten" to "Gescannte Seiten" so it matches the
capitalization style of adjacent sidebar labels (e.g., "sourceSharepoint" /
"SharePoint") and maintains UI copy consistency; locate the "scannedPages" entry
in the translations and change only its string value.

larryro added 7 commits April 11, 2026 21:02
Store scannedPagesDetected and ocrApplied directly on the document
record so the table list can display them without joining fileMetadata.
extractFileMetadata writes scannedPagesDetected, and the RAG polling
writes ocrApplied on completion. The RAG status column shows "OCR"
next to the Indexed badge when OCR was applied.
The rag_service already queried ocr_applied from the database but the
router dropped it when constructing DocumentStatusInfo. Add the field
to the Pydantic model and pass it through in the statuses endpoint.
The upload endpoint sets status to 'processing' before indexing starts,
which causes find_existing_by_hash (requires status='completed') to miss
the document's own row. This led to full re-extraction and vision API
calls on every re-upload of the same file.

Add a file_id-based content_hash check at the top of index_document that
also verifies chunks exist, so only previously completed documents skip.
@blacksmith-sh

This comment has been minimized.

larryro added 2 commits April 11, 2026 21:31
- Fix line-too-long lint errors in indexing_service.py early-dedup code
- Add early-dedup fetchrow mock to all RAG indexing tests
- Remove fileMetadata query from document preview sidebar (use document
  fields directly), fixing the TS2322 type error
@larryro larryro merged commit d6577cc into main Apr 11, 2026
24 checks passed
@larryro larryro deleted the feat/issue-1206-ocr-visibility branch April 11, 2026 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OCR capability for scanned/image documents not evident

2 participants