feat(platform): surface OCR visibility with scanned page detection#1401
Conversation
📝 WalkthroughWalkthroughThe pull request introduces OCR detection and tracking for scanned PDF documents across the system. In the knowledge extraction layer, Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Add scanned page detection metadata (scanned_pages_detected, ocr_applied) to both PDF extraction pipelines. Return structured PdfExtractionResult from tale_knowledge and PdfExtractionMetadata from the crawler service so downstream consumers can inform users about OCR activity. - Detect scanned pages using text threshold + large-image heuristic - Warn when scanned pages found but no vision client configured - Propagate OCR metadata through crawler file parser response - Add i18n keys for OCR status in en.json and de.json - Update tests for new PdfExtractionResult return type
2101968 to
14f3a6d
Compare
Move scanned page detection from crawler through fileMetadata to the document preview sidebar. Consolidate all direct fileMetadata inserts into the saveFileMetadata mutation so every upload path automatically triggers metadata extraction. - Extend crawler extract-metadata endpoints with scanned_pages_detected - Add pageCount, scannedPagesDetected, visionRequired to fileMetadata schema - Add extractFileMetadata action triggered by saveFileMetadata on insert - Replace extractDocumentDates with extractFileMetadata (handles dates + vision metadata) - Display scanned page count in document preview sidebar
Persist vision_used as ocr_applied in the RAG documents table during indexing, return it in the status query, and propagate through the platform polling path into fileMetadata. The document preview sidebar now shows whether OCR was applied after indexing completes.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
services/crawler/app/routers/docx.py (1)
494-502:⚠️ Potential issue | 🟠 MajorAvoid hardcoding a “no scanned pages” result for DOCX metadata.
This handler does not perform any scanned-page detection, but downstream code treats
0as a definitive signal and setsvisionRequired: falseinservices/platform/convex/file_metadata/internal_actions.ts:166-174. For image-only DOCX files that suppresses the OCR hint even though nothing was checked.Suggested minimal fix
return FileMetadataResponse( success=True, filename=filename, file_type="application/vnd.openxmlformats-officedocument.wordprocessingml.document", title=meta["title"] or None, author=meta["author"] or None, created_at=meta["created_at"], modified_at=meta["modified_at"], - scanned_pages_detected=0, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/docx.py` around lines 494 - 502, The handler is hardcoding scanned_pages_detected=0 which wrongly signals "no scanned pages" downstream; update the response in services/crawler/app/routers/docx.py to omit that hardcoded zero and return a null/undefined value instead (e.g., scanned_pages_detected=None) or remove the scanned_pages_detected field so FileMetadataResponse does not assert 0; locate the return constructing FileMetadataResponse and change the scanned_pages_detected assignment to None (or omit the property) so downstream logic (e.g., services/platform/convex/file_metadata/internal_actions.ts) doesn't treat it as a definitive "no OCR needed" signal.services/crawler/app/routers/pptx.py (1)
295-304:⚠️ Potential issue | 🟠 MajorDon't report
scanned_pages_detected=0without actually checking the slides.This endpoint never inspects PPTX slides for scan-like/image-only content, but
services/platform/convex/file_metadata/internal_actions.ts:166-174turns0intovisionRequired: false. That creates a false negative for image-only decks and hides OCR need in downstream UI. ReturnNoneuntil detection exists, or add a real heuristic here.Suggested minimal fix
return FileMetadataResponse( success=True, filename=filename, file_type="application/vnd.openxmlformats-officedocument.presentationml.presentation", title=meta["title"] or None, author=meta["author"] or None, slide_count=len(prs.slides), created_at=meta["created_at"], modified_at=meta["modified_at"], - scanned_pages_detected=0, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/routers/pptx.py` around lines 295 - 304, The code currently returns scanned_pages_detected=0 in the FileMetadataResponse (returned from the PPTX metadata handler using prs.slides), which causes downstream logic to treat decks as visionNotRequired; change the response to return scanned_pages_detected=None until real detection is implemented (or implement a real heuristic that inspects prs.slides for image-only slides), i.e., update the FileMetadataResponse construction to set scanned_pages_detected=None instead of 0 (refer to FileMetadataResponse and the place where prs.slides is used).services/platform/convex/file_metadata/internal_mutations.ts (1)
27-43:⚠️ Potential issue | 🟠 MajorReschedule metadata extraction when reusing an existing
fileMetadatarow.
extractFileMetadatais only queued on insert. Now thatcreateDocumentFromUploadreuses existing rows and links thedocumentIdafterward, a retry/re-link path can return from Line 43 without ever rerunning extraction, so the later document never gets its source dates (and any fresh vision metadata) backfilled. Please trigger the extraction action for the existing-row path too, or fromlinkDocumentToFilewhen attaching a previously unlinked record.Also applies to: 68-76
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/convex/file_metadata/internal_mutations.ts` around lines 27 - 43, When reusing an existing fileMetadata row in createDocumentFromUpload (the branch that finds existing and calls ctx.db.patch on existing._id), ensure extractFileMetadata is queued for that record too so metadata (source dates, vision data) gets backfilled; add a call to the same extraction scheduling mechanism used on insert (or invoke extractFileMetadata scheduling) immediately after the patch/return of existing._id. Also add the same scheduling call inside linkDocumentToFile when attaching a previously-unlinked fileMetadata record so linking triggers metadata extraction for that record as well.services/crawler/app/services/vision/pdf_extractor.py (1)
125-133:⚠️ Potential issue | 🔴 CriticalDecouple scanned-page detection from OCR execution.
This branch only marks a page as scanned when
ocr_scanned_pagesis true, so deployments without Vision will reportscanned_pages_detected = 0even for obviously scanned PDFs. It also ignores the large-image-area check described in the module docstring/PR, which will overcount low-text pages as scanned. Computeis_scanned_pagefrom layout alone, then gate only the OCR call onocr_scanned_pages.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/services/vision/pdf_extractor.py` around lines 125 - 133, The code currently sets is_scanned_page only when ocr_scanned_pages is true and skips the large-image-area heuristic; change the flow in the page-processing block (references: MIN_TEXT_THRESHOLD, ocr_scanned_pages, _ocr_page, is_scanned_page, scanned_pages_detected) so that is_scanned_page is computed purely from layout heuristics (text length vs MIN_TEXT_THRESHOLD and the large-image-area check described in the module docstring) regardless of ocr_scanned_pages, increment/update scanned_pages_detected based on that boolean, and then separately gate the OCR call: if is_scanned_page and ocr_scanned_pages then call await _ocr_page(...), set elements and vision_used accordingly and emit the debug log about sending to Vision API only when OCR is actually invoked.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@packages/tale_knowledge/tests/test_pdf.py`:
- Around line 212-219: The test test_scanned_page_without_vision_client should
also assert the warning is emitted when a scanned page is detected without a
vision client: wrap the call to extract_text_from_pdf_bytes(vision_client=None)
in a logging/warning capture (e.g., use pytest's caplog or pytest.warns) and
assert that the expected warning message (the one produced by the
scanned-page-without-vision-client code path) appears; update the test to
reference extract_text_from_pdf_bytes and assert the warning text and/or warning
category so CI verifies the warning path.
In `@services/crawler/app/routers/pdf.py`:
- Around line 255-267: The current loop only uses the large_image_ratio
heuristic (variables page.get_images, bbox, large_image_ratio, scanned_count) to
mark scanned pages, which differs from the extraction pipeline (e.g.,
extract_text_from_pdf_bytes) that also uses low-text detection and sets
scanned_pages_detected; replace or augment this logic to call the shared
scanned-page detector or the existing metadata helper used by
extract_text_from_pdf_bytes so both paths use the same detection logic and value
for scanned_pages_detected, i.e., invoke the common function instead of only
counting large-image pages (or combine both heuristics via that shared helper)
so the UI/visionRequired flag stays consistent.
In `@services/crawler/app/services/vision/pdf_extractor.py`:
- Around line 187-193: The code currently uses page_vision_used (set by both
_ocr_page() and image descriptions) to compute ocr_applied, causing scanned
pages with empty OCR text but later image descriptions to be misattributed;
modify _extract_page_with_layout() to return a separate boolean page_ocr_applied
(true only when OCR actually produced text) in addition to page_vision_used,
update the process_page coroutine to unpack and return page_ocr_applied
alongside page_vision_used (and keep the existing return signature ordering),
and change the aggregation logic that computes ocr_applied (currently using
page_vision_used around lines referenced) to use the new page_ocr_applied flag
so OCR usage is tracked independently from generic Vision/image-description
usage.
In `@services/platform/convex/file_metadata/internal_actions.ts`:
- Around line 89-100: The code currently bases visionRequired solely on
args.contentType (using IMAGE_CONTENT_TYPES) so uploads with a generic MIME like
application/octet-stream and an image filename (e.g.,
.png/.jpg/.jpeg/.gif/.webp) will be misclassified; update the logic in the
branch that checks IMAGE_CONTENT_TYPES and in the later branch around
createDocumentFromUpload to also inspect the uploaded filename/extension
(args.filename or the upload metadata) and treat known image extensions as
images before taking the “no vision needed” path, then call
internal.file_metadata.internal_mutations.updateFileVisionMetadata with
visionRequired: true for those cases; ensure you reuse the same extension set
(or a small helper isKnownImageExtension(filename)) so both places behave
consistently.
In `@services/platform/convex/file_metadata/queries.ts`:
- Around line 43-55: The handler for this public query must verify the caller is
a member of args.organizationId before reading fileMetadata; after obtaining
authUser via getAuthUserIdentity and before calling ctx.db.query('fileMetadata')
(the withIndex/...first() block), invoke the existing org-membership check
helper (e.g., assertUserIsOrgMember(authUser, args.organizationId) or
ensureUserInOrganization(authUser, args.organizationId)) and return null/throw
if the check fails so only org members can access the index query.
- Around line 31-34: Change the args schema to use documentId: v.id('documents')
instead of v.string() to match the typed documents ID and restore ID-format
validation, and update the query that compares the arg against the document.id
to use the correctly typed arg; also enforce org membership by calling
getOrganizationMember(ctx, args.organizationId, authUser) immediately after
getAuthUserIdentity(...) (before returning org-scoped metadata) so only members
of the organization can fetch the metadata.
In `@services/platform/messages/de.json`:
- Around line 2975-2976: The German translation value for the key "scannedPages"
is not title-cased; update the value from "gescannte Seiten" to "Gescannte
Seiten" so it matches the capitalization style of adjacent sidebar labels (e.g.,
"sourceSharepoint" / "SharePoint") and maintains UI copy consistency; locate the
"scannedPages" entry in the translations and change only its string value.
---
Outside diff comments:
In `@services/crawler/app/routers/docx.py`:
- Around line 494-502: The handler is hardcoding scanned_pages_detected=0 which
wrongly signals "no scanned pages" downstream; update the response in
services/crawler/app/routers/docx.py to omit that hardcoded zero and return a
null/undefined value instead (e.g., scanned_pages_detected=None) or remove the
scanned_pages_detected field so FileMetadataResponse does not assert 0; locate
the return constructing FileMetadataResponse and change the
scanned_pages_detected assignment to None (or omit the property) so downstream
logic (e.g., services/platform/convex/file_metadata/internal_actions.ts) doesn't
treat it as a definitive "no OCR needed" signal.
In `@services/crawler/app/routers/pptx.py`:
- Around line 295-304: The code currently returns scanned_pages_detected=0 in
the FileMetadataResponse (returned from the PPTX metadata handler using
prs.slides), which causes downstream logic to treat decks as visionNotRequired;
change the response to return scanned_pages_detected=None until real detection
is implemented (or implement a real heuristic that inspects prs.slides for
image-only slides), i.e., update the FileMetadataResponse construction to set
scanned_pages_detected=None instead of 0 (refer to FileMetadataResponse and the
place where prs.slides is used).
In `@services/crawler/app/services/vision/pdf_extractor.py`:
- Around line 125-133: The code currently sets is_scanned_page only when
ocr_scanned_pages is true and skips the large-image-area heuristic; change the
flow in the page-processing block (references: MIN_TEXT_THRESHOLD,
ocr_scanned_pages, _ocr_page, is_scanned_page, scanned_pages_detected) so that
is_scanned_page is computed purely from layout heuristics (text length vs
MIN_TEXT_THRESHOLD and the large-image-area check described in the module
docstring) regardless of ocr_scanned_pages, increment/update
scanned_pages_detected based on that boolean, and then separately gate the OCR
call: if is_scanned_page and ocr_scanned_pages then call await _ocr_page(...),
set elements and vision_used accordingly and emit the debug log about sending to
Vision API only when OCR is actually invoked.
In `@services/platform/convex/file_metadata/internal_mutations.ts`:
- Around line 27-43: When reusing an existing fileMetadata row in
createDocumentFromUpload (the branch that finds existing and calls ctx.db.patch
on existing._id), ensure extractFileMetadata is queued for that record too so
metadata (source dates, vision data) gets backfilled; add a call to the same
extraction scheduling mechanism used on insert (or invoke extractFileMetadata
scheduling) immediately after the patch/return of existing._id. Also add the
same scheduling call inside linkDocumentToFile when attaching a
previously-unlinked fileMetadata record so linking triggers metadata extraction
for that record as well.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 82570a72-47fe-4ce9-97b5-0e8890206e32
📒 Files selected for processing (23)
packages/tale_knowledge/src/tale_knowledge/extraction/image.pypackages/tale_knowledge/src/tale_knowledge/extraction/pdf.pypackages/tale_knowledge/src/tale_knowledge/extraction/router.pypackages/tale_knowledge/tests/test_pdf.pyservices/crawler/app/models.pyservices/crawler/app/routers/docx.pyservices/crawler/app/routers/pdf.pyservices/crawler/app/routers/pptx.pyservices/crawler/app/services/file_parser_service.pyservices/crawler/app/services/vision/pdf_extractor.pyservices/platform/app/features/documents/components/document-preview-dialog.tsxservices/platform/convex/agents/mutations.tsservices/platform/convex/conversations/send_message_via_integration.tsservices/platform/convex/documents/__tests__/create_document_from_upload.test.tsservices/platform/convex/documents/internal_actions.tsservices/platform/convex/documents/mutations.tsservices/platform/convex/file_metadata/internal_actions.tsservices/platform/convex/file_metadata/internal_mutations.tsservices/platform/convex/file_metadata/mutations.tsservices/platform/convex/file_metadata/queries.tsservices/platform/convex/file_metadata/schema.tsservices/platform/messages/de.jsonservices/platform/messages/en.json
| @pytest.mark.asyncio | ||
| async def test_scanned_page_without_vision_client(self): | ||
| pdf_bytes = _make_fullpage_image_pdf() | ||
|
|
||
| result = await extract_text_from_pdf_bytes(pdf_bytes, vision_client=None) | ||
| assert result.scanned_pages_detected == 1 | ||
| assert result.ocr_applied is False | ||
| assert result.vision_used is False |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Assert the warning path in this regression test.
This test proves the counters, but the new behavior also relies on emitting a warning when a scanned page is detected without a vision client. Add a log assertion here so that path is covered in CI instead of staying manual-only.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@packages/tale_knowledge/tests/test_pdf.py` around lines 212 - 219, The test
test_scanned_page_without_vision_client should also assert the warning is
emitted when a scanned page is detected without a vision client: wrap the call
to extract_text_from_pdf_bytes(vision_client=None) in a logging/warning capture
(e.g., use pytest's caplog or pytest.warns) and assert that the expected warning
message (the one produced by the scanned-page-without-vision-client code path)
appears; update the test to reference extract_text_from_pdf_bytes and assert the
warning text and/or warning category so CI verifies the warning path.
| large_image_ratio = 0.5 | ||
| scanned_count = 0 | ||
| for page_num in range(page_count): | ||
| page = doc[page_num] | ||
| page_area = page.rect.get_area() | ||
| if page_area <= 0: | ||
| continue | ||
| for img in page.get_images(full=True): | ||
| bbox = page.get_image_bbox(img) | ||
| if bbox and bbox.get_area() / page_area > large_image_ratio: | ||
| scanned_count += 1 | ||
| break | ||
|
|
There was a problem hiding this comment.
Use the same scanned-page heuristic as the extraction pipeline.
This endpoint only applies the large-image-area check, but the PR’s extraction path also uses low-text detection. That means extract-metadata can return scanned_pages_detected = 0 for PDFs that extract_text_from_pdf_bytes will later treat as scanned and OCR, so the UI’s OCR indicator/visionRequired flag can drift from actual extraction behavior. Please reuse the shared detector or route this through the same metadata helper.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/routers/pdf.py` around lines 255 - 267, The current loop
only uses the large_image_ratio heuristic (variables page.get_images, bbox,
large_image_ratio, scanned_count) to mark scanned pages, which differs from the
extraction pipeline (e.g., extract_text_from_pdf_bytes) that also uses low-text
detection and sets scanned_pages_detected; replace or augment this logic to call
the shared scanned-page detector or the existing metadata helper used by
extract_text_from_pdf_bytes so both paths use the same detection logic and value
for scanned_pages_detected, i.e., invoke the common function instead of only
counting large-image pages (or combine both heuristics via that shared helper)
so the UI/visionRequired flag stays consistent.
| async def process_page(page_num: int) -> tuple[int, str, bool, bool]: | ||
| page = doc[page_num] | ||
| content, vision_used = await _extract_page_with_layout( | ||
| content, page_vision_used, page_is_scanned = await _extract_page_with_layout( | ||
| page, doc, semaphore, process_images, ocr_scanned_pages, usage=usage | ||
| ) | ||
| return page_num, f"--- Page {page_num + 1} ---\n{content}", vision_used | ||
| return page_num, f"--- Page {page_num + 1} ---\n{content}", page_vision_used, page_is_scanned | ||
|
|
There was a problem hiding this comment.
Track OCR usage separately from generic Vision usage.
page_vision_used is set for both _ocr_page() and embedded-image descriptions, but Lines 210-213 use it to derive ocr_applied. A scanned page where OCR returns empty text can still flip ocr_applied=True if image description ran later. Return a dedicated page_ocr_applied flag from _extract_page_with_layout() and aggregate that instead.
Also applies to: 206-213
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/services/vision/pdf_extractor.py` around lines 187 -
193, The code currently uses page_vision_used (set by both _ocr_page() and image
descriptions) to compute ocr_applied, causing scanned pages with empty OCR text
but later image descriptions to be misattributed; modify
_extract_page_with_layout() to return a separate boolean page_ocr_applied (true
only when OCR actually produced text) in addition to page_vision_used, update
the process_page coroutine to unpack and return page_ocr_applied alongside
page_vision_used (and keep the existing return signature ordering), and change
the aggregation logic that computes ocr_applied (currently using
page_vision_used around lines referenced) to use the new page_ocr_applied flag
so OCR usage is tracked independently from generic Vision/image-description
usage.
| // Images: always need vision, no crawler call needed | ||
| if (IMAGE_CONTENT_TYPES.has(args.contentType)) { | ||
| await ctx.runMutation( | ||
| internal.file_metadata.internal_mutations.updateFileVisionMetadata, | ||
| { | ||
| storageId: args.storageId, | ||
| pageCount: 1, | ||
| scannedPagesDetected: 0, | ||
| visionRequired: true, | ||
| }, | ||
| ); | ||
| return null; |
There was a problem hiding this comment.
Treat known image extensions as images when MIME is generic.
This branch relies on contentType alone, but createDocumentFromUpload falls back to application/octet-stream when the upload has no MIME type. A .png/.jpg uploaded that way will fall through to Lines 218-226 and be marked visionRequired: false. Add an extension fallback for common image types before taking the “no vision needed” path.
Also applies to: 218-226
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/file_metadata/internal_actions.ts` around lines 89 -
100, The code currently bases visionRequired solely on args.contentType (using
IMAGE_CONTENT_TYPES) so uploads with a generic MIME like
application/octet-stream and an image filename (e.g.,
.png/.jpg/.jpeg/.gif/.webp) will be misclassified; update the logic in the
branch that checks IMAGE_CONTENT_TYPES and in the later branch around
createDocumentFromUpload to also inspect the uploaded filename/extension
(args.filename or the upload metadata) and treat known image extensions as
images before taking the “no vision needed” path, then call
internal.file_metadata.internal_mutations.updateFileVisionMetadata with
visionRequired: true for those cases; ensure you reuse the same extension set
(or a small helper isKnownImageExtension(filename)) so both places behave
consistently.
| // Write dates to linked document (if any) | ||
| if (createdAt != null || modifiedAt != null) { | ||
| const fileMetadata = await ctx.runQuery( | ||
| internal.file_metadata.internal_queries.getByStorageId, | ||
| { storageId: args.storageId }, | ||
| ); | ||
|
|
||
| if (fileMetadata?.documentId) { | ||
| await ctx.runMutation( | ||
| internal.documents.internal_mutations.updateDocumentDates, | ||
| { | ||
| documentId: fileMetadata.documentId, | ||
| sourceCreatedAt: createdAt, | ||
| sourceModifiedAt: modifiedAt, | ||
| }, | ||
| ); | ||
| } |
There was a problem hiding this comment.
Don't drop extracted dates before the file is linked to its document.
saveFileMetadata schedules this action before createDocumentFromUpload later calls linkDocumentToFile in services/platform/convex/documents/mutations.ts:166-202. If the crawler responds quickly, fileMetadata.documentId is still empty here and the extracted created_at / modified_at values are discarded with no retry. Persist the dates until the link exists, or reschedule a short retry for this block.
| args: { | ||
| organizationId: v.string(), | ||
| documentId: v.string(), | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n services/platform/convex/file_metadata/queries.ts | head -80Repository: tale-project/tale
Length of output: 2731
🏁 Script executed:
fd -t f fileMetadata 'convex' | head -20Repository: tale-project/tale
Length of output: 154
🏁 Script executed:
rg "fileMetadata" services/platform/convex --type ts -A 2 -B 2 | grep -E "(v\.object|documentId)" | head -30Repository: tale-project/tale
Length of output: 1298
🏁 Script executed:
find services/platform/convex -name "*.ts" -type f | xargs grep -l "fileMetadata.*:" | head -10Repository: tale-project/tale
Length of output: 294
🏁 Script executed:
cat -n services/platform/convex/schema.ts | grep -A 15 "fileMetadata"Repository: tale-project/tale
Length of output: 1916
🏁 Script executed:
cat -n services/platform/convex/file_metadata/schema.tsRepository: tale-project/tale
Length of output: 1477
🏁 Script executed:
cat -n services/platform/convex/lib/rls.ts | head -50Repository: tale-project/tale
Length of output: 128
🏁 Script executed:
rg "getAuthUserIdentity" services/platform/convex --type ts -A 5 | head -60Repository: tale-project/tale
Length of output: 4294
🏁 Script executed:
fd "rls" services/platform/convex --type fRepository: tale-project/tale
Length of output: 553
🏁 Script executed:
rg "export.*getAuthUserIdentity" services/platform/convex --type ts -A 10 | head -40Repository: tale-project/tale
Length of output: 1122
🏁 Script executed:
rg "isOrgMember" services/platform/convex --type ts -B 3 -A 3 | head -60Repository: tale-project/tale
Length of output: 4954
🏁 Script executed:
cat -n services/platform/convex/documents/queries.ts | head -80Repository: tale-project/tale
Length of output: 2935
🏁 Script executed:
cat -n services/platform/convex/file_metadata/queries.ts | grep -B 5 "getByDocumentId"Repository: tale-project/tale
Length of output: 184
🏁 Script executed:
rg "getByDocumentId" services/platform/convex --type ts | head -20Repository: tale-project/tale
Length of output: 149
Use v.id('documents') for documentId and verify org membership.
Line 33 declares documentId: v.string(), but the schema defines documentId: v.id('documents'). This causes a type mismatch on line 53 where the query compares the string argument against the typed field, breaking type safety and dropping ID-format validation.
Additionally, the query returns organization-scoped metadata without verifying the caller's org membership. Following the authorization pattern in other org-scoped queries (e.g., documents/queries.ts), add getOrganizationMember(ctx, args.organizationId, authUser) after the getAuthUserIdentity check.
Suggested fix
export const getByDocumentId = query({
args: {
organizationId: v.string(),
- documentId: v.string(),
+ documentId: v.id('documents'),
},
returns: v.union(
v.object({
pageCount: v.optional(v.number()),
scannedPagesDetected: v.optional(v.number()),
visionRequired: v.optional(v.boolean()),
ocrApplied: v.optional(v.boolean()),
}),
v.null(),
),
handler: async (ctx, args) => {
const authUser = await getAuthUserIdentity(ctx);
if (!authUser) return null;
+
+ await getOrganizationMember(ctx, args.organizationId, authUser);
const meta = await ctx.db🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/file_metadata/queries.ts` around lines 31 - 34,
Change the args schema to use documentId: v.id('documents') instead of
v.string() to match the typed documents ID and restore ID-format validation, and
update the query that compares the arg against the document.id to use the
correctly typed arg; also enforce org membership by calling
getOrganizationMember(ctx, args.organizationId, authUser) immediately after
getAuthUserIdentity(...) (before returning org-scoped metadata) so only members
of the organization can fetch the metadata.
| handler: async (ctx, args) => { | ||
| const authUser = await getAuthUserIdentity(ctx); | ||
| if (!authUser) return null; | ||
|
|
||
| const meta = await ctx.db | ||
| .query('fileMetadata') | ||
| .withIndex('by_organizationId_and_documentId', (q) => | ||
| q | ||
| .eq('organizationId', args.organizationId) | ||
| .eq('documentId', args.documentId), | ||
| ) | ||
| .first(); | ||
| if (!meta) return null; |
There was a problem hiding this comment.
Enforce organization membership before reading fileMetadata.
This new public query only checks that the caller is authenticated. Any signed-in user who can guess an organizationId/documentId pair can probe another org’s OCR metadata. Add the org-membership check at this boundary before hitting the index.
Based on learnings, "In tale-project/tale, Convex authorization is enforced at the public API boundary (queries/mutations), while shared helpers ... remain authorization-agnostic. Do not add org-membership checks inside these helpers; perform them in the calling query/mutation."
🧰 Tools
🪛 GitHub Actions: Lint
[error] 52-52: TypeScript (tsc --noEmit) failed with TS2345: Argument of type 'string' is not assignable to parameter of type 'Id<"documents">'.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/convex/file_metadata/queries.ts` around lines 43 - 55, The
handler for this public query must verify the caller is a member of
args.organizationId before reading fileMetadata; after obtaining authUser via
getAuthUserIdentity and before calling ctx.db.query('fileMetadata') (the
withIndex/...first() block), invoke the existing org-membership check helper
(e.g., assertUserIsOrgMember(authUser, args.organizationId) or
ensureUserInOrganization(authUser, args.organizationId)) and return null/throw
if the check fails so only org members can access the index query.
| "sourceSharepoint": "SharePoint", | ||
| "scannedPages": "gescannte Seiten" |
There was a problem hiding this comment.
Capitalize the new sidebar label for consistency.
Line [2976] uses "gescannte Seiten" while adjacent sidebar labels are title-cased nouns. Use "Gescannte Seiten" for consistent UI copy.
Suggested fix
- "scannedPages": "gescannte Seiten"
+ "scannedPages": "Gescannte Seiten"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "sourceSharepoint": "SharePoint", | |
| "scannedPages": "gescannte Seiten" | |
| "sourceSharepoint": "SharePoint", | |
| "scannedPages": "Gescannte Seiten" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/platform/messages/de.json` around lines 2975 - 2976, The German
translation value for the key "scannedPages" is not title-cased; update the
value from "gescannte Seiten" to "Gescannte Seiten" so it matches the
capitalization style of adjacent sidebar labels (e.g., "sourceSharepoint" /
"SharePoint") and maintains UI copy consistency; locate the "scannedPages" entry
in the translations and change only its string value.
Store scannedPagesDetected and ocrApplied directly on the document record so the table list can display them without joining fileMetadata. extractFileMetadata writes scannedPagesDetected, and the RAG polling writes ocrApplied on completion. The RAG status column shows "OCR" next to the Indexed badge when OCR was applied.
The rag_service already queried ocr_applied from the database but the router dropped it when constructing DocumentStatusInfo. Add the field to the Pydantic model and pass it through in the statuses endpoint.
The upload endpoint sets status to 'processing' before indexing starts, which causes find_existing_by_hash (requires status='completed') to miss the document's own row. This led to full re-extraction and vision API calls on every re-upload of the same file. Add a file_id-based content_hash check at the top of index_document that also verifies chunks exist, so only previously completed documents skip.
This comment has been minimized.
This comment has been minimized.
- Fix line-too-long lint errors in indexing_service.py early-dedup code - Add early-dedup fetchrow mock to all RAG indexing tests - Remove fileMetadata query from document preview sidebar (use document fields directly), fixing the TS2322 type error
Summary
scanned_pages_detected,ocr_applied) to both PDF extraction pipelines (tale_knowledge and crawler service)PdfExtractionResultfrom tale_knowledge andPdfExtractionMetadatafrom the crawler so downstream consumers can inform users about OCR activitydocuments.ocr.*) inen.jsonandde.jsonPdfExtractionResultreturn type and add a new test for scanned page detection without a vision clientTest plan
scanned_pages_detectedcountNoneand scanned pages detectedCloses #1206
Summary by CodeRabbit
Release Notes
New Features
Improvements