-
Notifications
You must be signed in to change notification settings - Fork 5
feat(platform): surface OCR visibility with scanned page detection #1401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
14f3a6d
ba80647
c409ec0
691e70a
c1f0fbd
c70ce91
1562e69
8681eed
287b10d
bea5e38
01bc2a9
2c290c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -251,6 +251,20 @@ async def extract_pdf_metadata(file: UploadFile = _FILE_UPLOAD): | |
| doc = fitz.open(stream=file_bytes, filetype="pdf") | ||
| raw = doc.metadata or {} | ||
| page_count = len(doc) | ||
|
|
||
| large_image_ratio = 0.5 | ||
| scanned_count = 0 | ||
| for page_num in range(page_count): | ||
| page = doc[page_num] | ||
| page_area = page.rect.get_area() | ||
| if page_area <= 0: | ||
| continue | ||
| for img in page.get_images(full=True): | ||
| bbox = page.get_image_bbox(img) | ||
| if bbox and bbox.get_area() / page_area > large_image_ratio: | ||
| scanned_count += 1 | ||
| break | ||
|
|
||
|
Comment on lines
+255
to
+267
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use the same scanned-page heuristic as the extraction pipeline. This endpoint only applies the large-image-area check, but the PR’s extraction path also uses low-text detection. That means 🤖 Prompt for AI Agents |
||
| doc.close() | ||
|
|
||
| return FileMetadataResponse( | ||
|
|
@@ -262,6 +276,7 @@ async def extract_pdf_metadata(file: UploadFile = _FILE_UPLOAD): | |
| page_count=page_count, | ||
| created_at=_parse_pdf_date(raw.get("creationDate")), | ||
| modified_at=_parse_pdf_date(raw.get("modDate")), | ||
| scanned_pages_detected=scanned_count, | ||
| ) | ||
| except HTTPException: | ||
| raise | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧹 Nitpick | 🔵 Trivial
Assert the warning path in this regression test.
This test proves the counters, but the new behavior also relies on emitting a warning when a scanned page is detected without a vision client. Add a log assertion here so that path is covered in CI instead of staying manual-only.
🤖 Prompt for AI Agents