feat(rag): optimize PDF parsing with OCR and improve chunk settings by larryro · Pull Request #34 · tale-project/tale

larryro · 2025-12-23T04:47:14Z

Summary

This PR optimizes the RAG pipeline's PDF parsing capabilities and improves document chunking settings.

Changes

PDF Extraction

Enable OCR-based PDF extraction with hi_res strategy for better text extraction from scanned documents

LiteLLM/OpenRouter Compatibility

Configure LiteLLM to drop unsupported params for OpenRouter compatibility
Patch LiteLLM embedding to fix OpenRouter encoding_format error
Patch LiteLLM internal OpenAIChatCompletion.embedding method
Add **kwargs to patched embedding to support max_retries parameter

Chunk Optimization

Optimize chunk size for structured documents
Add configurable RAG_CHUNK_SIZE and RAG_CHUNK_OVERLAP environment variables

Testing

Verified PDF extraction with OCR works correctly
Verified embedding calls work with OpenRouter
Verified chunk settings are applied correctly

Pull Request opened by Augment Code with guidance from the PR author

Summary by CodeRabbit

New Features
- Added OCR-based text extraction for image-heavy PDFs with German and English language support
Chores
- Updated service dependencies for enhanced OCR and graphics capabilities
- Optimized RAG chunk processing configuration parameters

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Add Tesseract OCR language packs (English, German, French) to Dockerfile - Add OpenCV dependencies (libgl1, libglib2.0-0, libsm6, libxext6, libxrender1) - Configure cognee PDF loader to use hi_res strategy with OCR support - Enable multi-language OCR (deu, eng) for image-based PDF extraction - Add test scripts for PDF extraction validation

… compatibility Add configure_litellm_drop_params() to set litellm.drop_params=True globally. This fixes embedding errors when using OpenAI-compatible APIs (like OpenRouter) that don't support the 'encoding_format' parameter. Error fixed: - Embedding error with model openai/text-embedding-3-large: litellm.BadRequestError: 'Invalid option: expected one of "float"|"base64"'

…rror OpenRouter's embeddings API only accepts 'float' or 'base64' for encoding_format, but LiteLLM sends unsupported values. This patch: - Sets litellm.modify_params=True for better compatibility - Patches both litellm.embedding and litellm.aembedding to remove the encoding_format parameter when calling non-OpenAI endpoints This fixes the 400 error: 'Invalid option: expected one of float|base64' that occurs when processing documents with the hi_res PDF strategy.

The previous patch on litellm.aembedding didn't work because encoding_format is added internally by LiteLLM after the entry point. This fix patches the internal OpenAIChatCompletion.embedding method which receives optional_params containing encoding_format, allowing us to remove it before the API call. This properly fixes the OpenRouter 400 error: 'Invalid option: expected one of float|base64'

…rameter

- Change RAG_CHUNK_SIZE from 2048 to 1024 - Add RAG_CHUNK_OVERLAP=100 (10% overlap) Optimized for employee handbooks and similar structured documents where 1024 tokens provides better retrieval precision while preserving section context.

coderabbitai · 2025-12-23T04:52:01Z

📝 Walkthrough

Walkthrough

This PR enhances PDF text extraction within the RAG service. It adjusts environment variables in the compose configuration (CHUNK_SIZE → RAG_CHUNK_SIZE, value 2048→1024, plus RAG_CHUNK_OVERLAP=100). The Docker image receives OCR and graphics dependencies (tesseract language packs, libgl1, libglib2.0-0). Two new test scripts validate PDF extraction using PyMuPDF and container-based backends (unstructured/pypdf). The RAG service initialization adds LiteLLM configuration functions for parameter dropping and embedding patching. PDF document ingestion is modified to apply high-resolution OCR-based extraction with German and English language support.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

tale-project/poc2#393 — Overlaps with same RAG service modules including config, main, cognee_service, Dockerfile, and environment variable handling.
talecorp/poc2#31 — Modifies RAG/Cognee ingestion initialization and document ingestion paths, similar to LiteLLM patching and OCR loader configurations in this PR.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a3c1645 and ee3f304.

📒 Files selected for processing (6)

compose.yml
scripts/test_pdf_extraction.py
scripts/test_pdf_extraction_docker.py
services/rag/Dockerfile
services/rag/app/services/cognee/config.py
services/rag/app/services/cognee/service.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

Use English only for ALL user-facing content including UI components, labels, buttons, dialogs, forms, toast messages, error messages, success messages, comments, documentation, README files, variable names, function names, and type names

Files:

services/rag/Dockerfile
services/rag/app/services/cognee/service.py
compose.yml
scripts/test_pdf_extraction_docker.py
services/rag/app/services/cognee/config.py
scripts/test_pdf_extraction.py

🧠 Learnings (1)

📚 Learning: 2025-12-19T04:29:46.183Z

Learnt from: larryro
Repo: tale-project/tale PR: 26
File: services/rag/Dockerfile:10-20
Timestamp: 2025-12-19T04:29:46.183Z
Learning: Do not pin apt package versions in Dockerfiles within the tale-project/tale repository (e.g., services/rag/Dockerfile). Rely on regularly updated base images (like python:3.11-slim) and unpinned apt packages (curl, build-essential, libpq-dev) so that security updates and compatibility are handled via base image refresh and CI/CD caching. This reduces maintenance burden; verify through CI pipelines and ensure reproducibility comes from image rebuilds rather than manual pinning.

Applied to files:

services/rag/Dockerfile

🧬 Code graph analysis (1)

scripts/test_pdf_extraction_docker.py (1)

scripts/test_pdf_extraction.py (1)

main (94-121)

🪛 Hadolint (2.14.0)

services/rag/Dockerfile

[warning] 15-15: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)

🪛 Ruff (0.14.10)

scripts/test_pdf_extraction_docker.py

16-16: Test function parameter search_term has default argument

Remove default argument

(PT028)

78-78: Do not catch blind exception: Exception

(BLE001)

84-84: Test function parameter search_term has default argument

Remove default argument

(PT028)

120-120: Do not catch blind exception: Exception

(BLE001)

services/rag/app/services/cognee/config.py

98-98: Dynamically typed expressions (typing.Any) are disallowed in self

(ANN401)

102-102: Dynamically typed expressions (typing.Any) are disallowed in logging_obj

(ANN401)

105-105: Dynamically typed expressions (typing.Any) are disallowed in model_response

(ANN401)

107-107: Dynamically typed expressions (typing.Any) are disallowed in client

(ANN401)

108-108: Boolean-typed positional argument in function definition

(FBT001)

108-108: Boolean default positional argument in function definition

(FBT002)

109-109: Dynamically typed expressions (typing.Any) are disallowed in **kwargs

(ANN401)

110-110: Dynamically typed expressions (typing.Any) are disallowed in _patched_embedding

(ANN401)

137-137: Do not catch blind exception: Exception

(BLE001)

scripts/test_pdf_extraction.py

36-36: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (13)

services/rag/Dockerfile (1)

13-30: LGTM! OCR and OpenCV dependencies properly documented.

The added dependencies (Tesseract language packs and OpenCV system libraries) are well-documented with inline comments explaining their purpose for OCR-based PDF extraction. The unpinned apt packages align with the repository's documented practice of relying on base image updates for security patches.

Based on learnings, unpinned apt packages are the preferred approach in this repository.

scripts/test_pdf_extraction.py (5)

1-22: LGTM! Clear script structure with helpful error handling.

The shebang, docstring, and dependency check provide good UX for standalone execution. The PyMuPDF ImportError handling guides users to install the required package.

34-38: Broad exception handling is acceptable for generic PDF errors.

Catching Exception here is reasonable since PyMuPDF can raise various exception types when opening PDFs. The error is logged with context and the script exits gracefully.

47-92: LGTM! Clear text extraction and analysis logic.

The page-by-page extraction with character counting, search term filtering, and summary output provides good diagnostic information. The warning messages (lines 87-91) helpfully suggest possible causes when a search term is not found.

94-122: LGTM! Well-structured CLI with proper output redirection.

The argparse configuration is clear and the stdout redirection pattern (lines 112-119) correctly preserves and restores the original stdout. UTF-8 encoding ensures international characters are handled properly.

124-126: LGTM! Standard Python entry point.

scripts/test_pdf_extraction_docker.py (3)

16-82: LGTM! Comprehensive unstructured extraction test with helpful diagnostics.

The function provides detailed element-level analysis and search term tracking. The broad exception handling at line 78 with traceback.print_exc() is appropriate for a diagnostic script that should show full error details.

Note: The Ruff PT028 hint is a false positive—this is a utility function, not a pytest test fixture, so the default parameter is acceptable.

84-122: LGTM! Alternative extraction method with context snippets.

The pypdf extraction provides a complementary view with search term context windows (lines 108-116). The broad exception handling is appropriate for diagnostic purposes.

Note: PT028 hint is also a false positive here—this is a utility function.

124-147: LGTM! Straightforward argument handling for container execution.

The direct sys.argv parsing is simpler than argparse and appropriate for the intended container-based usage documented in the script header.

compose.yml (1)

205-208: LGTM! Chunking parameters optimized for structured documents.

The reduction from 2048 to 1024 tokens with 100-token overlap (~10%) aligns with the PR's goal to improve retrieval precision for structured documents like employee handbooks. The updated comments clearly explain the rationale.

services/rag/app/services/cognee/config.py (3)

50-75: LGTM! Clear LiteLLM parameter configuration.

The function configures LiteLLM to drop unsupported parameters globally, addressing OpenRouter compatibility issues. The docstring clearly explains the purpose, and the ImportError handling at debug level is appropriate for optional dependency.

77-139: LGTM! Monkey-patch correctly strips encoding_format for OpenRouter.

The patch logic is sound:

Idempotency check prevents double-patching (lines 92-93)

Removes encoding_format only for non-OpenAI endpoints (lines 111-115)

Preserves all original parameters including **kwargs for extensibility

Exception handling at line 137 is appropriate for startup-time patches

Note: The static analysis hints about Any types (ANN401) and boolean arguments (FBT001/FBT002) are expected for monkey-patching internal library methods and can be safely ignored.

391-417: LGTM! LiteLLM patches properly integrated into initialization flow.

The new configuration functions are called in the correct order:

configure_litellm_drop_params() sets global flags

_patch_litellm_embedding() patches the embedding method

Both execute before importing cognee, ensuring patches are in place

This maintains the existing initialization pattern while adding the OpenRouter compatibility layer.

Add 'fra' to the OCR languages list to match the tesseract-ocr-fra package already installed in the Dockerfile.

)

…— validator tightening Closes round-5 findings #27, #28, #34, #35, #36. - `tts/queries.ts` getMessageChunks return validator narrows `format` and `error` from `v.optional(v.string())` to the closed unions built from `audioFormatLiterals` and `ttsErrorCodeLiterals`. The schema's writer validator already uses those unions; the query was the only seam where a future drift could fan out unnoticed. - `tts/queries.ts` getVoiceModeEffective now falls back to a prefix-only `userPreferences` lookup when the thread has no `organizationId` (legacy / edge rows). A user who toggled voice ON globally previously got silently-off voice on those threads. - `lib/shared/schemas/providers.ts` `defaultVoice` and `voicesByLocale` values now reject all-whitespace strings (`.regex(/\S/)`) so `' '` no longer slips through `.min(1)` and surfaces as UNKNOWN_VOICE at synth time. - `lib/shared/schemas/providers.ts` locale-regex docs explicitly note the narrow BCP-47 subset (ISO-639-1 + optional ISO-3166-1 alpha-2); script subtags (`zh-Hans`), 3-letter codes (`fil`), and UN region codes (`en-419`) are intentionally out of scope. Adds a follow-up pointer in the comment so future widening is a deliberate, lockstep change with the resolver. - `lib/shared/schemas/providers.ts` superRefine now uses the `forEach` index instead of `data.models.indexOf(model)` (O(n²) → O(n)) and points the error `path` at the actually-missing field (`voicesByLocale` when the operator only typed an empty map, else `defaultVoice`), so the operator's editor jumps to the right line.

larryro and others added 6 commits December 23, 2025 11:13

fix(rag): add **kwargs to patched embedding to support max_retries pa…

922555f

…rameter

coderabbitai Bot requested changes Dec 23, 2025

View reviewed changes

Comment thread services/rag/app/services/cognee/service.py

feat(rag): add French language support to OCR configuration

23cb92d

Add 'fra' to the OCR languages list to match the tesseract-ocr-fra package already installed in the Dockerfile.

coderabbitai Bot approved these changes Dec 23, 2025

View reviewed changes

larryro merged commit 8681377 into main Dec 23, 2025
1 check passed

larryro deleted the optimize-pdf-parse branch December 23, 2025 04:56

coderabbitai Bot mentioned this pull request Dec 30, 2025

refactor(rag): Replace OCR with OpenAI Vision API to reduce Docker image size #49

Merged

7 tasks

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

feat(rag): optimize PDF parsing with OCR and improve chunk settings (#34

63f80ee

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rag): optimize PDF parsing with OCR and improve chunk settings#34

feat(rag): optimize PDF parsing with OCR and improve chunk settings#34
larryro merged 7 commits into
mainfrom
optimize-pdf-parse

larryro commented Dec 23, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Dec 23, 2025

Walkthrough

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larryro commented Dec 23, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

PDF Extraction

LiteLLM/OpenRouter Compatibility

Chunk Optimization

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Dec 23, 2025

Walkthrough

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Dec 23, 2025 •

edited by coderabbitai Bot

Loading