feat(rag): optimize PDF parsing with OCR and improve chunk settings#34
Conversation
- Add Tesseract OCR language packs (English, German, French) to Dockerfile - Add OpenCV dependencies (libgl1, libglib2.0-0, libsm6, libxext6, libxrender1) - Configure cognee PDF loader to use hi_res strategy with OCR support - Enable multi-language OCR (deu, eng) for image-based PDF extraction - Add test scripts for PDF extraction validation
… compatibility Add configure_litellm_drop_params() to set litellm.drop_params=True globally. This fixes embedding errors when using OpenAI-compatible APIs (like OpenRouter) that don't support the 'encoding_format' parameter. Error fixed: - Embedding error with model openai/text-embedding-3-large: litellm.BadRequestError: 'Invalid option: expected one of "float"|"base64"'
…rror OpenRouter's embeddings API only accepts 'float' or 'base64' for encoding_format, but LiteLLM sends unsupported values. This patch: - Sets litellm.modify_params=True for better compatibility - Patches both litellm.embedding and litellm.aembedding to remove the encoding_format parameter when calling non-OpenAI endpoints This fixes the 400 error: 'Invalid option: expected one of float|base64' that occurs when processing documents with the hi_res PDF strategy.
The previous patch on litellm.aembedding didn't work because encoding_format is added internally by LiteLLM after the entry point. This fix patches the internal OpenAIChatCompletion.embedding method which receives optional_params containing encoding_format, allowing us to remove it before the API call. This properly fixes the OpenRouter 400 error: 'Invalid option: expected one of float|base64'
- Change RAG_CHUNK_SIZE from 2048 to 1024 - Add RAG_CHUNK_OVERLAP=100 (10% overlap) Optimized for employee handbooks and similar structured documents where 1024 tokens provides better retrieval precision while preserving section context.
📝 WalkthroughWalkthroughThis PR enhances PDF text extraction within the RAG service. It adjusts environment variables in the compose configuration (CHUNK_SIZE → RAG_CHUNK_SIZE, value 2048→1024, plus RAG_CHUNK_OVERLAP=100). The Docker image receives OCR and graphics dependencies (tesseract language packs, libgl1, libglib2.0-0). Two new test scripts validate PDF extraction using PyMuPDF and container-based backends (unstructured/pypdf). The RAG service initialization adds LiteLLM configuration functions for parameter dropping and embedding patching. PDF document ingestion is modified to apply high-resolution OCR-based extraction with German and English language support. Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro (Legacy)
📒 Files selected for processing (6)
compose.ymlscripts/test_pdf_extraction.pyscripts/test_pdf_extraction_docker.pyservices/rag/Dockerfileservices/rag/app/services/cognee/config.pyservices/rag/app/services/cognee/service.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*
📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)
Use English only for ALL user-facing content including UI components, labels, buttons, dialogs, forms, toast messages, error messages, success messages, comments, documentation, README files, variable names, function names, and type names
Files:
services/rag/Dockerfileservices/rag/app/services/cognee/service.pycompose.ymlscripts/test_pdf_extraction_docker.pyservices/rag/app/services/cognee/config.pyscripts/test_pdf_extraction.py
🧠 Learnings (1)
📚 Learning: 2025-12-19T04:29:46.183Z
Learnt from: larryro
Repo: tale-project/tale PR: 26
File: services/rag/Dockerfile:10-20
Timestamp: 2025-12-19T04:29:46.183Z
Learning: Do not pin apt package versions in Dockerfiles within the tale-project/tale repository (e.g., services/rag/Dockerfile). Rely on regularly updated base images (like python:3.11-slim) and unpinned apt packages (curl, build-essential, libpq-dev) so that security updates and compatibility are handled via base image refresh and CI/CD caching. This reduces maintenance burden; verify through CI pipelines and ensure reproducibility comes from image rebuilds rather than manual pinning.
Applied to files:
services/rag/Dockerfile
🧬 Code graph analysis (1)
scripts/test_pdf_extraction_docker.py (1)
scripts/test_pdf_extraction.py (1)
main(94-121)
🪛 Hadolint (2.14.0)
services/rag/Dockerfile
[warning] 15-15: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>
(DL3008)
🪛 Ruff (0.14.10)
scripts/test_pdf_extraction_docker.py
16-16: Test function parameter search_term has default argument
Remove default argument
(PT028)
78-78: Do not catch blind exception: Exception
(BLE001)
84-84: Test function parameter search_term has default argument
Remove default argument
(PT028)
120-120: Do not catch blind exception: Exception
(BLE001)
services/rag/app/services/cognee/config.py
98-98: Dynamically typed expressions (typing.Any) are disallowed in self
(ANN401)
102-102: Dynamically typed expressions (typing.Any) are disallowed in logging_obj
(ANN401)
105-105: Dynamically typed expressions (typing.Any) are disallowed in model_response
(ANN401)
107-107: Dynamically typed expressions (typing.Any) are disallowed in client
(ANN401)
108-108: Boolean-typed positional argument in function definition
(FBT001)
108-108: Boolean default positional argument in function definition
(FBT002)
109-109: Dynamically typed expressions (typing.Any) are disallowed in **kwargs
(ANN401)
110-110: Dynamically typed expressions (typing.Any) are disallowed in _patched_embedding
(ANN401)
137-137: Do not catch blind exception: Exception
(BLE001)
scripts/test_pdf_extraction.py
36-36: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (13)
services/rag/Dockerfile (1)
13-30: LGTM! OCR and OpenCV dependencies properly documented.The added dependencies (Tesseract language packs and OpenCV system libraries) are well-documented with inline comments explaining their purpose for OCR-based PDF extraction. The unpinned apt packages align with the repository's documented practice of relying on base image updates for security patches.
Based on learnings, unpinned apt packages are the preferred approach in this repository.
scripts/test_pdf_extraction.py (5)
1-22: LGTM! Clear script structure with helpful error handling.The shebang, docstring, and dependency check provide good UX for standalone execution. The PyMuPDF ImportError handling guides users to install the required package.
34-38: Broad exception handling is acceptable for generic PDF errors.Catching
Exceptionhere is reasonable since PyMuPDF can raise various exception types when opening PDFs. The error is logged with context and the script exits gracefully.
47-92: LGTM! Clear text extraction and analysis logic.The page-by-page extraction with character counting, search term filtering, and summary output provides good diagnostic information. The warning messages (lines 87-91) helpfully suggest possible causes when a search term is not found.
94-122: LGTM! Well-structured CLI with proper output redirection.The argparse configuration is clear and the stdout redirection pattern (lines 112-119) correctly preserves and restores the original stdout. UTF-8 encoding ensures international characters are handled properly.
124-126: LGTM! Standard Python entry point.scripts/test_pdf_extraction_docker.py (3)
16-82: LGTM! Comprehensive unstructured extraction test with helpful diagnostics.The function provides detailed element-level analysis and search term tracking. The broad exception handling at line 78 with
traceback.print_exc()is appropriate for a diagnostic script that should show full error details.Note: The Ruff PT028 hint is a false positive—this is a utility function, not a pytest test fixture, so the default parameter is acceptable.
84-122: LGTM! Alternative extraction method with context snippets.The pypdf extraction provides a complementary view with search term context windows (lines 108-116). The broad exception handling is appropriate for diagnostic purposes.
Note: PT028 hint is also a false positive here—this is a utility function.
124-147: LGTM! Straightforward argument handling for container execution.The direct sys.argv parsing is simpler than argparse and appropriate for the intended container-based usage documented in the script header.
compose.yml (1)
205-208: LGTM! Chunking parameters optimized for structured documents.The reduction from 2048 to 1024 tokens with 100-token overlap (~10%) aligns with the PR's goal to improve retrieval precision for structured documents like employee handbooks. The updated comments clearly explain the rationale.
services/rag/app/services/cognee/config.py (3)
50-75: LGTM! Clear LiteLLM parameter configuration.The function configures LiteLLM to drop unsupported parameters globally, addressing OpenRouter compatibility issues. The docstring clearly explains the purpose, and the ImportError handling at debug level is appropriate for optional dependency.
77-139: LGTM! Monkey-patch correctly strips encoding_format for OpenRouter.The patch logic is sound:
- Idempotency check prevents double-patching (lines 92-93)
- Removes
encoding_formatonly for non-OpenAI endpoints (lines 111-115)- Preserves all original parameters including
**kwargsfor extensibility- Exception handling at line 137 is appropriate for startup-time patches
Note: The static analysis hints about
Anytypes (ANN401) and boolean arguments (FBT001/FBT002) are expected for monkey-patching internal library methods and can be safely ignored.
391-417: LGTM! LiteLLM patches properly integrated into initialization flow.The new configuration functions are called in the correct order:
configure_litellm_drop_params()sets global flags_patch_litellm_embedding()patches the embedding method- Both execute before importing cognee, ensuring patches are in place
This maintains the existing initialization pattern while adding the OpenRouter compatibility layer.
Add 'fra' to the OCR languages list to match the tesseract-ocr-fra package already installed in the Dockerfile.
…— validator tightening Closes round-5 findings #27, #28, #34, #35, #36. - `tts/queries.ts` getMessageChunks return validator narrows `format` and `error` from `v.optional(v.string())` to the closed unions built from `audioFormatLiterals` and `ttsErrorCodeLiterals`. The schema's writer validator already uses those unions; the query was the only seam where a future drift could fan out unnoticed. - `tts/queries.ts` getVoiceModeEffective now falls back to a prefix-only `userPreferences` lookup when the thread has no `organizationId` (legacy / edge rows). A user who toggled voice ON globally previously got silently-off voice on those threads. - `lib/shared/schemas/providers.ts` `defaultVoice` and `voicesByLocale` values now reject all-whitespace strings (`.regex(/\S/)`) so `' '` no longer slips through `.min(1)` and surfaces as UNKNOWN_VOICE at synth time. - `lib/shared/schemas/providers.ts` locale-regex docs explicitly note the narrow BCP-47 subset (ISO-639-1 + optional ISO-3166-1 alpha-2); script subtags (`zh-Hans`), 3-letter codes (`fil`), and UN region codes (`en-419`) are intentionally out of scope. Adds a follow-up pointer in the comment so future widening is a deliberate, lockstep change with the resolver. - `lib/shared/schemas/providers.ts` superRefine now uses the `forEach` index instead of `data.models.indexOf(model)` (O(n²) → O(n)) and points the error `path` at the actually-missing field (`voicesByLocale` when the operator only typed an empty map, else `defaultVoice`), so the operator's editor jumps to the right line.
…— validator tightening Closes round-5 findings #27, #28, #34, #35, #36. - `tts/queries.ts` getMessageChunks return validator narrows `format` and `error` from `v.optional(v.string())` to the closed unions built from `audioFormatLiterals` and `ttsErrorCodeLiterals`. The schema's writer validator already uses those unions; the query was the only seam where a future drift could fan out unnoticed. - `tts/queries.ts` getVoiceModeEffective now falls back to a prefix-only `userPreferences` lookup when the thread has no `organizationId` (legacy / edge rows). A user who toggled voice ON globally previously got silently-off voice on those threads. - `lib/shared/schemas/providers.ts` `defaultVoice` and `voicesByLocale` values now reject all-whitespace strings (`.regex(/\S/)`) so `' '` no longer slips through `.min(1)` and surfaces as UNKNOWN_VOICE at synth time. - `lib/shared/schemas/providers.ts` locale-regex docs explicitly note the narrow BCP-47 subset (ISO-639-1 + optional ISO-3166-1 alpha-2); script subtags (`zh-Hans`), 3-letter codes (`fil`), and UN region codes (`en-419`) are intentionally out of scope. Adds a follow-up pointer in the comment so future widening is a deliberate, lockstep change with the resolver. - `lib/shared/schemas/providers.ts` superRefine now uses the `forEach` index instead of `data.models.indexOf(model)` (O(n²) → O(n)) and points the error `path` at the actually-missing field (`voicesByLocale` when the operator only typed an empty map, else `defaultVoice`), so the operator's editor jumps to the right line.
Summary
This PR optimizes the RAG pipeline's PDF parsing capabilities and improves document chunking settings.
Changes
PDF Extraction
hi_resstrategy for better text extraction from scanned documentsLiteLLM/OpenRouter Compatibility
encoding_formaterrorOpenAIChatCompletion.embeddingmethod**kwargsto patched embedding to supportmax_retriesparameterChunk Optimization
RAG_CHUNK_SIZEandRAG_CHUNK_OVERLAPenvironment variablesTesting
Pull Request opened by Augment Code with guidance from the PR author
Summary by CodeRabbit
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.