Skip to content

feat(rag): optimize PDF parsing with OCR and improve chunk settings#34

Merged
larryro merged 7 commits into
mainfrom
optimize-pdf-parse
Dec 23, 2025
Merged

feat(rag): optimize PDF parsing with OCR and improve chunk settings#34
larryro merged 7 commits into
mainfrom
optimize-pdf-parse

Conversation

@larryro

@larryro larryro commented Dec 23, 2025

Copy link
Copy Markdown
Collaborator

Summary

This PR optimizes the RAG pipeline's PDF parsing capabilities and improves document chunking settings.

Changes

PDF Extraction

  • Enable OCR-based PDF extraction with hi_res strategy for better text extraction from scanned documents

LiteLLM/OpenRouter Compatibility

  • Configure LiteLLM to drop unsupported params for OpenRouter compatibility
  • Patch LiteLLM embedding to fix OpenRouter encoding_format error
  • Patch LiteLLM internal OpenAIChatCompletion.embedding method
  • Add **kwargs to patched embedding to support max_retries parameter

Chunk Optimization

  • Optimize chunk size for structured documents
  • Add configurable RAG_CHUNK_SIZE and RAG_CHUNK_OVERLAP environment variables

Testing

  • Verified PDF extraction with OCR works correctly
  • Verified embedding calls work with OpenRouter
  • Verified chunk settings are applied correctly

Pull Request opened by Augment Code with guidance from the PR author

Summary by CodeRabbit

  • New Features

    • Added OCR-based text extraction for image-heavy PDFs with German and English language support
  • Chores

    • Updated service dependencies for enhanced OCR and graphics capabilities
    • Optimized RAG chunk processing configuration parameters

✏️ Tip: You can customize this high-level summary in your review settings.

larryro and others added 6 commits December 23, 2025 11:13
- Add Tesseract OCR language packs (English, German, French) to Dockerfile
- Add OpenCV dependencies (libgl1, libglib2.0-0, libsm6, libxext6, libxrender1)
- Configure cognee PDF loader to use hi_res strategy with OCR support
- Enable multi-language OCR (deu, eng) for image-based PDF extraction
- Add test scripts for PDF extraction validation
… compatibility

Add configure_litellm_drop_params() to set litellm.drop_params=True globally.
This fixes embedding errors when using OpenAI-compatible APIs (like OpenRouter)
that don't support the 'encoding_format' parameter.

Error fixed:
- Embedding error with model openai/text-embedding-3-large: litellm.BadRequestError:
  'Invalid option: expected one of "float"|"base64"'
…rror

OpenRouter's embeddings API only accepts 'float' or 'base64' for encoding_format,
but LiteLLM sends unsupported values. This patch:

- Sets litellm.modify_params=True for better compatibility
- Patches both litellm.embedding and litellm.aembedding to remove
  the encoding_format parameter when calling non-OpenAI endpoints

This fixes the 400 error: 'Invalid option: expected one of float|base64'
that occurs when processing documents with the hi_res PDF strategy.
The previous patch on litellm.aembedding didn't work because encoding_format
is added internally by LiteLLM after the entry point. This fix patches the
internal OpenAIChatCompletion.embedding method which receives optional_params
containing encoding_format, allowing us to remove it before the API call.

This properly fixes the OpenRouter 400 error:
'Invalid option: expected one of float|base64'
- Change RAG_CHUNK_SIZE from 2048 to 1024
- Add RAG_CHUNK_OVERLAP=100 (10% overlap)

Optimized for employee handbooks and similar structured documents
where 1024 tokens provides better retrieval precision while
preserving section context.
@coderabbitai

coderabbitai Bot commented Dec 23, 2025

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR enhances PDF text extraction within the RAG service. It adjusts environment variables in the compose configuration (CHUNK_SIZE → RAG_CHUNK_SIZE, value 2048→1024, plus RAG_CHUNK_OVERLAP=100). The Docker image receives OCR and graphics dependencies (tesseract language packs, libgl1, libglib2.0-0). Two new test scripts validate PDF extraction using PyMuPDF and container-based backends (unstructured/pypdf). The RAG service initialization adds LiteLLM configuration functions for parameter dropping and embedding patching. PDF document ingestion is modified to apply high-resolution OCR-based extraction with German and English language support.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • tale-project/poc2#393 — Overlaps with same RAG service modules including config, main, cognee_service, Dockerfile, and environment variable handling.
  • talecorp/poc2#31 — Modifies RAG/Cognee ingestion initialization and document ingestion paths, similar to LiteLLM patching and OCR loader configurations in this PR.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between a3c1645 and ee3f304.

📒 Files selected for processing (6)
  • compose.yml
  • scripts/test_pdf_extraction.py
  • scripts/test_pdf_extraction_docker.py
  • services/rag/Dockerfile
  • services/rag/app/services/cognee/config.py
  • services/rag/app/services/cognee/service.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*

📄 CodeRabbit inference engine (.cursor/rules/workspace_rules.mdc)

Use English only for ALL user-facing content including UI components, labels, buttons, dialogs, forms, toast messages, error messages, success messages, comments, documentation, README files, variable names, function names, and type names

Files:

  • services/rag/Dockerfile
  • services/rag/app/services/cognee/service.py
  • compose.yml
  • scripts/test_pdf_extraction_docker.py
  • services/rag/app/services/cognee/config.py
  • scripts/test_pdf_extraction.py
🧠 Learnings (1)
📚 Learning: 2025-12-19T04:29:46.183Z
Learnt from: larryro
Repo: tale-project/tale PR: 26
File: services/rag/Dockerfile:10-20
Timestamp: 2025-12-19T04:29:46.183Z
Learning: Do not pin apt package versions in Dockerfiles within the tale-project/tale repository (e.g., services/rag/Dockerfile). Rely on regularly updated base images (like python:3.11-slim) and unpinned apt packages (curl, build-essential, libpq-dev) so that security updates and compatibility are handled via base image refresh and CI/CD caching. This reduces maintenance burden; verify through CI pipelines and ensure reproducibility comes from image rebuilds rather than manual pinning.

Applied to files:

  • services/rag/Dockerfile
🧬 Code graph analysis (1)
scripts/test_pdf_extraction_docker.py (1)
scripts/test_pdf_extraction.py (1)
  • main (94-121)
🪛 Hadolint (2.14.0)
services/rag/Dockerfile

[warning] 15-15: Pin versions in apt get install. Instead of apt-get install <package> use apt-get install <package>=<version>

(DL3008)

🪛 Ruff (0.14.10)
scripts/test_pdf_extraction_docker.py

16-16: Test function parameter search_term has default argument

Remove default argument

(PT028)


78-78: Do not catch blind exception: Exception

(BLE001)


84-84: Test function parameter search_term has default argument

Remove default argument

(PT028)


120-120: Do not catch blind exception: Exception

(BLE001)

services/rag/app/services/cognee/config.py

98-98: Dynamically typed expressions (typing.Any) are disallowed in self

(ANN401)


102-102: Dynamically typed expressions (typing.Any) are disallowed in logging_obj

(ANN401)


105-105: Dynamically typed expressions (typing.Any) are disallowed in model_response

(ANN401)


107-107: Dynamically typed expressions (typing.Any) are disallowed in client

(ANN401)


108-108: Boolean-typed positional argument in function definition

(FBT001)


108-108: Boolean default positional argument in function definition

(FBT002)


109-109: Dynamically typed expressions (typing.Any) are disallowed in **kwargs

(ANN401)


110-110: Dynamically typed expressions (typing.Any) are disallowed in _patched_embedding

(ANN401)


137-137: Do not catch blind exception: Exception

(BLE001)

scripts/test_pdf_extraction.py

36-36: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (13)
services/rag/Dockerfile (1)

13-30: LGTM! OCR and OpenCV dependencies properly documented.

The added dependencies (Tesseract language packs and OpenCV system libraries) are well-documented with inline comments explaining their purpose for OCR-based PDF extraction. The unpinned apt packages align with the repository's documented practice of relying on base image updates for security patches.

Based on learnings, unpinned apt packages are the preferred approach in this repository.

scripts/test_pdf_extraction.py (5)

1-22: LGTM! Clear script structure with helpful error handling.

The shebang, docstring, and dependency check provide good UX for standalone execution. The PyMuPDF ImportError handling guides users to install the required package.


34-38: Broad exception handling is acceptable for generic PDF errors.

Catching Exception here is reasonable since PyMuPDF can raise various exception types when opening PDFs. The error is logged with context and the script exits gracefully.


47-92: LGTM! Clear text extraction and analysis logic.

The page-by-page extraction with character counting, search term filtering, and summary output provides good diagnostic information. The warning messages (lines 87-91) helpfully suggest possible causes when a search term is not found.


94-122: LGTM! Well-structured CLI with proper output redirection.

The argparse configuration is clear and the stdout redirection pattern (lines 112-119) correctly preserves and restores the original stdout. UTF-8 encoding ensures international characters are handled properly.


124-126: LGTM! Standard Python entry point.

scripts/test_pdf_extraction_docker.py (3)

16-82: LGTM! Comprehensive unstructured extraction test with helpful diagnostics.

The function provides detailed element-level analysis and search term tracking. The broad exception handling at line 78 with traceback.print_exc() is appropriate for a diagnostic script that should show full error details.

Note: The Ruff PT028 hint is a false positive—this is a utility function, not a pytest test fixture, so the default parameter is acceptable.


84-122: LGTM! Alternative extraction method with context snippets.

The pypdf extraction provides a complementary view with search term context windows (lines 108-116). The broad exception handling is appropriate for diagnostic purposes.

Note: PT028 hint is also a false positive here—this is a utility function.


124-147: LGTM! Straightforward argument handling for container execution.

The direct sys.argv parsing is simpler than argparse and appropriate for the intended container-based usage documented in the script header.

compose.yml (1)

205-208: LGTM! Chunking parameters optimized for structured documents.

The reduction from 2048 to 1024 tokens with 100-token overlap (~10%) aligns with the PR's goal to improve retrieval precision for structured documents like employee handbooks. The updated comments clearly explain the rationale.

services/rag/app/services/cognee/config.py (3)

50-75: LGTM! Clear LiteLLM parameter configuration.

The function configures LiteLLM to drop unsupported parameters globally, addressing OpenRouter compatibility issues. The docstring clearly explains the purpose, and the ImportError handling at debug level is appropriate for optional dependency.


77-139: LGTM! Monkey-patch correctly strips encoding_format for OpenRouter.

The patch logic is sound:

  • Idempotency check prevents double-patching (lines 92-93)
  • Removes encoding_format only for non-OpenAI endpoints (lines 111-115)
  • Preserves all original parameters including **kwargs for extensibility
  • Exception handling at line 137 is appropriate for startup-time patches

Note: The static analysis hints about Any types (ANN401) and boolean arguments (FBT001/FBT002) are expected for monkey-patching internal library methods and can be safely ignored.


391-417: LGTM! LiteLLM patches properly integrated into initialization flow.

The new configuration functions are called in the correct order:

  1. configure_litellm_drop_params() sets global flags
  2. _patch_litellm_embedding() patches the embedding method
  3. Both execute before importing cognee, ensuring patches are in place

This maintains the existing initialization pattern while adding the OpenRouter compatibility layer.

Comment thread services/rag/app/services/cognee/service.py
Add 'fra' to the OCR languages list to match the tesseract-ocr-fra
package already installed in the Dockerfile.
@larryro larryro merged commit 8681377 into main Dec 23, 2025
1 check passed
@larryro larryro deleted the optimize-pdf-parse branch December 23, 2025 04:56
larryro added a commit that referenced this pull request May 17, 2026
…— validator tightening

Closes round-5 findings #27, #28, #34, #35, #36.

- `tts/queries.ts` getMessageChunks return validator narrows `format` and
  `error` from `v.optional(v.string())` to the closed unions built from
  `audioFormatLiterals` and `ttsErrorCodeLiterals`. The schema's writer
  validator already uses those unions; the query was the only seam where
  a future drift could fan out unnoticed.
- `tts/queries.ts` getVoiceModeEffective now falls back to a prefix-only
  `userPreferences` lookup when the thread has no `organizationId`
  (legacy / edge rows). A user who toggled voice ON globally previously
  got silently-off voice on those threads.
- `lib/shared/schemas/providers.ts` `defaultVoice` and `voicesByLocale`
  values now reject all-whitespace strings (`.regex(/\S/)`) so `'   '`
  no longer slips through `.min(1)` and surfaces as UNKNOWN_VOICE at
  synth time.
- `lib/shared/schemas/providers.ts` locale-regex docs explicitly note
  the narrow BCP-47 subset (ISO-639-1 + optional ISO-3166-1 alpha-2);
  script subtags (`zh-Hans`), 3-letter codes (`fil`), and UN region
  codes (`en-419`) are intentionally out of scope. Adds a follow-up
  pointer in the comment so future widening is a deliberate, lockstep
  change with the resolver.
- `lib/shared/schemas/providers.ts` superRefine now uses the `forEach`
  index instead of `data.models.indexOf(model)` (O(n²) → O(n)) and
  points the error `path` at the actually-missing field
  (`voicesByLocale` when the operator only typed an empty map, else
  `defaultVoice`), so the operator's editor jumps to the right line.
larryro added a commit that referenced this pull request May 17, 2026
…— validator tightening

Closes round-5 findings #27, #28, #34, #35, #36.

- `tts/queries.ts` getMessageChunks return validator narrows `format` and
  `error` from `v.optional(v.string())` to the closed unions built from
  `audioFormatLiterals` and `ttsErrorCodeLiterals`. The schema's writer
  validator already uses those unions; the query was the only seam where
  a future drift could fan out unnoticed.
- `tts/queries.ts` getVoiceModeEffective now falls back to a prefix-only
  `userPreferences` lookup when the thread has no `organizationId`
  (legacy / edge rows). A user who toggled voice ON globally previously
  got silently-off voice on those threads.
- `lib/shared/schemas/providers.ts` `defaultVoice` and `voicesByLocale`
  values now reject all-whitespace strings (`.regex(/\S/)`) so `'   '`
  no longer slips through `.min(1)` and surfaces as UNKNOWN_VOICE at
  synth time.
- `lib/shared/schemas/providers.ts` locale-regex docs explicitly note
  the narrow BCP-47 subset (ISO-639-1 + optional ISO-3166-1 alpha-2);
  script subtags (`zh-Hans`), 3-letter codes (`fil`), and UN region
  codes (`en-419`) are intentionally out of scope. Adds a follow-up
  pointer in the comment so future widening is a deliberate, lockstep
  change with the resolver.
- `lib/shared/schemas/providers.ts` superRefine now uses the `forEach`
  index instead of `data.models.indexOf(model)` (O(n²) → O(n)) and
  points the error `path` at the actually-missing field
  (`voicesByLocale` when the operator only typed an empty map, else
  `defaultVoice`), so the operator's editor jumps to the right line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant