Goal
Reduce RAG Docker image from 9.37GB to ~2GB by replacing Tesseract OCR + unstructured hi_res strategy with OpenAI Vision API.
Current Problem
- Heavy system deps: tesseract-ocr, poppler-utils, OpenCV libs (~600MB)
- Heavy Python deps: unstructured-inference, ONNX models (~2-3GB)
- Causes 9.37GB Docker image vs 2.77GB for platform service
What the current hi_res strategy does
- Layout detection - Uses ONNX models (Detectron2/YOLOX) to identify regions (text, tables, images)
- OCR for scanned PDFs - Uses Tesseract to extract text from image-based pages
- Image extraction - Identifies embedded images but doesn't understand their content
Proposed Solution: Smart Hybrid Approach
Use PyMuPDF (no system deps) for PDF processing with selective Vision API:
- Digital PDFs: Extract text directly using PyMuPDF (no API calls)
- Scanned PDFs: Detect low-text pages → send to Vision API for OCR
- Embedded images: Extract images → send to Vision API for descriptions
- Combine extracted text + image descriptions → feed to Cognee
Benefits over full-page Vision approach:
- Lower API costs (skip Vision for text-heavy pages)
- Faster processing (direct text extraction is instant)
- Still removes all heavy OCR dependencies
PDF Processing Logic
For each page in PDF:
1. Extract text using PyMuPDF (fitz.Page.get_text())
2. Check if page is scanned:
- If len(text) < MIN_TEXT_THRESHOLD (50 chars):
- Render page as image (fitz.Page.get_pixmap())
- Send to Vision API for OCR
- Use OCR result instead of empty text
3. Extract embedded images from page:
- Get images via fitz.Page.get_images()
- For each image > MIN_IMAGE_SIZE:
- Extract image bytes
- Send to Vision API for description
- Append: "\n[Image: {description}]\n"
4. Combine: page_text + image_descriptions
Final output: All pages combined with page markers
Files to Create
1. services/rag/app/services/vision/__init__.py
- Export
extract_text_from_document
2. services/rag/app/services/vision/processor.py
- Main entry point:
extract_text_from_document(file_path) -> (text, was_processed)
- Route PDF/images to Vision API, pass through DOCX/PPTX/XLSX
3. services/rag/app/services/vision/pdf_extractor.py
Smart PDF processing with selective Vision API usage:
- Use PyMuPDF to extract text directly from each page
- Scanned page detection: If page text < threshold (e.g., 50 chars), render as image → Vision API
- Image extraction: Extract embedded images from pages → Vision API for descriptions
- Combine: page text + OCR'd scanned pages + image descriptions
- Concurrent Vision API calls (semaphore-limited)
4. services/rag/app/services/vision/image_extractor.py
- Handle direct image files (PNG, JPG, etc.)
5. services/rag/app/services/vision/openai_client.py
- Vision API wrapper with retry logic
ocr_image(image_bytes) -> str - Extract text from scanned page
describe_image(image_bytes) -> str - Generate description of photo/chart/diagram
Files to Modify
1. services/rag/Dockerfile
Remove:
tesseract-ocr \
tesseract-ocr-eng \
tesseract-ocr-deu \
tesseract-ocr-fra \
poppler-utils \
libgl1 \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender1 \
Keep:
curl \
build-essential \
libpq-dev \
libmagic1 \
2. services/rag/requirements.txt
- unstructured[xlsx,docx,pptx,pdf]==0.18.21
+ unstructured[xlsx,docx,pptx]==0.18.21
+ PyMuPDF>=1.24.0
3. services/rag/app/config.py
Add Vision settings:
# Vision API Configuration
openai_vision_model: str = "qwen/qwen3-vl-32b-instruct" # Default vision model
vision_max_concurrent_pages: int = 5
vision_pdf_dpi: int = 150
vision_extraction_prompt: str = "Extract ALL text from this document image..."
def get_vision_model(self) -> str:
return self.openai_vision_model or os.environ.get("OPENAI_VISION_MODEL") or "qwen/qwen3-vl-32b-instruct"
4. services/rag/app/services/cognee/service.py
Modify add_document():
- Import and call Vision pre-processor for PDFs/images
- Save extracted text to temp file
- Pass text file to
cognee.add() (no preferred_loaders needed)
- Remove
hi_res strategy configuration
Implementation Sequence
-
Create Vision module (services/rag/app/services/vision/)
- openai_client.py → pdf_extractor.py → image_extractor.py → processor.py
-
Add config (services/rag/app/config.py)
- Vision model, DPI, concurrency settings
-
Integrate into Cognee service (services/rag/app/services/cognee/service.py)
- Pre-process PDFs/images before cognee.add()
-
Update dependencies (requirements.txt, Dockerfile)
- Remove heavy OCR deps, add PyMuPDF
-
Build and test
- Verify image size ~2GB
- Test PDF, DOCX, PPTX, XLSX processing
Key Design Decisions
| Decision |
Choice |
Rationale |
| PDF to image library |
PyMuPDF |
No system deps, 17-24MB wheel, faster than pdf2image |
| Integration approach |
Pre-processor |
Cleaner than hacking Cognee loaders, full control |
| Concurrency |
Semaphore (5 pages) |
Prevent rate limiting on large PDFs |
| Vision model |
qwen/qwen3-vl-32b-instruct |
Cost-effective multi-language vision model |
| Image DPI |
150 |
Balance quality vs token usage |
Expected Results
| Metric |
Before |
After |
| Docker image size |
9.37GB |
~2GB |
| System dependencies |
10+ packages |
4 packages |
| Python OCR packages |
unstructured-inference, ONNX |
PyMuPDF only |
| OCR location |
Self-hosted (Tesseract) |
Cloud (Vision API) |
Risks & Mitigations
- API costs: ~$0.01-0.03 per PDF page (mitigated by smart hybrid - only scanned pages/images use API)
- Rate limits: Semaphore concurrency control (5 concurrent calls)
- API downtime: Clear error messages, no silent fallback
Goal
Reduce RAG Docker image from 9.37GB to ~2GB by replacing Tesseract OCR + unstructured
hi_resstrategy with OpenAI Vision API.Current Problem
What the current
hi_resstrategy doesProposed Solution: Smart Hybrid Approach
Use PyMuPDF (no system deps) for PDF processing with selective Vision API:
Benefits over full-page Vision approach:
PDF Processing Logic
Files to Create
1.
services/rag/app/services/vision/__init__.pyextract_text_from_document2.
services/rag/app/services/vision/processor.pyextract_text_from_document(file_path) -> (text, was_processed)3.
services/rag/app/services/vision/pdf_extractor.pySmart PDF processing with selective Vision API usage:
4.
services/rag/app/services/vision/image_extractor.py5.
services/rag/app/services/vision/openai_client.pyocr_image(image_bytes) -> str- Extract text from scanned pagedescribe_image(image_bytes) -> str- Generate description of photo/chart/diagramFiles to Modify
1.
services/rag/DockerfileRemove:
Keep:
2.
services/rag/requirements.txt3.
services/rag/app/config.pyAdd Vision settings:
4.
services/rag/app/services/cognee/service.pyModify
add_document():cognee.add()(nopreferred_loadersneeded)hi_resstrategy configurationImplementation Sequence
Create Vision module (
services/rag/app/services/vision/)Add config (services/rag/app/config.py)
Integrate into Cognee service (services/rag/app/services/cognee/service.py)
Update dependencies (requirements.txt, Dockerfile)
Build and test
Key Design Decisions
Expected Results
Risks & Mitigations