Skip to content

refactor(rag): Replace self-hosted OCR with OpenAI Vision API to reduce Docker image size #43

@larryro

Description

@larryro

Goal

Reduce RAG Docker image from 9.37GB to ~2GB by replacing Tesseract OCR + unstructured hi_res strategy with OpenAI Vision API.

Current Problem

  • Heavy system deps: tesseract-ocr, poppler-utils, OpenCV libs (~600MB)
  • Heavy Python deps: unstructured-inference, ONNX models (~2-3GB)
  • Causes 9.37GB Docker image vs 2.77GB for platform service

What the current hi_res strategy does

  1. Layout detection - Uses ONNX models (Detectron2/YOLOX) to identify regions (text, tables, images)
  2. OCR for scanned PDFs - Uses Tesseract to extract text from image-based pages
  3. Image extraction - Identifies embedded images but doesn't understand their content

Proposed Solution: Smart Hybrid Approach

Use PyMuPDF (no system deps) for PDF processing with selective Vision API:

  1. Digital PDFs: Extract text directly using PyMuPDF (no API calls)
  2. Scanned PDFs: Detect low-text pages → send to Vision API for OCR
  3. Embedded images: Extract images → send to Vision API for descriptions
  4. Combine extracted text + image descriptions → feed to Cognee

Benefits over full-page Vision approach:

  • Lower API costs (skip Vision for text-heavy pages)
  • Faster processing (direct text extraction is instant)
  • Still removes all heavy OCR dependencies

PDF Processing Logic

For each page in PDF:
    1. Extract text using PyMuPDF (fitz.Page.get_text())

    2. Check if page is scanned:
       - If len(text) < MIN_TEXT_THRESHOLD (50 chars):
         - Render page as image (fitz.Page.get_pixmap())
         - Send to Vision API for OCR
         - Use OCR result instead of empty text

    3. Extract embedded images from page:
       - Get images via fitz.Page.get_images()
       - For each image > MIN_IMAGE_SIZE:
         - Extract image bytes
         - Send to Vision API for description
         - Append: "\n[Image: {description}]\n"

    4. Combine: page_text + image_descriptions

Final output: All pages combined with page markers

Files to Create

1. services/rag/app/services/vision/__init__.py

  • Export extract_text_from_document

2. services/rag/app/services/vision/processor.py

  • Main entry point: extract_text_from_document(file_path) -> (text, was_processed)
  • Route PDF/images to Vision API, pass through DOCX/PPTX/XLSX

3. services/rag/app/services/vision/pdf_extractor.py

Smart PDF processing with selective Vision API usage:

  • Use PyMuPDF to extract text directly from each page
  • Scanned page detection: If page text < threshold (e.g., 50 chars), render as image → Vision API
  • Image extraction: Extract embedded images from pages → Vision API for descriptions
  • Combine: page text + OCR'd scanned pages + image descriptions
  • Concurrent Vision API calls (semaphore-limited)

4. services/rag/app/services/vision/image_extractor.py

  • Handle direct image files (PNG, JPG, etc.)

5. services/rag/app/services/vision/openai_client.py

  • Vision API wrapper with retry logic
  • ocr_image(image_bytes) -> str - Extract text from scanned page
  • describe_image(image_bytes) -> str - Generate description of photo/chart/diagram

Files to Modify

1. services/rag/Dockerfile

Remove:

tesseract-ocr \
tesseract-ocr-eng \
tesseract-ocr-deu \
tesseract-ocr-fra \
poppler-utils \
libgl1 \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender1 \

Keep:

curl \
build-essential \
libpq-dev \
libmagic1 \

2. services/rag/requirements.txt

- unstructured[xlsx,docx,pptx,pdf]==0.18.21
+ unstructured[xlsx,docx,pptx]==0.18.21
+ PyMuPDF>=1.24.0

3. services/rag/app/config.py

Add Vision settings:

# Vision API Configuration
openai_vision_model: str = "qwen/qwen3-vl-32b-instruct"  # Default vision model
vision_max_concurrent_pages: int = 5
vision_pdf_dpi: int = 150
vision_extraction_prompt: str = "Extract ALL text from this document image..."

def get_vision_model(self) -> str:
    return self.openai_vision_model or os.environ.get("OPENAI_VISION_MODEL") or "qwen/qwen3-vl-32b-instruct"

4. services/rag/app/services/cognee/service.py

Modify add_document():

  1. Import and call Vision pre-processor for PDFs/images
  2. Save extracted text to temp file
  3. Pass text file to cognee.add() (no preferred_loaders needed)
  4. Remove hi_res strategy configuration

Implementation Sequence

  1. Create Vision module (services/rag/app/services/vision/)

    • openai_client.py → pdf_extractor.py → image_extractor.py → processor.py
  2. Add config (services/rag/app/config.py)

    • Vision model, DPI, concurrency settings
  3. Integrate into Cognee service (services/rag/app/services/cognee/service.py)

    • Pre-process PDFs/images before cognee.add()
  4. Update dependencies (requirements.txt, Dockerfile)

    • Remove heavy OCR deps, add PyMuPDF
  5. Build and test

    • Verify image size ~2GB
    • Test PDF, DOCX, PPTX, XLSX processing

Key Design Decisions

Decision Choice Rationale
PDF to image library PyMuPDF No system deps, 17-24MB wheel, faster than pdf2image
Integration approach Pre-processor Cleaner than hacking Cognee loaders, full control
Concurrency Semaphore (5 pages) Prevent rate limiting on large PDFs
Vision model qwen/qwen3-vl-32b-instruct Cost-effective multi-language vision model
Image DPI 150 Balance quality vs token usage

Expected Results

Metric Before After
Docker image size 9.37GB ~2GB
System dependencies 10+ packages 4 packages
Python OCR packages unstructured-inference, ONNX PyMuPDF only
OCR location Self-hosted (Tesseract) Cloud (Vision API)

Risks & Mitigations

  • API costs: ~$0.01-0.03 per PDF page (mitigated by smart hybrid - only scanned pages/images use API)
  • Rate limits: Semaphore concurrency control (5 concurrent calls)
  • API downtime: Clear error messages, no silent fallback

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions