refactor(rag): Replace self-hosted OCR with OpenAI Vision API to reduce Docker image size

## Goal
Reduce RAG Docker image from **9.37GB** to **~2GB** by replacing Tesseract OCR + unstructured `hi_res` strategy with OpenAI Vision API.

## Current Problem
- Heavy system deps: tesseract-ocr, poppler-utils, OpenCV libs (~600MB)
- Heavy Python deps: unstructured-inference, ONNX models (~2-3GB)
- Causes 9.37GB Docker image vs 2.77GB for platform service

## What the current `hi_res` strategy does
1. **Layout detection** - Uses ONNX models (Detectron2/YOLOX) to identify regions (text, tables, images)
2. **OCR for scanned PDFs** - Uses Tesseract to extract text from image-based pages
3. **Image extraction** - Identifies embedded images but doesn't understand their content

## Proposed Solution: Smart Hybrid Approach

Use **PyMuPDF** (no system deps) for PDF processing with selective Vision API:

1. **Digital PDFs**: Extract text directly using PyMuPDF (no API calls)
2. **Scanned PDFs**: Detect low-text pages → send to Vision API for OCR
3. **Embedded images**: Extract images → send to Vision API for descriptions
4. Combine extracted text + image descriptions → feed to Cognee

**Benefits over full-page Vision approach:**
- Lower API costs (skip Vision for text-heavy pages)
- Faster processing (direct text extraction is instant)
- Still removes all heavy OCR dependencies

---

## PDF Processing Logic

```
For each page in PDF:
    1. Extract text using PyMuPDF (fitz.Page.get_text())

    2. Check if page is scanned:
       - If len(text) < MIN_TEXT_THRESHOLD (50 chars):
         - Render page as image (fitz.Page.get_pixmap())
         - Send to Vision API for OCR
         - Use OCR result instead of empty text

    3. Extract embedded images from page:
       - Get images via fitz.Page.get_images()
       - For each image > MIN_IMAGE_SIZE:
         - Extract image bytes
         - Send to Vision API for description
         - Append: "\n[Image: {description}]\n"

    4. Combine: page_text + image_descriptions

Final output: All pages combined with page markers
```

---

## Files to Create

### 1. `services/rag/app/services/vision/__init__.py`
- Export `extract_text_from_document`

### 2. `services/rag/app/services/vision/processor.py`
- Main entry point: `extract_text_from_document(file_path) -> (text, was_processed)`
- Route PDF/images to Vision API, pass through DOCX/PPTX/XLSX

### 3. `services/rag/app/services/vision/pdf_extractor.py`
Smart PDF processing with selective Vision API usage:
- Use PyMuPDF to extract text directly from each page
- **Scanned page detection**: If page text < threshold (e.g., 50 chars), render as image → Vision API
- **Image extraction**: Extract embedded images from pages → Vision API for descriptions
- Combine: page text + OCR'd scanned pages + image descriptions
- Concurrent Vision API calls (semaphore-limited)

### 4. `services/rag/app/services/vision/image_extractor.py`
- Handle direct image files (PNG, JPG, etc.)

### 5. `services/rag/app/services/vision/openai_client.py`
- Vision API wrapper with retry logic
- `ocr_image(image_bytes) -> str` - Extract text from scanned page
- `describe_image(image_bytes) -> str` - Generate description of photo/chart/diagram

---

## Files to Modify

### 1. `services/rag/Dockerfile`

**Remove:**
```dockerfile
tesseract-ocr \
tesseract-ocr-eng \
tesseract-ocr-deu \
tesseract-ocr-fra \
poppler-utils \
libgl1 \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender1 \
```

**Keep:**
```dockerfile
curl \
build-essential \
libpq-dev \
libmagic1 \
```

### 2. `services/rag/requirements.txt`
```diff
- unstructured[xlsx,docx,pptx,pdf]==0.18.21
+ unstructured[xlsx,docx,pptx]==0.18.21
+ PyMuPDF>=1.24.0
```

### 3. `services/rag/app/config.py`
Add Vision settings:
```python
# Vision API Configuration
openai_vision_model: str = "qwen/qwen3-vl-32b-instruct"  # Default vision model
vision_max_concurrent_pages: int = 5
vision_pdf_dpi: int = 150
vision_extraction_prompt: str = "Extract ALL text from this document image..."

def get_vision_model(self) -> str:
    return self.openai_vision_model or os.environ.get("OPENAI_VISION_MODEL") or "qwen/qwen3-vl-32b-instruct"
```

### 4. `services/rag/app/services/cognee/service.py`
Modify `add_document()`:
1. Import and call Vision pre-processor for PDFs/images
2. Save extracted text to temp file
3. Pass text file to `cognee.add()` (no `preferred_loaders` needed)
4. Remove `hi_res` strategy configuration

---

## Implementation Sequence

1. **Create Vision module** (`services/rag/app/services/vision/`)
   - openai_client.py → pdf_extractor.py → image_extractor.py → processor.py

2. **Add config** (services/rag/app/config.py)
   - Vision model, DPI, concurrency settings

3. **Integrate into Cognee service** (services/rag/app/services/cognee/service.py)
   - Pre-process PDFs/images before cognee.add()

4. **Update dependencies** (requirements.txt, Dockerfile)
   - Remove heavy OCR deps, add PyMuPDF

5. **Build and test**
   - Verify image size ~2GB
   - Test PDF, DOCX, PPTX, XLSX processing

---

## Key Design Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| PDF to image library | PyMuPDF | No system deps, 17-24MB wheel, faster than pdf2image |
| Integration approach | Pre-processor | Cleaner than hacking Cognee loaders, full control |
| Concurrency | Semaphore (5 pages) | Prevent rate limiting on large PDFs |
| Vision model | qwen/qwen3-vl-32b-instruct | Cost-effective multi-language vision model |
| Image DPI | 150 | Balance quality vs token usage |

---

## Expected Results

| Metric | Before | After |
|--------|--------|-------|
| Docker image size | 9.37GB | ~2GB |
| System dependencies | 10+ packages | 4 packages |
| Python OCR packages | unstructured-inference, ONNX | PyMuPDF only |
| OCR location | Self-hosted (Tesseract) | Cloud (Vision API) |

---

## Risks & Mitigations

- **API costs**: ~$0.01-0.03 per PDF page (mitigated by smart hybrid - only scanned pages/images use API)
- **Rate limits**: Semaphore concurrency control (5 concurrent calls)
- **API downtime**: Clear error messages, no silent fallback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(rag): Replace self-hosted OCR with OpenAI Vision API to reduce Docker image size #43

Goal

Current Problem

What the current `hi_res` strategy does

Proposed Solution: Smart Hybrid Approach

PDF Processing Logic

Files to Create

1. `services/rag/app/services/vision/init.py`

2. `services/rag/app/services/vision/processor.py`

3. `services/rag/app/services/vision/pdf_extractor.py`

4. `services/rag/app/services/vision/image_extractor.py`

5. `services/rag/app/services/vision/openai_client.py`

Files to Modify

1. `services/rag/Dockerfile`

2. `services/rag/requirements.txt`

3. `services/rag/app/config.py`

4. `services/rag/app/services/cognee/service.py`

Implementation Sequence

Key Design Decisions

Expected Results

Risks & Mitigations

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Decision	Choice	Rationale
PDF to image library	PyMuPDF	No system deps, 17-24MB wheel, faster than pdf2image
Integration approach	Pre-processor	Cleaner than hacking Cognee loaders, full control
Concurrency	Semaphore (5 pages)	Prevent rate limiting on large PDFs
Vision model	qwen/qwen3-vl-32b-instruct	Cost-effective multi-language vision model
Image DPI	150	Balance quality vs token usage

Metric	Before	After
Docker image size	9.37GB	~2GB
System dependencies	10+ packages	4 packages
Python OCR packages	unstructured-inference, ONNX	PyMuPDF only
OCR location	Self-hosted (Tesseract)	Cloud (Vision API)

refactor(rag): Replace self-hosted OCR with OpenAI Vision API to reduce Docker image size #43

Description

Goal

Current Problem

What the current hi_res strategy does

Proposed Solution: Smart Hybrid Approach

PDF Processing Logic

Files to Create

1. services/rag/app/services/vision/__init__.py

2. services/rag/app/services/vision/processor.py

3. services/rag/app/services/vision/pdf_extractor.py

4. services/rag/app/services/vision/image_extractor.py

5. services/rag/app/services/vision/openai_client.py

Files to Modify

1. services/rag/Dockerfile

2. services/rag/requirements.txt

3. services/rag/app/config.py

4. services/rag/app/services/cognee/service.py

Implementation Sequence

Key Design Decisions

Expected Results

Risks & Mitigations

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

What the current `hi_res` strategy does

1. `services/rag/app/services/vision/init.py`

2. `services/rag/app/services/vision/processor.py`

3. `services/rag/app/services/vision/pdf_extractor.py`

4. `services/rag/app/services/vision/image_extractor.py`

5. `services/rag/app/services/vision/openai_client.py`

1. `services/rag/Dockerfile`

2. `services/rag/requirements.txt`

3. `services/rag/app/config.py`

4. `services/rag/app/services/cognee/service.py`