fix(retriever): HybridRetriever passes unlimited documents to LLM with no token budget

## Problem

`HybridRetriever` in `csv_chroma.py` passes all retrieved documents to `create_stuff_documents_chain` with zero token awareness. With two databases installed (Reactome + UniProt), this results in 60-100 documents per query reaching the LLM.

This causes three problems:

**1. Verbose, unfocused answers**
The LLM receives too much context including tangentially related documents. Combined with multi-query expansion across both databases, the signal-to-noise ratio degrades significantly.

**2. High token cost**
```
60-100 docs × ~150 tokens avg = 9,000-15,000 tokens per query
× 1,000 users/day = ~15M tokens/day
GPT-4o at $2.50/1M input tokens = ~$1,100/month
```

**3. Lost in the middle**
Research shows LLMs pay significantly less attention to documents in the middle of long contexts. Passing 60-100 documents means the LLM actively ignores most of them, degrading answer quality beyond a certain threshold.

## Root Cause

`rag_chain.py` uses LangChain's `create_stuff_documents_chain`:
```python
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
```

LangChain's own documentation states: *"Use StuffDocumentsChain when documents are small enough to fit in context window."* There is no token budgeting or document count limit applied before documents reach this chain.

## Expected Behaviour

Retrieved documents should be truncated to a configurable token and count budget before being passed to the LLM. Truncation should run after ranking (WRR or FlashRank) so the least relevant documents are always removed first and the most relevant documents always survive.

## Proposed Fix

Add `src/util/context_truncator.py` with a `truncate_to_token_limit()` function using tiktoken for exact token counting. Call it at the end of both `retrieve_documents()` and `aretrieve_documents()` in `csv_chroma.py`.

Default limits:
- `max_docs: 15`
- `max_tokens: 12000`

Both configurable via `config_default.yml`.

## Related
- Complements PR #116 (FlashRank reranking) - truncation runs after reranking so quality ordering is preserved
- Complements PR #121 (prompt precision) - together these three changes form a complete solution to the verbosity problem raised by the mentor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retriever): HybridRetriever passes unlimited documents to LLM with no token budget #138

Problem

Root Cause

Expected Behaviour

Proposed Fix

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

fix(retriever): HybridRetriever passes unlimited documents to LLM with no token budget #138

Description

Problem

Root Cause

Expected Behaviour

Proposed Fix

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions