Problem
HybridRetriever in csv_chroma.py passes all retrieved documents to create_stuff_documents_chain with zero token awareness. With two databases installed (Reactome + UniProt), this results in 60-100 documents per query reaching the LLM.
This causes three problems:
1. Verbose, unfocused answers
The LLM receives too much context including tangentially related documents. Combined with multi-query expansion across both databases, the signal-to-noise ratio degrades significantly.
2. High token cost
60-100 docs × ~150 tokens avg = 9,000-15,000 tokens per query
× 1,000 users/day = ~15M tokens/day
GPT-4o at $2.50/1M input tokens = ~$1,100/month
3. Lost in the middle
Research shows LLMs pay significantly less attention to documents in the middle of long contexts. Passing 60-100 documents means the LLM actively ignores most of them, degrading answer quality beyond a certain threshold.
Root Cause
rag_chain.py uses LangChain's create_stuff_documents_chain:
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
LangChain's own documentation states: "Use StuffDocumentsChain when documents are small enough to fit in context window." There is no token budgeting or document count limit applied before documents reach this chain.
Expected Behaviour
Retrieved documents should be truncated to a configurable token and count budget before being passed to the LLM. Truncation should run after ranking (WRR or FlashRank) so the least relevant documents are always removed first and the most relevant documents always survive.
Proposed Fix
Add src/util/context_truncator.py with a truncate_to_token_limit() function using tiktoken for exact token counting. Call it at the end of both retrieve_documents() and aretrieve_documents() in csv_chroma.py.
Default limits:
max_docs: 15
max_tokens: 12000
Both configurable via config_default.yml.
Related
Problem
HybridRetrieverincsv_chroma.pypasses all retrieved documents tocreate_stuff_documents_chainwith zero token awareness. With two databases installed (Reactome + UniProt), this results in 60-100 documents per query reaching the LLM.This causes three problems:
1. Verbose, unfocused answers
The LLM receives too much context including tangentially related documents. Combined with multi-query expansion across both databases, the signal-to-noise ratio degrades significantly.
2. High token cost
3. Lost in the middle
Research shows LLMs pay significantly less attention to documents in the middle of long contexts. Passing 60-100 documents means the LLM actively ignores most of them, degrading answer quality beyond a certain threshold.
Root Cause
rag_chain.pyuses LangChain'screate_stuff_documents_chain:LangChain's own documentation states: "Use StuffDocumentsChain when documents are small enough to fit in context window." There is no token budgeting or document count limit applied before documents reach this chain.
Expected Behaviour
Retrieved documents should be truncated to a configurable token and count budget before being passed to the LLM. Truncation should run after ranking (WRR or FlashRank) so the least relevant documents are always removed first and the most relevant documents always survive.
Proposed Fix
Add
src/util/context_truncator.pywith atruncate_to_token_limit()function using tiktoken for exact token counting. Call it at the end of bothretrieve_documents()andaretrieve_documents()incsv_chroma.py.Default limits:
max_docs: 15max_tokens: 12000Both configurable via
config_default.yml.Related