Skip to content

fix(retriever): HybridRetriever passes unlimited documents to LLM with no token budget #138

@GovindhKishore

Description

@GovindhKishore

Problem

HybridRetriever in csv_chroma.py passes all retrieved documents to create_stuff_documents_chain with zero token awareness. With two databases installed (Reactome + UniProt), this results in 60-100 documents per query reaching the LLM.

This causes three problems:

1. Verbose, unfocused answers
The LLM receives too much context including tangentially related documents. Combined with multi-query expansion across both databases, the signal-to-noise ratio degrades significantly.

2. High token cost

60-100 docs × ~150 tokens avg = 9,000-15,000 tokens per query
× 1,000 users/day = ~15M tokens/day
GPT-4o at $2.50/1M input tokens = ~$1,100/month

3. Lost in the middle
Research shows LLMs pay significantly less attention to documents in the middle of long contexts. Passing 60-100 documents means the LLM actively ignores most of them, degrading answer quality beyond a certain threshold.

Root Cause

rag_chain.py uses LangChain's create_stuff_documents_chain:

question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

LangChain's own documentation states: "Use StuffDocumentsChain when documents are small enough to fit in context window." There is no token budgeting or document count limit applied before documents reach this chain.

Expected Behaviour

Retrieved documents should be truncated to a configurable token and count budget before being passed to the LLM. Truncation should run after ranking (WRR or FlashRank) so the least relevant documents are always removed first and the most relevant documents always survive.

Proposed Fix

Add src/util/context_truncator.py with a truncate_to_token_limit() function using tiktoken for exact token counting. Call it at the end of both retrieve_documents() and aretrieve_documents() in csv_chroma.py.

Default limits:

  • max_docs: 15
  • max_tokens: 12000

Both configurable via config_default.yml.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions