Skip to content

feat(retriever): add token-aware context truncation to cap documents passed to LLM#139

Open
GovindhKishore wants to merge 1 commit intoreactome:mainfrom
GovindhKishore:feat/token-aware-context-truncation
Open

feat(retriever): add token-aware context truncation to cap documents passed to LLM#139
GovindhKishore wants to merge 1 commit intoreactome:mainfrom
GovindhKishore:feat/token-aware-context-truncation

Conversation

@GovindhKishore
Copy link
Copy Markdown

Summary

Adds token-aware context truncation to HybridRetriever to cap the number of documents and tokens passed to the LLM. Fixes the unbounded document passing problem described in #138.

Motivation

create_stuff_documents_chain in rag_chain.py stuffs ALL retrieved documents into the LLM context with zero token awareness. With two databases installed (Reactome + UniProt), this results in 60-100 documents per query reaching the LLM, causing:

  • Verbose, unfocused answers from too much noisy context
  • High token cost depending on the number of tokens in a document and the current OpenAI API cost.
  • "Lost in the middle" degradation - LLMs ignore middle documents in long contexts.

Changes

New - src/util/context_truncator.py

  • truncate_to_token_limit() - truncates a ranked document list to fit within a token budget and document count limit
  • Uses tiktoken for exact GPT-4o token counting - already installed via langchain-openai, zero new dependencies
  • Processes documents in order (best first) and stops when either limit is hit - least relevant documents always removed first
  • Guarantees at least one document is always returned even if it exceeds the token budget, preventing empty context edge case

Modified - src/retrievers/csv_chroma.py

  • retrieve_documents() - returns truncate_to_token_limit(subdirectory_docs) instead of raw subdirectory_docs
  • aretrieve_documents() - same change applied to async retrieval path
  • Added import for truncate_to_token_limit from util.context_truncator

Modified - config_default.yml

  • Added retriever.context_truncation block with max_docs: 15 and max_tokens: 12000
  • Config wiring to HybridRetriever can be added later - current defaults are hardcoded

Why These Default Values

max_docs: 15
  15 high quality ranked docs provides sufficient context for any question
  beyond this quality degrades due to lost-in-the-middle effect

max_tokens: 12000
  leaves room for system prompt + chat history + answer
  within GPT-4o's 128k context window

Expected Impact

Limits context passed to the LLM to a maximum of 15 documents and 12,000 tokens per query, down from an unbounded 60-100 documents. This reduces token usage significantly and avoids the "lost in the middle" quality degradation that occurs with excessively long contexts.

Note: Exact token counts per document depend on installed database versions and chunk sizes. A follow-up evaluation with real embeddings will quantify the precise reduction.

Note: GPT-4o supports a 128k context window but the 12,000 token limit is intentional - it reserves space for system prompt, chat history, and model output, while avoiding the "lost in the middle" quality degradation that occurs with excessively long contexts.

Interaction With Other PRs

Truncation runs at the end of retrieve_documents() and aretrieve_documents() - after WRR ranking and after FlashRank reranking (PR #116). This means truncation always removes the least relevant documents from last, preserving quality ordering established by both ranking stages.

BM25 + Vector retrieval
      ->
weighted_reciprocal_rank
      ->
FlashRank reranking (PR #116)
      ->
truncate_to_token_limit()   
      ->
create_stuff_documents_chain

Related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(retriever): HybridRetriever passes unlimited documents to LLM with no token budget

1 participant