Problem
The current HybridRetriever in csv_chroma.py retrieves documents
from multiple subdirectories using BM25 + SelfQuery + MultiQuery
expansion, resulting in ~90 documents being passed to
create_stuff_documents_chain.
All retrieved documents are stuffed directly into the LLM prompt
with no relevance filtering across subdirectories. This causes:
- Responses becoming increasingly long as more data sources are added
- No cross-subdirectory relevance ranking - a low-relevance document
from one subdirectory is treated equally to a high-relevance document
from another
- LLM receiving noisy context which reduces answer precision
Proposed Solution
Add a reranking layer after weighted_reciprocal_rank and before
returning documents in both retrieve_documents() and
aretrieve_documents().
The reranker scores all retrieved documents against the original user
query using a cross-encoder model and returns only the top N most
relevant documents regardless of which subdirectory they came from.
Implementation:
- Add
src/retrievers/reranker.py using FlashRank
(ms-marco-MiniLM-L-12-v2)
- Modify return statements in both sync and async retrieve methods
in csv_chroma.py
- Add reranker configuration to
config_default.yml
Why FlashRank:
- Runs locally - no API key required
- CPU only - no GPU needed
- Lightweight (~4MB model)
- Already compatible with existing
list[Document] pipeline
Impact
- Applies automatically to both Reactome and UniProt retrievers
since csv_chroma.py is shared
- Any future database integrations get reranking for free
- Response length directly controlled via
top_n config parameter
- Zero changes to downstream pipeline - same
list[Document] type
returned throughout
Problem
The current
HybridRetrieverincsv_chroma.pyretrieves documentsfrom multiple subdirectories using BM25 + SelfQuery + MultiQuery
expansion, resulting in ~90 documents being passed to
create_stuff_documents_chain.All retrieved documents are stuffed directly into the LLM prompt
with no relevance filtering across subdirectories. This causes:
from one subdirectory is treated equally to a high-relevance document
from another
Proposed Solution
Add a reranking layer after
weighted_reciprocal_rankand beforereturning documents in both
retrieve_documents()andaretrieve_documents().The reranker scores all retrieved documents against the original user
query using a cross-encoder model and returns only the top N most
relevant documents regardless of which subdirectory they came from.
Implementation:
src/retrievers/reranker.pyusing FlashRank(
ms-marco-MiniLM-L-12-v2)in
csv_chroma.pyconfig_default.ymlWhy FlashRank:
list[Document]pipelineImpact
since
csv_chroma.pyis sharedtop_nconfig parameterlist[Document]typereturned throughout