feat: HybridRetriever passes noisy, low-relevance documents to the LLM

## Problem

The current [HybridRetriever](cci:2://file:///C:/Users/Bhavya%20Keerthi/Desktop/reactome_chatbot/src/retrievers/csv_chroma.py:101:0-232:32) fetches up to 10 documents from both BM25 
and Vector search across multiple queries. Many of these documents are 
only tangentially related to the user's query.

Passing noisy, low-relevance context to the LLM leads to:
- Unfocused or verbose answers
- Increased risk of hallucinated pathway/protein associations
- Wasted LLM tokens on irrelevant content

## Proposed Fix

Wrap the [HybridRetriever](cci:2://file:///C:/Users/Bhavya%20Keerthi/Desktop/reactome_chatbot/src/retrievers/csv_chroma.py:101:0-232:32) with LangChain's `ContextualCompressionRetriever` 
using an `EmbeddingsFilter` compressor.

This filters out any retrieved document whose cosine similarity to the 
query embedding falls below a threshold (e.g. 0.76), ensuring only 
high-quality, relevant context reaches the LLM.

This is complementary to reranking (PR #116) — compression removes noise, 
reranking orders what remains.

I am working on a fix and will submit a PR shortly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: HybridRetriever passes noisy, low-relevance documents to the LLM #132

Problem

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: HybridRetriever passes noisy, low-relevance documents to the LLM #132

Description

Problem

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions