Fixing retrieval bugs by GFJHogue · Pull Request #94 · reactome/reactome_chatbot

GFJHogue · 2025-09-26T22:32:31Z

I identified multiple issues in the retrieval system, that together cause the inconsistent behaviours noted in #91.

BM25Retrieval was relying on the default whitespace word-splitting tokenization without any additional text processing to normalize for case (BM25 case-sensitive retrieval causing failed retrievals #92) and punctuation. I added case normalization and added nltk's word tokenizer. This seems to help for cases of failed queries involving gene symbols (NFAT1).
The old version of langchain-chroma we were using was not properly returning Documents from SelfQueryRetriever. This appears to have been a bug that is now fixed in the version I've upgraded to here. This caused EnsembleRetriever and MergerRetriever to miss vector search results (glycolysis).
For reasons I still do not understand, RAG would occasionally claim no results found for the 1st message, even if identical query/rephrasing results in valid responses in all other cases. I've inserted the user's original message into chat_history only for the 1st message, which strangely seems to work as a workaround.

@heliamoh you should evaluate this branch and see if it is an acceptable minimal fix for the study-system.

GFJHogue added 4 commits September 26, 2025 16:50

upgrade chroma to fix dropped SelfQueryRetrievals

b3c6cf1

add minimal case-normalization for BM25Retriever

7b22043

wordaround weird 1st-message bug

eb533ef

improved word tokenization for BM25Retriever

0ad84d7

GFJHogue self-assigned this Sep 26, 2025

GFJHogue requested review from adamjohnwright and heliamoh September 26, 2025 22:32

heliamoh approved these changes Sep 27, 2025

View reviewed changes

heliamoh merged commit 5137c46 into main Sep 27, 2025
9 checks passed

GFJHogue mentioned this pull request Sep 30, 2025

Flakey retrieval #91

Open

GFJHogue deleted the retrieval-fixes branch February 6, 2026 16:56

Provide feedback