feat(retriever): add hybrid retrieval engine pipeline with lexical + semantic ranking#23
Conversation
WalkthroughAdds a new TypeScript Smart Notes hybrid retrieval app: project config and tsconfig, core types and public barrel, cosine-similarity and hybrid-scoring utilities, a RetrievalEngine orchestration class, a demo with mock store/embedder, and updates to .gitignore and package.json. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RetrievalEngine
participant RetrievalStore
participant QueryEmbeddingProvider
participant CosineSimilarity
participant HybridScorer
Client->>RetrievalEngine: search(query, limit)
RetrievalEngine->>RetrievalStore: searchLexical(query, limit * 5)
RetrievalStore-->>RetrievalEngine: candidates[]
RetrievalEngine->>QueryEmbeddingProvider: embedQuery(query)
QueryEmbeddingProvider-->>RetrievalEngine: queryEmbedding
RetrievalEngine->>RetrievalStore: loadEmbeddings(chunkIds)
RetrievalStore-->>RetrievalEngine: chunkEmbeddings[][]
RetrievalEngine->>CosineSimilarity: cosineSimilarity(queryEmbedding, chunkEmbedding)
CosineSimilarity-->>RetrievalEngine: semanticScores[]
RetrievalEngine->>HybridScorer: scoreHybridResults(candidates, semanticScores, weights)
HybridScorer-->>RetrievalEngine: SearchResult[] (sorted)
RetrievalEngine->>RetrievalEngine: truncate to limit
RetrievalEngine-->>Client: SearchResult[]
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Poem
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/retriever/src/demo/RetrievalDemo.ts`:
- Around line 37-43: The mock currently ignores the RetrievalStore contract:
update searchLexical to honor the incoming limit by returning at most limit
candidates (preserving deterministic ordering of the hard-coded candidate set)
and update loadEmbeddings to return embeddings only for the requested chunkIds
in the same order as chunkIds (preserving one embedding per chunkId); locate the
hard-coded candidates array in RetrievalDemo.ts and change searchLexical to
slice it by Math.max(0, limit) and change loadEmbeddings to map over the
incoming chunkIds to produce exactly one vector per id (or throw/return null for
missing ids) so cardinality and ordering match the real RetrievalStore behavior.
- Around line 15-20: The imports RetrievalStore, QueryEmbeddingProvider,
HybridScoreWeights, and SearchCandidate are only used as type annotations;
change the import to be type-only by using an "import type" statement for those
symbols from "../types" (so the runtime require is not emitted) and keep the
rest of the file unchanged.
In `@apps/retriever/src/HybridScorer.ts`:
- Line 10: The import in HybridScorer.ts should be type-only: update the import
of SearchCandidate, SearchResult, and HybridScoreWeights to use a type-only
import so the compiled output doesn't include a runtime require("./types");
locate the top-level import statement that currently imports these symbols and
change it to an `import type` form referencing SearchCandidate, SearchResult,
and HybridScoreWeights.
- Around line 99-113: The current lexical normalization uses reduce(..., 0) to
compute maxLexicalScore and falls back to 0 when max <= 0, which zeroes out
lexical signal for all-negative score sets; update the logic in the block that
computes maxLexicalScore and normalizedLexicalScore (around candidates.reduce,
maxLexicalScore, normalizedLexicalScore, finalScore) to either validate adapters
emit non-negative lexicalScore or—preferably—compute both minLexicalScore and
maxLexicalScore and normalize each candidate via (lexicalScore -
minLexicalScore) / (maxLexicalScore - minLexicalScore) when range > 0 (otherwise
fall back to 0.0), then use that normalized value in finalScore
(weights.alpha/weights.beta unchanged).
In `@apps/retriever/src/index.ts`:
- Around line 9-13: The example import in the docblock uses the wrong package
name; update the example to import from "retriever" (matching
apps/retriever/package.json) instead of "@smart-notes/retriever". Edit the
import lines that reference RetrievalEngine, cosineSimilarity,
scoreHybridResults and the type imports SearchResult, HybridScoreWeights in the
example at the top of apps/retriever/src/index.ts so they use "retriever", and
ensure both example import and type import use the same package string.
In `@apps/retriever/src/RetrievalEngine.ts`:
- Around line 10-15: The import statement is pulling in type-only symbols at
runtime; change the import to a type-only import by using "import type" for
RetrievalStore, QueryEmbeddingProvider, HybridScoreWeights, and SearchResult in
RetrievalEngine.ts so the compiler only emits types and no runtime import
side-effects—update the existing import line that currently reads "import {
RetrievalStore, QueryEmbeddingProvider, HybridScoreWeights, SearchResult } from
'./types';" to use "import type" for those symbols and leave other non-type
imports unchanged.
- Around line 120-129: The semanticScores array currently uses raw
cosineSimilarity values (range [-1,1]) which must be normalized to [0,1] before
hybrid blending; update the computation that builds semanticScores (the variable
computed from cosineSimilarity) to normalize each score via (score + 1) / 2,
then pass those normalized values into scoreHybridResults so HybridScorer's
blend (weights.alpha * semanticScore + weights.beta * normalizedLexicalScore)
uses aligned scales for correct ranking of candidates.
In `@apps/retriever/src/types.ts`:
- Around line 67-72: Update the JSDoc for finalScore to match the implementation
in scoreHybridResults: state that lexicalScore is normalized before blending,
that scoreHybridResults enforces/assumes alpha + beta ≈ 1 (throws if not), and
that finalScore is the weighted blend of the normalized lexical score and
semanticScore computed after normalization; apply the same clarification to the
other comment block referenced (lines around the second occurrence) and
reference scoreHybridResults, finalScore, lexicalScore, alpha and beta so
adapter authors understand the exact contract.
In `@apps/retriever/tsconfig.json`:
- Around line 2-11: Enable TypeScript declaration emission by updating
tsconfig.json's compilerOptions to include "declaration": true (and optionally
"declarationMap": true) so .d.ts files are emitted into the existing "outDir":
"dist"; ensure "rootDir": "src" and "include": ["src"] still cover your public
API surface (types like SearchResult, HybridScoreWeights, RetrievalStore), and
add "types": "dist/index.d.ts" to package.json to point consumers to the
generated declaration entry if applicable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b784b02a-c6af-4b05-9807-3d13788d91f5
⛔ Files ignored due to path filters (1)
apps/retriever/package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (9)
.gitignoreapps/retriever/package.jsonapps/retriever/src/CosineSimilarity.tsapps/retriever/src/HybridScorer.tsapps/retriever/src/RetrievalEngine.tsapps/retriever/src/demo/RetrievalDemo.tsapps/retriever/src/index.tsapps/retriever/src/types.tsapps/retriever/tsconfig.json
|
Thanks for the suggestion ! I've updated the implementation to address this and pushed the fix in the latest commit. |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
apps/retriever/src/HybridScorer.ts (1)
39-60:⚠️ Potential issue | 🟡 MinorPublish the actual scoring contract in this docblock.
This export still documents blending raw
lexicalScore, says non-1 weight sums only warn, and only lists a length-mismatchRangeError. The implementation now min/max normalizes lexical scores, throws on invalid weights, and relies on callers to passsemanticScoreson the expected scale, so this docblock is no longer an accurate API contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@apps/retriever/src/HybridScorer.ts` around lines 39 - 60, The JSDoc for HybridScorer's scoring API is outdated: update the docblock on HybridScorer (the scoring function that takes candidates, semanticScores, and weights) to accurately state that lexicalScore values are min/max normalized before blending, that weights (weights.alpha and weights.beta) must be valid (e.g., sum to 1 or meet the function's validation) and the function now throws on invalid weights, that semanticScores are expected to already be cosine-similarity-scaled by the caller, and that the function throws on candidates/semanticScores length mismatch; also keep the note that results are returned sorted by finalScore descending and document the exact finalScore formula used after normalization.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/retriever/src/HybridScorer.ts`:
- Around line 84-97: The validation currently in scoreHybridResults lets
NaN/Infinity through because comparisons with < and Math.abs(NaN - 1) are false;
add an explicit finite-number check using Number.isFinite for weights.alpha and
weights.beta (reject if not finite) before the existing non-negative check so
you throw a RangeError early; keep the existing checks for non-negativity and
sum-to-1 (using WEIGHT_SUM_TOLERANCE) after this new finite validation to avoid
producing NaN finalScore values later.
In `@apps/retriever/src/RetrievalEngine.ts`:
- Around line 66-67: Update the public doc comment for the RetrievalEngine
`limit` parameter to accurately state that the engine requests a widened
candidate pool from the upstream lexical search (currently expands to limit * 5
candidates before reranking) so callers and store adapters understand the actual
query load; reference the RetrievalEngine class and the `limit` parameter and
explicitly mention the candidate expansion multiplier (5) and that the final
ranked output is trimmed to `limit`.
In `@apps/retriever/tsconfig.json`:
- Around line 5-13: The base tsconfig currently hardcodes "rootDir": "src" and
"include": ["src"], which breaks when package-level .ts files exist; remove
"rootDir" (and any emit-only settings like "outDir" or
"declaration"/"declarationMap") from the base tsconfig and create a new
build-only tsconfig (e.g., tsconfig.build.json) that sets "rootDir": "src",
"outDir": "dist", and other emit options; update the base tsconfig "include" to
["*.ts", "src"] (or simply ["**/*.ts"]) so files like vitest.config.ts and
eslint.config.ts are type-checked while keeping compilation settings isolated in
tsconfig.build.json referenced by build scripts or "extends" as appropriate.
---
Duplicate comments:
In `@apps/retriever/src/HybridScorer.ts`:
- Around line 39-60: The JSDoc for HybridScorer's scoring API is outdated:
update the docblock on HybridScorer (the scoring function that takes candidates,
semanticScores, and weights) to accurately state that lexicalScore values are
min/max normalized before blending, that weights (weights.alpha and
weights.beta) must be valid (e.g., sum to 1 or meet the function's validation)
and the function now throws on invalid weights, that semanticScores are expected
to already be cosine-similarity-scaled by the caller, and that the function
throws on candidates/semanticScores length mismatch; also keep the note that
results are returned sorted by finalScore descending and document the exact
finalScore formula used after normalization.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 663714ab-04af-483f-a450-e2ea86ab3728
📒 Files selected for processing (6)
apps/retriever/src/HybridScorer.tsapps/retriever/src/RetrievalEngine.tsapps/retriever/src/demo/RetrievalDemo.tsapps/retriever/src/index.tsapps/retriever/src/types.tsapps/retriever/tsconfig.json
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/retriever/src/RetrievalEngine.ts`:
- Around line 106-114: Wrap the embedding calls in RetrievalEngine (the call to
this.embedder.embedQuery and this.store.loadEmbeddings) with explicit try/catch
so failures are caught and annotated with context; on error, log or rethrow a
new Error that includes which operation failed (embedQuery vs loadEmbeddings),
the query or chunkIds involved (or their counts/ids), and the original error to
preserve the stack, and consider returning a sensible fallback (empty
embeddings/empty candidates) or letting the caller handle the failure. Ensure
you update the code paths that use queryEmbedding and embeddings to handle a
thrown or fallback result.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4ee3361b-9ecf-493c-af0d-b950cd3f9ebd
📒 Files selected for processing (2)
apps/retriever/src/HybridScorer.tsapps/retriever/src/RetrievalEngine.ts
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@apps/retriever/src/RetrievalEngine.ts`:
- Around line 127-131: The error handler in RetrievalEngine currently injects
raw user query text into wrapOperationError in the embedQuery block (and
similarly includes full chunk ID lists elsewhere); change these to avoid leaking
sensitive data by passing redacted values (e.g., replace query with "<redacted>"
or a fixed-length hash/summary) and replace explicit ID arrays with safe
metadata such as the count of chunks or hashed/trimmed IDs; update the calls to
wrapOperationError in embedQuery and the other location (the block handling
chunk IDs) to include only non-sensitive context (e.g., queryLength or
chunkCount and optionally a deterministic short hash) instead of the full query
or full ID lists.
- Around line 102-106: The check in RetrievalEngine.search currently only
enforces limit > 0 but needs an upper bound to avoid resource amplification (the
code later multiplies limit by 5); add a MAX_LIMIT constant (e.g.
MAX_SEARCH_LIMIT) and enforce that limit is an integer between 1 and MAX_LIMIT
in the existing validation for limit (update the RangeError message
accordingly), and ensure any subsequent calculation (the multiplication at the
adjusted-limit step where limit is multiplied by 5) uses the validated/capped
value so callers cannot trigger excessive lexical lookups, embeddings, or
ranking work.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4ecbd3b5-1fb9-4b57-aa30-636e29d3e9e3
📒 Files selected for processing (1)
apps/retriever/src/RetrievalEngine.ts
| if (!Number.isInteger(limit) || limit <= 0) { | ||
| throw new RangeError( | ||
| `RetrievalEngine.search: limit must be a positive integer (got ${limit}).` | ||
| ); | ||
| } |
There was a problem hiding this comment.
Add an upper bound for limit to prevent resource amplification.
Line 114 multiplies caller input by 5. With no max cap at Line 102-106, a large limit can trigger expensive lexical lookup, embedding loads, and ranking work.
⚙️ Proposed fix
+const MAX_LIMIT = 100;
...
- if (!Number.isInteger(limit) || limit <= 0) {
+ if (!Number.isInteger(limit) || limit <= 0 || limit > MAX_LIMIT) {
throw new RangeError(
- `RetrievalEngine.search: limit must be a positive integer (got ${limit}).`
+ `RetrievalEngine.search: limit must be an integer between 1 and ${MAX_LIMIT} (got ${limit}).`
);
}As per coding guidelines: "The code adheres to best practices recommended for performance".
Also applies to: 114-115
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/retriever/src/RetrievalEngine.ts` around lines 102 - 106, The check in
RetrievalEngine.search currently only enforces limit > 0 but needs an upper
bound to avoid resource amplification (the code later multiplies limit by 5);
add a MAX_LIMIT constant (e.g. MAX_SEARCH_LIMIT) and enforce that limit is an
integer between 1 and MAX_LIMIT in the existing validation for limit (update the
RangeError message accordingly), and ensure any subsequent calculation (the
multiplication at the adjusted-limit step where limit is multiplied by 5) uses
the validated/capped value so callers cannot trigger excessive lexical lookups,
embeddings, or ranking work.
| try { | ||
| queryEmbedding = await this.embedder.embedQuery(query); | ||
| } catch (error: unknown) { | ||
| throw wrapOperationError("embedQuery", `query="${query}"`, error); | ||
| } |
There was a problem hiding this comment.
Avoid logging raw query text and full chunk ID lists in thrown errors.
Line 130 and Line 145 embed potentially sensitive payloads directly into error messages. This can leak user input/internal identifiers via logs and telemetry.
🔒 Proposed fix
- throw wrapOperationError("embedQuery", `query="${query}"`, error);
+ throw wrapOperationError("embedQuery", `queryLength=${query.length}`, error);
...
- `chunkCount=${chunkIds.length}, chunkIds=[${chunkIds.join(", ")}]`,
+ `chunkCount=${chunkIds.length}`,As per coding guidelines: "No exposed API keys or sensitive data".
Also applies to: 143-146
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@apps/retriever/src/RetrievalEngine.ts` around lines 127 - 131, The error
handler in RetrievalEngine currently injects raw user query text into
wrapOperationError in the embedQuery block (and similarly includes full chunk ID
lists elsewhere); change these to avoid leaking sensitive data by passing
redacted values (e.g., replace query with "<redacted>" or a fixed-length
hash/summary) and replace explicit ID arrays with safe metadata such as the
count of chunks or hashed/trimmed IDs; update the calls to wrapOperationError in
embedQuery and the other location (the block handling chunk IDs) to include only
non-sensitive context (e.g., queryLength or chunkCount and optionally a
deterministic short hash) instead of the full query or full ID lists.
Overview
This PR introduces a modular hybrid retrieval engine for Smart Notes.
The goal is to establish a retrieval architecture capable of combining lexical search (e.g. SQLite FTS5) with semantic similarity for future AI-assisted features.
The design follows an adapter-based architecture so the retrieval pipeline remains independent of specific storage or embedding implementations.
Architecture
The implemented pipeline follows this flow:
Query
→ lexical candidate search
→ embedding loading
→ cosine similarity scoring
→ hybrid ranking
→ top-k results
Hybrid ranking blends lexical and semantic signals using a weighted scoring function:
final_score = α * semantic_score + β * normalized_lexical_score
Where:
Lexical scores are normalized before blending to avoid scale mismatch with cosine similarity.
Key Components
RetrievalEngine
Orchestrates the full retrieval pipeline while remaining independent of storage implementations.
HybridScorer
Combines lexical and semantic signals into a final ranking score.
CosineSimilarity
Efficient vector similarity utility used for semantic scoring.
Types & Adapter Contracts
Defines clear interfaces for:
This allows future integration with:
Improvements After Code Audit
Based on automated review feedback, the following robustness improvements were added:
limit * 5)limitinputDemo
A demo runner is included to demonstrate the pipeline without external dependencies.
Run:
Why This Matters
This module lays the foundation for future AI capabilities in Smart Notes, including:
The design keeps the system:
Summary by CodeRabbit
New Features
Chores
Documentation / Demos