feat(crawler,db): add cross-page paragraph deduplication for boilerplate filtering#604
Conversation
…ate filtering Track paragraph fingerprints per page in a new page_paragraph_hashes table and filter paragraphs appearing on >50% of a domain's pages before chunking. This removes repeated boilerplate (nav bars, footers, CMP vendor lists) from search chunks, improving retrieval quality.
📝 WalkthroughWalkthroughThis pull request introduces cross-page paragraph deduplication for boilerplate filtering in the crawler indexing pipeline. A new utility module provides paragraph hashing, content deduplication, and frequency-based boilerplate filtering functions. The indexing service integrates these utilities to compute and track paragraph hashes per page in a new database table, calculate cross-page frequencies when minimum thresholds are met, filter boilerplate paragraphs from content, and store filtering hashes. Indexing is skipped when both content and filtering hashes match existing stored values; otherwise, only filtered content is chunked, embedded, and indexed in a single transaction. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/crawler/app/services/indexing_service.py`:
- Around line 57-67: The delete-then-insert sequence operating on
page_paragraph_hashes must be run inside a DB transaction so the refresh is
atomic; update the block using the existing acquire_with_retry(conn) session to
open a transaction (e.g., conn.transaction() or the library's transaction
context) and perform the DELETE and conditional INSERT (executemany) within that
single transaction, ensuring you still handle the case where hashes is empty by
at least running the DELETE inside the transaction. Reference:
acquire_with_retry, page_paragraph_hashes, and the executemany/execute calls
shown.
- Around line 69-73: The current total_pages calculation counts only URLs
present in page_paragraph_hashes, excluding pages with zero extracted paragraphs
and thus inflating frequency rates; update the query that sets total_pages (used
with MIN_DOMAIN_PAGES_FOR_DEDUP and hashes) to count all domain pages
instead—either by querying the pages table for the domain (e.g., SELECT COUNT(*)
FROM pages WHERE domain = $1) or by using a LEFT JOIN/COUNT DISTINCT on the
pages table to include pages with zero paragraphs—so the denominator reflects “%
of domain pages” as intended.
- Around line 100-103: When chunks is empty the current early return leaves
previously indexed chunks around; modify the empty-branch inside the same method
(the block that uses acquire_with_retry(self._pool) and calls
_UPSERT_WEBSITE_URL) to also remove any prior chunks for this URL before
returning: execute a DELETE for the URL via the DB connection (e.g.
conn.execute(_DELETE_CHUNKS_FOR_URL, url) — create that SQL constant if missing)
and invoke the vector/embedding cleanup on the in-memory/indexed store if
applicable (e.g. self._vector_store.delete_documents or
delete_by_namespace(url)); keep the existing upsert of the website metadata and
then return the same {"url": url, "status": "empty", "chunks_indexed": 0}.
In `@services/crawler/app/utils/paragraph_dedup.py`:
- Around line 29-30: Normalize newlines in the input before splitting into
paragraphs: replace CRLF (\r\n) and lone CR (\r) with LF (\n) on the variable
used as the source (the `content` string) so both the loop using
`content.strip().split("\n\n")` and the later similar block (lines around where
`para` is iterated) operate on consistent newline characters; then split on
one-or-more blank lines (i.e., treat consecutive newlines as paragraph
separators) to avoid missing paragraphs on CRLF pages and preserve the existing
`para.strip()` checks.
In `@services/crawler/tests/test_indexing_service.py`:
- Around line 236-249: The test is missing verification that old chunks are
removed when filtering yields an empty page; after calling
indexing_service.index_page with content that filters to empty, assert that
chunk deletion ran by checking mock_chunk.delete_by_paragraph_hash was called
with boilerplate_h (or, if deletions happen via SQL, assert
mock_conn.execute.call_args_list contains a DELETE targeting
website_paragraphs/website_paragraph_chunks). Use the existing boilerplate_h,
mock_chunk, indexing_service.index_page and mock_conn.execute call list to
locate where to add the assertion.
In `@services/crawler/tests/test_paragraph_dedup.py`:
- Around line 41-45: Add tests that cover CRLF paragraph boundaries for
extract_paragraph_hashes by duplicating the existing newline test cases but
using "\r\n\r\n" between paragraphs (e.g., create content like "First
paragraph\r\n\r\nSecond paragraph\r\n\r\nThird paragraph") and assert the
extractor returns the same number of paragraphs/hashes and same values as the
LF-only cases; update both the test_multiple_paragraphs case and the other test
block around lines 92-102 to include CRLF variants to ensure
extraction/filtering is stable across newline formats.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (5)
services/crawler/app/services/indexing_service.pyservices/crawler/app/utils/paragraph_dedup.pyservices/crawler/tests/test_indexing_service.pyservices/crawler/tests/test_paragraph_dedup.pyservices/db/init-scripts/04-create-search-database.sql
| async with acquire_with_retry(self._pool) as conn: | ||
| existing_hash = await conn.fetchval("SELECT content_hash FROM chunks WHERE url = $1 LIMIT 1", url) | ||
| if existing_hash == content_hash: | ||
| return {"url": url, "status": "skipped", "chunks_indexed": 0} | ||
| await conn.execute( | ||
| "DELETE FROM page_paragraph_hashes WHERE domain = $1 AND url = $2", | ||
| domain, | ||
| url, | ||
| ) | ||
| if hashes: | ||
| await conn.executemany( | ||
| "INSERT INTO page_paragraph_hashes (domain, url, paragraph_hash) VALUES ($1, $2, $3)", | ||
| [(domain, url, h) for h in hashes], | ||
| ) |
There was a problem hiding this comment.
Make paragraph-hash refresh atomic.
Line 58 deletes old hashes before Line 64 inserts replacements, but this runs outside a transaction. A mid-sequence failure can leave a page with no hashes and corrupt subsequent frequency calculations.
🔧 Proposed fix
- async with acquire_with_retry(self._pool) as conn:
+ async with acquire_with_retry(self._pool) as conn, conn.transaction():
await conn.execute(
"DELETE FROM page_paragraph_hashes WHERE domain = $1 AND url = $2",
domain,
url,
)
if hashes:
await conn.executemany(
"INSERT INTO page_paragraph_hashes (domain, url, paragraph_hash) VALUES ($1, $2, $3)",
[(domain, url, h) for h in hashes],
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/services/indexing_service.py` around lines 57 - 67, The
delete-then-insert sequence operating on page_paragraph_hashes must be run
inside a DB transaction so the refresh is atomic; update the block using the
existing acquire_with_retry(conn) session to open a transaction (e.g.,
conn.transaction() or the library's transaction context) and perform the DELETE
and conditional INSERT (executemany) within that single transaction, ensuring
you still handle the case where hashes is empty by at least running the DELETE
inside the transaction. Reference: acquire_with_retry, page_paragraph_hashes,
and the executemany/execute calls shown.
| total_pages = await conn.fetchval( | ||
| "SELECT COUNT(DISTINCT url) FROM page_paragraph_hashes WHERE domain = $1", | ||
| domain, | ||
| ) | ||
| if total_pages >= MIN_DOMAIN_PAGES_FOR_DEDUP and hashes: |
There was a problem hiding this comment.
Use domain page count as the frequency denominator.
Line 69 currently counts only URLs present in page_paragraph_hashes. Pages with zero extracted paragraphs are excluded, which inflates frequencies and can over-filter content versus the “% of domain pages” objective.
🔧 Proposed fix
- total_pages = await conn.fetchval(
- "SELECT COUNT(DISTINCT url) FROM page_paragraph_hashes WHERE domain = $1",
- domain,
- )
+ total_pages = await conn.fetchval(
+ "SELECT COUNT(*) FROM website_urls WHERE domain = $1 AND content IS NOT NULL",
+ domain,
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/services/indexing_service.py` around lines 69 - 73, The
current total_pages calculation counts only URLs present in
page_paragraph_hashes, excluding pages with zero extracted paragraphs and thus
inflating frequency rates; update the query that sets total_pages (used with
MIN_DOMAIN_PAGES_FOR_DEDUP and hashes) to count all domain pages instead—either
by querying the pages table for the domain (e.g., SELECT COUNT(*) FROM pages
WHERE domain = $1) or by using a LEFT JOIN/COUNT DISTINCT on the pages table to
include pages with zero paragraphs—so the denominator reflects “% of domain
pages” as intended.
| if not chunks: | ||
| async with acquire_with_retry(self._pool) as conn: | ||
| await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash) | ||
| return {"url": url, "status": "empty", "chunks_indexed": 0} |
There was a problem hiding this comment.
Delete stale chunks when filtered content becomes empty.
When chunks is empty, Line 103 returns without removing previously indexed chunks for that URL. This leaves stale content in search even though current filtered content is empty.
🔧 Proposed fix
if not chunks:
- async with acquire_with_retry(self._pool) as conn:
+ async with acquire_with_retry(self._pool) as conn, conn.transaction():
await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash)
+ await conn.execute("DELETE FROM chunks WHERE domain = $1 AND url = $2", domain, url)
return {"url": url, "status": "empty", "chunks_indexed": 0}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if not chunks: | |
| async with acquire_with_retry(self._pool) as conn: | |
| await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash) | |
| return {"url": url, "status": "empty", "chunks_indexed": 0} | |
| if not chunks: | |
| async with acquire_with_retry(self._pool) as conn, conn.transaction(): | |
| await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash) | |
| await conn.execute("DELETE FROM chunks WHERE domain = $1 AND url = $2", domain, url) | |
| return {"url": url, "status": "empty", "chunks_indexed": 0} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/services/indexing_service.py` around lines 100 - 103,
When chunks is empty the current early return leaves previously indexed chunks
around; modify the empty-branch inside the same method (the block that uses
acquire_with_retry(self._pool) and calls _UPSERT_WEBSITE_URL) to also remove any
prior chunks for this URL before returning: execute a DELETE for the URL via the
DB connection (e.g. conn.execute(_DELETE_CHUNKS_FOR_URL, url) — create that SQL
constant if missing) and invoke the vector/embedding cleanup on the
in-memory/indexed store if applicable (e.g. self._vector_store.delete_documents
or delete_by_namespace(url)); keep the existing upsert of the website metadata
and then return the same {"url": url, "status": "empty", "chunks_indexed": 0}.
| for para in content.strip().split("\n\n"): | ||
| if not para.strip(): |
There was a problem hiding this comment.
Normalize newlines before paragraph splitting.
Line 29 and Line 57 split only on "\n\n", which breaks for CRLF content and causes missed dedup/filtering on pages using \r\n boundaries.
💡 Proposed fix
import hashlib
+import re
import unicodedata
BOILERPLATE_FREQUENCY_THRESHOLD = 0.5
MIN_DOMAIN_PAGES_FOR_DEDUP = 5
+_PARAGRAPH_SPLIT_RE = re.compile(r"\n\s*\n")
+
+
+def _split_paragraphs(content: str) -> list[str]:
+ normalized = content.replace("\r\n", "\n").strip()
+ if not normalized:
+ return []
+ return [p for p in _PARAGRAPH_SPLIT_RE.split(normalized) if p.strip()]
@@
def extract_paragraph_hashes(content: str) -> list[str]:
@@
- for para in content.strip().split("\n\n"):
- if not para.strip():
- continue
+ for para in _split_paragraphs(content):
h = paragraph_hash(para)
if h not in seen:
seen.add(h)
result.append(h)
@@
def filter_boilerplate_paragraphs(
@@
- paragraphs = content.strip().split("\n\n")
+ paragraphs = _split_paragraphs(content)
kept = [p for p in paragraphs if frequencies.get(paragraph_hash(p), 0.0) <= threshold]
return "\n\n".join(kept)Also applies to: 57-59
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/utils/paragraph_dedup.py` around lines 29 - 30,
Normalize newlines in the input before splitting into paragraphs: replace CRLF
(\r\n) and lone CR (\r) with LF (\n) on the variable used as the source (the
`content` string) so both the loop using `content.strip().split("\n\n")` and the
later similar block (lines around where `para` is iterated) operate on
consistent newline characters; then split on one-or-more blank lines (i.e.,
treat consecutive newlines as paragraph separators) to avoid missing paragraphs
on CRLF pages and preserve the existing `para.strip()` checks.
| async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service): | ||
| content = "Only boilerplate" | ||
| boilerplate_h = paragraph_hash("Only boilerplate") | ||
|
|
||
| mock_conn.fetchval = AsyncMock(return_value=10) | ||
| mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}]) | ||
| mock_conn.fetchrow = AsyncMock(return_value=None) | ||
|
|
||
| result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content) | ||
|
|
||
| assert result["status"] == "empty" | ||
| upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()] | ||
| assert any("filtering_hash" in c for c in upsert_calls) | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Assert chunk cleanup on the empty path.
This test should also verify that old chunks are deleted when filtering removes all content, to prevent stale-index regressions.
🧪 Suggested assertion
async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service):
@@
assert result["status"] == "empty"
upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()]
assert any("filtering_hash" in c for c in upsert_calls)
+ delete_chunk_calls = [str(c) for c in mock_conn.execute.call_args_list if "DELETE FROM chunks" in str(c)]
+ assert delete_chunk_calls📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service): | |
| content = "Only boilerplate" | |
| boilerplate_h = paragraph_hash("Only boilerplate") | |
| mock_conn.fetchval = AsyncMock(return_value=10) | |
| mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}]) | |
| mock_conn.fetchrow = AsyncMock(return_value=None) | |
| result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content) | |
| assert result["status"] == "empty" | |
| upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()] | |
| assert any("filtering_hash" in c for c in upsert_calls) | |
| async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service): | |
| content = "Only boilerplate" | |
| boilerplate_h = paragraph_hash("Only boilerplate") | |
| mock_conn.fetchval = AsyncMock(return_value=10) | |
| mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}]) | |
| mock_conn.fetchrow = AsyncMock(return_value=None) | |
| result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content) | |
| assert result["status"] == "empty" | |
| upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()] | |
| assert any("filtering_hash" in c for c in upsert_calls) | |
| delete_chunk_calls = [str(c) for c in mock_conn.execute.call_args_list if "DELETE FROM chunks" in str(c)] | |
| assert delete_chunk_calls |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/tests/test_indexing_service.py` around lines 236 - 249, The
test is missing verification that old chunks are removed when filtering yields
an empty page; after calling indexing_service.index_page with content that
filters to empty, assert that chunk deletion ran by checking
mock_chunk.delete_by_paragraph_hash was called with boilerplate_h (or, if
deletions happen via SQL, assert mock_conn.execute.call_args_list contains a
DELETE targeting website_paragraphs/website_paragraph_chunks). Use the existing
boilerplate_h, mock_chunk, indexing_service.index_page and mock_conn.execute
call list to locate where to add the assertion.
| def test_multiple_paragraphs(self): | ||
| content = "First paragraph\n\nSecond paragraph\n\nThird paragraph" | ||
| result = extract_paragraph_hashes(content) | ||
| assert len(result) == 3 | ||
|
|
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add CRLF boundary regression tests.
Please add coverage for \r\n\r\n paragraph boundaries so extraction/filtering behavior is stable across newline formats.
🧪 Suggested test additions
class TestExtractParagraphHashes:
@@
+ def test_multiple_paragraphs_crlf(self):
+ content = "First paragraph\r\n\r\nSecond paragraph\r\n\r\nThird paragraph"
+ result = extract_paragraph_hashes(content)
+ assert len(result) == 3
+
class TestFilterBoilerplateParagraphs:
@@
+ def test_filtering_with_crlf_boundaries(self):
+ content = "Boilerplate text\r\n\r\nUnique content here"
+ frequencies = {paragraph_hash("Boilerplate text"): 0.8}
+ result = filter_boilerplate_paragraphs(content, frequencies)
+ assert "Boilerplate text" not in result
+ assert "Unique content here" in resultAlso applies to: 92-102
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/tests/test_paragraph_dedup.py` around lines 41 - 45, Add
tests that cover CRLF paragraph boundaries for extract_paragraph_hashes by
duplicating the existing newline test cases but using "\r\n\r\n" between
paragraphs (e.g., create content like "First paragraph\r\n\r\nSecond
paragraph\r\n\r\nThird paragraph") and assert the extractor returns the same
number of paragraphs/hashes and same values as the LF-only cases; update both
the test_multiple_paragraphs case and the other test block around lines 92-102
to include CRLF variants to ensure extraction/filtering is stable across newline
formats.
Greptile SummaryThis PR implements incremental cross-page paragraph deduplication to filter boilerplate content (navigation bars, footers, CMP vendor lists) that appears on more than 50% of a domain's pages. The implementation introduces a new Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| services/db/init-scripts/04-create-search-database.sql | Added page_paragraph_hashes table with proper foreign key constraints and indexing for efficient boilerplate detection queries |
| services/crawler/app/utils/paragraph_dedup.py | New utility module with hash extraction and frequency-based filtering functions, properly handles unicode normalization and within-page deduplication |
| services/crawler/app/services/indexing_service.py | Integrated paragraph deduplication with dual-hash skip logic, properly handles edge cases (empty after filtering, embedding failures), correctly updates tracking tables |
Sequence Diagram
sequenceDiagram
participant Client
participant IndexingService
participant ParagraphDedup
participant Database
participant EmbeddingService
Client->>IndexingService: index_page(domain, url, content)
IndexingService->>IndexingService: compute content_hash
IndexingService->>ParagraphDedup: extract_paragraph_hashes(content)
ParagraphDedup-->>IndexingService: paragraph_hashes[]
IndexingService->>Database: DELETE old paragraph_hashes
IndexingService->>Database: INSERT new paragraph_hashes
IndexingService->>Database: COUNT total pages in domain
Database-->>IndexingService: total_pages
alt total_pages >= MIN_DOMAIN_PAGES_FOR_DEDUP
IndexingService->>Database: SELECT frequency for paragraph_hashes
Database-->>IndexingService: frequencies{}
IndexingService->>ParagraphDedup: filter_boilerplate_paragraphs(content, frequencies)
ParagraphDedup-->>IndexingService: filtered_content
else domain too small
IndexingService->>IndexingService: use original content
end
IndexingService->>IndexingService: compute filtering_hash
IndexingService->>Database: SELECT existing content_hash, filtering_hash
Database-->>IndexingService: existing_hashes
alt both hashes match existing
IndexingService-->>Client: status: skipped
else hashes differ or no existing
IndexingService->>IndexingService: chunk_content(filtered)
alt chunks empty
IndexingService->>Database: UPSERT website_urls with filtering_hash
IndexingService-->>Client: status: empty
else chunks exist
IndexingService->>EmbeddingService: embed_texts(chunks)
EmbeddingService-->>IndexingService: embeddings[]
IndexingService->>Database: BEGIN TRANSACTION
IndexingService->>Database: UPSERT website_urls
IndexingService->>Database: DELETE old chunks
IndexingService->>Database: INSERT new chunks
IndexingService->>Database: COMMIT
IndexingService-->>Client: status: indexed
end
end
Last reviewed commit: 3e4ecab
…rity Crawl4ai emits single-newline-separated markdown, so splitting on double newlines missed most boilerplate (nav bars, cookie banners, footers). Switch to line-level hashing with a MIN_LINE_LENGTH threshold to skip short lines that lack fingerprinting signal. Also cast filtering_hash to text in the upsert query.
…ratio for dedup Replace fractional frequency threshold (>50% of pages) with absolute page count threshold (>5 pages) for boilerplate detection, making filtering behavior more predictable and independent of domain size.
Summary
page_paragraph_hashestable to track paragraph fingerprints per page and aparagraph_deduputility module for hashing and frequency-based filteringTest plan
paragraph_dedup.pyunit tests pass (hashing, extraction, filtering, boundary cases)test_indexing_service.pytests pass (skip logic, hash tracking, frequency filtering, empty-after-filtering)page_paragraph_hashestable is created correctly via the DB init script🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Chores