Skip to content

feat(crawler,db): add cross-page paragraph deduplication for boilerplate filtering#604

Merged
larryro merged 3 commits into
mainfrom
feat/crawler-paragraph-dedup
Feb 28, 2026
Merged

feat(crawler,db): add cross-page paragraph deduplication for boilerplate filtering#604
larryro merged 3 commits into
mainfrom
feat/crawler-paragraph-dedup

Conversation

@larryro

@larryro larryro commented Feb 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add incremental cross-page paragraph deduplication to the crawler indexing pipeline, filtering boilerplate content (nav bars, footers, CMP vendor lists) that appears on >50% of a domain's pages before chunking
  • Introduce a new page_paragraph_hashes table to track paragraph fingerprints per page and a paragraph_dedup utility module for hashing and frequency-based filtering
  • Update the indexing service to use a dual-hash skip check (content hash + filtering hash) so pages are re-indexed when boilerplate detection improves over time

Test plan

  • Verify paragraph_dedup.py unit tests pass (hashing, extraction, filtering, boundary cases)
  • Verify test_indexing_service.py tests pass (skip logic, hash tracking, frequency filtering, empty-after-filtering)
  • Verify integration scenarios: shared boilerplate filtered, unique content preserved, small domains unaffected
  • Confirm the page_paragraph_hashes table is created correctly via the DB init script

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Implemented cross-page boilerplate detection and filtering to identify and remove duplicate content that appears frequently across pages within a domain, improving content quality and search result relevance.
  • Tests

    • Added comprehensive test coverage for the new boilerplate filtering and paragraph deduplication features, including edge cases and integration scenarios.
  • Chores

    • Updated database schema to support paragraph hash tracking for boilerplate detection across pages.

…ate filtering

Track paragraph fingerprints per page in a new page_paragraph_hashes table
and filter paragraphs appearing on >50% of a domain's pages before chunking.
This removes repeated boilerplate (nav bars, footers, CMP vendor lists) from
search chunks, improving retrieval quality.
@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This pull request introduces cross-page paragraph deduplication for boilerplate filtering in the crawler indexing pipeline. A new utility module provides paragraph hashing, content deduplication, and frequency-based boilerplate filtering functions. The indexing service integrates these utilities to compute and track paragraph hashes per page in a new database table, calculate cross-page frequencies when minimum thresholds are met, filter boilerplate paragraphs from content, and store filtering hashes. Indexing is skipped when both content and filtering hashes match existing stored values; otherwise, only filtered content is chunked, embedded, and indexed in a single transaction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding cross-page paragraph deduplication for boilerplate filtering across crawler and database components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/crawler-paragraph-dedup

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/services/indexing_service.py`:
- Around line 57-67: The delete-then-insert sequence operating on
page_paragraph_hashes must be run inside a DB transaction so the refresh is
atomic; update the block using the existing acquire_with_retry(conn) session to
open a transaction (e.g., conn.transaction() or the library's transaction
context) and perform the DELETE and conditional INSERT (executemany) within that
single transaction, ensuring you still handle the case where hashes is empty by
at least running the DELETE inside the transaction. Reference:
acquire_with_retry, page_paragraph_hashes, and the executemany/execute calls
shown.
- Around line 69-73: The current total_pages calculation counts only URLs
present in page_paragraph_hashes, excluding pages with zero extracted paragraphs
and thus inflating frequency rates; update the query that sets total_pages (used
with MIN_DOMAIN_PAGES_FOR_DEDUP and hashes) to count all domain pages
instead—either by querying the pages table for the domain (e.g., SELECT COUNT(*)
FROM pages WHERE domain = $1) or by using a LEFT JOIN/COUNT DISTINCT on the
pages table to include pages with zero paragraphs—so the denominator reflects “%
of domain pages” as intended.
- Around line 100-103: When chunks is empty the current early return leaves
previously indexed chunks around; modify the empty-branch inside the same method
(the block that uses acquire_with_retry(self._pool) and calls
_UPSERT_WEBSITE_URL) to also remove any prior chunks for this URL before
returning: execute a DELETE for the URL via the DB connection (e.g.
conn.execute(_DELETE_CHUNKS_FOR_URL, url) — create that SQL constant if missing)
and invoke the vector/embedding cleanup on the in-memory/indexed store if
applicable (e.g. self._vector_store.delete_documents or
delete_by_namespace(url)); keep the existing upsert of the website metadata and
then return the same {"url": url, "status": "empty", "chunks_indexed": 0}.

In `@services/crawler/app/utils/paragraph_dedup.py`:
- Around line 29-30: Normalize newlines in the input before splitting into
paragraphs: replace CRLF (\r\n) and lone CR (\r) with LF (\n) on the variable
used as the source (the `content` string) so both the loop using
`content.strip().split("\n\n")` and the later similar block (lines around where
`para` is iterated) operate on consistent newline characters; then split on
one-or-more blank lines (i.e., treat consecutive newlines as paragraph
separators) to avoid missing paragraphs on CRLF pages and preserve the existing
`para.strip()` checks.

In `@services/crawler/tests/test_indexing_service.py`:
- Around line 236-249: The test is missing verification that old chunks are
removed when filtering yields an empty page; after calling
indexing_service.index_page with content that filters to empty, assert that
chunk deletion ran by checking mock_chunk.delete_by_paragraph_hash was called
with boilerplate_h (or, if deletions happen via SQL, assert
mock_conn.execute.call_args_list contains a DELETE targeting
website_paragraphs/website_paragraph_chunks). Use the existing boilerplate_h,
mock_chunk, indexing_service.index_page and mock_conn.execute call list to
locate where to add the assertion.

In `@services/crawler/tests/test_paragraph_dedup.py`:
- Around line 41-45: Add tests that cover CRLF paragraph boundaries for
extract_paragraph_hashes by duplicating the existing newline test cases but
using "\r\n\r\n" between paragraphs (e.g., create content like "First
paragraph\r\n\r\nSecond paragraph\r\n\r\nThird paragraph") and assert the
extractor returns the same number of paragraphs/hashes and same values as the
LF-only cases; update both the test_multiple_paragraphs case and the other test
block around lines 92-102 to include CRLF variants to ensure
extraction/filtering is stable across newline formats.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6d1468 and 3e4ecab.

📒 Files selected for processing (5)
  • services/crawler/app/services/indexing_service.py
  • services/crawler/app/utils/paragraph_dedup.py
  • services/crawler/tests/test_indexing_service.py
  • services/crawler/tests/test_paragraph_dedup.py
  • services/db/init-scripts/04-create-search-database.sql

Comment on lines 57 to +67
async with acquire_with_retry(self._pool) as conn:
existing_hash = await conn.fetchval("SELECT content_hash FROM chunks WHERE url = $1 LIMIT 1", url)
if existing_hash == content_hash:
return {"url": url, "status": "skipped", "chunks_indexed": 0}
await conn.execute(
"DELETE FROM page_paragraph_hashes WHERE domain = $1 AND url = $2",
domain,
url,
)
if hashes:
await conn.executemany(
"INSERT INTO page_paragraph_hashes (domain, url, paragraph_hash) VALUES ($1, $2, $3)",
[(domain, url, h) for h in hashes],
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make paragraph-hash refresh atomic.

Line 58 deletes old hashes before Line 64 inserts replacements, but this runs outside a transaction. A mid-sequence failure can leave a page with no hashes and corrupt subsequent frequency calculations.

🔧 Proposed fix
-        async with acquire_with_retry(self._pool) as conn:
+        async with acquire_with_retry(self._pool) as conn, conn.transaction():
             await conn.execute(
                 "DELETE FROM page_paragraph_hashes WHERE domain = $1 AND url = $2",
                 domain,
                 url,
             )
             if hashes:
                 await conn.executemany(
                     "INSERT INTO page_paragraph_hashes (domain, url, paragraph_hash) VALUES ($1, $2, $3)",
                     [(domain, url, h) for h in hashes],
                 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/services/indexing_service.py` around lines 57 - 67, The
delete-then-insert sequence operating on page_paragraph_hashes must be run
inside a DB transaction so the refresh is atomic; update the block using the
existing acquire_with_retry(conn) session to open a transaction (e.g.,
conn.transaction() or the library's transaction context) and perform the DELETE
and conditional INSERT (executemany) within that single transaction, ensuring
you still handle the case where hashes is empty by at least running the DELETE
inside the transaction. Reference: acquire_with_retry, page_paragraph_hashes,
and the executemany/execute calls shown.

Comment on lines +69 to +73
total_pages = await conn.fetchval(
"SELECT COUNT(DISTINCT url) FROM page_paragraph_hashes WHERE domain = $1",
domain,
)
if total_pages >= MIN_DOMAIN_PAGES_FOR_DEDUP and hashes:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use domain page count as the frequency denominator.

Line 69 currently counts only URLs present in page_paragraph_hashes. Pages with zero extracted paragraphs are excluded, which inflates frequencies and can over-filter content versus the “% of domain pages” objective.

🔧 Proposed fix
-            total_pages = await conn.fetchval(
-                "SELECT COUNT(DISTINCT url) FROM page_paragraph_hashes WHERE domain = $1",
-                domain,
-            )
+            total_pages = await conn.fetchval(
+                "SELECT COUNT(*) FROM website_urls WHERE domain = $1 AND content IS NOT NULL",
+                domain,
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/services/indexing_service.py` around lines 69 - 73, The
current total_pages calculation counts only URLs present in
page_paragraph_hashes, excluding pages with zero extracted paragraphs and thus
inflating frequency rates; update the query that sets total_pages (used with
MIN_DOMAIN_PAGES_FOR_DEDUP and hashes) to count all domain pages instead—either
by querying the pages table for the domain (e.g., SELECT COUNT(*) FROM pages
WHERE domain = $1) or by using a LEFT JOIN/COUNT DISTINCT on the pages table to
include pages with zero paragraphs—so the denominator reflects “% of domain
pages” as intended.

Comment on lines 100 to 103
if not chunks:
async with acquire_with_retry(self._pool) as conn:
await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash)
return {"url": url, "status": "empty", "chunks_indexed": 0}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Delete stale chunks when filtered content becomes empty.

When chunks is empty, Line 103 returns without removing previously indexed chunks for that URL. This leaves stale content in search even though current filtered content is empty.

🔧 Proposed fix
         if not chunks:
-            async with acquire_with_retry(self._pool) as conn:
+            async with acquire_with_retry(self._pool) as conn, conn.transaction():
                 await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash)
+                await conn.execute("DELETE FROM chunks WHERE domain = $1 AND url = $2", domain, url)
             return {"url": url, "status": "empty", "chunks_indexed": 0}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not chunks:
async with acquire_with_retry(self._pool) as conn:
await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash)
return {"url": url, "status": "empty", "chunks_indexed": 0}
if not chunks:
async with acquire_with_retry(self._pool) as conn, conn.transaction():
await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash)
await conn.execute("DELETE FROM chunks WHERE domain = $1 AND url = $2", domain, url)
return {"url": url, "status": "empty", "chunks_indexed": 0}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/services/indexing_service.py` around lines 100 - 103,
When chunks is empty the current early return leaves previously indexed chunks
around; modify the empty-branch inside the same method (the block that uses
acquire_with_retry(self._pool) and calls _UPSERT_WEBSITE_URL) to also remove any
prior chunks for this URL before returning: execute a DELETE for the URL via the
DB connection (e.g. conn.execute(_DELETE_CHUNKS_FOR_URL, url) — create that SQL
constant if missing) and invoke the vector/embedding cleanup on the
in-memory/indexed store if applicable (e.g. self._vector_store.delete_documents
or delete_by_namespace(url)); keep the existing upsert of the website metadata
and then return the same {"url": url, "status": "empty", "chunks_indexed": 0}.

Comment on lines +29 to +30
for para in content.strip().split("\n\n"):
if not para.strip():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize newlines before paragraph splitting.

Line 29 and Line 57 split only on "\n\n", which breaks for CRLF content and causes missed dedup/filtering on pages using \r\n boundaries.

💡 Proposed fix
 import hashlib
+import re
 import unicodedata
 
 BOILERPLATE_FREQUENCY_THRESHOLD = 0.5
 MIN_DOMAIN_PAGES_FOR_DEDUP = 5
+_PARAGRAPH_SPLIT_RE = re.compile(r"\n\s*\n")
+
+
+def _split_paragraphs(content: str) -> list[str]:
+    normalized = content.replace("\r\n", "\n").strip()
+    if not normalized:
+        return []
+    return [p for p in _PARAGRAPH_SPLIT_RE.split(normalized) if p.strip()]
 
@@
 def extract_paragraph_hashes(content: str) -> list[str]:
@@
-    for para in content.strip().split("\n\n"):
-        if not para.strip():
-            continue
+    for para in _split_paragraphs(content):
         h = paragraph_hash(para)
         if h not in seen:
             seen.add(h)
             result.append(h)
@@
 def filter_boilerplate_paragraphs(
@@
-    paragraphs = content.strip().split("\n\n")
+    paragraphs = _split_paragraphs(content)
     kept = [p for p in paragraphs if frequencies.get(paragraph_hash(p), 0.0) <= threshold]
     return "\n\n".join(kept)

Also applies to: 57-59

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/utils/paragraph_dedup.py` around lines 29 - 30,
Normalize newlines in the input before splitting into paragraphs: replace CRLF
(\r\n) and lone CR (\r) with LF (\n) on the variable used as the source (the
`content` string) so both the loop using `content.strip().split("\n\n")` and the
later similar block (lines around where `para` is iterated) operate on
consistent newline characters; then split on one-or-more blank lines (i.e.,
treat consecutive newlines as paragraph separators) to avoid missing paragraphs
on CRLF pages and preserve the existing `para.strip()` checks.

Comment on lines +236 to 249
async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service):
content = "Only boilerplate"
boilerplate_h = paragraph_hash("Only boilerplate")

mock_conn.fetchval = AsyncMock(return_value=10)
mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}])
mock_conn.fetchrow = AsyncMock(return_value=None)

result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content)

assert result["status"] == "empty"
upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()]
assert any("filtering_hash" in c for c in upsert_calls)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Assert chunk cleanup on the empty path.

This test should also verify that old chunks are deleted when filtering removes all content, to prevent stale-index regressions.

🧪 Suggested assertion
     async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service):
@@
         assert result["status"] == "empty"
         upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()]
         assert any("filtering_hash" in c for c in upsert_calls)
+        delete_chunk_calls = [str(c) for c in mock_conn.execute.call_args_list if "DELETE FROM chunks" in str(c)]
+        assert delete_chunk_calls
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service):
content = "Only boilerplate"
boilerplate_h = paragraph_hash("Only boilerplate")
mock_conn.fetchval = AsyncMock(return_value=10)
mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}])
mock_conn.fetchrow = AsyncMock(return_value=None)
result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content)
assert result["status"] == "empty"
upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()]
assert any("filtering_hash" in c for c in upsert_calls)
async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service):
content = "Only boilerplate"
boilerplate_h = paragraph_hash("Only boilerplate")
mock_conn.fetchval = AsyncMock(return_value=10)
mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}])
mock_conn.fetchrow = AsyncMock(return_value=None)
result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content)
assert result["status"] == "empty"
upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()]
assert any("filtering_hash" in c for c in upsert_calls)
delete_chunk_calls = [str(c) for c in mock_conn.execute.call_args_list if "DELETE FROM chunks" in str(c)]
assert delete_chunk_calls
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/tests/test_indexing_service.py` around lines 236 - 249, The
test is missing verification that old chunks are removed when filtering yields
an empty page; after calling indexing_service.index_page with content that
filters to empty, assert that chunk deletion ran by checking
mock_chunk.delete_by_paragraph_hash was called with boilerplate_h (or, if
deletions happen via SQL, assert mock_conn.execute.call_args_list contains a
DELETE targeting website_paragraphs/website_paragraph_chunks). Use the existing
boilerplate_h, mock_chunk, indexing_service.index_page and mock_conn.execute
call list to locate where to add the assertion.

Comment on lines +41 to +45
def test_multiple_paragraphs(self):
content = "First paragraph\n\nSecond paragraph\n\nThird paragraph"
result = extract_paragraph_hashes(content)
assert len(result) == 3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add CRLF boundary regression tests.

Please add coverage for \r\n\r\n paragraph boundaries so extraction/filtering behavior is stable across newline formats.

🧪 Suggested test additions
 class TestExtractParagraphHashes:
@@
+    def test_multiple_paragraphs_crlf(self):
+        content = "First paragraph\r\n\r\nSecond paragraph\r\n\r\nThird paragraph"
+        result = extract_paragraph_hashes(content)
+        assert len(result) == 3
+
 class TestFilterBoilerplateParagraphs:
@@
+    def test_filtering_with_crlf_boundaries(self):
+        content = "Boilerplate text\r\n\r\nUnique content here"
+        frequencies = {paragraph_hash("Boilerplate text"): 0.8}
+        result = filter_boilerplate_paragraphs(content, frequencies)
+        assert "Boilerplate text" not in result
+        assert "Unique content here" in result

Also applies to: 92-102

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/tests/test_paragraph_dedup.py` around lines 41 - 45, Add
tests that cover CRLF paragraph boundaries for extract_paragraph_hashes by
duplicating the existing newline test cases but using "\r\n\r\n" between
paragraphs (e.g., create content like "First paragraph\r\n\r\nSecond
paragraph\r\n\r\nThird paragraph") and assert the extractor returns the same
number of paragraphs/hashes and same values as the LF-only cases; update both
the test_multiple_paragraphs case and the other test block around lines 92-102
to include CRLF variants to ensure extraction/filtering is stable across newline
formats.

@greptile-apps

greptile-apps Bot commented Feb 28, 2026

Copy link
Copy Markdown

Greptile Summary

This PR implements incremental cross-page paragraph deduplication to filter boilerplate content (navigation bars, footers, CMP vendor lists) that appears on more than 50% of a domain's pages. The implementation introduces a new page_paragraph_hashes database table to track paragraph fingerprints, a paragraph_dedup utility module for hashing and frequency-based filtering, and updates the indexing pipeline with a dual-hash skip check (content hash + filtering hash) to trigger re-indexing when boilerplate detection improves over time. The changes are well-architected with proper handling of edge cases (empty pages after filtering, embedding failures, small domains), comprehensive test coverage including unit and integration tests, and efficient database queries with appropriate indexing.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The implementation is well-designed with proper database schema, comprehensive test coverage (unit tests for all functions, integration tests for real-world scenarios), correct handling of edge cases, and no security vulnerabilities. The dual-hash skip logic ensures pages are re-indexed appropriately, and the frequency-based filtering correctly identifies boilerplate content without affecting unique content.
  • No files require special attention

Important Files Changed

Filename Overview
services/db/init-scripts/04-create-search-database.sql Added page_paragraph_hashes table with proper foreign key constraints and indexing for efficient boilerplate detection queries
services/crawler/app/utils/paragraph_dedup.py New utility module with hash extraction and frequency-based filtering functions, properly handles unicode normalization and within-page deduplication
services/crawler/app/services/indexing_service.py Integrated paragraph deduplication with dual-hash skip logic, properly handles edge cases (empty after filtering, embedding failures), correctly updates tracking tables

Sequence Diagram

sequenceDiagram
    participant Client
    participant IndexingService
    participant ParagraphDedup
    participant Database
    participant EmbeddingService

    Client->>IndexingService: index_page(domain, url, content)
    IndexingService->>IndexingService: compute content_hash
    IndexingService->>ParagraphDedup: extract_paragraph_hashes(content)
    ParagraphDedup-->>IndexingService: paragraph_hashes[]
    
    IndexingService->>Database: DELETE old paragraph_hashes
    IndexingService->>Database: INSERT new paragraph_hashes
    IndexingService->>Database: COUNT total pages in domain
    Database-->>IndexingService: total_pages
    
    alt total_pages >= MIN_DOMAIN_PAGES_FOR_DEDUP
        IndexingService->>Database: SELECT frequency for paragraph_hashes
        Database-->>IndexingService: frequencies{}
        IndexingService->>ParagraphDedup: filter_boilerplate_paragraphs(content, frequencies)
        ParagraphDedup-->>IndexingService: filtered_content
    else domain too small
        IndexingService->>IndexingService: use original content
    end
    
    IndexingService->>IndexingService: compute filtering_hash
    IndexingService->>Database: SELECT existing content_hash, filtering_hash
    Database-->>IndexingService: existing_hashes
    
    alt both hashes match existing
        IndexingService-->>Client: status: skipped
    else hashes differ or no existing
        IndexingService->>IndexingService: chunk_content(filtered)
        alt chunks empty
            IndexingService->>Database: UPSERT website_urls with filtering_hash
            IndexingService-->>Client: status: empty
        else chunks exist
            IndexingService->>EmbeddingService: embed_texts(chunks)
            EmbeddingService-->>IndexingService: embeddings[]
            IndexingService->>Database: BEGIN TRANSACTION
            IndexingService->>Database: UPSERT website_urls
            IndexingService->>Database: DELETE old chunks
            IndexingService->>Database: INSERT new chunks
            IndexingService->>Database: COMMIT
            IndexingService-->>Client: status: indexed
        end
    end
Loading

Last reviewed commit: 3e4ecab

…rity

Crawl4ai emits single-newline-separated markdown, so splitting on
double newlines missed most boilerplate (nav bars, cookie banners,
footers). Switch to line-level hashing with a MIN_LINE_LENGTH threshold
to skip short lines that lack fingerprinting signal. Also cast
filtering_hash to text in the upsert query.
…ratio for dedup

Replace fractional frequency threshold (>50% of pages) with absolute page count
threshold (>5 pages) for boilerplate detection, making filtering behavior more
predictable and independent of domain size.
@larryro larryro merged commit 9eaba4a into main Feb 28, 2026
17 checks passed
@larryro larryro deleted the feat/crawler-paragraph-dedup branch February 28, 2026 12:23
yannickmonney pushed a commit that referenced this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant