feat(crawler,db): add cross-page paragraph deduplication for boilerplate filtering by larryro · Pull Request #604 · tale-project/tale

larryro · 2026-02-28T11:02:36Z

Summary

Add incremental cross-page paragraph deduplication to the crawler indexing pipeline, filtering boilerplate content (nav bars, footers, CMP vendor lists) that appears on >50% of a domain's pages before chunking
Introduce a new page_paragraph_hashes table to track paragraph fingerprints per page and a paragraph_dedup utility module for hashing and frequency-based filtering
Update the indexing service to use a dual-hash skip check (content hash + filtering hash) so pages are re-indexed when boilerplate detection improves over time

Test plan

Verify paragraph_dedup.py unit tests pass (hashing, extraction, filtering, boundary cases)
Verify test_indexing_service.py tests pass (skip logic, hash tracking, frequency filtering, empty-after-filtering)
Verify integration scenarios: shared boilerplate filtered, unique content preserved, small domains unaffected
Confirm the page_paragraph_hashes table is created correctly via the DB init script

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Implemented cross-page boilerplate detection and filtering to identify and remove duplicate content that appears frequently across pages within a domain, improving content quality and search result relevance.
Tests
- Added comprehensive test coverage for the new boilerplate filtering and paragraph deduplication features, including edge cases and integration scenarios.
Chores
- Updated database schema to support paragraph hash tracking for boilerplate detection across pages.

…ate filtering Track paragraph fingerprints per page in a new page_paragraph_hashes table and filter paragraphs appearing on >50% of a domain's pages before chunking. This removes repeated boilerplate (nav bars, footers, CMP vendor lists) from search chunks, improving retrieval quality.

coderabbitai · 2026-02-28T11:08:51Z

📝 Walkthrough

Walkthrough

This pull request introduces cross-page paragraph deduplication for boilerplate filtering in the crawler indexing pipeline. A new utility module provides paragraph hashing, content deduplication, and frequency-based boilerplate filtering functions. The indexing service integrates these utilities to compute and track paragraph hashes per page in a new database table, calculate cross-page frequencies when minimum thresholds are met, filter boilerplate paragraphs from content, and store filtering hashes. Indexing is skipped when both content and filtering hashes match existing stored values; otherwise, only filtered content is chunked, embedded, and indexed in a single transaction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat: migrate website search and embeddings to crawler service #595: Establishes the foundational crawler indexing pipeline and search database schema that this PR extends with cross-page paragraph deduplication and the new page_paragraph_hashes table.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding cross-page paragraph deduplication for boilerplate filtering across crawler and database components.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/crawler-paragraph-dedup

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/services/indexing_service.py`:
- Around line 57-67: The delete-then-insert sequence operating on
page_paragraph_hashes must be run inside a DB transaction so the refresh is
atomic; update the block using the existing acquire_with_retry(conn) session to
open a transaction (e.g., conn.transaction() or the library's transaction
context) and perform the DELETE and conditional INSERT (executemany) within that
single transaction, ensuring you still handle the case where hashes is empty by
at least running the DELETE inside the transaction. Reference:
acquire_with_retry, page_paragraph_hashes, and the executemany/execute calls
shown.
- Around line 69-73: The current total_pages calculation counts only URLs
present in page_paragraph_hashes, excluding pages with zero extracted paragraphs
and thus inflating frequency rates; update the query that sets total_pages (used
with MIN_DOMAIN_PAGES_FOR_DEDUP and hashes) to count all domain pages
instead—either by querying the pages table for the domain (e.g., SELECT COUNT(*)
FROM pages WHERE domain = $1) or by using a LEFT JOIN/COUNT DISTINCT on the
pages table to include pages with zero paragraphs—so the denominator reflects “%
of domain pages” as intended.
- Around line 100-103: When chunks is empty the current early return leaves
previously indexed chunks around; modify the empty-branch inside the same method
(the block that uses acquire_with_retry(self._pool) and calls
_UPSERT_WEBSITE_URL) to also remove any prior chunks for this URL before
returning: execute a DELETE for the URL via the DB connection (e.g.
conn.execute(_DELETE_CHUNKS_FOR_URL, url) — create that SQL constant if missing)
and invoke the vector/embedding cleanup on the in-memory/indexed store if
applicable (e.g. self._vector_store.delete_documents or
delete_by_namespace(url)); keep the existing upsert of the website metadata and
then return the same {"url": url, "status": "empty", "chunks_indexed": 0}.

In `@services/crawler/app/utils/paragraph_dedup.py`:
- Around line 29-30: Normalize newlines in the input before splitting into
paragraphs: replace CRLF (\r\n) and lone CR (\r) with LF (\n) on the variable
used as the source (the `content` string) so both the loop using
`content.strip().split("\n\n")` and the later similar block (lines around where
`para` is iterated) operate on consistent newline characters; then split on
one-or-more blank lines (i.e., treat consecutive newlines as paragraph
separators) to avoid missing paragraphs on CRLF pages and preserve the existing
`para.strip()` checks.

In `@services/crawler/tests/test_indexing_service.py`:
- Around line 236-249: The test is missing verification that old chunks are
removed when filtering yields an empty page; after calling
indexing_service.index_page with content that filters to empty, assert that
chunk deletion ran by checking mock_chunk.delete_by_paragraph_hash was called
with boilerplate_h (or, if deletions happen via SQL, assert
mock_conn.execute.call_args_list contains a DELETE targeting
website_paragraphs/website_paragraph_chunks). Use the existing boilerplate_h,
mock_chunk, indexing_service.index_page and mock_conn.execute call list to
locate where to add the assertion.

In `@services/crawler/tests/test_paragraph_dedup.py`:
- Around line 41-45: Add tests that cover CRLF paragraph boundaries for
extract_paragraph_hashes by duplicating the existing newline test cases but
using "\r\n\r\n" between paragraphs (e.g., create content like "First
paragraph\r\n\r\nSecond paragraph\r\n\r\nThird paragraph") and assert the
extractor returns the same number of paragraphs/hashes and same values as the
LF-only cases; update both the test_multiple_paragraphs case and the other test
block around lines 92-102 to include CRLF variants to ensure
extraction/filtering is stable across newline formats.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6d1468 and 3e4ecab.

📒 Files selected for processing (5)

services/crawler/app/services/indexing_service.py
services/crawler/app/utils/paragraph_dedup.py
services/crawler/tests/test_indexing_service.py
services/crawler/tests/test_paragraph_dedup.py
services/db/init-scripts/04-create-search-database.sql

coderabbitai · 2026-02-28T11:08:54Z

        async with acquire_with_retry(self._pool) as conn:
-            existing_hash = await conn.fetchval("SELECT content_hash FROM chunks WHERE url = $1 LIMIT 1", url)
-            if existing_hash == content_hash:
-                return {"url": url, "status": "skipped", "chunks_indexed": 0}
+            await conn.execute(
+                "DELETE FROM page_paragraph_hashes WHERE domain = $1 AND url = $2",
+                domain,
+                url,
+            )
+            if hashes:
+                await conn.executemany(
+                    "INSERT INTO page_paragraph_hashes (domain, url, paragraph_hash) VALUES ($1, $2, $3)",
+                    [(domain, url, h) for h in hashes],
+                )


⚠️ Potential issue | 🟠 Major

Make paragraph-hash refresh atomic.

Line 58 deletes old hashes before Line 64 inserts replacements, but this runs outside a transaction. A mid-sequence failure can leave a page with no hashes and corrupt subsequent frequency calculations.

🔧 Proposed fix

- async with acquire_with_retry(self._pool) as conn: + async with acquire_with_retry(self._pool) as conn, conn.transaction(): await conn.execute( "DELETE FROM page_paragraph_hashes WHERE domain = $1 AND url = $2", domain, url, ) if hashes: await conn.executemany( "INSERT INTO page_paragraph_hashes (domain, url, paragraph_hash) VALUES ($1, $2, $3)", [(domain, url, h) for h in hashes], )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/services/indexing_service.py` around lines 57 - 67, The delete-then-insert sequence operating on page_paragraph_hashes must be run inside a DB transaction so the refresh is atomic; update the block using the existing acquire_with_retry(conn) session to open a transaction (e.g., conn.transaction() or the library's transaction context) and perform the DELETE and conditional INSERT (executemany) within that single transaction, ensuring you still handle the case where hashes is empty by at least running the DELETE inside the transaction. Reference: acquire_with_retry, page_paragraph_hashes, and the executemany/execute calls shown.

coderabbitai · 2026-02-28T11:08:54Z

+            total_pages = await conn.fetchval(
+                "SELECT COUNT(DISTINCT url) FROM page_paragraph_hashes WHERE domain = $1",
+                domain,
+            )
+            if total_pages >= MIN_DOMAIN_PAGES_FOR_DEDUP and hashes:


⚠️ Potential issue | 🟠 Major

Use domain page count as the frequency denominator.

Line 69 currently counts only URLs present in page_paragraph_hashes. Pages with zero extracted paragraphs are excluded, which inflates frequencies and can over-filter content versus the “% of domain pages” objective.

🔧 Proposed fix

- total_pages = await conn.fetchval( - "SELECT COUNT(DISTINCT url) FROM page_paragraph_hashes WHERE domain = $1", - domain, - ) + total_pages = await conn.fetchval( + "SELECT COUNT(*) FROM website_urls WHERE domain = $1 AND content IS NOT NULL", + domain, + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/services/indexing_service.py` around lines 69 - 73, The current total_pages calculation counts only URLs present in page_paragraph_hashes, excluding pages with zero extracted paragraphs and thus inflating frequency rates; update the query that sets total_pages (used with MIN_DOMAIN_PAGES_FOR_DEDUP and hashes) to count all domain pages instead—either by querying the pages table for the domain (e.g., SELECT COUNT(*) FROM pages WHERE domain = $1) or by using a LEFT JOIN/COUNT DISTINCT on the pages table to include pages with zero paragraphs—so the denominator reflects “% of domain pages” as intended.

coderabbitai · 2026-02-28T11:08:54Z

        if not chunks:
+            async with acquire_with_retry(self._pool) as conn:
+                await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash)
            return {"url": url, "status": "empty", "chunks_indexed": 0}


⚠️ Potential issue | 🔴 Critical

Delete stale chunks when filtered content becomes empty.

When chunks is empty, Line 103 returns without removing previously indexed chunks for that URL. This leaves stale content in search even though current filtered content is empty.

🔧 Proposed fix

if not chunks: - async with acquire_with_retry(self._pool) as conn: + async with acquire_with_retry(self._pool) as conn, conn.transaction(): await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash) + await conn.execute("DELETE FROM chunks WHERE domain = $1 AND url = $2", domain, url) return {"url": url, "status": "empty", "chunks_indexed": 0}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if not chunks:

async with acquire_with_retry(self._pool) as conn:

await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash)

return {"url": url, "status": "empty", "chunks_indexed": 0}

if not chunks:

async with acquire_with_retry(self._pool) as conn, conn.transaction():

await conn.execute(_UPSERT_WEBSITE_URL, domain, url, title, content_hash, filtered_hash)

await conn.execute("DELETE FROM chunks WHERE domain = $1 AND url = $2", domain, url)

return {"url": url, "status": "empty", "chunks_indexed": 0}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/services/indexing_service.py` around lines 100 - 103, When chunks is empty the current early return leaves previously indexed chunks around; modify the empty-branch inside the same method (the block that uses acquire_with_retry(self._pool) and calls _UPSERT_WEBSITE_URL) to also remove any prior chunks for this URL before returning: execute a DELETE for the URL via the DB connection (e.g. conn.execute(_DELETE_CHUNKS_FOR_URL, url) — create that SQL constant if missing) and invoke the vector/embedding cleanup on the in-memory/indexed store if applicable (e.g. self._vector_store.delete_documents or delete_by_namespace(url)); keep the existing upsert of the website metadata and then return the same {"url": url, "status": "empty", "chunks_indexed": 0}.

coderabbitai · 2026-02-28T11:08:54Z

+    for para in content.strip().split("\n\n"):
+        if not para.strip():


⚠️ Potential issue | 🟠 Major

Normalize newlines before paragraph splitting.

Line 29 and Line 57 split only on "\n\n", which breaks for CRLF content and causes missed dedup/filtering on pages using \r\n boundaries.

💡 Proposed fix

import hashlib +import re import unicodedata BOILERPLATE_FREQUENCY_THRESHOLD = 0.5 MIN_DOMAIN_PAGES_FOR_DEDUP = 5 +_PARAGRAPH_SPLIT_RE = re.compile(r"\n\s*\n") + + +def _split_paragraphs(content: str) -> list[str]: + normalized = content.replace("\r\n", "\n").strip() + if not normalized: + return [] + return [p for p in _PARAGRAPH_SPLIT_RE.split(normalized) if p.strip()] @@ def extract_paragraph_hashes(content: str) -> list[str]: @@ - for para in content.strip().split("\n\n"): - if not para.strip(): - continue + for para in _split_paragraphs(content): h = paragraph_hash(para) if h not in seen: seen.add(h) result.append(h) @@ def filter_boilerplate_paragraphs( @@ - paragraphs = content.strip().split("\n\n") + paragraphs = _split_paragraphs(content) kept = [p for p in paragraphs if frequencies.get(paragraph_hash(p), 0.0) <= threshold] return "\n\n".join(kept)

Also applies to: 57-59

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/utils/paragraph_dedup.py` around lines 29 - 30, Normalize newlines in the input before splitting into paragraphs: replace CRLF (\r\n) and lone CR (\r) with LF (\n) on the variable used as the source (the `content` string) so both the loop using `content.strip().split("\n\n")` and the later similar block (lines around where `para` is iterated) operate on consistent newline characters; then split on one-or-more blank lines (i.e., treat consecutive newlines as paragraph separators) to avoid missing paragraphs on CRLF pages and preserve the existing `para.strip()` checks.

coderabbitai · 2026-02-28T11:08:54Z

+    async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service):
+        content = "Only boilerplate"
+        boilerplate_h = paragraph_hash("Only boilerplate")
+
+        mock_conn.fetchval = AsyncMock(return_value=10)
+        mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}])
+        mock_conn.fetchrow = AsyncMock(return_value=None)
+
+        result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content)
+
+        assert result["status"] == "empty"
+        upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()]
+        assert any("filtering_hash" in c for c in upsert_calls)



🧹 Nitpick | 🔵 Trivial

Assert chunk cleanup on the empty path.

This test should also verify that old chunks are deleted when filtering removes all content, to prevent stale-index regressions.

🧪 Suggested assertion

async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service): @@ assert result["status"] == "empty" upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()] assert any("filtering_hash" in c for c in upsert_calls) + delete_chunk_calls = [str(c) for c in mock_conn.execute.call_args_list if "DELETE FROM chunks" in str(c)] + assert delete_chunk_calls

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service):

content = "Only boilerplate"

boilerplate_h = paragraph_hash("Only boilerplate")

mock_conn.fetchval = AsyncMock(return_value=10)

mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}])

mock_conn.fetchrow = AsyncMock(return_value=None)

result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content)

assert result["status"] == "empty"

upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()]

assert any("filtering_hash" in c for c in upsert_calls)

async def test_empty_after_filtering_stores_filtering_hash(self, mock_chunk, mock_conn, indexing_service):

content = "Only boilerplate"

boilerplate_h = paragraph_hash("Only boilerplate")

mock_conn.fetchval = AsyncMock(return_value=10)

mock_conn.fetch = AsyncMock(return_value=[{"paragraph_hash": boilerplate_h, "url_count": 10}])

mock_conn.fetchrow = AsyncMock(return_value=None)

result = await indexing_service.index_page("example.com", "https://example.com/page", "Title", content)

assert result["status"] == "empty"

upsert_calls = [str(c) for c in mock_conn.execute.call_args_list if "website_urls" in str(c).lower()]

assert any("filtering_hash" in c for c in upsert_calls)

delete_chunk_calls = [str(c) for c in mock_conn.execute.call_args_list if "DELETE FROM chunks" in str(c)]

assert delete_chunk_calls

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/tests/test_indexing_service.py` around lines 236 - 249, The test is missing verification that old chunks are removed when filtering yields an empty page; after calling indexing_service.index_page with content that filters to empty, assert that chunk deletion ran by checking mock_chunk.delete_by_paragraph_hash was called with boilerplate_h (or, if deletions happen via SQL, assert mock_conn.execute.call_args_list contains a DELETE targeting website_paragraphs/website_paragraph_chunks). Use the existing boilerplate_h, mock_chunk, indexing_service.index_page and mock_conn.execute call list to locate where to add the assertion.

coderabbitai · 2026-02-28T11:08:54Z

+    def test_multiple_paragraphs(self):
+        content = "First paragraph\n\nSecond paragraph\n\nThird paragraph"
+        result = extract_paragraph_hashes(content)
+        assert len(result) == 3
+


🧹 Nitpick | 🔵 Trivial

Add CRLF boundary regression tests.

Please add coverage for \r\n\r\n paragraph boundaries so extraction/filtering behavior is stable across newline formats.

🧪 Suggested test additions

class TestExtractParagraphHashes: @@ + def test_multiple_paragraphs_crlf(self): + content = "First paragraph\r\n\r\nSecond paragraph\r\n\r\nThird paragraph" + result = extract_paragraph_hashes(content) + assert len(result) == 3 + class TestFilterBoilerplateParagraphs: @@ + def test_filtering_with_crlf_boundaries(self): + content = "Boilerplate text\r\n\r\nUnique content here" + frequencies = {paragraph_hash("Boilerplate text"): 0.8} + result = filter_boilerplate_paragraphs(content, frequencies) + assert "Boilerplate text" not in result + assert "Unique content here" in result

Also applies to: 92-102

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/tests/test_paragraph_dedup.py` around lines 41 - 45, Add tests that cover CRLF paragraph boundaries for extract_paragraph_hashes by duplicating the existing newline test cases but using "\r\n\r\n" between paragraphs (e.g., create content like "First paragraph\r\n\r\nSecond paragraph\r\n\r\nThird paragraph") and assert the extractor returns the same number of paragraphs/hashes and same values as the LF-only cases; update both the test_multiple_paragraphs case and the other test block around lines 92-102 to include CRLF variants to ensure extraction/filtering is stable across newline formats.

greptile-apps · 2026-02-28T11:11:53Z

Greptile Summary

This PR implements incremental cross-page paragraph deduplication to filter boilerplate content (navigation bars, footers, CMP vendor lists) that appears on more than 50% of a domain's pages. The implementation introduces a new page_paragraph_hashes database table to track paragraph fingerprints, a paragraph_dedup utility module for hashing and frequency-based filtering, and updates the indexing pipeline with a dual-hash skip check (content hash + filtering hash) to trigger re-indexing when boilerplate detection improves over time. The changes are well-architected with proper handling of edge cases (empty pages after filtering, embedding failures, small domains), comprehensive test coverage including unit and integration tests, and efficient database queries with appropriate indexing.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The implementation is well-designed with proper database schema, comprehensive test coverage (unit tests for all functions, integration tests for real-world scenarios), correct handling of edge cases, and no security vulnerabilities. The dual-hash skip logic ensures pages are re-indexed appropriately, and the frequency-based filtering correctly identifies boilerplate content without affecting unique content.
No files require special attention

Important Files Changed

Filename	Overview
services/db/init-scripts/04-create-search-database.sql	Added `page_paragraph_hashes` table with proper foreign key constraints and indexing for efficient boilerplate detection queries
services/crawler/app/utils/paragraph_dedup.py	New utility module with hash extraction and frequency-based filtering functions, properly handles unicode normalization and within-page deduplication
services/crawler/app/services/indexing_service.py	Integrated paragraph deduplication with dual-hash skip logic, properly handles edge cases (empty after filtering, embedding failures), correctly updates tracking tables

Sequence Diagram

sequenceDiagram
    participant Client
    participant IndexingService
    participant ParagraphDedup
    participant Database
    participant EmbeddingService

    Client->>IndexingService: index_page(domain, url, content)
    IndexingService->>IndexingService: compute content_hash
    IndexingService->>ParagraphDedup: extract_paragraph_hashes(content)
    ParagraphDedup-->>IndexingService: paragraph_hashes[]
    
    IndexingService->>Database: DELETE old paragraph_hashes
    IndexingService->>Database: INSERT new paragraph_hashes
    IndexingService->>Database: COUNT total pages in domain
    Database-->>IndexingService: total_pages
    
    alt total_pages >= MIN_DOMAIN_PAGES_FOR_DEDUP
        IndexingService->>Database: SELECT frequency for paragraph_hashes
        Database-->>IndexingService: frequencies{}
        IndexingService->>ParagraphDedup: filter_boilerplate_paragraphs(content, frequencies)
        ParagraphDedup-->>IndexingService: filtered_content
    else domain too small
        IndexingService->>IndexingService: use original content
    end
    
    IndexingService->>IndexingService: compute filtering_hash
    IndexingService->>Database: SELECT existing content_hash, filtering_hash
    Database-->>IndexingService: existing_hashes
    
    alt both hashes match existing
        IndexingService-->>Client: status: skipped
    else hashes differ or no existing
        IndexingService->>IndexingService: chunk_content(filtered)
        alt chunks empty
            IndexingService->>Database: UPSERT website_urls with filtering_hash
            IndexingService-->>Client: status: empty
        else chunks exist
            IndexingService->>EmbeddingService: embed_texts(chunks)
            EmbeddingService-->>IndexingService: embeddings[]
            IndexingService->>Database: BEGIN TRANSACTION
            IndexingService->>Database: UPSERT website_urls
            IndexingService->>Database: DELETE old chunks
            IndexingService->>Database: INSERT new chunks
            IndexingService->>Database: COMMIT
            IndexingService-->>Client: status: indexed
        end
    end

_{Last reviewed commit: 3e4ecab}

…rity Crawl4ai emits single-newline-separated markdown, so splitting on double newlines missed most boilerplate (nav bars, cookie banners, footers). Switch to line-level hashing with a MIN_LINE_LENGTH threshold to skip short lines that lack fingerprinting signal. Also cast filtering_hash to text in the upsert query.

…ratio for dedup Replace fractional frequency threshold (>50% of pages) with absolute page count threshold (>5 pages) for boilerplate detection, making filtering behavior more predictable and independent of domain size.

…ate filtering (#604)

coderabbitai Bot requested changes Feb 28, 2026

View reviewed changes

larryro added 2 commits February 28, 2026 19:33

larryro merged commit 9eaba4a into main Feb 28, 2026
17 checks passed

larryro deleted the feat/crawler-paragraph-dedup branch February 28, 2026 12:23

coderabbitai Bot mentioned this pull request Mar 4, 2026

fix(crawler): delete 404/410 pages instead of retrying forever #660

Merged

3 tasks

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

feat(crawler,db): add cross-page paragraph deduplication for boilerpl…

13ddc70

…ate filtering (#604)

coderabbitai Bot mentioned this pull request Apr 24, 2026

refactor(db): split service migrations and add chunk overlap fields #1615

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(crawler,db): add cross-page paragraph deduplication for boilerplate filtering#604

feat(crawler,db): add cross-page paragraph deduplication for boilerplate filtering#604
larryro merged 3 commits into
mainfrom
feat/crawler-paragraph-dedup

larryro commented Feb 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 28, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Uh oh!

coderabbitai Bot Feb 28, 2026

Uh oh!

coderabbitai Bot Feb 28, 2026

Uh oh!

coderabbitai Bot Feb 28, 2026

Uh oh!

coderabbitai Bot Feb 28, 2026

Uh oh!

coderabbitai Bot Feb 28, 2026

Uh oh!

greptile-apps Bot commented Feb 28, 2026

Important Files Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		for para in content.strip().split("\n\n"):
		if not para.strip():

Conversation

larryro commented Feb 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 28, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Feb 28, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Feb 28, 2026 •

edited by coderabbitai Bot

Loading