Skip to content

fix(crawler): delete 404/410 pages instead of retrying forever#660

Merged
larryro merged 4 commits into
mainfrom
fix/crawler-delete-404-pages
Mar 4, 2026
Merged

fix(crawler): delete 404/410 pages instead of retrying forever#660
larryro merged 4 commits into
mainfrom
fix/crawler-delete-404-pages

Conversation

@larryro

@larryro larryro commented Mar 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Mark 404/410 pages as deleted instead of endlessly incrementing fail_count. Permanent HTTP errors (404 Not Found, 410 Gone) now trigger soft-deletion: chunks and paragraph hashes are removed, content fields are cleared, and the URL status is set to deleted.
  • Mass deletion guard: if >50% of a site's known URLs return 404/410 in a single batch, deletion is blocked and URLs are routed to fail_count increment instead — protecting against false positives from temporary server misconfiguration.
  • Fail count recrawl limit (max_fail_count=10): URLs that have failed 10+ times are excluded from recrawl queries, preventing wasted crawl cycles on permanently broken pages.
  • Exclude deleted URLs from page counts: get_total_count and get_website lateral join now filter out status = 'deleted' so dashboard metrics stay accurate.
  • Add compound index on chunks(domain, url) for efficient bulk deletion.

Test plan

  • New test suite test_scheduler_errors.py covers error classification (404→deleted, 500→fail_count, network failure→fail_count, mixed batches)
  • New test suite test_deleted_url_counts.py covers deleted URL exclusion from counts, soft-deletion content cleanup, fail_count filtering, and idempotency
  • Mass deletion threshold tests verify blocking at >50%, allowing at <=50%, and edge cases (total=0, exact boundary)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added protection mechanism to prevent unintended mass URL deletions with safety thresholds
    • Enhanced classification of permanent HTTP errors (404, 410) from transient failures
  • Improvements

    • Improved URL recrawl decision-making with updated failure tracking
    • Database query performance optimization

larryro added 4 commits March 4, 2026 13:21
URLs returning permanent HTTP errors (404/410) were retried every scan
cycle indefinitely, keeping the progress bar stuck at 99%. Now the
scheduler classifies these as permanent errors and marks them deleted,
while transient errors (5xx, network failures) continue with the
existing retry behavior.

Changes:
- Split HTTP error handling: 404/410 → mark_urls_deleted, others → increment_fail_count
- Exclude deleted URLs from page count queries (get_website, get_total_count)
- Clean up chunks and paragraph hashes when marking URLs as deleted
- Resurrect deleted URLs if re-discovered in sitemap (handles false-positive 404s)
… cleanup

Use ANY($2) batch queries in a transaction instead of executemany loops,
clear all content fields on soft delete, switch discovered URL upsert to
DO NOTHING to prevent re-discovery of deleted URLs, and cap deletion log
to 5 sample URLs.
Prevent catastrophic data loss when a site temporarily 404s all pages
by blocking deletion when >50% of known URLs are gone in a single scan.
URLs exceeding 10 consecutive failures are excluded from recrawl queues.
Extract identical _run_scan method from TestSchedulerErrorClassification
and TestMassDeletionThreshold into a shared module-level function.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai

coderabbitai Bot commented Mar 4, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR implements soft deletion of URLs and introduces mass-deletion safety checks. The main changes include: updating pg_website_store.py to accept a max_fail_count parameter in get_urls_needing_recrawl, implementing a multi-step transactional deletion in mark_urls_deleted that clears related chunks and page_paragraph_hashes, and modifying count queries to exclude deleted URLs. In scheduler.py, new constants define permanent HTTP errors (404, 410) and a max deletion ratio threshold (0.5), and the crawl results handler now classifies responses into gone_urls, transient_error_urls, and network_failed_urls, with a safety check that prevents mass deletions exceeding the threshold. New test modules validate deleted URL exclusion from counts and comprehensive HTTP error classification scenarios. A new index is added to the chunks table for domain and url columns.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: permanent HTTP errors (404/410) are now marked for deletion instead of being retried indefinitely.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/crawler-delete-404-pages

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/services/scheduler.py`:
- Around line 330-332: The deletion-guard uses site_store.get_total_count(),
which only counts rows with content_hash IS NOT NULL and underestimates "known"
URLs; replace that usage with a new method (e.g., get_known_url_count) that
returns COUNT(*) WHERE domain = $1 AND status != 'deleted' and update
scheduler.py to call site_store.get_known_url_count() when computing ratio
(instead of get_total_count()); implement get_known_url_count in
pg_website_store.py using acquire_with_retry on the connection pool and fetchval
with the suggested WHERE clause so the mass-deletion ratio denominator correctly
reflects known URLs.

In `@services/crawler/tests/test_scheduler_errors.py`:
- Around line 84-87: Tests patch _bulk_head_check but not _seed_cache_headers,
so _scan_website may still perform real HTTP HEAD requests via
httpx.AsyncClient.head; patch _seed_cache_headers (or the httpx AsyncClient.head
it uses) in the test to prevent live network calls. Specifically, in the test
where you patch "_bulk_head_check" and call _scan_website, also patch
"app.services.scheduler._seed_cache_headers" (or patch "httpx.AsyncClient.head"
used by _seed_cache_headers) with an AsyncMock that returns a safe/fake result
so _scan_website never performs real HTTP requests.

ℹ️ Review info
Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 92f02b00-9249-4787-a689-d5b9c26f19f1

📥 Commits

Reviewing files that changed from the base of the PR and between d9dbc20 and c6293da.

📒 Files selected for processing (5)
  • services/crawler/app/services/pg_website_store.py
  • services/crawler/app/services/scheduler.py
  • services/crawler/tests/test_deleted_url_counts.py
  • services/crawler/tests/test_scheduler_errors.py
  • services/db/init-scripts/03-create-knowledge-database.sql

Comment on lines +330 to +332
total_count = await site_store.get_total_count()
ratio = len(gone_urls) / total_count if total_count > 0 else 0.0
if total_count > 0 and ratio > _MAX_DELETION_RATIO:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mass-deletion ratio denominator excludes discovered URLs, making the guard too strict.

This currently uses get_total_count(), which (in services/crawler/app/services/pg_website_store.py, Lines 220-221) counts only rows with content_hash IS NOT NULL. That undercounts “known URLs” and can incorrectly block valid deletions.

🔧 Proposed fix
-                total_count = await site_store.get_total_count()
+                total_count = await site_store.get_known_url_count()
# services/crawler/app/services/pg_website_store.py
async def get_known_url_count(self) -> int:
    async with acquire_with_retry(self._pool) as conn:
        return await conn.fetchval(
            """SELECT COUNT(*) FROM website_urls
               WHERE domain = $1 AND status != 'deleted'""",
            self._domain,
        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/services/scheduler.py` around lines 330 - 332, The
deletion-guard uses site_store.get_total_count(), which only counts rows with
content_hash IS NOT NULL and underestimates "known" URLs; replace that usage
with a new method (e.g., get_known_url_count) that returns COUNT(*) WHERE domain
= $1 AND status != 'deleted' and update scheduler.py to call
site_store.get_known_url_count() when computing ratio (instead of
get_total_count()); implement get_known_url_count in pg_website_store.py using
acquire_with_retry on the connection pool and fetchval with the suggested WHERE
clause so the mass-deletion ratio denominator correctly reflects known URLs.

Comment on lines +84 to +87
with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head:
mock_head.return_value = ([], urls_to_crawl, set())
await _scan_website("example.com", store_manager, crawler_service)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Patch _seed_cache_headers too; current helper can still make live HTTP calls.

When crawl results include content, _scan_website can call _seed_cache_headers, which uses httpx.AsyncClient.head. That makes these tests flaky/slow and environment-dependent.

🔧 Proposed fix
-    with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head:
+    with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head, patch(
+        "app.services.scheduler._seed_cache_headers", new_callable=AsyncMock
+    ):
         mock_head.return_value = ([], urls_to_crawl, set())
         await _scan_website("example.com", store_manager, crawler_service)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head:
mock_head.return_value = ([], urls_to_crawl, set())
await _scan_website("example.com", store_manager, crawler_service)
with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head, patch(
"app.services.scheduler._seed_cache_headers", new_callable=AsyncMock
):
mock_head.return_value = ([], urls_to_crawl, set())
await _scan_website("example.com", store_manager, crawler_service)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/tests/test_scheduler_errors.py` around lines 84 - 87, Tests
patch _bulk_head_check but not _seed_cache_headers, so _scan_website may still
perform real HTTP HEAD requests via httpx.AsyncClient.head; patch
_seed_cache_headers (or the httpx AsyncClient.head it uses) in the test to
prevent live network calls. Specifically, in the test where you patch
"_bulk_head_check" and call _scan_website, also patch
"app.services.scheduler._seed_cache_headers" (or patch "httpx.AsyncClient.head"
used by _seed_cache_headers) with an AsyncMock that returns a safe/fake result
so _scan_website never performs real HTTP requests.

@larryro larryro merged commit 4514360 into main Mar 4, 2026
16 checks passed
@larryro larryro deleted the fix/crawler-delete-404-pages branch March 4, 2026 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant