fix(crawler): delete 404/410 pages instead of retrying forever#660
Conversation
URLs returning permanent HTTP errors (404/410) were retried every scan cycle indefinitely, keeping the progress bar stuck at 99%. Now the scheduler classifies these as permanent errors and marks them deleted, while transient errors (5xx, network failures) continue with the existing retry behavior. Changes: - Split HTTP error handling: 404/410 → mark_urls_deleted, others → increment_fail_count - Exclude deleted URLs from page count queries (get_website, get_total_count) - Clean up chunks and paragraph hashes when marking URLs as deleted - Resurrect deleted URLs if re-discovered in sitemap (handles false-positive 404s)
… cleanup Use ANY($2) batch queries in a transaction instead of executemany loops, clear all content fields on soft delete, switch discovered URL upsert to DO NOTHING to prevent re-discovery of deleted URLs, and cap deletion log to 5 sample URLs.
Prevent catastrophic data loss when a site temporarily 404s all pages by blocking deletion when >50% of known URLs are gone in a single scan. URLs exceeding 10 consecutive failures are excluded from recrawl queues.
Extract identical _run_scan method from TestSchedulerErrorClassification and TestMassDeletionThreshold into a shared module-level function.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughThis PR implements soft deletion of URLs and introduces mass-deletion safety checks. The main changes include: updating Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/crawler/app/services/scheduler.py`:
- Around line 330-332: The deletion-guard uses site_store.get_total_count(),
which only counts rows with content_hash IS NOT NULL and underestimates "known"
URLs; replace that usage with a new method (e.g., get_known_url_count) that
returns COUNT(*) WHERE domain = $1 AND status != 'deleted' and update
scheduler.py to call site_store.get_known_url_count() when computing ratio
(instead of get_total_count()); implement get_known_url_count in
pg_website_store.py using acquire_with_retry on the connection pool and fetchval
with the suggested WHERE clause so the mass-deletion ratio denominator correctly
reflects known URLs.
In `@services/crawler/tests/test_scheduler_errors.py`:
- Around line 84-87: Tests patch _bulk_head_check but not _seed_cache_headers,
so _scan_website may still perform real HTTP HEAD requests via
httpx.AsyncClient.head; patch _seed_cache_headers (or the httpx AsyncClient.head
it uses) in the test to prevent live network calls. Specifically, in the test
where you patch "_bulk_head_check" and call _scan_website, also patch
"app.services.scheduler._seed_cache_headers" (or patch "httpx.AsyncClient.head"
used by _seed_cache_headers) with an AsyncMock that returns a safe/fake result
so _scan_website never performs real HTTP requests.
ℹ️ Review info
Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 92f02b00-9249-4787-a689-d5b9c26f19f1
📒 Files selected for processing (5)
services/crawler/app/services/pg_website_store.pyservices/crawler/app/services/scheduler.pyservices/crawler/tests/test_deleted_url_counts.pyservices/crawler/tests/test_scheduler_errors.pyservices/db/init-scripts/03-create-knowledge-database.sql
| total_count = await site_store.get_total_count() | ||
| ratio = len(gone_urls) / total_count if total_count > 0 else 0.0 | ||
| if total_count > 0 and ratio > _MAX_DELETION_RATIO: |
There was a problem hiding this comment.
Mass-deletion ratio denominator excludes discovered URLs, making the guard too strict.
This currently uses get_total_count(), which (in services/crawler/app/services/pg_website_store.py, Lines 220-221) counts only rows with content_hash IS NOT NULL. That undercounts “known URLs” and can incorrectly block valid deletions.
🔧 Proposed fix
- total_count = await site_store.get_total_count()
+ total_count = await site_store.get_known_url_count()# services/crawler/app/services/pg_website_store.py
async def get_known_url_count(self) -> int:
async with acquire_with_retry(self._pool) as conn:
return await conn.fetchval(
"""SELECT COUNT(*) FROM website_urls
WHERE domain = $1 AND status != 'deleted'""",
self._domain,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/app/services/scheduler.py` around lines 330 - 332, The
deletion-guard uses site_store.get_total_count(), which only counts rows with
content_hash IS NOT NULL and underestimates "known" URLs; replace that usage
with a new method (e.g., get_known_url_count) that returns COUNT(*) WHERE domain
= $1 AND status != 'deleted' and update scheduler.py to call
site_store.get_known_url_count() when computing ratio (instead of
get_total_count()); implement get_known_url_count in pg_website_store.py using
acquire_with_retry on the connection pool and fetchval with the suggested WHERE
clause so the mass-deletion ratio denominator correctly reflects known URLs.
| with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head: | ||
| mock_head.return_value = ([], urls_to_crawl, set()) | ||
| await _scan_website("example.com", store_manager, crawler_service) | ||
|
|
There was a problem hiding this comment.
Patch _seed_cache_headers too; current helper can still make live HTTP calls.
When crawl results include content, _scan_website can call _seed_cache_headers, which uses httpx.AsyncClient.head. That makes these tests flaky/slow and environment-dependent.
🔧 Proposed fix
- with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head:
+ with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head, patch(
+ "app.services.scheduler._seed_cache_headers", new_callable=AsyncMock
+ ):
mock_head.return_value = ([], urls_to_crawl, set())
await _scan_website("example.com", store_manager, crawler_service)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head: | |
| mock_head.return_value = ([], urls_to_crawl, set()) | |
| await _scan_website("example.com", store_manager, crawler_service) | |
| with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head, patch( | |
| "app.services.scheduler._seed_cache_headers", new_callable=AsyncMock | |
| ): | |
| mock_head.return_value = ([], urls_to_crawl, set()) | |
| await _scan_website("example.com", store_manager, crawler_service) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/crawler/tests/test_scheduler_errors.py` around lines 84 - 87, Tests
patch _bulk_head_check but not _seed_cache_headers, so _scan_website may still
perform real HTTP HEAD requests via httpx.AsyncClient.head; patch
_seed_cache_headers (or the httpx AsyncClient.head it uses) in the test to
prevent live network calls. Specifically, in the test where you patch
"_bulk_head_check" and call _scan_website, also patch
"app.services.scheduler._seed_cache_headers" (or patch "httpx.AsyncClient.head"
used by _seed_cache_headers) with an AsyncMock that returns a safe/fake result
so _scan_website never performs real HTTP requests.
Summary
fail_count. Permanent HTTP errors (404 Not Found, 410 Gone) now trigger soft-deletion: chunks and paragraph hashes are removed, content fields are cleared, and the URL status is set todeleted.fail_countincrement instead — protecting against false positives from temporary server misconfiguration.max_fail_count=10): URLs that have failed 10+ times are excluded from recrawl queries, preventing wasted crawl cycles on permanently broken pages.get_total_countandget_websitelateral join now filter outstatus = 'deleted'so dashboard metrics stay accurate.chunks(domain, url)for efficient bulk deletion.Test plan
test_scheduler_errors.pycovers error classification (404→deleted, 500→fail_count, network failure→fail_count, mixed batches)test_deleted_url_counts.pycovers deleted URL exclusion from counts, soft-deletion content cleanup, fail_count filtering, and idempotency🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements