fix(crawler): delete 404/410 pages instead of retrying forever by larryro · Pull Request #660 · tale-project/tale

larryro · 2026-03-04T06:34:39Z

Summary

Mark 404/410 pages as deleted instead of endlessly incrementing fail_count. Permanent HTTP errors (404 Not Found, 410 Gone) now trigger soft-deletion: chunks and paragraph hashes are removed, content fields are cleared, and the URL status is set to deleted.
Mass deletion guard: if >50% of a site's known URLs return 404/410 in a single batch, deletion is blocked and URLs are routed to fail_count increment instead — protecting against false positives from temporary server misconfiguration.
Fail count recrawl limit (max_fail_count=10): URLs that have failed 10+ times are excluded from recrawl queries, preventing wasted crawl cycles on permanently broken pages.
Exclude deleted URLs from page counts: get_total_count and get_website lateral join now filter out status = 'deleted' so dashboard metrics stay accurate.
Add compound index on chunks(domain, url) for efficient bulk deletion.

Test plan

New test suite test_scheduler_errors.py covers error classification (404→deleted, 500→fail_count, network failure→fail_count, mixed batches)
New test suite test_deleted_url_counts.py covers deleted URL exclusion from counts, soft-deletion content cleanup, fail_count filtering, and idempotency
Mass deletion threshold tests verify blocking at >50%, allowing at <=50%, and edge cases (total=0, exact boundary)

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added protection mechanism to prevent unintended mass URL deletions with safety thresholds
- Enhanced classification of permanent HTTP errors (404, 410) from transient failures
Improvements
- Improved URL recrawl decision-making with updated failure tracking
- Database query performance optimization

URLs returning permanent HTTP errors (404/410) were retried every scan cycle indefinitely, keeping the progress bar stuck at 99%. Now the scheduler classifies these as permanent errors and marks them deleted, while transient errors (5xx, network failures) continue with the existing retry behavior. Changes: - Split HTTP error handling: 404/410 → mark_urls_deleted, others → increment_fail_count - Exclude deleted URLs from page count queries (get_website, get_total_count) - Clean up chunks and paragraph hashes when marking URLs as deleted - Resurrect deleted URLs if re-discovered in sitemap (handles false-positive 404s)

… cleanup Use ANY($2) batch queries in a transaction instead of executemany loops, clear all content fields on soft delete, switch discovered URL upsert to DO NOTHING to prevent re-discovery of deleted URLs, and cap deletion log to 5 sample URLs.

Prevent catastrophic data loss when a site temporarily 404s all pages by blocking deletion when >50% of known URLs are gone in a single scan. URLs exceeding 10 consecutive failures are excluded from recrawl queues.

Extract identical _run_scan method from TestSchedulerErrorClassification and TestMassDeletionThreshold into a shared module-level function.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

coderabbitai · 2026-03-04T06:41:12Z

📝 Walkthrough

Walkthrough

This PR implements soft deletion of URLs and introduces mass-deletion safety checks. The main changes include: updating pg_website_store.py to accept a max_fail_count parameter in get_urls_needing_recrawl, implementing a multi-step transactional deletion in mark_urls_deleted that clears related chunks and page_paragraph_hashes, and modifying count queries to exclude deleted URLs. In scheduler.py, new constants define permanent HTTP errors (404, 410) and a max deletion ratio threshold (0.5), and the crawl results handler now classifies responses into gone_urls, transient_error_urls, and network_failed_urls, with a safety check that prevents mass deletions exceeding the threshold. New test modules validate deleted URL exclusion from counts and comprehensive HTTP error classification scenarios. A new index is added to the chunks table for domain and url columns.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix(crawler): handle non-2xx responses and increase sync reliability #624: Modifies scheduler error handling to separate crawled pages from HTTP-error vs network-failure results, directly overlapping with the error classification logic in scheduler.py.
feat(crawler,db): add cross-page paragraph deduplication for boilerplate filtering #604: Introduces page_paragraph_hashes table; this PR updates mark_urls_deleted to remove page_paragraph_hashes rows as part of cleanup, creating a direct code-level dependency.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 17.65% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately and concisely describes the main change: permanent HTTP errors (404/410) are now marked for deletion instead of being retried indefinitely.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/crawler-delete-404-pages

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/services/scheduler.py`:
- Around line 330-332: The deletion-guard uses site_store.get_total_count(),
which only counts rows with content_hash IS NOT NULL and underestimates "known"
URLs; replace that usage with a new method (e.g., get_known_url_count) that
returns COUNT(*) WHERE domain = $1 AND status != 'deleted' and update
scheduler.py to call site_store.get_known_url_count() when computing ratio
(instead of get_total_count()); implement get_known_url_count in
pg_website_store.py using acquire_with_retry on the connection pool and fetchval
with the suggested WHERE clause so the mass-deletion ratio denominator correctly
reflects known URLs.

In `@services/crawler/tests/test_scheduler_errors.py`:
- Around line 84-87: Tests patch _bulk_head_check but not _seed_cache_headers,
so _scan_website may still perform real HTTP HEAD requests via
httpx.AsyncClient.head; patch _seed_cache_headers (or the httpx AsyncClient.head
it uses) in the test to prevent live network calls. Specifically, in the test
where you patch "_bulk_head_check" and call _scan_website, also patch
"app.services.scheduler._seed_cache_headers" (or patch "httpx.AsyncClient.head"
used by _seed_cache_headers) with an AsyncMock that returns a safe/fake result
so _scan_website never performs real HTTP requests.

ℹ️ Review info

Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 92f02b00-9249-4787-a689-d5b9c26f19f1

📥 Commits

Reviewing files that changed from the base of the PR and between d9dbc20 and c6293da.

📒 Files selected for processing (5)

services/crawler/app/services/pg_website_store.py
services/crawler/app/services/scheduler.py
services/crawler/tests/test_deleted_url_counts.py
services/crawler/tests/test_scheduler_errors.py
services/db/init-scripts/03-create-knowledge-database.sql

coderabbitai · 2026-03-04T06:41:15Z

+                total_count = await site_store.get_total_count()
+                ratio = len(gone_urls) / total_count if total_count > 0 else 0.0
+                if total_count > 0 and ratio > _MAX_DELETION_RATIO:


⚠️ Potential issue | 🟠 Major

Mass-deletion ratio denominator excludes discovered URLs, making the guard too strict.

This currently uses get_total_count(), which (in services/crawler/app/services/pg_website_store.py, Lines 220-221) counts only rows with content_hash IS NOT NULL. That undercounts “known URLs” and can incorrectly block valid deletions.

🔧 Proposed fix

- total_count = await site_store.get_total_count() + total_count = await site_store.get_known_url_count()

# services/crawler/app/services/pg_website_store.py async def get_known_url_count(self) -> int: async with acquire_with_retry(self._pool) as conn: return await conn.fetchval( """SELECT COUNT(*) FROM website_urls WHERE domain = $1 AND status != 'deleted'""", self._domain, )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/app/services/scheduler.py` around lines 330 - 332, The deletion-guard uses site_store.get_total_count(), which only counts rows with content_hash IS NOT NULL and underestimates "known" URLs; replace that usage with a new method (e.g., get_known_url_count) that returns COUNT(*) WHERE domain = $1 AND status != 'deleted' and update scheduler.py to call site_store.get_known_url_count() when computing ratio (instead of get_total_count()); implement get_known_url_count in pg_website_store.py using acquire_with_retry on the connection pool and fetchval with the suggested WHERE clause so the mass-deletion ratio denominator correctly reflects known URLs.

coderabbitai · 2026-03-04T06:41:15Z

+    with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head:
+        mock_head.return_value = ([], urls_to_crawl, set())
+        await _scan_website("example.com", store_manager, crawler_service)
+


⚠️ Potential issue | 🟠 Major

Patch _seed_cache_headers too; current helper can still make live HTTP calls.

When crawl results include content, _scan_website can call _seed_cache_headers, which uses httpx.AsyncClient.head. That makes these tests flaky/slow and environment-dependent.

🔧 Proposed fix

- with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head: + with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head, patch( + "app.services.scheduler._seed_cache_headers", new_callable=AsyncMock + ): mock_head.return_value = ([], urls_to_crawl, set()) await _scan_website("example.com", store_manager, crawler_service)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head:

mock_head.return_value = ([], urls_to_crawl, set())

await _scan_website("example.com", store_manager, crawler_service)

with patch("app.services.scheduler._bulk_head_check", new_callable=AsyncMock) as mock_head, patch(

"app.services.scheduler._seed_cache_headers", new_callable=AsyncMock

):

mock_head.return_value = ([], urls_to_crawl, set())

await _scan_website("example.com", store_manager, crawler_service)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/crawler/tests/test_scheduler_errors.py` around lines 84 - 87, Tests patch _bulk_head_check but not _seed_cache_headers, so _scan_website may still perform real HTTP HEAD requests via httpx.AsyncClient.head; patch _seed_cache_headers (or the httpx AsyncClient.head it uses) in the test to prevent live network calls. Specifically, in the test where you patch "_bulk_head_check" and call _scan_website, also patch "app.services.scheduler._seed_cache_headers" (or patch "httpx.AsyncClient.head" used by _seed_cache_headers) with an AsyncMock that returns a safe/fake result so _scan_website never performs real HTTP requests.

larryro added 4 commits March 4, 2026 13:21

fix(crawler): add mass deletion guard and fail count recrawl limit

1da56f6

Prevent catastrophic data loss when a site temporarily 404s all pages by blocking deletion when >50% of known URLs are gone in a single scan. URLs exceeding 10 consecutive failures are excluded from recrawl queues.

test(crawler): deduplicate _run_scan helper in scheduler error tests

c6293da

Extract identical _run_scan method from TestSchedulerErrorClassification and TestMassDeletionThreshold into a shared module-level function.

greptile-apps Bot reviewed Mar 4, 2026

View reviewed changes

coderabbitai Bot requested changes Mar 4, 2026

View reviewed changes

larryro merged commit 4514360 into main Mar 4, 2026
16 checks passed

larryro deleted the fix/crawler-delete-404-pages branch March 4, 2026 06:41

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

fix(crawler): delete 404/410 pages instead of retrying forever (#660)

8066dfb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(crawler): delete 404/410 pages instead of retrying forever#660

fix(crawler): delete 404/410 pages instead of retrying forever#660
larryro merged 4 commits into
mainfrom
fix/crawler-delete-404-pages

larryro commented Mar 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

coderabbitai Bot commented Mar 4, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 4, 2026

Uh oh!

coderabbitai Bot Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larryro commented Mar 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented Mar 4, 2026

Walkthrough

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Mar 4, 2026 •

edited by coderabbitai Bot

Loading