refactor(platform): consolidate crawler tools into unified web tool#369
Conversation
- Replace separate crawler helpers with unified web tool module - Add browser_operate and fetch_url_via_pdf helpers for web interactions - Move operator service helpers to web/helpers directory - Add new crawler service web router with page content and search endpoints - Update agents to use new web tool structure - Simplify web assistant tool implementation - Remove deprecated action cache functions
…serve user intent Add isProcessingToolResult state to detect when agent is processing tool results but hasn't resumed streaming text. This fixes the UI gap where no loading indicator was shown between tool completion and agent response. Also update agent and web_assistant_tool prompts to preserve user's specific questions when delegating to sub-agents, instead of reducing them to generic "Get content from URL" requests.
📝 WalkthroughWalkthroughThis PR refactors web content fetching and agent tooling across the platform. It introduces a new FastAPI endpoint Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@services/crawler/app/routers/web.py`:
- Around line 34-56: This code fetches arbitrary user URLs (url_str/hostname)
before any validation in the PDF flow (get_pdf_service -> url_to_pdf), so add
SSRF protection by resolving the request.url hostname to IP(s) and rejecting
requests that resolve to loopback, link-local, or private RFC1918/IPv6
unique/local addresses or to hostnames on an allowlist; perform this check
immediately after computing hostname and before calling get_pdf_service or
url_to_pdf, returning a 4xx error for blocked hosts and logging the blocked
hostname/IPs.
In `@services/platform/convex/agent_tools/sub_agents/web_assistant_tool.ts`:
- Around line 64-78: The web_assistant_tool handler currently calls
validateToolContext(ctx, 'web_assistant') without requiring a userId, which lets
undefined userId propagate; update the call in the handler (the
validateToolContext invocation inside handler: async (ctx: ToolCtx, args):
Promise<ToolResponse>) to pass the options object { requireUserId: true } so
validation fails fast and returns a clear error when userId is missing before
calling getOrCreateSubThread or invoking the Web Agent action.
In `@services/platform/convex/agent_tools/web/helpers/browser_operate.ts`:
- Around line 28-44: The try block using AbortController/timeoutId for the fetch
can produce generic errors on timeout; update the error handling so that when
fetch throws due to abort you detect the AbortError (from controller.signal or
error.name === 'AbortError') and throw or log a clear, specific timeout/abort
error message (e.g., "Operator service request timed out after 300000ms")
instead of the generic error; make this change around the fetch call and the
existing catch path that handles response errors, referencing AbortController,
timeoutId, controller.signal, and response to locate where to add the explicit
AbortError handling.
In `@services/platform/convex/agent_tools/web/helpers/fetch_url_via_pdf.ts`:
- Around line 73-96: The code computes a truncated boolean but doesn't return
it, so update the WebFetchUrlResult type (in types.ts) to include a truncated:
boolean field and then include truncated in the object returned by the function
that builds the fetch result (the block that currently sets operation:
'fetch_url', success: true, url: args.url, title: result.title, content, ...).
Ensure the computed truncated value (from content.length > MAX_CONTENT_LENGTH)
is preserved and returned so callers/LLMs can detect truncation.
In `@services/platform/convex/agents/crm/agent.ts`:
- Around line 30-36: The multi-match response currently exposes full email
addresses; update the code path that formats/returns multiple CRM matches (e.g.,
the function handling multiple-results in agent.ts such as
getContactMatches/formatMatchList or the branch labeled "**CRITICAL - MULTIPLE
MATCHES:**") to mask email addresses by default (e.g., show local-part partial
or initials and domain as ****) and include non‑PII distinguishing fields
(title, company, last activity) so the user can disambiguate; only return full
email addresses when an explicit authorization flag or an explicit user
confirmation is present (check for an isAuthorized or revealEmail parameter) and
log that full PII was disclosed. Ensure the response asks the user to clarify
which record they mean rather than selecting one automatically.
…ontext Clarify in the pre-analyzed content marker and routing agent instructions that attachments from the current message take priority over any previous conversation context. This prevents the AI from confusing attached documents with previously discussed content.
Summary
Changes
Web Tool Consolidation
fetch_page_content,fetch_searxng_results,search_and_fetch,search_web) with unifiedweb/modulebrowser_operateandfetch_url_via_pdfhelpers for web interactionsweb/helpers/directoryUX Improvements
isProcessingToolResultstate to detect when agent is processing tool results but hasn't resumed streaming textTest plan
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Chores