Skip to content

refactor(platform): rename website scan operations, add embedding workpool, and improve streaming#574

Merged
larryro merged 5 commits into
mainfrom
refactor/website-scan-naming-and-embedding-pool
Feb 26, 2026
Merged

refactor(platform): rename website scan operations, add embedding workpool, and improve streaming#574
larryro merged 5 commits into
mainfrom
refactor/website-scan-naming-and-embedding-pool

Conversation

@larryro

@larryro larryro commented Feb 26, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Rename website scan operations for clarity: register_discovered_urlsregister_urls, sync_pending_pagescrawl_and_upsert, generateForPageembedPage
  • Add embedding workpool using @convex-dev/workpool (maxParallelism=1, 3 retries with exponential backoff) to serialize embedding generation during bulk scans and prevent rate-limit failures
  • Inject configured dimensions into embedding API calls — the dimension setting was only used for vector table selection but never passed to the model
  • Support comma-separated model lists across all services (platform, crawler, operator, rag), always selecting the first entry. Replace useFastModel boolean with explicit model string resolution. Add modelId field to custom agents for user-selectable models
  • Fix streaming text regression — add useStableStreamText hook to suppress temporary regression from WebSocket reconnection, and make useAutoScroll ResizeObserver run continuously through typewriter drain

Test plan

  • Unit tests added for useStableStreamText, useAutoScroll, model list parsing, and embedding config
  • Verify website scan workflow executes correctly with renamed operations
  • Verify embedding workpool serializes and retries on rate limits
  • Verify custom agent model selection works end-to-end
  • Verify streaming chat auto-scrolls correctly during typewriter drain

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Support for comma-separated model configurations for flexible model selection across services
    • Model ID-based selection system for custom agents replacing preset categories
  • Bug Fixes

    • Fixed streaming text regression during content reconnections
    • Improved auto-scroll behavior for proper content growth handling during streaming
    • Enhanced scroll position stability across content updates
  • Improvements

    • Optimized embedding job scheduling with enhanced rate-limit protection
    • Refined website operation naming: register_urls, crawl_and_upsert

…workpool

Clarify website scan operation names to better reflect their purpose:
- register_discovered_urls → register_urls
- sync_pending_pages → crawl_and_upsert
- generateForPage → embedPage

Replace manual embedding retry logic with @convex-dev/workpool
(maxParallelism=1, 3 retries with exponential backoff) to serialize
embedding generation during bulk scans and prevent rate-limit failures.

Also add missing output schemas for register_urls and crawl_and_upsert
operations in the workflow engine validation layer.
The embedding dimension setting was only used to select the vector table
but never passed to the model, so API calls still returned the default
1536-d vectors. Add a withDimensions wrapper for the AI SDK embedding
model and inject EMBEDDING_DIMENSIONS into litellm sync/async patches.
…model selection

Parse OPENAI_MODEL, OPENAI_FAST_MODEL, etc. as comma-separated lists across
all services (platform, crawler, operator, rag), always selecting the first
entry. Replace the `useFastModel` boolean flag with explicit model string
resolution through centralized helpers in agent_runtime_config. Add `modelId`
field to custom agents so users can select a specific model from the available
pool rather than only choosing a preset tier.
…oll during typewriter drain

Add useStableStreamText hook to suppress temporary text regression caused
by WebSocket reconnection delivering shorter committed text while the
sync stream reconnects. Make useAutoScroll's ResizeObserver run
continuously so content growth is followed even after streaming ends
(typewriter drain), while shrinkage is only followed during active
streaming.
@larryro larryro force-pushed the refactor/website-scan-naming-and-embedding-pool branch from 5b4774a to 28a0d28 Compare February 26, 2026 05:49
@coderabbitai

coderabbitai Bot commented Feb 26, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR systematically refactors model selection across multiple services by introducing standardized utilities for parsing comma-separated model environment variables and replacing direct environment variable reads with centralized helper functions. It updates agent creation factories to use explicit model strings instead of boolean flags, adds a modelId field to custom agents schema, refactors website workflow operations with new naming conventions, introduces a work pool for serialized embedding jobs, and enhances streaming text stability with a new useStableStreamText hook. The changes span services/crawler, services/operator, services/platform, and services/rag.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: renaming website operations, adding an embedding workpool, and improving streaming. It directly reflects the primary objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor/website-scan-naming-and-embedding-pool

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps

greptile-apps Bot commented Feb 26, 2026

Copy link
Copy Markdown

Greptile Summary

This PR implements five key refactorings across the platform to improve reliability and clarity:

1. Embedding Workpool (Rate Limit Fix)

  • Introduces @convex-dev/workpool to serialize embedding generation (maxParallelism=1) during bulk website scans
  • Prevents concurrent API calls that were causing rate-limit failures and server instability
  • Adds automatic retry logic (3 attempts, exponential backoff starting at 10s)

2. Embedding Dimension Injection (Bug Fix)

  • The EMBEDDING_DIMENSIONS setting was only used for table selection, never passed to the API
  • Adds withDimensions() wrapper that injects dimensions via providerOptions.openai.dimensions
  • Ensures embedding models receive the configured dimension parameter

3. Comma-Separated Model Lists

  • Replaces single-model environment variables with comma-separated lists across all services
  • Always selects the first entry, enabling easy model fallback/rotation
  • Removes useFastModel boolean flag in favor of explicit model string resolution
  • Consistent implementation in TypeScript and Python services

4. Website Scan Operation Renames

  • register_discovered_urlsregister_urls (clearer, URL source is irrelevant)
  • sync_pending_pagescrawl_and_upsert (more accurate description)
  • generateForPageembedPage (consistent naming with other operations)

5. Streaming Text Regression Fix

  • WebSocket reconnection was causing temporary text regression during streaming
  • Adds useStableStreamText hook to suppress backward jumps in displayed text
  • Updates useAutoScroll ResizeObserver to run continuously through typewriter drain

All changes include comprehensive test coverage and maintain backward compatibility in behavior.

Confidence Score: 5/5

  • Safe to merge with high confidence - well-tested refactoring with clear improvements
  • This PR demonstrates excellent engineering practices: comprehensive test coverage (unit tests for all new utilities and hooks), clear architectural improvements (workpool prevents rate limits, dimension injection fixes API bug), consistent patterns across services (model list parsing in TS/Python), and thorough documentation. The changes are primarily refactorings with improved naming and behavior fixes. No breaking changes or risky operations detected.
  • No files require special attention - all changes are well-implemented and tested

Important Files Changed

Filename Overview
services/platform/convex/website_page_embeddings/embedding_pool.ts Introduces workpool (maxParallelism=1) to serialize embedding generation and prevent rate-limit failures during bulk scans
services/platform/convex/lib/embedding_config.ts Adds withDimensions() wrapper to inject configured dimensions into embedding API calls via providerOptions
services/platform/lib/shared/utils/model-list.ts New utility for parsing comma-separated model lists, selecting first entry from environment variables
services/platform/convex/lib/agent_runtime_config.ts Refactors to use model list utilities, removes useFastModel boolean in favor of explicit model resolution
services/platform/convex/websites/bulk_upsert_pages.ts Uses embeddingPool.enqueueAction() instead of scheduler.runAfter() to serialize embedding jobs
services/platform/app/hooks/use-auto-scroll.ts ResizeObserver now runs continuously (not just when enabled), follows growth always but shrinkage only during streaming
services/platform/app/features/chat/components/typewriter-text.tsx Adds useStableStreamText hook to suppress temporary text regression from WebSocket reconnection
services/platform/convex/custom_agents/config.ts Adds resolveModel() function to support custom modelId or fall back to preset-based resolution
services/platform/convex/workflow_engine/action_defs/website_pages/website_pages_action.ts Renames operations: register_discovered_urlsregister_urls, sync_pending_pagescrawl_and_upsert

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    Start[Website Scan Triggered] --> FetchMain[Fetch Main Page]
    FetchMain --> QueryURLs[Query URLs from Sitemap]
    QueryURLs --> RegisterURLs[register_urls: Register/Update URLs]
    RegisterURLs --> CheckHash{Content Hash Changed?}
    CheckHash -->|New/Changed| AddPending[Mark as pending sync]
    CheckHash -->|Unchanged| Skip[Skip]
    AddPending --> CrawlUpsert[crawl_and_upsert: Fetch & Upsert Pages]
    CrawlUpsert --> EnqueueEmbed[Enqueue to embeddingPool]
    EnqueueEmbed --> EmbedPage[embedPage: Generate Embeddings]
    EmbedPage --> InjectDims[withDimensions: Inject API dimensions]
    InjectDims --> CallAPI[Embedding API Call]
    CallAPI -->|Rate Limit| Retry[Retry with Backoff]
    CallAPI -->|Success| StoreVectors[Store in Vector Table]
    Retry -->|Max 3 attempts| StoreVectors
    StoreVectors --> HasMore{More URLs?}
    HasMore -->|Yes| RegisterURLs
    HasMore -->|No| UpdateStatus[Update Website Status]
    
    style EnqueueEmbed fill:#e1f5ff
    style InjectDims fill:#e1f5ff
    style Retry fill:#fff4e1
Loading

Last reviewed commit: 28a0d28

…h test

Add @ai-sdk/provider as an explicit dependency (was only transitive via
@ai-sdk/openai) to satisfy Knip. Align cleanup_website test expectations
with the actual PAGE_BATCH_SIZE of 2.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
services/platform/convex/website_page_embeddings/internal_actions.ts (1)

60-66: ⚠️ Potential issue | 🔴 Critical

Prevent partial-embedding lock-in under retries.

The skip check only compares contentHash. If an earlier attempt fails mid-insert, a retry can observe matching hash and skip, leaving an incomplete embedding set for the page.

Suggested fix pattern
-    // 2. Check content hash to skip if unchanged
-    const contentHash = computeContentHash(page.content);
-    const existing = await ctx.runQuery(
-      internal.website_page_embeddings.internal_queries
-        .getExistingEmbeddingHash,
-      { organizationId, pageId, dimension },
-    );
-
-    if (existing?.contentHash === contentHash) {
-      debugLog('embedPage skip - content unchanged', {
-        pageId,
-        contentHash,
-      });
-      return { status: 'skipped', reason: 'unchanged' };
-    }
-
-    // 3. Chunk content
-    const chunks = chunkContent(page.content, page.title);
+    // 2. Chunk first so unchanged checks can validate completeness, not just hash
+    const contentHash = computeContentHash(page.content);
+    const chunks = chunkContent(page.content, page.title);
+    const existing = await ctx.runQuery(
+      internal.website_page_embeddings.internal_queries
+        .getExistingEmbeddingStats, // contentHash + chunkCount
+      { organizationId, pageId, dimension },
+    );
+
+    if (
+      existing?.contentHash === contentHash &&
+      existing.chunkCount === chunks.length
+    ) {
+      debugLog('embedPage skip - content unchanged', {
+        pageId,
+        contentHash,
+      });
+      return { status: 'skipped', reason: 'unchanged' };
+    }

Also applies to: 81-88, 104-116

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/config.py`:
- Line 74: The line assigning model is terse and hard to read; split it into two
clear steps: first evaluate the preferred source(s) (e.g., assign a variable
like preferred_model = self.openai_fast_model and fallback_env =
os.environ.get("OPENAI_FAST_MODEL")), then call get_first_model(preferred_model)
and if that returns falsy call get_first_model(fallback_env) and assign the
result to model; update the code around the get_first_model call so the intent
is explicit and easier to debug.
- Line 67: Split the long assignment to improve readability by first resolving
the environment value and then calling get_first_model; e.g., extract
os.environ.get("OPENAI_VISION_MODEL") into a named variable (e.g., env_model)
and then set model = get_first_model(self.openai_vision_model) or
get_first_model(env_model), referencing the existing get_first_model function
and the self.openai_vision_model attribute to preserve current precedence and
behavior.

In `@services/operator/app/config.py`:
- Around line 77-89: The model properties can return None or malformed raw
strings when environment values parse to empty/invalid lists; update
get_first_model to return None for empty/invalid parses and then change
llm_model, llm_vision_model and llm_fast_model to validate that result and fail
fast: llm_model should raise a clear ValueError if
get_first_model(self.openai_model) yields None (so startup fails on invalid
OPENAI_MODEL), llm_vision_model should return None only when openai_vision_model
is falsy or when get_first_model(...) returns None, and llm_fast_model should
use get_first_model(self.openai_fast_model) but fall back to self.llm_model only
if that call returns a non-None model; otherwise raise a ValueError indicating
OPENAI_FAST_MODEL is invalid (or if optional, fall back to llm_model only when
safe). Ensure references are to get_first_model, llm_model, llm_vision_model,
llm_fast_model, openai_model, openai_vision_model, and openai_fast_model so the
change is locatable.

In `@services/operator/app/utils/model_list.py`:
- Around line 1-22: This file duplicates the same model-list parsing utilities
across services; extract the shared functions parse_model_list, get_first_model,
and get_first_model_or_raise from model_list.py into a common/shared utility
package (e.g., common.utils.model_list) and update each service to import those
functions instead of keeping local copies: move the implementations to the
shared package, remove the three local model_list.py copies, adjust imports in
services/operator, services/crawler, and services/rag to reference the new
common module, and ensure the shared package is added to each service’s
build/requirements and covered by unit tests.

In `@services/platform/app/hooks/use-auto-scroll.ts`:
- Around line 37-44: Update the comment/docs in use-auto-scroll.ts to reflect
the new invariant: state that the hook's core behavior always follows content
growth when the user is at the bottom, and that the enabled option only controls
whether shrinkage is followed (not global auto-scroll). Edit the
function/comment block for useAutoScroll and the options description to replace
wording that suggests enabled toggles all auto-scrolling with language that
enabled = "follow shrinkage" (or similar), and add a short note clarifying that
growth-following remains active regardless of enabled.

In
`@services/platform/app/routes/dashboard/`$id/custom-agents/$agentId/instructions.tsx:
- Around line 61-66: getDefaultModelId can return '' when presets aren't present
which allows initialModelId to be empty and saves an invalid agent.modelId;
locate getDefaultModelId and the code that builds initialModelId and either (A)
disable the model Select component until modelPresets is loaded (use the route
loader state or a modelPresets truthy check to set disabled on the Select) or
(B) add a guard inside handleSave (check agent.modelId or initialModelId is
non-empty) to abort saving and surface an error/validation message to the user;
update references to getDefaultModelId, initialModelId, the Select control, and
handleSave accordingly so an empty string cannot be persisted.

In `@services/platform/convex/agents/workflow/generate_response.ts`:
- Line 45: The use of getCodingModel() ?? '' is confusing because it relies on
downstream truthiness in createAgentConfig to pick a default; change the
initialization to an explicit fallback or enforce presence: either set model to
getCodingModel() || getDefaultModel() so the default is chosen clearly, or call
getCodingModelOrThrow() if a coding model is mandatory for workflow agents;
update the call site where model is passed into createAgentConfig to use one of
these two alternatives and ensure tests/consumers expect an explicit model or a
thrown error accordingly.

In `@services/platform/convex/predefined_workflows/website_scan.ts`:
- Around line 9-10: The flow comment at the top lists step slug
"crawl_and_upsert_pages" but the actual workflow uses "sync_pages" in
nextSteps/stepSlug; update the comment to match the real slug chain (or rename
the step slug to match the comment) so they are consistent. Locate the step
definitions that use nextSteps/stepSlug (check references to "sync_pages" and
any transition logic) and either change the comment to include "sync_pages" in
place of "crawl_and_upsert_pages" or rename the step slug and its occurrences to
"crawl_and_upsert_pages" so the documented chain and actual step names align.

In
`@services/platform/convex/workflow_engine/helpers/validation/variables/action_schemas.ts`:
- Around line 622-625: The schema for the field urlsToSync is declared as type
'array' but missing an items spec; update the action schema (the object that
contains urlsToSync) to add an items property describing the element type (e.g.,
items: { type: 'string', format: 'uri' } or items: { type: 'string' } as
appropriate) so validation mirrors other array fields like tags; modify the
urlsToSync definition in action_schemas.ts to include the items schema and any
required constraints (minItems, uniqueItems) consistent with surrounding array
definitions.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 362380d and 28a0d28.

⛔ Files ignored due to path filters (1)
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
📒 Files selected for processing (62)
  • services/crawler/app/config.py
  • services/crawler/app/utils/model_list.py
  • services/crawler/tests/test_config.py
  • services/operator/app/config.py
  • services/operator/app/utils/__init__.py
  • services/operator/app/utils/model_list.py
  • services/operator/tests/conftest.py
  • services/operator/tests/test_config.py
  • services/platform/app/features/chat/components/__tests__/typewriter-text.test.tsx
  • services/platform/app/features/chat/components/typewriter-text.tsx
  • services/platform/app/features/chat/hooks/__tests__/use-stream-buffer.test.ts
  • services/platform/app/features/custom-agents/components/custom-agents-table.tsx
  • services/platform/app/features/custom-agents/hooks/use-custom-agents-table-config.tsx
  • services/platform/app/hooks/__tests__/use-auto-scroll.test.ts
  • services/platform/app/hooks/use-auto-scroll.ts
  • services/platform/app/routes/dashboard/$id/custom-agents/$agentId/instructions.tsx
  • services/platform/convex/agent_tools/files/helpers/analyze_text.ts
  • services/platform/convex/agent_tools/files/helpers/vision_agent.ts
  • services/platform/convex/agents/crm/agent.ts
  • services/platform/convex/agents/document/agent.ts
  • services/platform/convex/agents/integration/agent.ts
  • services/platform/convex/agents/web/agent.ts
  • services/platform/convex/agents/workflow/agent.ts
  • services/platform/convex/agents/workflow/generate_response.ts
  • services/platform/convex/conversations/improve_message.ts
  • services/platform/convex/convex.config.ts
  • services/platform/convex/custom_agents/config.ts
  • services/platform/convex/custom_agents/internal_queries.ts
  • services/platform/convex/custom_agents/mutations.ts
  • services/platform/convex/custom_agents/queries.ts
  • services/platform/convex/custom_agents/schema.ts
  • services/platform/convex/custom_agents/test_chat.test.ts
  • services/platform/convex/lib/agent_chat/internal_actions.ts
  • services/platform/convex/lib/agent_chat/types.ts
  • services/platform/convex/lib/agent_response/generate_response.ts
  • services/platform/convex/lib/agent_runtime_config.ts
  • services/platform/convex/lib/create_agent_config.ts
  • services/platform/convex/lib/embedding_config.test.ts
  • services/platform/convex/lib/embedding_config.ts
  • services/platform/convex/lib/summarize_context.ts
  • services/platform/convex/predefined_workflows/website_scan.ts
  • services/platform/convex/website_page_embeddings/embedding_pool.ts
  • services/platform/convex/website_page_embeddings/internal_actions.ts
  • services/platform/convex/websites/bulk_upsert_pages.ts
  • services/platform/convex/websites/cleanup_website.ts
  • services/platform/convex/websites/helpers.ts
  • services/platform/convex/websites/internal_mutations.ts
  • services/platform/convex/websites/register_urls.ts
  • services/platform/convex/workflow_engine/action_defs/website_pages/helpers/types.ts
  • services/platform/convex/workflow_engine/action_defs/website_pages/website_pages_action.ts
  • services/platform/convex/workflow_engine/helpers/nodes/llm/utils/validate_and_normalize_config.ts
  • services/platform/convex/workflow_engine/helpers/validation/variables/action_schemas.ts
  • services/platform/convex/workflow_engine/workflow_syntax_compact.ts
  • services/platform/convex/workflows/triggers/actions.ts
  • services/platform/lib/shared/schemas/custom_agents.ts
  • services/platform/lib/shared/utils/model-list.test.ts
  • services/platform/lib/shared/utils/model-list.ts
  • services/platform/messages/en.json
  • services/rag/app/config.py
  • services/rag/app/utils/__init__.py
  • services/rag/app/utils/model_list.py
  • services/rag/tests/test_config.py

def get_vision_model(self) -> str:
"""Get Vision model from CRAWLER_OPENAI_VISION_MODEL or OPENAI_VISION_MODEL."""
model = self.openai_vision_model or os.environ.get("OPENAI_VISION_MODEL")
model = get_first_model(self.openai_vision_model) or get_first_model(os.environ.get("OPENAI_VISION_MODEL"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider splitting long line for readability.

The logic is correct—prioritizing the crawler-prefixed config over the generic environment variable and extracting the first model from comma-separated lists. However, the line is quite long.

♻️ Optional: Split for readability
     def get_vision_model(self) -> str:
         """Get Vision model from CRAWLER_OPENAI_VISION_MODEL or OPENAI_VISION_MODEL."""
-        model = get_first_model(self.openai_vision_model) or get_first_model(os.environ.get("OPENAI_VISION_MODEL"))
+        model = get_first_model(self.openai_vision_model) or get_first_model(
+            os.environ.get("OPENAI_VISION_MODEL")
+        )
         if not model:
             raise ValueError("OPENAI_VISION_MODEL must be set in environment.")
         return model
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/config.py` at line 67, Split the long assignment to
improve readability by first resolving the environment value and then calling
get_first_model; e.g., extract os.environ.get("OPENAI_VISION_MODEL") into a
named variable (e.g., env_model) and then set model =
get_first_model(self.openai_vision_model) or get_first_model(env_model),
referencing the existing get_first_model function and the
self.openai_vision_model attribute to preserve current precedence and behavior.

def get_fast_model(self) -> str:
"""Get Fast LLM model from CRAWLER_OPENAI_FAST_MODEL or OPENAI_FAST_MODEL."""
model = self.openai_fast_model or os.environ.get("OPENAI_FAST_MODEL")
model = get_first_model(self.openai_fast_model) or get_first_model(os.environ.get("OPENAI_FAST_MODEL"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Same readability suggestion applies here.

♻️ Optional: Split for readability
     def get_fast_model(self) -> str:
         """Get Fast LLM model from CRAWLER_OPENAI_FAST_MODEL or OPENAI_FAST_MODEL."""
-        model = get_first_model(self.openai_fast_model) or get_first_model(os.environ.get("OPENAI_FAST_MODEL"))
+        model = get_first_model(self.openai_fast_model) or get_first_model(
+            os.environ.get("OPENAI_FAST_MODEL")
+        )
         if not model:
             raise ValueError("OPENAI_FAST_MODEL must be set in environment.")
         return model
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model = get_first_model(self.openai_fast_model) or get_first_model(os.environ.get("OPENAI_FAST_MODEL"))
def get_fast_model(self) -> str:
"""Get Fast LLM model from CRAWLER_OPENAI_FAST_MODEL or OPENAI_FAST_MODEL."""
model = get_first_model(self.openai_fast_model) or get_first_model(
os.environ.get("OPENAI_FAST_MODEL")
)
if not model:
raise ValueError("OPENAI_FAST_MODEL must be set in environment.")
return model
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/crawler/app/config.py` at line 74, The line assigning model is terse
and hard to read; split it into two clear steps: first evaluate the preferred
source(s) (e.g., assign a variable like preferred_model = self.openai_fast_model
and fallback_env = os.environ.get("OPENAI_FAST_MODEL")), then call
get_first_model(preferred_model) and if that returns falsy call
get_first_model(fallback_env) and assign the result to model; update the code
around the get_first_model call so the intent is explicit and easier to debug.

Comment on lines 77 to 89
def llm_model(self) -> str:
return self.openai_model
return get_first_model(self.openai_model) or self.openai_model

@property
def llm_vision_model(self) -> str | None:
"""Return vision model if configured, None otherwise."""
return self.openai_vision_model if self.openai_vision_model else None
return get_first_model(self.openai_vision_model) if self.openai_vision_model else None

@property
def llm_fast_model(self) -> str:
"""Return fast model for summarization, falls back to coding model."""
return self.openai_fast_model if self.openai_fast_model else self.openai_model
return get_first_model(self.openai_fast_model) if self.openai_fast_model else self.llm_model

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Harden model-list parsing to avoid invalid/None model values.

llm_fast_model can currently return None when OPENAI_FAST_MODEL is set but parses to an empty list (e.g., whitespace/comma-only). Also, llm_model can fall back to malformed raw strings instead of failing fast.

🔧 Proposed fix
-from app.utils.model_list import get_first_model
+from app.utils.model_list import get_first_model, get_first_model_or_raise
@@
     `@property`
     def llm_model(self) -> str:
-        return get_first_model(self.openai_model) or self.openai_model
+        return get_first_model_or_raise(self.openai_model, "OPENAI_MODEL")
@@
     `@property`
     def llm_fast_model(self) -> str:
         """Return fast model for summarization, falls back to coding model."""
-        return get_first_model(self.openai_fast_model) if self.openai_fast_model else self.llm_model
+        fast_model = get_first_model(self.openai_fast_model) if self.openai_fast_model else None
+        return fast_model or self.llm_model

Based on learnings: Enforce fail-fast validation for required configuration at service startup so invalid config fails early rather than causing runtime issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/operator/app/config.py` around lines 77 - 89, The model properties
can return None or malformed raw strings when environment values parse to
empty/invalid lists; update get_first_model to return None for empty/invalid
parses and then change llm_model, llm_vision_model and llm_fast_model to
validate that result and fail fast: llm_model should raise a clear ValueError if
get_first_model(self.openai_model) yields None (so startup fails on invalid
OPENAI_MODEL), llm_vision_model should return None only when openai_vision_model
is falsy or when get_first_model(...) returns None, and llm_fast_model should
use get_first_model(self.openai_fast_model) but fall back to self.llm_model only
if that call returns a non-None model; otherwise raise a ValueError indicating
OPENAI_FAST_MODEL is invalid (or if optional, fall back to llm_model only when
safe). Ensure references are to get_first_model, llm_model, llm_vision_model,
llm_fast_model, openai_model, openai_vision_model, and openai_fast_model so the
change is locatable.

Comment on lines +1 to +22
"""Utilities for parsing comma-separated model lists from environment variables."""


def parse_model_list(value: str | None) -> list[str]:
"""Parse a comma-separated model list into a list of trimmed, non-empty strings."""
if not value:
return []
return [m.strip() for m in value.split(",") if m.strip()]


def get_first_model(value: str | None) -> str | None:
"""Get the first model from a comma-separated model list."""
models = parse_model_list(value)
return models[0] if models else None


def get_first_model_or_raise(value: str | None, var_name: str) -> str:
"""Get the first model from a comma-separated model list, or raise if none available."""
model = get_first_model(value)
if not model:
raise ValueError(f"{var_name} is not set or contains no valid models.")
return model

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider extracting shared utilities to a common package.

This module is duplicated verbatim across three services:

  • services/operator/app/utils/model_list.py
  • services/crawler/app/utils/model_list.py
  • services/rag/app/utils/model_list.py

While duplication across independently deployed services can be acceptable to avoid coupling, if these services share a build system or common package infrastructure, consider extracting this into a shared utility package to reduce maintenance burden.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/operator/app/utils/model_list.py` around lines 1 - 22, This file
duplicates the same model-list parsing utilities across services; extract the
shared functions parse_model_list, get_first_model, and get_first_model_or_raise
from model_list.py into a common/shared utility package (e.g.,
common.utils.model_list) and update each service to import those functions
instead of keeping local copies: move the implementations to the shared package,
remove the three local model_list.py copies, adjust imports in
services/operator, services/crawler, and services/rag to reference the new
common module, and ensure the shared package is added to each service’s
build/requirements and covered by unit tests.

Comment on lines +37 to +44
* useAutoScroll - Position-based auto-scrolling for growing content.
*
* Uses a single principle: "Was the user at the bottom before content grew?"
* - Yes → scroll to the new bottom (follow the stream)
* - No → don't scroll (user is reading above)
* Core invariant: "If the user is at the bottom and content grows, follow it."
*
* Position is the sole source of truth — no input-event sniffing, no direction
* tracking. A programmatic scroll guard distinguishes our own scrollTo calls
* from user scrolls, preventing async scroll events from falsely updating the
* position state. Works correctly with ALL scroll methods (mouse wheel,
* trackpad, scrollbar drag, keyboard, programmatic scroll).
* The ResizeObserver and scroll handler run continuously — position is the sole
* source of truth, not streaming state. The `enabled` flag only controls whether
* shrinkage is also followed (useful during streaming when markdown restructuring
* can temporarily reduce height).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align enabled API documentation with the new behavior.

The new invariant says enabled only controls shrinkage follow, but the options doc still implies it toggles auto-scroll globally. Please make both docs consistent to avoid misuse.

📝 Suggested doc update
 interface UseAutoScrollOptions {
   /**
-   * Whether auto-scroll should be active (e.g., during streaming)
+   * Whether shrinkage should also be followed (typically while streaming).
+   * Growth is followed whenever the user is at bottom.
    */
   enabled: boolean;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/app/hooks/use-auto-scroll.ts` around lines 37 - 44, Update
the comment/docs in use-auto-scroll.ts to reflect the new invariant: state that
the hook's core behavior always follows content growth when the user is at the
bottom, and that the enabled option only controls whether shrinkage is followed
(not global auto-scroll). Edit the function/comment block for useAutoScroll and
the options description to replace wording that suggests enabled toggles all
auto-scrolling with language that enabled = "follow shrinkage" (or similar), and
add a short note clarifying that growth-following remains active regardless of
enabled.

Comment on lines +61 to +66
function getDefaultModelId(
preset: string,
presets: Record<string, string[]> | undefined,
): string {
return presets?.[preset]?.[0] ?? '';
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider guarding against empty modelId.

getDefaultModelId returns an empty string when presets is undefined or the preset key doesn't exist. If agent.modelId is also unset and modelPresets hasn't loaded yet, initialModelId will be '', which propagates to the form and could be saved.

The route loader prefetches modelPresets, so in practice this window should be brief. Consider either:

  1. Disabling the Select until modelPresets is loaded, or
  2. Adding a guard in handleSave to prevent saving an empty modelId.
💡 Optional guard in handleSave
 const handleSave = useCallback(
   async (data: InstructionsFormData) => {
+    if (!data.modelId) return; // Skip save if modelId not yet resolved
+
     const modelPreset = modelPresets
       ? derivePresetFromModelId(data.modelId, modelPresets)
       : 'standard';
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@services/platform/app/routes/dashboard/`$id/custom-agents/$agentId/instructions.tsx
around lines 61 - 66, getDefaultModelId can return '' when presets aren't
present which allows initialModelId to be empty and saves an invalid
agent.modelId; locate getDefaultModelId and the code that builds initialModelId
and either (A) disable the model Select component until modelPresets is loaded
(use the route loader state or a modelPresets truthy check to set disabled on
the Select) or (B) add a guard inside handleSave (check agent.modelId or
initialModelId is non-empty) to abort saving and surface an error/validation
message to the user; update references to getDefaultModelId, initialModelId, the
Select control, and handleSave accordingly so an empty string cannot be
persisted.

createAgent: (options) =>
createWorkflowAgent({ withTools: true, delegationMode, ...options }),
model: process.env.OPENAI_CODING_MODEL || '',
model: getCodingModel() ?? '',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# First, let's locate and examine generate_response.ts
echo "=== File structure check ==="
fd "generate_response.ts" services/platform/convex/agents/workflow/

echo
echo "=== Content around line 45 in generate_response.ts ==="
head -60 services/platform/convex/agents/workflow/generate_response.ts

echo
echo "=== getCodingModel function definition ==="
rg "getCodingModel" services/platform/convex/ -A 5 -B 2

echo
echo "=== Full generate_response.ts (limited to 150 lines for analysis) ==="
head -150 services/platform/convex/agents/workflow/generate_response.ts

Repository: tale-project/tale

Length of output: 8889


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== generateAgentResponse function definition ==="
find services/platform/convex/lib -name "*.ts" -o -name "*.tsx" | xargs grep -l "generateAgentResponse" | head -5

echo
echo "=== Generate Agent Response implementation ==="
fd "generate_response.ts" services/platform/convex/lib/ -x cat {} | head -200

echo
echo "=== How model is used in agent creation and API calls ==="
rg "model:" services/platform/convex/lib/create_agent_config.ts -A 3 -B 3

echo
echo "=== Check if model is validated before API use ==="
rg "function generateAgentResponse" services/platform/convex/lib -A 50 | head -80

Repository: tale-project/tale

Length of output: 7055


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Search for where model is used in generateAgentResponse ==="
grep -n "model" services/platform/convex/lib/agent_response/generate_response.ts | head -30

echo
echo "=== Check createAgent call and how model is passed ==="
sed -n '1,250p' services/platform/convex/lib/agent_response/generate_response.ts | grep -A 20 "createAgent"

echo
echo "=== Check agent creation in workflow agent (where model validation happens) ==="
sed -n '50,100p' services/platform/convex/agents/workflow/agent.ts

echo
echo "=== Full agent.ts to see model fallback ==="
head -80 services/platform/convex/agents/workflow/agent.ts

Repository: tale-project/tale

Length of output: 5144


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check how other agents handle model fallback ==="
fd "generate_response.ts" services/platform/convex/agents/ -x grep -H "model:" {}

echo
echo "=== Compare agent factory patterns across the codebase ==="
rg "getCodingModel\(\) \?\? ''" services/platform/convex/ 

echo
echo "=== Check if getCodingModelOrThrow is used elsewhere in similar patterns ==="
rg "getCodingModelOrThrow\(\)" services/platform/convex/agents/ -B 2 -A 2

echo
echo "=== Verify the nullish coalescing behavior in agent.ts ==="
sed -n '75,90p' services/platform/convex/agents/workflow/agent.ts

echo
echo "=== Check how createAgentConfig uses the model parameter ==="
rg "function createAgentConfig\|const createAgentConfig" services/platform/convex/lib/create_agent_config.ts -A 20

Repository: tale-project/tale

Length of output: 1210


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check createAgentConfig definition and how it handles model ==="
cat services/platform/convex/lib/create_agent_config.ts

echo
echo "=== Check how Agent constructor uses the model in agent config ==="
rg "class Agent\|new Agent" services/platform/convex -B 2 -A 5 | head -40

echo
echo "=== Verify if empty string would cause issues at SDK level ==="
rg "model.*\|\|.*model\|model.*??.*model" services/platform/convex/lib/create_agent_config.ts

Repository: tale-project/tale

Length of output: 6924


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check if getDefaultAgentRuntimeConfig exists and how it's used ==="
rg "getDefaultAgentRuntimeConfig" services/platform/convex/ -B 2 -A 5

echo
echo "=== Check getDefaultModel to understand the fallback ==="
rg "function getDefaultModel\|const getDefaultModel" services/platform/convex/lib/ -A 5

echo
echo "=== Trace complete flow: empty string through the chain ==="
echo "1. generate_response.ts passes: getCodingModel() ?? '' "
echo "2. createWorkflowAgent receives options.model = ''"
echo "   - Line: const model = options?.model ?? getCodingModelOrThrow()"
echo "   - Result: '' is NOT nullish, so model = '' (does NOT throw)"
echo "3. createAgentConfig receives model = ''"
echo "   - truthiness check: if (opts.model) → '' is falsy → uses getDefaultModel()"
echo ""
echo "=== Verify if this pattern is used elsewhere or if there's a preferred pattern ==="
rg "generateWorkflowResponse\|createWorkflowAgent" services/platform/convex/ -B 5 | grep -A 5 "model:"

Repository: tale-project/tale

Length of output: 7442


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Get the full getDefaultModel implementation ==="
rg "function getDefaultModel|export.*getDefaultModel" services/platform/convex/lib/agent_runtime_config.ts -A 10

echo
echo "=== Confirm the truthiness check behavior in createAgentConfig ==="
sed -n '95,110p' services/platform/convex/lib/create_agent_config.ts

echo
echo "=== Check other agents to see if they use similar or different patterns ==="
rg "generateAgentResponse" services/platform/convex/agents/ -B 20 | grep -A 10 "model:"

echo
echo "=== Verify the learning about using getDefaultAgentRuntimeConfig ==="
echo "According to PR `#397` learning, agents should use getDefaultAgentRuntimeConfig()"
echo "Let's see if chat_agent uses that pattern for reference"
rg "generateAgentResponse\|generateChatResponse" services/platform/convex/agents/chat/ -A 20 | head -40

Repository: tale-project/tale

Length of output: 1198


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Get the actual getModel function in createAgentConfig ==="
sed -n '95,120p' services/platform/convex/lib/create_agent_config.ts

echo
echo "=== Check if getCodingModel vs getCodingModelOrThrow pattern is documented ==="
cat services/platform/convex/lib/agent_runtime_config.ts | head -50

echo
echo "=== Verify that empty string is indeed falsy and will fall back to getDefaultModel ==="
echo "In TypeScript/JavaScript:"
echo "  '' (empty string) is falsy"
echo "  if ('') → false"
echo "  So: if (opts.model) where opts.model = '' → will call getDefaultModel()"
echo ""
echo "=== Check if there's any direct API call that uses model before createAgentConfig ==="
rg "getCodingModel\(\)" services/platform/convex/agents/workflow/ -B 5 -A 10

Repository: tale-project/tale

Length of output: 3606


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check how chat agent (if it exists) handles model ==="
fd "generate_response.ts" services/platform/convex/agents/chat/ -x cat {}

echo
echo "=== Check if there are other agents and their model handling patterns ==="
fd "generate_response.ts" services/platform/convex/agents/

echo
echo "=== Verify the complete call chain: model parameter in generateAgentResponse ==="
echo "generateAgentResponse receives config.model"
echo "Let's see where it's actually used in generateAgentResponse"
sed -n '315,330p' services/platform/convex/lib/agent_response/generate_response.ts

echo
echo "=== Check if model is ever used directly before being passed to createAgent ==="
sed -n '315,400p' services/platform/convex/lib/agent_response/generate_response.ts | grep -i "model"

Repository: tale-project/tale

Length of output: 977


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find where createAgent is called within generateAgentResponse ==="
rg "createAgent" services/platform/convex/lib/agent_response/generate_response.ts -B 3 -A 3

echo
echo "=== Get the context around line 400-420 where the agent is created ==="
sed -n '400,420p' services/platform/convex/lib/agent_response/generate_response.ts

echo
echo "=== Search for all places where createAgent is invoked ==="
rg "await.*createAgent\|createAgent\(" services/platform/convex/lib/agent_response/generate_response.ts -B 2 -A 2

Repository: tale-project/tale

Length of output: 2295


Reconsider the model initialization pattern for clarity and consistency.

The empty string fallback is actually safe—createAgentConfig treats '' as falsy and falls back to getDefaultModel(). However, the pattern getCodingModel() ?? '' is unintuitive because empty string is not nullish; the truthiness check downstream only works by accident.

Instead, either:

  • Use getCodingModel() || getDefaultModel() for explicit fallback, or
  • Use getCodingModelOrThrow() if coding model is mandatory for workflow agents

This aligns with the codebase pattern where agents source runtime config centrally (per your PR #397 guidance) rather than relying on implicit truthiness checks.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/agents/workflow/generate_response.ts` at line 45,
The use of getCodingModel() ?? '' is confusing because it relies on downstream
truthiness in createAgentConfig to pick a default; change the initialization to
an explicit fallback or enforce presence: either set model to getCodingModel()
|| getDefaultModel() so the default is chosen clearly, or call
getCodingModelOrThrow() if a coding model is mandatory for workflow agents;
update the call site where model is passed into createAgentConfig to use one of
these two alternatives and ensure tests/consumers expect an explicit model or a
thrown error accordingly.

Comment on lines 9 to +10
* start → fetch_main_page → update_metadata → query_urls → register_urls
* → sync_pages → check_has_more → (loop or update_status)
* → crawl_and_upsert_pages → check_has_more → (loop or update_status)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Flow documentation no longer matches the actual step slug chain.

At Line 10, the flow says crawl_and_upsert_pages, but the workflow transitions through sync_pages (see nextSteps/stepSlug). Please align the comment with the actual slug names (or rename the slug chain consistently).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/platform/convex/predefined_workflows/website_scan.ts` around lines 9
- 10, The flow comment at the top lists step slug "crawl_and_upsert_pages" but
the actual workflow uses "sync_pages" in nextSteps/stepSlug; update the comment
to match the real slug chain (or rename the step slug to match the comment) so
they are consistent. Locate the step definitions that use nextSteps/stepSlug
(check references to "sync_pages" and any transition logic) and either change
the comment to include "sync_pages" in place of "crawl_and_upsert_pages" or
rename the step slug and its occurrences to "crawl_and_upsert_pages" so the
documented chain and actual step names align.

Comment on lines +622 to +625
urlsToSync: {
type: 'array',
description: 'URLs that need content fetching',
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Missing items specification for array field.

The urlsToSync field is typed as array but lacks an items specification to describe its element type. Other array fields in this file (e.g., line 142-143 for tags) include items definitions.

📝 Suggested fix
       urlsToSync: {
         type: 'array',
         description: 'URLs that need content fetching',
+        items: { type: 'string' },
       },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
urlsToSync: {
type: 'array',
description: 'URLs that need content fetching',
},
urlsToSync: {
type: 'array',
description: 'URLs that need content fetching',
items: { type: 'string' },
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@services/platform/convex/workflow_engine/helpers/validation/variables/action_schemas.ts`
around lines 622 - 625, The schema for the field urlsToSync is declared as type
'array' but missing an items spec; update the action schema (the object that
contains urlsToSync) to add an items property describing the element type (e.g.,
items: { type: 'string', format: 'uri' } or items: { type: 'string' } as
appropriate) so validation mirrors other array fields like tags; modify the
urlsToSync definition in action_schemas.ts to include the items schema and any
required constraints (minItems, uniqueItems) consistent with surrounding array
definitions.

@larryro larryro merged commit 12c618e into main Feb 26, 2026
17 checks passed
@larryro larryro deleted the refactor/website-scan-naming-and-embedding-pool branch February 26, 2026 06:00
yannickmonney pushed a commit that referenced this pull request Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant