Skip to content

optimize rag action#4

Merged
larryro merged 3 commits into
mainfrom
optimize-rag-action
Dec 3, 2025
Merged

optimize rag action#4
larryro merged 3 commits into
mainfrom
optimize-rag-action

Conversation

@larryro

@larryro larryro commented Dec 3, 2025

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

Release Notes

  • Improvements
    • Updated default scheduling for Document RAG Sync, Product RAG Sync, Website Pages RAG Sync, and Customer RAG Sync from hourly to every 20 minutes
    • Standardized workflow processing terminology: renamed document-related fields to record-based identifiers for consistency across workflow execution and processing APIs
    • Simplified RAG upload action parameters—now accepts recordId instead of multiple document fields
    • Enhanced workflow execution tracking with root workflow definition reference

✏️ Tip: You can customize this high-level summary in your review settings.

…ories

- Move action helper files and types into dedicated /helpers subdirectories
  across multiple workflow action modules:
  - approval/
  - conversation/
  - crawler/
  - imap/
  - integration/
  - rag/
  - website/
  - websitePages/
  - workflow/
  - workflow_processing_records/

- Simplify RAG action interface:
  - Replace documentId/organizationId params with single recordId
  - Remove forceReupload and includeMetadata options
  - Remove explicit timeout parameters from workflow definitions
  - Update upload_text and upload_document operations accordingly

- Update predefined workflow definitions to use new recordId parameter:
  - customer_rag_sync.ts
  - document_rag_sync.ts
  - product_rag_sync.ts
  - website_pages_rag_sync.ts

- Update all import paths across IMAP library files and action modules
- Remove obsolete README documentation files (imap, product, rag)
- Regenerate Convex API type definitions
- Rename `documentId` to `recordId` across workflow processing records
- Rename `documentCreationTime` to `recordCreationTime`
- Rename `workflowId` to `wfDefinitionId` for consistency with schema
- Add `rootWfDefinitionId` to wfExecutions for tracking workflow family versions
- Update schema indexes: `by_document` -> `by_record`, `by_org_table_workflow_*` -> `by_org_table_wfDefinition_*`
- Update predefined workflows to use `{{rootWfDefinitionId}}` instead of `{{workflowId}}`
- Change default RAG sync schedules from hourly to every 20 minutes
- Remove unused `create_execution.ts` and related types
- Rename helper function `isDocumentProcessed` to `isRecordProcessed`
@larryro larryro force-pushed the optimize-rag-action branch from c8f6e70 to e2378b4 Compare December 3, 2025 10:27
@larryro larryro merged commit ab75157 into main Dec 3, 2025
1 check passed
@larryro larryro deleted the optimize-rag-action branch December 3, 2025 10:30
@coderabbitai

coderabbitai Bot commented Dec 3, 2025

Copy link
Copy Markdown
Contributor

Caution

Review failed

The pull request is closed.

📝 Walkthrough

Walkthrough

This PR systematically refactors the workflow processing and execution system by renaming fields from document-centric to record-centric terminology (documentId → recordId, documentCreationTime → recordCreationTime), renaming workflowId to wfDefinitionId, and introducing rootWfDefinitionId to track workflow versions. It reorganizes helper modules into ./helpers/ subdirectories across multiple action types, refactors RAG upload APIs from multi-parameter to object-based signatures, removes the createExecution function, updates database schema indexes to reflect record terminology, and adjusts default RAG sync workflow schedules from hourly to 20 minutes. Import paths throughout are updated to accommodate the new directory structure.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Specific areas requiring attention:

  • RAG action refactoring (rag_action.ts, upload_file_direct.ts, upload_text_document.ts): Signature changes from multi-parameter to object-based API, metadata field restructuring, and recordId propagation logic need verification across all call sites
  • Schema index renames (schema.ts): Verify all index reference updates (by_org_table_workflow → by_org_table_wfDefinition, by_document → by_record) are correctly reflected throughout model files and query builders
  • Workflow execution changes (start_workflow_handler.ts, initialize_execution_variables.ts): rootWfDefinitionId introduction and propagation into execution payload and variables must be traced through the entire workflow lifecycle
  • Predefined workflows (10+ files): Consistent parameter updates across all workflow definitions—verify wfDefinitionId/recordId substitutions match the new action reference signatures
  • Helper module relocations: Multiple files moved to helpers/ subdirectories with corresponding import path updates—verify all references are correctly updated and no circular dependencies introduced
  • Removed function impacts (createExecution removal): Verify all callers have been updated and the execution creation logic has been properly relocated/refactored

Possibly related PRs

  • tale-project/poc2#354: Directly modifies the same ACTION_REFERENCE metadata for record_processed operation with matching field renamings (documentId/workflowId → recordId/wfDefinitionId)
  • tale-project/poc2#447: Parallel coordinated changes to workflow identifiers and APIs, updating workflow start/processing helpers with wfDefinitionId and record terminology throughout
  • tale-project/poc2#393: Overlapping RAG upload API refactoring with identical parameter shape changes (documentId → recordId) and uploadTextDocument/uploadFileDirect signature updates

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro (Legacy)

📥 Commits

Reviewing files that changed from the base of the PR and between 3c13944 and e2378b4.

⛔ Files ignored due to path filters (1)
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
📒 Files selected for processing (97)
  • services/platform/convex/agent_tools/convex_tools/workflows/action_reference.ts (1 hunks)
  • services/platform/convex/lib/create_workflow_agent.ts (1 hunks)
  • services/platform/convex/lib/variables/jexl_instance.ts (1 hunks)
  • services/platform/convex/model/organizations/save_default_workflows.ts (3 hunks)
  • services/platform/convex/model/wf_executions/create_execution.ts (0 hunks)
  • services/platform/convex/model/wf_executions/index.ts (0 hunks)
  • services/platform/convex/model/wf_executions/types.ts (0 hunks)
  • services/platform/convex/model/workflow_processing_records/find_product_recommendation_by_status.ts (3 hunks)
  • services/platform/convex/model/workflow_processing_records/find_unprocessed.ts (3 hunks)
  • services/platform/convex/model/workflow_processing_records/find_unprocessed_open_conversation.ts (3 hunks)
  • services/platform/convex/model/workflow_processing_records/helpers/find_unprocessed_with_custom_query.ts (3 hunks)
  • services/platform/convex/model/workflow_processing_records/helpers/get_latest_processed_creation_time.ts (3 hunks)
  • services/platform/convex/model/workflow_processing_records/helpers/is_document_processed.ts (3 hunks)
  • services/platform/convex/model/workflow_processing_records/helpers/run_query.ts (3 hunks)
  • services/platform/convex/model/workflow_processing_records/index.ts (1 hunks)
  • services/platform/convex/model/workflow_processing_records/record_processed.ts (3 hunks)
  • services/platform/convex/model/workflow_processing_records/types.ts (1 hunks)
  • services/platform/convex/node_only/imap/lib/build_email_message.ts (1 hunks)
  • services/platform/convex/node_only/imap/lib/collect_thread_message_ids.ts (1 hunks)
  • services/platform/convex/node_only/imap/lib/fetch_and_parse_message.ts (1 hunks)
  • services/platform/convex/node_only/imap/lib/fetch_email_by_uid.ts (1 hunks)
  • services/platform/convex/node_only/imap/lib/fetch_messages_from_search_results.ts (1 hunks)
  • services/platform/convex/node_only/imap/lib/find_root_message.ts (1 hunks)
  • services/platform/convex/node_only/imap/lib/search_thread_messages.ts (1 hunks)
  • services/platform/convex/node_only/imap/retrieve_imap_emails.ts (1 hunks)
  • services/platform/convex/predefined_workflows/conversation_auto_reply.ts (2 hunks)
  • services/platform/convex/predefined_workflows/customer_rag_sync.ts (3 hunks)
  • services/platform/convex/predefined_workflows/document_rag_sync.ts (3 hunks)
  • services/platform/convex/predefined_workflows/general_customer_status_assessment.ts (2 hunks)
  • services/platform/convex/predefined_workflows/general_product_recommendation.ts (3 hunks)
  • services/platform/convex/predefined_workflows/loopi_customer_status_assessment.ts (2 hunks)
  • services/platform/convex/predefined_workflows/loopi_product_recommendation.ts (3 hunks)
  • services/platform/convex/predefined_workflows/onedrive_sync.ts (2 hunks)
  • services/platform/convex/predefined_workflows/product_rag_sync.ts (3 hunks)
  • services/platform/convex/predefined_workflows/product_recommendation_email.ts (2 hunks)
  • services/platform/convex/predefined_workflows/product_relationship_analysis.ts (2 hunks)
  • services/platform/convex/predefined_workflows/website_pages_rag_sync.ts (3 hunks)
  • services/platform/convex/schema.ts (2 hunks)
  • services/platform/convex/workflow/actions/approval/approval_action.ts (1 hunks)
  • services/platform/convex/workflow/actions/approval/helpers/create_approval.ts (1 hunks)
  • services/platform/convex/workflow/actions/approval/helpers/get_approval.ts (1 hunks)
  • services/platform/convex/workflow/actions/approval/helpers/get_approval_history.ts (1 hunks)
  • services/platform/convex/workflow/actions/approval/helpers/list_approvals_for_execution.ts (1 hunks)
  • services/platform/convex/workflow/actions/approval/helpers/list_pending_approvals.ts (1 hunks)
  • services/platform/convex/workflow/actions/approval/helpers/list_pending_approvals_for_execution.ts (1 hunks)
  • services/platform/convex/workflow/actions/approval/helpers/types.ts (1 hunks)
  • services/platform/convex/workflow/actions/approval/helpers/update_approval_status.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/conversation_action.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/add_message_to_conversation.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/build_conversation_metadata.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/build_email_metadata.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/build_initial_message.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/create_conversation.ts (2 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/create_conversation_from_email.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/create_conversation_from_sent_email.ts (2 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/find_or_create_customer_from_email.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/find_related_conversation.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/get_conversation_by_id.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/query_conversation_messages.ts (2 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/query_conversations.ts (2 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/query_latest_message_by_delivery_state.ts (2 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/types.ts (1 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/update_conversations.ts (2 hunks)
  • services/platform/convex/workflow/actions/conversation/helpers/update_message.ts (1 hunks)
  • services/platform/convex/workflow/actions/crawler/crawler_action.ts (1 hunks)
  • services/platform/convex/workflow/actions/crawler/helpers/types.ts (1 hunks)
  • services/platform/convex/workflow/actions/imap/README.md (0 hunks)
  • services/platform/convex/workflow/actions/imap/helpers/get_imap_credentials.ts (1 hunks)
  • services/platform/convex/workflow/actions/imap/helpers/types.ts (1 hunks)
  • services/platform/convex/workflow/actions/imap/imap_action.ts (1 hunks)
  • services/platform/convex/workflow/actions/integration/helpers/build_secrets_from_integration.ts (1 hunks)
  • services/platform/convex/workflow/actions/integration/integration_action.ts (1 hunks)
  • services/platform/convex/workflow/actions/product/README.md (0 hunks)
  • services/platform/convex/workflow/actions/rag/README.md (0 hunks)
  • services/platform/convex/workflow/actions/rag/helpers/get_document_info.ts (4 hunks)
  • services/platform/convex/workflow/actions/rag/helpers/get_rag_config.ts (1 hunks)
  • services/platform/convex/workflow/actions/rag/helpers/types.ts (3 hunks)
  • services/platform/convex/workflow/actions/rag/helpers/upload_file_direct.ts (4 hunks)
  • services/platform/convex/workflow/actions/rag/helpers/upload_text_document.ts (3 hunks)
  • services/platform/convex/workflow/actions/rag/rag_action.ts (3 hunks)
  • services/platform/convex/workflow/actions/website/website_action.ts (1 hunks)
  • services/platform/convex/workflow/actions/websitePages/helpers/types.ts (1 hunks)
  • services/platform/convex/workflow/actions/websitePages/websitePages_action.ts (1 hunks)
  • services/platform/convex/workflow/actions/workflow/helpers/types.ts (1 hunks)
  • services/platform/convex/workflow/actions/workflow/helpers/upload_workflows.ts (4 hunks)
  • services/platform/convex/workflow/actions/workflow/workflow_action.ts (1 hunks)
  • services/platform/convex/workflow/actions/workflow_processing_records/helpers/find_product_recommendation_by_status.ts (2 hunks)
  • services/platform/convex/workflow/actions/workflow_processing_records/helpers/find_unprocessed.ts (3 hunks)
  • services/platform/convex/workflow/actions/workflow_processing_records/helpers/find_unprocessed_open_conversation.ts (1 hunks)
  • services/platform/convex/workflow/actions/workflow_processing_records/helpers/record_processed.ts (1 hunks)
  • services/platform/convex/workflow/actions/workflow_processing_records/helpers/types.ts (2 hunks)
  • services/platform/convex/workflow/actions/workflow_processing_records/record_processed.ts (0 hunks)
  • services/platform/convex/workflow/actions/workflow_processing_records/workflow_processing_records_action.ts (8 hunks)
  • services/platform/convex/workflow/helpers/engine/start_workflow_handler.ts (2 hunks)
  • services/platform/convex/workflow/helpers/step_execution/initialize_execution_variables.ts (1 hunks)
  • services/platform/convex/workflow/helpers/step_execution/types.ts (1 hunks)
  • services/platform/convex/workflow_processing_records.ts (5 hunks)

Comment @coderabbitai help to get the list of available commands and usage tips.

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026
matek-alphacruncher added a commit to nuvolos-cloud/tale that referenced this pull request May 4, 2026
…volos)

Replaces the runtime patch_sops.py monkey-patch with an in-source change.
Reads existing *.secrets.json as plain JSON (seeded by tale-init.sh from
OPENROUTER_API_KEY) with OPENROUTER_API_KEY / OPENAI_API_KEY env fallback;
saveProviderSecret writes plain JSON instead of SOPS-encrypting. Also
documented in FORK.md as patch tale-project#4.
larryro added a commit that referenced this pull request May 7, 2026
Round-2-confirmed critical bug. The docstring on `requestErasure`
advertised `LEGAL_HOLD_BLOCKS_ERASURE` as a refusal code; the body had
`const blocked: string[] = []` hard-coded as a "phase-8 placeholder"
with `loadActiveHolds` commented out. Result: an org admin (or anyone
with a compromised admin token) could spoliate litigation-hold data via
this mutation, and the audit row would record `threadsBlocked: []` —
actively misleading. Direct GDPR Art 17(3)(e) and FRCP 37(e) violation.

`loadActiveHolds` ships in the same branch (legal_hold.ts:65) and is
already consumed by retention_cleanup; only erasure ignored it.

This commit:
- Calls `loadActiveHolds` and partitions the subject's threads into
  held vs erasable.
- Fails the whole request closed when `holds.orgHeld === true` or any
  of the subject's threads is held. Emits a `gdpr_erasure_blocked_by_hold`
  audit row before throwing so the refusal itself is logged in the
  tamper-evident chain (compliance teams need this).
- Throws `ConvexError({ code: 'LEGAL_HOLD_BLOCKS_ERASURE', orgHeld,
  heldThreadIds })` matching the docstring contract.
- Tracks per-thread cascade completion explicitly. The previous loop
  ran cascadeDeleteThreadChildren up to 50 times then incremented
  `erased` regardless of outcome; a thread with >50 pages of children
  was reported as erased while leaving residue. Now only counts threads
  whose cascade actually returned `done: true`, and emits the audit row
  with `status: 'failure'` + an explicit error message when any thread
  was partial.
- Audit row records `heldThreadIds` (when present) and
  `threadsTargeted` (so reviewers can spot partial erasures).

Out-of-scope here (separate work-streams):
- Sub-thread cascade hold check (W1 #3 — cascade_helpers).
- Snapshot-race fix in deleteExpired* mutations (W1 #4).
- Subject scope expansion (W2: userMemories, BetterAuth, audit-PII
  scrub, RAG propagation, gdprErasureRequests state machine).
larryro added a commit that referenced this pull request May 7, 2026
…ascade

Round-1 reviewer-confirmed: 12 of 14 retention cleanup categories did
not consult holds.orgHeld, and cascadeDeleteThreadChildren recursed
into sub-threads with no hold check. Both let a litigation/preservation
hold be silently bypassed for the table that records *why* the hold
exists (auditLogs), every PII-bearing table, and any held sub-thread
when its parent ages out — direct US FRCP 37(e) / EU GDPR Art 21
spoliation risk.

retention_cleanup.ts (W1 #2)
- Added `holds: ActiveHolds` to every cleanup category that lacked it:
  cleanupTempFiles (user + agent), cleanupAuditLogs, cleanupWorkflowLogs,
  cleanupUsageLedger, cleanupChatFilterEvents, cleanupPromptTemplates,
  cleanupMessageFeedback, cleanupMemoryAudit, cleanupCustomers,
  cleanupVendors, cleanupExternalConversations, cleanupMessageMetadata.
- Each short-circuits with a clear info log when holds.orgHeld is true.
- cleanupWorkflowLogs additionally consults holds.executionIds for
  per-execution holds (`targetType: 'execution'`) — until now those
  rows were silently ignored by cleanup.
- The dispatcher's category list updated to thread `holds` to all 15
  category invocations.

cascade_helpers.ts (W1 #3 + partial W1 #4)
- cascadeDeleteThreadChildren accepts an optional `holds` snapshot. If
  omitted AND organizationId is set, the helper re-loads `legalHolds`
  itself — closing the snapshot-once race where a hold placed mid-run
  provided zero protection because the dispatcher's pre-fetched Set
  was already stale by the time per-thread cascade fired.
- Sub-thread recursion now passes the snapshot through, so the
  per-sub-thread hold check uses the same authoritative read.
- The helper returns `{ done: true, remaining: 0 }` (no-op) on a held
  thread; callers (retention / erasure) treat that as "skip and
  continue" rather than "delete completed".
larryro added a commit that referenced this pull request May 7, 2026
…OCTOU

Adds assertSafeRetentionDelete to internal_mutations_retention.ts and
threads the new cutoffMs through every retention dispatcher call site.
The guard runs in-mutation (V8 transaction) and closes three round-2
findings simultaneously:

- W1 #4 — snapshot-race vs legal hold. The dispatcher's
  loadActiveHoldsForOrg snapshot is up to 25 min stale; a hold placed
  AFTER snapshot but BEFORE the per-row mutation otherwise had zero
  protection. Re-reading inside each mutation makes the in-mutation
  read the only authoritative gate.
- W6 #10 — cross-org corruption. Every deleteExpired* trusted
  args.organizationId blindly; a swapped id silently deleted org A's
  row while logging the audit under org B (and forking the per-org
  hash chain). Now we assert row.organizationId === args.organizationId.
- W4 #13 — cutoff TOCTOU. Between the dispatcher's listExpired query
  and the per-row delete, a user can re-touch the row (chat
  thread.updatedAt bump, document patch). The mutation now re-evaluates
  cutoff against (updatedAt ?? _creationTime) and skips when the row
  is no longer eligible.

Mutations updated (each accepts new optional cutoffMs):
- deleteExpiredDocument            (targetType: 'document')
- deleteExpiredThread              (targetType: 'thread')
- deleteExpiredWorkflowExecution   (targetType: 'execution')
- deleteExpiredWorkflowTriggerLog  (orgId + cutoff only)
- deleteExpiredCustomer / Vendor / ExternalConversation /
  PromptTemplate / MessageFeedback / MemoryAuditRow /
  ChatFilterEvent / UsageLedgerRow (orgId + cutoff)

Out-of-scope here:
- deleteExpiredTempFile, deleteExpiredMessageMetadata,
  deleteExpiredTwoFactorAttempt, deleteExpiredLoginAttempt,
  deleteExpiredLoginBlockCounter — none of these carry organizationId
  on their args today (they're either cross-org-by-shape or have an
  indirect link). The orgHeld / cross-org checks are not applicable
  without an org backreference, which is itself a separate phase-10
  follow-up. Their existing absence-of-row idempotence is preserved.
larryro added a commit that referenced this pull request May 7, 2026
…eout

Round-2 v15 confirmed: /config unauthenticated, /openapi.json + /docs +
/redoc unauthenticated, RAG container ran as root, default token baked
into image ENV, strict-mode env name diverged across the wire,
non-constant-time token compare, plus three SSRF-guard gaps.

services/rag/app/auth.py
- W7 #3: hmac.compare_digest replaces == on the bearer compare. Removes
  the dead-code EXEMPT_PATHS frozenset.

services/rag/app/routers/health.py
- W7 #1: split into public_router (`/`, `/health`) and protected_router
  (`/config`). main.py mounts the protected one under
  Depends(verify_internal_token). Old `router` re-export stays for
  backwards compat.

services/rag/app/main.py
- W7 #2: docs_url / redoc_url / openapi_url are None outside debug.
- W7 #4: CORS allow_credentials flipped to False (bearer rides
  Authorization, never cookies).
- W7 #1 wiring: mount health-public + health-protected separately.

services/rag/app/config.py
- W7 #8: require_custom_internal_token accepts BOTH
  RAG_REQUIRE_CUSTOM_INTERNAL_TOKEN and TALE_REQUIRE_CUSTOM_RAG_TOKEN
  via pydantic AliasChoices.

services/rag/Dockerfile + services/convex/Dockerfile
- W7 #5: RAG container runs as non-root (uid:gid 1001:1001 `app`).
  RAG ingests untrusted PDFs/DOCX through native parsers; biggest
  blast radius in the stack, now hardened.
- W7 #6: removed RAG_INTERNAL_TOKEN=tale-rag-dev-only ENV bake from
  both runtime + scratch-squash stages and the matching bake in
  services/convex/Dockerfile. Operators MUST supply via env / compose
  / k8s secret.

services/platform/convex/lib/helpers/rag_config.ts
- W7 #9 F1: `redirect: 'manual'` on every ragFetch.
- W7 #9 F2: added fc00::/7 (IPv6 ULA) to v6 blocklist (AWS IPv6 IMDSv2).
- W7 #9 F3: strip trailing `.` before hostname blocklist lookup.
- W7 #9 F4: re-validate URL per ragFetch invocation (DNS rebinding +
  env rotation mitigation).
- W7 #9 F9: deleted path.startsWith('http') override branch (future-
  bypass foot-gun).

services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
- W7 #10: pass timeoutMs=60_000 (default 10s was a regression).
- Plus MAX_ITERATIONS=30 cap and "cursor did not advance" break to
  defend against an adversarial RAG response.
larryro added a commit that referenced this pull request May 7, 2026
… rag propagation

W2 — completes the round-2-confirmed gaps in requestErasure:

- v11 C2: erasure was a mutation, structurally unable to call the RAG
  service. Vector chunks + indexed text persisted after a "successful"
  erasure receipt.
- v11 C2/C3: bounded loop reported partial cascades as success; receipt
  lied about coverage. No durable receipt for the subject (Art 19).
- v12 H3: single-mutation cascade hit Convex transaction limits.

Schema (gdprErasureRequestsTable)
- New per-request state machine: pending → running → done | partial |
  failed. Carries threadsTargeted snapshot, threadsErased counter,
  threadsBlockedByHold, ragDocumentsRemoved, slaDeadlineAt
  (= requestedAt + 30 days per Art 12(3)), startedAt / completedAt.

erasure.ts — split-phase pipeline
- Public mutation requestErasure: auth + org-admin gate + legal-hold
  gate (refusal also audit-logged) + inserts request row + schedules
  the processor + returns { requestId, threadsTargeted } so the admin
  can poll the row for progress.
- Internal action processErasureRequest: flips row to 'running',
  cascades each thread via cascadeDeleteThreadChildren (up to 50 pages
  per thread), propagates to RAG via ragFetch DELETE per subject-owned
  document, then finalizeProcessing transitions the row to done |
  partial | failed with explicit counts.
- Three supporting internal mutations: beginProcessing,
  eraseThreadById, finalizeProcessing, listSubjectDocuments.

Subject scope
- Today: chat threads + their RAG-indexed descendants (departing-
  employee case).
- Out of scope (documented in file header): userMemories,
  userPreferences, feedback, fileMetadata blobs, audit-chain PII
  scrub (W3 #4), BetterAuth tables, loginAttempts / twoFactorAttempts,
  policyAcknowledgements.

Receipt = the gdprErasureRequests row. The admin UI can render the
full state machine without scraping audit logs; SLA deadline is
durable.
larryro added a commit that referenced this pull request May 7, 2026
Two work-streams in one commit because they share the erasure
processor's per-table cleanup path.

Subject-scope expansion (W2 follow-up)
- Adds eight per-table internal mutations called sequentially after
  the chat-thread cascade + RAG propagation:
    * eraseSubjectUserMemories
    * eraseSubjectUserPreferences
    * eraseSubjectMessageFeedback
    * eraseSubjectFileMetadata (deletes _storage blobs too)
    * eraseSubjectUsageLedger
    * eraseSubjectTwoFactorAttempts
    * eraseSubjectPolicyAcknowledgements
    * eraseSubjectOnedrive
- loginAttempts / loginBlockCounters are email-keyed; new
  lookupSubjectEmail mutation reads the BetterAuth user row first,
  then eraseSubjectLoginAttempts wipes both tables.
- BetterAuth `user`/`account`/`session`/`verification` rows are
  intentionally NOT touched here — the auth component owns its own
  delete-user flow and direct table manipulation could leave dangling
  sessions in flight. Documented as out-of-scope in the file header
  for a follow-up that goes through `authComponent`.

Audit-chain PII scrub (W3 #4)
- New `auditLogs.piiScrubbed` (+ `piiScrubbedAt`) field. The processor
  blanks `actorEmail`, `actorRole`, `ipAddress`, `userAgent`,
  `previousState`, `newState`, `metadata` on every row where
  `actorId === targetUserId`. Row existence + timestamp + action are
  preserved per Art 17(3)(b)/(e) accountability requirements.
- New `auditLogCheckpoints.subtype` field discriminates 'retention'
  (existing) from 'pii_scrub'. The scrub mutation inserts a signed
  checkpoint with `scrubbedSubjectId` + `scrubbedRowCount` so an
  external auditor can confirm the divergence is bounded and
  intentional.
- `verifyIntegrity` updated to recognize the boundary: rows with
  `piiScrubbed: true` (or whose actorId is in any pii_scrub
  checkpoint's subjectId set) skip the SHA-256 recompute step. Chain
  forward-link via `previousHash` is still validated — the scrub does
  not break order, only blanks the body.
- `scrubSubjectAuditLogs` is in audit_logs/internal_mutations.ts so
  it can use the existing `signCheckpoint` HMAC helper directly.

The processor finalizeProcessing audit log on a 'done' run now
implicitly covers the scrub-marker checkpoint (admins reviewing the
erasure see both rows in the chain). No client-callable surface added
— scrub fires only as part of an authorized requestErasure flow.
larryro added a commit that referenced this pull request May 8, 2026
Round-2-confirmed critical bug. The docstring on `requestErasure`
advertised `LEGAL_HOLD_BLOCKS_ERASURE` as a refusal code; the body had
`const blocked: string[] = []` hard-coded as a "phase-8 placeholder"
with `loadActiveHolds` commented out. Result: an org admin (or anyone
with a compromised admin token) could spoliate litigation-hold data via
this mutation, and the audit row would record `threadsBlocked: []` —
actively misleading. Direct GDPR Art 17(3)(e) and FRCP 37(e) violation.

`loadActiveHolds` ships in the same branch (legal_hold.ts:65) and is
already consumed by retention_cleanup; only erasure ignored it.

This commit:
- Calls `loadActiveHolds` and partitions the subject's threads into
  held vs erasable.
- Fails the whole request closed when `holds.orgHeld === true` or any
  of the subject's threads is held. Emits a `gdpr_erasure_blocked_by_hold`
  audit row before throwing so the refusal itself is logged in the
  tamper-evident chain (compliance teams need this).
- Throws `ConvexError({ code: 'LEGAL_HOLD_BLOCKS_ERASURE', orgHeld,
  heldThreadIds })` matching the docstring contract.
- Tracks per-thread cascade completion explicitly. The previous loop
  ran cascadeDeleteThreadChildren up to 50 times then incremented
  `erased` regardless of outcome; a thread with >50 pages of children
  was reported as erased while leaving residue. Now only counts threads
  whose cascade actually returned `done: true`, and emits the audit row
  with `status: 'failure'` + an explicit error message when any thread
  was partial.
- Audit row records `heldThreadIds` (when present) and
  `threadsTargeted` (so reviewers can spot partial erasures).

Out-of-scope here (separate work-streams):
- Sub-thread cascade hold check (W1 #3 — cascade_helpers).
- Snapshot-race fix in deleteExpired* mutations (W1 #4).
- Subject scope expansion (W2: userMemories, BetterAuth, audit-PII
  scrub, RAG propagation, gdprErasureRequests state machine).
larryro added a commit that referenced this pull request May 8, 2026
…ascade

Round-1 reviewer-confirmed: 12 of 14 retention cleanup categories did
not consult holds.orgHeld, and cascadeDeleteThreadChildren recursed
into sub-threads with no hold check. Both let a litigation/preservation
hold be silently bypassed for the table that records *why* the hold
exists (auditLogs), every PII-bearing table, and any held sub-thread
when its parent ages out — direct US FRCP 37(e) / EU GDPR Art 21
spoliation risk.

retention_cleanup.ts (W1 #2)
- Added `holds: ActiveHolds` to every cleanup category that lacked it:
  cleanupTempFiles (user + agent), cleanupAuditLogs, cleanupWorkflowLogs,
  cleanupUsageLedger, cleanupChatFilterEvents, cleanupPromptTemplates,
  cleanupMessageFeedback, cleanupMemoryAudit, cleanupCustomers,
  cleanupVendors, cleanupExternalConversations, cleanupMessageMetadata.
- Each short-circuits with a clear info log when holds.orgHeld is true.
- cleanupWorkflowLogs additionally consults holds.executionIds for
  per-execution holds (`targetType: 'execution'`) — until now those
  rows were silently ignored by cleanup.
- The dispatcher's category list updated to thread `holds` to all 15
  category invocations.

cascade_helpers.ts (W1 #3 + partial W1 #4)
- cascadeDeleteThreadChildren accepts an optional `holds` snapshot. If
  omitted AND organizationId is set, the helper re-loads `legalHolds`
  itself — closing the snapshot-once race where a hold placed mid-run
  provided zero protection because the dispatcher's pre-fetched Set
  was already stale by the time per-thread cascade fired.
- Sub-thread recursion now passes the snapshot through, so the
  per-sub-thread hold check uses the same authoritative read.
- The helper returns `{ done: true, remaining: 0 }` (no-op) on a held
  thread; callers (retention / erasure) treat that as "skip and
  continue" rather than "delete completed".
larryro added a commit that referenced this pull request May 8, 2026
…OCTOU

Adds assertSafeRetentionDelete to internal_mutations_retention.ts and
threads the new cutoffMs through every retention dispatcher call site.
The guard runs in-mutation (V8 transaction) and closes three round-2
findings simultaneously:

- W1 #4 — snapshot-race vs legal hold. The dispatcher's
  loadActiveHoldsForOrg snapshot is up to 25 min stale; a hold placed
  AFTER snapshot but BEFORE the per-row mutation otherwise had zero
  protection. Re-reading inside each mutation makes the in-mutation
  read the only authoritative gate.
- W6 #10 — cross-org corruption. Every deleteExpired* trusted
  args.organizationId blindly; a swapped id silently deleted org A's
  row while logging the audit under org B (and forking the per-org
  hash chain). Now we assert row.organizationId === args.organizationId.
- W4 #13 — cutoff TOCTOU. Between the dispatcher's listExpired query
  and the per-row delete, a user can re-touch the row (chat
  thread.updatedAt bump, document patch). The mutation now re-evaluates
  cutoff against (updatedAt ?? _creationTime) and skips when the row
  is no longer eligible.

Mutations updated (each accepts new optional cutoffMs):
- deleteExpiredDocument            (targetType: 'document')
- deleteExpiredThread              (targetType: 'thread')
- deleteExpiredWorkflowExecution   (targetType: 'execution')
- deleteExpiredWorkflowTriggerLog  (orgId + cutoff only)
- deleteExpiredCustomer / Vendor / ExternalConversation /
  PromptTemplate / MessageFeedback / MemoryAuditRow /
  ChatFilterEvent / UsageLedgerRow (orgId + cutoff)

Out-of-scope here:
- deleteExpiredTempFile, deleteExpiredMessageMetadata,
  deleteExpiredTwoFactorAttempt, deleteExpiredLoginAttempt,
  deleteExpiredLoginBlockCounter — none of these carry organizationId
  on their args today (they're either cross-org-by-shape or have an
  indirect link). The orgHeld / cross-org checks are not applicable
  without an org backreference, which is itself a separate phase-10
  follow-up. Their existing absence-of-row idempotence is preserved.
larryro added a commit that referenced this pull request May 8, 2026
…eout

Round-2 v15 confirmed: /config unauthenticated, /openapi.json + /docs +
/redoc unauthenticated, RAG container ran as root, default token baked
into image ENV, strict-mode env name diverged across the wire,
non-constant-time token compare, plus three SSRF-guard gaps.

services/rag/app/auth.py
- W7 #3: hmac.compare_digest replaces == on the bearer compare. Removes
  the dead-code EXEMPT_PATHS frozenset.

services/rag/app/routers/health.py
- W7 #1: split into public_router (`/`, `/health`) and protected_router
  (`/config`). main.py mounts the protected one under
  Depends(verify_internal_token). Old `router` re-export stays for
  backwards compat.

services/rag/app/main.py
- W7 #2: docs_url / redoc_url / openapi_url are None outside debug.
- W7 #4: CORS allow_credentials flipped to False (bearer rides
  Authorization, never cookies).
- W7 #1 wiring: mount health-public + health-protected separately.

services/rag/app/config.py
- W7 #8: require_custom_internal_token accepts BOTH
  RAG_REQUIRE_CUSTOM_INTERNAL_TOKEN and TALE_REQUIRE_CUSTOM_RAG_TOKEN
  via pydantic AliasChoices.

services/rag/Dockerfile + services/convex/Dockerfile
- W7 #5: RAG container runs as non-root (uid:gid 1001:1001 `app`).
  RAG ingests untrusted PDFs/DOCX through native parsers; biggest
  blast radius in the stack, now hardened.
- W7 #6: removed RAG_INTERNAL_TOKEN=tale-rag-dev-only ENV bake from
  both runtime + scratch-squash stages and the matching bake in
  services/convex/Dockerfile. Operators MUST supply via env / compose
  / k8s secret.

services/platform/convex/lib/helpers/rag_config.ts
- W7 #9 F1: `redirect: 'manual'` on every ragFetch.
- W7 #9 F2: added fc00::/7 (IPv6 ULA) to v6 blocklist (AWS IPv6 IMDSv2).
- W7 #9 F3: strip trailing `.` before hostname blocklist lookup.
- W7 #9 F4: re-validate URL per ragFetch invocation (DNS rebinding +
  env rotation mitigation).
- W7 #9 F9: deleted path.startsWith('http') override branch (future-
  bypass foot-gun).

services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
- W7 #10: pass timeoutMs=60_000 (default 10s was a regression).
- Plus MAX_ITERATIONS=30 cap and "cursor did not advance" break to
  defend against an adversarial RAG response.
larryro added a commit that referenced this pull request May 8, 2026
… rag propagation

W2 — completes the round-2-confirmed gaps in requestErasure:

- v11 C2: erasure was a mutation, structurally unable to call the RAG
  service. Vector chunks + indexed text persisted after a "successful"
  erasure receipt.
- v11 C2/C3: bounded loop reported partial cascades as success; receipt
  lied about coverage. No durable receipt for the subject (Art 19).
- v12 H3: single-mutation cascade hit Convex transaction limits.

Schema (gdprErasureRequestsTable)
- New per-request state machine: pending → running → done | partial |
  failed. Carries threadsTargeted snapshot, threadsErased counter,
  threadsBlockedByHold, ragDocumentsRemoved, slaDeadlineAt
  (= requestedAt + 30 days per Art 12(3)), startedAt / completedAt.

erasure.ts — split-phase pipeline
- Public mutation requestErasure: auth + org-admin gate + legal-hold
  gate (refusal also audit-logged) + inserts request row + schedules
  the processor + returns { requestId, threadsTargeted } so the admin
  can poll the row for progress.
- Internal action processErasureRequest: flips row to 'running',
  cascades each thread via cascadeDeleteThreadChildren (up to 50 pages
  per thread), propagates to RAG via ragFetch DELETE per subject-owned
  document, then finalizeProcessing transitions the row to done |
  partial | failed with explicit counts.
- Three supporting internal mutations: beginProcessing,
  eraseThreadById, finalizeProcessing, listSubjectDocuments.

Subject scope
- Today: chat threads + their RAG-indexed descendants (departing-
  employee case).
- Out of scope (documented in file header): userMemories,
  userPreferences, feedback, fileMetadata blobs, audit-chain PII
  scrub (W3 #4), BetterAuth tables, loginAttempts / twoFactorAttempts,
  policyAcknowledgements.

Receipt = the gdprErasureRequests row. The admin UI can render the
full state machine without scraping audit logs; SLA deadline is
durable.
larryro added a commit that referenced this pull request May 8, 2026
Two work-streams in one commit because they share the erasure
processor's per-table cleanup path.

Subject-scope expansion (W2 follow-up)
- Adds eight per-table internal mutations called sequentially after
  the chat-thread cascade + RAG propagation:
    * eraseSubjectUserMemories
    * eraseSubjectUserPreferences
    * eraseSubjectMessageFeedback
    * eraseSubjectFileMetadata (deletes _storage blobs too)
    * eraseSubjectUsageLedger
    * eraseSubjectTwoFactorAttempts
    * eraseSubjectPolicyAcknowledgements
    * eraseSubjectOnedrive
- loginAttempts / loginBlockCounters are email-keyed; new
  lookupSubjectEmail mutation reads the BetterAuth user row first,
  then eraseSubjectLoginAttempts wipes both tables.
- BetterAuth `user`/`account`/`session`/`verification` rows are
  intentionally NOT touched here — the auth component owns its own
  delete-user flow and direct table manipulation could leave dangling
  sessions in flight. Documented as out-of-scope in the file header
  for a follow-up that goes through `authComponent`.

Audit-chain PII scrub (W3 #4)
- New `auditLogs.piiScrubbed` (+ `piiScrubbedAt`) field. The processor
  blanks `actorEmail`, `actorRole`, `ipAddress`, `userAgent`,
  `previousState`, `newState`, `metadata` on every row where
  `actorId === targetUserId`. Row existence + timestamp + action are
  preserved per Art 17(3)(b)/(e) accountability requirements.
- New `auditLogCheckpoints.subtype` field discriminates 'retention'
  (existing) from 'pii_scrub'. The scrub mutation inserts a signed
  checkpoint with `scrubbedSubjectId` + `scrubbedRowCount` so an
  external auditor can confirm the divergence is bounded and
  intentional.
- `verifyIntegrity` updated to recognize the boundary: rows with
  `piiScrubbed: true` (or whose actorId is in any pii_scrub
  checkpoint's subjectId set) skip the SHA-256 recompute step. Chain
  forward-link via `previousHash` is still validated — the scrub does
  not break order, only blanks the body.
- `scrubSubjectAuditLogs` is in audit_logs/internal_mutations.ts so
  it can use the existing `signCheckpoint` HMAC helper directly.

The processor finalizeProcessing audit log on a 'done' run now
implicitly covers the scrub-marker checkpoint (admins reviewing the
erasure see both rows in the chain). No client-callable surface added
— scrub fires only as part of an authorized requestErasure flow.
larryro added a commit that referenced this pull request May 8, 2026
Round-2-confirmed critical bug. The docstring on `requestErasure`
advertised `LEGAL_HOLD_BLOCKS_ERASURE` as a refusal code; the body had
`const blocked: string[] = []` hard-coded as a "phase-8 placeholder"
with `loadActiveHolds` commented out. Result: an org admin (or anyone
with a compromised admin token) could spoliate litigation-hold data via
this mutation, and the audit row would record `threadsBlocked: []` —
actively misleading. Direct GDPR Art 17(3)(e) and FRCP 37(e) violation.

`loadActiveHolds` ships in the same branch (legal_hold.ts:65) and is
already consumed by retention_cleanup; only erasure ignored it.

This commit:
- Calls `loadActiveHolds` and partitions the subject's threads into
  held vs erasable.
- Fails the whole request closed when `holds.orgHeld === true` or any
  of the subject's threads is held. Emits a `gdpr_erasure_blocked_by_hold`
  audit row before throwing so the refusal itself is logged in the
  tamper-evident chain (compliance teams need this).
- Throws `ConvexError({ code: 'LEGAL_HOLD_BLOCKS_ERASURE', orgHeld,
  heldThreadIds })` matching the docstring contract.
- Tracks per-thread cascade completion explicitly. The previous loop
  ran cascadeDeleteThreadChildren up to 50 times then incremented
  `erased` regardless of outcome; a thread with >50 pages of children
  was reported as erased while leaving residue. Now only counts threads
  whose cascade actually returned `done: true`, and emits the audit row
  with `status: 'failure'` + an explicit error message when any thread
  was partial.
- Audit row records `heldThreadIds` (when present) and
  `threadsTargeted` (so reviewers can spot partial erasures).

Out-of-scope here (separate work-streams):
- Sub-thread cascade hold check (W1 #3 — cascade_helpers).
- Snapshot-race fix in deleteExpired* mutations (W1 #4).
- Subject scope expansion (W2: userMemories, BetterAuth, audit-PII
  scrub, RAG propagation, gdprErasureRequests state machine).
larryro added a commit that referenced this pull request May 8, 2026
…ascade

Round-1 reviewer-confirmed: 12 of 14 retention cleanup categories did
not consult holds.orgHeld, and cascadeDeleteThreadChildren recursed
into sub-threads with no hold check. Both let a litigation/preservation
hold be silently bypassed for the table that records *why* the hold
exists (auditLogs), every PII-bearing table, and any held sub-thread
when its parent ages out — direct US FRCP 37(e) / EU GDPR Art 21
spoliation risk.

retention_cleanup.ts (W1 #2)
- Added `holds: ActiveHolds` to every cleanup category that lacked it:
  cleanupTempFiles (user + agent), cleanupAuditLogs, cleanupWorkflowLogs,
  cleanupUsageLedger, cleanupChatFilterEvents, cleanupPromptTemplates,
  cleanupMessageFeedback, cleanupMemoryAudit, cleanupCustomers,
  cleanupVendors, cleanupExternalConversations, cleanupMessageMetadata.
- Each short-circuits with a clear info log when holds.orgHeld is true.
- cleanupWorkflowLogs additionally consults holds.executionIds for
  per-execution holds (`targetType: 'execution'`) — until now those
  rows were silently ignored by cleanup.
- The dispatcher's category list updated to thread `holds` to all 15
  category invocations.

cascade_helpers.ts (W1 #3 + partial W1 #4)
- cascadeDeleteThreadChildren accepts an optional `holds` snapshot. If
  omitted AND organizationId is set, the helper re-loads `legalHolds`
  itself — closing the snapshot-once race where a hold placed mid-run
  provided zero protection because the dispatcher's pre-fetched Set
  was already stale by the time per-thread cascade fired.
- Sub-thread recursion now passes the snapshot through, so the
  per-sub-thread hold check uses the same authoritative read.
- The helper returns `{ done: true, remaining: 0 }` (no-op) on a held
  thread; callers (retention / erasure) treat that as "skip and
  continue" rather than "delete completed".
larryro added a commit that referenced this pull request May 8, 2026
…OCTOU

Adds assertSafeRetentionDelete to internal_mutations_retention.ts and
threads the new cutoffMs through every retention dispatcher call site.
The guard runs in-mutation (V8 transaction) and closes three round-2
findings simultaneously:

- W1 #4 — snapshot-race vs legal hold. The dispatcher's
  loadActiveHoldsForOrg snapshot is up to 25 min stale; a hold placed
  AFTER snapshot but BEFORE the per-row mutation otherwise had zero
  protection. Re-reading inside each mutation makes the in-mutation
  read the only authoritative gate.
- W6 #10 — cross-org corruption. Every deleteExpired* trusted
  args.organizationId blindly; a swapped id silently deleted org A's
  row while logging the audit under org B (and forking the per-org
  hash chain). Now we assert row.organizationId === args.organizationId.
- W4 #13 — cutoff TOCTOU. Between the dispatcher's listExpired query
  and the per-row delete, a user can re-touch the row (chat
  thread.updatedAt bump, document patch). The mutation now re-evaluates
  cutoff against (updatedAt ?? _creationTime) and skips when the row
  is no longer eligible.

Mutations updated (each accepts new optional cutoffMs):
- deleteExpiredDocument            (targetType: 'document')
- deleteExpiredThread              (targetType: 'thread')
- deleteExpiredWorkflowExecution   (targetType: 'execution')
- deleteExpiredWorkflowTriggerLog  (orgId + cutoff only)
- deleteExpiredCustomer / Vendor / ExternalConversation /
  PromptTemplate / MessageFeedback / MemoryAuditRow /
  ChatFilterEvent / UsageLedgerRow (orgId + cutoff)

Out-of-scope here:
- deleteExpiredTempFile, deleteExpiredMessageMetadata,
  deleteExpiredTwoFactorAttempt, deleteExpiredLoginAttempt,
  deleteExpiredLoginBlockCounter — none of these carry organizationId
  on their args today (they're either cross-org-by-shape or have an
  indirect link). The orgHeld / cross-org checks are not applicable
  without an org backreference, which is itself a separate phase-10
  follow-up. Their existing absence-of-row idempotence is preserved.
larryro added a commit that referenced this pull request May 8, 2026
…eout

Round-2 v15 confirmed: /config unauthenticated, /openapi.json + /docs +
/redoc unauthenticated, RAG container ran as root, default token baked
into image ENV, strict-mode env name diverged across the wire,
non-constant-time token compare, plus three SSRF-guard gaps.

services/rag/app/auth.py
- W7 #3: hmac.compare_digest replaces == on the bearer compare. Removes
  the dead-code EXEMPT_PATHS frozenset.

services/rag/app/routers/health.py
- W7 #1: split into public_router (`/`, `/health`) and protected_router
  (`/config`). main.py mounts the protected one under
  Depends(verify_internal_token). Old `router` re-export stays for
  backwards compat.

services/rag/app/main.py
- W7 #2: docs_url / redoc_url / openapi_url are None outside debug.
- W7 #4: CORS allow_credentials flipped to False (bearer rides
  Authorization, never cookies).
- W7 #1 wiring: mount health-public + health-protected separately.

services/rag/app/config.py
- W7 #8: require_custom_internal_token accepts BOTH
  RAG_REQUIRE_CUSTOM_INTERNAL_TOKEN and TALE_REQUIRE_CUSTOM_RAG_TOKEN
  via pydantic AliasChoices.

services/rag/Dockerfile + services/convex/Dockerfile
- W7 #5: RAG container runs as non-root (uid:gid 1001:1001 `app`).
  RAG ingests untrusted PDFs/DOCX through native parsers; biggest
  blast radius in the stack, now hardened.
- W7 #6: removed RAG_INTERNAL_TOKEN=tale-rag-dev-only ENV bake from
  both runtime + scratch-squash stages and the matching bake in
  services/convex/Dockerfile. Operators MUST supply via env / compose
  / k8s secret.

services/platform/convex/lib/helpers/rag_config.ts
- W7 #9 F1: `redirect: 'manual'` on every ragFetch.
- W7 #9 F2: added fc00::/7 (IPv6 ULA) to v6 blocklist (AWS IPv6 IMDSv2).
- W7 #9 F3: strip trailing `.` before hostname blocklist lookup.
- W7 #9 F4: re-validate URL per ragFetch invocation (DNS rebinding +
  env rotation mitigation).
- W7 #9 F9: deleted path.startsWith('http') override branch (future-
  bypass foot-gun).

services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
- W7 #10: pass timeoutMs=60_000 (default 10s was a regression).
- Plus MAX_ITERATIONS=30 cap and "cursor did not advance" break to
  defend against an adversarial RAG response.
larryro added a commit that referenced this pull request May 8, 2026
… rag propagation

W2 — completes the round-2-confirmed gaps in requestErasure:

- v11 C2: erasure was a mutation, structurally unable to call the RAG
  service. Vector chunks + indexed text persisted after a "successful"
  erasure receipt.
- v11 C2/C3: bounded loop reported partial cascades as success; receipt
  lied about coverage. No durable receipt for the subject (Art 19).
- v12 H3: single-mutation cascade hit Convex transaction limits.

Schema (gdprErasureRequestsTable)
- New per-request state machine: pending → running → done | partial |
  failed. Carries threadsTargeted snapshot, threadsErased counter,
  threadsBlockedByHold, ragDocumentsRemoved, slaDeadlineAt
  (= requestedAt + 30 days per Art 12(3)), startedAt / completedAt.

erasure.ts — split-phase pipeline
- Public mutation requestErasure: auth + org-admin gate + legal-hold
  gate (refusal also audit-logged) + inserts request row + schedules
  the processor + returns { requestId, threadsTargeted } so the admin
  can poll the row for progress.
- Internal action processErasureRequest: flips row to 'running',
  cascades each thread via cascadeDeleteThreadChildren (up to 50 pages
  per thread), propagates to RAG via ragFetch DELETE per subject-owned
  document, then finalizeProcessing transitions the row to done |
  partial | failed with explicit counts.
- Three supporting internal mutations: beginProcessing,
  eraseThreadById, finalizeProcessing, listSubjectDocuments.

Subject scope
- Today: chat threads + their RAG-indexed descendants (departing-
  employee case).
- Out of scope (documented in file header): userMemories,
  userPreferences, feedback, fileMetadata blobs, audit-chain PII
  scrub (W3 #4), BetterAuth tables, loginAttempts / twoFactorAttempts,
  policyAcknowledgements.

Receipt = the gdprErasureRequests row. The admin UI can render the
full state machine without scraping audit logs; SLA deadline is
durable.
larryro added a commit that referenced this pull request May 8, 2026
Two work-streams in one commit because they share the erasure
processor's per-table cleanup path.

Subject-scope expansion (W2 follow-up)
- Adds eight per-table internal mutations called sequentially after
  the chat-thread cascade + RAG propagation:
    * eraseSubjectUserMemories
    * eraseSubjectUserPreferences
    * eraseSubjectMessageFeedback
    * eraseSubjectFileMetadata (deletes _storage blobs too)
    * eraseSubjectUsageLedger
    * eraseSubjectTwoFactorAttempts
    * eraseSubjectPolicyAcknowledgements
    * eraseSubjectOnedrive
- loginAttempts / loginBlockCounters are email-keyed; new
  lookupSubjectEmail mutation reads the BetterAuth user row first,
  then eraseSubjectLoginAttempts wipes both tables.
- BetterAuth `user`/`account`/`session`/`verification` rows are
  intentionally NOT touched here — the auth component owns its own
  delete-user flow and direct table manipulation could leave dangling
  sessions in flight. Documented as out-of-scope in the file header
  for a follow-up that goes through `authComponent`.

Audit-chain PII scrub (W3 #4)
- New `auditLogs.piiScrubbed` (+ `piiScrubbedAt`) field. The processor
  blanks `actorEmail`, `actorRole`, `ipAddress`, `userAgent`,
  `previousState`, `newState`, `metadata` on every row where
  `actorId === targetUserId`. Row existence + timestamp + action are
  preserved per Art 17(3)(b)/(e) accountability requirements.
- New `auditLogCheckpoints.subtype` field discriminates 'retention'
  (existing) from 'pii_scrub'. The scrub mutation inserts a signed
  checkpoint with `scrubbedSubjectId` + `scrubbedRowCount` so an
  external auditor can confirm the divergence is bounded and
  intentional.
- `verifyIntegrity` updated to recognize the boundary: rows with
  `piiScrubbed: true` (or whose actorId is in any pii_scrub
  checkpoint's subjectId set) skip the SHA-256 recompute step. Chain
  forward-link via `previousHash` is still validated — the scrub does
  not break order, only blanks the body.
- `scrubSubjectAuditLogs` is in audit_logs/internal_mutations.ts so
  it can use the existing `signCheckpoint` HMAC helper directly.

The processor finalizeProcessing audit log on a 'done' run now
implicitly covers the scrub-marker checkpoint (admins reviewing the
erasure see both rows in the chain). No client-callable surface added
— scrub fires only as part of an authorized requestErasure flow.
larryro added a commit that referenced this pull request May 9, 2026
Round-2-confirmed critical bug. The docstring on `requestErasure`
advertised `LEGAL_HOLD_BLOCKS_ERASURE` as a refusal code; the body had
`const blocked: string[] = []` hard-coded as a "phase-8 placeholder"
with `loadActiveHolds` commented out. Result: an org admin (or anyone
with a compromised admin token) could spoliate litigation-hold data via
this mutation, and the audit row would record `threadsBlocked: []` —
actively misleading. Direct GDPR Art 17(3)(e) and FRCP 37(e) violation.

`loadActiveHolds` ships in the same branch (legal_hold.ts:65) and is
already consumed by retention_cleanup; only erasure ignored it.

This commit:
- Calls `loadActiveHolds` and partitions the subject's threads into
  held vs erasable.
- Fails the whole request closed when `holds.orgHeld === true` or any
  of the subject's threads is held. Emits a `gdpr_erasure_blocked_by_hold`
  audit row before throwing so the refusal itself is logged in the
  tamper-evident chain (compliance teams need this).
- Throws `ConvexError({ code: 'LEGAL_HOLD_BLOCKS_ERASURE', orgHeld,
  heldThreadIds })` matching the docstring contract.
- Tracks per-thread cascade completion explicitly. The previous loop
  ran cascadeDeleteThreadChildren up to 50 times then incremented
  `erased` regardless of outcome; a thread with >50 pages of children
  was reported as erased while leaving residue. Now only counts threads
  whose cascade actually returned `done: true`, and emits the audit row
  with `status: 'failure'` + an explicit error message when any thread
  was partial.
- Audit row records `heldThreadIds` (when present) and
  `threadsTargeted` (so reviewers can spot partial erasures).

Out-of-scope here (separate work-streams):
- Sub-thread cascade hold check (W1 #3 — cascade_helpers).
- Snapshot-race fix in deleteExpired* mutations (W1 #4).
- Subject scope expansion (W2: userMemories, BetterAuth, audit-PII
  scrub, RAG propagation, gdprErasureRequests state machine).
larryro added a commit that referenced this pull request May 9, 2026
…ascade

Round-1 reviewer-confirmed: 12 of 14 retention cleanup categories did
not consult holds.orgHeld, and cascadeDeleteThreadChildren recursed
into sub-threads with no hold check. Both let a litigation/preservation
hold be silently bypassed for the table that records *why* the hold
exists (auditLogs), every PII-bearing table, and any held sub-thread
when its parent ages out — direct US FRCP 37(e) / EU GDPR Art 21
spoliation risk.

retention_cleanup.ts (W1 #2)
- Added `holds: ActiveHolds` to every cleanup category that lacked it:
  cleanupTempFiles (user + agent), cleanupAuditLogs, cleanupWorkflowLogs,
  cleanupUsageLedger, cleanupChatFilterEvents, cleanupPromptTemplates,
  cleanupMessageFeedback, cleanupMemoryAudit, cleanupCustomers,
  cleanupVendors, cleanupExternalConversations, cleanupMessageMetadata.
- Each short-circuits with a clear info log when holds.orgHeld is true.
- cleanupWorkflowLogs additionally consults holds.executionIds for
  per-execution holds (`targetType: 'execution'`) — until now those
  rows were silently ignored by cleanup.
- The dispatcher's category list updated to thread `holds` to all 15
  category invocations.

cascade_helpers.ts (W1 #3 + partial W1 #4)
- cascadeDeleteThreadChildren accepts an optional `holds` snapshot. If
  omitted AND organizationId is set, the helper re-loads `legalHolds`
  itself — closing the snapshot-once race where a hold placed mid-run
  provided zero protection because the dispatcher's pre-fetched Set
  was already stale by the time per-thread cascade fired.
- Sub-thread recursion now passes the snapshot through, so the
  per-sub-thread hold check uses the same authoritative read.
- The helper returns `{ done: true, remaining: 0 }` (no-op) on a held
  thread; callers (retention / erasure) treat that as "skip and
  continue" rather than "delete completed".
larryro added a commit that referenced this pull request May 9, 2026
…OCTOU

Adds assertSafeRetentionDelete to internal_mutations_retention.ts and
threads the new cutoffMs through every retention dispatcher call site.
The guard runs in-mutation (V8 transaction) and closes three round-2
findings simultaneously:

- W1 #4 — snapshot-race vs legal hold. The dispatcher's
  loadActiveHoldsForOrg snapshot is up to 25 min stale; a hold placed
  AFTER snapshot but BEFORE the per-row mutation otherwise had zero
  protection. Re-reading inside each mutation makes the in-mutation
  read the only authoritative gate.
- W6 #10 — cross-org corruption. Every deleteExpired* trusted
  args.organizationId blindly; a swapped id silently deleted org A's
  row while logging the audit under org B (and forking the per-org
  hash chain). Now we assert row.organizationId === args.organizationId.
- W4 #13 — cutoff TOCTOU. Between the dispatcher's listExpired query
  and the per-row delete, a user can re-touch the row (chat
  thread.updatedAt bump, document patch). The mutation now re-evaluates
  cutoff against (updatedAt ?? _creationTime) and skips when the row
  is no longer eligible.

Mutations updated (each accepts new optional cutoffMs):
- deleteExpiredDocument            (targetType: 'document')
- deleteExpiredThread              (targetType: 'thread')
- deleteExpiredWorkflowExecution   (targetType: 'execution')
- deleteExpiredWorkflowTriggerLog  (orgId + cutoff only)
- deleteExpiredCustomer / Vendor / ExternalConversation /
  PromptTemplate / MessageFeedback / MemoryAuditRow /
  ChatFilterEvent / UsageLedgerRow (orgId + cutoff)

Out-of-scope here:
- deleteExpiredTempFile, deleteExpiredMessageMetadata,
  deleteExpiredTwoFactorAttempt, deleteExpiredLoginAttempt,
  deleteExpiredLoginBlockCounter — none of these carry organizationId
  on their args today (they're either cross-org-by-shape or have an
  indirect link). The orgHeld / cross-org checks are not applicable
  without an org backreference, which is itself a separate phase-10
  follow-up. Their existing absence-of-row idempotence is preserved.
larryro added a commit that referenced this pull request May 9, 2026
…eout

Round-2 v15 confirmed: /config unauthenticated, /openapi.json + /docs +
/redoc unauthenticated, RAG container ran as root, default token baked
into image ENV, strict-mode env name diverged across the wire,
non-constant-time token compare, plus three SSRF-guard gaps.

services/rag/app/auth.py
- W7 #3: hmac.compare_digest replaces == on the bearer compare. Removes
  the dead-code EXEMPT_PATHS frozenset.

services/rag/app/routers/health.py
- W7 #1: split into public_router (`/`, `/health`) and protected_router
  (`/config`). main.py mounts the protected one under
  Depends(verify_internal_token). Old `router` re-export stays for
  backwards compat.

services/rag/app/main.py
- W7 #2: docs_url / redoc_url / openapi_url are None outside debug.
- W7 #4: CORS allow_credentials flipped to False (bearer rides
  Authorization, never cookies).
- W7 #1 wiring: mount health-public + health-protected separately.

services/rag/app/config.py
- W7 #8: require_custom_internal_token accepts BOTH
  RAG_REQUIRE_CUSTOM_INTERNAL_TOKEN and TALE_REQUIRE_CUSTOM_RAG_TOKEN
  via pydantic AliasChoices.

services/rag/Dockerfile + services/convex/Dockerfile
- W7 #5: RAG container runs as non-root (uid:gid 1001:1001 `app`).
  RAG ingests untrusted PDFs/DOCX through native parsers; biggest
  blast radius in the stack, now hardened.
- W7 #6: removed RAG_INTERNAL_TOKEN=tale-rag-dev-only ENV bake from
  both runtime + scratch-squash stages and the matching bake in
  services/convex/Dockerfile. Operators MUST supply via env / compose
  / k8s secret.

services/platform/convex/lib/helpers/rag_config.ts
- W7 #9 F1: `redirect: 'manual'` on every ragFetch.
- W7 #9 F2: added fc00::/7 (IPv6 ULA) to v6 blocklist (AWS IPv6 IMDSv2).
- W7 #9 F3: strip trailing `.` before hostname blocklist lookup.
- W7 #9 F4: re-validate URL per ragFetch invocation (DNS rebinding +
  env rotation mitigation).
- W7 #9 F9: deleted path.startsWith('http') override branch (future-
  bypass foot-gun).

services/platform/convex/agent_tools/rag/helpers/fetch_document_chunks.ts
- W7 #10: pass timeoutMs=60_000 (default 10s was a regression).
- Plus MAX_ITERATIONS=30 cap and "cursor did not advance" break to
  defend against an adversarial RAG response.
larryro added a commit that referenced this pull request May 9, 2026
… rag propagation

W2 — completes the round-2-confirmed gaps in requestErasure:

- v11 C2: erasure was a mutation, structurally unable to call the RAG
  service. Vector chunks + indexed text persisted after a "successful"
  erasure receipt.
- v11 C2/C3: bounded loop reported partial cascades as success; receipt
  lied about coverage. No durable receipt for the subject (Art 19).
- v12 H3: single-mutation cascade hit Convex transaction limits.

Schema (gdprErasureRequestsTable)
- New per-request state machine: pending → running → done | partial |
  failed. Carries threadsTargeted snapshot, threadsErased counter,
  threadsBlockedByHold, ragDocumentsRemoved, slaDeadlineAt
  (= requestedAt + 30 days per Art 12(3)), startedAt / completedAt.

erasure.ts — split-phase pipeline
- Public mutation requestErasure: auth + org-admin gate + legal-hold
  gate (refusal also audit-logged) + inserts request row + schedules
  the processor + returns { requestId, threadsTargeted } so the admin
  can poll the row for progress.
- Internal action processErasureRequest: flips row to 'running',
  cascades each thread via cascadeDeleteThreadChildren (up to 50 pages
  per thread), propagates to RAG via ragFetch DELETE per subject-owned
  document, then finalizeProcessing transitions the row to done |
  partial | failed with explicit counts.
- Three supporting internal mutations: beginProcessing,
  eraseThreadById, finalizeProcessing, listSubjectDocuments.

Subject scope
- Today: chat threads + their RAG-indexed descendants (departing-
  employee case).
- Out of scope (documented in file header): userMemories,
  userPreferences, feedback, fileMetadata blobs, audit-chain PII
  scrub (W3 #4), BetterAuth tables, loginAttempts / twoFactorAttempts,
  policyAcknowledgements.

Receipt = the gdprErasureRequests row. The admin UI can render the
full state machine without scraping audit logs; SLA deadline is
durable.
larryro added a commit that referenced this pull request May 9, 2026
Two work-streams in one commit because they share the erasure
processor's per-table cleanup path.

Subject-scope expansion (W2 follow-up)
- Adds eight per-table internal mutations called sequentially after
  the chat-thread cascade + RAG propagation:
    * eraseSubjectUserMemories
    * eraseSubjectUserPreferences
    * eraseSubjectMessageFeedback
    * eraseSubjectFileMetadata (deletes _storage blobs too)
    * eraseSubjectUsageLedger
    * eraseSubjectTwoFactorAttempts
    * eraseSubjectPolicyAcknowledgements
    * eraseSubjectOnedrive
- loginAttempts / loginBlockCounters are email-keyed; new
  lookupSubjectEmail mutation reads the BetterAuth user row first,
  then eraseSubjectLoginAttempts wipes both tables.
- BetterAuth `user`/`account`/`session`/`verification` rows are
  intentionally NOT touched here — the auth component owns its own
  delete-user flow and direct table manipulation could leave dangling
  sessions in flight. Documented as out-of-scope in the file header
  for a follow-up that goes through `authComponent`.

Audit-chain PII scrub (W3 #4)
- New `auditLogs.piiScrubbed` (+ `piiScrubbedAt`) field. The processor
  blanks `actorEmail`, `actorRole`, `ipAddress`, `userAgent`,
  `previousState`, `newState`, `metadata` on every row where
  `actorId === targetUserId`. Row existence + timestamp + action are
  preserved per Art 17(3)(b)/(e) accountability requirements.
- New `auditLogCheckpoints.subtype` field discriminates 'retention'
  (existing) from 'pii_scrub'. The scrub mutation inserts a signed
  checkpoint with `scrubbedSubjectId` + `scrubbedRowCount` so an
  external auditor can confirm the divergence is bounded and
  intentional.
- `verifyIntegrity` updated to recognize the boundary: rows with
  `piiScrubbed: true` (or whose actorId is in any pii_scrub
  checkpoint's subjectId set) skip the SHA-256 recompute step. Chain
  forward-link via `previousHash` is still validated — the scrub does
  not break order, only blanks the body.
- `scrubSubjectAuditLogs` is in audit_logs/internal_mutations.ts so
  it can use the existing `signCheckpoint` HMAC helper directly.

The processor finalizeProcessing audit log on a 'done' run now
implicitly covers the scrub-marker checkpoint (admins reviewing the
erasure see both rows in the chain). No client-callable surface added
— scrub fires only as part of an authorized requestErasure flow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant