Skip to content

refactor(platform,cli): uniform org-first config layout#1752

Merged
larryro merged 41 commits into
mainfrom
refactor/uniform-org-first-config-layout
May 29, 2026
Merged

refactor(platform,cli): uniform org-first config layout#1752
larryro merged 41 commits into
mainfrom
refactor/uniform-org-first-config-layout

Conversation

@larryro

@larryro larryro commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Summary

Collapses three divergent config layout shapes (per-domain, @<org>-prefixed, default-at-root) into one uniform rule: <root>/<org>/<domain>/... for every org, including default. Applies in the live data tree, the builtin catalog, the repo examples/, and the operator's host workspace — default is just another org with the same shape.

  • Convex resolvers (6 domains): single TALE_CONFIG_DIR root; resolveXxxDir(org) = join(root, org, '<domain>') for every org. Drops per-domain env overrides (AGENTS_DIR, WORKFLOWS_DIR, PROVIDERS_DIR, INTEGRATIONS_DIR, SKILLS_DIR) — platform entrypoint unconditionally removes them from the Convex deployment env on every boot.
  • scaffold.ts: default org now scaffold-able; new override arg with per-domain semantics; symlink-hijack defenses + two-phase rename-then-delete in cleanupOrgFilesystem; new reseed_all_orgs.ts internal action with cursor pagination.
  • Repo + builtin: examples/<domain>/examples/default/<domain>/; retention collapses to single examples/default/retention.json; Convex Dockerfile becomes a single COPY examples/default/.
  • Platform consumers: branding-images path, config-watcher SSE invalidation, and config_store retention enumeration all retargeted at the new shape (would have silently dropped non-default-org events otherwise).
  • Convex entrypoint: mkdir creates only convex/ + default/; all seed loops retarget /app/builtin/default/<domain>/app/data/default/<domain>; new atomic_cp helper; marker name carries -orgfirst token so downgrade re-seeds cleanly.
  • CLI: delete the entire lib/upgrade/ auto-migration framework + four importers; tale init scaffolds default/<domain>/... with recursive gitignore globs; tale deploy --override rewrites to 1:1 host→/app/data/ push with org-slug allowlist + reserved-domain-name denylist; new tale deploy --override-all triggers server-side reseed via docker exec -i <platform> bash -s; new tale migrate config-layout [--dry-run] [--cleanup-old] does a copy-not-move migration with sha-verified cleanup.

Behavior change: per-domain env overrides (AGENTS_DIR etc.) are no longer honored. Operators with custom paths must set TALE_CONFIG_DIR to a root and use the <org>/<domain>/ subtree.

Operator runbook (2 commands, zero downtime)

  1. tale migrate config-layout — copies providers/*.secrets.json (and rest) to new paths; old paths preserved as rollback insurance.
  2. tale deploy --override-all -y — implies --all; recreates convex with new entrypoint, then triggers reseed-all-orgs.
  3. (Optional) tale migrate config-layout --cleanup-old — sha-verifies new == old, then unlinks olds.

Test plan

  • bun run check — 36/36 tasks, 70927 tests, zero lint warnings
  • scaffold.test.ts rewritten (28 cases: override per-domain semantics, symlink defenses, copy-onto-self guard, retention single file, cleanupOrgFilesystem symlink hijack)
  • skills/file_utils.test.ts + branding/queries.test.ts rewritten for org-first
  • retention.test.ts uses new example path
  • Manual: tale migrate config-layout --dry-run against a real legacy workspace
  • Manual: tale deploy --override-all -y against a multi-org deployment, verify all orgs reseed
  • Manual: confirm SSE config-watcher fires for a non-default org file edit

Summary by CodeRabbit

  • New Features

    • Added tale migrate config-layout command to migrate configuration to new directory structure.
    • Added --override-all flag to reseed all organizations from builtin catalog during deploy.
  • Breaking Changes

    • Configuration layout restructured to org-first directory organization.
    • Automatic migrations during startup removed; migration is now opt-in via tale migrate config-layout.
    • Legacy per-domain environment variables (AGENTS_DIR, WORKFLOWS_DIR, etc.) no longer supported.
  • Documentation

    • Updated example file paths across all language documentation.

Review Change Stack

Collapse three divergent layout shapes (per-domain, @<org>-prefixed,
default-at-root) into one uniform rule: <root>/<org>/<domain>/...
applies to every org including `default`, in the live data tree,
the builtin catalog, the repo `examples/`, and the operator's host
workspace. `default` is just another org — the canonical template,
same shape.

Repo + builtin:
- git mv examples/<domain>/ → examples/default/<domain>/ (retention
  collapses to single examples/default/retention.json; branding moves
  under examples/default/branding/)
- services/convex/Dockerfile: one COPY examples/default/ → /app/builtin/default/
- Sweep load-bearing path strings in tests, GitHub raw URL, retention
  error messages, and docs (en/fr/de)

Convex resolvers (6 domains): single TALE_CONFIG_DIR root;
resolveXxxDir(org) = join(root, org, '<domain>') for every org. Drop
per-domain env overrides (AGENTS_DIR, WORKFLOWS_DIR, PROVIDERS_DIR,
INTEGRATIONS_DIR, SKILLS_DIR) — platform entrypoint now unconditionally
removes them from the Convex deployment env on every boot.

scaffold.ts:
- Default org is scaffold-able (no early-return); source from
  <catalog>/default/<domain> with realpath-based copy-onto-self guard
- New `override` arg with per-domain semantics (flat: per-file overwrite,
  bundle: rm-replace per bundle, tree: per-file recursive, retention:
  single-file copy); always preserves *.secrets.json + .history/
- cleanupOrgFilesystem: lstat symlink-hijack defense, two-phase
  rename-then-delete, dropped force:true on rm to surface real errors,
  removes one <root>/<org>/ subtree instead of per-domain loop
- New reseed_all_orgs.ts internal action — cursor-loop pagination,
  sorted slugs, per-org try/catch, structured per-org return shape

Platform consumers (previously hardcoding the old layout):
- server.ts + vite-plugins/serve-branding-images.ts: branding-images
  path → default/branding/images (would have 404'd post-rewrite)
- lib/config-watcher.ts: parseConfigChange rewritten for
  <org>/<domain>/<rest> shape (SSE invalidation would have silently
  dropped events for non-default orgs otherwise)
- config_store/store.ts: orgFirst option; retention flipped to
  <org>/retention.json with per-org-dir list() enumeration

Bash entrypoint (services/convex/docker-entrypoint.sh):
- mkdir creates only convex/ + default/ (legacy per-domain dirs gone)
- All run_seed loops retargeted source /app/builtin/default/<domain>
  → dest /app/data/default/<domain>; new branding seed loop closes a
  long-standing gap
- atomic_cp helper (cp tmp + mv) for crash safety
- Marker name carries -orgfirst layout token so downgrade re-seeds
  cleanly into legacy paths

CLI (tools/cli/):
- Delete the entire lib/upgrade/ auto-migration framework + four
  importers (deploy/start/update/init); -y/--yes on `tale start`
  kept as hidden no-op + warn-once for one-release back-compat
- tale init: scaffolds default/<domain>/... (was flat); recursive
  gitignore globs (**/.history/, **/*.secrets.json); OpenRouter
  secret lands at default/providers/openrouter.secrets.json
- tale deploy --override: rewrites to 1:1 host→/app/data/ push with
  allowlist filter (org-slug regex + reserved-domain-name denylist
  to detect legacy flat layout). Naive blocklist would have shipped
  .env / .git/ / .tale/ into /app/data/.
- New tale deploy --override-all (implies --all): runs server-side
  reseed via docker exec -i <platform> bash -s into the proven
  scripts/2026-03-28-migrate-convex-data.sh:120-131 pattern; TTY-gated
  confirm; non-zero exit on any per-org failure.
- New tale migrate config-layout [--dry-run] [--cleanup-old]: cp
  (not mv) so old paths stay readable for rollback; baked-in
  script.sh piped to docker exec -i <convex> bash -s; --cleanup-old
  sha-verifies before unlinking
- exec.ts: stdin support for the bash-via-stdin pattern

Tests: rewrote scaffold.test.ts (28 cases incl. override per-domain
semantics, symlink defenses, copy-onto-self guard, retention single
file, cleanupOrgFilesystem symlink hijack); rewrote skills/file_utils.test.ts
and branding/queries.test.ts for org-first; retention.test.ts uses
new example path. embedded-files.ts regenerated. bun run check passes:
36/36 tasks, 70927 tests, zero lint warnings.

Operator runbook (2 commands, zero downtime):
1. tale migrate config-layout — copies providers/*.secrets.json to
   new paths; old paths preserved
2. tale deploy --override-all -y — implies --all; recreates convex
   with new entrypoint, then triggers reseed-all-orgs action
3. (Optional) tale migrate config-layout --cleanup-old — sha-verifies
   new == old, unlinks olds (rollback insurance until then)

Behavior change — per-domain env overrides (AGENTS_DIR etc.) are no
longer honored. Operators with custom paths must set TALE_CONFIG_DIR
to a root and use the <org>/<domain>/ subtree.
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This PR comprehensively migrates Tale's on-disk configuration storage from a domain-first layout (e.g., agents/, workflows/, integrations/ at the root) to an org-first layout (e.g., default/agents/, default/workflows/, default/integrations/). The change affects documentation paths, Docker infrastructure initialization, path resolution across all domain modules, the configuration storage framework, organization scaffolding logic, CLI command structure, and migration tooling. A new tale migrate config-layout command supports transitioning existing instances, while tale init now scaffolds new projects under the org-first structure.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • tale-project/tale#1741: Updates org-first skills layout and scaffolding copy logic that directly intersects with this PR's directory refactoring.
  • tale-project/tale#1751: Modifies scaffold.ts and per-domain builtin catalog seeding with symlink-skip handling that parallels this PR's scaffolding refactors.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.92% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'refactor(platform,cli): uniform org-first config layout' clearly and concisely describes the main change: unifying config layout across platform and CLI components into a uniform org-first structure.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering summary, behavior changes, operator runbook, and test plan. It includes the required pre-merge checklist items with appropriate markings (checked or N/A).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch refactor/uniform-org-first-config-layout

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 24

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
services/convex/docker-entrypoint.sh (1)

354-372: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Seed directory bundles atomically too.

atomic_cp() closes the truncation window for JSON files, but these loops still cp -r straight into the final directory and later skip on -d "$dest_dir". If the container dies mid-copy on first boot, the next boot sees the partial bundle and permanently skips reseeding. Copy into a temp sibling and rename into place, the same way the file seeds are protected.

Proposed direction
+atomic_cp_dir() {
+  local src="$1" dest="$2"
+  local tmp="${dest}.tale-seed.$$.tmp"
+  rm -rf "$tmp"
+  cp -R "$src" "$tmp" && mv -f "$tmp" "$dest"
+}
+
   if [ "$FORCE_SEED" = "true" ]; then
     rm -rf "$dest_dir"
-    cp -r "$src_dir" "$dest_dir"; echo "   ✓ Seeded integration $name (forced)"; continue
+    atomic_cp_dir "$src_dir" "$dest_dir"; echo "   ✓ Seeded integration $name (forced)"; continue
   fi
-  if [ -d "$dest_dir" ]; then echo "   ⏭ Skipping integration $name (already exists)"; continue; fi
-  cp -r "$src_dir" "$dest_dir"; echo "   ✓ Seeded integration $name"
+  if [ -d "$dest_dir" ]; then echo "   ⏭ Skipping integration $name (already exists)"; continue; fi
+  atomic_cp_dir "$src_dir" "$dest_dir"; echo "   ✓ Seeded integration $name"

Also applies to: 375-392

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@services/convex/docker-entrypoint.sh` around lines 354 - 372, The integration
bundle seeding copies directly into "$dest_dir" which can leave a partial bundle
and cause future boots to skip reseeding; change the logic around the loop that
iterates "$integrations_builtin"/*/ (and the analogous bundle loop mentioned) to
copy into a temporary sibling directory (use mktemp -d
"$integrations_dir/.$name.tmp.XXXXXX"), perform the recursive copy into that
temp, fsync/ensure completeness, then atomically rename (mv) the temp to
"$dest_dir" (or rm -rf then mv when FORCE_SEED="true"); keep the existing
messages (e.g., "Seeded integration $name") and ensure cleanup of temp on
errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/de/platform/integrations/overview.md`:
- Line 68: The doc currently references the old single-tenant runtime path
string "TALE_CONFIG_DIR/integrations/<slug>/config.json"; update that occurrence
in docs/de/platform/integrations/overview.md to the org-first layout
"TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json" and mention the default
org variant "TALE_CONFIG_DIR/default/integrations/<slug>/config.json" so
self-hosted setups follow the PR contract.

In `@docs/de/self-hosted/configuration/providers.md`:
- Line 34: Update the hyperlink target in
docs/de/self-hosted/configuration/providers.md so the URL points to the new
directory: replace the current link target ending in /examples/providers with
/examples/default/providers; keep the displayed text as-is
(`examples/default/providers/`) and only change the href portion of the Markdown
link in the line containing `openai.json`, `openrouter.json` und
`vercel-gateway.json`.

In `@docs/en/platform/integrations/overview.md`:
- Line 68: Update the documentation to reflect the org-scoped config path:
replace references to TALE_CONFIG_DIR/integrations/<slug>/config.json with the
org-first path TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json (or
TALE_CONFIG_DIR/default/integrations/<slug>/config.json for the default org),
and ensure examples and the sentence referencing examples/default/integrations/
and "Settings > Integrations" consistently mention the org segment so users know
configs are per-org.

In `@docs/en/self-hosted/configuration/providers.md`:
- Line 34: Update the Markdown link so the URL matches the updated link text
"examples/default/providers/": change the href currently pointing to
"examples/providers" to "examples/default/providers" (the visible text
references `examples/default/providers/` and the examples `openai.json`,
`openrouter.json`, and `vercel-gateway.json` should remain the same).

In `@docs/fr/platform/integrations/overview.md`:
- Line 68: Update the documented config path that still uses the legacy
domain-first layout by replacing the string
`TALE_CONFIG_DIR/integrations/<slug>/config.json` with the org-first form
`TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json` (e.g.
`TALE_CONFIG_DIR/default/integrations/<slug>/config.json`) so the sentence and
example reflect the migrated layout; also update any nearby references
mentioning "domain-first" to "org-first" to keep wording consistent with the
migration.

In `@docs/fr/self-hosted/configuration/providers.md`:
- Line 34: Le lien dans la phrase fait référence à "examples/default/providers/"
mais cible l'URL legacy "examples/providers"; update l'URL cible pour pointer
vers le nouveau chemin GitHub "examples/default/providers" (par ex. remplacer
the href value currently pointing to
"https://github.com/tale-project/tale/tree/main/examples/providers" par
"https://github.com/tale-project/tale/tree/main/examples/default/providers")
afin que le texte et le lien correspondent (rechercher la chaîne contenant
"examples/default/providers/" et corriger l'URL associée).

In `@services/convex/docker-entrypoint.sh`:
- Around line 41-47: The entrypoint currently computes
data_dir="${TALE_CONFIG_DIR:-/app/data}" but other places still hardcode
/app/data (so the seed marker, seed paths, tmp dir, PID/log files and backend
storage can get out of sync); update the script to derive and reuse a single
DATA_ROOT (or reuse the existing data_dir variable) everywhere: replace any
literal /app/data usages with the data_dir variable, ensure mkdir -p and chown
use that variable, and update references used by run_seed,
scaffoldNewOrganization, the seed marker path, tmp directory, and PID/log file
locations so they all reference the same data_dir value.

In `@services/platform/convex/lib/config_store/store.ts`:
- Around line 167-168: The code currently swallows all errors from
stat(filePath) which hides permission or filesystem errors; update the catch in
the org enumeration (the list() path that does const info = await
stat(filePath).catch(() => null); if (info?.isFile()) results.push({ orgSlug:
name });) to only convert missing-file errors to null: inspect the thrown error
and if err.code is 'ENOENT' or 'ENOTDIR' return null, otherwise rethrow the
error so permission/broken-symlink/other filesystem issues surface.

In `@services/platform/convex/organizations/scaffold.test.ts`:
- Around line 417-422: This test currently triggers console.warn; wrap the call
to cleanupHandler inside a console.warn spy similar to the existing
missing-catalog test's console.error spy: create a spy/mock for console.warn
before calling cleanupHandler({} as never, { orgSlug: '../escape' }) and
cleanupHandler({} as never, { orgSlug: 'UPPER' }), assert the spy was called (or
calledTimes as appropriate), then restore/restoreMock the original console.warn
after the assertions; reference the cleanupHandler function and the console.warn
spy in your changes.

In `@services/platform/convex/organizations/scaffold.ts`:
- Around line 318-324: The scaffold catch blocks in scaffoldNewOrganization
currently log and continue (e.g., the block logging `[scaffold] ${domain.name}:
copy failed for org "${orgSlug}"` and the similar block around lines 373-378),
which hides failures from reseedAllOrgsFromBuiltin; change
scaffoldNewOrganization to accumulate per-domain failures and propagate them
(either by throwing a composed Error that lists failed domains/orgSlugs or by
returning a structured result object like { success: boolean, failures:
Array<{domain, orgSlug, error}> }) so reseedAllOrgsFromBuiltin can treat partial
failures as errors; update the two catch blocks to push failure details into
that accumulator (including the original err) instead of merely logging and at
the end either throw when any failures exist or return the failure result for
the caller to handle.
- Around line 336-343: seedRetention currently falls back to empty string for
process.env.TALE_CONFIG_DIR which can cause writes to CWD; replicate the
absolute-root guard used in cleanupOrgFilesystem: validate that
process.env.TALE_CONFIG_DIR is set and absolute before constructing
sourceFile/targetFile (when catalogRoot is falsy) and fail fast (throw or return
an error) if it's missing or invalid; ensure the same check uses
TALE_CONFIG_DIR, catalogRoot, and orgSlug so both sourceFile and targetFile are
built from a validated absolute config root.

In `@services/platform/convex/skills/file_utils.test.ts`:
- Around line 24-31: Add a regression test that ensures the removed legacy
SKILLS_DIR override is not used: in
services/platform/convex/skills/file_utils.test.ts create a test which sets both
process.env.TALE_CONFIG_DIR and process.env.SKILLS_DIR (after saving
prevTaleConfigDir/prevSkillsDir) then calls resolveSkillsDir() and asserts the
returned path is derived from TALE_CONFIG_DIR (not SKILLS_DIR); restore
environment in afterEach. This test should mirror the existing
beforeEach/afterEach setup flow and specifically reference resolveSkillsDir,
TALE_CONFIG_DIR and SKILLS_DIR to lock in the intended behavior.

In `@services/platform/docker-entrypoint.sh`:
- Around line 230-251: The purge loop removes legacy keys from Convex
(LEGACY_DOMAIN_VARS / CONVEX_ENV_MAP via bunx convex env remove) but the later
generic environment sync still re-pushes any same-named variables from the
process environment; to fix, ensure those legacy names are filtered out before
the generic sync: either unset them from the shell environment (unset AGENTS_DIR
WORKFLOWS_DIR INTEGRATIONS_DIR PROVIDERS_DIR SKILLS_DIR) right after the purge
loop or modify the sync iterator to skip keys present in
LEGACY_DOMAIN_VARS/CONVEX_ENV_MAP so the “strip on boot” behavior is not undone
by the sync loop.

In `@services/platform/lib/config-watcher.ts`:
- Around line 45-48: The current path-extension guard that returns early for
non-.json files prevents asset updates (e.g., default/branding/images/logo.png
and default/skills/foo/scripts/run.py) from reaching the branding and skills
branches; update the logic in config-watcher.ts so the .json-only short-circuit
is either moved after the domain checks or augmented with exceptions for
branding and skills paths: detect domain === 'branding' (and the skills
invalidation paths) before applying the .json extension filter and emit the
appropriate SSE event objects (e.g., return { type: 'branding', orgSlug } or the
skills invalidation event) for non-JSON assets so image and script changes are
not dropped.

In `@tools/cli/src/commands/deploy/index.ts`:
- Around line 90-104: The flag handling permits a contradictory invocation where
options.overrideAll is true but services is non-empty, causing deploy(...) to
skip the full stateful restart; update the CLI logic around the call to deploy
(the code building the deploy(...) args) to either (a) enforce the full-deploy
path by clearing/ignoring the services list and setting updateStateful = true
when options.overrideAll is true, or (b) validate and reject the invocation
early with a clear error when options.overrideAll and services are both set;
modify the call site that constructs the deploy(...) parameters (referencing
deploy, updateStateful, services, and options.overrideAll) to implement one of
these two behaviors so --override-all reliably implies a full restart.

In `@tools/cli/src/lib/actions/deploy.ts`:
- Around line 683-692: The current branches in deploy.ts (e.g., the block
checking legacyDirs and the similar branches around the other noted locations)
log failures and return, causing deploy to exit successfully; change these to
surface hard failures by throwing an Error (or calling a helper like throw new
Error(...)) with a clear message including prefix and context so the command
exits non-zero; update the blocks that currently call logger.error/logger.info
and return (the legacyDirs check and the other two similar branches referenced)
to throw instead and ensure any callers of the surrounding function will
propagate the exception to fail the CLI.
- Around line 610-624: The legacy-layout guard (LEGACY_DOMAIN_DIR_NAMES) misses
the root file retention.json so projects with only that file slip through;
update the check in deploy.ts that uses LEGACY_DOMAIN_DIR_NAMES to also detect a
root "retention.json" (either by adding "retention.json" to the legacy names set
or adding a separate existence check for that filename) and make the same
error/early-exit flow trigger (pointing the operator at `tale init --force`)
when retention.json is present so the legacy workspace is refused from pushing;
reference the constant LEGACY_DOMAIN_DIR_NAMES and the surrounding
legacy-directory guard logic to locate where to add the file check.
- Around line 634-657: findOrgDirs currently treats any slug-shaped top-level
directory as an org; change it so after the isValidOrgSlug(name) check you
verify the directory contains at least one allowed config child (agents,
workflows, integrations, branding, providers, skills) or the file retention.json
before pushing to orgDirs. Use readdir/stat on the candidate directory (the
variable name) to inspect its entries, skip hidden files, and only push name
into orgDirs if any of those allowed child names exist; preserve the existing
LEGACY_DOMAIN_DIR_NAMES handling and return behavior when readdir fails.
- Around line 148-149: Remove the outdated comment line referencing the removed
auto-migration framework ("(Auto-migration framework removed — `tale migrate
config-layout` is the only opt-in, manually-run migration now.)"); locate this
stray comment by searching for the exact phrase `tale migrate config-layout` in
tools/cli/src/lib/actions/deploy.ts and delete it so no source comments describe
removed behavior (leave surrounding code unchanged).

In `@tools/cli/src/lib/actions/init.ts`:
- Around line 254-256: Remove the explanatory note about removed code that
mentions "`.tale/migrations.json seeding removed alongside the auto-migration
framework. Existing projects' stale files are harmless and can be deleted
manually.`" — simply delete that comment line(s) from init.ts so no removed-code
explanation remains in source; the change is to edit the comment block
containing that string (near the init flow) and remove the entire removed-code
note, leaving any remaining relevant comments intact.

In `@tools/cli/src/lib/actions/migrate-config-layout.ts`:
- Around line 29-32: Remove the unused exported interface
MigrateConfigLayoutOptions to satisfy the linter: either delete the entire
interface declaration or make it non-exported (remove the leading export) if the
local code still references it; ensure no other modules import
MigrateConfigLayoutOptions and update any local type references accordingly so
the code still compiles.

In `@tools/cli/src/lib/actions/reseed-all-orgs.ts`:
- Around line 21-24: The exported interface ReseedAllOrgsOptions is currently
unnecessary as a public API; remove the export keyword from the
ReseedAllOrgsOptions declaration to make it file-local and update any references
within the same file (e.g., function parameters or type annotations that mention
ReseedAllOrgsOptions) to continue using the now-unexported interface so the code
compiles and intent remains internal.
- Around line 76-86: The dry-run path calls findPlatformContainer()
unconditionally which can fail on fresh hosts; update reseed-all-orgs.ts so the
dry-run short-circuit happens before resolve of the live container: either move
the if (dryRun) { ... return } block to run before calling
findPlatformContainer(), or change the container binding to use a conditional
(e.g., container = dryRun ? '<container>' : await findPlatformContainer()) so
RESEED_SCRIPT and the dry-run logger use a safe placeholder instead of resolving
a real container; key symbols: findPlatformContainer, dryRun, RESEED_SCRIPT.

In `@tools/cli/src/lib/actions/update.ts`:
- Around line 207-208: Remove the inline comment "(Auto-migration planning
removed — `tale migrate config-layout` is the only opt-in, manually-run
migration now; operators invoke it directly.)" from the source in update.ts;
this is deleted behavior and should be dropped from runtime comments (leave no
replacement comment).

---

Outside diff comments:
In `@services/convex/docker-entrypoint.sh`:
- Around line 354-372: The integration bundle seeding copies directly into
"$dest_dir" which can leave a partial bundle and cause future boots to skip
reseeding; change the logic around the loop that iterates
"$integrations_builtin"/*/ (and the analogous bundle loop mentioned) to copy
into a temporary sibling directory (use mktemp -d
"$integrations_dir/.$name.tmp.XXXXXX"), perform the recursive copy into that
temp, fsync/ensure completeness, then atomically rename (mv) the temp to
"$dest_dir" (or rm -rf then mv when FORCE_SEED="true"); keep the existing
messages (e.g., "Seeded integration $name") and ensure cleanup of temp on
errors.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4a520c73-5246-4932-9f5c-084395a2c43e

📥 Commits

Reviewing files that changed from the base of the PR and between 7bae786 and 5deaa5a.

⛔ Files ignored due to path filters (18)
  • examples/default/integrations/ai-image/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/circuly/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/discord/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/github/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/gmail/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/google_drive/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/outlook/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/protel/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/shopify/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/slack/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/tavily/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/teams/icon.svg is excluded by !**/*.svg
  • examples/default/integrations/twilio/icon.svg is excluded by !**/*.svg
  • services/platform/convex/_generated/api.d.ts is excluded by !**/_generated/**
  • tools/cli/src/lib/upgrade/migrations/adopt-convex-stateful.ts is excluded by !**/migrations/**
  • tools/cli/src/lib/upgrade/migrations/namespace-caddy-config.ts is excluded by !**/migrations/**
  • tools/cli/src/lib/upgrade/migrations/namespace-volumes.ts is excluded by !**/migrations/**
  • tools/cli/src/lib/upgrade/migrations/split-convex.ts is excluded by !**/migrations/**
📒 Files selected for processing (173)
  • .dockerignore
  • docs/de/develop/integrations.md
  • docs/de/platform/integrations/overview.md
  • docs/de/platform/models.md
  • docs/de/self-hosted/configuration/providers.md
  • docs/en/develop/integrations.md
  • docs/en/platform/integrations/overview.md
  • docs/en/platform/models.md
  • docs/en/self-hosted/configuration/providers.md
  • docs/fr/develop/integrations.md
  • docs/fr/platform/integrations/overview.md
  • docs/fr/platform/models.md
  • docs/fr/self-hosted/configuration/providers.md
  • examples/default/agents/chat-agent.json
  • examples/default/agents/crm-assistant.json
  • examples/default/agents/image-creator.json
  • examples/default/agents/integration-assistant.json
  • examples/default/agents/researcher.json
  • examples/default/agents/translator.json
  • examples/default/agents/workflow-assistant.json
  • examples/default/branding/branding.json
  • examples/default/integrations/ai-image/config.json
  • examples/default/integrations/ai-image/connector.ts
  • examples/default/integrations/circuly/config.json
  • examples/default/integrations/circuly/connector.ts
  • examples/default/integrations/discord/config.json
  • examples/default/integrations/discord/connector.ts
  • examples/default/integrations/github/config.json
  • examples/default/integrations/github/connector.ts
  • examples/default/integrations/gmail/config.json
  • examples/default/integrations/gmail/connector.ts
  • examples/default/integrations/google_drive/config.json
  • examples/default/integrations/google_drive/connector.ts
  • examples/default/integrations/outlook/config.json
  • examples/default/integrations/outlook/connector.ts
  • examples/default/integrations/protel/config.json
  • examples/default/integrations/shopify/config.json
  • examples/default/integrations/shopify/connector.ts
  • examples/default/integrations/slack/config.json
  • examples/default/integrations/slack/connector.ts
  • examples/default/integrations/tavily/config.json
  • examples/default/integrations/tavily/connector.ts
  • examples/default/integrations/teams/config.json
  • examples/default/integrations/teams/connector.ts
  • examples/default/integrations/twilio/config.json
  • examples/default/integrations/twilio/connector.ts
  • examples/default/providers/openai.json
  • examples/default/providers/openrouter.json
  • examples/default/providers/vercel-gateway.json
  • examples/default/retention.json
  • examples/default/skills/pptx/LICENSE.txt
  • examples/default/skills/pptx/SKILL.md
  • examples/default/skills/pptx/editing.md
  • examples/default/skills/pptx/pptxgenjs.md
  • examples/default/skills/pptx/scripts/__init__.py
  • examples/default/skills/pptx/scripts/add_slide.py
  • examples/default/skills/pptx/scripts/clean.py
  • examples/default/skills/pptx/scripts/office/helpers/__init__.py
  • examples/default/skills/pptx/scripts/office/helpers/merge_runs.py
  • examples/default/skills/pptx/scripts/office/helpers/simplify_redlines.py
  • examples/default/skills/pptx/scripts/office/pack.py
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/dml-chart.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/dml-chartDrawing.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/dml-diagram.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/dml-lockedCanvas.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/dml-main.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/dml-picture.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/dml-spreadsheetDrawing.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/dml-wordprocessingDrawing.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/pml.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-additionalCharacteristics.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-bibliography.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-commonSimpleTypes.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlDataProperties.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-customXmlSchemaProperties.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesCustom.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesExtended.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-documentPropertiesVariantTypes.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-math.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/shared-relationshipReference.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/sml.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/vml-main.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/vml-officeDrawing.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/vml-presentationDrawing.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/vml-spreadsheetDrawing.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/vml-wordprocessingDrawing.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/wml.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ISO-IEC29500-4_2016/xml.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ecma/fouth-edition/opc-contentTypes.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ecma/fouth-edition/opc-coreProperties.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ecma/fouth-edition/opc-digSig.xsd
  • examples/default/skills/pptx/scripts/office/schemas/ecma/fouth-edition/opc-relationships.xsd
  • examples/default/skills/pptx/scripts/office/schemas/mce/mc.xsd
  • examples/default/skills/pptx/scripts/office/schemas/microsoft/wml-2010.xsd
  • examples/default/skills/pptx/scripts/office/schemas/microsoft/wml-2012.xsd
  • examples/default/skills/pptx/scripts/office/schemas/microsoft/wml-2018.xsd
  • examples/default/skills/pptx/scripts/office/schemas/microsoft/wml-cex-2018.xsd
  • examples/default/skills/pptx/scripts/office/schemas/microsoft/wml-cid-2016.xsd
  • examples/default/skills/pptx/scripts/office/schemas/microsoft/wml-sdtdatahash-2020.xsd
  • examples/default/skills/pptx/scripts/office/schemas/microsoft/wml-symex-2015.xsd
  • examples/default/skills/pptx/scripts/office/soffice.py
  • examples/default/skills/pptx/scripts/office/unpack.py
  • examples/default/skills/pptx/scripts/office/validate.py
  • examples/default/skills/pptx/scripts/office/validators/__init__.py
  • examples/default/skills/pptx/scripts/office/validators/base.py
  • examples/default/skills/pptx/scripts/office/validators/docx.py
  • examples/default/skills/pptx/scripts/office/validators/pptx.py
  • examples/default/skills/pptx/scripts/office/validators/redlining.py
  • examples/default/skills/pptx/scripts/thumbnail.py
  • examples/default/workflows/circuly/sync-customers.json
  • examples/default/workflows/circuly/sync-products.json
  • examples/default/workflows/circuly/sync-subscriptions.json
  • examples/default/workflows/general/conversation-auto-archive.json
  • examples/default/workflows/general/conversation-sync.json
  • examples/default/workflows/general/customer-status-assessment.json
  • examples/default/workflows/general/document-rag-sync.json
  • examples/default/workflows/general/product-relationship-analysis.json
  • examples/default/workflows/gmail/email-sync.json
  • examples/default/workflows/google_drive/sync.json
  • examples/default/workflows/onedrive/sync.json
  • examples/default/workflows/outlook/email-sync.json
  • examples/default/workflows/shopify/sync-customers.json
  • examples/default/workflows/shopify/sync-products.json
  • services/convex/Dockerfile
  • services/convex/docker-entrypoint.sh
  • services/platform/app/features/settings/integrations/components/integration-upload/constants/integration-templates.ts
  • services/platform/convex/agents/file_utils.ts
  • services/platform/convex/branding/file_actions.ts
  • services/platform/convex/branding/file_utils.ts
  • services/platform/convex/branding/queries.test.ts
  • services/platform/convex/governance/retention_actions.ts
  • services/platform/convex/governance/retention_bounds_proposal.ts
  • services/platform/convex/governance/retention_floors.ts
  • services/platform/convex/integrations/file_utils.ts
  • services/platform/convex/lib/config_store/actions.ts
  • services/platform/convex/lib/config_store/store.ts
  • services/platform/convex/node_only/integration_sandbox/gmail_draft_filtering.test.ts
  • services/platform/convex/node_only/integration_sandbox/outlook_draft_filtering.test.ts
  • services/platform/convex/organizations/reseed_all_orgs.ts
  • services/platform/convex/organizations/scaffold.test.ts
  • services/platform/convex/organizations/scaffold.ts
  • services/platform/convex/providers/file_utils.ts
  • services/platform/convex/skills/file_actions.ts
  • services/platform/convex/skills/file_utils.test.ts
  • services/platform/convex/skills/file_utils.ts
  • services/platform/convex/workflows/file_utils.ts
  • services/platform/docker-entrypoint.sh
  • services/platform/env.sh
  • services/platform/lib/config-watcher.ts
  • services/platform/lib/shared/schemas/governance.ts
  • services/platform/lib/shared/schemas/retention.test.ts
  • services/platform/lib/shared/utils/example-agents-normalized.test.ts
  • services/platform/server.ts
  • services/platform/vite-plugins/serve-branding-images.ts
  • tools/cli/src/commands/deploy/index.ts
  • tools/cli/src/commands/migrate.ts
  • tools/cli/src/commands/start/index.ts
  • tools/cli/src/index.ts
  • tools/cli/src/lib/actions/deploy.ts
  • tools/cli/src/lib/actions/init.ts
  • tools/cli/src/lib/actions/migrate-config-layout.ts
  • tools/cli/src/lib/actions/reseed-all-orgs.ts
  • tools/cli/src/lib/actions/start.ts
  • tools/cli/src/lib/actions/update.ts
  • tools/cli/src/lib/docker/exec.ts
  • tools/cli/src/lib/migrate-config-layout/script.sh
  • tools/cli/src/lib/project/fetch-reference.ts
  • tools/cli/src/lib/upgrade/registry.ts
  • tools/cli/src/lib/upgrade/runner.test.ts
  • tools/cli/src/lib/upgrade/runner.ts
  • tools/cli/src/lib/upgrade/state.ts
  • tools/cli/src/lib/upgrade/types.ts
  • tools/cli/src/lib/upgrade/volume-helpers.ts
💤 Files with no reviewable changes (6)
  • tools/cli/src/lib/upgrade/types.ts
  • tools/cli/src/lib/upgrade/state.ts
  • tools/cli/src/lib/upgrade/runner.test.ts
  • tools/cli/src/lib/upgrade/registry.ts
  • tools/cli/src/lib/upgrade/runner.ts
  • tools/cli/src/lib/upgrade/volume-helpers.ts

## Eine eigene Integration hinzufügen

Eigene Integrationen folgen derselben JSON-Form wie die oben. Leg eine Konfiguration in `TALE_CONFIG_DIR/integrations/<slug>/config.json` ab, die die Operationen, die Auth-Methode und die erlaubten Hosts deklariert; die Integration erscheint in **Einstellungen > Integrationen**, damit User sie verbinden können. Die Form und die Validierungsregeln leben neben den ausgelieferten Konfigurationen in `examples/integrations/`.
Eigene Integrationen folgen derselben JSON-Form wie die oben. Leg eine Konfiguration in `TALE_CONFIG_DIR/integrations/<slug>/config.json` ab, die die Operationen, die Auth-Methode und die erlaubten Hosts deklariert; die Integration erscheint in **Einstellungen > Integrationen**, damit User sie verbinden können. Die Form und die Validierungsregeln leben neben den ausgelieferten Konfigurationen in `examples/default/integrations/`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix documented runtime path to org-first layout.

Line 68 still points to TALE_CONFIG_DIR/integrations/<slug>/config.json, but this PR’s contract is org-first. This will misconfigure self-hosted setups after migration; document it as TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json (or .../default/... for the default org).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/de/platform/integrations/overview.md` at line 68, The doc currently
references the old single-tenant runtime path string
"TALE_CONFIG_DIR/integrations/<slug>/config.json"; update that occurrence in
docs/de/platform/integrations/overview.md to the org-first layout
"TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json" and mention the default
org variant "TALE_CONFIG_DIR/default/integrations/<slug>/config.json" so
self-hosted setups follow the PR contract.

```

Die vollständige Menge der Felder lebt in [`examples/providers/`](https://github.com/tale-project/tale/tree/main/examples/providers) — `openai.json`, `openrouter.json` und `vercel-gateway.json` decken die drei Formen ab, die du wahrscheinlich brauchst.
Die vollständige Menge der Felder lebt in [`examples/default/providers/`](https://github.com/tale-project/tale/tree/main/examples/providers) — `openai.json`, `openrouter.json` und `vercel-gateway.json` decken die drei Formen ab, die du wahrscheinlich brauchst.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the hyperlink target to match the new path.

The displayed path is examples/default/providers/, but the link still targets .../examples/providers. Please update the URL to .../examples/default/providers to avoid sending readers to the old location.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/de/self-hosted/configuration/providers.md` at line 34, Update the
hyperlink target in docs/de/self-hosted/configuration/providers.md so the URL
points to the new directory: replace the current link target ending in
/examples/providers with /examples/default/providers; keep the displayed text
as-is (`examples/default/providers/`) and only change the href portion of the
Markdown link in the line containing `openai.json`, `openrouter.json` und
`vercel-gateway.json`.

## Adding a custom integration

Custom integrations follow the same JSON shape as the ones above. Drop a config into `TALE_CONFIG_DIR/integrations/<slug>/config.json` declaring the operations, auth method, and allowed hosts; the integration appears in **Settings > Integrations** for users to connect. The shape and validation rules live alongside the shipped configs in `examples/integrations/`.
Custom integrations follow the same JSON shape as the ones above. Drop a config into `TALE_CONFIG_DIR/integrations/<slug>/config.json` declaring the operations, auth method, and allowed hosts; the integration appears in **Settings > Integrations** for users to connect. The shape and validation rules live alongside the shipped configs in `examples/default/integrations/`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Document the org segment in custom integration path.

The path still uses TALE_CONFIG_DIR/integrations/<slug>/config.json. Under this migration it should be org-first, e.g. TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json (or default org explicitly).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/platform/integrations/overview.md` at line 68, Update the
documentation to reflect the org-scoped config path: replace references to
TALE_CONFIG_DIR/integrations/<slug>/config.json with the org-first path
TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json (or
TALE_CONFIG_DIR/default/integrations/<slug>/config.json for the default org),
and ensure examples and the sentence referencing examples/default/integrations/
and "Settings > Integrations" consistently mention the org segment so users know
configs are per-org.

```

The full set of fields lives in [`examples/providers/`](https://github.com/tale-project/tale/tree/main/examples/providers) — `openai.json`, `openrouter.json`, and `vercel-gateway.json` cover the three shapes you are likely to need.
The full set of fields lives in [`examples/default/providers/`](https://github.com/tale-project/tale/tree/main/examples/providers) — `openai.json`, `openrouter.json`, and `vercel-gateway.json` cover the three shapes you are likely to need.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix mismatched GitHub link target.

The link label was updated to examples/default/providers/, but the URL still points to examples/providers. This will send readers to the legacy path.

Proposed fix
-The full set of fields lives in [`examples/default/providers/`](https://github.com/tale-project/tale/tree/main/examples/providers) — `openai.json`, `openrouter.json`, and `vercel-gateway.json` cover the three shapes you are likely to need.
+The full set of fields lives in [`examples/default/providers/`](https://github.com/tale-project/tale/tree/main/examples/default/providers) — `openai.json`, `openrouter.json`, and `vercel-gateway.json` cover the three shapes you are likely to need.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
The full set of fields lives in [`examples/default/providers/`](https://github.com/tale-project/tale/tree/main/examples/providers)`openai.json`, `openrouter.json`, and `vercel-gateway.json` cover the three shapes you are likely to need.
The full set of fields lives in [`examples/default/providers/`](https://github.com/tale-project/tale/tree/main/examples/default/providers)`openai.json`, `openrouter.json`, and `vercel-gateway.json` cover the three shapes you are likely to need.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/en/self-hosted/configuration/providers.md` at line 34, Update the
Markdown link so the URL matches the updated link text
"examples/default/providers/": change the href currently pointing to
"examples/providers" to "examples/default/providers" (the visible text
references `examples/default/providers/` and the examples `openai.json`,
`openrouter.json`, and `vercel-gateway.json` should remain the same).

## Ajouter une intégration personnalisée

Les intégrations personnalisées suivent la même forme JSON que celles ci-dessus. Dépose une configuration dans `TALE_CONFIG_DIR/integrations/<slug>/config.json` déclarant les opérations, la méthode d'auth et les hôtes autorisés ; l'intégration apparaît sous **Paramètres > Intégrations** pour que les utilisateurs la connectent. La forme et les règles de validation vivent à côté des configurations livrées dans `examples/integrations/`.
Les intégrations personnalisées suivent la même forme JSON que celles ci-dessus. Dépose une configuration dans `TALE_CONFIG_DIR/integrations/<slug>/config.json` déclarant les opérations, la méthode d'auth et les hôtes autorisés ; l'intégration apparaît sous **Paramètres > Intégrations** pour que les utilisateurs la connectent. La forme et les règles de validation vivent à côté des configurations livrées dans `examples/default/integrations/`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update custom integration path to org-first layout.

This line still documents the legacy domain-first path (TALE_CONFIG_DIR/integrations/...). It should use org-first (TALE_CONFIG_DIR/<org>/integrations/..., e.g. TALE_CONFIG_DIR/default/integrations/...) to match the migration.

Proposed fix
-Les intégrations personnalisées suivent la même forme JSON que celles ci-dessus. Dépose une configuration dans `TALE_CONFIG_DIR/integrations/<slug>/config.json` déclarant les opérations, la méthode d'auth et les hôtes autorisés ; l'intégration apparaît sous **Paramètres > Intégrations** pour que les utilisateurs la connectent. La forme et les règles de validation vivent à côté des configurations livrées dans `examples/default/integrations/`.
+Les intégrations personnalisées suivent la même forme JSON que celles ci-dessus. Dépose une configuration dans `TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json` (par exemple `TALE_CONFIG_DIR/default/integrations/<slug>/config.json`) déclarant les opérations, la méthode d'auth et les hôtes autorisés ; l'intégration apparaît sous **Paramètres > Intégrations** pour que les utilisateurs la connectent. La forme et les règles de validation vivent à côté des configurations livrées dans `examples/default/integrations/`.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Les intégrations personnalisées suivent la même forme JSON que celles ci-dessus. Dépose une configuration dans `TALE_CONFIG_DIR/integrations/<slug>/config.json` déclarant les opérations, la méthode d'auth et les hôtes autorisés ; l'intégration apparaît sous **Paramètres > Intégrations** pour que les utilisateurs la connectent. La forme et les règles de validation vivent à côté des configurations livrées dans `examples/default/integrations/`.
Les intégrations personnalisées suivent la même forme JSON que celles ci-dessus. Dépose une configuration dans `TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json` (par exemple `TALE_CONFIG_DIR/default/integrations/<slug>/config.json`) déclarant les opérations, la méthode d'auth et les hôtes autorisés ; l'intégration apparaît sous **Paramètres > Intégrations** pour que les utilisateurs la connectent. La forme et les règles de validation vivent à côté des configurations livrées dans `examples/default/integrations/`.
🧰 Tools
🪛 LanguageTool

[typographical] ~68-~68: Caractère d’apostrophe incorrect.
Context: ... à côté des configurations livrées dans examples/default/integrations/. Pour des ponts plus riches ou auto-hé...

(APOS_INCORRECT)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/fr/platform/integrations/overview.md` at line 68, Update the documented
config path that still uses the legacy domain-first layout by replacing the
string `TALE_CONFIG_DIR/integrations/<slug>/config.json` with the org-first form
`TALE_CONFIG_DIR/<org>/integrations/<slug>/config.json` (e.g.
`TALE_CONFIG_DIR/default/integrations/<slug>/config.json`) so the sentence and
example reflect the migrated layout; also update any nearby references
mentioning "domain-first" to "org-first" to keep wording consistent with the
migration.

Comment on lines +254 to +256
// (`.tale/migrations.json` seeding removed alongside the auto-migration
// framework. Existing projects' stale files are harmless and can be
// deleted manually.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Delete the removed-code note.

This comment preserves deletion history in the source file, which the repo guidelines explicitly disallow.

As per coding guidelines, "No comments explaining what was removed. Removed code is gone; git log is the record."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cli/src/lib/actions/init.ts` around lines 254 - 256, Remove the
explanatory note about removed code that mentions "`.tale/migrations.json
seeding removed alongside the auto-migration framework. Existing projects' stale
files are harmless and can be deleted manually.`" — simply delete that comment
line(s) from init.ts so no removed-code explanation remains in source; the
change is to edit the comment block containing that string (near the init flow)
and remove the entire removed-code note, leaving any remaining relevant comments
intact.

Comment on lines +29 to +32
export interface MigrateConfigLayoutOptions {
dryRun: boolean;
cleanupOld: boolean;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Drop the unused exported interface to unblock lint.

knip is failing because MigrateConfigLayoutOptions is exported but unused outside this module.

Suggested change
-export interface MigrateConfigLayoutOptions {
+interface MigrateConfigLayoutOptions {
   dryRun: boolean;
   cleanupOld: boolean;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export interface MigrateConfigLayoutOptions {
dryRun: boolean;
cleanupOld: boolean;
}
interface MigrateConfigLayoutOptions {
dryRun: boolean;
cleanupOld: boolean;
}
🧰 Tools
🪛 GitHub Actions: Lint / 2_Knip.txt

[error] 29-29: knip: Unused exported types detected. Unused exported type: MigrateConfigLayoutOptions (interface) at tools/cli/src/lib/actions/migrate-config-layout.ts:29:18

🪛 GitHub Actions: Lint / Knip

[warning] 29-29: knip: Unused exported type/interface detected: MigrateConfigLayoutOptions (Unused exported types).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cli/src/lib/actions/migrate-config-layout.ts` around lines 29 - 32,
Remove the unused exported interface MigrateConfigLayoutOptions to satisfy the
linter: either delete the entire interface declaration or make it non-exported
(remove the leading export) if the local code still references it; ensure no
other modules import MigrateConfigLayoutOptions and update any local type
references accordingly so the code still compiles.

Comment on lines +21 to +24
export interface ReseedAllOrgsOptions {
dryRun: boolean;
assumeYes: boolean;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make ReseedAllOrgsOptions file-local.

Knip is already failing on this exported interface, and the provided diff shows no external consumer. Keeping it exported turns an internal parameter shape into a public API for no benefit.

🧰 Tools
🪛 GitHub Actions: Lint / 2_Knip.txt

[error] 21-21: knip: Unused exported types detected. Unused exported type: ReseedAllOrgsOptions (interface) at tools/cli/src/lib/actions/reseed-all-orgs.ts:21:18

🪛 GitHub Actions: Lint / Knip

[warning] 21-21: knip: Unused exported type/interface detected: ReseedAllOrgsOptions (Unused exported types).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cli/src/lib/actions/reseed-all-orgs.ts` around lines 21 - 24, The
exported interface ReseedAllOrgsOptions is currently unnecessary as a public
API; remove the export keyword from the ReseedAllOrgsOptions declaration to make
it file-local and update any references within the same file (e.g., function
parameters or type annotations that mention ReseedAllOrgsOptions) to continue
using the now-unexported interface so the code compiles and intent remains
internal.

Comment on lines +76 to +86
const container = await findPlatformContainer();

if (dryRun) {
logger.blank();
logger.info('[DRY-RUN] Would run:');
logger.info(` docker exec ${container} bash -lc '<reseed script>'`);
logger.info('Reseed script body (would be piped into bash):');
for (const line of RESEED_SCRIPT.split('\n')) {
logger.info(` ${line}`);
}
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Short-circuit dry-run before resolving a live platform container.

deploy({ dryRun: true, overrideAll: true }) can reach this path before any container exists. Because findPlatformContainer() runs unconditionally, the preview can fail on a fresh host instead of printing the dry-run output. Resolve the container only on the execution path, or fall back to a placeholder name in dry-run mode.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cli/src/lib/actions/reseed-all-orgs.ts` around lines 76 - 86, The
dry-run path calls findPlatformContainer() unconditionally which can fail on
fresh hosts; update reseed-all-orgs.ts so the dry-run short-circuit happens
before resolve of the live container: either move the if (dryRun) { ... return }
block to run before calling findPlatformContainer(), or change the container
binding to use a conditional (e.g., container = dryRun ? '<container>' : await
findPlatformContainer()) so RESEED_SCRIPT and the dry-run logger use a safe
placeholder instead of resolving a real container; key symbols:
findPlatformContainer, dryRun, RESEED_SCRIPT.

Comment on lines +207 to +208
// (Auto-migration planning removed — `tale migrate config-layout` is the
// only opt-in, manually-run migration now; operators invoke it directly.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the removal-status comment from source.

This comment records deleted behavior; that belongs in PR/commit history, not runtime code comments.

Suggested change
-  // (Auto-migration planning removed — `tale migrate config-layout` is the
-  // only opt-in, manually-run migration now; operators invoke it directly.)

As per coding guidelines: "No comments explaining what was removed. Removed code is gone; git log is the record."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// (Auto-migration planning removed — `tale migrate config-layout` is the
// only opt-in, manually-run migration now; operators invoke it directly.)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/cli/src/lib/actions/update.ts` around lines 207 - 208, Remove the
inline comment "(Auto-migration planning removed — `tale migrate config-layout`
is the only opt-in, manually-run migration now; operators invoke it directly.)"
from the source in update.ts; this is deleted behavior and should be dropped
from runtime comments (leave no replacement comment).

…Crawler + runbook

Two-round multi-agent review of 5deaa5a surfaced ~45 findings across
four themes; this commit lands the unified fix in one pass.

Wave 1 — Correctness (error-reporting chain + safety guards)

- scaffold.ts: seedDomain/seedRetention now return structured
  { domain, ok, error? } results. scaffoldNewOrganization aggregates
  and accepts a new `strict` arg — when true, throws on any
  per-domain failure (used by reseed_all_orgs); when false/omitted,
  preserves the org-create lenient semantics auth.afterCreateOrganization
  depends on. Promote path.isAbsolute(TALE_CONFIG_DIR) guard from
  cleanupOrgFilesystem so seedRetention can't accidentally write
  into the action's CWD on unset env. Bundle-mode rm-before-copy
  replaced with staging-dir + atomic-rename. randomUUID suffix on
  the condemned-dir name (defends against ms-resolution Date.now()
  collisions). Opportunistic janitor for stale <root>/.deleted-*
  trees older than 24h. Three previously-empty catches replaced
  with console.warn lines.
- reseed_all_orgs.ts: throw at end of loop when failed > 0 with
  aggregated failed-slug detail; that propagates through
  bunx convex run → docker exec exit code → CLI throw. Returns
  validator added so the action's shape is explicit. Passes
  strict:true to scaffold so per-domain failures are no longer
  swallowed silently.
- tools/cli/src/lib/actions/reseed-all-orgs.ts: add --no-push to
  bunx convex run; grep-strip the bunx banner (Admin key, emoji
  separators, blank lines) so the trailing JSON is parseable;
  parse the result on success and surface succeeded/total counts;
  throw on docker-exec non-zero (which is what the action's new
  end-of-loop throw produces).
- tools/cli/src/lib/actions/deploy.ts: stageOrgIntoDir filter now
  skips dotfiles (.git, .DS_Store, .vscode, .idea, .tale),
  node_modules, __pycache__ — the previous filter only excluded
  .history/ and *.secrets.json at depth ≥ 1, so operators with
  default/.git/ or macOS .DS_Store would have shipped those into
  /app/data/. syncProjectFiles now throws on docker cp failure
  instead of returning, so the outer success("Deployment complete!")
  no longer prints over a half-pushed state.
- Deleted services/platform/convex/migrations/rename_org_slug.ts —
  under the multi-org model the migration is actively dangerous
  (renames every org to `default`), and there is no registry/cron
  tying it to anything. The docblock's "Self-hosted Tale deployments
  use a single organization" assumption is stale.

Wave 2 — Hidden code paths (out-of-sight assumptions that survived 5deaa5a)

- Python provider loader: load_providers + get_chat_model/
  get_embedding_model/get_vision_model + their *_config siblings
  now REQUIRE org_slug. Path resolves to
  <root>/<org_slug>/providers/ instead of <root>/providers/.
  BaseServiceSettings + Settings.get_llm_config likewise threaded.
  Without this, RAG and crawler both died at FastAPI lifespan
  startup with "No chat model found" against the old flat path.
- RAG: RagService rebuilt around a per-org _OrgClients cache with
  a 15s TTL. DB pool stays singleton; embedding/openai/vision
  clients and search service are per-org, built lazily on first
  request for that org. add_document/search/generate/compare_files
  now take org_slug as first arg. Embedding dimensions pinned
  globally on first org init; subsequent orgs that disagree raise
  loudly (per-org schema would need per-org DB).
- Crawler: uses contextvars instead of explicit threading — a new
  app/org_context.py exposes set_active_org/get_active_org/
  require_org_slug; main.py mounts require_org_slug as a router-
  level dep on every public router. embedding_service.py rebuilt
  with per-org cache keyed on slug. Boot-time embedding-dim guard
  in database.py removed (no org context at lifespan); pgvector
  enforces dim on insert. vision/openai_client.py and
  file_parser_service.py read get_active_org() at each settings.get_*
  call. scheduler.py background task sets "default" with a one-shot
  warn until per-website org binding lands.
- ragFetch: optional orgSlug in init; when set, X-Tale-Org header
  is forced (cannot be spoofed via init.headers). RAG endpoints
  that need it (search/generate/upload/compare-files) enforce
  via Depends(require_org_slug); status/delete/content/compare-
  by-id stay org-agnostic. All platform callers threaded — new
  lib/helpers/org_slug.ts (orgSlugFromId) bridges organizationId
  to slug for callers that only have the id. Crawler /api/v1/search
  callers (query_web_context, search_pages) set X-Tale-Org directly.
- generate-dev-compose.ts: bind mounts rewritten for the org-first
  layout. Old HOST_CONFIG_DIRS = ['agents','workflows',…] enumerated
  flat host dirs that no longer exist after tale init writes
  default/<domain>/. Replaced with findOrgDirs() — emits
  ./<org>/<domain>:/app/data/<org>/<domain>{ro} for every org
  found. start.ts user-facing hot-reload message updated.
- RULES_CONTENT (tools/cli/src/lib/rules/content.ts) + Cursor MDC
  globs rewritten for the org-first layout. tale update now applies
  checksum protection to rules files (CLAUDE.md, .cursor/rules/
  tale.mdc, .github/copilot-instructions.md, .windsurfrules) — was
  unconditional overwrite, would clobber local edits.
- tale update embedded-examples paths prefixed with `default/` so
  scaffolded files land where init puts them; previously update
  wrote into the now-unread flat layout.
- services/convex/docker-entrypoint.sh: detects pre-orgfirst flat
  dirs at /app/data/{agents,workflows,…}/ on boot and warns loudly
  with the tale migrate config-layout runbook. atomic_cp helper
  comment reworded — it's atomic for the destination but cp itself
  isn't atomic.
- tale start: detects legacy flat-layout dirs at project root and
  prints the runbook before continuing.

Wave 3 — User-facing surface

- All three root READMEs: stale "pending data migrations are
  detected and applied automatically on tale start/deploy" claim
  replaced with the explicit migrate runbook + link to upgrades.md.
- docs/{en,de,fr}/self-hosted/configuration/providers.md: GitHub
  href tree/main/examples/providers (404) → tree/main/examples/default/providers.
- docs/{en,de,fr}/self-hosted/configuration/retention.md: path
  documented as per-org /app/data/<org>/retention.json instead of
  the removed /app/data/platform-config/governance/retention-bounds.json.
- docs/{en,de,fr}/self-hosted/operate/upgrades.md: new "Migrating
  to the org-first config layout" section covering the 3-step
  runbook, the rollback story (downgrade safe between steps 1 and
  3 via the -orgfirst marker token; provider-secrets restore-from-
  backup needed after step 3), and the skip-step-1 fallback.
- governance/mutations.ts: client-facing ConvexError message now
  references $TALE_CONFIG_DIR/<orgSlug>/retention.json.

Wave 4 — Remaining majors (small surgical fixes)

- init.ts: OpenRouter secrets file gets mode 0o600.
- .dockerignore: !examples/**/*.md keeps skill SKILL.md in build context.
- compose.yml + tools/cli/src/lib/compose/generators/constants.ts:
  stale `tale migrate split-convex` justification reworded — the
  platform-data volume is now an unused pre-0.3.0 stub kept only
  for the detect() probe in start.ts.
- migrate-config-layout/script.sh: set -euo pipefail + ${DATA:?}
  guard so an unset $DATA can't make --cleanup-old rm from the
  container root.
- Empty-catch fix-ups in branding/file_actions.ts (unlink loop +
  readdir), serve-branding-images.ts (.catch fallthrough now logs
  non-ENOENT), init.ts (detectTaleProjectFiles readdir).
- config_store/store.ts: deleted the dead orgFirst flag — every
  caller passed true. Inlined the org-first layout; deleted the
  legacy <area>/<orgSlug>.json branch and updated the unit tests.
- Stale docblocks updated in governance/{retention_actions,
  retention_bounds_proposal, retention_floors}.ts, integrations/
  {credentials_schema, load_integration}.ts, agents/file_utils.ts,
  skills/file_actions.ts (all referenced the old flat layout or
  removed env vars).
- services/platform/docker-entrypoint.sh: ORPHAN_DERIVED → LEGACY_
  DOMAIN_VARS, dropped 2>/dev/null on the env-purge so failures
  surface in logs.
- services/platform/Dockerfile: env-comment rewritten for the
  org-first sub-dir derivation.

Test surface: 36/36 tasks pass via `bun run check`; 70927 platform
tests, 298 RAG, 472 crawler. Touched tests:
test_rag_service, test_compare_files, test_background_ingest,
test_config (RAG + crawler), test_document_helpers, test_database
(crawler — boot-time dim guard tests skipped with rationale),
test_llm_cache (ContextVar setup), upload_file_direct.test,
upload_document.test, store.test (rewritten for org-first paths).

Out of scope (per user direction): reserving the literal `default`
slug at the Better Auth `beforeCreateOrganization` hook. Resolved
operationally — the operator is the first user and registers the
default org via the normal flow.
Comment thread services/rag/tests/test_rag_service.py Fixed
larryro added 2 commits May 28, 2026 10:33
Two-round review of the org-first refactor surfaced 10 P0 issues where
the org-aware line only finished header propagation and left the data
layer, error layer, reserved-name layer, and caller-update layer
incomplete. This commit closes all 10 without expanding scope to P1.

Crawler
- Add `website_org_memberships` junction table + migration with
  backfill to `default`. websites/website_urls/chunks/page_paragraph_hashes
  remain deployment-shared content; the per-org boundary lives in
  the new table. Delete is ref-counted — last membership purges the
  website, others only drop the row.
- `_fts_search` / `_vector_search` filter by EXISTS on the membership
  table so org A can't see chunks from a domain only org B added.
- Scheduler binds `set_active_org()` to each website's oldest member
  per-task instead of hard-coding `default`.
- Re-add boot-time `ALTER TABLE chunks ALTER COLUMN embedding
  TYPE vector(N)` pin (resolved from default-org provider catalog);
  fail-loudly on missing provider, atomic pool rollback on failure.

Platform
- New `UpstreamHttpError` typed wrapper with status/retryable/
  safeMessage/sanitized body snippet; replace 8 raw `errorText`
  embeds across rag_action, rag_search_tool, fetch_document_*,
  upload_file_direct, delete_document.
- Reserve `'default'` slug in `beforeCreateOrganization` (with
  first-run seed exception via `betterAuth.organization` count),
  in zod refine on the create-org form, and narrow `isCallerAdmin`
  to admins of the `default` org (branding owner).
- config-watcher drops the `.endsWith('.json')` early gate so
  SKILL.md / scripts/*.py invalidate skill queries as the doc
  comment promises; per-domain extension filters inside
  `parseConfigChange`; 100ms tail-debounce per (type, org, slug)
  key prevents SSE storms during bulk migrations.
- Thread `organizationId` + `x-tale-org` header through 10+ crawler
  callers that previously hit the now-globally-required dependency
  without the header: fetch_and_extract, websites/internal_actions
  (8 sites), file_metadata/internal_actions, apply/extract_docx,
  generate_docx, generate_document, crawler_action (3 sites).

CLI
- deploy.ts: move success log after sync + reseed; legacy-flat-layout
  detection now throws with `tale migrate config-layout` guidance
  and runs at deploy entry (not just under --override); --override-all
  implies forceRecreate so the reseed targets the new binary.
- start.ts: pass `projectDir` to `generateDevCompose` so running from
  a subdirectory finds the right org dirs.

RAG
- /config no longer 500s — drop the per-org `get_llm_config()` call
  that the multi-org refactor made impossible from an org-less health
  endpoint; remove the corresponding fields from ConfigResponse.

Misc
- Remove Phase 2 (renameOrgSlug) from the dated convex-data migration
  script — the underlying Convex function was deleted on the parent
  refactor and any re-run would fail.

Tests: new `test_website_membership.py` (9 cases), updated
`test_websites_router.py` for ref-counted delete + first-membership
trigger semantics, `test_database.py` rewritten around the new pin
contract, new `upstream_http_error.test.ts` for the typed wrapper.
Adjusted 4 existing platform test files whose assertions hard-coded
the old raw-body error message format. `bun run check` green
(70932 platform + 481 crawler tests + RAG suite).
…d review

Two-round multi-agent review of refactor/uniform-org-first-config-layout
surfaced 5 P0s and ~30 P1s. This commit closes all of them.

P0 — cross-tenant isolation

- workflow_engine RAG delete_document, document extract/apply_docx_structured:
  add verifyStorageIdsBelongToOrg guards (mirror compare/retrieve pattern).
- crawler vision: refactor VisionClient + process_pages_with_llm to per-org
  _org_states / _chat_states keyed by get_active_org(); previously a 15s
  TTL singleton leaked org A's API key to org B's request within the
  window. llm_cache (OCR/desc/LLM) entries are now org-scoped via
  _scoped_key so the same content from two orgs never collides.
- New test_vision_isolation.py locks the invariant down (6 cases).

P1 — abstractions and CLI/upgrade flow

- UpstreamHttpError: .message = safeMessage only (snippet kept on
  .bodySnippet so it doesn't cross the Convex client boundary as a
  default toast); parse Retry-After into retryAfterMs; endpoint
  defaults to response.url; new toConvexError() carries structured
  fields across the wire; 9 new tests covering 401/403/404/429
  carve-outs, Retry-After parsing, ConvexError marshalling.
- Migrate 8 raw `new Error` sites to UpstreamHttpError.fromResponse
  (crawler_action ×3, file_metadata, fetch_document_comparison 4xx
  paths, web fetch_and_extract, docx extract/apply structured helpers).
- delete_document re-throws retryable upstream errors instead of
  folding them into {success:false}, so action retries can recover.
- rag_search_tool search-path now mirrors retrieve-path: returns
  safe summary instead of throwing past the agent runtime.

P1 — shared utilities (dedup)

- New lib/shared/constants/org-slug.ts owns ORG_SLUG_REGEX,
  isValidOrgSlug, assertValidOrgSlug — replaces 3 inline copies
  (file_io.ts, config-watcher.ts, reseed_all_orgs.ts) and tightens
  the bash regex in migrate-config-layout/script.sh.
- lib/file_io.ts gains getConfigRoot(area?) and safeJoinWithinDir
  helpers; 6 file_utils.ts files + config_store/store.ts drop ~80
  lines of copy-pasted path-traversal guards.
- organizations/resolve_org_slug.ts now re-exports orgSlugFromId
  (single implementation across ~46 callers).

P1 — CLI upgrade flow alignment

- migrate-config-layout/script.sh: pre-scan dst-collisions, SKIP
  notices to stdout (only ERROR on stderr), invalid org slug
  surfaces as conflict+error not silent skip.
- start.ts: import LEGACY_DOMAIN_DIR_NAMES from deploy.ts (closes
  the missing 'retention' drift), hard-fail on legacy layout
  (consistent with deploy), ensureEnv unconditional (matches deploy
  semantics for auto-secret refresh).
- migrate-config-layout.ts: actionable error when convex container
  isn't running; help text says "byte-for-byte" not "sha256"
  (matches cmp -s implementation).
- Three-locale docs/upgrades.md: drop "(and other config)" overpromise,
  reflect deploy hard-fail (not "starts up empty"), document old-
  container-must-be-running prereq for step 1, fix DE/FR grammar
  ("du läufst" → "du ausführst", "neu walkst" → "neu durchgehst",
  "re-walks" → "reparcours").

P1 — reseed CLI robustness

- reseed-all-orgs.ts: line-aware trailing-JSON parser (replaces
  fragile lastIndexOf('{')); grep `|| true` so grep zero-matches
  don't poison pipefail; failure branch parses payload too so
  failed-slug detail reaches CI logs; timeout-124 exit gets a
  distinct "timed out, safe to re-run" message.
- reseed_all_orgs.ts: invalid betterAuth slugs flow into results;
  pagination guards against stuck cursor + 1000-page cap.

P1 — RAG internal concurrency

- search() returns (results, usage) tuple — drops the mutable
  self.last_search_usage singleton that mis-attributed tokens
  under concurrent calls.
- Module-level _pin_dim_lock serializes the first _pinned_dims
  write across orgs (was racing past `if dims is None`).
- _org_locks LRU-capped at 256 to bound memory if a caller ever
  sprays distinct slugs.
- shutdown() drains _background_tasks before close_pool().

P1 — crawler data correctness

- DELETE FROM chunks: add `AND domain = $2` so same URL path on
  two domains doesn't over-delete.
- delete_page_chunks now accepts optional domain arg.
- pg_website_store: parse asyncpg DELETE tag as integer (was
  literal string compare against "DELETE 0").

P1 — branding hardening

- requireBrandingAdmin: trusted-headers branch no longer short-
  circuits past isCallerAdmin's default-org check.
- safeGetUrl in getLegacyBranding now logs instead of swallowing.
- saveImage/deleteImage readdir errors: distinguish ENOENT from
  EACCES/EISDIR.
- server.ts branding route: explicit Content-Type allowlist + sep-
  bounded prefix check (defense in depth over the existing
  filename validator).

P1 — docker entrypoint + 2026-03-28 migration script

- FORCE_SEED default ("false") so script stays correct under any
  future set -u audit.
- $data_dir single source of truth — drops /app/data hardcodes
  that diverged from $TALE_CONFIG_DIR.
- chown -R replaced with `find ! -user app -exec chown app:app`
  so large volumes don't re-walk every startup.
- POSTGRES_URL parsing handles bracketed IPv6 ([::1]) and URL-
  encoded password segments (pure-bash, no python dependency).
- mkdir + atomic_cp chained with `&&` instead of `;` so a failed
  mkdir doesn't cause a misattributed copy diagnostic.
- 2026-03-28 migration: drop `2>/dev/null` on cp so I/O errors
  surface; keep `|| true` only for the empty-glob case.

P1 — file_metadata retry classification

- extractFileMetadata uses isUpstreamHttpError to distinguish
  transient (5xx/408/429 → retry) from permanent (4xx, org-slug
  lookup failure → markFailed). Earlier retried permanent errors
  N times burning scheduler slots.

P1 — auth + org form

- beforeCreateOrganization: lowercase-normalize slug BEFORE
  reservation + uniqueness checks (closes Default/default cased
  bypass); assertValidOrgSlug on entry.
- New beforeUpdateOrganization hook: same guards on rename so
  owners can't claim reserved slugs post-creation.
- organization-form.tsx: extract deriveOrgSlug helper (was
  inlined three places); route Zod refine messages through
  useT (was hardcoded English); add three-locale i18n keys.

P1 — scaffold test coverage

- Add tests for invalid-slug skipped:true return, retention
  override on/off, strict:true aggregated throw, non-strict
  aggregated result.

Verification

- bun run check: all lint + type + test suites pass.
- Platform: 274 test files, 70941 assertions green.
- Crawler: 487 tests, RAG: 298 tests.
Comment thread services/rag/tests/test_rag_service.py Fixed
larryro added 22 commits May 28, 2026 15:06
The upgrades.md page is the evergreen operate doc (two-step flow,
blue-green, rollback, semver). A one-time migration between specific
versions belongs in that release's Migration notes, not here.
Knip flagged three exports that are only consumed within their own
file: RESERVED_ORG_SLUGS (used by isReservedOrgSlug in the same
module), and the MigrateConfigLayoutOptions / ReseedAllOrgsOptions
interfaces (only the function parameter type — never imported).
Closes the P0-1 cross-tenant search leak surfaced by the second-round
review (workflow `search` with `fileIds: []` could return chunks across
orgs because RAG's `_build_scope_clause` dropped the WHERE filter when
file_ids was empty), plus the SemanticCache same-similarity leak that
bypassed the SQL filter entirely. Per-tenant isolation no longer
depends on callers passing the right `file_ids` — the data layer
physically partitions every row by `org_slug`.

Schema migration (services/rag/migrations/20260528000001_enforce_org_slug.sql)

- Idempotent forward-only DDL: every step guards on
  information_schema / pg_constraint so re-running is a no-op
  (protects manual `psql -f`, partial-failure recovery, and
  backup-restore drift).
- Rename `documents.team_id` and `chunks.team_id` → `org_slug`;
  backfill any NULL rows to `'default'`; SET NOT NULL DEFAULT
  'default'. Existing single-org deployments transparently land in
  the `default` bucket.
- Replace stale partial indexes with org-scoped covering indexes:
  UNIQUE `(org_slug, file_id)`, `(org_slug, document_id)`, and
  `semantic_cache(org_slug, expires_at)`.
- Composite FK `chunks.(document_id, org_slug) → documents.(id,
  org_slug) ON DELETE CASCADE` enforces at the DB that
  chunks.org_slug can never drift from documents.org_slug.
- Add `semantic_cache.org_slug` column + companion index.

RAG service — SQL scoped end-to-end by org_slug

- indexing_service.py: thread org_slug through `index_document`,
  `find_existing_by_hash`, `clone_from_existing`, `_do_store`,
  `_do_clone`, `_update_progress`. INSERT now writes org_slug;
  UPSERT conflict target is `(org_slug, file_id)`. Cross-org
  content-hash dedup is deliberately disabled — org B's secret
  upload can no longer be probed by hash from org A.
- search_service.py: `_build_scope_clause` ALWAYS emits
  `AND c.org_slug = $N`. Empty/None file_ids no longer drops the
  WHERE clause (closes P0-1). The documents subquery is independently
  org-scoped — defense in depth. `RagSearchService.search` returns
  `(results, usage)` directly; the `last_search_usage` singleton
  race is gone.
- semantic_cache.py: lookup/store/invalidate/cleanup all take
  org_slug. Two orgs asking semantically identical questions get
  independent cache entries. `cleanup(None)` retained for operator-
  side global GC but callers must pass None explicitly.
- rag_service.py: every public method takes org_slug as first arg;
  `delete_document` / `get_document_content` / `get_document_statuses`
  / `compare_documents` are now per-tenant (foreign-org file_id
  returns 0 deletes / 404 rather than touching the foreign row).
- routers/documents.py: `/documents/{id}` DELETE, `/{id}/content`,
  `/statuses`, `/compare` now require `Depends(require_org_slug)`.
  The pre-existing org-agnostic carve-out is gone — the data layer
  needs the slug, so the routes do too.

Platform-side adaptation — orgSlug threaded to ~12 call sites

- `deleteDocumentById({orgSlug, fileId})` + `deleteFromRagBatch(
  {orgSlug, fileIds})` signatures broken; callers in rag_action,
  agents/internal_actions, threads/cascade_helpers, governance
  (erasure ×3, retention_cleanup ×2), documents/internal_actions ×2
  all updated.
- `fetchDocumentContent(orgSlug, ...)`, `fetchDocumentComparison(
  orgSlug, ...)`, `fetchDocumentChunks(orgSlug, ...)` — broken
  signatures + caller updates in workflow document_action,
  retrieve_document tool, etc.
- `filterStorageIdsByCallerOrg` now returns `{storageId,
  organizationId}` pairs so `checkFileRagStatuses` groups by org and
  fans out one RAG call per tenant instead of one global call.
- `deleteKnowledgeFileFromRag` takes organizationId; the scheduler
  callsites in agents/mutations.ts pass it.

Sanitize util renamed: `sanitize_team_id` → `sanitize_org_slug`. No
back-compat shim per the no-backwards-compat-hacks rule.

Tests

- All 30+ touched existing RAG tests updated for the new signatures
  + new SQL parameter positions ($1 = org_slug now).
- New `test_org_isolation.py` (10 cases) + `test_semantic_cache_
  isolation.py` (5 cases) pin the invariant at the application layer:
  empty file_ids never drops the org filter, foreign-org delete
  returns 0, foreign-org content returns None, same-hash cross-org
  probe returns None, etc.
- Platform `document_retrieve_tool.test.ts` mock adds
  `components.betterAuth.adapter` so the new `orgSlugFromId` resolves
  inside the test sandbox.

Verification

- RAG: 313 tests pass (was 298; +15 isolation tests).
- Platform: 274 files / 70941 tests pass; lint clean (2751 files).
- `bun run check`: 36/36 tasks green.

Out of scope (deliberate)

- `RAG_AUTH_TOKEN` enforcement on the RAG side — platform-only auth
  boundary per the project briefing.
- Embedding-dimensions per-org pin — orthogonal; cross-org dim
  mismatch is still a fail-loud availability issue, not a leak.
- Removing the `team_id` column historical references in
  `services/rag/app/utils/sanitize.py` callers — the function has
  no callers, rename is clean.
…mpile

Closes P0-3 from the multi-agent review. The compiled binary at
tools/cli/dist/tale ENOENT'd on `tale migrate config-layout` because
the script was loaded via `readFile(import.meta.url + '../migrate-
config-layout/script.sh')`. Bun's `--compile` does not bundle runtime
asset reads — only assets imported with `with { type: 'file' }` or
listed in `Bun.build({entrypoints})`. From source it works; from the
shipped binary the entire upgrade runbook was broken.

Fix mirrors the canonical pattern at reseed-all-orgs.ts:58-71
(RESEED_SCRIPT inline template literal). Bash `${VAR}` collides with
TS template-literal `${...}` interpolation, so each literal `${` in
the script is escaped as `\${`; plain `$VAR` (without braces) stays
unchanged. Removed the now-orphaned tools/cli/src/lib/migrate-config-
layout/script.sh and its parent directory — keeping the .sh on disk
would invite a future contributor to revive `loadScript()`.

Regression guard: new tools/cli/scripts/check-bundle.ts greps the
compiled binary for distinctive markers (`MIGRATE_PLAN`,
`MIGRATE_SUMMARY`, the reseed convex function ref) and exits non-zero
if any are missing. Wired into `bun run build` and `build:windows` as
a post-compile step so CI catches a recurrence of this regression
class loud and early.

Verified: `bun run build` → `check-bundle: OK (3 markers present)`;
compiled binary's `tale migrate config-layout --dry-run` now reaches
the project-validation phase instead of ENOENTing on the script load.
…arser

Closes P0-4 and P1-35 from the multi-agent review.

P0-4 — admin-key leak. RESEED_SCRIPT did `source /app/generate-admin-
key.sh`, which doesn't just define helpers — it `echo`s a dashboard
banner including `   Admin Key:      <key>` (3-space indent, capital
"K"). The line-based grep filter was anchored on `^Admin key` so it
mis-matched both the case AND the leading whitespace, and the key
landed in `result.stdout` which is then `logger.info`-ed on both the
failure path (line 179) and the success-no-parse path (line 230).
Reseed runs on `tale deploy --override-all`, which is exactly what
operators execute in CI — the leak path was through CI logs.

Fix is two-layered:
- Structural: drop the `source generate-admin-key.sh` line. The
  function `ensure_instance_secret` and the binary `generate_key` are
  already available from sourcing `env.sh`, and the inline `ADMIN_KEY=
  $(generate_key …)` re-derives the key without firing the banner.
  Mirrors scripts/2026-03-28-migrate-convex-data.sh:120-131.
- Defense in depth: new `redactAdminKey(text)` regex-strips any
  `[Aa]dmin [Kk]ey: <12+ chars>` pattern from stdout/stderr before
  it reaches the logger. Catches any future upstream banner (env.sh
  diagnostic mode, Convex CLI change, etc.) without breaking the
  primary fix. Applied at both surviving log sites.

P1-35 — dead JSON parser on failure path. The convex action
`reseed_all_orgs.ts:147-167` only `console.log`s human-readable strings
then throws; `bunx convex run` does not emit the action's return value
on the throw path. The CLI's `parseTrailingJson` failure-branch
(lines 196-212) therefore always returned null and the special-case
error message never rendered. Removed the dead branch; the generic
"reseed action raised" path already surfaces per-org detail via the
stdout log just above (now redacted).

Verification:
- 7 new unit tests cover the redactor (case, indent, missing colon,
  multiple occurrences, no false positives on short tokens).
- `bun run check`: 36/36 tasks green.
Closes P0-2. The watcher fan-out at server.ts:43-57 used to push every
config-change event to every connected client with zero auth and zero
per-org filter; client-side filters in `app/hooks/use-file-events.ts`
only suppressed the cache invalidation, not the wire payload. When
`TALE_FILE_EVENTS=true` (operator opt-in), an unauthenticated peer or
a cross-org member got a real-time inventory of every org's config
items (agent slugs, workflow names, integration ids, etc.).

Fix is two layers:

- New convex httpAction `/api/sse/auth` validates the Better Auth
  session cookie via the same `auth.api.getSession({ headers })`
  pattern as `/api/tts-audio`, then walks the `member` table to
  resolve the user's org slugs. Returns `{userId, orgSlugs}` on
  success or 401 with `Vary: Cookie` on missing/invalid session.

- The Bun `/events/file` handler forwards the request's Cookie to
  this endpoint (using `CONVEX_SITE_PROXY_URL` if set, else deriving
  `:3211` from `CONVEX_URL`'s `:3210`). On 401 the SSE connection is
  closed with 401 + `WWW-Authenticate: Cookie`. On success the
  resolved org-slug set is stored alongside the controller; the
  watcher's fan-out drops events whose `orgSlug` falls outside the
  client's allowed set BEFORE the payload reaches the wire.

The existing client-side filter in use-file-events.ts is now
redundant defense in depth — kept since wire-side filter requires
the server to know the active org, which it does, but layering
makes a future "send a global event to everyone" code change safer.

Verified:
- 3 new server.test.ts cases: no-cookie → 401, convex-rejects → 401,
  valid session + mocked convex response → 200 text/event-stream
  with Vary: Cookie. The pre-existing FILE_EVENTS_ENABLED=false 404
  case still passes.
- `bun run check`: 36/36 tasks green (70944 platform tests).
Closes P1-1, P1-3, P1-4, P1-5 from the multi-agent review. Each is a
cross-tenant or content-injection hole the org-first refactor surfaced
or widened. None had auth tests; all four fixes are pattern-aligned to
the sibling code path that was already doing the right thing.

P1-1 — `rag_action.upload_document` skipped the org gate. Every other
op in `rag_action.ts` (delete/get_chunks/search) calls
`assertStorageIdsInOrg` first, but upload went straight to
`uploadDocument(ctx, fileId, ...)`. The helper derives orgSlug from
file-metadata, so an org-A workflow could force ingestion of an org-B
storage blob into org-B's RAG namespace — cost shift + content
injection under attacker-controlled fileName/contentType. One-line
gate added at the same spot as the sibling ops.

P1-5 — workflow `get_chunks` / `search` skipped
`stripReservedPromptTags`. The agent-tool path applies the SEC1
strip at rag_search_tool.ts:319/483; the workflow path passed
chunk content through untouched, letting an attacker-uploaded
`<system>…</system>` escape the surrounding workflow system prompt.
Strip is applied BEFORE the existing video-link `wrapUntrusted`
layer so both defenses compose correctly.

P1-4 — `websites/actions.ts` fetchPages/fetchChunks/searchContent
authenticated the user but never called `verifyOrganizationMembership`
on `website.organizationId`. Sibling actions (deleteWebsite,
updateWebsite, createWebsite) already do the check; the REST surface
(rest_api.ts:103-105) does too. Pre-existing on `main` but materially
widened by the org-first refactor: pre-branch the crawler was
effectively single-tenant per deployment so a leaked websiteId hit a
shared scope; on this branch the crawler is org-scoped via
website.organizationId resolved from the row, so cross-org calls now
return the *target org's* private pages/chunks/search hits to any
authenticated caller. Extracted a small `loadOwnedWebsite(ctx,
websiteId)` helper so the three handlers stay one-liners and the
auth pattern can't drift on the next addition. Errors uniformly
say "Website not found" to avoid existence disclosure across orgs.

P1-3 — workflow `retrieve` skipped the team-ACL gate that its
agent-tool sibling (retrieve_document.ts:42-58) enforces. The comment
at document_action.ts:278 even claimed parity: only the first half
(`findDocumentByFileId`) was implemented; `getAccessibleDocumentIds`
was missing. Same-org members of a different team could read foreign-
team documents. Same gap audited in `update` (line 219) and
`get_metadata` (line 540) — both also resolve by fileId via
`findDocumentByFileId` and were missing the team check.
`compare`/`extract_docx_structured`/`apply_docx_structured` operate
on storage IDs (not docs rows) so `verifyStorageIdsBelongToOrg`
covers them; they have no team field to ACL.

Extracted `assertDocumentAccessibleInWorkflow(ctx, organizationId,
userId, document, fileId)` helper in document_action.ts. When userId
is absent from `_variables` (system-triggered workflows), the team
gate degrades to org-only — consistent with the existing optional-
userId pattern at lines 424/651/746. For `get_metadata` the
accessibleIds list is loaded once and per-id filtering happens in
the map loop; team-private fields (sourceCreatedAt/Modified,
docMetadata) are dropped for inaccessible docs while fileName from
fileMetadata still surfaces so legitimate name-only workflows don't
break.

Verified: bun run check 36/36 tasks green (70944 platform tests).
Closes P1-6, P1-7, P1-8, P1-9, P1-10, P1-11 from the multi-agent review.
Six related correctness bugs in the file_metadata / agent_response /
documents modules: retry classifier inverted in the wrong direction,
no single-flight on transcription, no source-state preconditions on
the user-facing retry/skip mutations, a stuck-transcription watchdog
that mis-kills freshly-retried runs, an abort-watcher leak on
guardrails-block early returns, and two documents-generation paths
that skipped the typed-error migration.

P1-7 — `extractFileMetadata` (`file_metadata/internal_actions.ts:206`).
The retry classifier read `isRetryable || !isUpstreamHttpError(error)`
which meant every non-UpstreamHttpError throw was treated as transient.
`orgSlugFromId` failures, malformed JSON, "Invalid response shape" —
all got rescheduled 3× before failing. Inverted: default permanent,
opt in only on `UpstreamHttpError && retryable`. Trade-off accepts
that a genuine network blip surfaces as permanent rather than self-
healing — the original deterministic-error retry storms were far more
damaging.

P1-8 / P1-9 / P1-10 — single-flight + state gates + watchdog key.
`transcribeAudio` had no atomic lock: two concurrent invocations on
the same storageId (retryTranscription double-click, scheduled retry
+ user-triggered retry) both proceeded. Double Whisper bill, double
`+=` on `recordTranscriptionUsage`, double RAG index.

  - New `acquireTranscriptionLock` mutation writes
    `transcriptionRunId`, `transcriptionLeaseExpiresAt`,
    `transcriptionStartedAt` atomically under the existing
    `by_storageId` index; returns the runId on win or null on loss.
  - `transcribeAudio` acquires the lock first; on loss it logs
    `transcription.deduplicated` and returns without compress /
    Whisper / ledger work.
  - Lock is released in the `finally` block via the new
    `releaseTranscriptionLock` mutation, which no-ops if a watchdog
    or another retry already claimed it.

`retryTranscription` and `skipTranscription` now precondition on
source state — retry only from `failed`/`skipped`, skip only from
`queued`/`running`. The pre-existing bug allowed Skip after
`completed` to clobber the transcript and cascade videoLinkJobs into
a failed state; Retry from `running` would have double-billed
Whisper (now blocked by the lock too, but the UI shouldn't surface
a Retry button for an in-flight row anyway). `skipTranscription`
now routes through `updateFileTranscription` so the videoLinkJobs
cascade (internal_mutations.ts:319-345) fires correctly for video-
link audio; the prior direct `db.patch` orphaned the linked job at
`transcribing_handoff`.

`recoverStuckTranscriptions` (the 5-min cron) keyed on
`row._creationTime` so a `retryTranscription` against an old
fileMetadata row could be killed within seconds of the next tick.
Now keys on `transcriptionStartedAt ?? _creationTime` with legacy-
row fallback, and clears the single-flight fields when breaking the
lock so a re-retry can acquire cleanly.

Schema: three new optional fields on `fileMetadata` —
`transcriptionRunId`, `transcriptionLeaseExpiresAt`,
`transcriptionStartedAt`.

P1-11 — abort watcher leak. `generate_response.ts:1166-1184` and
`:1254-1276` are guardrails-block early returns that skipped the
`abortWatcher?.stop()` call every other return path makes. Bounded
leak (the polling closure self-terminates within ~1.5s on
`streams.abort`), but it kept issuing redundant `check_cancelled`
Convex queries after the function returned. Mirror the canonical
stop call at line 580 (cancelledReturn) before each
`return buildBlockedReturn(...)`.

P1-6 — `documents/generate_document.ts` and `generate_docx.ts`
skipped the UpstreamHttpError migration. Crawler-side errors now
route through `UpstreamHttpError.fromResponse('crawler', ...)` so
the body snippet is sanitised, retryability is classified by
status, and the agent boundary sees the safe message instead of
raw upstream text. Storage-upload paths (Convex `_storage` via
`generateUploadUrl`) sit outside the `'rag' | 'crawler'` service
union; those throws are downgraded to a status-only Error after
scrubbing the response body via `sanitizeError`.

Verified: `bun run check`: 36/36 tasks green (70943 platform tests).
Closes P1-12, P1-13, P1-14, P1-31, P1-32, P1-33, P1-34 from the multi-
agent review. Seven small but data-loss-class bugs across the auth
hooks, scaffold janitor, deploy CLI, and tale init/update flow.

P1-12 — `beforeUpdateOrganization`'s slug collision check at auth.ts:
654-667 didn't exclude the org being updated. Better Auth's own pre-
check at crud-org.mjs:213-215 does this self-exclude; without
mirroring it, any update payload that re-sends the current slug
(e.g. a name-only PATCH that round-trips the full object, or a UI
form that posts the full org state) 400s with "already taken".
Fixed by reading `data.member.organizationId` and treating a
collision against the same id as a no-op.

P1-13 — `beforeCreateOrganization` at auth.ts:632-636 swallowed
an empty catch around `data.organization.slug = normalizedSlug`.
If the assignment ever threw (frozen object, etc.) the un-normalized
caller-supplied slug would persist while the reservation + unique-
ness checks above had run against the normalized version — defeating
the very normalization the comment at line 580-585 defends against.
Replaced with the cleaner Record<string, unknown> cast pattern that
`beforeUpdateOrganization` already uses; if the assignment ever
throws the create fails loudly rather than persisting the wrong
slug.

P1-14 — `scaffold.ts` bundle-mode (this file's seedSingleDomain
line ~335) and skills uploads (skills/file_actions.ts:706-707) stage
into `<bundle>.staging-<8hex>` / `.replacing-<8hex>` siblings before
atomic-renaming onto the target. Process crash mid-stage leaves
orphans 3 levels deep at `<root>/<org>/<domain>/`. The pre-existing
`sweepStaleCondemnedDirs` janitor only walked root-level
`.deleted-*`, so the orphans would (a) survive forever and (b) make
`dirHasFiles` return true → next `override:false` reseed skips that
domain indefinitely.

  - Renamed `isAtomicWriteTmp` → `isTransientArtifact` and taught
    it to match `\.{staging,replacing}-[a-f0-9]{8}$`. `dirHasFiles`
    now ignores these orphans the same way it ignored `.tmp`.
  - Rewrote `sweepStaleCondemnedDirs` to walk 3 levels (root →
    `<org>` → `<domain>`) and rm 24h-old transient siblings, in
    addition to the existing root-level `.deleted-*` sweep. Skips
    non-validated org dirs and symlinks; per-entry errors only log.
  - Wired the janitor into `scaffoldNewOrganization` (was only
    called from `cleanupOrgFilesystem`) so reseed paths sweep too.

P1-31 — `tale deploy --override` + `--override-all` together is
nonsense: host push runs first, then the catalog factory reseed
clobbers everything --override would have written. Operators were
hitting this combination and reasoning about a silently-discarded
flag. Reject the combination at commander parse time with a
diagnostic that explains the two modes are mutually exclusive.

P1-32 — entry-time legacy-flat-layout check ran unconditionally
inside `withLock`, before any compose action. Plain `tale deploy`
(container rotation; no host push) has no host-push hazard, so the
duplicate check trapped operators with leftover legacy artifacts
who just wanted to roll containers. Gated on
`options.override || options.overrideAll`; the host-push code path
at syncProjectFiles enforces the same check where it matters.

P1-33 — `LEGACY_DOMAIN_DIR_NAMES` (deploy.ts) blocked operators
with legitimately-named org slugs (`agents`, `workflows`, `branding`,
`providers`, `skills`, `integrations`, `retention`) at deploy time
— but `reserved-org-slugs.ts` reserved only `default`, so the UI
happily created those orgs in the first place. The CLI then
classified the org's `<root>/<orgSlug>/` dir as a legacy artifact
and refused to deploy; the error message recommended
`tale migrate config-layout`, which would silently merge that
org's data under `default/`. Data-loss risk.

Fix: move the legacy-domain set into `reserved-org-slugs.ts` so the
UI form's zod refine (organization-form.tsx:73) and Better Auth's
beforeCreate/beforeUpdate hooks all refuse these names up front.
The CLI's `LEGACY_DOMAIN_DIR_NAMES` stays where it is (different
package, hard to share Convex code with), with a comment to keep
the two sets in lockstep.

P1-34 — `tale init` recorded checksums for example files but NOT
for the four AI rules files (CLAUDE.md, .cursor/rules/tale.mdc,
.github/copilot-instructions.md, .windsurfrules). The rules were
written AFTER `writeChecksums`, so the first `tale update` after
init saw `oldHash === undefined` for each and hit the unconditional
"new" branch at update.ts:95-101 — silently clobbering any local
edits the user made between init and that first update.

  - init.ts: moved the rules-file write loop ABOVE the checksum
    construction, and added each rules file to `allFiles`.
  - update.ts: defense-in-depth — the `!oldHash` branch now also
    checks `!existsSync(destPath)`. If the file is present on disk
    but absent from checksums.json (legacy projects init'd by the
    pre-fix CLI), treat it as locally modified: preserve unless
    `--force`, with a warning.

Verified: `bun run check`: 36/36 tasks green; scaffold's 24 tests
still pass after the janitor rewrite.
Closes P1-19, P1-20, P1-21, P1-22, P1-23, P1-24, P1-25, P1-26, P1-27,
and the P1-29 boot-dim test gap from the multi-agent review.

P1-19 — `RagService.shutdown()` had two gaps. New `_shutting_down`
flag is checked at the top of `_ensure_org_clients`; requests landing
between `_org_clients.clear()` and `await close_pool()` now raise
RuntimeError instead of repopulating the cache and binding to a
closing pool. The unbounded `gather(*_background_tasks)` drain (whose
underlying `_safe_close` coroutines each `asyncio.sleep(30)`) is now
wrapped in `asyncio.wait_for(_, timeout=10)`; on timeout the still-
pending tasks are cancelled so shutdown completes promptly.

P1-20 — `_get_org_lock` claimed bounded LRU in the comment but was
actually FIFO with no reordering. A busy org's lock could be evicted
while held by fiber A; fiber B then got a fresh lock and both raced
into `_build_or_refresh_org_clients` with `previous=None`, silently
overwriting each other's client set (no `_safe_close` scheduled
since the cleanup at line 289-302 only fires when `previous is not
None`). Switched `_org_locks` and `_org_clients` to `OrderedDict`
with `move_to_end` on every access; eviction scans for the LRU
*unheld* lock rather than blindly popping the head. `_org_clients`
is now also bounded by the same `_ORG_LOCKS_MAX`, with `_safe_close`
scheduled per evicted client.

P1-21 — new `tests/test_rag_service_concurrency.py` (5 cases)
locks the invariants: shutdown gate, drain timeout, LRU
move-to-end, eviction skips held locks, `_pin_dim_lock` first-write
race serialises across two concurrent dim-pinners.

P1-24 — `database.py` boot-time dim handling. Restored the legible
"dimension mismatch" RuntimeError that was lost in the post-refactor
"ALTER unconditionally" path: now pre-reads `format_type(atttypid,
atttypmod)` on chunks.embedding; if pinned to vector(N) with N !=
configured dim, raise a clear message naming both values. Skip the
ALTER (and its AccessExclusiveLock) when the column already matches.
Pre-existing `_fake_pool` test helper now also tracks fetchval; new
`BOOT_PINNED_DIMS` module global is exported and cleared by the
test fixture.

P1-25 — crawler `_org_states` / `_vision_states` / `_chat_states`
were unbounded dicts holding `AsyncOpenAI` httpx pools. Under typo'd-
slug churn the file-descriptor footprint grew indefinitely. Switched
all three to bounded LRU `OrderedDict`s capped at 64; eviction
schedules `_safe_close` after the standard grace window.

P1-26 — vision hot paths (`ocr_image`, `describe_image`) called
`settings.get_vision_model(get_active_org())` per request, which
routes through `load_providers` → glob providers dir + parse JSON +
fork `sops -d` per `.secrets.json`. On a multi-page PDF OCR run, the
sops fork storm dominated. Now reads the cached model id from
`_vision_states[org].config[2]` — the same pattern
`process_pages_with_llm:456` already uses.

P1-27 — `embedding_service.get_embedding_service` only checked dim
drift within the same org. With chunks.embedding pinned globally to
the default org's dim at boot (P1-24), a second org with a
disagreeing provider config would succeed at config-load and crash
only at INSERT/search time. Now imports `database.BOOT_PINNED_DIMS`
and raises a clear RuntimeError at config-load time naming both
dims and the offending org.

P1-22 — `register_website`'s ON CONFLICT used to overwrite the
shared `websites.scan_interval` on every re-register, so any
member-org silently re-set everyone else's cadence. Now uses
first-org-sets-cadence semantics: ON CONFLICT only touches `status`
(see P1-23) and `updated_at`. Updating cadence requires the
explicit `update_scan_interval` API. Full per-org cadence move to
`website_org_memberships` is deferred to a follow-up; the immediate
"silent clobber on join" bug is closed.

P1-23 — `recover_stuck_deletes` + `execute_delete` race. Between
`begin_delete` marking the row 'deleting' and `execute_delete`
firing on the background task, a new org could join via
`register_website` (now reset-to-idle on conflict — see P1-22).
`execute_delete` now re-checks `COUNT(website_org_memberships)` in
the same tx; if any membership exists it aborts the CASCADE and
flips status back to 'idle' rather than killing the new org's
content.

P1-29 — added two `test_database.py` cases: dim-pin mismatch raises
with legible message and rolls back the pool; already-correctly-
pinned column skips the ALTER (no AccessExclusiveLock churn).

Verified: bun run check 36/36 tasks green; RAG 318 tests (5 new
concurrency cases); crawler 487 tests (2 new dim cases).
Closes the final P1 cluster from the org-first review:

- P1-15: SSE watcher now emits invalidations for <org>/retention.json so
  governance UI cache refreshes when an operator edits the file.
- P1-16: replace 6 empty readdir catches (agents, workflows file actions)
  with handleDirReadError helper — ENOENT silently falls through, other
  errors are logged instead of silently swallowed.
- P1-17: move overscroll-behavior:none from packages/ui/src/globals.css
  to platform/app/locals.css so docs + web shells keep native rubber-band
  scroll.
- P1-30: extract canonical ORG_SLUG_RE + validate_org_slug to
  tale_shared.config.org_slug and route provider loader + RAG and crawler
  auth dependencies through it; drops three near-identical local regexes.
- P1-37: README.fr.md tu-form alignment (exécutez → exécute).
- P1-38: doc paths under TALE_CONFIG_DIR now include the <orgSlug>
  segment across en/de/fr models + integrations overview pages, matching
  the org-first layout.
- P1-39: lift RetentionConfigMissingError → ConvexError translation into
  computeEffectiveAppliedBounds so the four call sites can drop their
  duplicated try/catch; tightens seeder's catch to the
  RETENTION_CONFIG_MISSING code only.
Default ~30s timeout causes transient PyPI fetch failures on slow links
when pulling large wheels (scipy, playwright/patchright, ML libs).
addKnowledgeFile and removeKnowledgeFile accepted a caller-supplied
fileId: v.id('_storage') after verifying only org membership, then
called ctx.storage.delete / saveFileMetadata / indexKnowledgeFile
against that fileId. Convex _storage is deployment-global and the
fileMetadata by_storageId index is not org-scoped, so an authenticated
member of org A could pass an org B storageId and delete the blob,
patch the foreign metadata row, or schedule indexing against B's blob.

removeKnowledgeFile now requires the fileId to be present in the
org-scoped agentBindings.knowledgeFiles array before touching anything.
addKnowledgeFile cross-checks via fileMetadata: a foreign storageId
with metadata in another org is rejected. Both paths share an opaque
'file_not_in_org' ConvexError to avoid cross-org existence probing.

cleanupAgentBinding remains safe by construction (it iterates only
this binding's knowledgeFiles, which addKnowledgeFile's new gate
keeps clean of foreign fileIds).
1. governance/erasure.ts — eraseSubjectTwoFactorAttempts and
   eraseSubjectLoginAttempts wiped global-key tables (twoFactorAttempts
   by userId; loginAttempts/loginBlockCounters by email). For a
   multi-org subject, an admin of org A filing erasure would silently
   reset the subject's 2FA backoff + login lockout state for every
   other org they belong to — a cross-tenant auth-throttling bypass
   primitive. Gate both wipes via new subjectIsMemberOfOtherActiveOrgs
   helper: when the subject still belongs to another active (non-
   disabled) org, skip the wipe and log a partial-outcome warning.
   The other org's auth-state remains intact; the last-org admin's
   erasure runs the cleanup.

2. http.ts /api/sse/auth — the membership lookup didn't filter
   role === 'disabled'. A soft-removed member kept receiving SSE
   file-change events for the org they were kicked from. Now matches
   the canonical getUserOrganizations filter.

3. agents/file_actions.ts — readHistoryEntry and restoreFromHistory
   used resolved.startsWith(path.resolve(historyDir)) without a
   trailing path.sep and never validated args.timestamp shape. A
   crafted timestamp containing '../' or referencing a sibling
   agent's history dir whose path string starts with the original
   prefix could escape the agent's history scope. Replaced with
   safeJoinWithinDir + validateTimestamp from lib/file_io (the same
   helpers the rest of the agents module uses).
1. execute_delete (pg_website_store.py) ran a same-tx membership COUNT
   then DELETE under READ COMMITTED with no parent-row lock. A
   concurrent register_website from another org could insert a fresh
   membership row between the COUNT and the DELETE; the FK CASCADE
   on website_org_memberships(domain) would then silently wipe the
   new org's just-inserted membership. Take a row-level lock via
   SELECT ... FROM websites WHERE domain = $1 FOR UPDATE before the
   COUNT, forcing concurrent register_website's ON CONFLICT DO UPDATE
   to serialize.

2. register_website returned the request's scan_interval verbatim even
   though ON CONFLICT preserves the originally-stored value (first-org
   sets cadence). A second org joining the same domain with a different
   cadence saw their input echoed while the scheduler kept running on
   the first org's interval — silent contract drift. Surface the
   stored value via RETURNING and echo it back through the router.
ORG_SLUG_REGEX was unbounded (/^[a-z0-9][a-z0-9_-]*$/) while the
Python validator at packages/tale_shared/.../org_slug.py is capped at
64 chars (/^[a-z0-9][a-z0-9_-]{0,63}$/, fullmatch). A long display
name went through deriveOrgSlug → Better Auth → organization row with
no length check; the first RAG / crawler call then 400'd on
require_org_slug forever — the org was bricked with no recovery.

- Add MAX_ORG_SLUG_LENGTH = 64 export.
- Tighten ORG_SLUG_REGEX to {0,63} so assertValidOrgSlug
  (called from beforeCreateOrganization + beforeUpdateOrganization
  in convex/auth.ts) rejects long slugs at the Better Auth hook.
- Truncate deriveOrgSlug in organization-form.tsx so the UI
  preview matches what will actually be persisted.

All four layers (TS regex, Convex hook, Python validator, UI form)
now enforce the same 64-char ceiling.
Previously the migrate script ran only inside the convex container against
\$DATA — it never touched the operator's host project directory. But
\`tale start\` and \`tale deploy\` hard-fail when they detect legacy per-
domain dirs (agents/, workflows/, …) at the project root and point the
operator at \`tale migrate config-layout\`. Running the suggested fix
left those dirs in place; the operator would re-run \`tale start\` and
hit the same error — a deadlock with no documented escape.

New host-layout phase runs first (in the TypeScript wrapper, not the
docker'd bash):
- Detects each LEGACY_DOMAIN_DIR_NAMES dir at the project root.
- Atomically renames it to \`default/<dir>/\` (same-fs rename(2)).
- Refuses to overwrite a populated \`default/<dir>/\` — records a
  conflict the operator must resolve.
- Supports --dry-run.
- --cleanup-old skips the host phase (rename is destructive, no old to
  clean).

Also fixes the docs/<locale>/ placeholder in two error messages
(start.ts and migrate-config-layout.ts) — render docs/en/ literally
since these messages are operator-facing CLI output, not translated
copy.
ADMIN_KEY_RE's value charset was [A-Za-z0-9+/=._-] — it stopped at the
first `|`. Self-hosted Convex admin keys are formatted
`<INSTANCE_NAME>|<base64-payload>` (Convex generate_key contract). When
reseed-all-orgs hit the failure or unparseable-success log paths, the
redactor truncated the key at the pipe and emitted
`Admin Key: <redacted>|<actual-secret-payload>` to operator stderr/CI.

Add `|` to the charset. New test asserts the full pipe-delimited shape
collapses to `Admin Key: <redacted>` with no payload survival.
- assertValidOrgSlug now wraps in APIError('BAD_REQUEST') at both
  Better Auth hooks so invalid input returns 400 instead of 500.
- orgSlugFromId carries a stronger doc warning: it does NOT verify
  membership; callers must pre-verify or trust the source. Rename
  deferred (185+ call sites — too risky for this PR).
- /api/sse/auth comment corrected (256 is a soft cap, not a hard
  limit) and we now warn when truncation actually hits.
- ragFetch sets x-tale-org from the trimmed slug so accidental
  whitespace doesn't ride into RAG's filesystem lookup. JSDoc
  corrected: content + compare-by-id + compare-files all require
  org_slug per the RAG router (verified against documents.py:475,518,564
  and search.py).
- generateAgentResponse wraps orgSlugFromId in try/catch; a transient
  lookup miss degrades to "skip knowledge context" instead of aborting
  the whole response (matches the guardrails-resolve pattern).
- cascade_helpers resolves orgSlug BEFORE the storage.delete loop.
  Previously a slug-lookup failure landed AFTER blobs were already
  out-of-band deleted, leaking RAG chunks with no purge scheduled.
- parseRetryAfterMs caps at 24h (MAX_RETRY_AFTER_MS); a malicious /
  buggy upstream sending '1e10' or a far-future HTTP date would
  otherwise pin scheduler backoff at ~317 years. New tests cover
  scientific-notation, oversized-seconds, far-future-date branches.
- acquireTranscriptionLock rejects rows already in 'completed' status.
  Without this guard, a late-arriving duplicate transcribeAudio schedule
  re-bills Whisper and re-writes the transcript on a row whose previous
  run already succeeded. The single chokepoint is the right place to
  enforce; entry-point pre-checks remain as belt-and-braces.

- extractFileMetadata stamps a terminal marker (visionRequired: false +
  scannedPagesDetected: 0) on permanent failure. Previously, the catch
  logged and returned, leaving visionRequired undefined forever — the
  UI's "still extracting" state had no exit when extraction failed on
  a 4xx, malformed response, or org-resolve failure.

- serve-branding-images parses req.url through new URL(...) so query
  strings (?v=2 cache-busters, etc.) are dropped before filename
  validation. Without this, dev silently 404'd on any image URL with a
  query while prod's c.req.param handler worked fine. Also adds the
  imagesDir + sep defense-in-depth prefix check to match server.ts.

- reindexDocumentInRag now accepts oldOrganizationId and uses it to
  scope the old-RAG delete BEFORE the early-return on missing document.
  Previously, if the document row was deleted/cleared between scheduling
  and execution, the early-return skipped the delete and orphaned the
  oldFileId chunks in RAG forever. updateDocumentInternal passes the
  current document's organizationId at schedule time so the delete-org
  context survives any later doc state changes.
- queryRagContext now hoists the orgSlug check OUTSIDE the outer
  try/catch so an empty/blank slug throws cleanly instead of being
  silently swallowed by the graceful-degrade catch (returning
  undefined as if RAG had no results). The JSDoc said the failure
  was surfaced; runtime now matches.
- search_pages explicitly rejects malformed `args.domain` rather than
  silently dropping the filter and running a global org search — the
  LLM thinks its filter applied and the user gets unrelated hits.
- search_pages wraps fetchSearch in try/catch and degrades to a
  "search temporarily unavailable" reply on crawler failure, matching
  the sibling fetch_and_extract helper's {success:false} contract.
- REST POST /websites/:id/sync now invokes syncSingleWebsite scoped to
  the specific id instead of org-wide syncWebsiteStatuses. Previously
  the :id path param only served as an ownership tripwire and callers
  got a surprise org-wide side effect.
- fetchPages debounces the inline syncSingleWebsite schedule to 1 hour
  via metadata.lastStatusSyncAt, matching syncWebsiteStatuses' throttle.
  Without this, every page view / poll fanned out a concurrent crawler
  sync that raced last-write-wins on the row's status field.
- fetchHomepageMetadata now console.warn's the crawler HTTP failure so
  blank title/description doesn't look like a real "no metadata" signal
  to operators triaging.
- applyDocxStructured wraps saveFileMetadata in try/catch and deletes
  the orphan _storage blob on metadata-mutation failure. Convex
  _storage is reference-counted only by application rows; a partial
  failure between upload and saveFileMetadata leaked the blob forever.

R10-P2-a (getAccessibleDocumentIds full-org scan in workflow ops)
is deferred — it needs a new hasDocumentAccess(documentId) internal
query and a memo-hoist across consecutive ops. Not blocking at demo
stage scale; tracked separately.
larryro added 15 commits May 29, 2026 02:07
…stency

- retention_cleanup.ts cleanupLoginAttemptsGlobal now carries an
  explicit comment documenting the legal-hold trade-off: the 30-day
  fixed TTL sweep intentionally does NOT cross-check active holds
  because pulling email→userId→cross-org-holds resolution back into a
  global sweep would re-introduce the per-org coupling the Phase 11
  reframe deliberately removed. Forensics relevant to a hold live in
  the per-org auditLogs stream, which IS hold-gated.

- scaffold.ts seedRetention: non-ENOENT stat errors (EACCES on a
  chmod-locked file, EPERM on an immutable-bit attribute, ELOOP on a
  symlink cycle) previously fell through and silently overwrote the
  locked file. Treat unknown stat failures as "target exists" so the
  override:false branch refuses, and surface the message in the result
  so a deploy reports the failure instead of producing a silent clobber.

- loadIntegration: orgSlug and organizationId were trusted independently
  even though they drive different reads (orgSlug → filesystem config;
  organizationId → DB credentials). A mismatched pair would silently
  splice org A's config template with org B's encrypted secrets. Resolve
  canonically via orgSlugFromId and refuse on mismatch — backward-
  compatible with every current caller while closing the consistency
  invariant.

R12-P2-c (open-run watchdog) is already handled by the existing
STALE_RUN_AGE_MS / STALE_HEARTBEAT_MS reclaim path in claimRetentionRun
— no code change needed.
…uments-table

- convexHttpActionsBaseUrl now parses CONVEX_URL via new URL() and sets
  parsed.port = '3211' explicitly. The previous `:\\d+$` regex only
  matched URLs ending in a literal port; operators with a bare
  hostname (`https://convex.example.com`) or a path suffix
  (`http://convex:3210/sub`) silently got the wrong port and every
  SSE auth lookup 401'd.

- config-watcher's single-file-per-org branch now consults a
  SINGLE_FILE_ORG_CONFIGS set rather than a hardcoded `stem === 'retention'`
  comparison. Adding a future per-org config file (`quota.json`, etc.) is
  now a one-line change instead of silently no-op'ing because the watcher
  doesn't recognize the stem.

- documents-table.tsx eager-pagination predicate (`hasActiveQuery`) now
  includes the context-level `selectedTeamId`. The page-level team filter
  feeds into filterDocumentResults too, so without this the user could
  pick a team in the page filter (no other filters/search), see only the
  first page of results matching that team, and have no way to scroll the
  rest into view.
- Delete services/rag/app/utils/sanitize.py — sanitize_org_slug had
  zero call sites in the RAG service, no tests, and its charset (which
  accepted uppercase) diverged from the canonical validate_org_slug at
  packages/tale_shared/.../org_slug.py. Dead module.

- auth.py require_org_slug now uses ORG_SLUG_RE.fullmatch instead of
  .match for parity with the canonical validator. The regex carries
  explicit ^...$ anchors so this only matters for the trailing-\\n
  edge case today, but the canonical contract is fullmatch — keep it
  in lockstep.

- database.pin_embedding_dimensions now also pins the
  semantic_cache.query_embedding column when the table exists.
  Previously declared as plain vector (any-dim) and never aligned —
  on a dim change the next lookup's `<=>` operator raised
  "different vector dimensions", was silently swallowed by the
  generic exception handler, and all subsequent cache reads returned
  None until manual purge. TRUNCATE on mismatch because `<=>` can't
  be coerced across dims.

- rag_service _safe_close uses an interruptible asyncio.Event-based
  sleep instead of plain asyncio.sleep(30). shutdown() now sets the
  event before draining the background-task pool, so the wrapped
  close coroutine runs immediately instead of being cancelled
  mid-sleep when the 10s drain timeout fires. Previously each
  refresh-evicted client's httpx pool leaked through process exit.
- delete_page_chunks: `domain` parameter is now required. The previous
  `domain=None` branch issued `DELETE FROM chunks WHERE url=$1` which
  spans every domain that ever ingested the path, silently
  over-deleting another org's chunks on shared paths like `/about`.
  No production caller relied on the omit-domain behavior; only legacy
  tests did.

- vision/openai_client.process_pages_with_llm: per-chunk LLM failure
  now logs at error level (was warning) AND prepends an explicit
  `[LLM_EXTRACTION_FAILED: <type>]` marker to the returned chunk.
  Downstream storage / indexing can now distinguish "LLM extracted
  this" from "LLM died, this is the raw input pretending to be
  extraction".

- vision/openai_client._safe_close_client and
  embedding_service._close_old grace window extended from 30s to
  300s. Vision requests can run up to 180s (vision_request_timeout)
  and chat completions can run for ~300s; the previous 30s window
  tore down the httpx pool while a long PDF OCR was still in flight.

- vision/openai_client.process_pages_with_llm cache_key now includes
  the resolved client.base_url so an in-org provider rotation (same
  model id, different upstream) doesn't serve cached outputs from
  the previous provider.

- main.py lifespan teardown drains per-org client caches (_org_states
  in embedding_service; _chat_states and _vision_states in
  vision/openai_client) under a 10s bound. Previously each held an
  AsyncOpenAI httpx pool that was reclaimed only at process exit,
  producing noisy "Event loop is closed" tracebacks under
  uvicorn --reload / docker rolling restart.

- migrations/20260528000000_add_website_org_memberships.sql now
  documents the implicit `default` org assumption so an operator
  with a non-default-only layout has a clear signal when bounded_scan
  errors on the missing provider catalog.
…cript

- migrate-config-layout: copy_secret stages via .tale-migrate.<pid>
  tmp + mv -f so a SIGINT mid-copy can't leave a half-written dst
  that the cmp-s guard then refuses to overwrite (operator stuck).
- migrate-config-layout: detect_default_dst_collisions now exits
  non-zero with a clear "MIGRATE_ABORT" message BEFORE process_secret
  iterates. Previously the script logged the conflict but proceeded;
  whichever source iterated first won and end-state depended on dir
  order.
- deploy.ts ORG_SLUG_REGEX gains the {0,63} length cap to match the
  shared platform constant + the dev-compose generator. Without this
  the deploy-side enumerator would push slugs the platform itself
  refuses to mint.
- docker-entrypoint.sh: new atomic_cp_bundle stages bundle dirs into
  .tale-seed.<pid> then renames over dest. Previously integrations
  and skills used raw `cp -r` which left half-populated bundles on
  interruption that the next-run `[ -d dest ]` probe then skipped
  permanently as "already seeded".
- docker-entrypoint.sh workflows loop replaces `&&` + `; continue`
  with an explicit if/else so set -e + the trailing semicolon don't
  silently swallow mkdir/atomic_cp failures. Failed seeds now hit
  log_error instead of producing neither a ✓ nor an error line.
- 2026-03-28-migrate-convex-data.sh accepts --old-volume / --new-volume
  flags and defaults to COMPOSE_PROJECT_NAME-derived names. The
  previous hardcoded `tale_…` shape silently skipped the migration
  for any operator running compose with a different -p name.
… cluster

- rag_service _safe_close uses contextlib.suppress(TimeoutError) per
  ruff SIM105.
- crawler tests test_website_membership now feed fetchrow as a list
  (websites UPSERT RETURNING + membership INSERT RETURNING) since
  register_website added a second fetchrow when surfacing the stored
  scan_interval.
- platform scaffold seedRetention error-message uses JSON.stringify on
  non-Error targetStatErr to satisfy no-base-to-string.
- platform config-watcher single-file gate replaced with an
  isSingleFileOrgConfig type predicate so stem narrows without an
  unsafe assertion.
…t_metadata, reserved-slug

`workflows/file_actions.ts` was the last sub-system whose public actions
took `organizationId` as a raw arg and resolved it through
`resolveOrgSlug` without first verifying caller membership; any
authenticated user could read or mutate another org's workflows by
passing that org's id. Replace the auth + resolveOrgSlug pair with a
single `requireOrgMembershipById` call on every public action, mirroring
the pattern already used in agents / threads / integrations / providers.
Same edit also tightens two adjacent hazards in the same file:
`renameWorkflow` now refuses to clobber an existing target (the old
`atomicWrite` flow silently overwrote the victim file), and
`readHistoryEntry` / `restoreFromHistory` route the path through
`safeJoinWithinDir` with explicit slug + timestamp validation so the
inline `startsWith` check can't be bypassed via a sibling dir whose name
shares the prefix.

`document_action.get_metadata` was returning `fileName` for any
caller-supplied storage id because `getByStorageId` is a global
`by_storageId` lookup with no org filter, while the sibling
`extract_docx_structured` and `apply_docx_structured` branches already
gate ownership via `verifyStorageIdsBelongToOrg`. Add the same
ownership check inline (compare `fileMetadata.organizationId ===
organizationId`, treat the mismatch as "Unknown") so foreign-org
filenames stop leaking through the workflow steps.

`auth.ts` reserved-slug bypass was too wide: when `anyOrg.length === 0`
it admitted every reserved slug (`default`, `agents`, `branding`,
`providers`, …), so a racing first-signup user on a fresh deploy could
claim e.g. `branding` before the operator created `default` and wedge
the install in the `findOrgDirs` legacy-artifact trap. Narrow the
bypass to only `default` (the one slug the platform's first-run actually
needs); all other reserved slugs are now refused unconditionally.
…entralization

Two cross-cutting cleanups shipped together because the agent-tool RAG
files end up touching both:

(1) `lib/helpers/org_slug.ts` gains an `OrgSlugUnresolvableError` typed
    error and an `orgSlugFromIdOrNull` variant. Round-3 review found
    that ~29 of the 32 `orgSlugFromId` call sites either didn't catch
    the throw at all or caught it inside a try-block designed for some
    other error class, so a permanent slug miss (deleted-org race,
    replica skew) cross-contaminated unrelated work — abort of GDPR
    cascades, mis-classification of transcripts as failed, etc. The
    new helper preserves the throwing variant for security gates that
    must fail loud (workflow steps, agent-facing tool errors) and
    introduces a `*OrNull` form for the cascade / cleanup / multi-org
    batch sites where a missing slug is recoverable.

    Subsequent commits migrate the high-impact call sites.

(2) `format_search_results.ts` now strips reserved prompt tags inside
    the helper for every prompt-bound field (`content`, `filename`,
    `file_id`) at the single chokepoint, so a future caller can't
    forget the strip. Round-3 review found that 4 separate call sites
    were each stripping `r.content` independently and `query_rag_context`
    (the highest-volume chat auto-context path) was stripping nothing
    at all — a prompt-injection regression waiting to happen.

`rag_search_tool.ts` retrieve and search branches:
  - Use `orgSlugFromIdOrNull` and return the safe-summary shape
    ("Knowledge base temporarily unavailable.") instead of letting the
    raw `[orgSlugFromId] organization "..." has no slug` text bubble
    into the agent loop and onward to the UI toast.
  - Wrap the retrieve branch's network + parsing in try/catch.
  - Drop the now-redundant local `stripReservedPromptTags(r.content)`.
  - Non-`UpstreamHttpError` catches return the safe summary instead of
    re-throwing.

`query_rag_context.ts` inner catch now re-throws `UpstreamHttpError` so
the 4xx "auth misconfigured" signal escapes the graceful-degrade layer
rather than being collapsed into the same `undefined` return as a 5xx
RAG outage.

`rag_action.ts` (workflow) keeps its explicit per-call strip but
extends coverage to `result.title` (chunks path) and adds a recursive
`sanitizeMetadataStrings` over the search-result `metadata` field so
indexed-chunk metadata can't bypass SEC1 either.
…delete reindex

Continuation of the helper refactor in the prior commit: migrate the
seven high-impact call sites that round-3 review flagged for the
"`orgSlugFromId` throw cross-contaminates unrelated work" pattern,
plus three same-file hardenings that landed alongside.

threads/cascade_helpers.ts:
  Permanent slug miss (org row deleted, missing slug) used to return
  `{done:false, remaining:1}` forever; the retention sweep then burned
  its MAX_ATTEMPTS budget each cycle and gave up, accumulating orphan
  `fileMetadata` rows + `_storage` blobs indefinitely. Now uses OrNull,
  cleans local rows + storage even when slug is gone, and only skips
  the RAG-side purge (the tenant index is gone too). Also hoist the
  slug lookup inside the `filesPage.length > 0` branch — empty pages
  no longer pay for an unnecessary Better Auth round-trip.

governance/erasure.ts:
  GDPR fan-out now degrades when slug is unresolvable instead of
  aborting; DB-side cascade continues. `subjectIsMemberOfOtherActiveOrgs`
  256-cap silent fail-open is fixed: at the cap we now warn + return
  true (fail-closed) so an operator account with many memberships
  doesn't accidentally trip a global throttle / 2FA wipe.

governance/retention_cleanup.ts:
  Empty-batch fast path before the slug lookup; OrNull so a missing
  slug skips the RAG DELETE step but the local document/temp-file
  deletes still proceed.

file_metadata/actions.ts:
  Cross-org failure traffic: a single org's RAG outage marked every
  other org's in-flight uploads as `failed` after 90 s. Track which
  orgs queried successfully and only run `expireStaleRagQueue` against
  their storage ids; an unresolvable org skips the bucket entirely.

file_metadata/internal_actions.ts:
  Resolve slug OUTSIDE the try block in `extractFileMetadata` so a
  slug miss doesn't get reclassified as a "permanent" failure and
  stamp `visionRequired:false` against an otherwise-healthy upload.

file_metadata/transcribe_audio.ts:
  Resolve slug ONCE at the top of the action; reuse on both the
  cache-path and post-Whisper RAG index sites. Previously each call
  re-queried Better Auth, and a transient adapter failure on the
  second lookup bubbled into the outer catch and re-queued a fresh
  Whisper call against already-completed work.

documents/internal_actions.ts:
  Three sites converted to OrNull: `checkRagDocumentStatus` (mark
  failed once instead of looping retries forever), `deleteDocumentFromRag`
  (proceed with the local DB delete if the org is gone — RAG index
  is gone too), and `reindexDocumentInRag` (now invokes the new
  `deleteOldRagEntry` helper). Also swap the reindex from delete-first
  to upload-then-delete: a failed upload now leaves the OLD RAG entry
  intact so search keeps returning the prior revision, instead of
  marking ragInfo.failed with no entries at all.
…ce stamp

`syncSingleWebsite` was the missing writer of `metadata.lastStatusSyncAt`:
`fetchPages` debounced on this field but the per-website sync action
never stamped it, so the debounce gate stayed open forever and every
subsequent `fetchPages` re-scheduled a new sync. Stamp the timestamp on
all three patch branches (success / not-found / error).

REST `/api/v1/websites/...` and the Convex `actions.*` surface used to
diverge in two important ways:

- DELETE only removed the `websites` row and left the crawler with a
  dangling registration; the Convex action correctly called
  `deregisterDomainFromCrawler` first. Add a new
  `deregisterAndDelete` internal action and route both surfaces through
  it so the crawler binding always goes away together with the row.

- `createWebsite` REST awaited `runAction(registerAndSync, ...)` while
  the Convex action used `scheduler.runAfter(0, ...)`. Likewise the
  `POST /:id/sync` REST sub-action awaited the full crawler round-trip
  before returning `{status: 'syncing'}`. Switch REST to `runAfter(0)`
  so the response is fire-and-forget in both places, the `'syncing'`
  status is honest, and caller latency stops being tied to the crawler.
…tter MEDIUM cleanup

A bundle of round-3 hardening items that touch one or two methods each.

agents/file_actions.ts:
  - rename unlink is now ENOENT-aware; non-ENOENT errors log instead of
    silently leaving the old file on disk next to the new one (which
    `listAgents` would then surface twice).
  - `deleteAgent.preDelete` bare-catch now warn-logs the underlying
    error so the audit-without-previousState case is explainable.

agents/file_utils.ts:
  `resolveHistoryDir` now runs the agent name through `validateAgentName`
  and uses nested `safeJoinWithinDir` calls. The standalone history
  callers (listHistory / readHistoryEntry / restoreFromHistory) reach
  this helper BEFORE any path validator runs, so a crafted name
  containing `..` would otherwise traverse out of `agents/.history/`.

documents/generate_docx.ts:
  The `{success:false, error}` JSON branch now scrubs `result.error`
  through `sanitizeError` before throwing — matches the HTTP-error
  branch above (which already did) so an upstream body that echoes
  e.g. `Authorization: Bearer …` is redacted on both paths.

lib/file_io.ts:
  `safeJoinWithinDir` now explicitly rejects an empty `name`. Empty
  used to resolve to `dir` itself, which would let an unvalidated
  empty string from user input land on the config root.

agent_tools/web/helpers/search_pages.ts:
  Add a 15 s `AbortController` to `fetchSearch`. A hung crawler
  connection used to block the agent step indefinitely (no signal
  passed). Aligns with `query_web_context.ts`'s 10 s and
  `fetch_and_extract.ts`'s 300 s.

http.ts + lib/rate_limiter/index.ts:
  Add a `security:sse-auth` rate-limit bucket and gate `/api/sse/auth`
  on it BEFORE the Better Auth session lookup. Mirrors the
  `/api/tts-audio` pattern. Anonymous floods can no longer force a
  session-table read per request.

lib/utils/sanitize_secrets.ts:
  Add pipe-delimited self-hosted Convex admin-key patterns (`<INSTANCE>
  |<base64>`), the `--admin-key VALUE` argv form, and a bare-payload
  pattern. Mirrors the CLI `ADMIN_KEY_RE` in `reseed-all-orgs.ts` so
  the shared `UpstreamHttpError.bodySnippet` scrubber and the CLI log
  redactor stay in lockstep.

workflow_engine/action_defs/crawler/crawler_action.ts:
  `result.success===false` branches now throw `UpstreamHttpError` with
  `retryable:false` instead of a plain `Error`. The workflow retry
  layer can finally distinguish transport-level failures (already
  typed) from body-level "crawler said no".

workflow_engine/action_defs/document/helpers/apply_docx_structured.ts:
  Correct the `UpstreamHttpError` endpoint label from
  `/api/v1/apply-structured` to `/api/v1/docx/apply-structured` to
  match the actual request URL.
Recall `feedback_migration_ux.md`: the surface UX for one-shot
migrations should be (1) the user just runs their normal command,
(2) the command auto-detects the legacy state and asks for confirm
inline, (3) `--yes` skips the prompt for CI, and (4) no per-migration
`tale migrate <name>` subcommand. Previously `tale start`, `tale
deploy`, and `tale update` hard-failed on the pre-org-first flat
layout and pointed the operator at `tale migrate config-layout`
followed by another `tale deploy --override-all -y` — three commands
where one prompt should suffice.

New `legacy-layout-preflight.ts` wraps detect → confirm → migrate as
a single entry point and is called from start / deploy / update. TTY
+ no `--yes` prompts; non-TTY + no `--yes` throws a clear actionable
error; `--yes` migrates without prompting. `update` runs the preflight
BEFORE writing new `default/<domain>/` files so a legacy project no
longer ends up half-migrated and dead-locked on the next `tale start`.

`tale start` gains a `--yes / -y` flag (parallels `tale deploy`).

`tale migrate` is deprecated: the no-flag form prints a clear redirect
to `tale start --yes` and exits non-zero; `--cleanup-old` stays
available as the optional post-migration housekeeping step that
byte-for-byte verifies new paths and removes the rollback-insurance
copies. The forward migration is now automatic.

Same commit folds in two CLI safety items that share the
`reseed-all-orgs.ts` / `deploy.ts` surface:

reseed-all-orgs.ts:
  - Move the `if (dryRun)` gate ABOVE the destructive `confirm()` and
    `findPlatformContainer()` lookup. `--dry-run` was both prompting
    the operator and hard-throwing on hosts without a running platform
    container, defeating its preview-only point.
  - Broaden `ADMIN_KEY_RE` to also catch the hyphenated argv form
    (`--admin-key <value>`) that the bash heredoc itself contains.
    A future Convex CLI line echoing its argv would otherwise slip
    the secret past the redactor.

deploy.ts:
  chown failure after a host push is now a hard error instead of a
  warning. Previously `Overrode N orgs` printed green over a root-owned
  volume that the app user couldn't write to, sending operators into a
  debugging maze when later writes silently failed inside the container.
…orce-seed

`scripts/2026-03-28-migrate-convex-data.sh:108` — drop the trailing
`|| true` on the `cp -rn` step. The empty-glob edge case is already
covered by the per-dir empty check above, so `|| true` was only
swallowing real I/O errors (disk-full, EACCES). The previous shape
exited 0 reporting "N new items copied" while the migration was
silently incomplete; now a real `cp` failure aborts via `set -e`.

`services/convex/docker-entrypoint.sh::atomic_cp_bundle` —
`FORCE_SEED=true` unconditionally `rm -rf`s the destination bundle
before rename, which loses any operator-added files inside an
integration or skill bundle (custom_state.json, scripts the operator
dropped in, …). Take a timestamped snapshot to
`<dest>.history/<ts>/` before the rm so the pre-force tree is
recoverable. Snapshot failure is best-effort and doesn't block the seed.

`services/convex/docker-entrypoint.sh` workflow-seed loop — convert the
`find | while` pipeline (which ran the loop body in a subshell where
`log_error` could not bump an aggregate counter) to process
substitution. Track `workflows_failed` in the parent shell and
`log_warn` an aggregate count at the end. Disk-full mid-seed now
surfaces visibly instead of returning a clean exit code with a
half-seeded `default/workflows/`.
… marker drop

rag_service.initialize:
  Reset `_shutting_down` and clear the module-level shutdown event at
  the top of `initialize()`. Re-init after a prior `shutdown()` (tests,
  supervisor restart with the same singleton) was leaving the
  "shutting down" state set, so every subsequent `_ensure_org_clients`
  call permanently raised `RuntimeError("RagService is shutting down")`
  despite the pool being back. Also delete the dead `embedding_service`
  property (zero external readers confirmed by grep).

crawler/org_context.py:
  Switch the X-Tale-Org regex check from `re.match` to `re.fullmatch`.
  Python's `$` accepts a trailing `\n`, so `match()` silently accepted
  CRLF-smuggled slugs like `"acme\n"`. The RAG-side `auth.py` already
  uses `fullmatch`; mirror that semantic here.

crawler/services/vision/openai_client.py:
  - LLM-failure path no longer prepends a `[LLM_EXTRACTION_FAILED:...]`
    marker + raw chunk text to the returned page content. The marker
    string was flowing through embeddings → BM25 index → search hits
    as user-visible content, and the raw fallback text was poisoning
    relevance since the LLM step's structural extraction was missing.
    Drop the chunk entirely on failure; the error log is the
    operator-visible signal.
  - Track outstanding `_safe_close_client` tasks in
    `_PENDING_CLOSE_TASKS` and expose `drain_pending_close_tasks()` so
    lifespan shutdown can cancel + await them. Previously evicted /
    rotated clients sat in a 300 s sleep that the event loop closed
    underneath at shutdown, leaking the httpx connection pool and
    producing "Event loop is closed" tracebacks.

crawler/main.py:
  Lifespan teardown calls `drain_pending_close_tasks()` (bounded by 10s
  timeout) right after the per-org cache drain so the new pending-task
  set is flushed before the event loop exits.
@larryro larryro merged commit 79d2b49 into main May 29, 2026
33 of 34 checks passed
@larryro larryro deleted the refactor/uniform-org-first-config-layout branch May 29, 2026 04:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant