Skip to content

Multi-tenancy seam: tenant columns, resolver, quotas, sharded ingest#3

Merged
gaurav0107 merged 11 commits into
mainfrom
feat/multitenancy-seam
Jun 7, 2026
Merged

Multi-tenancy seam: tenant columns, resolver, quotas, sharded ingest#3
gaurav0107 merged 11 commits into
mainfrom
feat/multitenancy-seam

Conversation

@gaurav0107

Copy link
Copy Markdown
Collaborator

Implements the multi-tenancy spec (docs/multi-tenancy-spec.md) end-to-end. Cloud-only deployment per the office-hours decisions; on-prem / BYOC deferred.

Summary

  • Schema: postgres 0024_multitenancy_seam.sql (plan/plan_meter_limit/quota_period/org.retention_days), ClickHouse 0006_tenant_columns.sql (8 trace tables swap to (org_id, project_id, …) ORDER BY via CREATE-INSERT-RENAME), ClickHouse 0007_audit_log.sql (audit_log + audit_reconciliation_gap).
  • New shared module services/_shared/tenant/ (tracebility-tenant): TenantContext, Redis-cached Resolver with pubsub invalidation, GCRA RateLimiter, QuotaMeter, ShardRouter (blake2b mod 16), AuditWriter, EvalConcurrency semaphore, plus the quota + audit reconciler loops.
  • ingest-api: auth chains through the resolver, every envelope stamps (org_id, workspace_id, project_id), enqueue routes to tracebility:ingest:v1:{0..15} keyed by hash(org_id). Per-plan rate limit (429) and per-org hard quota (402) at the edge.
  • ingest-worker: writer threads tenant tuple into all CH row inserts; consumer reads from N shards round-robin and dual-reads the legacy stream during cutover; per-org consume-time quota filter (XACK+drop, never whole-shard stop) — the consume-time-filter refinement of spec §7.2.
  • api: central ScopedClickHouse auto-injects and org_id = {org_id:UUID} into every CH WHERE; canary router runs_query migrated; workspace switcher (/v1/me/workspaces); audit egress dual-writes (postgres + CH); admin endpoints (/v1/admin/orgs/{id}/quotas, /audit, /audit/gaps).
  • Cutover doc docs/multi-tenancy-cutover.md — sequencing, verification XLEN commands, the backlog-skew alert that's the trigger for the deferred weighted_map TODO.
  • Property test statically asserts no router emits a tenant-aware CH query without org_id in WHERE (or ScopedClickHouse in scope, which auto-injects). 14 routers still on the raw ch.query() path are tracked in _PENDING_ROUTERS as remaining work; they all scope by project_id (FK-bound to org_id) so they're not security holes today, just no partition pruning.

Phase ↔ commit map

Phase Commit
1: postgres 0024 7dd133d
2: clickhouse 0006 283bb70
3: clickhouse 0007 a59b07e
4: shared tenant module (+ phase 8 eval semaphore seam) e9593fb
5: ingest-api 03ebd14
6: ingest-worker b1de7c2
7+10: api + admin endpoints ce927cb
9: cutover doc 306bb00
11: reconcilers 5b367bd
12: integration + property tests 6db1cdb

Test plan

  • uv run pytest services/46 passed against the live docker-compose stack (postgres, redis, clickhouse).
  • Apply migrations to staging (0024_multitenancy_seam.sql first, then CH 0006, then CH 0007). Both CH migrations were validated against the local container during development.
  • Deploy ingest-worker first (dual-reads legacy + sharded streams), then ingest-api, per docs/multi-tenancy-cutover.md.
  • Watch XLEN tracebility:ingest:v1 decay to 0; flip TRACEBILITY_INGEST_DUAL_READ_LEGACY=false and roll the worker.
  • Smoke-test: send a run through POST /v1/runs with a real api_key, confirm row in tracebility.run has the right org_id + workspace_id.
  • Smoke-test: hit GET /v1/admin/orgs/{id}/quotas from an org-admin session, confirm meter bars render.
  • Schedule the quota reconciler (60s) and audit reconciler (24h) in the operator service.

Deliberately deferred (each has a reactivation trigger in the spec)

  • 14 routers still on the raw ch.query() path. Each is a 5-line _assert_project_accessresolve_project_scope swap; tracked in services/api/tests/test_property_org_id_in_where.py::_PENDING_ROUTERS.
  • Web UI for /v1/admin/orgs/{id}/quotas and /audit (frontend not in this branch).
  • eval-orchestrator service (the seam exists in tracebility_tenant.eval_concurrency; the service itself doesn't exist yet on main).
  • Weighted shard map — TODO documented in services/_shared/tenant/tracebility_tenant/shard.py and the cutover runbook. Triggered by sustained backlog skew on a single shard.
  • License JWT / on-prem self_hosted mode — collapsed into a future focused project per office-hours decision fix(migrator): version-track ClickHouse migrations (unblock helm-deploy) #7.

🤖 Generated with Claude Code

gaurav0107 and others added 11 commits June 7, 2026 13:48
Adds plan + plan_meter_limit + quota_period reference tables, org.retention_days,
and project_tenant_view (consumed by the ClickHouse 0006 backfill dictionary).

Numbered 0024 because litellm-dispatch already claimed 0023.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds (org_id, workspace_id) to run, span, eval_score, eval_aggregate,
replay_capture, replay_run, dataset_item, billing_meter. ORDER BY rewritten
to put org_id first so per-tenant reads partition-prune to near-zero work.

Backfill via project_tenant_dict (postgres-backed). _v1 tables retained one
release as the rollback target. Swap pattern is CREATE-INSERT-RENAME because
ClickHouse cannot reorder primary-key columns in place.

Spec §6.2.1 named 5 tables; we add eval_aggregate + replay_run for
consistency (every trace row carries the tenant tuple per goal #1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickHouse audit_log per multi-tenancy spec §5.8: monthly partitions on
event_time, ORDER BY (org_id, event_time, event_type), 730-day TTL. Bloom
filters on actor_user_id and target_id for the admin UI's per-actor /
per-target lookups.

audit_reconciliation_gap is where the daily reconciler logs detected
postgres↔ClickHouse divergences for the auditor evidence pack.

Postgres audit_log (migration 0005) becomes read-only after this lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
services/_shared/tenant — new uv workspace member that the four services
import from for non-bypassable tenant enforcement:

- TenantContext: frozen (org, workspace, project, plan, scopes) tuple
- Resolver: api_key public_id -> TenantContext, Redis-cached, pubsub-invalidated
- RateLimiter: GCRA Lua (one round-trip, atomic) per ingest key
- QuotaMeter: per-org / period / meter Redis counter + reset hook
- ShardRouter: blake2b(org_id) % 16 with stream key helpers
- AuditWriter: append-only writer to ClickHouse audit_log
- EvalConcurrency: per-org eval semaphore in Redis Lua (the §5.6 cap)

19 tests against the live docker-compose redis (rate-limit GCRA, quota
counter, semaphore behavior, shard hash determinism, audit roundtrip).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…queue

Auth path now resolves api_key.public_id through the Redis-cached Resolver
(returns TenantContext including workspace_id + plan + scopes). All four
ingest routers (runs, langsmith_shim, otel, multipart) chain
enforce_quota -> enforce_rate_limit -> require_ingest_key, stamp org_id /
workspace_id on every envelope, and increment the optimistic quota
counter after enqueue.

Enqueue routes via hash(org_id) % 16 to per-shard Redis streams
(tracebility:ingest:v1:{0..15}). Disk-buffer fallback survives the
shard cutover by encoding the shard index in the spill filename.

Test: in-process envelope routing lands on the right shard, never the
legacy stream (services/ingest-api/tests/test_enqueue_shard.py).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…me-time quota filter

Writer threads (org_id, workspace_id, project_id) into _RUN_COLUMNS,
_SPAN_COLUMNS, _REPLAY_CAPTURE_COLUMNS so every CH row carries the
tenant tuple. Pre-Phase-5 envelopes (no tenant fields) get a sentinel
UUID and a structured warning so the worker never stalls during the
rolling cutover.

Consumer reads from N sharded streams round-robin (XREADGROUP with N
keys, per-stream count = batch_size / N). dual_read_legacy=True keeps
draining tracebility:ingest:v1 until the cutover finishes.

Per-org consume-time quota filter: when the reconciler sets
quota:over:<org>:<period>, the worker XACKs and drops envelopes for
that org only, NOT the whole shard. Co-tenants on the same shard are
unaffected — the consume-time-filter refinement of spec §7.2.

Tests: row-shape, sentinel fallback, and a live ClickHouse round-trip
(insert envelope -> SELECT, verify org_id/workspace_id columns match).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…endpoints

Central tenant scope for the api service:
- TenantScope = (org_id, workspace_id, project_id, principal) resolved
  per-request via resolve_project_scope (replaces the per-router
  _assert_project_access pattern).
- ScopedClickHouse wraps ClickHouseQuery with a non-bypassable org_id
  injector — auto-appends `and org_id = {org_id:UUID}` to any WHERE
  chain, raises if SQL has no WHERE at all.
- Property test (test_property_org_id_in_where.py) walks router SQL
  literals statically; either ScopedClickHouse is in scope or the SQL
  must mention org_id explicitly. _PENDING_ROUTERS lists the 14
  routers still on the raw ch.query() path (not security holes today —
  they all scope by project_id which is FK-bound to the org — but the
  spec wants org_id explicit so partition pruning takes the fast path).

Routers landed:
- runs_query: migrated (canary; the `final` + complex group-by patterns
  the helper has to handle).
- workspaces_me: GET /v1/me/workspaces + POST /v1/me/workspaces/{id}/select
  for the top-nav switcher.
- admin_quotas: GET /v1/admin/orgs/{id}/quotas reads from
  quota_period (postgres truth, not the Redis hot counter).
- admin_audit: GET /v1/admin/orgs/{id}/audit unions postgres + CH
  audit feeds; GET .../audit/gaps reads audit_reconciliation_gap.

Audit egress: api/audit.py grows record_egress() that dual-writes to
postgres audit_log AND ClickHouse audit_log. Identity events stay
postgres-only; the daily reconciler catches gaps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sequencing (deploy-worker-first), verification XLEN commands, the
backlog-skew alert that's the trigger for the weighted_map TODO, DLQ
layout, and the rollback story. Companion to multi-tenancy-spec.md
§5.6 and §9 step 9.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reconciler_quota: every 60s, SUM authoritative ClickHouse billing_meter
into postgres quota_period.used_amount, reset Redis hot counter to the
authoritative value, and maintain quota:over:<org>:<period> for the
ingest-worker's consume-time filter. Loop is crash-resistant — a
failed tick logs and retries; the worst case is the optimistic-cap
window stretching from 60s to 120s.

reconciler_audit: daily, scan postgres api_key.created_at /
api_key.revoked_at and verify each has a matching ClickHouse
audit_log row. Misses go to audit_reconciliation_gap with the
detection_run UUID for the auditor evidence pack.

Test: live round-trip writes 2M-span billing_meter row, runs
reconcile_once, confirms quota_period row + Redis counter +
quota:over flag all line up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
integration/test_cross_org_isolation.py: writes one run per org against
the live ClickHouse, verifies queries scoped to org A return only A's
rows and never B's (and the unfiltered query returns both — the org_id
filter is the load-bearing thing). Also asserts workspace-A-vs-B
isolation inside one org. Cleans up its own state.

test_property_org_id_in_where.py: AST-walks every router's SQL string
literals. Any string that reads from a tenant-aware ClickHouse table
(detected via {name:Type} placeholders + `final` + toDateTime64) must
contain `org_id =`, UNLESS the file uses ScopedClickHouse (which
auto-injects). Tracks the still-pending router migrations so it shows
up as remaining work in CI but doesn't block the gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main brought in litellm 1.52.16 + a chunk of new transitive deps; this
branch's tenant module pulls in aiohttp via clickhouse-connect[async].
Single regenerated lockfile holds them all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gaurav0107 gaurav0107 force-pushed the feat/multitenancy-seam branch from 6db1cdb to 41a701e Compare June 7, 2026 08:20
@gaurav0107 gaurav0107 merged commit ac72dcd into main Jun 7, 2026
0 of 3 checks passed
gaurav0107 added a commit that referenced this pull request Jun 7, 2026
…est-api (#9)

Two crash-loops on the live deploy after PR #8 unblocked the rollout:

  ingest-api: RuntimeError: TRACEBILITY_CLICKHOUSE_URL is required
  api:        clickhouse_connect.driver.exceptions.DatabaseError:
              Code: 516. Authentication failed: ... default user

Root cause (single pattern, two symptoms):

PR #3 (multi-tenancy) added new ClickHouse-touching code paths in api
and ingest-api but invented a credential-passing convention the helm
chart didn't ship — splitting URL / USER / PASSWORD / DATABASE into
four env vars. The chart's existing tracebility-clickhouse secret has
ONE key (``url``) holding the full DSN with embedded credentials,
matching the postgres / redis pattern.

So in production:
- ingest-api template never set TRACEBILITY_CLICKHOUSE_URL at all
  → config.load() raised on the missing required env var.
- api template DID set TRACEBILITY_CLICKHOUSE_URL, but the new code
  passed username='default' / password='' as kwargs to
  clickhouse_connect.get_async_client(dsn=URL, username=...).
  Those kwargs override the DSN's embedded credentials, so it tried
  to auth as 'default' (which doesn't exist in the cluster).

Fix:

- AuditWriter.from_url(url) now accepts only the DSN; no override
  kwargs. The DSN's embedded credentials carry through.
- reconciler_loop() in both reconcilers: same simplification.
- Settings on api + ingest-api: drop clickhouse_user /
  clickhouse_password / clickhouse_database. Only clickhouse_url.
- ingest-api deployment template: add the missing
  TRACEBILITY_CLICKHOUSE_URL env-from-secret line.
- api deployment template: also pass TRACEBILITY_REDIS_URL (PR #3
  added optional Redis support to the api for api-key invalidation
  + reconciler hooks).

Verified locally:
- ``uv run pytest services/`` → 68 passed.
- ``helm template`` confirms api + ingest-api templates render with
  PG_DSN + REDIS_URL + CLICKHOUSE_URL all sourced from secrets.
- ``ruff check`` and ``ruff format --check`` clean.

Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com>
Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant