Multi-tenancy seam: tenant columns, resolver, quotas, sharded ingest#3
Merged
Conversation
Adds plan + plan_meter_limit + quota_period reference tables, org.retention_days, and project_tenant_view (consumed by the ClickHouse 0006 backfill dictionary). Numbered 0024 because litellm-dispatch already claimed 0023. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds (org_id, workspace_id) to run, span, eval_score, eval_aggregate, replay_capture, replay_run, dataset_item, billing_meter. ORDER BY rewritten to put org_id first so per-tenant reads partition-prune to near-zero work. Backfill via project_tenant_dict (postgres-backed). _v1 tables retained one release as the rollback target. Swap pattern is CREATE-INSERT-RENAME because ClickHouse cannot reorder primary-key columns in place. Spec §6.2.1 named 5 tables; we add eval_aggregate + replay_run for consistency (every trace row carries the tenant tuple per goal #1). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ClickHouse audit_log per multi-tenancy spec §5.8: monthly partitions on event_time, ORDER BY (org_id, event_time, event_type), 730-day TTL. Bloom filters on actor_user_id and target_id for the admin UI's per-actor / per-target lookups. audit_reconciliation_gap is where the daily reconciler logs detected postgres↔ClickHouse divergences for the auditor evidence pack. Postgres audit_log (migration 0005) becomes read-only after this lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
services/_shared/tenant — new uv workspace member that the four services import from for non-bypassable tenant enforcement: - TenantContext: frozen (org, workspace, project, plan, scopes) tuple - Resolver: api_key public_id -> TenantContext, Redis-cached, pubsub-invalidated - RateLimiter: GCRA Lua (one round-trip, atomic) per ingest key - QuotaMeter: per-org / period / meter Redis counter + reset hook - ShardRouter: blake2b(org_id) % 16 with stream key helpers - AuditWriter: append-only writer to ClickHouse audit_log - EvalConcurrency: per-org eval semaphore in Redis Lua (the §5.6 cap) 19 tests against the live docker-compose redis (rate-limit GCRA, quota counter, semaphore behavior, shard hash determinism, audit roundtrip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…queue
Auth path now resolves api_key.public_id through the Redis-cached Resolver
(returns TenantContext including workspace_id + plan + scopes). All four
ingest routers (runs, langsmith_shim, otel, multipart) chain
enforce_quota -> enforce_rate_limit -> require_ingest_key, stamp org_id /
workspace_id on every envelope, and increment the optimistic quota
counter after enqueue.
Enqueue routes via hash(org_id) % 16 to per-shard Redis streams
(tracebility:ingest:v1:{0..15}). Disk-buffer fallback survives the
shard cutover by encoding the shard index in the spill filename.
Test: in-process envelope routing lands on the right shard, never the
legacy stream (services/ingest-api/tests/test_enqueue_shard.py).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…me-time quota filter Writer threads (org_id, workspace_id, project_id) into _RUN_COLUMNS, _SPAN_COLUMNS, _REPLAY_CAPTURE_COLUMNS so every CH row carries the tenant tuple. Pre-Phase-5 envelopes (no tenant fields) get a sentinel UUID and a structured warning so the worker never stalls during the rolling cutover. Consumer reads from N sharded streams round-robin (XREADGROUP with N keys, per-stream count = batch_size / N). dual_read_legacy=True keeps draining tracebility:ingest:v1 until the cutover finishes. Per-org consume-time quota filter: when the reconciler sets quota:over:<org>:<period>, the worker XACKs and drops envelopes for that org only, NOT the whole shard. Co-tenants on the same shard are unaffected — the consume-time-filter refinement of spec §7.2. Tests: row-shape, sentinel fallback, and a live ClickHouse round-trip (insert envelope -> SELECT, verify org_id/workspace_id columns match). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…endpoints
Central tenant scope for the api service:
- TenantScope = (org_id, workspace_id, project_id, principal) resolved
per-request via resolve_project_scope (replaces the per-router
_assert_project_access pattern).
- ScopedClickHouse wraps ClickHouseQuery with a non-bypassable org_id
injector — auto-appends `and org_id = {org_id:UUID}` to any WHERE
chain, raises if SQL has no WHERE at all.
- Property test (test_property_org_id_in_where.py) walks router SQL
literals statically; either ScopedClickHouse is in scope or the SQL
must mention org_id explicitly. _PENDING_ROUTERS lists the 14
routers still on the raw ch.query() path (not security holes today —
they all scope by project_id which is FK-bound to the org — but the
spec wants org_id explicit so partition pruning takes the fast path).
Routers landed:
- runs_query: migrated (canary; the `final` + complex group-by patterns
the helper has to handle).
- workspaces_me: GET /v1/me/workspaces + POST /v1/me/workspaces/{id}/select
for the top-nav switcher.
- admin_quotas: GET /v1/admin/orgs/{id}/quotas reads from
quota_period (postgres truth, not the Redis hot counter).
- admin_audit: GET /v1/admin/orgs/{id}/audit unions postgres + CH
audit feeds; GET .../audit/gaps reads audit_reconciliation_gap.
Audit egress: api/audit.py grows record_egress() that dual-writes to
postgres audit_log AND ClickHouse audit_log. Identity events stay
postgres-only; the daily reconciler catches gaps.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sequencing (deploy-worker-first), verification XLEN commands, the backlog-skew alert that's the trigger for the weighted_map TODO, DLQ layout, and the rollback story. Companion to multi-tenancy-spec.md §5.6 and §9 step 9. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reconciler_quota: every 60s, SUM authoritative ClickHouse billing_meter into postgres quota_period.used_amount, reset Redis hot counter to the authoritative value, and maintain quota:over:<org>:<period> for the ingest-worker's consume-time filter. Loop is crash-resistant — a failed tick logs and retries; the worst case is the optimistic-cap window stretching from 60s to 120s. reconciler_audit: daily, scan postgres api_key.created_at / api_key.revoked_at and verify each has a matching ClickHouse audit_log row. Misses go to audit_reconciliation_gap with the detection_run UUID for the auditor evidence pack. Test: live round-trip writes 2M-span billing_meter row, runs reconcile_once, confirms quota_period row + Redis counter + quota:over flag all line up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
integration/test_cross_org_isolation.py: writes one run per org against
the live ClickHouse, verifies queries scoped to org A return only A's
rows and never B's (and the unfiltered query returns both — the org_id
filter is the load-bearing thing). Also asserts workspace-A-vs-B
isolation inside one org. Cleans up its own state.
test_property_org_id_in_where.py: AST-walks every router's SQL string
literals. Any string that reads from a tenant-aware ClickHouse table
(detected via {name:Type} placeholders + `final` + toDateTime64) must
contain `org_id =`, UNLESS the file uses ScopedClickHouse (which
auto-injects). Tracks the still-pending router migrations so it shows
up as remaining work in CI but doesn't block the gate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main brought in litellm 1.52.16 + a chunk of new transitive deps; this branch's tenant module pulls in aiohttp via clickhouse-connect[async]. Single regenerated lockfile holds them all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6db1cdb to
41a701e
Compare
gaurav0107
added a commit
that referenced
this pull request
Jun 7, 2026
…est-api (#9) Two crash-loops on the live deploy after PR #8 unblocked the rollout: ingest-api: RuntimeError: TRACEBILITY_CLICKHOUSE_URL is required api: clickhouse_connect.driver.exceptions.DatabaseError: Code: 516. Authentication failed: ... default user Root cause (single pattern, two symptoms): PR #3 (multi-tenancy) added new ClickHouse-touching code paths in api and ingest-api but invented a credential-passing convention the helm chart didn't ship — splitting URL / USER / PASSWORD / DATABASE into four env vars. The chart's existing tracebility-clickhouse secret has ONE key (``url``) holding the full DSN with embedded credentials, matching the postgres / redis pattern. So in production: - ingest-api template never set TRACEBILITY_CLICKHOUSE_URL at all → config.load() raised on the missing required env var. - api template DID set TRACEBILITY_CLICKHOUSE_URL, but the new code passed username='default' / password='' as kwargs to clickhouse_connect.get_async_client(dsn=URL, username=...). Those kwargs override the DSN's embedded credentials, so it tried to auth as 'default' (which doesn't exist in the cluster). Fix: - AuditWriter.from_url(url) now accepts only the DSN; no override kwargs. The DSN's embedded credentials carry through. - reconciler_loop() in both reconcilers: same simplification. - Settings on api + ingest-api: drop clickhouse_user / clickhouse_password / clickhouse_database. Only clickhouse_url. - ingest-api deployment template: add the missing TRACEBILITY_CLICKHOUSE_URL env-from-secret line. - api deployment template: also pass TRACEBILITY_REDIS_URL (PR #3 added optional Redis support to the api for api-key invalidation + reconciler hooks). Verified locally: - ``uv run pytest services/`` → 68 passed. - ``helm template`` confirms api + ingest-api templates render with PG_DSN + REDIS_URL + CLICKHOUSE_URL all sourced from secrets. - ``ruff check`` and ``ruff format --check`` clean. Signed-off-by: Gaurav Dubey <gauravdubey0107@gmail.com> Signed-off-by: gaurav0107 <gauravdubey0107@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements the multi-tenancy spec (docs/multi-tenancy-spec.md) end-to-end. Cloud-only deployment per the office-hours decisions; on-prem / BYOC deferred.
Summary
0024_multitenancy_seam.sql(plan/plan_meter_limit/quota_period/org.retention_days), ClickHouse0006_tenant_columns.sql(8 trace tables swap to(org_id, project_id, …)ORDER BY via CREATE-INSERT-RENAME), ClickHouse0007_audit_log.sql(audit_log + audit_reconciliation_gap).services/_shared/tenant/(tracebility-tenant):TenantContext, Redis-cachedResolverwith pubsub invalidation, GCRARateLimiter,QuotaMeter,ShardRouter(blake2b mod 16),AuditWriter,EvalConcurrencysemaphore, plus the quota + audit reconciler loops.(org_id, workspace_id, project_id), enqueue routes totracebility:ingest:v1:{0..15}keyed byhash(org_id). Per-plan rate limit (429) and per-org hard quota (402) at the edge.ScopedClickHouseauto-injectsand org_id = {org_id:UUID}into every CH WHERE; canary routerruns_querymigrated; workspace switcher (/v1/me/workspaces); audit egress dual-writes (postgres + CH); admin endpoints (/v1/admin/orgs/{id}/quotas,/audit,/audit/gaps).docs/multi-tenancy-cutover.md— sequencing, verification XLEN commands, the backlog-skew alert that's the trigger for the deferred weighted_map TODO.org_idin WHERE (orScopedClickHousein scope, which auto-injects). 14 routers still on the rawch.query()path are tracked in_PENDING_ROUTERSas remaining work; they all scope byproject_id(FK-bound toorg_id) so they're not security holes today, just no partition pruning.Phase ↔ commit map
7dd133d283bb70a59b07ee9593fb03ebd14b1de7c2ce927cb306bb005b367bd6db1cdbTest plan
uv run pytest services/— 46 passed against the live docker-compose stack (postgres, redis, clickhouse).0024_multitenancy_seam.sqlfirst, then CH 0006, then CH 0007). Both CH migrations were validated against the local container during development.docs/multi-tenancy-cutover.md.XLEN tracebility:ingest:v1decay to 0; flipTRACEBILITY_INGEST_DUAL_READ_LEGACY=falseand roll the worker.POST /v1/runswith a real api_key, confirm row intracebility.runhas the rightorg_id+workspace_id.GET /v1/admin/orgs/{id}/quotasfrom an org-admin session, confirm meter bars render.Deliberately deferred (each has a reactivation trigger in the spec)
ch.query()path. Each is a 5-line_assert_project_access→resolve_project_scopeswap; tracked inservices/api/tests/test_property_org_id_in_where.py::_PENDING_ROUTERS./v1/admin/orgs/{id}/quotasand/audit(frontend not in this branch).tracebility_tenant.eval_concurrency; the service itself doesn't exist yet on main).services/_shared/tenant/tracebility_tenant/shard.pyand the cutover runbook. Triggered by sustained backlog skew on a single shard.self_hostedmode — collapsed into a future focused project per office-hours decision fix(migrator): version-track ClickHouse migrations (unblock helm-deploy) #7.🤖 Generated with Claude Code