feat(telemetry): add Prometheus metrics to all services by larryro · Pull Request #605 · tale-project/tale

larryro · 2026-02-28T11:58:10Z

Summary

Add Prometheus /metrics endpoints to all 4 services (crawler, rag, operator, platform)
Python services use prometheus_client with HTTP request count/duration + process metrics
Platform (Node.js) uses prom-client with process-level metrics (CPU, memory, event loop)
Caddy proxies /metrics/{service} with Bearer token auth (METRICS_BEARER_TOKEN env var)
Disabled by default — set METRICS_BEARER_TOKEN to enable external access

How it works

Customer's Prometheus/Alloy
  → GET https://tale.example.com/metrics/crawler
    Header: Authorization: Bearer <token>
  → Caddy verifies token → proxies to internal crawler:8002/metrics

Supported backends: Self-hosted Prometheus, Grafana Cloud (agentless), Datadog (via agent), New Relic, VictoriaMetrics

Test plan

30 unit tests pass (10 per Python service)
Ruff lint passes for all Python changes
docker compose up --build, verify curl http://localhost:8002/metrics returns Prometheus text
Set METRICS_BEARER_TOKEN=test, verify curl -H "Authorization: Bearer test" https://tale.local/metrics/crawler works
Verify curl https://tale.local/metrics/crawler without token returns 404

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added Prometheus metrics monitoring and observability across all services
- Metrics endpoints now secured with bearer token authentication via proxy
- Configurable metrics endpoints for integration with monitoring systems
Tests
- Added comprehensive test coverage for telemetry functionality
Chores
- Added monitoring client dependencies to support metrics collection

coderabbitai · 2026-02-28T12:05:22Z

📝 Walkthrough

Walkthrough

This pull request adds Prometheus monitoring integration across multiple microservices. It includes telemetry modules in the crawler, operator, platform, and RAG services that expose HTTP metrics (request count and duration) via /metrics endpoints. Each telemetry module integrates with the application lifecycle during startup and shutdown. Configuration in .env.example defines metrics endpoints and bearer token authentication. The Caddyfile is updated to gate metrics endpoints behind bearer token authentication and proxy requests to the appropriate services. Dependencies prometheus-client and prom-client are added to enable Prometheus metrics collection. Comprehensive tests validate telemetry initialization, metrics endpoint availability, HTTP metrics recording, and path templating behavior across services.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 24.19% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately and concisely describes the main objective: adding Prometheus metrics to all services. It directly matches the changeset scope.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/telemetry-metrics

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/telemetry.py`:
- Around line 37-40: The fallback currently returns the raw request.url.path
which can create unbounded path_template cardinality; change the fallback to
return a bounded sentinel string (for example "/unmatched" or "unknown_route")
instead of request.url.path. Locate the code that checks route =
request.scope.get("route") and hasattr(route, "path") and replace the final
return request.url.path with a fixed sentinel value, keeping the existing
route.path return intact. Ensure the sentinel is a constant used consistently in
telemetry code so consumers of path_template see a limited set of labels.
- Around line 48-58: The current middleware only records metrics after a
successful await call_next(request), so exceptions skip metrics; change the
implementation in the middleware function that calls call_next (the block using
start = time.perf_counter(), await call_next(request), duration calculation and
calls to _request_count/_request_duration) to record metrics in a finally block:
start the timer before awaiting call_next, then in the try/except/finally
capture the response.status_code when available, fall back to a default status
(e.g., 500) if an exception occurred, always observe duration via
_request_duration.labels(method=request.method,
path_template=path_template).observe(duration) and increment
_request_count.labels(method=request.method, path_template=path_template,
status=status).inc(); re-raise the exception after metrics are recorded so
behavior is unchanged.
- Around line 72-94: The metrics are being registered into the same
CollectorRegistry as MultiProcessCollector which causes duplicate series in
gunicorn multiprocess mode; update the logic around
multiprocess.MultiProcessCollector(_registry) so that when
MultiProcessCollector() succeeds you register app metrics (_request_count and
_request_duration) without passing a registry (so they attach to the default
REGISTRY), and only when MultiProcessCollector() fails (non-multiprocess mode)
pass registry=_registry to the Counter/Histogram constructors; keep using
multiprocess.MultiProcessCollector to build a fresh CollectorRegistry for the
scrape endpoint but do not register app metrics into that same _registry.

In `@services/operator/app/telemetry.py`:
- Around line 43-58: The dispatch method in _MetricsMiddleware doesn't record
metrics when call_next(request) raises, so wrap the call_next invocation in
try/except/finally: record start = time.perf_counter() before calling call_next,
in the try block await call_next and capture response; in except capture the
exception, synthesize a 500-like status (e.g., status_code = 500) and re-raise
after recording; in finally compute duration, derive path_template via
_get_path_template(request), and increment/observe _request_count and
_request_duration (using request.method, path_template, and the status_code
variable) so failed requests are recorded even when call_next throws.

In `@services/platform/telemetry.js`:
- Around line 13-19: Add a guard to initTelemetry to prevent calling
client.collectDefaultMetrics() more than once: detect prior initialization
either via a module-level boolean (e.g., telemetryInitialized) or by inspecting
the client.register for existing metrics before calling collectDefaultMetrics(),
then set the flag after successful registration; ensure the /metrics route still
uses client.register.metrics() and does not re-register metrics on repeated
initTelemetry() calls.

In `@services/proxy/Caddyfile`:
- Around line 124-127: The `@metricsAuth` matcher currently uses a dangerous
default token value (__disabled__) when METRICS_BEARER_TOKEN is unset; replace
that default with an unmatchable value (e.g., a long random/UUID string) or
change the Caddyfile generation so the Authorization header matcher is omitted
entirely when METRICS_BEARER_TOKEN is not provided. Update the Authorization
header line in the `@metricsAuth` block (and any code that injects
METRICS_BEARER_TOKEN) to use the new unmatchable default or conditional logic so
no valid bearer token is accepted when the env var is absent.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6d1468 and 518a6f4.

📒 Files selected for processing (17)

.env.example
services/crawler/app/main.py
services/crawler/app/telemetry.py
services/crawler/pyproject.toml
services/crawler/tests/test_telemetry.py
services/operator/app/main.py
services/operator/app/telemetry.py
services/operator/pyproject.toml
services/operator/tests/test_telemetry.py
services/platform/package.json
services/platform/server.js
services/platform/telemetry.js
services/proxy/Caddyfile
services/rag/app/main.py
services/rag/app/telemetry.py
services/rag/pyproject.toml
services/rag/tests/test_telemetry.py

coderabbitai · 2026-02-28T12:05:25Z

+class _MetricsMiddleware(BaseHTTPMiddleware):
+    async def dispatch(self, request: Request, call_next: RequestResponseEndpoint) -> StarletteResponse:
+        if request.url.path == "/metrics":
+            return await call_next(request)
+
+        start = time.perf_counter()
+        response = await call_next(request)
+        duration = time.perf_counter() - start
+
+        path_template = _get_path_template(request)
+        if _request_count is not None:
+            _request_count.labels(method=request.method, path_template=path_template, status=response.status_code).inc()
+        if _request_duration is not None:
+            _request_duration.labels(method=request.method, path_template=path_template).observe(duration)
+
+        return response


🧹 Nitpick | 🔵 Trivial

Unhandled exceptions bypass metrics recording.

If call_next(request) raises an exception, Lines 52-56 won't execute and the failed request won't be recorded in metrics. Consider wrapping with try/except to capture error responses as well.

♻️ Proposed fix to record metrics for exceptions

class _MetricsMiddleware(BaseHTTPMiddleware): async def dispatch(self, request: Request, call_next: RequestResponseEndpoint) -> StarletteResponse: if request.url.path == "/metrics": return await call_next(request) start = time.perf_counter() - response = await call_next(request) - duration = time.perf_counter() - start - - path_template = _get_path_template(request) - if _request_count is not None: - _request_count.labels(method=request.method, path_template=path_template, status=response.status_code).inc() - if _request_duration is not None: - _request_duration.labels(method=request.method, path_template=path_template).observe(duration) - - return response + status_code = 500 + try: + response = await call_next(request) + status_code = response.status_code + return response + finally: + duration = time.perf_counter() - start + path_template = _get_path_template(request) + if _request_count is not None: + _request_count.labels(method=request.method, path_template=path_template, status=status_code).inc() + if _request_duration is not None: + _request_duration.labels(method=request.method, path_template=path_template).observe(duration)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@services/operator/app/telemetry.py` around lines 43 - 58, The dispatch method in _MetricsMiddleware doesn't record metrics when call_next(request) raises, so wrap the call_next invocation in try/except/finally: record start = time.perf_counter() before calling call_next, in the try block await call_next and capture response; in except capture the exception, synthesize a 500-like status (e.g., status_code = 500) and re-raise after recording; in finally compute duration, derive path_template via _get_path_template(request), and increment/observe _request_count and _request_duration (using request.method, path_template, and the status_code variable) so failed requests are recorded even when call_next throws.

greptile-apps · 2026-02-28T12:09:01Z

Greptile Summary

This PR adds Prometheus /metrics endpoints to all four Tale services (crawler, rag, operator, platform) behind Caddy bearer-token authentication, enabling external Prometheus/Grafana scraping without exposing metrics by default. The implementation is well-structured — the three Python services share a consistent _MetricsMiddleware + init_telemetry/shutdown_telemetry pattern with proper multiprocess support, and the test suites are thorough.

Key findings:

Security — __disabled__ backdoor token (Caddyfile line 126): When METRICS_BEARER_TOKEN is unset, Caddy substitutes the literal string __disabled__ as the token. Because this default is visible in the public repository, any client that sends Authorization: Bearer __disabled__ can access all four metrics endpoints on any deployment that hasn't explicitly set the env var. This directly contradicts the documented "disabled by default" intent and should be addressed before merging.
Missing idempotency guard in telemetry.js: Unlike the Python services, initTelemetry doesn't check whether it's already been called. A second invocation causes prom-client to throw a duplicate-metric registration error, crashing the server.
Exception-path requests not counted: In all three Python _MetricsMiddleware implementations, if call_next raises an exception, the metric labels are never incremented. HTTP 500s caused by unhandled exceptions will be absent from http_requests_total.
prom-client version not pinned: "prom-client": "^15.1.0" is inconsistent with the exact-version pinning used for every other runtime dependency in package.json.

Confidence Score: 2/5

Not safe to merge until the __disabled__ default token issue in the Caddyfile is resolved — it exposes metrics on all deployments where the env var is unset.
The __disabled__ fallback creates a publicly-known static bearer token that grants metrics access on any instance that hasn't explicitly set METRICS_BEARER_TOKEN, directly contradicting the "disabled by default" guarantee. This is a meaningful security regression for all existing deployments. The missing idempotency guard in telemetry.js is a secondary crash risk. The other two issues (exception-path metrics, unpinned version) are lower-severity but worth fixing.
services/proxy/Caddyfile (security) and services/platform/telemetry.js (crash on double-init) need attention before merging.

Important Files Changed

Filename	Overview
services/proxy/Caddyfile	Adds metrics proxy routes with bearer token auth; the `__disabled__` default is a publicly-known static string that acts as an accessible backdoor token in open-source deployments where `METRICS_BEARER_TOKEN` is unset.
services/platform/telemetry.js	New Node.js telemetry module using `prom-client`; lacks an idempotency guard unlike the three Python equivalents, which would cause a fatal error if `initTelemetry` is called more than once.
services/crawler/app/telemetry.py	Well-structured Prometheus middleware with idempotency and multiprocess-mode support; exception-path requests are not counted (shared pattern across all three Python services).
services/rag/app/telemetry.py	Identical implementation to `crawler/app/telemetry.py`; same exception-path metric gap applies.
services/operator/app/telemetry.py	Identical implementation to `crawler/app/telemetry.py`; same exception-path metric gap applies.
services/platform/package.json	Adds `prom-client` as a runtime dependency; uses a caret range (`^15.1.0`) inconsistent with the exact-version pinning used for all other dependencies.
services/crawler/tests/test_telemetry.py	Comprehensive test suite covering endpoint, HTTP metrics, path-template cardinality, and idempotency; uses a proper autouse fixture to reset global module state between tests.
.env.example	Adds well-documented Prometheus scrape config example; correctly notes the feature is disabled by default.

Sequence Diagram

sequenceDiagram
    participant P as Prometheus
    participant C as Caddy proxy
    participant S as Internal Service

    P->>C: GET /metrics/crawler Authorization: Bearer token
    Note over C: metricsAuth matcher checks path + header
    alt Token matches METRICS_BEARER_TOKEN
        C->>C: rewrite to /metrics
        C->>S: GET /metrics internal
        S-->>C: 200 Prometheus text
        C-->>P: 200 Prometheus text
    else No token or wrong token
        Note over C: Falls through to metricsBlock
        C-->>P: 404 Not Found
    end
    Note over S: Python services use _MetricsMiddleware for HTTP metrics
    Note over S: Node.js platform exposes process metrics only

_{Last reviewed commit: 518a6f4}

greptile-apps

_{17 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-28T12:09:12Z

Additional Comments (1)

services/crawler/app/telemetry.py
Exception-level errors are invisible in request metrics

The metrics are only recorded after call_next returns a Response. If the downstream handler raises an unhandled exception (which Starlette converts to a 500 response outside this middleware), duration and the label increment are never reached. Real-world HTTP 500s that originate from exceptions will be absent from http_requests_total{status="500"}.

The same pattern is present in all three Python telemetry.py files (rag, operator, crawler).

Wrapping in a try/except captures the status code even when an exception propagates:

start = time.perf_counter()
status_code = 500
try:
    response = await call_next(request)
    status_code = response.status_code
    return response
finally:
    duration = time.perf_counter() - start
    path_template = _get_path_template(request)
    if _request_count is not None:
        _request_count.labels(method=request.method, path_template=path_template, status=status_code).inc()
    if _request_duration is not None:
        _request_duration.labels(method=request.method, path_template=path_template).observe(duration)

This issue also applies to services/operator/app/telemetry.py (lines 116-129) and services/rag/app/telemetry.py (lines 116-129).

Add lightweight Prometheus metrics endpoints to crawler, rag, operator, and platform services using prometheus_client (Python) and prom-client (Node.js). Each service exposes GET /metrics on its internal port with HTTP request count/duration and process-level metrics. Caddy proxies /metrics/{service} with Bearer token auth, gated behind METRICS_BEARER_TOKEN env var (disabled by default). Supports scraping by Prometheus, Grafana Alloy, Datadog Agent, and Grafana Cloud.

Replace Express-based telemetry.js with telemetry.ts compatible with Bun.serve(), following the server.js → server.ts migration on main.

…s in Docker Move init_telemetry() from lifespan handler to after routes/middleware registration across crawler, operator, and rag services. Add missing telemetry.ts to platform Dockerfile COPY steps.

Route /metrics/convex through Caddy to the Convex backend on port 3210, protected by the same bearer-token auth as the other metrics endpoints.

Document Prometheus endpoints, bearer token auth, and scrape config for all Tale services.

…ry package Replace duplicated telemetry modules in crawler, rag, and operator with a shared tale_telemetry package. Harden platform telemetry with idempotent init, shutdown support, and error handling. Fix proxy metrics auth to require a non-empty bearer token.

Sort tale_telemetry imports into third-party group for ruff isort compliance and replace bun:test with vitest in telemetry.test.ts.

coderabbitai Bot requested changes Feb 28, 2026

View reviewed changes

greptile-apps Bot reviewed Feb 28, 2026

View reviewed changes

Comment thread services/proxy/Caddyfile Outdated

Comment thread services/platform/telemetry.js Outdated

Comment thread services/platform/package.json

larryro force-pushed the feat/telemetry-metrics branch 2 times, most recently from 67828c4 to e8551ee Compare March 2, 2026 12:42

larryro added 7 commits March 2, 2026 20:47

fix(platform): adapt telemetry to Bun server

ae32d79

Replace Express-based telemetry.js with telemetry.ts compatible with Bun.serve(), following the server.js → server.ts migration on main.

fix(platform): init telemetry after app setup and include telemetry.t…

c7b8545

…s in Docker Move init_telemetry() from lifespan handler to after routes/middleware registration across crawler, operator, and rag services. Add missing telemetry.ts to platform Dockerfile COPY steps.

feat(proxy): add Convex backend metrics endpoint to reverse proxy

b6fb61d

Route /metrics/convex through Caddy to the Convex backend on port 3210, protected by the same bearer-token auth as the other metrics endpoints.

docs: add monitoring and metrics section to README

f5bbd35

Document Prometheus endpoints, bearer token auth, and scrape config for all Tale services.

fix(platform): fix lint and test CI failures

c6ea60d

Sort tale_telemetry imports into third-party group for ruff isort compliance and replace bun:test with vitest in telemetry.test.ts.

larryro force-pushed the feat/telemetry-metrics branch from e8551ee to c6ea60d Compare March 2, 2026 12:47

larryro merged commit 95c6dad into main Mar 2, 2026
16 checks passed

larryro deleted the feat/telemetry-metrics branch March 2, 2026 12:55

coderabbitai Bot mentioned this pull request Mar 11, 2026

feat(platform): add designer service for AI document transformation #759

Closed

yannickmonney pushed a commit that referenced this pull request Apr 8, 2026

feat(telemetry): add Prometheus metrics to all services (#605)

69b8be0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(telemetry): add Prometheus metrics to all services#605

feat(telemetry): add Prometheus metrics to all services#605
larryro merged 7 commits into
mainfrom
feat/telemetry-metrics

larryro commented Feb 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 28, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Feb 28, 2026

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Feb 28, 2026

Important Files Changed

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

larryro commented Feb 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 28, 2026

Walkthrough

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Feb 28, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

larryro commented Feb 28, 2026 •

edited by coderabbitai Bot

Loading