feat(telemetry): add Prometheus metrics to all services#605
Conversation
📝 WalkthroughWalkthroughThis pull request adds Prometheus monitoring integration across multiple microservices. It includes telemetry modules in the crawler, operator, platform, and RAG services that expose HTTP metrics (request count and duration) via /metrics endpoints. Each telemetry module integrates with the application lifecycle during startup and shutdown. Configuration in .env.example defines metrics endpoints and bearer token authentication. The Caddyfile is updated to gate metrics endpoints behind bearer token authentication and proxy requests to the appropriate services. Dependencies prometheus-client and prom-client are added to enable Prometheus metrics collection. Comprehensive tests validate telemetry initialization, metrics endpoint availability, HTTP metrics recording, and path templating behavior across services. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/crawler/app/telemetry.py`:
- Around line 37-40: The fallback currently returns the raw request.url.path
which can create unbounded path_template cardinality; change the fallback to
return a bounded sentinel string (for example "/unmatched" or "unknown_route")
instead of request.url.path. Locate the code that checks route =
request.scope.get("route") and hasattr(route, "path") and replace the final
return request.url.path with a fixed sentinel value, keeping the existing
route.path return intact. Ensure the sentinel is a constant used consistently in
telemetry code so consumers of path_template see a limited set of labels.
- Around line 48-58: The current middleware only records metrics after a
successful await call_next(request), so exceptions skip metrics; change the
implementation in the middleware function that calls call_next (the block using
start = time.perf_counter(), await call_next(request), duration calculation and
calls to _request_count/_request_duration) to record metrics in a finally block:
start the timer before awaiting call_next, then in the try/except/finally
capture the response.status_code when available, fall back to a default status
(e.g., 500) if an exception occurred, always observe duration via
_request_duration.labels(method=request.method,
path_template=path_template).observe(duration) and increment
_request_count.labels(method=request.method, path_template=path_template,
status=status).inc(); re-raise the exception after metrics are recorded so
behavior is unchanged.
- Around line 72-94: The metrics are being registered into the same
CollectorRegistry as MultiProcessCollector which causes duplicate series in
gunicorn multiprocess mode; update the logic around
multiprocess.MultiProcessCollector(_registry) so that when
MultiProcessCollector() succeeds you register app metrics (_request_count and
_request_duration) without passing a registry (so they attach to the default
REGISTRY), and only when MultiProcessCollector() fails (non-multiprocess mode)
pass registry=_registry to the Counter/Histogram constructors; keep using
multiprocess.MultiProcessCollector to build a fresh CollectorRegistry for the
scrape endpoint but do not register app metrics into that same _registry.
In `@services/operator/app/telemetry.py`:
- Around line 43-58: The dispatch method in _MetricsMiddleware doesn't record
metrics when call_next(request) raises, so wrap the call_next invocation in
try/except/finally: record start = time.perf_counter() before calling call_next,
in the try block await call_next and capture response; in except capture the
exception, synthesize a 500-like status (e.g., status_code = 500) and re-raise
after recording; in finally compute duration, derive path_template via
_get_path_template(request), and increment/observe _request_count and
_request_duration (using request.method, path_template, and the status_code
variable) so failed requests are recorded even when call_next throws.
In `@services/platform/telemetry.js`:
- Around line 13-19: Add a guard to initTelemetry to prevent calling
client.collectDefaultMetrics() more than once: detect prior initialization
either via a module-level boolean (e.g., telemetryInitialized) or by inspecting
the client.register for existing metrics before calling collectDefaultMetrics(),
then set the flag after successful registration; ensure the /metrics route still
uses client.register.metrics() and does not re-register metrics on repeated
initTelemetry() calls.
In `@services/proxy/Caddyfile`:
- Around line 124-127: The `@metricsAuth` matcher currently uses a dangerous
default token value (__disabled__) when METRICS_BEARER_TOKEN is unset; replace
that default with an unmatchable value (e.g., a long random/UUID string) or
change the Caddyfile generation so the Authorization header matcher is omitted
entirely when METRICS_BEARER_TOKEN is not provided. Update the Authorization
header line in the `@metricsAuth` block (and any code that injects
METRICS_BEARER_TOKEN) to use the new unmatchable default or conditional logic so
no valid bearer token is accepted when the env var is absent.
ℹ️ Review info
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (17)
.env.exampleservices/crawler/app/main.pyservices/crawler/app/telemetry.pyservices/crawler/pyproject.tomlservices/crawler/tests/test_telemetry.pyservices/operator/app/main.pyservices/operator/app/telemetry.pyservices/operator/pyproject.tomlservices/operator/tests/test_telemetry.pyservices/platform/package.jsonservices/platform/server.jsservices/platform/telemetry.jsservices/proxy/Caddyfileservices/rag/app/main.pyservices/rag/app/telemetry.pyservices/rag/pyproject.tomlservices/rag/tests/test_telemetry.py
| class _MetricsMiddleware(BaseHTTPMiddleware): | ||
| async def dispatch(self, request: Request, call_next: RequestResponseEndpoint) -> StarletteResponse: | ||
| if request.url.path == "/metrics": | ||
| return await call_next(request) | ||
|
|
||
| start = time.perf_counter() | ||
| response = await call_next(request) | ||
| duration = time.perf_counter() - start | ||
|
|
||
| path_template = _get_path_template(request) | ||
| if _request_count is not None: | ||
| _request_count.labels(method=request.method, path_template=path_template, status=response.status_code).inc() | ||
| if _request_duration is not None: | ||
| _request_duration.labels(method=request.method, path_template=path_template).observe(duration) | ||
|
|
||
| return response |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Unhandled exceptions bypass metrics recording.
If call_next(request) raises an exception, Lines 52-56 won't execute and the failed request won't be recorded in metrics. Consider wrapping with try/except to capture error responses as well.
♻️ Proposed fix to record metrics for exceptions
class _MetricsMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next: RequestResponseEndpoint) -> StarletteResponse:
if request.url.path == "/metrics":
return await call_next(request)
start = time.perf_counter()
- response = await call_next(request)
- duration = time.perf_counter() - start
-
- path_template = _get_path_template(request)
- if _request_count is not None:
- _request_count.labels(method=request.method, path_template=path_template, status=response.status_code).inc()
- if _request_duration is not None:
- _request_duration.labels(method=request.method, path_template=path_template).observe(duration)
-
- return response
+ status_code = 500
+ try:
+ response = await call_next(request)
+ status_code = response.status_code
+ return response
+ finally:
+ duration = time.perf_counter() - start
+ path_template = _get_path_template(request)
+ if _request_count is not None:
+ _request_count.labels(method=request.method, path_template=path_template, status=status_code).inc()
+ if _request_duration is not None:
+ _request_duration.labels(method=request.method, path_template=path_template).observe(duration)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@services/operator/app/telemetry.py` around lines 43 - 58, The dispatch method
in _MetricsMiddleware doesn't record metrics when call_next(request) raises, so
wrap the call_next invocation in try/except/finally: record start =
time.perf_counter() before calling call_next, in the try block await call_next
and capture response; in except capture the exception, synthesize a 500-like
status (e.g., status_code = 500) and re-raise after recording; in finally
compute duration, derive path_template via _get_path_template(request), and
increment/observe _request_count and _request_duration (using request.method,
path_template, and the status_code variable) so failed requests are recorded
even when call_next throws.
Greptile SummaryThis PR adds Prometheus Key findings:
Confidence Score: 2/5
|
| Filename | Overview |
|---|---|
| services/proxy/Caddyfile | Adds metrics proxy routes with bearer token auth; the __disabled__ default is a publicly-known static string that acts as an accessible backdoor token in open-source deployments where METRICS_BEARER_TOKEN is unset. |
| services/platform/telemetry.js | New Node.js telemetry module using prom-client; lacks an idempotency guard unlike the three Python equivalents, which would cause a fatal error if initTelemetry is called more than once. |
| services/crawler/app/telemetry.py | Well-structured Prometheus middleware with idempotency and multiprocess-mode support; exception-path requests are not counted (shared pattern across all three Python services). |
| services/rag/app/telemetry.py | Identical implementation to crawler/app/telemetry.py; same exception-path metric gap applies. |
| services/operator/app/telemetry.py | Identical implementation to crawler/app/telemetry.py; same exception-path metric gap applies. |
| services/platform/package.json | Adds prom-client as a runtime dependency; uses a caret range (^15.1.0) inconsistent with the exact-version pinning used for all other dependencies. |
| services/crawler/tests/test_telemetry.py | Comprehensive test suite covering endpoint, HTTP metrics, path-template cardinality, and idempotency; uses a proper autouse fixture to reset global module state between tests. |
| .env.example | Adds well-documented Prometheus scrape config example; correctly notes the feature is disabled by default. |
Sequence Diagram
sequenceDiagram
participant P as Prometheus
participant C as Caddy proxy
participant S as Internal Service
P->>C: GET /metrics/crawler Authorization: Bearer token
Note over C: metricsAuth matcher checks path + header
alt Token matches METRICS_BEARER_TOKEN
C->>C: rewrite to /metrics
C->>S: GET /metrics internal
S-->>C: 200 Prometheus text
C-->>P: 200 Prometheus text
else No token or wrong token
Note over C: Falls through to metricsBlock
C-->>P: 404 Not Found
end
Note over S: Python services use _MetricsMiddleware for HTTP metrics
Note over S: Node.js platform exposes process metrics only
Last reviewed commit: 518a6f4
Additional Comments (1)
The metrics are only recorded after The same pattern is present in all three Python Wrapping in a start = time.perf_counter()
status_code = 500
try:
response = await call_next(request)
status_code = response.status_code
return response
finally:
duration = time.perf_counter() - start
path_template = _get_path_template(request)
if _request_count is not None:
_request_count.labels(method=request.method, path_template=path_template, status=status_code).inc()
if _request_duration is not None:
_request_duration.labels(method=request.method, path_template=path_template).observe(duration)This issue also applies to |
67828c4 to
e8551ee
Compare
Add lightweight Prometheus metrics endpoints to crawler, rag, operator,
and platform services using prometheus_client (Python) and prom-client
(Node.js). Each service exposes GET /metrics on its internal port with
HTTP request count/duration and process-level metrics.
Caddy proxies /metrics/{service} with Bearer token auth, gated behind
METRICS_BEARER_TOKEN env var (disabled by default). Supports scraping
by Prometheus, Grafana Alloy, Datadog Agent, and Grafana Cloud.
Replace Express-based telemetry.js with telemetry.ts compatible with Bun.serve(), following the server.js → server.ts migration on main.
…s in Docker Move init_telemetry() from lifespan handler to after routes/middleware registration across crawler, operator, and rag services. Add missing telemetry.ts to platform Dockerfile COPY steps.
Route /metrics/convex through Caddy to the Convex backend on port 3210, protected by the same bearer-token auth as the other metrics endpoints.
Document Prometheus endpoints, bearer token auth, and scrape config for all Tale services.
…ry package Replace duplicated telemetry modules in crawler, rag, and operator with a shared tale_telemetry package. Harden platform telemetry with idempotent init, shutdown support, and error handling. Fix proxy metrics auth to require a non-empty bearer token.
Sort tale_telemetry imports into third-party group for ruff isort compliance and replace bun:test with vitest in telemetry.test.ts.
e8551ee to
c6ea60d
Compare
Summary
/metricsendpoints to all 4 services (crawler, rag, operator, platform)prometheus_clientwith HTTP request count/duration + process metricsprom-clientwith process-level metrics (CPU, memory, event loop)/metrics/{service}with Bearer token auth (METRICS_BEARER_TOKENenv var)METRICS_BEARER_TOKENto enable external accessHow it works
Supported backends: Self-hosted Prometheus, Grafana Cloud (agentless), Datadog (via agent), New Relic, VictoriaMetrics
Test plan
docker compose up --build, verifycurl http://localhost:8002/metricsreturns Prometheus textMETRICS_BEARER_TOKEN=test, verifycurl -H "Authorization: Bearer test" https://tale.local/metrics/crawlerworkscurl https://tale.local/metrics/crawlerwithout token returns 404🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Tests
Chores