Skip to content

feat(telemetry): add Prometheus metrics to all services#605

Merged
larryro merged 7 commits into
mainfrom
feat/telemetry-metrics
Mar 2, 2026
Merged

feat(telemetry): add Prometheus metrics to all services#605
larryro merged 7 commits into
mainfrom
feat/telemetry-metrics

Conversation

@larryro

@larryro larryro commented Feb 28, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add Prometheus /metrics endpoints to all 4 services (crawler, rag, operator, platform)
  • Python services use prometheus_client with HTTP request count/duration + process metrics
  • Platform (Node.js) uses prom-client with process-level metrics (CPU, memory, event loop)
  • Caddy proxies /metrics/{service} with Bearer token auth (METRICS_BEARER_TOKEN env var)
  • Disabled by default — set METRICS_BEARER_TOKEN to enable external access

How it works

Customer's Prometheus/Alloy
  → GET https://tale.example.com/metrics/crawler
    Header: Authorization: Bearer <token>
  → Caddy verifies token → proxies to internal crawler:8002/metrics

Supported backends: Self-hosted Prometheus, Grafana Cloud (agentless), Datadog (via agent), New Relic, VictoriaMetrics

Test plan

  • 30 unit tests pass (10 per Python service)
  • Ruff lint passes for all Python changes
  • docker compose up --build, verify curl http://localhost:8002/metrics returns Prometheus text
  • Set METRICS_BEARER_TOKEN=test, verify curl -H "Authorization: Bearer test" https://tale.local/metrics/crawler works
  • Verify curl https://tale.local/metrics/crawler without token returns 404

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added Prometheus metrics monitoring and observability across all services
    • Metrics endpoints now secured with bearer token authentication via proxy
    • Configurable metrics endpoints for integration with monitoring systems
  • Tests

    • Added comprehensive test coverage for telemetry functionality
  • Chores

    • Added monitoring client dependencies to support metrics collection

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This pull request adds Prometheus monitoring integration across multiple microservices. It includes telemetry modules in the crawler, operator, platform, and RAG services that expose HTTP metrics (request count and duration) via /metrics endpoints. Each telemetry module integrates with the application lifecycle during startup and shutdown. Configuration in .env.example defines metrics endpoints and bearer token authentication. The Caddyfile is updated to gate metrics endpoints behind bearer token authentication and proxy requests to the appropriate services. Dependencies prometheus-client and prom-client are added to enable Prometheus metrics collection. Comprehensive tests validate telemetry initialization, metrics endpoint availability, HTTP metrics recording, and path templating behavior across services.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.19% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately and concisely describes the main objective: adding Prometheus metrics to all services. It directly matches the changeset scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/telemetry-metrics

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/crawler/app/telemetry.py`:
- Around line 37-40: The fallback currently returns the raw request.url.path
which can create unbounded path_template cardinality; change the fallback to
return a bounded sentinel string (for example "/unmatched" or "unknown_route")
instead of request.url.path. Locate the code that checks route =
request.scope.get("route") and hasattr(route, "path") and replace the final
return request.url.path with a fixed sentinel value, keeping the existing
route.path return intact. Ensure the sentinel is a constant used consistently in
telemetry code so consumers of path_template see a limited set of labels.
- Around line 48-58: The current middleware only records metrics after a
successful await call_next(request), so exceptions skip metrics; change the
implementation in the middleware function that calls call_next (the block using
start = time.perf_counter(), await call_next(request), duration calculation and
calls to _request_count/_request_duration) to record metrics in a finally block:
start the timer before awaiting call_next, then in the try/except/finally
capture the response.status_code when available, fall back to a default status
(e.g., 500) if an exception occurred, always observe duration via
_request_duration.labels(method=request.method,
path_template=path_template).observe(duration) and increment
_request_count.labels(method=request.method, path_template=path_template,
status=status).inc(); re-raise the exception after metrics are recorded so
behavior is unchanged.
- Around line 72-94: The metrics are being registered into the same
CollectorRegistry as MultiProcessCollector which causes duplicate series in
gunicorn multiprocess mode; update the logic around
multiprocess.MultiProcessCollector(_registry) so that when
MultiProcessCollector() succeeds you register app metrics (_request_count and
_request_duration) without passing a registry (so they attach to the default
REGISTRY), and only when MultiProcessCollector() fails (non-multiprocess mode)
pass registry=_registry to the Counter/Histogram constructors; keep using
multiprocess.MultiProcessCollector to build a fresh CollectorRegistry for the
scrape endpoint but do not register app metrics into that same _registry.

In `@services/operator/app/telemetry.py`:
- Around line 43-58: The dispatch method in _MetricsMiddleware doesn't record
metrics when call_next(request) raises, so wrap the call_next invocation in
try/except/finally: record start = time.perf_counter() before calling call_next,
in the try block await call_next and capture response; in except capture the
exception, synthesize a 500-like status (e.g., status_code = 500) and re-raise
after recording; in finally compute duration, derive path_template via
_get_path_template(request), and increment/observe _request_count and
_request_duration (using request.method, path_template, and the status_code
variable) so failed requests are recorded even when call_next throws.

In `@services/platform/telemetry.js`:
- Around line 13-19: Add a guard to initTelemetry to prevent calling
client.collectDefaultMetrics() more than once: detect prior initialization
either via a module-level boolean (e.g., telemetryInitialized) or by inspecting
the client.register for existing metrics before calling collectDefaultMetrics(),
then set the flag after successful registration; ensure the /metrics route still
uses client.register.metrics() and does not re-register metrics on repeated
initTelemetry() calls.

In `@services/proxy/Caddyfile`:
- Around line 124-127: The `@metricsAuth` matcher currently uses a dangerous
default token value (__disabled__) when METRICS_BEARER_TOKEN is unset; replace
that default with an unmatchable value (e.g., a long random/UUID string) or
change the Caddyfile generation so the Authorization header matcher is omitted
entirely when METRICS_BEARER_TOKEN is not provided. Update the Authorization
header line in the `@metricsAuth` block (and any code that injects
METRICS_BEARER_TOKEN) to use the new unmatchable default or conditional logic so
no valid bearer token is accepted when the env var is absent.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6d1468 and 518a6f4.

📒 Files selected for processing (17)
  • .env.example
  • services/crawler/app/main.py
  • services/crawler/app/telemetry.py
  • services/crawler/pyproject.toml
  • services/crawler/tests/test_telemetry.py
  • services/operator/app/main.py
  • services/operator/app/telemetry.py
  • services/operator/pyproject.toml
  • services/operator/tests/test_telemetry.py
  • services/platform/package.json
  • services/platform/server.js
  • services/platform/telemetry.js
  • services/proxy/Caddyfile
  • services/rag/app/main.py
  • services/rag/app/telemetry.py
  • services/rag/pyproject.toml
  • services/rag/tests/test_telemetry.py

Comment thread services/crawler/app/telemetry.py Outdated
Comment thread services/crawler/app/telemetry.py Outdated
Comment thread services/crawler/app/telemetry.py Outdated
Comment thread services/operator/app/telemetry.py Outdated
Comment on lines +43 to +58
class _MetricsMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request: Request, call_next: RequestResponseEndpoint) -> StarletteResponse:
if request.url.path == "/metrics":
return await call_next(request)

start = time.perf_counter()
response = await call_next(request)
duration = time.perf_counter() - start

path_template = _get_path_template(request)
if _request_count is not None:
_request_count.labels(method=request.method, path_template=path_template, status=response.status_code).inc()
if _request_duration is not None:
_request_duration.labels(method=request.method, path_template=path_template).observe(duration)

return response

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Unhandled exceptions bypass metrics recording.

If call_next(request) raises an exception, Lines 52-56 won't execute and the failed request won't be recorded in metrics. Consider wrapping with try/except to capture error responses as well.

♻️ Proposed fix to record metrics for exceptions
 class _MetricsMiddleware(BaseHTTPMiddleware):
     async def dispatch(self, request: Request, call_next: RequestResponseEndpoint) -> StarletteResponse:
         if request.url.path == "/metrics":
             return await call_next(request)
 
         start = time.perf_counter()
-        response = await call_next(request)
-        duration = time.perf_counter() - start
-
-        path_template = _get_path_template(request)
-        if _request_count is not None:
-            _request_count.labels(method=request.method, path_template=path_template, status=response.status_code).inc()
-        if _request_duration is not None:
-            _request_duration.labels(method=request.method, path_template=path_template).observe(duration)
-
-        return response
+        status_code = 500
+        try:
+            response = await call_next(request)
+            status_code = response.status_code
+            return response
+        finally:
+            duration = time.perf_counter() - start
+            path_template = _get_path_template(request)
+            if _request_count is not None:
+                _request_count.labels(method=request.method, path_template=path_template, status=status_code).inc()
+            if _request_duration is not None:
+                _request_duration.labels(method=request.method, path_template=path_template).observe(duration)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/operator/app/telemetry.py` around lines 43 - 58, The dispatch method
in _MetricsMiddleware doesn't record metrics when call_next(request) raises, so
wrap the call_next invocation in try/except/finally: record start =
time.perf_counter() before calling call_next, in the try block await call_next
and capture response; in except capture the exception, synthesize a 500-like
status (e.g., status_code = 500) and re-raise after recording; in finally
compute duration, derive path_template via _get_path_template(request), and
increment/observe _request_count and _request_duration (using request.method,
path_template, and the status_code variable) so failed requests are recorded
even when call_next throws.

Comment thread services/platform/telemetry.js Outdated
Comment thread services/proxy/Caddyfile
@greptile-apps

greptile-apps Bot commented Feb 28, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds Prometheus /metrics endpoints to all four Tale services (crawler, rag, operator, platform) behind Caddy bearer-token authentication, enabling external Prometheus/Grafana scraping without exposing metrics by default. The implementation is well-structured — the three Python services share a consistent _MetricsMiddleware + init_telemetry/shutdown_telemetry pattern with proper multiprocess support, and the test suites are thorough.

Key findings:

  • Security — __disabled__ backdoor token (Caddyfile line 126): When METRICS_BEARER_TOKEN is unset, Caddy substitutes the literal string __disabled__ as the token. Because this default is visible in the public repository, any client that sends Authorization: Bearer __disabled__ can access all four metrics endpoints on any deployment that hasn't explicitly set the env var. This directly contradicts the documented "disabled by default" intent and should be addressed before merging.
  • Missing idempotency guard in telemetry.js: Unlike the Python services, initTelemetry doesn't check whether it's already been called. A second invocation causes prom-client to throw a duplicate-metric registration error, crashing the server.
  • Exception-path requests not counted: In all three Python _MetricsMiddleware implementations, if call_next raises an exception, the metric labels are never incremented. HTTP 500s caused by unhandled exceptions will be absent from http_requests_total.
  • prom-client version not pinned: "prom-client": "^15.1.0" is inconsistent with the exact-version pinning used for every other runtime dependency in package.json.

Confidence Score: 2/5

  • Not safe to merge until the __disabled__ default token issue in the Caddyfile is resolved — it exposes metrics on all deployments where the env var is unset.
  • The __disabled__ fallback creates a publicly-known static bearer token that grants metrics access on any instance that hasn't explicitly set METRICS_BEARER_TOKEN, directly contradicting the "disabled by default" guarantee. This is a meaningful security regression for all existing deployments. The missing idempotency guard in telemetry.js is a secondary crash risk. The other two issues (exception-path metrics, unpinned version) are lower-severity but worth fixing.
  • services/proxy/Caddyfile (security) and services/platform/telemetry.js (crash on double-init) need attention before merging.

Important Files Changed

Filename Overview
services/proxy/Caddyfile Adds metrics proxy routes with bearer token auth; the __disabled__ default is a publicly-known static string that acts as an accessible backdoor token in open-source deployments where METRICS_BEARER_TOKEN is unset.
services/platform/telemetry.js New Node.js telemetry module using prom-client; lacks an idempotency guard unlike the three Python equivalents, which would cause a fatal error if initTelemetry is called more than once.
services/crawler/app/telemetry.py Well-structured Prometheus middleware with idempotency and multiprocess-mode support; exception-path requests are not counted (shared pattern across all three Python services).
services/rag/app/telemetry.py Identical implementation to crawler/app/telemetry.py; same exception-path metric gap applies.
services/operator/app/telemetry.py Identical implementation to crawler/app/telemetry.py; same exception-path metric gap applies.
services/platform/package.json Adds prom-client as a runtime dependency; uses a caret range (^15.1.0) inconsistent with the exact-version pinning used for all other dependencies.
services/crawler/tests/test_telemetry.py Comprehensive test suite covering endpoint, HTTP metrics, path-template cardinality, and idempotency; uses a proper autouse fixture to reset global module state between tests.
.env.example Adds well-documented Prometheus scrape config example; correctly notes the feature is disabled by default.

Sequence Diagram

sequenceDiagram
    participant P as Prometheus
    participant C as Caddy proxy
    participant S as Internal Service

    P->>C: GET /metrics/crawler Authorization: Bearer token
    Note over C: metricsAuth matcher checks path + header
    alt Token matches METRICS_BEARER_TOKEN
        C->>C: rewrite to /metrics
        C->>S: GET /metrics internal
        S-->>C: 200 Prometheus text
        C-->>P: 200 Prometheus text
    else No token or wrong token
        Note over C: Falls through to metricsBlock
        C-->>P: 404 Not Found
    end
    Note over S: Python services use _MetricsMiddleware for HTTP metrics
    Note over S: Node.js platform exposes process metrics only
Loading

Last reviewed commit: 518a6f4

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17 files reviewed, 4 comments

Edit Code Review Agent Settings | Greptile

Comment thread services/proxy/Caddyfile Outdated
Comment thread services/platform/telemetry.js Outdated
Comment thread services/platform/package.json
@greptile-apps

greptile-apps Bot commented Feb 28, 2026

Copy link
Copy Markdown
Additional Comments (1)

services/crawler/app/telemetry.py
Exception-level errors are invisible in request metrics

The metrics are only recorded after call_next returns a Response. If the downstream handler raises an unhandled exception (which Starlette converts to a 500 response outside this middleware), duration and the label increment are never reached. Real-world HTTP 500s that originate from exceptions will be absent from http_requests_total{status="500"}.

The same pattern is present in all three Python telemetry.py files (rag, operator, crawler).

Wrapping in a try/except captures the status code even when an exception propagates:

start = time.perf_counter()
status_code = 500
try:
    response = await call_next(request)
    status_code = response.status_code
    return response
finally:
    duration = time.perf_counter() - start
    path_template = _get_path_template(request)
    if _request_count is not None:
        _request_count.labels(method=request.method, path_template=path_template, status=status_code).inc()
    if _request_duration is not None:
        _request_duration.labels(method=request.method, path_template=path_template).observe(duration)

This issue also applies to services/operator/app/telemetry.py (lines 116-129) and services/rag/app/telemetry.py (lines 116-129).

@larryro larryro force-pushed the feat/telemetry-metrics branch 2 times, most recently from 67828c4 to e8551ee Compare March 2, 2026 12:42
larryro added 7 commits March 2, 2026 20:47
Add lightweight Prometheus metrics endpoints to crawler, rag, operator,
and platform services using prometheus_client (Python) and prom-client
(Node.js). Each service exposes GET /metrics on its internal port with
HTTP request count/duration and process-level metrics.

Caddy proxies /metrics/{service} with Bearer token auth, gated behind
METRICS_BEARER_TOKEN env var (disabled by default). Supports scraping
by Prometheus, Grafana Alloy, Datadog Agent, and Grafana Cloud.
Replace Express-based telemetry.js with telemetry.ts compatible with
Bun.serve(), following the server.js → server.ts migration on main.
…s in Docker

Move init_telemetry() from lifespan handler to after routes/middleware
registration across crawler, operator, and rag services. Add missing
telemetry.ts to platform Dockerfile COPY steps.
Route /metrics/convex through Caddy to the Convex backend on port 3210,
protected by the same bearer-token auth as the other metrics endpoints.
Document Prometheus endpoints, bearer token auth, and scrape config
for all Tale services.
…ry package

Replace duplicated telemetry modules in crawler, rag, and operator with a
shared tale_telemetry package. Harden platform telemetry with idempotent
init, shutdown support, and error handling. Fix proxy metrics auth to
require a non-empty bearer token.
Sort tale_telemetry imports into third-party group for ruff isort
compliance and replace bun:test with vitest in telemetry.test.ts.
@larryro larryro force-pushed the feat/telemetry-metrics branch from e8551ee to c6ea60d Compare March 2, 2026 12:47
@larryro larryro merged commit 95c6dad into main Mar 2, 2026
16 checks passed
@larryro larryro deleted the feat/telemetry-metrics branch March 2, 2026 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant