feat: wire up Prometheus metrics server#25
feat: wire up Prometheus metrics server#25jacderida wants to merge 11 commits intoWithAutonomi:mainfrom
Conversation
Connect the existing metrics_port config to saorsa-core's HealthServer, exposing /health, /ready, /metrics, and /debug/vars endpoints. - Move metrics_port from PaymentConfig to NodeConfig (semantically correct) - Add metrics_host field to NodeConfig and --metrics-host CLI arg - Instantiate HealthManager with 5 component checkers (DHT, network, transport, peers, storage) using P2PNode data sources - Spawn HealthServer as background task with graceful shutdown - Disable metrics server when port is 0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build a complete metrics pipeline with two data paths: - Event-driven: MetricsAggregator processes MetricEvents from saorsa-core into atomic counters and sliding windows (lookups, DHT ops, auth, streams, storage, peer connections) - Pull-based: SnapshotCollector reads state snapshots from saorsa-core accessors on each /metrics scrape (DHT health, security, trust, placement, transport, EigenTrust scores) Replace saorsa-core's HealthServer with our own Axum server that combines health component metrics with ~80 domain metric families in Prometheus text exposition format on /metrics. Update saorsa-core dependency to git branch feat-metrics_event_channel which provides the MetricEvent channel and accessor methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend the metrics pipeline with ~25 new metric families: - Handshake latency percentiles (PQ key exchange timing) - Separate DHT put/get latency percentiles (p50/p95/p99) - Operations per second (derived from total ops / uptime) - Extended transport stats: connection success/failure counts, byte counters, NAT traversal success rate, connection pool size - Connection failure breakdown by reason (labeled counter) - Replication timing: repair cycle duration, keys repaired, bytes transferred, grace period expiry tracking Update saorsa-core dependency to feat-metrics_phase2 branch which provides new MetricEvent variants (ConnectionEstablished, ConnectionFailed, HandshakeCompleted, ReplicationStarted, ReplicationCompleted, GracePeriodExpired) and extended TransportStats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move start_metric_event_loop() before p2p_node.start() so that connection and handshake MetricEvents emitted during startup are captured by the aggregator instead of being lost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- StrategyStats.name replaced with strategy: StrategyChoice enum; format via Debug when rendering Prometheus labels - HandshakeCompleted.duration and ConnectionEstablished.duration are now Option<Duration>; guard with if-let before recording latency - Add match arm for new ConnectionLost metric event variant - Update test fixtures and assertions accordingly Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a Prometheus-compatible metrics/health HTTP server to saorsa-node, plumbing saorsa-core’s metric event stream into an in-process aggregator and formatting both event-driven and snapshot-derived state into Prometheus text exposition.
Changes:
- Add an Axum HTTP server exposing
/health,/ready,/metrics, and/debug/vars, and subscribe to saorsa-coreMetricEvents during node startup. - Introduce a metrics pipeline:
MetricsAggregator(event-driven),SnapshotCollector(pull-based), andPrometheusFormatter(text renderer). - Move metrics config to top-level
NodeConfig(metrics_port,metrics_host) and update CLI + bootstrap peer conversion for updated saorsa-core config types.
Reviewed changes
Copilot reviewed 9 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/node.rs |
Wires up health manager, metric event loop, and Axum metrics/health server handlers. |
src/metrics/aggregator.rs |
Implements atomic counters + bounded sliding windows for high-frequency metric events. |
src/metrics/snapshot.rs |
Adds pull-based snapshot collection for transport/DHT/security/trust/placement state. |
src/metrics/prometheus.rs |
Renders aggregated + snapshot metrics into Prometheus text exposition (+ tests). |
src/metrics/mod.rs |
Exposes the new metrics modules. |
src/lib.rs |
Exports the new metrics module. |
src/config.rs |
Adds metrics_port/metrics_host to NodeConfig and migrates deprecated payment.metrics_port. |
src/bin/saorsa-node/cli.rs |
Adds CLI flags for metrics host/port and maps them into NodeConfig. |
src/bin/saorsa-cli/main.rs |
Adapts bootstrap peer config to MultiAddr. |
src/devnet.rs |
Adapts bootstrap peer config to MultiAddr. |
Cargo.toml |
Switches saorsa-core to a git branch and adds axum. |
.gitignore |
Stops ignoring Cargo.lock so it can be committed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryAdds a Prometheus-compatible metrics HTTP server to saorsa-node, wiring up saorsa-core's
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| Cargo.toml | Adds axum = "0.8" for HTTP server and switches saorsa-core to a personal fork branch — should be updated before merge. |
| src/config.rs | Moves metrics_port/metrics_host to NodeConfig with backward-compatible deprecation migration. Well-tested with 4 new tests. |
| src/metrics/aggregator.rs | Event-driven metrics accumulator with bounded sliding windows and atomic counters. Well-structured with comprehensive test coverage (14 tests). Uses Ordering::Relaxed appropriately for counters. |
| src/metrics/prometheus.rs | 1634-line Prometheus text exposition formatter covering 119 metrics across 14 categories. Follows spec correctly with HELP/TYPE/value ordering. Comprehensive tests verify no orphaned headers. |
| src/metrics/snapshot.rs | Pull-based snapshot collector for saorsa-core state. Clean design but depends on standalone metric collector instances that won't receive data from saorsa-core's internals. |
| src/node.rs | Core integration: wires up Axum HTTP server, metric event loop, health manager, and snapshot collector. Has premature log message and uses standalone collector instances that won't be populated. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph "saorsa-core"
ME[MetricEvent Channel]
PE[P2PEvent Channel]
PN[P2PNode]
TE[EigenTrustEngine]
end
subgraph "saorsa-node metrics pipeline"
MA[MetricsAggregator]
SC[SnapshotCollector]
PF[PrometheusFormatter]
end
subgraph "HTTP Server (Axum)"
HE["/health"]
RE["/ready"]
MET["/metrics"]
DV["/debug/vars"]
end
HM[HealthManager]
ME -->|"subscribe_metric_events()"| MA
PE -->|"PeerConnected/Disconnected"| MA
PN -->|"transport_stats()"| SC
TE -->|"cached_global_trust()"| SC
MA -->|"counters + sliding windows"| PF
SC -->|"point-in-time snapshot"| PF
HM -->|"component health"| HE
HM -->|"readiness status"| RE
PF -->|"text exposition"| MET
HM -->|"PrometheusExporter"| MET
HM -->|"debug info"| DV
Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/node.rs
Line: 837
Comment:
**Misleading log before bind completes**
This `info!` runs synchronously immediately after `tokio::spawn`, before the spawned task has actually called `TcpListener::bind`. If the bind fails (port already in use, permission denied), the log will still say "listening on ..." even though the server never started. Consider moving this log inside the spawned task, after the bind succeeds, to avoid misleading operators.
```suggestion
// Log is emitted inside the spawned task after successful bind.
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/node.rs
Line: 478-500
Comment:
**Standalone collectors will report zero values**
`build_snapshot_collector` creates new, empty instances of `DhtMetricsCollector`, `TrustMetricsCollector`, `PlacementMetricsCollector`, and `SecurityMetricsCollector`. Since these are not the same `Arc`s used internally by saorsa-core's DHT/security layers, they will never be populated and will always report default (zero) values for routing table, replication, security, trust, and placement snapshot metrics.
Only `transport_stats()` (line 418 in `snapshot.rs`) and `trust_engine()` (line 490 here) go through the actual `P2PNode`, so those work correctly.
Is this intentional as a placeholder until saorsa-core exposes its internal collectors, or is there a way to get the actual collector instances from `P2PNode`?
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: Cargo.toml
Line: 35
Comment:
**Dependency on personal fork branch**
`saorsa-core` is pinned to a personal fork (`jacderida/saorsa-core`) on a feature branch. The PR description references a companion PR (saorsa-labs/saorsa-core#46) — once that lands, this should be updated to point to the organization repo with a versioned release or at minimum `saorsa-labs/saorsa-core` on `main`. Merging with a personal fork reference would break builds if the branch is deleted or force-pushed.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: b28cfe9
- Fix all clippy pedantic/nursery lints without blanket #[allow]: - Replace u128-as-u64 casts with u64::try_from().unwrap_or() - Extract helper functions (ratio, percentile_index, duration_to_micros) - Drop RwLock guards early to fix significant_drop_tightening - Split long functions (security, trust, placement, transport metrics) - Use #[expect] for unavoidable u64->f64 precision-loss casts - Fix format! in format_args, pass StreamClass by value, use Option<&T> - Combine identical ConnectionEstablished/ConnectionLost match arms - Fix formatting issues (cargo fmt) - Fix e2e test compilation: convert Vec<SocketAddr> to Vec<MultiAddr> - Address review comments: - Move metrics server log inside spawned task after bind succeeds - Fix JSON error responses with serde_json::json! for proper escaping - Only start metric event loop when metrics are enabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the same Arc collector instances that saorsa-core's DHT layer writes to, instead of creating standalone empty instances. This means snapshot-based metrics (DHT health, security, trust, placement) now reflect live data rather than always reporting zeros. Falls back to standalone instances only when security_dashboard is None (minimal test configurations). Depends on saorsa-core aaad76c (SecurityDashboard accessor methods). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR wires saorsa-core’s metric event channel into an HTTP endpoint that exposes Prometheus-compatible metrics, combining event-driven aggregation with per-scrape snapshots and health endpoints.
Changes:
- Add metrics pipeline modules (
MetricsAggregator,SnapshotCollector,PrometheusFormatter) and expose them viasaorsa_node::metrics. - Add an Axum-based health/metrics server to
RunningNode, including/metrics,/health,/ready, and/debug/vars. - Update configs/CLI/tests to support
metrics_host/metrics_portand adapt bootstrap peer types to saorsa-core’sMultiAddr.
Reviewed changes
Copilot reviewed 11 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/testnet.rs | Convert bootstrap peers to MultiAddr for e2e config compatibility. |
| tests/e2e/live_testnet.rs | Convert bootstrap peers to MultiAddr in live testnet client. |
| src/node.rs | Integrate health manager, metrics aggregation, snapshot collection, and Axum server into node lifecycle. |
| src/metrics/snapshot.rs | Add pull-based snapshot collector for saorsa-core state. |
| src/metrics/prometheus.rs | Add Prometheus text exposition renderer and unit tests. |
| src/metrics/mod.rs | New metrics module wiring + public re-exports. |
| src/metrics/aggregator.rs | Add event-driven metrics accumulator with counters and sliding windows. |
| src/lib.rs | Export the new metrics module. |
| src/devnet.rs | Convert bootstrap peers to MultiAddr in devnet config. |
| src/config.rs | Add top-level metrics_port/metrics_host + deprecated migration from payment.metrics_port. |
| src/bin/saorsa-node/cli.rs | Add CLI args for metrics host/port and map into NodeConfig. |
| src/bin/saorsa-cli/main.rs | Convert bootstrap peers to MultiAddr in client node construction. |
| Cargo.toml | Switch saorsa-core to git dependency and add axum. |
| .gitignore | Stop ignoring Cargo.lock so it can be committed. |
Comments suppressed due to low confidence (1)
src/node.rs:865
connected_peersis only updated insidestart_protocol_routing(), but that function returns early whenant_protocolisNone. This means thep2p_connected_peersgauge (and peer connect/disconnect tracking) will stay at 0 for configs where storage/chunk protocol routing is disabled (e.g., client nodes). Consider subscribing toP2PEvents for peer connect/disconnect in a separate task that always runs when metrics are enabled, or restructurestart_protocol_routing()to always drain events and only gate the chunk-message handling onant_protocolbeing present.
fn start_protocol_routing(&mut self) {
let protocol = match self.ant_protocol {
Some(ref p) => Arc::clone(p),
None => return,
};
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Update doc to accurately describe emission behavior: counters and gauges are always emitted (even when zero) for stable metric names, while optional families like stream bandwidth are conditional. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set allow_loopback = true in e2e test node config. saorsa-core 0.15.0 defaults to rejecting loopback addresses, which caused all e2e tests to fail with "No remote peers found near target address" since every test node runs on 127.0.0.1. - Update prometheus module doc to accurately describe emission behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a Prometheus-compatible metrics/health HTTP server to saorsa-node, wiring saorsa-core’s MetricEvent stream plus periodic state snapshots into a single /metrics endpoint for scraping/monitoring.
Changes:
- Introduce a metrics pipeline:
MetricsAggregator(event-driven),SnapshotCollector(pull-based), andPrometheusFormatter(text exposition). - Add an Axum-based HTTP server exposing
/health,/ready,/metrics, and/debug/vars, controlled viametrics_host/metrics_port(0 disables). - Update config/CLI and e2e/devnet/bootstrap wiring to adapt to saorsa-core API changes (SocketAddr → MultiAddr, loopback allowance).
Reviewed changes
Copilot reviewed 11 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/e2e/testnet.rs | Allow loopback + convert bootstrap addrs to MultiAddr for core config. |
| tests/e2e/live_testnet.rs | Convert bootstrap addrs to MultiAddr. |
| src/node.rs | Wire health manager, metrics tasks, and Axum metrics/health server into node runtime. |
| src/metrics/aggregator.rs | New event-driven aggregator with counters + bounded sliding windows. |
| src/metrics/snapshot.rs | New pull-based snapshot collector from saorsa-core accessors. |
| src/metrics/prometheus.rs | New Prometheus text formatter + unit tests for emitted output. |
| src/metrics/mod.rs | Export metrics module types. |
| src/lib.rs | Expose metrics module publicly. |
| src/devnet.rs | Convert devnet bootstrap addrs to MultiAddr. |
| src/config.rs | Add metrics_host/metrics_port to NodeConfig and migrate deprecated payment.metrics_port. |
| src/bin/saorsa-node/cli.rs | Add --metrics-host flag and map CLI metrics settings onto NodeConfig. |
| src/bin/saorsa-cli/main.rs | Convert bootstrap addrs to MultiAddr for client node creation. |
| Cargo.toml | Add axum and switch saorsa-core to a git branch dependency. |
| .gitignore | Stop ignoring Cargo.lock (lockfile now tracked). |
Comments suppressed due to low confidence (1)
src/node.rs:865
p2p_connected_peersis driven byMetricsAggregator::record_peer_connected/disconnected(), but those updates only happen insidestart_protocol_routing(). Sincestart_protocol_routing()returns early whenant_protocolisNone, nodes running without storage/protocol routing will always reportp2p_connected_peers = 0even when connected. Consider tracking peer connect/disconnect in a dedicated P2P event task (independent ofant_protocol), or updatingconnected_peersfromMetricEvent::ConnectionEstablished/ConnectionLostand ensuring the two sources don’t double-count.
/// Also tracks peer connect/disconnect events in the metrics aggregator.
fn start_protocol_routing(&mut self) {
let protocol = match self.ant_protocol {
Some(ref p) => Arc::clone(p),
None => return,
};
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Formats aggregated + snapshot metrics into Prometheus text exposition format. | ||
| pub struct PrometheusFormatter; | ||
|
|
The test_payment_with_node_failures test shuts down 3 of 10 nodes then tries to store a chunk. On Windows with saorsa-core 0.15.0, DHT routing table convergence after node failures is slower due to loopback/diversity changes. Increase the wait-after-failure and post-warmup sleeps from 15s to 30s to give the routing tables time to adapt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Metrics will be explored in a more experimental way. |
Complete the Section 18 test matrix with the remaining scenarios: - #3: Fresh replication stores chunk + updates PaidForList on remote nodes - #9: Fetch retry rotates to alternate source - #10: Fetch retry exhaustion with single source - #11: Repeated ApplicationFailure events decrease peer trust score - #12: Bootstrap node discovers keys stored on multiple peers - #14: Hint construction covers all locally stored keys - #15: Data and PaidForList survive node shutdown (partition) - #17: Neighbor sync request returns valid response (admission test) - #21: Paid-list majority confirmed from multiple peers via verification - #24: PaidNotify propagates paid-list entries after fresh replication - #25: Paid-list convergence verified via majority peer queries - #44: PaidForList persists across restart (cold-start recovery) - #45: PaidForList lost in fresh directory (unrecoverable scenario) All 56 Section 18 scenarios now have test coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete the Section 18 test matrix with the remaining scenarios: - #3: Fresh replication stores chunk + updates PaidForList on remote nodes - #9: Fetch retry rotates to alternate source - #10: Fetch retry exhaustion with single source - #11: Repeated ApplicationFailure events decrease peer trust score - #12: Bootstrap node discovers keys stored on multiple peers - #14: Hint construction covers all locally stored keys - #15: Data and PaidForList survive node shutdown (partition) - #17: Neighbor sync request returns valid response (admission test) - #21: Paid-list majority confirmed from multiple peers via verification - #24: PaidNotify propagates paid-list entries after fresh replication - #25: Paid-list convergence verified via majority peer queries - #44: PaidForList persists across restart (cold-start recovery) - #45: PaidForList lost in fresh directory (unrecoverable scenario) All 56 Section 18 scenarios now have test coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
MetricEventbroadcast channel to a Prometheus-compatible HTTP metrics serverMetricsAggregatorthat consumes high-frequency metric events from saorsa-core into atomic counters and sliding windowsPrometheusRendererthat formats aggregated metrics as Prometheus text expositionMetricsSnapshotfor periodic state polling (transport stats, routing strategy stats, EigenTrust scores)--metrics-port,--metrics-hostNew modules
src/metrics/aggregator.rs— Event-driven metrics accumulator with sliding windows and atomic counterssrc/metrics/prometheus.rs— Prometheus text format renderer (119 metrics across 14 categories)src/metrics/snapshot.rs— Periodic state snapshot from saorsa-core accessorsCommits
790a3c4feat: wire up HTTP metrics server from saorsa-core HealthServer25e65f0feat: add Prometheus metrics aggregation and export pipeline5f122cafeat: add phase 2 metrics — transport, DHT latency, replicationb885e26fix: subscribe to metric events before starting P2P nodee8fa264chore: track Cargo.lock for reproducible buildsb28cfe9fix: adapt to saorsa-core MetricEvent API changesTestnet Validation
These metrics were observed on a live testnet with a corresponding
saorsa-corePR that consumes the event channel and exposes metrics via a Prometheus-compatible HTTP server.Infrastructure
Monitoring Stack
Observed Metrics (119 total across 12 categories)
On an idle network (no data storage or lookups), connection and transport metrics were active while operation-specific metrics remained at zero as expected:
p2p_connected_peers = 100p2p_operations_per_secondp2p_auth_failures_totalTest plan
cargo test— all existing tests pass🤖 Generated with Claude Code