feat: wire up HTTP metrics server from saorsa-core HealthServer#24
feat: wire up HTTP metrics server from saorsa-core HealthServer#24jacderida wants to merge 5 commits intoWithAutonomi:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR wires saorsa-core’s HealthServer/HealthManager into saorsa-node, exposing health, readiness, metrics, and debug endpoints over HTTP with configurable bind host/port.
Changes:
- Moved metrics server configuration to
NodeConfig(metrics_port) and addedmetrics_host(plus corresponding CLI/env wiring). - Added a
HealthManagerinNodeBuilder::build()and registered component health checkers backed byP2PNodedata sources. - Spawned/shutdown the
HealthServerlifecycle alongside the running node.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/node.rs |
Initializes health checkers and starts/stops the HTTP health/metrics server task based on metrics config. |
src/config.rs |
Adds metrics_port/metrics_host to NodeConfig and removes metrics_port from PaymentConfig. |
src/bin/saorsa-node/cli.rs |
Adds --metrics-host and maps CLI metrics settings into NodeConfig. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR wires up Two issues were found:
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| src/node.rs | Wires up HealthManager with 5 component checkers and spawns HealthServer as a background task with graceful oneshot shutdown; two of the five checkers ("network" and "peers") use identical data sources (peer_count), which may produce redundant/indistinguishable health metrics. |
| src/config.rs | Moves metrics_port from PaymentConfig to NodeConfig and adds metrics_host with correct serde defaults; struct change is correct, but the matching config template (production.toml) was not updated, leaving a silent misconfiguration trap for operators. |
| src/bin/saorsa-node/cli.rs | Adds --metrics-host CLI arg and moves metrics_port/metrics_host assignment outside the PaymentConfig constructor; clean change with correct env-var bindings and default values. |
Sequence Diagram
sequenceDiagram
participant Main as saorsa-node main
participant Builder as NodeBuilder::build()
participant P2P as P2PNode (Arc)
participant HM as HealthManager
participant RN as RunningNode::run()
participant HS as HealthServer (tokio task)
Main->>Builder: NodeBuilder::new(config)
Builder->>P2P: P2PNode::new(core_config)
Builder->>HM: HealthManager::new(version)
Builder->>HM: register_checker("dht", DhtHealthChecker)
Builder->>HM: register_checker("network", NetworkHealthChecker)
Builder->>HM: register_checker("transport", TransportHealthChecker)
Builder->>HM: register_checker("peers", PeerHealthChecker)
Builder->>HM: register_checker("storage", StorageHealthChecker)
Builder-->>Main: RunningNode { health_manager, health_shutdown_tx: None, health_handle: None }
Main->>RN: run()
RN->>P2P: start()
RN->>HS: HealthServer::new(health_manager, addr)
Note right of HS: Spawned as background tokio task
HS-->>RN: (health_server, shutdown_tx)
RN->>HS: tokio::spawn(health_server.run())
Note over HS: Serves /health /ready /metrics /debug/vars
Main->>RN: shutdown signal (Ctrl-C / SIGTERM)
RN->>HS: shutdown_tx.send(())
RN->>HS: handle.await (graceful wait)
RN->>P2P: shutdown()
Comments Outside Diff (1)
-
config/production.toml, line 42-43 (link)metrics_portstill under[payment]— silently ignored after this PRmetrics_porthas been moved fromPaymentConfigtoNodeConfig(root level) insrc/config.rs, butconfig/production.tomlwas not updated. The value on line 43 now lives inside the[payment]TOML table, and sincePaymentConfigno longer defines ametrics_portfield (and has no#[serde(deny_unknown_fields)]), serde will silently discard it during deserialization.This means:
- Any operator who customised
metrics_portto a non-default value in their config file will silently lose that setting after upgrading, and the server will revert to port 9100. - The config template is now misleading — new operators following the template will incorrectly place
metrics_port(and the newmetrics_host) under[payment]where they have no effect.
The fix is to move
metrics_portout of the[payment]block and addmetrics_hostat the root level:Prompt To Fix With AI
This is a comment left during a code review. Path: config/production.toml Line: 42-43 Comment: **`metrics_port` still under `[payment]` — silently ignored after this PR** `metrics_port` has been moved from `PaymentConfig` to `NodeConfig` (root level) in `src/config.rs`, but `config/production.toml` was not updated. The value on line 43 now lives inside the `[payment]` TOML table, and since `PaymentConfig` no longer defines a `metrics_port` field (and has no `#[serde(deny_unknown_fields)]`), serde will silently discard it during deserialization. This means: - Any operator who customised `metrics_port` to a non-default value in their config file will silently lose that setting after upgrading, and the server will revert to port 9100. - The config template is now misleading — new operators following the template will incorrectly place `metrics_port` (and the new `metrics_host`) under `[payment]` where they have no effect. The fix is to move `metrics_port` out of the `[payment]` block and add `metrics_host` at the root level: How can I resolve this? If you propose a fix, please make it concise.
- Any operator who customised
Prompt To Fix All With AI
This is a comment left during a code review.
Path: config/production.toml
Line: 42-43
Comment:
**`metrics_port` still under `[payment]` — silently ignored after this PR**
`metrics_port` has been moved from `PaymentConfig` to `NodeConfig` (root level) in `src/config.rs`, but `config/production.toml` was not updated. The value on line 43 now lives inside the `[payment]` TOML table, and since `PaymentConfig` no longer defines a `metrics_port` field (and has no `#[serde(deny_unknown_fields)]`), serde will silently discard it during deserialization.
This means:
- Any operator who customised `metrics_port` to a non-default value in their config file will silently lose that setting after upgrading, and the server will revert to port 9100.
- The config template is now misleading — new operators following the template will incorrectly place `metrics_port` (and the new `metrics_host`) under `[payment]` where they have no effect.
The fix is to move `metrics_port` out of the `[payment]` block and add `metrics_host` at the root level:
```suggestion
# Prometheus metrics / health-server address (0 to disable)
metrics_port = 9100
metrics_host = "0.0.0.0"
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/node.rs
Line: 173-207
Comment:
**`"network"` and `"peers"` health checkers use identical data sources**
Both the `NetworkHealthChecker` (registered as `"network"`) and the `PeerHealthChecker` (registered as `"peers"`) are given closures that return exactly the same value — `p2p.peer_count().await`. If these two checker types interpret the supplied value differently (e.g., one checking liveness vs. the other reporting a gauge), the duplication may be harmless. However, if they apply equivalent thresholds, the `/metrics` and `/health` endpoints will expose two separate components that are always in lockstep: any change that makes one healthy/unhealthy will do the same to the other, making the distinction meaningless to operators.
Consider whether `"network"` should instead reflect a different data source such as network reachability or transport status (e.g., `p2p.is_running()`), or whether one of the two checkers should be removed entirely if they genuinely duplicate the same metric.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: 508e8e5
508e8e5 to
cab5dc2
Compare
cab5dc2 to
5b5cb0e
Compare
There was a problem hiding this comment.
Pull request overview
Wires saorsa-core’s HTTP health/metrics server into saorsa-node, adds configuration for metrics bind host/port, and hooks server lifecycle into node startup/shutdown.
Changes:
- Move metrics configuration to
NodeConfig(addmetrics_host, migrate deprecatedpayment.metrics_port). - Add
HealthManagerinitialization with component health checkers and spawnHealthServeras a background task. - Extend CLI with
--metrics-hostand plumb metrics settings into runtime config.
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/node.rs |
Builds a HealthManager, registers health checkers, spawns/shuts down HealthServer based on configured metrics address. |
src/config.rs |
Adds metrics_port/metrics_host to NodeConfig and migrates deprecated payment.metrics_port during config file load. |
src/bin/saorsa-node/cli.rs |
Adds --metrics-host and maps CLI metrics settings onto NodeConfig. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Connect the existing metrics_port config to saorsa-core's HealthServer, exposing /health, /ready, /metrics, and /debug/vars endpoints. - Move metrics_port from PaymentConfig to NodeConfig (semantically correct) - Add metrics_host field to NodeConfig and --metrics-host CLI arg - Instantiate HealthManager with 5 component checkers (DHT, network, transport, peers, storage) using P2PNode data sources - Spawn HealthServer as background task with graceful shutdown - Disable metrics server when port is 0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5b5cb0e to
790a3c4
Compare
Build a complete metrics pipeline with two data paths: - Event-driven: MetricsAggregator processes MetricEvents from saorsa-core into atomic counters and sliding windows (lookups, DHT ops, auth, streams, storage, peer connections) - Pull-based: SnapshotCollector reads state snapshots from saorsa-core accessors on each /metrics scrape (DHT health, security, trust, placement, transport, EigenTrust scores) Replace saorsa-core's HealthServer with our own Axum server that combines health component metrics with ~80 domain metric families in Prometheus text exposition format on /metrics. Update saorsa-core dependency to git branch feat-metrics_event_channel which provides the MetricEvent channel and accessor methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Wires up an HTTP health/metrics server for saorsa-node using saorsa-core health data plus new node-side Prometheus metrics, and updates configuration/CLI to support metrics_host + top-level metrics_port with backward-compatible migration.
Changes:
- Add an Axum-based background HTTP server exposing
/health,/ready,/metrics, and/debug/vars, with graceful shutdown. - Introduce a
metricsmodule (event-driven aggregator + pull-based snapshot collector + Prometheus formatter) and hook it into node runtime. - Move
metrics_porttoNodeConfig, addmetrics_host, and implement deprecated config migration; update bootstrap peer conversion toMultiAddr.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
Cargo.toml |
Switches saorsa-core dependency source and adds axum for the HTTP server. |
src/config.rs |
Adds metrics_port/metrics_host to NodeConfig, deprecates payment.metrics_port, and migrates legacy config. |
src/bin/saorsa-node/cli.rs |
Adds --metrics-host and maps CLI metrics settings into NodeConfig. |
src/node.rs |
Builds health manager + metrics components; spawns/shuts down Axum server; subscribes to metric events. |
src/metrics/mod.rs |
Exposes new metrics submodules publicly. |
src/metrics/aggregator.rs |
Implements event-driven counters/sliding windows for metrics. |
src/metrics/snapshot.rs |
Implements pull-based snapshot collection from saorsa-core accessors/collectors. |
src/metrics/prometheus.rs |
Formats aggregator + snapshot into Prometheus text exposition. |
src/devnet.rs |
Updates bootstrap peers conversion to MultiAddr. |
src/bin/saorsa-cli/main.rs |
Updates bootstrap peers conversion to MultiAddr. |
src/lib.rs |
Exports the new metrics module. |
Comments suppressed due to low confidence (1)
src/node.rs:848
start_protocol_routing()returns early whenant_protocolisNone, which also disables the P2P event subscription used to updateMetricsAggregatorpeer connection gauges. If metrics should work when storage is disabled, consider splitting peer connect/disconnect tracking into a separate task that always subscribes toP2PEvents (independent of protocol routing).
fn start_protocol_routing(&mut self) {
let protocol = match self.ant_protocol {
Some(ref p) => Arc::clone(p),
None => return,
};
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Already emitted distribution above from cached scores if available; | ||
| // only emit the collector's distribution if we didn't have live scores. | ||
| if trust_scores.is_none() { | ||
| writeln!( | ||
| out, |
| pub fn migrate_deprecated(&mut self) { | ||
| if let Some(port) = self.payment.metrics_port.take() { | ||
| if self.metrics_port == default_metrics_port() { | ||
| tracing::warn!( | ||
| "Deprecated: `payment.metrics_port` is now a top-level field `metrics_port`. \ |
| let app = Router::new() | ||
| .route("/health", get(health_handler)) | ||
| .route("/ready", get(ready_handler)) | ||
| .route("/metrics", get(metrics_handler)) | ||
| .route("/debug/vars", get(debug_handler)) |
| [dependencies] | ||
| # Core (provides EVERYTHING: networking, DHT, security, trust, storage) | ||
| saorsa-core = "0.14.1" | ||
| saorsa-core = { git = "https://github.com/jacderida/saorsa-core", branch = "feat-metrics_event_channel" } |
| /// Build the snapshot collector, wiring in live saorsa-core components. | ||
| fn build_snapshot_collector(p2p_node: &Arc<P2PNode>) -> SnapshotCollector { | ||
| // DhtMetricsCollector, TrustMetricsCollector, PlacementMetricsCollector | ||
| // are standalone instances — they serve as the canonical source for | ||
| // snapshot metrics and will be populated as the DHT layer reports data. | ||
| let dht_health = Arc::new(DhtMetricsCollector::new()); | ||
| let trust = Arc::new(TrustMetricsCollector::new()); | ||
| let placement = Arc::new(PlacementMetricsCollector::new()); |
| Err(e) => ( | ||
| axum::http::StatusCode::INTERNAL_SERVER_ERROR, | ||
| [(header::CONTENT_TYPE, "application/json")], | ||
| format!("{{\"error\":\"{e}\"}}"), | ||
| ), |
| Err(e) => ( | ||
| axum::http::StatusCode::INTERNAL_SERVER_ERROR, | ||
| [(header::CONTENT_TYPE, "application/json")], | ||
| format!("{{\"error\":\"{e}\"}}"), | ||
| ), |
| Err(e) => ( | ||
| axum::http::StatusCode::INTERNAL_SERVER_ERROR, | ||
| [(header::CONTENT_TYPE, "application/json")], | ||
| format!("{{\"error\":\"{e}\"}}"), | ||
| ), |
Extend the metrics pipeline with ~25 new metric families: - Handshake latency percentiles (PQ key exchange timing) - Separate DHT put/get latency percentiles (p50/p95/p99) - Operations per second (derived from total ops / uptime) - Extended transport stats: connection success/failure counts, byte counters, NAT traversal success rate, connection pool size - Connection failure breakdown by reason (labeled counter) - Replication timing: repair cycle duration, keys repaired, bytes transferred, grace period expiry tracking Update saorsa-core dependency to feat-metrics_phase2 branch which provides new MetricEvent variants (ConnectionEstablished, ConnectionFailed, HandshakeCompleted, ReplicationStarted, ReplicationCompleted, GracePeriodExpired) and extended TransportStats. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move start_metric_event_loop() before p2p_node.start() so that connection and handshake MetricEvents emitted during startup are captured by the aggregator instead of being lost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR wires up an HTTP health/metrics server for saorsa-node, adds a metrics aggregation pipeline (event-driven + pull-based snapshots), and updates configuration/CLI to support a top-level metrics bind address/port with backward-compatible migration.
Changes:
- Add an Axum-based HTTP server serving
/health,/ready,/metrics, and/debug/vars, with graceful shutdown. - Introduce a new
metricsmodule:MetricsAggregator(event-driven),SnapshotCollector(pull-based), andPrometheusFormatter(Prometheus text output). - Move
metrics_porttoNodeConfig, addmetrics_host, and implement deprecated config migration frompayment.metrics_port.
Reviewed changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/node.rs |
Initializes health/metrics components, spawns Axum server, drains metric events, and tracks peer connect/disconnect in metrics. |
src/config.rs |
Adds NodeConfig.metrics_port/metrics_host, deprecates payment.metrics_port, and migrates old config fields on load. |
src/bin/saorsa-node/cli.rs |
Adds --metrics-host and wires CLI metrics settings to NodeConfig. |
src/bin/saorsa-cli/main.rs |
Updates bootstrap peers conversion to MultiAddr. |
src/devnet.rs |
Updates bootstrap peers conversion to MultiAddr. |
src/lib.rs |
Exposes the new metrics module. |
src/metrics/mod.rs |
Declares and re-exports the new metrics components. |
src/metrics/aggregator.rs |
Implements event-driven counters/sliding windows over saorsa-core MetricEvents. |
src/metrics/snapshot.rs |
Implements pull-based snapshot collection from saorsa-core accessors. |
src/metrics/prometheus.rs |
Formats aggregated + snapshot metrics into Prometheus text exposition output (with tests). |
Cargo.toml |
Adds axum and switches saorsa-core to a git branch dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /// Migrate deprecated config fields, logging warnings for any that are found. | ||
| /// | ||
| /// If both the deprecated `payment.metrics_port` and the new top-level | ||
| /// `metrics_port` are set, the top-level value takes precedence and the | ||
| /// deprecated value is ignored with a warning. | ||
| pub fn migrate_deprecated(&mut self) { | ||
| if let Some(port) = self.payment.metrics_port.take() { | ||
| if self.metrics_port == default_metrics_port() { | ||
| tracing::warn!( | ||
| "Deprecated: `payment.metrics_port` is now a top-level field `metrics_port`. \ | ||
| Migrating value {port} automatically — please update your config file." | ||
| ); | ||
| self.metrics_port = port; | ||
| } else { | ||
| tracing::warn!( | ||
| "Deprecated `payment.metrics_port = {port}` ignored — \ | ||
| using top-level `metrics_port = {}`. \ | ||
| Please remove `payment.metrics_port` from your config file.", | ||
| self.metrics_port | ||
| ); | ||
| } | ||
| } |
| self.health_handle = Some(tokio::spawn(async move { | ||
| let app = Router::new() | ||
| .route("/health", get(health_handler)) | ||
| .route("/ready", get(ready_handler)) | ||
| .route("/metrics", get(metrics_handler)) | ||
| .route("/debug/vars", get(debug_handler)) | ||
| .with_state(shared_state); | ||
|
|
||
| let listener = match TcpListener::bind(metrics_addr).await { | ||
| Ok(l) => l, | ||
| Err(e) => { | ||
| error!("Failed to bind metrics server on {metrics_addr}: {e}"); | ||
| return; | ||
| } | ||
| }; | ||
|
|
||
| let server = axum::serve(listener, app).with_graceful_shutdown(async { | ||
| let _ = shutdown_rx.await; | ||
| }); | ||
|
|
||
| if let Err(e) = server.await { | ||
| error!("Metrics server error: {e}"); | ||
| } | ||
| })); | ||
|
|
||
| info!("Metrics server listening on {metrics_addr}"); | ||
| } |
| Err(e) => ( | ||
| axum::http::StatusCode::INTERNAL_SERVER_ERROR, | ||
| [(header::CONTENT_TYPE, "application/json")], | ||
| format!("{{\"error\":\"{e}\"}}"), | ||
| ), |
| /// Build the snapshot collector, wiring in live saorsa-core components. | ||
| fn build_snapshot_collector(p2p_node: &Arc<P2PNode>) -> SnapshotCollector { | ||
| // DhtMetricsCollector, TrustMetricsCollector, PlacementMetricsCollector | ||
| // are standalone instances — they serve as the canonical source for | ||
| // snapshot metrics and will be populated as the DHT layer reports data. | ||
| let dht_health = Arc::new(DhtMetricsCollector::new()); | ||
| let trust = Arc::new(TrustMetricsCollector::new()); | ||
| let placement = Arc::new(PlacementMetricsCollector::new()); | ||
|
|
||
| // SecurityMetricsCollector: standalone instance that will be populated | ||
| // as security events are observed by the DHT layer. | ||
| let security = Arc::new(saorsa_core::dht::metrics::SecurityMetricsCollector::new()); | ||
|
|
||
| let eigentrust = p2p_node.trust_engine(); | ||
|
|
||
| SnapshotCollector::new( | ||
| dht_health, | ||
| security, | ||
| trust, | ||
| placement, | ||
| Arc::clone(p2p_node), | ||
| eigentrust, | ||
| ) |
| [dependencies] | ||
| # Core (provides EVERYTHING: networking, DHT, security, trust, storage) | ||
| saorsa-core = "0.14.1" | ||
| saorsa-core = { git = "https://github.com/jacderida/saorsa-core", branch = "feat-metrics_phase2" } |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This will be replaced with another PR. |
Complete the Section 18 test matrix with the remaining scenarios: - #3: Fresh replication stores chunk + updates PaidForList on remote nodes - #9: Fetch retry rotates to alternate source - #10: Fetch retry exhaustion with single source - #11: Repeated ApplicationFailure events decrease peer trust score - #12: Bootstrap node discovers keys stored on multiple peers - #14: Hint construction covers all locally stored keys - #15: Data and PaidForList survive node shutdown (partition) - #17: Neighbor sync request returns valid response (admission test) - #21: Paid-list majority confirmed from multiple peers via verification - #24: PaidNotify propagates paid-list entries after fresh replication - #25: Paid-list convergence verified via majority peer queries - #44: PaidForList persists across restart (cold-start recovery) - #45: PaidForList lost in fresh directory (unrecoverable scenario) All 56 Section 18 scenarios now have test coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete the Section 18 test matrix with the remaining scenarios: - #3: Fresh replication stores chunk + updates PaidForList on remote nodes - #9: Fetch retry rotates to alternate source - #10: Fetch retry exhaustion with single source - #11: Repeated ApplicationFailure events decrease peer trust score - #12: Bootstrap node discovers keys stored on multiple peers - #14: Hint construction covers all locally stored keys - #15: Data and PaidForList survive node shutdown (partition) - #17: Neighbor sync request returns valid response (admission test) - #21: Paid-list majority confirmed from multiple peers via verification - #24: PaidNotify propagates paid-list entries after fresh replication - #25: Paid-list convergence verified via majority peer queries - #44: PaidForList persists across restart (cold-start recovery) - #45: PaidForList lost in fresh directory (unrecoverable scenario) All 56 Section 18 scenarios now have test coverage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
metrics_portfromPaymentConfigtoNodeConfigand addmetrics_hostfield (IpAddr, default127.0.0.1) with--metrics-hostCLI argpayment.metrics_portis detected, warned, and migrated automatically; top-level value takes precedence if both are setHealthManagerinNodeBuilder::build()with 4 component health checkers (DHT, transport, peers, storage) wired toP2PNodedata sourcesHealthServeras a background task inRunningNode::run(), serving/health,/ready,/metrics, and/debug/varson the configured addressTest plan
cargo build --releasesucceedscargo test— all tests pass including newmigrate_deprecatedunit testscargo clippy --all-targets --all-features -- -D warningscleancargo fmt --all -- --checkclean/healthreturns JSON health status/readyreturns HTTP 200/503/metricsreturns Prometheus text format with component health gauges/debug/varsreturns JSON with system info, runtime stats, and per-component debug data🤖 Generated with Claude Code