moq-net(stats): aggregate per-node into a single gzipped broadcast#1517
Conversation
Per-broadcast level fan-out produced K*N stats broadcasts (K aggregation
levels times N served broadcasts times M relays) and quickly hit the
browser's 100-subscription ceiling. Collapse to one .stats/node/{name}
broadcast per relay carrying a gzipped JSON map of broadcast path to
cumulative counters for every broadcast active in the last
retention_ticks ticks. Dashboard-shape aggregation moves to a separate
downstream binary.
- Drop the Level / level_keys / advertised_path fan-out machinery; replace
with one BroadcastEntry per path and a single snapshot task that ticks
every tick_secs (default 1s) and writes gzipped JSON frames per the
four existing tier x role tracks (renamed .json.gz).
- GC entries that have no outstanding guards (Arc::strong_count == 1) and
haven't been observed active for retention_ticks (default 10) ticks.
- Replace --stats-levels with --stats-tick-secs and
--stats-retention-ticks. Both Option<T> per the TOML-merge rule;
regression test updated to cover them.
- Document the new wire format, broadcast path, tracks, and flags in
doc/bin/relay/config.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR refactors stats aggregation from a multi-level bucketing model to per-broadcast-path publishing with configurable tick intervals and retention windows. The 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rs/moq-net/src/stats.rs`:
- Around line 687-691: The code advances last_payload (`*last = json`) even when
`track.write_frame(compressed)` fails; change the control flow so `*last = json`
is executed only on success—i.e., call `track.write_frame(compressed)` and if it
returns Err log the error and do not update `last`, otherwise update `last` (for
example, move the `*last = json` into the success branch or use an early
continue/return on error). Ensure you modify the block around
`track.write_frame(compressed)` and the `*last = json` assignment so failed
writes do not advance `last_payload`.
- Around line 635-641: The retention window check is off-by-one: change the
condition that currently uses `current_tick.saturating_sub(last) <
retention_ticks as u64` to use `<=` so it becomes
`current_tick.saturating_sub(last) <= retention_ticks as u64`; this preserves
the intended "linger for N ticks" semantics and ensures `retention_ticks == 0`
still allows emission while `counters.active()` updates `last_tick` (identify
the code around `last_tick`, `current_tick`, `retention_ticks`,
`counters.active()` and the `frame.insert(entry.path.as_str().to_string(),
snap)` insertion).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 698f1649-c811-477a-846e-61356adc8158
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
doc/bin/relay/config.mdrs/moq-net/Cargo.tomlrs/moq-net/src/stats.rsrs/moq-relay/src/config.rsrs/moq-relay/src/stats.rs
CodeRabbit caught two issues: 1. Retention window was `< retention_ticks`, dropping entries one tick earlier than the documented "linger for N ticks after". Also broke retention_ticks=0 (the relay CLI clamps to >= 1, but the library API accepts u32 raw). Now uses `<=` and a separate active-branch insert so retention=0 emits while subs are live and never lingers. 2. `last_payload` advanced even when `track.write_frame()` failed, silently dropping the snapshot on the retry. Now updates only on success so the next tick re-attempts the same payload. Adds retention_boundary_keeps_last_idle_tick to lock in the boundary behaviour precisely (diff==N kept, diff==N+1 GC'd) so a future regression in either direction is caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-node stats refactor in the prior commit removed --stats-levels; the demo TOMLs still set it and would fail to load under deny_unknown_fields. Also retire the now-stale comment block describing .stats/prefix/<level> paths in favor of the new .stats/node/<node> schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-frame payload isn't large enough yet to justify the flate2 dep, and a future moq-lite revision is expected to negotiate transparent compression at the protocol layer. Tracks rename `.json.gz` -> `.json`; subscribers consume the raw JSON directly. Also lowers the default retention window to 1 tick to match the typical reconnect cadence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Actionable comments posted: 0 |
Three billing-grade correctness fixes for the per-node stats producer: 1. Short-lived subscriptions are now surfaced. Previously a sub that opened and closed within a single tick window would never have active() true at a snapshot, so the entry's last_active_tick stayed 0 and it was GC'd silently (all bumps lost). The snapshot task now tracks the last emitted snapshot per (entry, slot) in its local state and includes any entry whose snapshot has changed since the last emission, lingering for retention_ticks past the most recent change. delta_subs > 0 catches sub-tick flickers as a full open/close pair. 2. Counters::snapshot() now reads *_closed atomics before their open counterparts. Every close-bump is preceded in real time by an open-bump, so reading close first guarantees the snapshot satisfies open >= closed under concurrent bumps. Cost: open may be slightly inflated when a bump lands between loads, which is logically valid. 3. broadcasts / broadcasts_closed are now derived in the snapshot task from observed subscription-active transitions (0->1 and 1->0). "broadcasts" is the count of broadcast lifetimes with at least one active sub on that (tier, role) slot, which is what UI and billing want for "active broadcasts". The existing publisher() / subscriber() guard counts move to a new "announced" / "announced_closed" pair for "all broadcasts ever known on this slot, regardless of subs". Wire-format change: each Snapshot gains "announced" / "announced_closed" fields and "broadcasts" semantics changes (see doc/bin/relay/config.md). Downstream consumers that treat counter decreases as session restarts (per the existing reset-handling contract) need no other changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Actionable comments posted: 0 |
… slots Three PR review followups: - Rename `tick_secs` -> `interval` and `retention_ticks` -> `retention` on `StatsConfig`, the CLI flags, the env vars, and the `Stats::new` signature. The shorter names read more naturally; units stay in the field docs (`interval` is seconds, `retention` is intervals). - Replace the `[Counters; 4]` flat array on `BroadcastEntry` with named `publisher` and `subscriber` fields (each `[Counters; 2]` indexed by Tier). Drops the `Role` enum: the bump-path call sites always know the side at compile time, so a runtime enum just to pass that constant was gross. `EntrySnapState` mirrors the same shape for the snapshot task's local state. - Snapshot task gets two small helpers: `process_slot` factors out the per-tick derivation + emit decision, and `within_retention` is shared between the global-entry and local-state GC paths so they always agree. No wire-format change; all 16 stats tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failed on `cargo doc -D warnings` because `Tier` is public but linked to the private `BroadcastEntry` type. Drop the bracketed link; the surrounding sentence still conveys the structure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five followups from the pre-merge review: - Treat `None` prev_emitted as the default Snapshot for change detection. Before, every entry surfaced in all four tracks with all-zero counters on the first tick (because `None != Some(default)`), so a broadcast with only external-publisher traffic was misleadingly emitted on the three unused tier-side tracks too. Added `unused_slots_dont_surface` to lock the behavior in. - Strengthen ordering on `*_closed` atomics: Acquire on the snapshot loads, Release on the four RAII drops. The previous all-Relaxed pairing guaranteed `open >= closed` only on x86 (where loads are implicitly Acquire). On ARM / POWER / Apple Silicon / AWS Graviton the snapshot could observe `close=1` with `open=0` even though both bumps had already happened. Pairing the close load (Acquire) with the close store (Release) restores the invariant via a synchronizes-with edge, and lets the open / payload counters stay Relaxed for free. Updated the module docs to reflect the actual guarantee. Cost is zero on x86 and a single `ldar` instruction per close-load on ARM. - Stale `retention_ticks` reference in `doc/bin/relay/config.md` updated to `retention` intervals. - Cleaned up the dangling tail of `multiple_subs_count_as_one_broadcast` with a real post-drop assertion (raw subs_closed reaches 2 once both track guards drop) instead of the vestigial drop block + confusing "next assertion" comment. - Renamed `frame_round_trips_as_json` to `frame_emits_expected_counters` (no gzip round-trip layer anymore). 17 stats tests pass; clippy / fmt / shear / doc all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
.stats/node/{name}broadcast per relay (or.stats/nodewhen unset). The fourtier × roletracks remain, renamed to.json.gz.retention_ticksticks. Tier / role / node are implied by the track and broadcast paths, not repeated in the payload.--stats-levelsis gone; level aggregation moves to a separate downstream binary. New flags:--stats-tick-secs(default1) and--stats-retention-ticks(default10), bothOption<T>per the TOML-merge rule.Why
Per-(level × node) broadcasts grow as
(N + K) × M(broadcasts × levels × relays), and a browser dashboard subscribing to all of them hit the 100-subscription ceiling almost immediately. Per-frame payload now grows linearly in active-broadcast count instead, and gzip handles the repetitive JSON shape so the wire cost stays bounded.Wire format
A broadcast appears in the frame for a given (tier, role) while it has at least one active subscription, and lingers for
retention_ticksticks after the last one drops. Counters are cumulative; downstream computes rates.Implementation notes
BroadcastEntryholds fourCountersand fourlast_active_tickatomics (one pertier × role). The single snapshot task ticks everytick_secs, builds per-slot frames, writes them gzipped, and GCs entries that have no outstanding RAII guards (Arc::strong_count == 1) and are past the retention window. The strong-count check is the safety hatch: a bump on an orphanedArcafter GC can't be silently lost because we never GC while a guard is still alive.broadcast()call and exits after2 × retention_ticksof an empty entry map (next bump respawns it). One broadcast and four tracks per node lifetime.Reviewer notes
Stats::newsignature changed (levels: u32→tick: Duration, retention_ticks: u32). Only moq-relay constructsStatsso the rest of moq-net'sBroadcastStats/ guard surface is unchanged.js/netdoesn't consume the stats wire format today, so no JS sync is needed per the Cross-Package Sync table;doc/bin/relay/config.mdis updated with the new schema, flags, env vars.Test plan
cargo test --workspace --all-targets(sansmoq-gst, which needs system libs locally) — 11 new stats tests cover per-broadcast isolation, tier independence, single-broadcast announce, gzip round-trip, retention-keeps and retention-evicts, prefix/disabled no-ops, plus existing 78 moq-relay tests including the updated TOML/CLI regression test fortick_secs/retention_ticks.cargo clippy --workspace --all-targets -- -D warningscargo fmt --all --checkRUSTDOCFLAGS=-D warnings cargo doc --no-deps --workspacecargo shear/cargo sort --workspace --checkmoq-relay --stats-enabled --stats-node sjc/1, publish a couple broadcasts, subscribe to.stats/node/sjc/1/publisher.json.gzthroughgunzip, drop a publisher and confirm the entry lingers ~10 ticks then disappears.(Written by Claude)