Skip to content

moq-net(stats): aggregate per-node into a single gzipped broadcast#1517

Merged
kixelated merged 8 commits into
mainfrom
claude/youthful-goldberg-990757
May 28, 2026
Merged

moq-net(stats): aggregate per-node into a single gzipped broadcast#1517
kixelated merged 8 commits into
mainfrom
claude/youthful-goldberg-990757

Conversation

@kixelated

Copy link
Copy Markdown
Collaborator

Summary

  • Collapses the per-(level × node) stats fan-out into a single .stats/node/{name} broadcast per relay (or .stats/node when unset). The four tier × role tracks remain, renamed to .json.gz.
  • Each frame is a gzipped JSON map of broadcast path to cumulative counter snapshot for every broadcast active in the last retention_ticks ticks. Tier / role / node are implied by the track and broadcast paths, not repeated in the payload.
  • --stats-levels is gone; level aggregation moves to a separate downstream binary. New flags: --stats-tick-secs (default 1) and --stats-retention-ticks (default 10), both Option<T> per the TOML-merge rule.

Why

Per-(level × node) broadcasts grow as (N + K) × M (broadcasts × levels × relays), and a browser dashboard subscribing to all of them hit the 100-subscription ceiling almost immediately. Per-frame payload now grows linearly in active-broadcast count instead, and gzip handles the repetitive JSON shape so the wire cost stays bounded.

Wire format

// .stats/node/sjc/1, track publisher.json.gz, gunzipped
{
  \"demo/bbb\": { \"broadcasts\": 1, \"broadcasts_closed\": 0, \"subscriptions\": 5,
                 \"subscriptions_closed\": 2, \"bytes\": 12345, \"frames\": 678, \"groups\": 9 },
  \"anon/foo\": { \"broadcasts\": 1, \"broadcasts_closed\": 0, \"subscriptions\": 2,
                 \"subscriptions_closed\": 0, \"bytes\": 234,   \"frames\": 12,  \"groups\": 1 }
}

A broadcast appears in the frame for a given (tier, role) while it has at least one active subscription, and lingers for retention_ticks ticks after the last one drops. Counters are cumulative; downstream computes rates.

Implementation notes

  • BroadcastEntry holds four Counters and four last_active_tick atomics (one per tier × role). The single snapshot task ticks every tick_secs, builds per-slot frames, writes them gzipped, and GCs entries that have no outstanding RAII guards (Arc::strong_count == 1) and are past the retention window. The strong-count check is the safety hatch: a bump on an orphaned Arc after GC can't be silently lost because we never GC while a guard is still alive.
  • The task lazily spawns on the first broadcast() call and exits after 2 × retention_ticks of an empty entry map (next bump respawns it). One broadcast and four tracks per node lifetime.
  • Idle-frame skipping is preserved by byte-comparing the per-slot JSON payload to the last emitted one.

Reviewer notes

  • Stats::new signature changed (levels: u32tick: Duration, retention_ticks: u32). Only moq-relay constructs Stats so the rest of moq-net's BroadcastStats / guard surface is unchanged.
  • js/net doesn't consume the stats wire format today, so no JS sync is needed per the Cross-Package Sync table; doc/bin/relay/config.md is updated with the new schema, flags, env vars.

Test plan

  • cargo test --workspace --all-targets (sans moq-gst, which needs system libs locally) — 11 new stats tests cover per-broadcast isolation, tier independence, single-broadcast announce, gzip round-trip, retention-keeps and retention-evicts, prefix/disabled no-ops, plus existing 78 moq-relay tests including the updated TOML/CLI regression test for tick_secs / retention_ticks.
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo fmt --all --check
  • RUSTDOCFLAGS=-D warnings cargo doc --no-deps --workspace
  • cargo shear / cargo sort --workspace --check
  • End-to-end against a local relay: run moq-relay --stats-enabled --stats-node sjc/1, publish a couple broadcasts, subscribe to .stats/node/sjc/1/publisher.json.gz through gunzip, drop a publisher and confirm the entry lingers ~10 ticks then disappears.

(Written by Claude)

Per-broadcast level fan-out produced K*N stats broadcasts (K aggregation
levels times N served broadcasts times M relays) and quickly hit the
browser's 100-subscription ceiling. Collapse to one .stats/node/{name}
broadcast per relay carrying a gzipped JSON map of broadcast path to
cumulative counters for every broadcast active in the last
retention_ticks ticks. Dashboard-shape aggregation moves to a separate
downstream binary.

- Drop the Level / level_keys / advertised_path fan-out machinery; replace
  with one BroadcastEntry per path and a single snapshot task that ticks
  every tick_secs (default 1s) and writes gzipped JSON frames per the
  four existing tier x role tracks (renamed .json.gz).
- GC entries that have no outstanding guards (Arc::strong_count == 1) and
  haven't been observed active for retention_ticks (default 10) ticks.
- Replace --stats-levels with --stats-tick-secs and
  --stats-retention-ticks. Both Option<T> per the TOML-merge rule;
  regression test updated to cover them.
- Document the new wire format, broadcast path, tracks, and flags in
  doc/bin/relay/config.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR refactors stats aggregation from a multi-level bucketing model to per-broadcast-path publishing with configurable tick intervals and retention windows. The Stats::new() signature changes to accept tick: Duration and retention_ticks: u32 instead of levels: u32. A single shared publisher task now builds four gzipped JSON frames per tick—one for each (tier, role) combination—mapping broadcast paths to cumulative counter snapshots. Idle frames are skipped when JSON bytes match the previously emitted frame. Entries are garbage-collected when they have no active subscriptions and fall outside the retention window. Configuration is added to relay's TOML/CLI, documentation updated, and demo configs migrated to the new tick/retention model.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: collapsing per-level stats into a single per-node gzipped broadcast.
Description check ✅ Passed The description is comprehensively related to the changeset, explaining the motivation, wire format, implementation details, API changes, and test coverage.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch claude/youthful-goldberg-990757

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rs/moq-net/src/stats.rs`:
- Around line 687-691: The code advances last_payload (`*last = json`) even when
`track.write_frame(compressed)` fails; change the control flow so `*last = json`
is executed only on success—i.e., call `track.write_frame(compressed)` and if it
returns Err log the error and do not update `last`, otherwise update `last` (for
example, move the `*last = json` into the success branch or use an early
continue/return on error). Ensure you modify the block around
`track.write_frame(compressed)` and the `*last = json` assignment so failed
writes do not advance `last_payload`.
- Around line 635-641: The retention window check is off-by-one: change the
condition that currently uses `current_tick.saturating_sub(last) <
retention_ticks as u64` to use `<=` so it becomes
`current_tick.saturating_sub(last) <= retention_ticks as u64`; this preserves
the intended "linger for N ticks" semantics and ensures `retention_ticks == 0`
still allows emission while `counters.active()` updates `last_tick` (identify
the code around `last_tick`, `current_tick`, `retention_ticks`,
`counters.active()` and the `frame.insert(entry.path.as_str().to_string(),
snap)` insertion).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 698f1649-c811-477a-846e-61356adc8158

📥 Commits

Reviewing files that changed from the base of the PR and between e9a1086 and 07a2584.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • doc/bin/relay/config.md
  • rs/moq-net/Cargo.toml
  • rs/moq-net/src/stats.rs
  • rs/moq-relay/src/config.rs
  • rs/moq-relay/src/stats.rs

Comment thread rs/moq-net/src/stats.rs Outdated
Comment thread rs/moq-net/src/stats.rs Outdated
kixelated and others added 3 commits May 27, 2026 12:29
CodeRabbit caught two issues:

1. Retention window was `< retention_ticks`, dropping entries one tick
   earlier than the documented "linger for N ticks after". Also broke
   retention_ticks=0 (the relay CLI clamps to >= 1, but the library API
   accepts u32 raw). Now uses `<=` and a separate active-branch insert
   so retention=0 emits while subs are live and never lingers.

2. `last_payload` advanced even when `track.write_frame()` failed,
   silently dropping the snapshot on the retry. Now updates only on
   success so the next tick re-attempts the same payload.

Adds retention_boundary_keeps_last_idle_tick to lock in the boundary
behaviour precisely (diff==N kept, diff==N+1 GC'd) so a future regression
in either direction is caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-node stats refactor in the prior commit removed --stats-levels;
the demo TOMLs still set it and would fail to load under
deny_unknown_fields. Also retire the now-stale comment block describing
.stats/prefix/<level> paths in favor of the new .stats/node/<node>
schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-frame payload isn't large enough yet to justify the flate2 dep, and a
future moq-lite revision is expected to negotiate transparent compression
at the protocol layer. Tracks rename `.json.gz` -> `.json`; subscribers
consume the raw JSON directly. Also lowers the default retention window
to 1 tick to match the typical reconnect cadence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

Three billing-grade correctness fixes for the per-node stats producer:

1. Short-lived subscriptions are now surfaced. Previously a sub that
   opened and closed within a single tick window would never have
   active() true at a snapshot, so the entry's last_active_tick stayed 0
   and it was GC'd silently (all bumps lost). The snapshot task now
   tracks the last emitted snapshot per (entry, slot) in its local
   state and includes any entry whose snapshot has changed since the
   last emission, lingering for retention_ticks past the most recent
   change. delta_subs > 0 catches sub-tick flickers as a full
   open/close pair.

2. Counters::snapshot() now reads *_closed atomics before their open
   counterparts. Every close-bump is preceded in real time by an
   open-bump, so reading close first guarantees the snapshot satisfies
   open >= closed under concurrent bumps. Cost: open may be slightly
   inflated when a bump lands between loads, which is logically valid.

3. broadcasts / broadcasts_closed are now derived in the snapshot task
   from observed subscription-active transitions (0->1 and 1->0).
   "broadcasts" is the count of broadcast lifetimes with at least one
   active sub on that (tier, role) slot, which is what UI and billing
   want for "active broadcasts". The existing publisher() / subscriber()
   guard counts move to a new "announced" / "announced_closed" pair for
   "all broadcasts ever known on this slot, regardless of subs".

Wire-format change: each Snapshot gains "announced" / "announced_closed"
fields and "broadcasts" semantics changes (see doc/bin/relay/config.md).
Downstream consumers that treat counter decreases as session restarts
(per the existing reset-handling contract) need no other changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

Comment thread doc/bin/relay/config.md Outdated
Comment thread doc/bin/relay/config.md Outdated
Comment thread rs/moq-net/src/stats.rs Outdated
kixelated and others added 3 commits May 28, 2026 11:51
… slots

Three PR review followups:

- Rename `tick_secs` -> `interval` and `retention_ticks` -> `retention`
  on `StatsConfig`, the CLI flags, the env vars, and the `Stats::new`
  signature. The shorter names read more naturally; units stay in the
  field docs (`interval` is seconds, `retention` is intervals).
- Replace the `[Counters; 4]` flat array on `BroadcastEntry` with named
  `publisher` and `subscriber` fields (each `[Counters; 2]` indexed by
  Tier). Drops the `Role` enum: the bump-path call sites always know
  the side at compile time, so a runtime enum just to pass that
  constant was gross. `EntrySnapState` mirrors the same shape for the
  snapshot task's local state.
- Snapshot task gets two small helpers: `process_slot` factors out the
  per-tick derivation + emit decision, and `within_retention` is shared
  between the global-entry and local-state GC paths so they always
  agree.

No wire-format change; all 16 stats tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failed on `cargo doc -D warnings` because `Tier` is public but linked
to the private `BroadcastEntry` type. Drop the bracketed link; the
surrounding sentence still conveys the structure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five followups from the pre-merge review:

- Treat `None` prev_emitted as the default Snapshot for change detection.
  Before, every entry surfaced in all four tracks with all-zero counters
  on the first tick (because `None != Some(default)`), so a broadcast
  with only external-publisher traffic was misleadingly emitted on the
  three unused tier-side tracks too. Added `unused_slots_dont_surface`
  to lock the behavior in.

- Strengthen ordering on `*_closed` atomics: Acquire on the snapshot
  loads, Release on the four RAII drops. The previous all-Relaxed pairing
  guaranteed `open >= closed` only on x86 (where loads are implicitly
  Acquire). On ARM / POWER / Apple Silicon / AWS Graviton the snapshot
  could observe `close=1` with `open=0` even though both bumps had
  already happened. Pairing the close load (Acquire) with the close
  store (Release) restores the invariant via a synchronizes-with edge,
  and lets the open / payload counters stay Relaxed for free. Updated
  the module docs to reflect the actual guarantee. Cost is zero on x86
  and a single `ldar` instruction per close-load on ARM.

- Stale `retention_ticks` reference in `doc/bin/relay/config.md` updated
  to `retention` intervals.

- Cleaned up the dangling tail of `multiple_subs_count_as_one_broadcast`
  with a real post-drop assertion (raw subs_closed reaches 2 once both
  track guards drop) instead of the vestigial drop block + confusing
  "next assertion" comment.

- Renamed `frame_round_trips_as_json` to `frame_emits_expected_counters`
  (no gzip round-trip layer anymore).

17 stats tests pass; clippy / fmt / shear / doc all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kixelated kixelated enabled auto-merge (squash) May 28, 2026 19:48
@kixelated kixelated merged commit be70da4 into main May 28, 2026
1 check passed
@kixelated kixelated deleted the claude/youthful-goldberg-990757 branch May 28, 2026 19:59
@moq-bot moq-bot Bot mentioned this pull request May 28, 2026
@moq-bot moq-bot Bot mentioned this pull request May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant