Skip to content

Add KV.benchmark: lean Tsavorite KV throughput benchmark#1823

Merged
badrishc merged 9 commits into
mainfrom
badrishc/kv-bench
May 25, 2026
Merged

Add KV.benchmark: lean Tsavorite KV throughput benchmark#1823
badrishc merged 9 commits into
mainfrom
badrishc/kv-bench

Conversation

@badrishc
Copy link
Copy Markdown
Collaborator

A clean, allocation-free, NUMA-aware throughput benchmark for the Tsavorite KV engine. Designed for repeatable engineering numbers (stable mean / low stdev) rather than ad-hoc spot-checks.

Highlights

  • 8-byte synthetic keys generated per-op (no pre-built key arrays).
  • Configurable value size, key count, RUMD ratio (reads/upserts/RMW/ deletes), and key distribution (uniform or zipf).
  • Single ObjectAllocator store with Garnet-aligned defaults (16 MB pages, 1 GB segments, 16 KB max-inline value, mutable=0.9).
  • Auto-sized log (fits dataset in mutable region) and index (matching YCSB fill factor); both overridable.
  • Configurable device backend: native (libaio), randomaccess, filestream, null. Native backend ctor extended with an IoBackend selector (Default / Libaio).
  • Separate --load-threads and --run-threads-sweep so a single load can feed a multi-thread-count run sweep.
  • NUMA pinning + ThreadPool auto-tuning + full-GC quiescence before each timed window; hot loop is allocation-free (stackalloc only, 32-byte aligned to match AVX2 copy).
  • Workers coordinate start/end via a single gate + per-thread scoreboard slots padded to 128 bytes; warmup excluded from the timed window.
  • Output: human-readable progress + final summary; optional pretty- printed JSON file (one row per phase) and CSV file. Stdout JSON blobs are off by default to keep logs clean.
  • Always cleans up its data directory on exit.

Adds a new NativeStorageDevice.IoBackend enum (Default, Libaio) and an optional ctor parameter. The shipped libnative_device.so currently uses libaio on Linux for both values; the enum is forward-compatible for additional backends.

badrishc and others added 3 commits May 23, 2026 11:42
A clean, allocation-free, NUMA-aware throughput benchmark for the
Tsavorite KV engine. Designed for repeatable engineering numbers
(stable mean / low stdev) rather than ad-hoc spot-checks.

Highlights
- 8-byte synthetic keys generated per-op (no pre-built key arrays).
- Configurable value size, key count, RUMD ratio (reads/upserts/RMW/
  deletes), and key distribution (uniform or zipf).
- Single ObjectAllocator store with Garnet-aligned defaults
  (16 MB pages, 1 GB segments, 16 KB max-inline value, mutable=0.9).
- Auto-sized log (fits dataset in mutable region) and index
  (matching YCSB fill factor); both overridable.
- Configurable device backend: native (libaio), randomaccess,
  filestream, null. Native backend ctor extended with an
  IoBackend selector (Default / Libaio).
- Separate --load-threads and --run-threads-sweep so a single load
  can feed a multi-thread-count run sweep.
- NUMA pinning + ThreadPool auto-tuning + full-GC quiescence before
  each timed window; hot loop is allocation-free (stackalloc only,
  32-byte aligned to match AVX2 copy).
- Workers coordinate start/end via a single gate + per-thread
  scoreboard slots padded to 128 bytes; warmup excluded from the
  timed window.
- Output: human-readable progress + final summary; optional pretty-
  printed JSON file (one row per phase) and CSV file. Stdout JSON
  blobs are off by default to keep logs clean.
- Always cleans up its data directory on exit.

Adds a new NativeStorageDevice.IoBackend enum (Default, Libaio) and
an optional ctor parameter. The shipped libnative_device.so currently
uses libaio on Linux for both values; the enum is forward-compatible
for additional backends.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply analyzer-suggested fixes:
- IMPORTS: System.* usings ordered before others
- IDE0300/IDE0301: collection initializer syntax (new int[]{...} -> [...])

Pure mechanical reformatting; no behavioural change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The slnx was copied from an in-flight branch that had inadvertently
removed test/test.stress. Restoring so the PR diff against main only
adds KV.benchmark.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc marked this pull request as ready for review May 24, 2026 02:07
Copilot AI review requested due to automatic review settings May 24, 2026 02:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new KV.benchmark project under Tsavorite benchmarks to provide a lean, repeatable KV throughput benchmark (load + RUMD run phases) with NUMA pinning, allocation-free hot loop, and optional JSON/CSV output. Also extends NativeStorageDevice with a forward-compatible Linux async I/O backend selector (IoBackend).

Changes:

  • Add KV.benchmark (CLI parsing, sizing defaults, NUMA pinning, worker orchestration, output emitters, Zipf + RNG helpers).
  • Add NativeStorageDevice.IoBackend enum + optional ctor parameter for future backend selection/reporting.
  • Register the new benchmark project in Tsavorite.slnx.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
libs/storage/Tsavorite/cs/Tsavorite.slnx Adds the new benchmark/KV.benchmark project to the Tsavorite solution.
libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs Introduces IoBackend selector and stores configured value on the device.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KV.benchmark.csproj New benchmark project definition (CommandLineParser ref, Tsavorite.core reference, signing settings).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Program.cs Minimal Main that forwards to EntryPoint.Run.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/EntryPoint.cs CLI parse/validate, lifecycle orchestration, interrupt handling, ThreadPool tuning restore.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs CLI options + auto-default resolution (log/index sizing, device selection).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.cs Benchmark engine: store/device creation, phase execution, scoreboard-based timing, cleanup.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Worker.cs Worker entrypoint and allocation-free hot loop (uniform/zipf keygen, RUMD selection).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Setup.cs Run directory management, stale cleanup, device construction helper, RUMD helper.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Validate.cs Optional post-load validation pass.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvSessionFunctions.cs SpanByte session functions tuned for benchmark semantics (fixed-size read copy, RMW safety).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvKey.cs 8-byte synthetic key type (with padding for record alignment).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvNumaPinning.cs NUMA CPU discovery + best-effort thread affinity pinning (Linux/Windows).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/PaddedTypes.cs 128B-padded counter/flag structs to avoid false sharing.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvSize.cs Size parsing/formatting helpers + NextPow2.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvOutput.cs Human-readable reporting + shared helpers for JSON/CSV.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvOutput.Json.cs JSON schema + emit (compact + pretty).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvOutput.Csv.cs CSV schema + emit (wide rows, header-once).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/XoshiroRng.cs Per-thread PRNG implementation for Zipf sampling.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/ZipfGenerator.cs Zipf constants + sampler used by the benchmark.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md Benchmark documentation, CLI reference, and worked examples.

Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.cs
Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Worker.cs Outdated
Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs
Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs
Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs
Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md Outdated
Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.cs
Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.cs
Comment thread libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md Outdated
badrishc and others added 6 commits May 23, 2026 20:04
The load loop uses 'localOps & (kChunkSize - 1) == 0' as a bitmask check
for the scoreboard / CompletePending throttling. With kChunkSize=640
this mask is 639 = 0b1001111111, which actually fires only when localOps
is a multiple of 1024 — not every 640 ops as intended.

Changed kChunkSize from 640 to 512 so the bitmask is correct (mask 511
= 0b111111111). 512 is also large enough to amortize the Interlocked.Add
contention in the run loop. The run loop's 'idx % 512 == 0' for
CompletePending now aligns with chunk boundaries exactly (single check
per chunk).

No measurable perf change at T=1 (within run-to-run noise).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes from PR #1823 review:

- KvBenchmark.cs: use Volatile.Write for doneFlag stores so the stop
  signal is reliably visible to workers (which read via Volatile.Read).
  Removes the Thread.MemoryBarrier+plain-store pattern in the zero-second
  run path.

- KvBenchmark.cs: change kChunkSize from 512 to 1024. 1024 keeps the
  bitmask check (n & (kChunkSize-1)) correct and aligns the run loop's
  'idx % 512 == 0' to fire twice per chunk (matching YCSB's run-loop
  CompletePending cadence). Measured: 2.04 M ops/sec, within noise of
  prior 512 result.

- Options.cs: validate --device, --device-io-backend, and --zipf-theta.
  Unknown device/backend names now produce a clear error listing valid
  values (was: silently mapped to Default). --zipf-theta is rejected
  if outside [0, 1) (was: theta=1 divided by zero; theta>1 produced NaN).

- Options.cs: correct --value-size help text (was: 'Range: 8..4096';
  actual: 32..1048576).

- KvBenchmark.Setup.cs: correct CreateDevice doc comment (was: 'Native +
  io_uring on Linux'; native device only uses libaio in this PR).

- README: correct sample-output reader-copy size (32 B, not 64 B), and
  describe the actual GC default (Workstation, with DOTNET_gcServer=1
  opt-in) instead of claiming Server GC is enabled in the csproj.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously:
  - Run loop:   'if (idx % 512 == 0)' fired CompletePending twice per
                1024-op chunk (leftover from kChunkSize=640).
  - Load loop:  per-op '& (kChunkSize-1)' bitmask + nested '& 65535'
                throttle for CompletePending every 64K ops.

Now both phases:
  - Iterate one chunk (kChunkSize=1024 ops) at a time.
  - Call CompletePending(false) once at the start of each chunk.
  - Write scoreboard tick at end of each chunk.

The per-op checks are gone from both inner loops. Cleaner, fewer
branches, and CompletePending fires at a consistent cadence (every
chunk) in both phases.

Perf: load 3.05 M ops/s, run mean 2.07 M ops/s (3 iters at 4.6M × 96B
× T=1, BasicContext) — within noise of prior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a 'Cookbook' section at the top of 'Worked examples' with a copy-paste
matrix covering common scenarios:

- small (4.6 M) vs large (100 M) dataset
- log fits in memory (auto-sized) vs log constrained smaller than dataset
- uniform vs zipf θ=0.99 distribution
- pure reads vs mixed RUMD
- single thread vs thread sweep

Each row is a single-line command using a $KV environment variable for
the binary path, NUMA-pinned via numactl. Added knob-summary subsection
explaining how --log-memory, --device, and --max-inline-value-size
interact to control the in-memory vs out-of-memory mix.

Demoted the existing 9 detailed examples to '####' under a new 'Detailed
examples' subsection so the cookbook table is the new first-stop.

Verified by smoke-running the small-data constrained-log row: 4.6M
dataset with 256m log memory dropped from 2 M ops/sec (fits) to 320 K
ops/sec (spills to disk), as designed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t default

The pool target was computed from opts.Threads, which is the default 1
when --run-threads-sweep is in use. For sweeps with peak thread counts
> 128, that left the pool target floored at 256 (correct floor, but
not actually sized for the sweep peak). Switch to opts.ResolvedMaxThreads,
which is the max of --load-threads and the sweep entries.

No measurable impact on the 100M-key disk-bound sweep up to 32 run threads
(both old and new target floored at 256 there), but it makes the pool
sizing correct for higher-thread sweeps without the operator having to
override --no-threadpool-tune and SetMinThreads manually.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two latent bugs in the previous --validate path:

1) writerThread reconstruction used Options.Threads (the run-phase count)
   instead of Options.ResolvedLoadThreads (the actual load-phase count).
   With --load-threads N and --threads M (N != M), this verified against the
   wrong per-thread baked pattern and reported phantom mismatches.

2) Reads of records below HeadAddress return Status.IsPending. The previous
   code counted these as misses, so --validate falsely reported huge miss
   counts on any disk-spill scenario (log smaller than dataset).

Fix: issue reads in batches of 256 with per-slot pinned output buffers and
drain via CompletePendingWithOutputs between batches, verifying each
completed output against the per-thread baked pattern.

This is a pure --validate bug fix; it is independent of the
optimize-v2-io device-IO changes and applies cleanly to kv-bench.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc merged commit 4d85ab5 into main May 25, 2026
367 of 369 checks passed
@badrishc badrishc deleted the badrishc/kv-bench branch May 25, 2026 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants