Add KV.benchmark: lean Tsavorite KV throughput benchmark by badrishc · Pull Request #1823 · microsoft/garnet

badrishc · 2026-05-23T18:52:00Z

A clean, allocation-free, NUMA-aware throughput benchmark for the Tsavorite KV engine. Designed for repeatable engineering numbers (stable mean / low stdev) rather than ad-hoc spot-checks.

Highlights

8-byte synthetic keys generated per-op (no pre-built key arrays).
Configurable value size, key count, RUMD ratio (reads/upserts/RMW/ deletes), and key distribution (uniform or zipf).
Single ObjectAllocator store with Garnet-aligned defaults (16 MB pages, 1 GB segments, 16 KB max-inline value, mutable=0.9).
Auto-sized log (fits dataset in mutable region) and index (matching YCSB fill factor); both overridable.
Configurable device backend: native (libaio), randomaccess, filestream, null. Native backend ctor extended with an IoBackend selector (Default / Libaio).
Separate --load-threads and --run-threads-sweep so a single load can feed a multi-thread-count run sweep.
NUMA pinning + ThreadPool auto-tuning + full-GC quiescence before each timed window; hot loop is allocation-free (stackalloc only, 32-byte aligned to match AVX2 copy).
Workers coordinate start/end via a single gate + per-thread scoreboard slots padded to 128 bytes; warmup excluded from the timed window.
Output: human-readable progress + final summary; optional pretty- printed JSON file (one row per phase) and CSV file. Stdout JSON blobs are off by default to keep logs clean.
Always cleans up its data directory on exit.

Adds a new NativeStorageDevice.IoBackend enum (Default, Libaio) and an optional ctor parameter. The shipped libnative_device.so currently uses libaio on Linux for both values; the enum is forward-compatible for additional backends.

A clean, allocation-free, NUMA-aware throughput benchmark for the Tsavorite KV engine. Designed for repeatable engineering numbers (stable mean / low stdev) rather than ad-hoc spot-checks. Highlights - 8-byte synthetic keys generated per-op (no pre-built key arrays). - Configurable value size, key count, RUMD ratio (reads/upserts/RMW/ deletes), and key distribution (uniform or zipf). - Single ObjectAllocator store with Garnet-aligned defaults (16 MB pages, 1 GB segments, 16 KB max-inline value, mutable=0.9). - Auto-sized log (fits dataset in mutable region) and index (matching YCSB fill factor); both overridable. - Configurable device backend: native (libaio), randomaccess, filestream, null. Native backend ctor extended with an IoBackend selector (Default / Libaio). - Separate --load-threads and --run-threads-sweep so a single load can feed a multi-thread-count run sweep. - NUMA pinning + ThreadPool auto-tuning + full-GC quiescence before each timed window; hot loop is allocation-free (stackalloc only, 32-byte aligned to match AVX2 copy). - Workers coordinate start/end via a single gate + per-thread scoreboard slots padded to 128 bytes; warmup excluded from the timed window. - Output: human-readable progress + final summary; optional pretty- printed JSON file (one row per phase) and CSV file. Stdout JSON blobs are off by default to keep logs clean. - Always cleans up its data directory on exit. Adds a new NativeStorageDevice.IoBackend enum (Default, Libaio) and an optional ctor parameter. The shipped libnative_device.so currently uses libaio on Linux for both values; the enum is forward-compatible for additional backends. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Apply analyzer-suggested fixes: - IMPORTS: System.* usings ordered before others - IDE0300/IDE0301: collection initializer syntax (new int[]{...} -> [...]) Pure mechanical reformatting; no behavioural change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The slnx was copied from an in-flight branch that had inadvertently removed test/test.stress. Restoring so the PR diff against main only adds KV.benchmark. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds a new KV.benchmark project under Tsavorite benchmarks to provide a lean, repeatable KV throughput benchmark (load + RUMD run phases) with NUMA pinning, allocation-free hot loop, and optional JSON/CSV output. Also extends NativeStorageDevice with a forward-compatible Linux async I/O backend selector (IoBackend).

Changes:

Add KV.benchmark (CLI parsing, sizing defaults, NUMA pinning, worker orchestration, output emitters, Zipf + RNG helpers).
Add NativeStorageDevice.IoBackend enum + optional ctor parameter for future backend selection/reporting.
Register the new benchmark project in Tsavorite.slnx.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
libs/storage/Tsavorite/cs/Tsavorite.slnx	Adds the new `benchmark/KV.benchmark` project to the Tsavorite solution.
libs/storage/Tsavorite/cs/src/core/Device/NativeStorageDevice.cs	Introduces `IoBackend` selector and stores configured value on the device.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KV.benchmark.csproj	New benchmark project definition (CommandLineParser ref, Tsavorite.core reference, signing settings).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Program.cs	Minimal `Main` that forwards to `EntryPoint.Run`.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/EntryPoint.cs	CLI parse/validate, lifecycle orchestration, interrupt handling, ThreadPool tuning restore.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/Options.cs	CLI options + auto-default resolution (log/index sizing, device selection).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.cs	Benchmark engine: store/device creation, phase execution, scoreboard-based timing, cleanup.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Worker.cs	Worker entrypoint and allocation-free hot loop (uniform/zipf keygen, RUMD selection).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Setup.cs	Run directory management, stale cleanup, device construction helper, RUMD helper.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvBenchmark.Validate.cs	Optional post-load validation pass.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvSessionFunctions.cs	SpanByte session functions tuned for benchmark semantics (fixed-size read copy, RMW safety).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvKey.cs	8-byte synthetic key type (with padding for record alignment).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvNumaPinning.cs	NUMA CPU discovery + best-effort thread affinity pinning (Linux/Windows).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/PaddedTypes.cs	128B-padded counter/flag structs to avoid false sharing.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvSize.cs	Size parsing/formatting helpers + NextPow2.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvOutput.cs	Human-readable reporting + shared helpers for JSON/CSV.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvOutput.Json.cs	JSON schema + emit (compact + pretty).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/KvOutput.Csv.cs	CSV schema + emit (wide rows, header-once).
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/XoshiroRng.cs	Per-thread PRNG implementation for Zipf sampling.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/ZipfGenerator.cs	Zipf constants + sampler used by the benchmark.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md	Benchmark documentation, CLI reference, and worked examples.

The load loop uses 'localOps & (kChunkSize - 1) == 0' as a bitmask check for the scoreboard / CompletePending throttling. With kChunkSize=640 this mask is 639 = 0b1001111111, which actually fires only when localOps is a multiple of 1024 — not every 640 ops as intended. Changed kChunkSize from 640 to 512 so the bitmask is correct (mask 511 = 0b111111111). 512 is also large enough to amortize the Interlocked.Add contention in the run loop. The run loop's 'idx % 512 == 0' for CompletePending now aligns with chunk boundaries exactly (single check per chunk). No measurable perf change at T=1 (within run-to-run noise). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fixes from PR #1823 review: - KvBenchmark.cs: use Volatile.Write for doneFlag stores so the stop signal is reliably visible to workers (which read via Volatile.Read). Removes the Thread.MemoryBarrier+plain-store pattern in the zero-second run path. - KvBenchmark.cs: change kChunkSize from 512 to 1024. 1024 keeps the bitmask check (n & (kChunkSize-1)) correct and aligns the run loop's 'idx % 512 == 0' to fire twice per chunk (matching YCSB's run-loop CompletePending cadence). Measured: 2.04 M ops/sec, within noise of prior 512 result. - Options.cs: validate --device, --device-io-backend, and --zipf-theta. Unknown device/backend names now produce a clear error listing valid values (was: silently mapped to Default). --zipf-theta is rejected if outside [0, 1) (was: theta=1 divided by zero; theta>1 produced NaN). - Options.cs: correct --value-size help text (was: 'Range: 8..4096'; actual: 32..1048576). - KvBenchmark.Setup.cs: correct CreateDevice doc comment (was: 'Native + io_uring on Linux'; native device only uses libaio in this PR). - README: correct sample-output reader-copy size (32 B, not 64 B), and describe the actual GC default (Workstation, with DOTNET_gcServer=1 opt-in) instead of claiming Server GC is enabled in the csproj. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Previously: - Run loop: 'if (idx % 512 == 0)' fired CompletePending twice per 1024-op chunk (leftover from kChunkSize=640). - Load loop: per-op '& (kChunkSize-1)' bitmask + nested '& 65535' throttle for CompletePending every 64K ops. Now both phases: - Iterate one chunk (kChunkSize=1024 ops) at a time. - Call CompletePending(false) once at the start of each chunk. - Write scoreboard tick at end of each chunk. The per-op checks are gone from both inner loops. Cleaner, fewer branches, and CompletePending fires at a consistent cadence (every chunk) in both phases. Perf: load 3.05 M ops/s, run mean 2.07 M ops/s (3 iters at 4.6M × 96B × T=1, BasicContext) — within noise of prior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a 'Cookbook' section at the top of 'Worked examples' with a copy-paste matrix covering common scenarios: - small (4.6 M) vs large (100 M) dataset - log fits in memory (auto-sized) vs log constrained smaller than dataset - uniform vs zipf θ=0.99 distribution - pure reads vs mixed RUMD - single thread vs thread sweep Each row is a single-line command using a $KV environment variable for the binary path, NUMA-pinned via numactl. Added knob-summary subsection explaining how --log-memory, --device, and --max-inline-value-size interact to control the in-memory vs out-of-memory mix. Demoted the existing 9 detailed examples to '####' under a new 'Detailed examples' subsection so the cookbook table is the new first-stop. Verified by smoke-running the small-data constrained-log row: 4.6M dataset with 256m log memory dropped from 2 M ops/sec (fits) to 320 K ops/sec (spills to disk), as designed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t default The pool target was computed from opts.Threads, which is the default 1 when --run-threads-sweep is in use. For sweeps with peak thread counts > 128, that left the pool target floored at 256 (correct floor, but not actually sized for the sweep peak). Switch to opts.ResolvedMaxThreads, which is the max of --load-threads and the sweep entries. No measurable impact on the 100M-key disk-bound sweep up to 32 run threads (both old and new target floored at 256 there), but it makes the pool sizing correct for higher-thread sweeps without the operator having to override --no-threadpool-tune and SetMinThreads manually. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two latent bugs in the previous --validate path: 1) writerThread reconstruction used Options.Threads (the run-phase count) instead of Options.ResolvedLoadThreads (the actual load-phase count). With --load-threads N and --threads M (N != M), this verified against the wrong per-thread baked pattern and reported phantom mismatches. 2) Reads of records below HeadAddress return Status.IsPending. The previous code counted these as misses, so --validate falsely reported huge miss counts on any disk-spill scenario (log smaller than dataset). Fix: issue reads in batches of 256 with per-slot pinned output buffers and drain via CompletePendingWithOutputs between batches, verifying each completed output against the per-thread baked pattern. This is a pure --validate bug fix; it is independent of the optimize-v2-io device-IO changes and applies cleanly to kv-bench. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

badrishc and others added 3 commits May 23, 2026 11:42

Restore test.stress project entry in Tsavorite.slnx

8058233

The slnx was copied from an in-flight branch that had inadvertently removed test/test.stress. Restoring so the PR diff against main only adds KV.benchmark. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

badrishc marked this pull request as ready for review May 24, 2026 02:07

Copilot AI review requested due to automatic review settings May 24, 2026 02:07

Copilot started reviewing on behalf of badrishc May 24, 2026 02:07 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

badrishc and others added 6 commits May 23, 2026 20:04

TedHartMS approved these changes May 25, 2026

View reviewed changes

badrishc merged commit 4d85ab5 into main May 25, 2026
367 of 369 checks passed

badrishc deleted the badrishc/kv-bench branch May 25, 2026 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KV.benchmark: lean Tsavorite KV throughput benchmark#1823

Add KV.benchmark: lean Tsavorite KV throughput benchmark#1823
badrishc merged 9 commits into
mainfrom
badrishc/kv-bench

badrishc commented May 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

badrishc commented May 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants