Add tracing support to the compressor#7385
Conversation
7f4b5d7 to
3f1411c
Compare
## Summary Tracking issue: #7216 Makes the compressor types more robust (removes the possibility for invalid state), which additionally sets up adding tracing easier (draft at #7385) ## API Changes Changes some types: ```rust /// Closure type for [`DeferredEstimate::Callback`]. /// /// The compressor calls this with the same arguments it would pass to sampling. The closure must /// resolve directly to a terminal [`EstimateVerdict`]. #[rustfmt::skip] pub type EstimateFn = dyn FnOnce( &CascadingCompressor, &mut ArrayAndStats, CompressorContext, ) -> VortexResult<EstimateVerdict> + Send + Sync; /// The result of a [`Scheme`]'s compression ratio estimation. /// /// This type is returned by [`Scheme::expected_compression_ratio`] to tell the compressor how /// promising this scheme is for a given array without performing any expensive work. /// /// [`CompressionEstimate::Verdict`] means the scheme already knows the terminal answer. /// [`CompressionEstimate::Deferred`] means the compressor must do extra work before the scheme can /// produce a terminal answer. #[derive(Debug)] pub enum CompressionEstimate { /// The scheme already knows the terminal estimation verdict. Verdict(EstimateVerdict), /// The compressor must perform deferred work to resolve the terminal estimation verdict. Deferred(DeferredEstimate), } /// The terminal answer to a compression estimate request. #[derive(Debug)] pub enum EstimateVerdict { /// Do not use this scheme for this array. Skip, /// Always use this scheme, as it is definitively the best choice. /// /// Some examples include constant detection, decimal byte parts, and temporal decomposition. /// /// The compressor will select this scheme immediately without evaluating further candidates. /// Schemes that return `AlwaysUse` must be mutually exclusive per canonical type (enforced by /// [`Scheme::matches`]), otherwise the winner depends silently on registration order. /// /// [`Scheme::matches`]: crate::scheme::Scheme::matches AlwaysUse, /// The estimated compression ratio. This must be greater than `1.0` to be considered by the /// compressor, otherwise it is worse than the canonical encoding. Ratio(f64), } /// Deferred work that can resolve to a terminal [`EstimateVerdict`]. pub enum DeferredEstimate { /// The scheme cannot cheaply estimate its ratio, so the compressor should compress a small /// sample to determine effectiveness. Sample, /// A fallible estimation requiring a custom expensive computation. /// /// Use this only when the scheme needs to perform trial encoding or other costly checks to /// determine its compression ratio. The callback returns an [`EstimateVerdict`] directly, so /// it cannot request more sampling or another deferred callback. Callback(Box<EstimateFn>), } ``` This will make some changes that we want to make is the future easier as well (tracing, better decision making for what things to try, etc). ## Testing Some new tests Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
56bdc36 to
414149e
Compare
robert3005
left a comment
There was a problem hiding this comment.
I think this is reasonable but will wait for it to not be a draft
414149e to
43e333a
Compare
fd4ac40 to
d3721db
Compare
Merging this PR will not alter performance
Comparing Footnotes
|
7af3d55 to
6919ec0
Compare
6919ec0 to
d7c5df9
Compare
d7c5df9 to
cbbe7b3
Compare
9fa359f to
c94bb01
Compare
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.966x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.950x ➖, 0↑ 0↓)
datafusion / parquet (1.001x ➖, 1↑ 1↓)
datafusion / arrow (1.020x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.983x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.992x ➖, 0↑ 0↓)
duckdb / parquet (0.973x ➖, 1↑ 0↓)
duckdb / duckdb (0.985x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.964x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.011x ➖, 0↑ 0↓)
datafusion / parquet (0.992x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.974x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.008x ➖, 0↑ 1↓)
duckdb / parquet (0.997x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.910x ➖, 32↑ 0↓)
datafusion / vortex-compact (0.928x ➖, 23↑ 0↓)
datafusion / parquet (0.919x ➖, 27↑ 1↓)
duckdb / vortex-file-compressed (0.924x ➖, 30↑ 0↓)
duckdb / vortex-compact (0.934x ➖, 20↑ 0↓)
duckdb / parquet (0.946x ➖, 12↑ 0↓)
duckdb / duckdb (0.920x ➖, 29↑ 1↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.996x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.000x ➖, 0↑ 0↓)
datafusion / parquet (1.019x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.939x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.975x ➖, 0↑ 0↓)
duckdb / parquet (0.981x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.997x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.002x ➖, 0↑ 0↓)
datafusion / parquet (0.999x ➖, 0↑ 0↓)
datafusion / arrow (0.995x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (1.004x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.999x ➖, 0↑ 0↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
duckdb / duckdb (0.991x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (1.052x ➖, 0↑ 2↓)
duckdb / vortex-compact (1.037x ➖, 0↑ 1↓)
duckdb / parquet (1.015x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.091x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.005x ➖, 1↑ 1↓)
datafusion / parquet (0.948x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.023x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.036x ➖, 0↑ 0↓)
duckdb / parquet (0.962x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.006x ➖, 0↑ 0↓)
datafusion / parquet (0.990x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.990x ➖, 6↑ 1↓)
duckdb / parquet (0.994x ➖, 1↑ 0↓)
duckdb / duckdb (1.038x ➖, 1↑ 12↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: Random AccessVortex (geomean): 1.049x ➖ unknown / unknown (1.057x ➖, 0↑ 10↓)
|
Benchmarks: CompressionVortex (geomean): 1.001x ➖ unknown / unknown (1.002x ➖, 1↑ 0↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.078x ➖, 0↑ 2↓)
datafusion / vortex-compact (0.942x ➖, 2↑ 0↓)
datafusion / parquet (0.964x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.985x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.008x ➖, 0↑ 0↓)
duckdb / parquet (1.034x ➖, 0↑ 1↓)
Full attributed analysis
|
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
d3ca5b8 to
3c1a626
Compare
Summary
Tracking issue: #7216
We have very little observability into the compressor. When we are debugging, we don't really have any idea of what schemes the compressor is trying, how good or how bad estimates are, how reliable sampling is, how the cascading paths look, etc.
This change adds structured
tracingsupport tovortex-compressor. The compressor now emits a top-levelcompressspan and decision/debug events on thevortex_compressor::encodetarget, so a normal tracing subscriber can see what the compressor sampled, selected, accepted/rejected, and where nested failures happened.The
scheme.compress_resultevent reportsscheme,before_nbytes,after_nbytes,estimated_ratiowhen available,actual_ratiowhen available, andaccepted. Sampling is recorded throughsample.result; compression failures are recorded throughscheme.compress_failed/sample.compress_failedwithcascade_pathandcascade_depth. Zero-byte outputs intentionally omit ratio fields instead of logging infinities.This also adds JSON formatting to the benchmark logging setup via
--log-format json, which makesdata-gen/compress-benchoutput usable as JSONL. One useful workflow is to generate TPC-H data with compressor logs enabled and usejqto find over-optimistic estimates that were rejected.Example jq query for rejected over-estimates
Testing
Some basic tracing tests (that was claude-generated).