perf: idempotent_dir for dataset generation#7493
perf: idempotent_dir for dataset generation#7493joseph-isaacs wants to merge 3 commits intodevelopfrom
Conversation
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.952x ➖ datafusion / vortex-file-compressed (0.952x ➖, 0↑ 0↓)
|
File Sizes: PolarSignals ProfilingNo file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.934x ➖, 2↑ 1↓)
datafusion / vortex-compact (0.975x ➖, 0↑ 0↓)
datafusion / parquet (0.967x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (1.017x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.033x ➖, 0↑ 2↓)
duckdb / parquet (0.949x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: FineWeb NVMeNo file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.051x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.019x ➖, 0↑ 0↓)
datafusion / parquet (1.033x ➖, 0↑ 1↓)
datafusion / arrow (1.039x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.012x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.021x ➖, 0↑ 0↓)
duckdb / parquet (1.001x ➖, 0↑ 0↓)
duckdb / duckdb (1.009x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: TPC-H SF=1 on NVMENo file size changes detected. |
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.105x ❌, 0↑ 54↓)
datafusion / vortex-compact (1.098x ➖, 0↑ 44↓)
datafusion / parquet (1.102x ❌, 0↑ 46↓)
duckdb / vortex-file-compressed (1.084x ➖, 0↑ 30↓)
duckdb / vortex-compact (1.060x ➖, 2↑ 20↓)
duckdb / parquet (1.060x ➖, 0↑ 15↓)
duckdb / duckdb (1.067x ➖, 0↑ 25↓)
Full attributed analysis
|
File Sizes: TPC-DS SF=1 on NVMENo file size changes detected. |
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.954x ➖, 0↑ 1↓)
datafusion / vortex-compact (0.830x ➖, 1↑ 0↓)
datafusion / parquet (0.976x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.003x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.887x ➖, 0↑ 0↓)
duckdb / parquet (0.973x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.993x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.026x ➖, 0↑ 3↓)
datafusion / parquet (0.991x ➖, 0↑ 0↓)
datafusion / arrow (0.989x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.148x ❌, 0↑ 20↓)
duckdb / vortex-compact (1.125x ❌, 0↑ 18↓)
duckdb / parquet (1.069x ➖, 0↑ 4↓)
duckdb / duckdb (1.017x ➖, 0↑ 2↓)
Full attributed analysis
|
File Sizes: TPC-H SF=10 on NVMENo file size changes detected. |
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.997x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.990x ➖, 0↑ 0↓)
duckdb / parquet (0.981x ➖, 0↑ 0↓)
Full attributed analysis
|
File Sizes: Statistical and Population GeneticsNo file size changes detected. |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.979x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.911x ➖, 0↑ 0↓)
datafusion / parquet (1.058x ➖, 1↑ 3↓)
duckdb / vortex-file-compressed (1.001x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.992x ➖, 0↑ 0↓)
duckdb / parquet (0.969x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.003x ➖, 1↑ 2↓)
datafusion / parquet (0.989x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.970x ➖, 7↑ 0↓)
duckdb / parquet (0.998x ➖, 0↑ 1↓)
duckdb / duckdb (0.975x ➖, 2↑ 0↓)
Full attributed analysis
|
File Sizes: Clickbench on NVMEFile Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: Random AccessVortex (geomean): 0.853x ✅ unknown / unknown (0.951x ➖, 8↑ 1↓)
|
Benchmarks: CompressionVortex (geomean): 1.004x ➖ unknown / unknown (1.007x ➖, 1↑ 4↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.961x ➖, 1↑ 0↓)
datafusion / vortex-compact (0.907x ➖, 2↑ 0↓)
datafusion / parquet (0.819x ➖, 6↑ 0↓)
duckdb / vortex-file-compressed (1.017x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.021x ➖, 0↑ 0↓)
duckdb / parquet (1.048x ➖, 0↑ 0↓)
Full attributed analysis
|
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
21dfaa0 to
95850b6
Compare
Add a `dir: &Path` parameter to `download_many`. On entry it skips all downloads if `dir/.success` already exists; on success it writes that marker so subsequent runs skip the whole batch. Call sites updated: clickbench partitioned, public_bi bzips, vector_dataset train shards. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
18703fd to
762e4c6
Compare
| convert_parquet_directory_to_vortex(&base_path, CompactionStrategy::Default).await?; | ||
| } | ||
| // All conversions are only meaningful for local file URLs. | ||
| if benchmark.data_url().scheme() != "file" { |
There was a problem hiding this comment.
a bunch of this stuff is handled in the underlying benchmarks, if we want to have unified handling here, lets pull all of them out and document that on the trait,
Have a
.successfile for dataset generate to speed up repeated benchmark runs