Implement bin command functionality in OpenSearch PPL
Summary
Implement a bin command in OpenSearch PPL that provides comprehensive data discretization and binning operations. This would extend the existing span() function capabilities to support a more comprehensive binning interface.
Background
OpenSearch PPL currently has a span() function that provides basic binning functionality within stats operations. However, a dedicated bin command would offer more comprehensive binning capabilities that would be valuable for data analysis workflows.
Current PPL span() function capabilities:
- Basic numerical binning:
span(age, 10)
- Works within
stats aggregations
- Always becomes the first grouping key
- Limited to simple interval-based binning
Additional binning capabilities we want to add:
- Standalone binning operation (not just within stats)
- Time-based binning with various time scales
- Custom start/end range specification
- Flexible bin count specification
- Time alignment options
- Field aliasing support
Proposed Implementation
Create a dedicated bin command with comprehensive binning syntax:
Complete Bin Command Syntax
bin <field> [span=<value>] [bins=<number>] [minspan=<value>] [aligntime=<value>] [start=<value>] [end=<value>] [AS <alias>]
Detailed Feature Requirements
1. Core Binning Parameters
span=<value> - Fixed interval binning
source=logs | bin timestamp span=3600000 | stats count() by timestamp
bins=<number> - Fixed number of equal-width bins
source=sales | bin price bins=50 | stats count() by price
minspan=<value> - Automatic span calculation with minimum interval
source=metrics | bin timestamp minspan=300000 | stats count() by timestamp
aligntime=<value> - Time alignment for consistent bucketing
-- Align to earliest time
source=events | bin timestamp span=86400000 aligntime=earliest | stats count() by timestamp
-- Align to specific timestamp
source=logs | bin timestamp span=3600000 aligntime=1640995200000 | stats count() by timestamp
start=<num> / end=<num> - Range-bounded binning
source=data | bin value span=100 start=0 end=10000 | stats count() by value
2. Time-Based Binning
Support time scale units:
- Seconds:
s, sec, secs, second, seconds
- Minutes:
m, min, mins, minute, minutes
- Hours:
h, hr, hrs, hour, hours
- Days:
d, day, days
- Months:
mon, month, months
- Subseconds:
us, ms, cs, ds
3. Field Aliasing
Support AS <newfield> syntax for all binning operations:
source=accounts | bin age span=10 AS age_group | stats count() by age_group
Implementation Requirements
Core Components
- Grammar: Extend PPL lexer/parser with bin command tokens and syntax rules
- AST: Create/update Bin node to support all parameters
- Logic: Implement binning algorithms including time alignment formulas
- Integration: Full Calcite integration with proper relational algebra generation
Key Features
- Time alignment supporting
earliest, latest, and specific timestamp values
- Type safety: time-specific parameters only apply to temporal fields
- Flexible parameter combinations for complex binning scenarios
- Locale-independent number formatting
- Backward compatibility with existing PPL commands
Usage Examples
Basic Time Series Binning
-- Hourly buckets aligned to midnight
source=apache_logs | bin timestamp span=3600000 aligntime=earliest AS hourly | stats count() by hourly
-- Daily buckets with custom alignment
source=sales_data | bin order_date span=86400000 aligntime=1640995200000 AS daily_sales
Advanced Histogram Analysis
-- Price distribution with 20 equal bins
source=products | bin price bins=20 AS price_ranges | stats count() by price_ranges
-- Revenue analysis with adaptive binning
source=transactions | bin amount minspan=1000 AS revenue_tiers | stats sum(profit) by revenue_tiers
Complex Multi-Dimensional Binning
-- Time + amount analysis
source=orders |
bin order_time span=7200000 aligntime=earliest AS two_hour_windows |
bin order_value bins=10 AS value_tiers |
stats count(), avg(shipping_cost) by two_hour_windows, value_tiers
Test Coverage Required
- Unit tests for bin command parsing and AST generation
- Integration tests for end-to-end functionality
- Calcite tests for logical plan generation
- Performance tests with various bin configurations
- Locale compatibility tests
Documentation
docs/user/ppl/cmd/bin.rst - Complete parameter documentation with examples
Implement
bincommand functionality in OpenSearch PPLSummary
Implement a
bincommand in OpenSearch PPL that provides comprehensive data discretization and binning operations. This would extend the existingspan()function capabilities to support a more comprehensive binning interface.Background
OpenSearch PPL currently has a
span()function that provides basic binning functionality withinstatsoperations. However, a dedicatedbincommand would offer more comprehensive binning capabilities that would be valuable for data analysis workflows.Current PPL
span()function capabilities:span(age, 10)statsaggregationsAdditional binning capabilities we want to add:
Proposed Implementation
Create a dedicated
bincommand with comprehensive binning syntax:Complete Bin Command Syntax
Detailed Feature Requirements
1. Core Binning Parameters
span=<value>- Fixed interval binningbins=<number>- Fixed number of equal-width binsminspan=<value>- Automatic span calculation with minimum intervalaligntime=<value>- Time alignment for consistent bucketingstart=<num>/end=<num>- Range-bounded binning2. Time-Based Binning
Support time scale units:
s,sec,secs,second,secondsm,min,mins,minute,minutesh,hr,hrs,hour,hoursd,day,daysmon,month,monthsus,ms,cs,ds3. Field Aliasing
Support
AS <newfield>syntax for all binning operations:Implementation Requirements
Core Components
Key Features
earliest,latest, and specific timestamp valuesUsage Examples
Basic Time Series Binning
Advanced Histogram Analysis
Complex Multi-Dimensional Binning
Test Coverage Required
Documentation
docs/user/ppl/cmd/bin.rst- Complete parameter documentation with examples