Skip to content

[RFC] Implement bin command in PPL #3876

@ahkcs

Description

@ahkcs

Implement bin command functionality in OpenSearch PPL

Summary

Implement a bin command in OpenSearch PPL that provides comprehensive data discretization and binning operations. This would extend the existing span() function capabilities to support a more comprehensive binning interface.

Background

OpenSearch PPL currently has a span() function that provides basic binning functionality within stats operations. However, a dedicated bin command would offer more comprehensive binning capabilities that would be valuable for data analysis workflows.

Current PPL span() function capabilities:

  • Basic numerical binning: span(age, 10)
  • Works within stats aggregations
  • Always becomes the first grouping key
  • Limited to simple interval-based binning

Additional binning capabilities we want to add:

  • Standalone binning operation (not just within stats)
  • Time-based binning with various time scales
  • Custom start/end range specification
  • Flexible bin count specification
  • Time alignment options
  • Field aliasing support

Proposed Implementation

Create a dedicated bin command with comprehensive binning syntax:

Complete Bin Command Syntax

bin <field> [span=<value>] [bins=<number>] [minspan=<value>] [aligntime=<value>] [start=<value>] [end=<value>] [AS <alias>]

Detailed Feature Requirements

1. Core Binning Parameters

span=<value> - Fixed interval binning

source=logs | bin timestamp span=3600000 | stats count() by timestamp

bins=<number> - Fixed number of equal-width bins

source=sales | bin price bins=50 | stats count() by price

minspan=<value> - Automatic span calculation with minimum interval

source=metrics | bin timestamp minspan=300000 | stats count() by timestamp

aligntime=<value> - Time alignment for consistent bucketing

-- Align to earliest time
source=events | bin timestamp span=86400000 aligntime=earliest | stats count() by timestamp

-- Align to specific timestamp
source=logs | bin timestamp span=3600000 aligntime=1640995200000 | stats count() by timestamp

start=<num> / end=<num> - Range-bounded binning

source=data | bin value span=100 start=0 end=10000 | stats count() by value

2. Time-Based Binning

Support time scale units:

  • Seconds: s, sec, secs, second, seconds
  • Minutes: m, min, mins, minute, minutes
  • Hours: h, hr, hrs, hour, hours
  • Days: d, day, days
  • Months: mon, month, months
  • Subseconds: us, ms, cs, ds

3. Field Aliasing

Support AS <newfield> syntax for all binning operations:

source=accounts | bin age span=10 AS age_group | stats count() by age_group

Implementation Requirements

Core Components

  • Grammar: Extend PPL lexer/parser with bin command tokens and syntax rules
  • AST: Create/update Bin node to support all parameters
  • Logic: Implement binning algorithms including time alignment formulas
  • Integration: Full Calcite integration with proper relational algebra generation

Key Features

  • Time alignment supporting earliest, latest, and specific timestamp values
  • Type safety: time-specific parameters only apply to temporal fields
  • Flexible parameter combinations for complex binning scenarios
  • Locale-independent number formatting
  • Backward compatibility with existing PPL commands

Usage Examples

Basic Time Series Binning

-- Hourly buckets aligned to midnight
source=apache_logs | bin timestamp span=3600000 aligntime=earliest AS hourly | stats count() by hourly

-- Daily buckets with custom alignment
source=sales_data | bin order_date span=86400000 aligntime=1640995200000 AS daily_sales

Advanced Histogram Analysis

-- Price distribution with 20 equal bins
source=products | bin price bins=20 AS price_ranges | stats count() by price_ranges

-- Revenue analysis with adaptive binning
source=transactions | bin amount minspan=1000 AS revenue_tiers | stats sum(profit) by revenue_tiers

Complex Multi-Dimensional Binning

-- Time + amount analysis
source=orders |
bin order_time span=7200000 aligntime=earliest AS two_hour_windows |
bin order_value bins=10 AS value_tiers |
stats count(), avg(shipping_cost) by two_hour_windows, value_tiers

Test Coverage Required

  • Unit tests for bin command parsing and AST generation
  • Integration tests for end-to-end functionality
  • Calcite tests for logical plan generation
  • Performance tests with various bin configurations
  • Locale compatibility tests

Documentation

  • docs/user/ppl/cmd/bin.rst - Complete parameter documentation with examples

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageenhancementNew feature or requestv3.3.0

Type

No type
No fields configured for issues without a type.

Projects

Status
New
Status
Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions