Register DISTINCT_COUNT_APPROX logical marker in PPLFuncImpTable#5466
Register DISTINCT_COUNT_APPROX logical marker in PPLFuncImpTable#5466songkant-aws wants to merge 1 commit into
Conversation
PPL parser fails with "Cannot resolve function: DISTINCT_COUNT_APPROX" on any execution path that does not construct OpenSearchExecutionEngine (unified-query / analytics-engine / DataFusion), because the UDAF was only registered to PPLFuncImpTable.aggExternalFunctionRegistry as a side effect of that constructor. Add a logical marker class DistinctCountApproxLogicalAggFunction in core, expose it as PPLBuiltinOperators.DISTINCT_COUNT_APPROX, and register it inside PPLFuncImpTable.AggBuilder.populate() alongside other built-in aggregates. The marker has no JVM execution: init / add / result throw UnsupportedOperationException, mirroring the pattern already used by RelevanceQueryFunction.RelevanceQueryImplementor for match-family functions which similarly have no JVM semantics. Behavior on V3 path is preserved: OpenSearchExecutionEngine still registers the real HyperLogLog++ DistinctCountApproxAggFunction in aggExternalFunctionRegistry, and getImplementation() consults that external registry first, so the marker is overridden whenever the V3 constructor has run. AggregateAnalyzer continues to translate the operator to OpenSearch cardinality DSL via the BuiltinFunctionName switch which is independent of the wrapped SqlAggFunction instance. Operand metadata for the marker is null to match the existing external registration's permissive type policy and avoid introducing new type rejections that would surface as regressions in existing dc IT. Signed-off-by: Songkan Tang <songkant@amazon.com>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
| public class DistinctCountApproxLogicalAggFunction | ||
| implements UserDefinedAggFunction<DistinctCountApproxLogicalAggFunction.MarkerAccumulator> { |
There was a problem hiding this comment.
how many udf/udaf are skipping the JVM execution path now? How about add a new parent class for it, such as
class DistinctCountApproxLogicalAggFunction
implements NativeSpecificFunction, UserDefinedAggFunction<DistinctCountApproxLogicalAggFunction.MarkerAccumulator>There was a problem hiding this comment.
There are only a couple of functions in this category today:
- GEOIP is a UDF that depends on NodeClient / OpenSearch-side lookup.
- DISTINCT_COUNT_APPROX is a UDAF whose OpenSearch implementation uses the HLL++ core search cardinality aggregation path.
Both are registered from OpenSearchExecutionEngine.registerOpenSearchFunctions():
https://github.com/opensearch-project/sql/blob/main/opensearch/src/main/java/org/opensearch/sql/opensearch/executor/OpenSearchExecutionEngine.java#L329-L355
That registration only happens on the SQL/PPL V3 OpenSearchExecutionEngine path. The unified-query / analytics-engine path does not construct OpenSearchExecutionEngine; it uses UnifiedQueryPlanner to produce a Calcite RelNode via CalciteRelNodeVisitor, then passes that RelNode to the backend planner. As a result, PPLFuncImpTable cannot rely on aggExternalFunctionRegistry being populated on that path.
Description
PPL parser fails with
Cannot resolve function: DISTINCT_COUNT_APPROXon any execution path that does not constructOpenSearchExecutionEngine(unified-query / analytics-engine / DataFusion), because the UDAF was only registered toPPLFuncImpTable.aggExternalFunctionRegistryas a side effect of that constructor (registerOpenSearchFunctions).This change adds a logical marker UDAF in
corethat lets PPL parser succeed on every path. Backends are expected to push down or rewrite the operator before execution.Layered registration
core(this PR): register a markerDistinctCountApproxLogicalAggFunctioninPPLFuncImpTable.AggBuilder.populate()via the newPPLBuiltinOperators.DISTINCT_COUNT_APPROX. Lookup precedence inPPLFuncImpTable.getImplementation()isaggExternalFunctionRegistryfirst, thenaggFunctionRegistry.opensearch(unchanged):OpenSearchExecutionEngine.registerOpenSearchFunctions()still installs the realDistinctCountApproxAggFunction(HyperLogLog++) intoaggExternalFunctionRegistry. Because external is consulted first, the V3 path always sees the real implementation once the constructor has run, and the marker is effectively overridden.This is the same pattern already used by
RelevanceQueryFunction.RelevanceQueryImplementorformatch/match_phrase/etc — relevance functions register a no-op operator whoseRelevanceQueryImplementor.implement()throwsUnsupportedOperationException, because their semantics live entirely on the OpenSearch index side.DISTINCT_COUNT_APPROXis in the same situation: real semantics on the OpenSearch side (cardinality aggregation) and on backends like DataFusion (approx_count_distinct).Marker class behavior
init/add/resultand the accumulator'svalueall throwUnsupportedOperationExceptionwith the message:Reaching the body means a backend either failed to push down or did not register an adapter — surfacing this as a clear error rather than silently producing wrong results is the intended behavior.
V3 path behavior is preserved
AggregateAnalyzertranslatesDISTINCT_COUNT_APPROXto OpenSearchcardinalityaggregation through aBuiltinFunctionNameswitch, independent of the wrappedSqlAggFunctioninstance — unchanged by this PR.RexImpTable.getAggImplementorreadsSqlUserDefinedAggFunction.function, which is the realDistinctCountApproxAggFunctionregistered byOpenSearchExecutionEngine(external takes precedence) — unchanged.nullto match the existing external registration's permissive type policy. No new type rejection is introduced.How this unblocks unified-query / DataFusion path
RestUnifiedQueryActiondoes not constructOpenSearchExecutionEngine; before this change,dc()/distinct_count()/distinct_count_approx()failed at PPL parse stage on that path. After this change, the parser succeeds via the marker, and the downstream backend (e.g. analytics-engine's DataFusion adapter) rewritesRexOver(DISTINCT_COUNT_APPROX)toAPPROX_COUNT_DISTINCTbefore substrait emission. End-to-end verified locally with the analytics-engine REST IT in the OpenSearch sandbox.Related Issues
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.