Skip to content

Added UDT for IP and Binary support#5463

Open
vinaykpud wants to merge 3 commits into
opensearch-project:mainfrom
vinaykpud:feature/udt_ip_binary
Open

Added UDT for IP and Binary support#5463
vinaykpud wants to merge 3 commits into
opensearch-project:mainfrom
vinaykpud:feature/udt_ip_binary

Conversation

@vinaykpud
Copy link
Copy Markdown
Contributor

@vinaykpud vinaykpud commented May 22, 2026

Description

Companion change for opensearch-project/OpenSearch#21807 which adds Calcite UDTs (IpType, BinaryType) for OpenSearch ip and binary columns to the analytics-engine path. This PR teaches the SQL plugin to recognize the new UDTs at the response-schema boundary so:

  • byte[] cells from the analytics backend get rendered as canonical IP strings / base64 strings (was: unsupported object class [B).
  • The response schema reports "type": "ip" / "type": "binary" correctly (was: both labelled "binary").
  • Existing comparison and CIDRMATCH query shapes keep working unchanged — the UDT is invisible to Calcite's planner-internal coercion machinery.

Background

OpenSearchSchemaBuilder (in the OpenSearch sandbox) used to map both ip and binary field types to plain SqlTypeName.VARBINARY. That collapse caused three downstream bugs in this plugin's PPL path through the analytics engine:

  1. The DataFusion backend ships back a 16-byte ipv4-mapped-ipv6 buffer; AnalyticsExecutionEngine.convertRows calls ExprValueUtils.fromObjectValue(byte[]) which has no byte[] case → unsupported object class [B.
  2. The response schema reports "type": "binary" for an IP column (because convertRelDataTypeToExprType(VARBINARY) → BINARY ExprType).
  3. CIDRMATCH against a column was registered with PPLTypeChecker.family(SqlTypeFamily.BINARY, SqlTypeFamily.STRING), with the byte-range expansion living inline at the registration site.

The companion OpenSearch PR introduces IpType / BinaryType (Calcite UDTs that extend AbstractSqlType with VARBINARY underneath) and moves CIDRMATCH dispatch into a backend ScalarFunctionAdapter.

Approach

Two response-boundary changes plus one cleanup, all minimal:

1. Result-column UDT recognition (OpenSearchTypeFactory).
New sibling function convertAnalyticsEngineRelDataTypeToExprType does an instanceof dispatch:

if (type instanceof IpType) return IP;
if (type instanceof BinaryType) return BINARY;
return convertRelDataTypeToExprType(type);

The original convertRelDataTypeToExprType is deliberately unchanged — Calcite's coercion machinery round-trips through it, so returning IP ExprType for a VARBINARY column synthesizes IP(string) casts that DataFusion can't resolve. Keeping UDT recognition in a sibling function isolates it to the response-schema path.

2. Per-column UDT dispatch in row conversion (AnalyticsExecutionEngine.convertRows).
New static helper toExprValue(Object value, RelDataType type):

  • byte[] + IpTypeInetAddresses.toAddrString(InetAddress.getByAddress(bytes)) (matches IpFieldMapper.valueFetcher semantics: dotted-quad for IPv4 / IPv4-mapped, RFC 5952 for pure IPv6).
  • byte[] + BinaryTypeBase64.getEncoder().encodeToString(bytes) (matches the OpenSearch binary field wire contract).
  • Otherwise → ExprValueUtils.fromObjectValue(value) unchanged.

UnknownHostException from a malformed IP buffer falls through to the default handler so the user sees a clear error rather than a malformed address string.

3. CIDRMATCH cleanup (PPLFuncImpTable).

  • Removes udf/ip/CidrMatchAdapter and its inline registration. CIDRMATCH dispatch now lives in CidrMatchFunctionAdapter on the backend (in the companion OpenSearch PR), which means it serves both the production SQL-plugin variant and the sandbox test front-end with a single implementation.
  • Collapses the two-line registration (typecheck + withRexBuilderShim) to a single typecheck-only registerOperator(CIDRMATCH, PPLBuiltinOperators.CIDRMATCH).
  • The runtime CidrMatchFunction UDF stays as the dynamic last-resort fallback.

Testing

Unit tests added:

  • OpenSearchTypeFactoryTest:

    • testConvertResultColumnIpTypeReturnsIpExprTypeIpTypeExprCoreType.IP.
    • testConvertResultColumnBinaryTypeReturnsBinaryExprTypeBinaryTypeExprCoreType.BINARY.
    • testConvertResultColumnPlainVarbinaryFallsBackToBinary — plain VARBINARY (no UDT) keeps returning BINARY.
    • testConvertResultColumnDelegatesParityForNonUdtTypes — for every non-UDT RelDataType, the result-column variant must agree with the planner-internal variant. Drift here would mean response-schema labels diverge from what Calcite's coercion sees.
  • AnalyticsExecutionEngineTest:

    • executeRelNode_ipColumnRendersAsAddressString — both ipv4-mapped IPv6 (::ffff:1.2.3.4"1.2.3.4") and pure IPv6 (::1) buffers render to the right canonical form, schema reports "ip".
    • executeRelNode_binaryColumnRendersAsBase64byte[] payload base64-encodes to match OpenSearch wire format, schema reports "binary".

Existing tests still pass. [B-class regression caught at: physicalPlanExecute_callsOnFailure (already in the file, exercises the same converter dispatch).

End-to-end (against a single-node gradle run cluster with the companion OpenSearch PR applied):

  • cast(host as STRING) returns "1.2.3.4" / "::1" (was a 16-byte garbage buffer).
  • cast(blob as STRING) matches | fields blob (base64-encoded).
  • where host = '1.2.3.4' coerces cleanly with no Unable to convert call IP(string) regression.
  • cidrmatch(host, '1.2.3.0/24') (column form) and CIDRMATCH('1.2.3.4', '1.2.3.0/24') (literal form) both return correct row counts via the backend adapter.
  • Response schema labels: "type": "ip" for ip columns, "type": "binary" for binary columns.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

PR Reviewer Guide 🔍

(Review updated until commit c854bbd)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

The toExprValue method catches UnknownHostException when converting IP byte arrays but wraps it in an IllegalStateException with a message about "invalid IP buffer length". However, InetAddress.getByAddress(byte[]) only throws UnknownHostException when the byte array length is not 4 or 16. This exception handling is correct but the error message could be misleading if the actual cause is something other than length (though in practice length is the only cause). More critically, if the byte array has an invalid length, the exception is thrown but the method continues to fall through to ExprValueUtils.fromObjectValue(value) in the catch block's absence of a return/rethrow, which would then fail with the original "unsupported object class [B" error. The catch block should rethrow or the method should not continue past the catch.

try {
  return ExprValueUtils.stringValue(
      InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
} catch (UnknownHostException e) {
  throw new IllegalStateException("invalid IP buffer length: " + bytes.length, e);
}
Design Concern

The new convertAnalyticsEngineRelDataTypeToExprType method is documented as "result-schema-only" and warns against using it in the general path because it would cause Calcite to synthesize CAST operations that the substrait converter cannot handle. However, there is no runtime enforcement to prevent accidental use in the planner path. If a developer mistakenly calls this method during query planning instead of schema building, it could introduce subtle bugs that only manifest when IP/Binary UDTs are involved. Consider adding a runtime check or making the method package-private with a more restrictive name to reduce misuse risk.

public static ExprType convertAnalyticsEngineRelDataTypeToExprType(RelDataType type) {
  if (type instanceof IpType) {
    return IP;
  }
  if (type instanceof BinaryType) {
    return BINARY;
  }
  return convertRelDataTypeToExprType(type);
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 22, 2026

PR Code Suggestions ✨

Latest suggestions up to c854bbd

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Add null value handling

When value is null, the method will pass it to fromObjectValue, which may not handle
null values correctly for UDT types. Add an explicit null check at the beginning to
ensure consistent null handling across all type conversions.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [156-170]

 private static ExprValue toExprValue(Object value, RelDataType type) {
+  if (value == null) {
+    return ExprValueUtils.nullValue();
+  }
   if (value instanceof byte[] bytes) {
     if (type instanceof IpType) {
       try {
         return ExprValueUtils.stringValue(
             InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
       } catch (UnknownHostException e) {
         throw new IllegalStateException("invalid IP buffer length: " + bytes.length, e);
       }
     } else if (type instanceof BinaryType) {
       return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
     }
   }
   return ExprValueUtils.fromObjectValue(value);
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion addresses a potential issue where null values might not be handled consistently for UDT types. Adding explicit null handling at the beginning ensures consistent behavior and prevents potential issues with fromObjectValue, making this a valuable defensive programming practice.

Medium
Improve error message detail

The UnknownHostException is thrown only when the byte array length is invalid (not 4
or 16 bytes). However, wrapping it in IllegalStateException loses the original
exception context. Consider rethrowing the original exception or providing more
detailed error information including the actual byte array content for debugging.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [158-164]

 if (value instanceof byte[] bytes) {
   if (type instanceof IpType) {
     try {
       return ExprValueUtils.stringValue(
           InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
     } catch (UnknownHostException e) {
-      throw new IllegalStateException("invalid IP buffer length: " + bytes.length, e);
+      throw new IllegalStateException(
+          String.format("invalid IP buffer length: %d (expected 4 or 16 bytes)", bytes.length), e);
     }
   } else if (type instanceof BinaryType) {
     return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
   }
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion improves error message clarity by explicitly stating the expected byte array lengths (4 or 16 bytes). However, this is a minor enhancement to error handling that doesn't fix a bug or significantly improve functionality.

Low

Previous suggestions

Suggestions up to commit 90defce
CategorySuggestion                                                                                                                                    Impact
General
Handle null values explicitly

When row.length < fields.size(), missing cells default to null and are passed to
toExprValue. If value is null but type is IpType or BinaryType, the method skips the
UDT branches and delegates to fromObjectValue(null). Verify that fromObjectValue
correctly handles null for these UDT types, or add an explicit null-check before the
instanceof byte[] test to ensure consistent null handling.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [131-135]

 for (int i = 0; i < fields.size(); i++) {
   RelDataTypeField field = fields.get(i);
   Object value = (i < row.length) ? row[i] : null;
-  valueMap.put(field.getName(), toExprValue(value, field.getType()));
+  if (value == null) {
+    valueMap.put(field.getName(), ExprValueUtils.nullValue());
+  } else {
+    valueMap.put(field.getName(), toExprValue(value, field.getType()));
+  }
 }
Suggestion importance[1-10]: 7

__

Why: Valid suggestion to explicitly handle null values before passing them to toExprValue. This ensures consistent null handling and avoids potential issues with UDT type checking on null values, though the current code likely works correctly via fromObjectValue.

Medium
Suggestions up to commit 90defce
CategorySuggestion                                                                                                                                    Impact
General
Preserve exception diagnostic information

The UnknownHostException catch block may mask the actual cause of IP conversion
failures. The exception message only reports buffer length but doesn't include the
original exception message. Consider including e.getMessage() in the error message
to preserve diagnostic information about why the conversion failed.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [158-164]

 if (value instanceof byte[] bytes) {
   if (type instanceof IpType) {
     try {
       return ExprValueUtils.stringValue(
           InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
     } catch (UnknownHostException e) {
-      throw new IllegalStateException("invalid IP buffer length: " + bytes.length, e);
+      throw new IllegalStateException(
+          "invalid IP buffer length: " + bytes.length + ", cause: " + e.getMessage(), e);
     }
   } else if (type instanceof BinaryType) {
     return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
   }
 }
Suggestion importance[1-10]: 5

__

Why: Including e.getMessage() in the error message provides better diagnostics. However, UnknownHostException from InetAddress.getByAddress() typically only occurs for invalid buffer lengths (not 4 or 16 bytes), so the current message already captures the key information. The improvement is marginal.

Low
Validate schema and row length match

The loop assumes fields.size() matches or exceeds row.length, but if fields.size() >
row.length, the code silently fills missing columns with null. This could mask data
integrity issues. Consider validating that fields.size() == row.length and logging a
warning or throwing an exception when they don't match to catch potential schema
mismatches early.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [131-135]

+if (fields.size() != row.length) {
+  throw new IllegalStateException(
+      "Schema mismatch: expected " + fields.size() + " columns but got " + row.length);
+}
 for (int i = 0; i < fields.size(); i++) {
   RelDataTypeField field = fields.get(i);
-  Object value = (i < row.length) ? row[i] : null;
-  valueMap.put(field.getName(), toExprValue(value, field.getType()));
+  valueMap.put(field.getName(), toExprValue(row[i], field.getType()));
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion to validate fields.size() == row.length could catch schema mismatches, but the existing code already handles the case where row.length < fields.size() by filling with null. Throwing an exception might be too strict if the executor legitimately returns partial rows. The suggestion may introduce breaking changes without clear evidence of a bug.

Low
Suggestions up to commit 0918d6b
CategorySuggestion                                                                                                                                    Impact
General
Validate IP buffer length explicitly

The UnknownHostException is thrown only for invalid buffer lengths (not 4 or 16
bytes), but the error message doesn't clarify the expected lengths. Add validation
to check buffer length explicitly before calling InetAddress.getByAddress() and
provide a more descriptive error message indicating valid lengths (4 or 16 bytes).

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [157-170]

 private static ExprValue toExprValue(Object value, RelDataType type) {
   if (value instanceof byte[] bytes) {
     if (type instanceof IpType) {
+      if (bytes.length != 4 && bytes.length != 16) {
+        throw new IllegalStateException(
+            "invalid IP buffer length: " + bytes.length + " (expected 4 or 16 bytes)");
+      }
       try {
         return ExprValueUtils.stringValue(
             InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
       } catch (UnknownHostException e) {
-        throw new IllegalStateException("invalid IP buffer length: " + bytes.length, e);
+        throw new IllegalStateException("failed to parse IP address", e);
       }
     } else if (type instanceof BinaryType) {
       return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
     }
   }
   return ExprValueUtils.fromObjectValue(value);
 }
Suggestion importance[1-10]: 7

__

Why: The suggestion improves error handling by explicitly validating IP buffer lengths before processing, providing clearer error messages. However, InetAddress.getByAddress() already validates buffer length internally, so this is primarily a defensive programming improvement rather than fixing a critical bug.

Medium
Log warning for row-field mismatch

When row.length < fields.size(), the code defaults missing values to null. However,
this silently masks potential data corruption or schema mismatches. Consider logging
a warning when row length doesn't match field count to aid debugging of upstream
data issues.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [131-135]

+if (row.length != fields.size()) {
+  logger.warn("Row length mismatch: expected {} fields but got {} values", 
+              fields.size(), row.length);
+}
 for (int i = 0; i < fields.size(); i++) {
   RelDataTypeField field = fields.get(i);
   Object value = (i < row.length) ? row[i] : null;
   valueMap.put(field.getName(), toExprValue(value, field.getType()));
 }
Suggestion importance[1-10]: 6

__

Why: Adding logging for row-field mismatches would help detect data corruption or schema issues during debugging. However, the existing code already handles this case gracefully by defaulting to null, and the suggestion only adds observability without changing behavior or fixing a bug.

Low
Suggestions up to commit b7fd95b
CategorySuggestion                                                                                                                                    Impact
General
Log invalid IP buffer errors

The UnknownHostException catch block silently falls through to default handling,
which may throw an unclear error. Consider logging the exception with the invalid
buffer length to aid debugging when the backend sends malformed IP data.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [158-167]

 if (value instanceof byte[] bytes) {
   if (type instanceof IpType) {
     try {
       return ExprValueUtils.stringValue(
           InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
     } catch (UnknownHostException e) {
-      // Defensive: backend gave us a buffer that isn't 4 or 16 bytes. Fall through to
-      // the default handling so the user sees a clear error rather than a malformed
-      // address string.
+      // Log the error with buffer length for debugging
+      logger.warn("Invalid IP buffer length: {} bytes, expected 4 or 16", bytes.length, e);
+      // Fall through to default handling
     }
   } else if (type instanceof BinaryType) {
     return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
   }
 }
Suggestion importance[1-10]: 5

__

Why: Adding logging for the UnknownHostException would help with debugging malformed IP buffers from the backend. However, the current fallthrough behavior is intentional and documented, and the default handling will already surface an error to the user. The logging is a minor enhancement rather than a critical fix.

Low
Suggestions up to commit 540074b
CategorySuggestion                                                                                                                                    Impact
General
Log invalid IP byte arrays

The UnknownHostException catch block silently falls through to default handling,
which may throw an unclear error. Consider logging the exception with context (e.g.,
buffer length) to aid debugging when invalid IP byte arrays are encountered.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [158-166]

 if (value instanceof byte[] bytes) {
   if (type instanceof IpType) {
     try {
       return ExprValueUtils.stringValue(InetAddresses.toAddrString(InetAddress.getByAddress(bytes)));
     } catch (UnknownHostException e) {
-      // Defensive: backend gave us a buffer that isn't 4 or 16 bytes. Fall through to
-      // the default handling so the user sees a clear error rather than a malformed
-      // address string.
+      // Log the invalid buffer for debugging
+      logger.warn("Invalid IP address byte array of length {}: {}", bytes.length, e.getMessage());
+      // Fall through to default handling
     }
   } else if (type instanceof BinaryType) {
     return ExprValueUtils.stringValue(Base64.getEncoder().encodeToString(bytes));
   }
 }
Suggestion importance[1-10]: 5

__

Why: Adding logging for invalid IP byte arrays would help with debugging, but the current defensive approach of falling through to default handling is already reasonable. The suggestion improves observability without fixing a critical issue. The score reflects a moderate improvement in maintainability and debugging capability.

Low

@vinaykpud vinaykpud force-pushed the feature/udt_ip_binary branch from 540074b to b7fd95b Compare May 22, 2026 18:21
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b7fd95b

@vinaykpud vinaykpud marked this pull request as ready for review May 22, 2026 20:22
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 0918d6b

penghuo
penghuo previously approved these changes May 22, 2026
vinaykpud added 2 commits May 25, 2026 17:56
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
@vinaykpud vinaykpud force-pushed the feature/udt_ip_binary branch from 0918d6b to 90defce Compare May 25, 2026 17:57
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 90defce

@vinaykpud vinaykpud added enhancement New feature or request PPL Piped processing language labels May 25, 2026
@vinaykpud vinaykpud closed this May 25, 2026
@vinaykpud vinaykpud reopened this May 25, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 90defce

Signed-off-by: Vinay Krishna Pudyodu <vinkrish.neo@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit c854bbd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants