Skip to content

[FEATURE][RFC] Support native OpenSearch JSON document format as PPL query response #5225

@qianheng-aws

Description

@qianheng-aws

Problem Statement

OpenSearch PPL query results use a JDBC(schema-data) format where the schema section describes column names and types, and the data section contains the actual values. However, nested type information is lost in the schema section:

  • ARRAY(STRING) is serialized as just "array"
  • ARRAY(STRUCT(name:STRING, id:INTEGER)) is serialized as just "array"
  • STRUCT(cola:INTEGER, colb:STRING) is serialized as just "struct".

Impact on GraphLookup

The graphLookup command returns results where the output field is typed as ARRAY<STRUCT(name, reportsTo, id, ...>> internally (see GraphLookup.java:155-163). However, in the final response, this rich type information is reduced to just "array", making it impossible for downstream applications (e.g., LightRAG) to correctly parse the edge/node data without hardcoding assumptions about the structure.

Current Response Example

Query:

source=employees | graphLookup employees startField=reportsTo fromField=reportsTo toField=name maxDepth=3 as hierarchy

Current Response (schema-data format):

{
  "schema": [
    {"name": "id", "alias": null, "type": "integer"},
    {"name": "name", "alias": null, "type": "keyword"},
    {"name": "reportsTo", "alias": null, "type": "keyword"},
    {"name": "hierarchy", "alias": null, "type": "array"}
  ],
  "datarows": [
    [1, "Dev", "Eliot", [
      ["Eliot", "Ron", 2],
      ["Ron", "Andrew", 3],
      ["Andrew", null, 4]
    ]]
  ],
  "total": 1,
  "size": 1,
  "status": 200
}

Problem: The "type": "array" for the hierarchy column tells the client nothing about the structure of each element in the array. The client cannot know the column names or types for each element.


Proposed Options

Option A: New Document JSON Response Format

Introduce a new response format option (similar to OpenSearch DSL _search responses) that returns results as self-describing JSON documents instead of the schema-data tabular format.

Activation: Via a query parameter, e.g., format=json or a new PPL setting.

Proposed Response:

{
  "hits": {
    "total": {"value": 1, "relation": "eq"},
    "hits": [
      {
        "_source": {
          "id": 1,
          "name": "Dev",
          "reportsTo": "Eliot",
          "hierarchy": [
            {"name": "Eliot", "reportsTo": "Ron", "id": 2},
            {"name": "Ron", "reportsTo": "Andrew", "id": 3},
            {"name": "Andrew", "reportsTo": null, "id": 4}
          ]
        }
      }
    ]
  }
}

Alternatively, a lightweight variant without the DSL envelope:

{
  "data": [
    {
      "id": 1,
      "name": "Dev",
      "reportsTo": "Eliot",
      "hierarchy": [
        {"name": "Eliot", "reportsTo": "Ron", "id": 2},
        {"name": "Ron", "reportsTo": "Andrew", "id": 3},
        {"name": "Andrew", "reportsTo": null, "id": 4}
      ]
    }
  ],
  "total": 1,
  "size": 1
}

Pros

  • Self-describing: nested structures are naturally represented in JSON; no separate schema needed for parsing.
  • Familiar to OpenSearch users who already work with DSL _search responses.
  • No ambiguity — the structure of complex fields is directly visible in the data.
  • Easy for downstream applications (LightRAG, dashboards, etc.) to consume using standard JSON parsing.

Cons

  • New format: Introduces a parallel response format that needs long-term maintenance alongside schema-data.
  • No explicit type information: JSON itself doesn't distinguish integer vs long, keyword vs text, date vs string, etc. Applications needing precise types still need schema metadata.
  • Breaking change for clients: Existing clients expecting schema-data format would need to handle the new format.
  • Redundancy: Column names are repeated in every row, increasing payload size for large result sets.
  • Scope: This is a large cross-cutting change affecting the entire response pipeline, not just graphLookup.

Option B: Enhance Schema with Full Nested Type Information

Extend the current schema-data format to include component/field type information for complex types (ARRAY and STRUCT).

Proposed Response:

{
  "schema": [
    {"name": "id", "alias": null, "type": "integer"},
    {"name": "name", "alias": null, "type": "keyword"},
    {"name": "reportsTo", "alias": null, "type": "keyword"},
    {
      "name": "hierarchy",
      "alias": null,
      "type": "array",
      "elementType": {
        "type": "struct",
        "fields": [
          {"name": "name", "type": "keyword"},
          {"name": "reportsTo", "type": "keyword"},
          {"name": "id", "type": "integer"}
        ]
      }
    }
  ],
  "datarows": [
    [1, "Dev", "Eliot", [
      ["Eliot", "Ron", 2],
      ["Ron", "Andrew", 3],
      ["Andrew", null, 4]
    ]]
  ],
  "total": 1,
  "size": 1,
  "status": 200
}

For simpler cases:

// ARRAY(STRING) → e.g., a tags field
{"name": "tags", "type": "array", "elementType": {"type": "keyword"}}

// STRUCT(city:STRING, zip:INTEGER) → e.g., an address field
{
  "name": "address",
  "type": "struct",
  "fields": [
    {"name": "city", "type": "keyword"},
    {"name": "zip", "type": "integer"}
  ]
}

// Deeply nested: ARRAY(STRUCT(edges: ARRAY(STRING)))
{
  "name": "graph",
  "type": "array",
  "elementType": {
    "type": "struct",
    "fields": [
      {"name": "edges", "type": "array", "elementType": {"type": "keyword"}}
    ]
  }
}

Pros

  • Backward compatible: The top-level "type" field remains unchanged ("array", "struct"). Existing clients that only read type continue to work. New clients can read the additional elementType/fields properties.
  • Complete type information: Preserves full type hierarchy including deeply nested structures.
  • Minimal change surface: Only the schema serialization in JdbcResponseFormatter.fetchColumn() and Column class need modification. The data section remains unchanged.
  • Precise types: Unlike Option A, this format explicitly provides type names for every level of nesting.

Cons

  • Breaks the backward-compatibility: In the current implementation, except for the root object, data of struct type are shown as a map object instead of array.
  • Schema complexity: The schema section becomes more complex and deeply nested for complex types.
  • Implementation effort: Requires threading the Calcite RelDataType (which preserves component types) through to the formatter, since the current ExprCoreType enum doesn't carry component type info.
  • Client update needed: Clients need to understand the new optional elementType/fields properties to benefit from the enhancement (though they can safely ignore them).

Implementation Sketch

  1. Extend Schema.Column to carry RelDataType or a new TypeDescriptor alongside ExprType.
  2. Modify JdbcResponseFormatter.Column to support optional elementType / fields properties.
  3. In fetchColumn(), recursively build nested type descriptors from the Calcite RelDataType:
    • ARRAY(componentType){"type":"array", "elementType": describe(componentType)}
    • STRUCT(fields...){"type":"object", "fields": [describe(f) for f in fields]}
    • Primitive types → {"type": legacyName}

Option C: Flatten data of struct type to Map

Change the graphLookup output from ARRAY<STRUCT> to a STRUCT (map) representation so that field names appear as map keys in the data section, providing some structural information without schema changes.

Change the final output result of StructImpl from List to Map. We don't really want to use SqlTypeName.Map in the execution phase as map is less efficient and cost more memory than using list.

Proposed Response:

{
  "schema": [
    {"name": "id", "alias": null, "type": "integer"},
    {"name": "name", "alias": null, "type": "keyword"},
    {"name": "reportsTo", "alias": null, "type": "keyword"},
    {"name": "hierarchy", "alias": null, "type": "array"}
  ],
  "datarows": [
    [1, "Dev", "Eliot", [
      {"name": "Eliot", "reportsTo": "Ron", "id": 2},
      {"name": "Ron", "reportsTo": "Andrew", "id": 3},
      {"name": "Andrew", "reportsTo": null, "id": 4}
    ]]
  ],
  "total": 1,
  "size": 1,
  "status": 200
}

Note: the data section already looks like this today because ExprTupleValue.value() serializes structs as maps. The schema still says "struct" with no inner type information.

Pros

  • No schema format changes: The schema structure remains flat and simple.
  • Column names visible: In the flattened variant, field names are encoded in column names (dot notation).

Cons

  • Type information still lost: Schema still says "struct" — the nesting relationship between columns is implicit in naming convention.
  • GraphLookup-specific: This is a workaround for one command, not a general solution. Every new command with complex output would need its own workaround.
  • Lossy: The flattened variant loses the association between fields within the same array element. It's unclear that hierarchy.name[0] and hierarchy.id[0] belong to the same struct.
  • Awkward semantics: Converting an array of structs to parallel arrays of fields is unusual and confusing.
  • Doesn't address STRUCT: Standalone STRUCT columns still lose field type information.

Note: In Calcite's avatica, struct data are processed as StructImpl with a list stored internally. Actually, before graphlookup, we never touch struct type but only map type.

Option Comparison

Option A[Preferred in long term] could be pursued as a complementary long-term enhancement (independent of this issue) for clients that prefer document-oriented responses. However, it does not solve the type information problem by itself — JSON values like 42 are ambiguous (integer? long? short?).

Option B[Not recommended] breaks the backward-compatibility while not bring any parsing convenience as OptionA.

Option C[Preferred] is a command-specific workaround. It can perfectly satisfy our requirements from LightRAG with less change and effort.

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageenhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions