Problem Statement
OpenSearch PPL query results use a JDBC(schema-data) format where the schema section describes column names and types, and the data section contains the actual values. However, nested type information is lost in the schema section:
ARRAY(STRING) is serialized as just "array"
ARRAY(STRUCT(name:STRING, id:INTEGER)) is serialized as just "array"
STRUCT(cola:INTEGER, colb:STRING) is serialized as just "struct".
Impact on GraphLookup
The graphLookup command returns results where the output field is typed as ARRAY<STRUCT(name, reportsTo, id, ...>> internally (see GraphLookup.java:155-163). However, in the final response, this rich type information is reduced to just "array", making it impossible for downstream applications (e.g., LightRAG) to correctly parse the edge/node data without hardcoding assumptions about the structure.
Current Response Example
Query:
source=employees | graphLookup employees startField=reportsTo fromField=reportsTo toField=name maxDepth=3 as hierarchy
Current Response (schema-data format):
{
"schema": [
{"name": "id", "alias": null, "type": "integer"},
{"name": "name", "alias": null, "type": "keyword"},
{"name": "reportsTo", "alias": null, "type": "keyword"},
{"name": "hierarchy", "alias": null, "type": "array"}
],
"datarows": [
[1, "Dev", "Eliot", [
["Eliot", "Ron", 2],
["Ron", "Andrew", 3],
["Andrew", null, 4]
]]
],
"total": 1,
"size": 1,
"status": 200
}
Problem: The "type": "array" for the hierarchy column tells the client nothing about the structure of each element in the array. The client cannot know the column names or types for each element.
Proposed Options
Option A: New Document JSON Response Format
Introduce a new response format option (similar to OpenSearch DSL _search responses) that returns results as self-describing JSON documents instead of the schema-data tabular format.
Activation: Via a query parameter, e.g., format=json or a new PPL setting.
Proposed Response:
{
"hits": {
"total": {"value": 1, "relation": "eq"},
"hits": [
{
"_source": {
"id": 1,
"name": "Dev",
"reportsTo": "Eliot",
"hierarchy": [
{"name": "Eliot", "reportsTo": "Ron", "id": 2},
{"name": "Ron", "reportsTo": "Andrew", "id": 3},
{"name": "Andrew", "reportsTo": null, "id": 4}
]
}
}
]
}
}
Alternatively, a lightweight variant without the DSL envelope:
{
"data": [
{
"id": 1,
"name": "Dev",
"reportsTo": "Eliot",
"hierarchy": [
{"name": "Eliot", "reportsTo": "Ron", "id": 2},
{"name": "Ron", "reportsTo": "Andrew", "id": 3},
{"name": "Andrew", "reportsTo": null, "id": 4}
]
}
],
"total": 1,
"size": 1
}
Pros
- Self-describing: nested structures are naturally represented in JSON; no separate schema needed for parsing.
- Familiar to OpenSearch users who already work with DSL
_search responses.
- No ambiguity — the structure of complex fields is directly visible in the data.
- Easy for downstream applications (LightRAG, dashboards, etc.) to consume using standard JSON parsing.
Cons
- New format: Introduces a parallel response format that needs long-term maintenance alongside schema-data.
- No explicit type information: JSON itself doesn't distinguish
integer vs long, keyword vs text, date vs string, etc. Applications needing precise types still need schema metadata.
- Breaking change for clients: Existing clients expecting schema-data format would need to handle the new format.
- Redundancy: Column names are repeated in every row, increasing payload size for large result sets.
- Scope: This is a large cross-cutting change affecting the entire response pipeline, not just graphLookup.
Option B: Enhance Schema with Full Nested Type Information
Extend the current schema-data format to include component/field type information for complex types (ARRAY and STRUCT).
Proposed Response:
{
"schema": [
{"name": "id", "alias": null, "type": "integer"},
{"name": "name", "alias": null, "type": "keyword"},
{"name": "reportsTo", "alias": null, "type": "keyword"},
{
"name": "hierarchy",
"alias": null,
"type": "array",
"elementType": {
"type": "struct",
"fields": [
{"name": "name", "type": "keyword"},
{"name": "reportsTo", "type": "keyword"},
{"name": "id", "type": "integer"}
]
}
}
],
"datarows": [
[1, "Dev", "Eliot", [
["Eliot", "Ron", 2],
["Ron", "Andrew", 3],
["Andrew", null, 4]
]]
],
"total": 1,
"size": 1,
"status": 200
}
For simpler cases:
// ARRAY(STRING) → e.g., a tags field
{"name": "tags", "type": "array", "elementType": {"type": "keyword"}}
// STRUCT(city:STRING, zip:INTEGER) → e.g., an address field
{
"name": "address",
"type": "struct",
"fields": [
{"name": "city", "type": "keyword"},
{"name": "zip", "type": "integer"}
]
}
// Deeply nested: ARRAY(STRUCT(edges: ARRAY(STRING)))
{
"name": "graph",
"type": "array",
"elementType": {
"type": "struct",
"fields": [
{"name": "edges", "type": "array", "elementType": {"type": "keyword"}}
]
}
}
Pros
- Backward compatible: The top-level
"type" field remains unchanged ("array", "struct"). Existing clients that only read type continue to work. New clients can read the additional elementType/fields properties.
- Complete type information: Preserves full type hierarchy including deeply nested structures.
- Minimal change surface: Only the schema serialization in
JdbcResponseFormatter.fetchColumn() and Column class need modification. The data section remains unchanged.
- Precise types: Unlike Option A, this format explicitly provides type names for every level of nesting.
Cons
- Breaks the backward-compatibility: In the current implementation, except for the root object, data of struct type are shown as a map object instead of array.
- Schema complexity: The schema section becomes more complex and deeply nested for complex types.
- Implementation effort: Requires threading the Calcite
RelDataType (which preserves component types) through to the formatter, since the current ExprCoreType enum doesn't carry component type info.
- Client update needed: Clients need to understand the new optional
elementType/fields properties to benefit from the enhancement (though they can safely ignore them).
Implementation Sketch
- Extend
Schema.Column to carry RelDataType or a new TypeDescriptor alongside ExprType.
- Modify
JdbcResponseFormatter.Column to support optional elementType / fields properties.
- In
fetchColumn(), recursively build nested type descriptors from the Calcite RelDataType:
ARRAY(componentType) → {"type":"array", "elementType": describe(componentType)}
STRUCT(fields...) → {"type":"object", "fields": [describe(f) for f in fields]}
- Primitive types →
{"type": legacyName}
Option C: Flatten data of struct type to Map
Change the graphLookup output from ARRAY<STRUCT> to a STRUCT (map) representation so that field names appear as map keys in the data section, providing some structural information without schema changes.
Change the final output result of StructImpl from List to Map. We don't really want to use SqlTypeName.Map in the execution phase as map is less efficient and cost more memory than using list.
Proposed Response:
{
"schema": [
{"name": "id", "alias": null, "type": "integer"},
{"name": "name", "alias": null, "type": "keyword"},
{"name": "reportsTo", "alias": null, "type": "keyword"},
{"name": "hierarchy", "alias": null, "type": "array"}
],
"datarows": [
[1, "Dev", "Eliot", [
{"name": "Eliot", "reportsTo": "Ron", "id": 2},
{"name": "Ron", "reportsTo": "Andrew", "id": 3},
{"name": "Andrew", "reportsTo": null, "id": 4}
]]
],
"total": 1,
"size": 1,
"status": 200
}
Note: the data section already looks like this today because ExprTupleValue.value() serializes structs as maps. The schema still says "struct" with no inner type information.
Pros
- No schema format changes: The schema structure remains flat and simple.
- Column names visible: In the flattened variant, field names are encoded in column names (dot notation).
Cons
- Type information still lost: Schema still says
"struct" — the nesting relationship between columns is implicit in naming convention.
GraphLookup-specific: This is a workaround for one command, not a general solution. Every new command with complex output would need its own workaround.
Lossy: The flattened variant loses the association between fields within the same array element. It's unclear that hierarchy.name[0] and hierarchy.id[0] belong to the same struct.
Awkward semantics: Converting an array of structs to parallel arrays of fields is unusual and confusing.
Doesn't address STRUCT: Standalone STRUCT columns still lose field type information.
Note: In Calcite's avatica, struct data are processed as StructImpl with a list stored internally. Actually, before graphlookup, we never touch struct type but only map type.
Option Comparison
Option A[Preferred in long term] could be pursued as a complementary long-term enhancement (independent of this issue) for clients that prefer document-oriented responses. However, it does not solve the type information problem by itself — JSON values like 42 are ambiguous (integer? long? short?).
Option B[Not recommended] breaks the backward-compatibility while not bring any parsing convenience as OptionA.
Option C[Preferred] is a command-specific workaround. It can perfectly satisfy our requirements from LightRAG with less change and effort.
Problem Statement
OpenSearch PPL query results use a JDBC(schema-data) format where the schema section describes column names and types, and the data section contains the actual values. However, nested type information is lost in the schema section:
ARRAY(STRING)is serialized as just"array"ARRAY(STRUCT(name:STRING, id:INTEGER))is serialized as just"array"STRUCT(cola:INTEGER, colb:STRING)is serialized as just"struct".Impact on GraphLookup
The
graphLookupcommand returns results where the output field is typed asARRAY<STRUCT(name, reportsTo, id, ...>>internally (seeGraphLookup.java:155-163). However, in the final response, this rich type information is reduced to just"array", making it impossible for downstream applications (e.g., LightRAG) to correctly parse the edge/node data without hardcoding assumptions about the structure.Current Response Example
Query:
Current Response (schema-data format):
{ "schema": [ {"name": "id", "alias": null, "type": "integer"}, {"name": "name", "alias": null, "type": "keyword"}, {"name": "reportsTo", "alias": null, "type": "keyword"}, {"name": "hierarchy", "alias": null, "type": "array"} ], "datarows": [ [1, "Dev", "Eliot", [ ["Eliot", "Ron", 2], ["Ron", "Andrew", 3], ["Andrew", null, 4] ]] ], "total": 1, "size": 1, "status": 200 }Problem: The
"type": "array"for thehierarchycolumn tells the client nothing about the structure of each element in the array. The client cannot know the column names or types for each element.Proposed Options
Option A: New Document JSON Response Format
Introduce a new response format option (similar to OpenSearch DSL
_searchresponses) that returns results as self-describing JSON documents instead of the schema-data tabular format.Activation: Via a query parameter, e.g.,
format=jsonor a new PPL setting.Proposed Response:
{ "hits": { "total": {"value": 1, "relation": "eq"}, "hits": [ { "_source": { "id": 1, "name": "Dev", "reportsTo": "Eliot", "hierarchy": [ {"name": "Eliot", "reportsTo": "Ron", "id": 2}, {"name": "Ron", "reportsTo": "Andrew", "id": 3}, {"name": "Andrew", "reportsTo": null, "id": 4} ] } } ] } }Alternatively, a lightweight variant without the DSL envelope:
{ "data": [ { "id": 1, "name": "Dev", "reportsTo": "Eliot", "hierarchy": [ {"name": "Eliot", "reportsTo": "Ron", "id": 2}, {"name": "Ron", "reportsTo": "Andrew", "id": 3}, {"name": "Andrew", "reportsTo": null, "id": 4} ] } ], "total": 1, "size": 1 }Pros
_searchresponses.Cons
integervslong,keywordvstext,datevsstring, etc. Applications needing precise types still need schema metadata.Option B: Enhance Schema with Full Nested Type Information
Extend the current schema-data format to include component/field type information for complex types (ARRAY and STRUCT).
Proposed Response:
{ "schema": [ {"name": "id", "alias": null, "type": "integer"}, {"name": "name", "alias": null, "type": "keyword"}, {"name": "reportsTo", "alias": null, "type": "keyword"}, { "name": "hierarchy", "alias": null, "type": "array", "elementType": { "type": "struct", "fields": [ {"name": "name", "type": "keyword"}, {"name": "reportsTo", "type": "keyword"}, {"name": "id", "type": "integer"} ] } } ], "datarows": [ [1, "Dev", "Eliot", [ ["Eliot", "Ron", 2], ["Ron", "Andrew", 3], ["Andrew", null, 4] ]] ], "total": 1, "size": 1, "status": 200 }For simpler cases:
Pros
"type"field remains unchanged ("array","struct"). Existing clients that only readtypecontinue to work. New clients can read the additionalelementType/fieldsproperties.JdbcResponseFormatter.fetchColumn()andColumnclass need modification. The data section remains unchanged.Cons
RelDataType(which preserves component types) through to the formatter, since the currentExprCoreTypeenum doesn't carry component type info.elementType/fieldsproperties to benefit from the enhancement (though they can safely ignore them).Implementation Sketch
Schema.Columnto carryRelDataTypeor a newTypeDescriptoralongsideExprType.JdbcResponseFormatter.Columnto support optionalelementType/fieldsproperties.fetchColumn(), recursively build nested type descriptors from the CalciteRelDataType:ARRAY(componentType)→{"type":"array", "elementType": describe(componentType)}STRUCT(fields...)→{"type":"object", "fields": [describe(f) for f in fields]}{"type": legacyName}Option C: Flatten data of struct type to Map
Change the graphLookup output fromARRAY<STRUCT>to aSTRUCT(map) representation so that field names appear as map keys in the data section, providing some structural information without schema changes.Change the final output result of
StructImplfrom List to Map. We don't really want to use SqlTypeName.Map in the execution phase as map is less efficient and cost more memory than using list.Proposed Response:
{ "schema": [ {"name": "id", "alias": null, "type": "integer"}, {"name": "name", "alias": null, "type": "keyword"}, {"name": "reportsTo", "alias": null, "type": "keyword"}, {"name": "hierarchy", "alias": null, "type": "array"} ], "datarows": [ [1, "Dev", "Eliot", [ {"name": "Eliot", "reportsTo": "Ron", "id": 2}, {"name": "Ron", "reportsTo": "Andrew", "id": 3}, {"name": "Andrew", "reportsTo": null, "id": 4} ]] ], "total": 1, "size": 1, "status": 200 }Pros
Cons
"struct"— the nesting relationship between columns is implicit in naming convention.GraphLookup-specific: This is a workaround for one command, not a general solution. Every new command with complex output would need its own workaround.Lossy: The flattened variant loses the association between fields within the same array element. It's unclear thathierarchy.name[0]andhierarchy.id[0]belong to the same struct.Awkward semantics: Converting an array of structs to parallel arrays of fields is unusual and confusing.Doesn't address STRUCT: Standalone STRUCT columns still lose field type information.Note: In Calcite's avatica, struct data are processed as StructImpl with a list stored internally. Actually, before graphlookup, we never touch struct type but only map type.
Option Comparison
Option A[Preferred in long term] could be pursued as a complementary long-term enhancement (independent of this issue) for clients that prefer document-oriented responses. However, it does not solve the type information problem by itself — JSON values like
42are ambiguous (integer? long? short?).Option B[Not recommended] breaks the backward-compatibility while not bring any parsing convenience as OptionA.
Option C[Preferred] is a command-specific workaround. It can perfectly satisfy our requirements from LightRAG with less change and effort.