[FEATURE][RFC] Support native OpenSearch JSON document format as PPL query response

## Problem Statement

OpenSearch PPL query results use a **JDBC(schema-data)** format where the schema section describes column names and types, and the data section contains the actual values. However, **nested type information is lost** in the schema section:

- `ARRAY(STRING)` is serialized as just `"array"`
- `ARRAY(STRUCT(name:STRING, id:INTEGER))` is serialized as just `"array"`
- `STRUCT(cola:INTEGER, colb:STRING)` is serialized as just `"struct"`.

### Impact on GraphLookup

The `graphLookup` command returns results where the output field is typed as `ARRAY<STRUCT(name, reportsTo, id, ...>>` internally (see `GraphLookup.java:155-163`). However, in the final response, this rich type information is reduced to just `"array"`, making it impossible for downstream applications (e.g., LightRAG) to correctly parse the edge/node data without hardcoding assumptions about the structure.

### Current Response Example

**Query:**
```
source=employees | graphLookup employees startField=reportsTo fromField=reportsTo toField=name maxDepth=3 as hierarchy
```

**Current Response (schema-data format):**
```json
{
  "schema": [
    {"name": "id", "alias": null, "type": "integer"},
    {"name": "name", "alias": null, "type": "keyword"},
    {"name": "reportsTo", "alias": null, "type": "keyword"},
    {"name": "hierarchy", "alias": null, "type": "array"}
  ],
  "datarows": [
    [1, "Dev", "Eliot", [
      ["Eliot", "Ron", 2],
      ["Ron", "Andrew", 3],
      ["Andrew", null, 4]
    ]]
  ],
  "total": 1,
  "size": 1,
  "status": 200
}
```

**Problem:** The `"type": "array"` for the `hierarchy` column tells the client nothing about the structure of each element in the array. The client cannot know the column names or types for each element.

---

## Proposed Options

### Option A: New Document JSON Response Format

Introduce a new response format option (similar to OpenSearch DSL `_search` responses) that returns results as self-describing JSON documents instead of the schema-data tabular format.

**Activation:** Via a query parameter, e.g., `format=json` or a new PPL setting.

**Proposed Response:**
```json
{
  "hits": {
    "total": {"value": 1, "relation": "eq"},
    "hits": [
      {
        "_source": {
          "id": 1,
          "name": "Dev",
          "reportsTo": "Eliot",
          "hierarchy": [
            {"name": "Eliot", "reportsTo": "Ron", "id": 2},
            {"name": "Ron", "reportsTo": "Andrew", "id": 3},
            {"name": "Andrew", "reportsTo": null, "id": 4}
          ]
        }
      }
    ]
  }
}
```

**Alternatively**, a lightweight variant without the DSL envelope:
```json
{
  "data": [
    {
      "id": 1,
      "name": "Dev",
      "reportsTo": "Eliot",
      "hierarchy": [
        {"name": "Eliot", "reportsTo": "Ron", "id": 2},
        {"name": "Ron", "reportsTo": "Andrew", "id": 3},
        {"name": "Andrew", "reportsTo": null, "id": 4}
      ]
    }
  ],
  "total": 1,
  "size": 1
}
```

#### Pros
- Self-describing: nested structures are naturally represented in JSON; no separate schema needed for parsing.
- Familiar to OpenSearch users who already work with DSL `_search` responses.
- No ambiguity — the structure of complex fields is directly visible in the data.
- Easy for downstream applications (LightRAG, dashboards, etc.) to consume using standard JSON parsing.

#### Cons
- **New format**: Introduces a parallel response format that needs long-term maintenance alongside schema-data.
- **No explicit type information**: JSON itself doesn't distinguish `integer` vs `long`, `keyword` vs `text`, `date` vs `string`, etc. Applications needing precise types still need schema metadata.
- **Breaking change for clients**: Existing clients expecting schema-data format would need to handle the new format.
- **Redundancy**: Column names are repeated in every row, increasing payload size for large result sets.
- **Scope**: This is a large cross-cutting change affecting the entire response pipeline, not just graphLookup.

---

### Option B: Enhance Schema with Full Nested Type Information

Extend the current schema-data format to include component/field type information for complex types (ARRAY and STRUCT).

**Proposed Response:**
```json
{
  "schema": [
    {"name": "id", "alias": null, "type": "integer"},
    {"name": "name", "alias": null, "type": "keyword"},
    {"name": "reportsTo", "alias": null, "type": "keyword"},
    {
      "name": "hierarchy",
      "alias": null,
      "type": "array",
      "elementType": {
        "type": "struct",
        "fields": [
          {"name": "name", "type": "keyword"},
          {"name": "reportsTo", "type": "keyword"},
          {"name": "id", "type": "integer"}
        ]
      }
    }
  ],
  "datarows": [
    [1, "Dev", "Eliot", [
      ["Eliot", "Ron", 2],
      ["Ron", "Andrew", 3],
      ["Andrew", null, 4]
    ]]
  ],
  "total": 1,
  "size": 1,
  "status": 200
}
```

**For simpler cases:**
```json
// ARRAY(STRING) → e.g., a tags field
{"name": "tags", "type": "array", "elementType": {"type": "keyword"}}

// STRUCT(city:STRING, zip:INTEGER) → e.g., an address field
{
  "name": "address",
  "type": "struct",
  "fields": [
    {"name": "city", "type": "keyword"},
    {"name": "zip", "type": "integer"}
  ]
}

// Deeply nested: ARRAY(STRUCT(edges: ARRAY(STRING)))
{
  "name": "graph",
  "type": "array",
  "elementType": {
    "type": "struct",
    "fields": [
      {"name": "edges", "type": "array", "elementType": {"type": "keyword"}}
    ]
  }
}
```

#### Pros
- **Backward compatible**: The top-level `"type"` field remains unchanged (`"array"`, `"struct"`). Existing clients that only read `type` continue to work. New clients can read the additional `elementType`/`fields` properties.
- **Complete type information**: Preserves full type hierarchy including deeply nested structures.
- **Minimal change surface**: Only the schema serialization in `JdbcResponseFormatter.fetchColumn()` and `Column` class need modification. The data section remains unchanged.
- **Precise types**: Unlike Option A, this format explicitly provides type names for every level of nesting.

#### Cons
- **Breaks the backward-compatibility**: In the current implementation, except for the root object, data of struct type are shown as a map object instead of array.
- **Schema complexity**: The schema section becomes more complex and deeply nested for complex types.
- **Implementation effort**: Requires threading the Calcite `RelDataType` (which preserves component types) through to the formatter, since the current `ExprCoreType` enum doesn't carry component type info.
- **Client update needed**: Clients need to understand the new optional `elementType`/`fields` properties to benefit from the enhancement (though they can safely ignore them).

#### Implementation Sketch

1. **Extend `Schema.Column`** to carry `RelDataType` or a new `TypeDescriptor` alongside `ExprType`.
2. **Modify `JdbcResponseFormatter.Column`** to support optional `elementType` / `fields` properties.
3. **In `fetchColumn()`**, recursively build nested type descriptors from the Calcite `RelDataType`:
   - `ARRAY(componentType)` → `{"type":"array", "elementType": describe(componentType)}`
   - `STRUCT(fields...)` → `{"type":"object", "fields": [describe(f) for f in fields]}`
   - Primitive types → `{"type": legacyName}`

---

### Option C: Flatten data of struct type to Map

~~Change the graphLookup output from `ARRAY<STRUCT>` to a `STRUCT` (map) representation so that field names appear as map keys in the data section, providing *some* structural information without schema changes.~~

Change the final output result of `StructImpl` from List to Map. We don't really want to use SqlTypeName.Map in the execution phase as map is less efficient and cost more memory than using list.

**Proposed Response:**
```json
{
  "schema": [
    {"name": "id", "alias": null, "type": "integer"},
    {"name": "name", "alias": null, "type": "keyword"},
    {"name": "reportsTo", "alias": null, "type": "keyword"},
    {"name": "hierarchy", "alias": null, "type": "array"}
  ],
  "datarows": [
    [1, "Dev", "Eliot", [
      {"name": "Eliot", "reportsTo": "Ron", "id": 2},
      {"name": "Ron", "reportsTo": "Andrew", "id": 3},
      {"name": "Andrew", "reportsTo": null, "id": 4}
    ]]
  ],
  "total": 1,
  "size": 1,
  "status": 200
}
```

> Note: the data section already looks like this today because `ExprTupleValue.value()` serializes structs as maps. The schema still says `"struct"` with no inner type information.

#### Pros
- **No schema format changes**: The schema structure remains flat and simple.
- **Column names visible**: In the flattened variant, field names are encoded in column names (dot notation).

#### Cons
- **Type information still lost**: Schema still says `"struct"` — the nesting relationship between columns is implicit in naming convention.
- ~~**GraphLookup-specific**: This is a workaround for one command, not a general solution. Every new command with complex output would need its own workaround.~~
- ~~**Lossy**: The flattened variant loses the association between fields within the same array element. It's unclear that `hierarchy.name[0]` and `hierarchy.id[0]` belong to the same struct.~~
- ~~**Awkward semantics**: Converting an array of structs to parallel arrays of fields is unusual and confusing.~~
- ~~**Doesn't address STRUCT**: Standalone STRUCT columns still lose field type information.~~

Note: In Calcite's avatica, struct data are processed as StructImpl with a list stored internally. Actually, before graphlookup, we never touch struct type but only map type.
---

## Option Comparison

**Option A[Preferred in long term]** could be pursued as a complementary long-term enhancement (independent of this issue) for clients that prefer document-oriented responses. However, it does not solve the type information problem by itself — JSON values like `42` are ambiguous (integer? long? short?).

**Option B[Not recommended]**  breaks the backward-compatibility while not bring any parsing convenience as OptionA.

**Option C[Preferred]** is a command-specific workaround. It can perfectly satisfy our requirements from LightRAG with less change and  effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE][RFC] Support native OpenSearch JSON document format as PPL query response #5225

Problem Statement

Impact on GraphLookup

Current Response Example

Proposed Options

Option A: New Document JSON Response Format

Pros

Cons

Option B: Enhance Schema with Full Nested Type Information

Pros

Cons

Implementation Sketch

Option C: Flatten data of struct type to Map

Pros

Cons

Note: In Calcite's avatica, struct data are processed as StructImpl with a list stored internally. Actually, before graphlookup, we never touch struct type but only map type.

Option Comparison

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[FEATURE][RFC] Support native OpenSearch JSON document format as PPL query response #5225

Description

Problem Statement

Impact on GraphLookup

Current Response Example

Proposed Options

Option A: New Document JSON Response Format

Pros

Cons

Option B: Enhance Schema with Full Nested Type Information

Pros

Cons

Implementation Sketch

Option C: Flatten data of struct type to Map

Pros

Cons

Note: In Calcite's avatica, struct data are processed as StructImpl with a list stored internally. Actually, before graphlookup, we never touch struct type but only map type.

Option Comparison

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions