Skip to content

[C++] Kernel to select subset of fields of a StructArray #31101

@asfimport

Description

@asfimport

Triggered by https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure. I thought there was already an issue about this, but don't directly find one.

Assume you have a struct array with some fields:

>>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
>>> arr.type
StructType(struct<a: int64, b: int64, c: int64>)

We have a kernel to select a single child field:

>>> pc.struct_field(arr, [0])
<pyarrow.lib.Int64Array object at 0x7ffa9e229940>
[
  1,
  2,
  3
]

But if you want to subset the StructArray to some of its fields, resulting in a new StructArray, that's not possible with struct_field, and doing this manually is a bit cumbersome:

>>> fields = ['a', 'c']
>>> arrays = [arr.field(n) for n in fields]
>>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
>>> arr_subset.type
StructType(struct<a: int64, c: int64>)

(this is still OK, but if you had a ChunkedArray, it certainly gets annoying)

One option could be to expand the existing struct_field to allow selecting multiple fields (although that probably gets ambigous/confusing with how you currently select a recursively nested field -> [0, 1] currently means "first child, second subchild" and not "first and second child").
Or a new kernel like "struct_subset" or some other name.

This might also overlap with general projection functionality? (cc @westonpace)

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Dhruv Vats / @dhruv9vats

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-15643. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions