Skip to content

[pick](Variant) pick nested impl and unique_id fix (#39841 #39022)#40320

Merged
eldenmoon merged 2 commits intoapache:branch-3.0from
eldenmoon:branch-3.0
Sep 3, 2024
Merged

[pick](Variant) pick nested impl and unique_id fix (#39841 #39022)#40320
eldenmoon merged 2 commits intoapache:branch-3.0from
eldenmoon:branch-3.0

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

@eldenmoon eldenmoon commented Sep 3, 2024

@doris-robot
Copy link
Copy Markdown

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

…pache#39022)

Currently, importing nested data formats, such as:

``` json
{
  "a": [{"nested1": 1}, {"nested2": "123"}]
}
```
This results in the a column type becoming JSON, which has worse
compression and query performance compared to native arrays, mainly due
to the inability to leverage low cardinality optimizations and the
overhead of parsing JSON during queries.

A common example:

``` json
{
  "eventId": 1,
  "firstName": "Name1",
  "lastName": "Surname1",
  "body": {
    "phoneNumbers": [
      {
        "number": "5550219210",
        "type": "GSM",
        "callLimit": 5
      },
      {
        "number": "02124713252",
        "type": "HOME",
        "callLimit": 3
      },
      {
        "number": "05550219211",
        "type": "WORK",
        "callLimit": 2
      }
    ]
  }
}
```

Consider storing the expanded nested structure so that the schema merge
logic can be utilized directly, and querying becomes easier, for
example:
``` json
{
  "n": [{"a": 1, "b": 2}, {"a": 10, "b": 11, "c": 12}, {"a": 1001, "d": "12"}]
},
{
  "n": [{"x": 1, "y": 2}]
}
```
Data would be stored as follows, with following storage format
Column | Row 0 | Row 1
-- | -- | --
n.a (array<int>) | [1, 10, 1001] | [null]
n.b (int) | [2, 11, null] | [null]
n.c (int) | [null, 12, null] | [null]
n.d (text) | [null, null, "12"] | [null]
n.x | [null, null, null] | [1]
n.y | [null, null, null] | [1]

Data offsets are aligned (equal size).

To maintain the relationship between nested nodes, such as n.a, n.b,
n.c, and n.d, during compaction, if any of these columns are missing,
their offsets are filled using any sibling column's offset.

```sql
SELECT v['n']['a'] FROM tbl;
--- This outputs [1, 10, 1001].
```

```  sql
SELECT v['n'] FROM tbl;
--- This outputs [{"a" : 1, "b" : 2}, {"a" : 10, "b" : 11, "c" : 12}, {"a":1001, "d" : "12"}].
```

During queries, the path's nested information is not perceived because
this information is ignored during path evaluation (not stored in the
subcolumn tree).
Currently, the variant type is not supported rename column because its
column reader accesses columns by path rather than by unique ID. If the
name is modified, the column reader may not locate the column
correctly.So we should access by unique id
@eldenmoon eldenmoon changed the title [Fix](Variant) use uinque id to access column reader [pick](Variant) pick nested impl and unique_id fix Sep 3, 2024
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

#endif
}

Status DefaultNestedColumnIterator::next_batch(size_t* n, vectorized::MutableColumnPtr& dst,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: pointer parameter 'n' can be pointer to const [readability-non-const-parameter]

Suggested change
Status DefaultNestedColumnIterator::next_batch(size_t* n, vectorized::MutableColumnPtr& dst,
Status DefaultNestedColumnIterator::next_batch(const size_t* n, vectorized::MutableColumnPtr& dst,


Field ColumnObject::Subcolumn::get_last_field() const {
if (data.empty()) {
return Field();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: avoid repeating the return type from the declaration; use a braced initializer list instead [modernize-return-braced-init-list]

Suggested change
return Field();
return {};

return (*last_part)[last_part->size() - 1];
}

void ColumnObject::Subcolumn::insert_range_from(const Subcolumn& src, size_t start, size_t length) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: function 'insert_range_from' exceeds recommended size/complexity thresholds [readability-function-size]

void ColumnObject::Subcolumn::insert_range_from(const Subcolumn& src, size_t start, size_t length) {
                              ^
Additional context

be/src/vec/columns/column_object.cpp:524: 85 lines including whitespace and comments (threshold 80)

void ColumnObject::Subcolumn::insert_range_from(const Subcolumn& src, size_t start, size_t length) {
                              ^

}
}

void ColumnObject::finalize(FinalizeMode mode) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'finalize' can be made const [readability-make-member-function-const]

be/src/vec/columns/column_object.h:350:

-     void finalize(FinalizeMode mode);
+     void finalize(FinalizeMode mode) const;
Suggested change
void ColumnObject::finalize(FinalizeMode mode) {
void ColumnObject::finalize(FinalizeMode mode) const {

// and modified by Doris

#pragma once
#include <butil/compiler_specific.h>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'butil/compiler_specific.h' file not found [clang-diagnostic-error]

#include <butil/compiler_specific.h>
         ^


#pragma once

#include <glog/logging.h>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'glog/logging.h' file not found [clang-diagnostic-error]

#include <glog/logging.h>
         ^

#pragma once

#include <glog/logging.h>
#include <stdint.h>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: inclusion of deprecated C++ header 'stdint.h'; consider using 'cstdint' instead [modernize-deprecated-headers]

Suggested change
#include <stdint.h>
#include <cstdint>

@@ -29,15 +29,18 @@
#include <vector>

#include "gen_cpp/segment_v2.pb.h"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'gen_cpp/segment_v2.pb.h' file not found [clang-diagnostic-error]

#include "gen_cpp/segment_v2.pb.h"
         ^

@eldenmoon eldenmoon merged commit 839f9b8 into apache:branch-3.0 Sep 3, 2024
@eldenmoon eldenmoon deleted the branch-3.0 branch September 3, 2024 08:35
@gavinchou gavinchou changed the title [pick](Variant) pick nested impl and unique_id fix [pick](Variant) pick nested impl and unique_id fix (#39841 #39022) Sep 29, 2024
@gavinchou gavinchou mentioned this pull request Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants