[pick](Variant) pick nested impl and unique_id fix (#39841 #39022)#40320
[pick](Variant) pick nested impl and unique_id fix (#39841 #39022)#40320eldenmoon merged 2 commits intoapache:branch-3.0from
Conversation
|
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
…pache#39022) Currently, importing nested data formats, such as: ``` json { "a": [{"nested1": 1}, {"nested2": "123"}] } ``` This results in the a column type becoming JSON, which has worse compression and query performance compared to native arrays, mainly due to the inability to leverage low cardinality optimizations and the overhead of parsing JSON during queries. A common example: ``` json { "eventId": 1, "firstName": "Name1", "lastName": "Surname1", "body": { "phoneNumbers": [ { "number": "5550219210", "type": "GSM", "callLimit": 5 }, { "number": "02124713252", "type": "HOME", "callLimit": 3 }, { "number": "05550219211", "type": "WORK", "callLimit": 2 } ] } } ``` Consider storing the expanded nested structure so that the schema merge logic can be utilized directly, and querying becomes easier, for example: ``` json { "n": [{"a": 1, "b": 2}, {"a": 10, "b": 11, "c": 12}, {"a": 1001, "d": "12"}] }, { "n": [{"x": 1, "y": 2}] } ``` Data would be stored as follows, with following storage format Column | Row 0 | Row 1 -- | -- | -- n.a (array<int>) | [1, 10, 1001] | [null] n.b (int) | [2, 11, null] | [null] n.c (int) | [null, 12, null] | [null] n.d (text) | [null, null, "12"] | [null] n.x | [null, null, null] | [1] n.y | [null, null, null] | [1] Data offsets are aligned (equal size). To maintain the relationship between nested nodes, such as n.a, n.b, n.c, and n.d, during compaction, if any of these columns are missing, their offsets are filled using any sibling column's offset. ```sql SELECT v['n']['a'] FROM tbl; --- This outputs [1, 10, 1001]. ``` ``` sql SELECT v['n'] FROM tbl; --- This outputs [{"a" : 1, "b" : 2}, {"a" : 10, "b" : 11, "c" : 12}, {"a":1001, "d" : "12"}]. ``` During queries, the path's nested information is not perceived because this information is ignored during path evaluation (not stored in the subcolumn tree).
Currently, the variant type is not supported rename column because its column reader accesses columns by path rather than by unique ID. If the name is modified, the column reader may not locate the column correctly.So we should access by unique id
0b62261 to
c546d84
Compare
| #endif | ||
| } | ||
|
|
||
| Status DefaultNestedColumnIterator::next_batch(size_t* n, vectorized::MutableColumnPtr& dst, |
There was a problem hiding this comment.
warning: pointer parameter 'n' can be pointer to const [readability-non-const-parameter]
| Status DefaultNestedColumnIterator::next_batch(size_t* n, vectorized::MutableColumnPtr& dst, | |
| Status DefaultNestedColumnIterator::next_batch(const size_t* n, vectorized::MutableColumnPtr& dst, |
|
|
||
| Field ColumnObject::Subcolumn::get_last_field() const { | ||
| if (data.empty()) { | ||
| return Field(); |
There was a problem hiding this comment.
warning: avoid repeating the return type from the declaration; use a braced initializer list instead [modernize-return-braced-init-list]
| return Field(); | |
| return {}; |
| return (*last_part)[last_part->size() - 1]; | ||
| } | ||
|
|
||
| void ColumnObject::Subcolumn::insert_range_from(const Subcolumn& src, size_t start, size_t length) { |
There was a problem hiding this comment.
warning: function 'insert_range_from' exceeds recommended size/complexity thresholds [readability-function-size]
void ColumnObject::Subcolumn::insert_range_from(const Subcolumn& src, size_t start, size_t length) {
^Additional context
be/src/vec/columns/column_object.cpp:524: 85 lines including whitespace and comments (threshold 80)
void ColumnObject::Subcolumn::insert_range_from(const Subcolumn& src, size_t start, size_t length) {
^| } | ||
| } | ||
|
|
||
| void ColumnObject::finalize(FinalizeMode mode) { |
There was a problem hiding this comment.
warning: method 'finalize' can be made const [readability-make-member-function-const]
be/src/vec/columns/column_object.h:350:
- void finalize(FinalizeMode mode);
+ void finalize(FinalizeMode mode) const;| void ColumnObject::finalize(FinalizeMode mode) { | |
| void ColumnObject::finalize(FinalizeMode mode) const { |
| // and modified by Doris | ||
|
|
||
| #pragma once | ||
| #include <butil/compiler_specific.h> |
There was a problem hiding this comment.
warning: 'butil/compiler_specific.h' file not found [clang-diagnostic-error]
#include <butil/compiler_specific.h>
^|
|
||
| #pragma once | ||
|
|
||
| #include <glog/logging.h> |
There was a problem hiding this comment.
warning: 'glog/logging.h' file not found [clang-diagnostic-error]
#include <glog/logging.h>
^| #pragma once | ||
|
|
||
| #include <glog/logging.h> | ||
| #include <stdint.h> |
There was a problem hiding this comment.
warning: inclusion of deprecated C++ header 'stdint.h'; consider using 'cstdint' instead [modernize-deprecated-headers]
| #include <stdint.h> | |
| #include <cstdint> |
| @@ -29,15 +29,18 @@ | |||
| #include <vector> | |||
|
|
|||
| #include "gen_cpp/segment_v2.pb.h" | |||
There was a problem hiding this comment.
warning: 'gen_cpp/segment_v2.pb.h' file not found [clang-diagnostic-error]
#include "gen_cpp/segment_v2.pb.h"
^
#39841
#39022