Skip to content

Support arrow struct#739

Open
ColinLeeo wants to merge 8 commits intodevelopfrom
support_arrow_struct
Open

Support arrow struct#739
ColinLeeo wants to merge 8 commits intodevelopfrom
support_arrow_struct

Conversation

@ColinLeeo
Copy link
Contributor

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 40.48507% with 319 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.87%. Comparing base (dd27faf) to head (12bbcba).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
cpp/src/cwrapper/arrow_c.cc 46.63% 180 Missing and 42 partials ⚠️
cpp/src/common/tablet.cc 0.00% 35 Missing ⚠️
cpp/src/utils/date_utils.h 0.00% 26 Missing ⚠️
cpp/src/cwrapper/tsfile_cwrapper.cc 0.00% 22 Missing ⚠️
cpp/src/reader/table_result_set.cc 50.00% 1 Missing and 5 partials ⚠️
cpp/src/common/container/bit_map.h 50.00% 4 Missing ⚠️
cpp/src/reader/result_set.h 0.00% 2 Missing ⚠️
cpp/src/reader/tsfile_reader.cc 50.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #739      +/-   ##
===========================================
- Coverage    62.09%   61.87%   -0.23%     
===========================================
  Files          700      704       +4     
  Lines        40142    41218    +1076     
  Branches      5650     5908     +258     
===========================================
+ Hits         24926    25503     +577     
- Misses       14522    14885     +363     
- Partials       694      830     +136     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

*/
int add_timestamp(uint32_t row_index, int64_t timestamp);

int set_timestamps(const int64_t* timestamps, uint32_t count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to explain how cur_row_size is updated after calling this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +179 to +180
int Tablet::set_column_values(uint32_t schema_index, const void* data,
const uint8_t* null_bitmap, uint32_t count) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add Tablet::set_column_values(uint32_t schema_index, const std::string* data, const uint32_t data_len,
const uint8_t* null_bitmap, uint32_t count)
to support STRING/TEXT/BLOB?

may also use char** to replace std::string* or std::vector to replace data_len.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a bulk API like set_column_strings, but it wouldn't provide a meaningful performance gain over the current per-row add_value approach. Each string still needs to be individually allocated and copied into the PageArena via dup_from, so the internal implementation would just be a loop doing the same work. The bulk interface would only make the call site slightly cleaner, not faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is my intention.

Using two sets of interfaces at the same time is a great burden to understanding.

Comment on lines +230 to +236
// Convert Arrow bitmap (1=valid, 0=null) to TsFile bitmap (1=null,
// 0=valid) by inverting and writing directly.
char* tsfile_bm = bitmaps_[schema_index].get_bitmap();
uint32_t full_bytes = count / 8;
for (uint32_t i = 0; i < full_bytes; i++) {
tsfile_bm[i] = ~static_cast<char>(null_bitmap[i]);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null_bitmap -> nonnull_bitmap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +338 to +341
if (bm.test(i)) {
// null row: write zero placeholder in Arrow buffer
std::memset(static_cast<char*>(data_buffer) + i * type_size,
0, type_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to just skip to the next row?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +623 to 630
for (uint32_t i = 0; i < column_count; ++i) {
children_arrays[i] = nullptr;
}

for (uint32_t i = 0; i < column_count; ++i) {
children_arrays[i] = static_cast<ArrowArray*>(
common::mem_alloc(sizeof(ArrowArray), common::MOD_TSBLOCK));
if (children_arrays[i] == nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the first initialization necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup routine after an error checks whether each pointer is null before freeing it. Without initialization, these pointers could contain garbage values, leading to undefined behavior.

Comment on lines +693 to +694
out_schema->format = schema_data->format_strings->at(0).c_str();
out_schema->name = schema_data->name_strings->at(0).c_str();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the format string a vector?

Comment on lines +709 to +710
if (time_col_index < 0 || time_col_index >= n_cols)
return common::E_INVALID_ARG;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the reg_schema already specifies the time column, may use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time_column_index should be set by caller.

// Write Arrow C Data Interface batch into a table (Arrow -> Tablet -> write).
// time_col_index: index of the time column in the Arrow struct.
// >= 0: use the specified column as the time column.
// < 0: auto-detect by Arrow format "tsn:" (TIMESTAMP type).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

data: pyarrow.RecordBatch or pyarrow.Table
time_col_index: index of the time column in the Arrow schema.
>= 0: use the specified column as the time column.
< 0: auto-detect by Arrow timestamp type (default).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the related logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants