[fix](csv reader) fix data loss when concurrency read using multi char line delimiter#53374
Merged
liaoxin01 merged 1 commit intoapache:masterfrom Jul 17, 2025
Merged
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Contributor
Author
|
run buildall |
c3d2310 to
6390b03
Compare
Contributor
Author
|
run buildall |
liaoxin01
reviewed
Jul 16, 2025
bdd269f to
8689240
Compare
Contributor
Author
|
run buildall |
8689240 to
e1c9fd0
Compare
Contributor
Author
|
run buildall |
Contributor
|
PR approved by at least one committer and no changes requested. |
Contributor
|
PR approved by anyone and no changes requested. |
TPC-H: Total hot run time: 33710 ms |
TPC-DS: Total hot run time: 186366 ms |
ClickBench: Total hot run time: 32.63 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
Contributor
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
github-actions bot
pushed a commit
that referenced
this pull request
Jul 17, 2025
…r line delimiter (#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.
sollhui
added a commit
to sollhui/doris
that referenced
this pull request
Jul 21, 2025
…r line delimiter (apache#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.
sollhui
added a commit
to sollhui/doris
that referenced
this pull request
Jul 21, 2025
…r line delimiter (apache#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.
dataroaring
pushed a commit
that referenced
this pull request
Jul 21, 2025
…ng multi char line delimiter (#53374) (#53634) pick (#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost. ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
yiguolei
pushed a commit
to sollhui/doris
that referenced
this pull request
Jul 26, 2025
…r line delimiter (apache#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost.
yiguolei
pushed a commit
that referenced
this pull request
Jul 26, 2025
…ng multi char line delimiter (#53374) (#53635) pick (#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost. ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
Hastyshell
pushed a commit
to Hastyshell/doris
that referenced
this pull request
Jul 30, 2025
…r line delimiter (apache#53374) (apache#4199) pick [(apache#53374)](apache#53374) Multiple concurrent split file locations will be determined in plan phase, if the split point happens to be in the middle of the multi char line delimiter: - The previous concurrent will read the complete row1 and read a little more to read the line delimiter. - The latter concurrency will start reading from half of the multi char line delimiter, and row2 is the first line of this concurrency, but the first line in the middle range is always discarded, so row2 will be lost. ## Proposed changes Issue Number: close #xxx <!--Describe your changes.-->
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Background
When using bulk load, it was found that one piece of data was lost(expect "2060625" but real is "2060624"):

Root Cause
Solution
Start reading by adding the length of the line delimiter character forward
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)