[Opt](cloud-sc) Clear stop token when commit_tablet_job fails#49275
Merged
dataroaring merged 2 commits intoapache:masterfrom Mar 26, 2025
Merged
[Opt](cloud-sc) Clear stop token when commit_tablet_job fails#49275dataroaring merged 2 commits intoapache:masterfrom
commit_tablet_job fails#49275dataroaring merged 2 commits intoapache:masterfrom
Conversation
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
1b67e73 to
f25917a
Compare
commit_tablet_job failscommit_tablet_job fails
Contributor
Author
|
run buildall |
TPC-H: Total hot run time: 32647 ms |
TPC-DS: Total hot run time: 192526 ms |
ClickBench: Total hot run time: 31.44 s |
Contributor
Author
|
run p0 |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
zhannngchen
reviewed
Mar 20, 2025
| } | ||
| }}; | ||
| if (_new_tablet->enable_unique_key_merge_on_write()) { | ||
| has_stop_token = true; |
Contributor
There was a problem hiding this comment.
should move register_compaction_stop_token() here
the register and unregister operation should request in same sope?
Yukang-Lian
approved these changes
Mar 20, 2025
Contributor
|
PR approved by anyone and no changes requested. |
Contributor
|
PR approved by at least one committer and no changes requested. |
github-actions bot
pushed a commit
that referenced
this pull request
Mar 26, 2025
### What problem does this PR solve? cloud heavy sc job will retry the whole alter tasks when encounter `KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES` error in `commit_tablet_job`(#46748). We should remove stop token(#48399) in MS for the sc job if it fails in `commit_tablet_job`, otherwise the later retries may fail to regsiter stop token(because the first stop token won't expire in `config::lease_compaction_interval_seconds * 4=80s`) and the schema change job will fail. ``` I20250318 15:40:15.851157 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:40:31.346628 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:40:31.346635 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:40:31.350860 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:40:31.350906 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:40:31.350916 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:40:31.382493 6496 segment_creator.cpp:308] tablet_id:1742283174829, flushing rowset_dir: , rowset_id:020000000000038f6644fd8945079d22209de0cad6c7e5b8, data size:73808, index size:3289 I20250318 15:40:31.385416 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=3|alter_version=2 I20250318 15:40:31.387535 6496 cloud_storage_engine.cpp:894] successfully register compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 I20250318 15:40:31.388285 6496 cloud_schema_change_job.cpp:439] alter table for mow table, calculate delete bitmap of incremental rowsets without lock, version: 3-2 new_table_id: 1742283174829 I20250318 15:40:31.391326 6496 cloud_schema_change_job.cpp:460] alter table for mow table, calculate delete bitmap of incremental rowsets with lock, version: 3-2 new_tablet_id: 1742283174829 I20250318 15:40:31.392035 6496 cloud_storage_engine.cpp:915] successfully unregister compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 W20250318 15:40:39.947554 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[DELETE_BITMAP_LOCK_ERROR]txn conflict when commit tablet job idx { table_id: 1742283165243 index_id: 1742283165244 partition_id: 1742283165242 tablet_id: 1742283170711 } schema_change { initiator: "172.20.56.12:9050" id: "1742283173457" new_tablet_idx { table_id: 1742283165243 index_id: 1742283173458 partition_id: 1742283165242 tablet_id: 1742283174829 } txn_ids: 610474243072 alter_version: 2 num_output_rowsets: 1 num_output_segments: 1 size_output_rowsets: 77097 num_output_rows: 611 output_versions: 2 output_cumulative_point: 2 delete_bitmap_lock_initiator: 6632031443518271970 index_size_output_rowsets: 3289 segment_size_output_rowsets: 73808 } I20250318 15:40:46.204162 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:41:07.487172 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:41:07.487183 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:41:07.489440 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:41:07.489511 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:41:07.489523 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:41:07.490249 6496 cloud_schema_change_job.cpp:285] Rowset [2-2] has already existed in tablet 1742283174829 I20250318 15:41:07.490275 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=2|alter_version=2 W20250318 15:41:07.490864 6496 cloud_compaction_stop_token.cpp:89] failed to register compaction stop token|job_id=a018587a-c12f-4926-9d7e-514ff9d88457|delete_bitmap_lock_initiator=1847151139249560285|tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 W20250318 15:41:07.490897 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 ```
koarz
pushed a commit
to koarz/doris
that referenced
this pull request
Jun 4, 2025
…he#49275) ### What problem does this PR solve? cloud heavy sc job will retry the whole alter tasks when encounter `KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMES` error in `commit_tablet_job`(apache#46748). We should remove stop token(apache#48399) in MS for the sc job if it fails in `commit_tablet_job`, otherwise the later retries may fail to regsiter stop token(because the first stop token won't expire in `config::lease_compaction_interval_seconds * 4=80s`) and the schema change job will fail. ``` I20250318 15:40:15.851157 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:40:31.346628 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:40:31.346635 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:40:31.350860 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:40:31.350906 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:40:31.350916 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:40:31.382493 6496 segment_creator.cpp:308] tablet_id:1742283174829, flushing rowset_dir: , rowset_id:020000000000038f6644fd8945079d22209de0cad6c7e5b8, data size:73808, index size:3289 I20250318 15:40:31.385416 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=3|alter_version=2 I20250318 15:40:31.387535 6496 cloud_storage_engine.cpp:894] successfully register compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 I20250318 15:40:31.388285 6496 cloud_schema_change_job.cpp:439] alter table for mow table, calculate delete bitmap of incremental rowsets without lock, version: 3-2 new_table_id: 1742283174829 I20250318 15:40:31.391326 6496 cloud_schema_change_job.cpp:460] alter table for mow table, calculate delete bitmap of incremental rowsets with lock, version: 3-2 new_tablet_id: 1742283174829 I20250318 15:40:31.392035 6496 cloud_storage_engine.cpp:915] successfully unregister compaction stop token for tablet_id=1742283174829, delete_bitmap_lock_initiator=6632031443518271970 W20250318 15:40:39.947554 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[DELETE_BITMAP_LOCK_ERROR]txn conflict when commit tablet job idx { table_id: 1742283165243 index_id: 1742283165244 partition_id: 1742283165242 tablet_id: 1742283170711 } schema_change { initiator: "172.20.56.12:9050" id: "1742283173457" new_tablet_idx { table_id: 1742283165243 index_id: 1742283173458 partition_id: 1742283165242 tablet_id: 1742283174829 } txn_ids: 610474243072 alter_version: 2 num_output_rowsets: 1 num_output_segments: 1 size_output_rowsets: 77097 num_output_rows: 611 output_versions: 2 output_cumulative_point: 2 delete_bitmap_lock_initiator: 6632031443518271970 index_size_output_rowsets: 3289 segment_size_output_rowsets: 73808 } I20250318 15:40:46.204162 7677 task_worker_pool.cpp:423] successfully submit task|type=ALTER|signature=1742283174829 I20250318 15:41:07.487172 6496 task_worker_pool.cpp:1999] get alter table task, signature: 1742283174829 I20250318 15:41:07.487183 6496 task_worker_pool.cpp:281] start alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|mem_limit=10682972209 I20250318 15:41:07.489440 6496 cloud_schema_change_job.cpp:132] Begin to alter tablet. base_tablet_id=1742283170711, new_tablet_id=1742283174829, alter_version=2, job_id=1742283173457 I20250318 15:41:07.489511 6496 cloud_schema_change_job.cpp:226] Begin to convert historical rowsets for new_tablet from base_tablet. base_tablet=1742283170711, new_tablet=1742283174829, job_id=1742283173457 I20250318 15:41:07.489523 6496 cloud_schema_change_job.cpp:247] schema change type, sc_sorting: 0, sc_directly: 1, base_tablet=1742283170711, new_tablet=1742283174829 I20250318 15:41:07.490249 6496 cloud_schema_change_job.cpp:285] Rowset [2-2] has already existed in tablet 1742283174829 I20250318 15:41:07.490275 6496 cloud_schema_change_job.cpp:416] process mow table|new_tablet_id=1742283174829|out_rowset_size=1|start_calc_delete_bitmap_version=2|alter_version=2 W20250318 15:41:07.490864 6496 cloud_compaction_stop_token.cpp:89] failed to register compaction stop token|job_id=a018587a-c12f-4926-9d7e-514ff9d88457|delete_bitmap_lock_initiator=1847151139249560285|tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 W20250318 15:41:07.490897 6496 task_worker_pool.cpp:306] failed to alter tablet|signature=1742283174829|base_tablet_id=1742283170711|new_tablet_id=1742283174829|error=[INTERNAL_ERROR]failed to start tablet job: compactions are not allowed on tablet_id=1742283174829 currently, blocked by schema change job delete_bitmap_initiator=6632031443518271970 ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
cloud heavy sc job will retry the whole alter tasks when encounter
KV_TXN_CONFLICT_RETRY_EXCEEDED_MAX_TIMESerror incommit_tablet_job(#46748). We should remove stop token(#48399) in MS for the sc job if it fails incommit_tablet_job, otherwise the later retries may fail to regsiter stop token(because the first stop token won't expire inconfig::lease_compaction_interval_seconds * 4=80s) and the schema change job will fail.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)