✅ I checked the Altinity Stable Builds lifecycle table, and the Altinity Stable Build version I'm using is still supported.
Type of problem
Bug report - something's broken
Describe the situation
A regression was introduced in 25.8.15 affecting ReplicatedMergeTree tables using S3 / MinIO with zero-copy replication.
When a replica is dropped and recreated during concurrent inserts, the recreated replica fails to fetch one data part from another replica.
As a result, the replica remains permanently out of sync, with fewer rows than expected.
This issue is:
- reproducible with Altinity Stable 25.8.15
- reproducible with official ClickHouse Docker image 25.8.15
- not reproducible on 25.8.14
Yes, the issue can be reproduced using an official ClickHouse build of the same version.
How to reproduce the behavior
The issue can be reproduced in two ways: via automation or manually.
Option 1: Reproduce using automation
Altinity Stable Build:
- ❌ 25.8.15 → fails
- ✅ 25.8.14 → passes
Command (example):
python3 -u s3/regression.py \
--clickhouse https://altinity-build-artifacts.s3.amazonaws.com/PRs/1331/80a50080a2dddad4ef2fc02d90e0ef1d2d5182d5/build_amd_binary/clickhouse \
--storage minio \
--only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \
--log log.log
Oficcial Clickhouse docker
- ❌ 25.8.15.35 → fails
- ✅ 25.8.14.17 → passes
Command (example):
python3 -u s3/regression.py \
--local \
--clickhouse docker://clickhouse/clickhouse-server:25.8.15.35 \
--clickhouse-version 25.8.15.35 \
--storage minio \
--only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \
--log log3.log
The test add remove one replica consistently fails on 25.8.15 and passes on 25.8.14.
Option 2: Manual reproduction (Docker environment)
repro_env.zip
A ZIP file is attached containing:
docker-compose.yml
- ClickHouse config files (
cluster.xml, storage.xml, macros*.xml)
Start the environment:
Steps
-
Open a ClickHouse client on each node:
docker exec -it s3_env-clickhouse1-1 clickhouse-client
docker exec -it s3_env-clickhouse2-1 clickhouse-client
docker exec -it s3_env-clickhouse3-1 clickhouse-client
-
On all three nodes, create the database and replicated table:
DROP DATABASE IF EXISTS s3test SYNC;
CREATE DATABASE s3test;
DROP TABLE IF EXISTS s3test.add_remove_one_replica SYNC;
CREATE TABLE s3test.add_remove_one_replica
(
d UInt64
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/s3test.add_remove_one_replica',
'{replica}'
)
ORDER BY d
SETTINGS
storage_policy = 'external',
allow_remote_fs_zero_copy_replication = 1;
-
On node2, insert the first batch of data:
INSERT INTO s3test.add_remove_one_replica
SELECT number FROM numbers(1000000);
-
On node3, delete the replica:
DROP TABLE IF EXISTS s3test.add_remove_one_replica SYNC;
-
On node3, recreate the replicated table:
CREATE TABLE s3test.add_remove_one_replica
(
d UInt64
)
ENGINE = ReplicatedMergeTree(
'/clickhouse/tables/{shard}/s3test.add_remove_one_replica',
'{replica}'
)
ORDER BY d
SETTINGS
storage_policy = 'external',
allow_remote_fs_zero_copy_replication = 1;
-
On node1, insert the second batch of data:
INSERT INTO s3test.add_remove_one_replica
SELECT number + 1000000 FROM numbers(1000000);
-
Verify row count on each node:
SELECT count(*)
FROM s3test.add_remove_one_replica;
Expected behavior
All replicas should eventually converge and return:
The recreated replica should fetch all missing parts and fully synchronize.
Actual behavior
The third replica remains permanently out of sync:
Replication does not recover even after waiting.
Logs, error messages, stacktraces
Replication queue error (node3)
Query executed:
SELECT
count() AS queue_items,
anyIf(last_exception, last_exception != '') AS last_exception
FROM system.replication_queue
WHERE database='s3test' AND table='add_remove_one_replica';
Result:
┌─queue_items─┬─last_exception─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
1. │ 1 │ Poco::Exception. Code: 1000, e.code() = 0, Malformed message: Unexpected EOF, Stack trace (when copying this message, always include the lines below): ↴│
│ │↳ ↴│
│ │↳0. Poco::Net::HTTPChunkedStreamBuf::readFromDevice(char*, long) @ 0x000000001f174834 ↴│
│ │↳1. DB::ReadBufferFromIStream::nextImpl() @ 0x00000000158d5a10 ↴│
│ │↳2. DB::ReadBuffer::next() @ 0x0000000013684bed ↴│
│ │↳3. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::ReadWriteBufferFromHTTP::nextImpl()::$_0, void ()>>(std::__function::__policy_storage const*) @ 0x00000000158d457d ↴│
│ │↳4. DB::ReadWriteBufferFromHTTP::doWithRetries(std::function<void ()>&&, std::function<void ()>, bool) const @ 0x00000000158cae3b ↴│
│ │↳5. DB::ReadWriteBufferFromHTTP::nextImpl() @ 0x00000000158cf7ff ↴│
│ │↳6. DB::ReadBuffer::next() @ 0x0000000013684bed ↴│
│ │↳7. DB::BuilderRWBufferFromHTTP::create(Poco::Net::HTTPBasicCredentials const&) @ 0x00000000158d1821 ↴│
│ │↳8. DB::DataPartsExchange::Fetcher::fetchSelectedPart(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::Context const>, String const&, String const&, String const&, String const&, int, DB::ConnectionTimeouts const&, String const&, String const&, String const&, std::shared_ptr<DB::IThrottler>, bool, String const&, std::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::shared_ptr<DB::IDisk>) @ 0x000000001924a018↴│
│ │↳9. DB::DataPartsExchange::Fetcher::fetchSelectedPart(std::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::shared_ptr<DB::Context const>, String const&, String const&, String const&, String const&, int, DB::ConnectionTimeouts const&, String const&, String const&, String const&, std::shared_ptr<DB::IThrottler>, bool, String const&, std::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::shared_ptr<DB::IDisk>) @ 0x00000000192513de↴│
│ │↳10. std::shared_ptr<DB::IMergeTreeDataPart> std::__function::__policy_invoker<std::shared_ptr<DB::IMergeTreeDataPart> ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::fetchPart(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, String const&, String const&, bool, unsigned long, std::shared_ptr<zkutil::ZooKeeper>, bool)::$_4, std::shared_ptr<DB::IMergeTreeDataPart> ()>>(std::__function::__policy_storage const*) @ 0x0000000018f1dd09↴│
│ │↳11. DB::StorageReplicatedMergeTree::fetchPart(String const&, std::shared_ptr<DB::StorageInMemoryMetadata const> const&, String const&, String const&, bool, unsigned long, std::shared_ptr<zkutil::ZooKeeper>, bool) @ 0x0000000018de290f ↴│
│ │↳12. DB::StorageReplicatedMergeTree::executeFetch(DB::ReplicatedMergeTreeLogEntry&, bool) @ 0x0000000018dce760 ↴│
│ │↳13. DB::StorageReplicatedMergeTree::executeLogEntry(DB::ReplicatedMergeTreeLogEntry&) @ 0x0000000018dba56a ↴│
│ │↳14. bool std::__function::__policy_invoker<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>)::$_1, bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>>(std::__function::__policy_storage const*, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&) @ 0x0000000018f1a71b↴│
│ │↳15. DB::ReplicatedMergeTreeQueue::processEntry(std::function<std::shared_ptr<zkutil::ZooKeeper> ()>, std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&, std::function<bool (std::shared_ptr<DB::ReplicatedMergeTreeLogEntry>&)>) @ 0x00000000197f7968 ↴│
│ │↳16. DB::StorageReplicatedMergeTree::processQueueEntry(std::shared_ptr<DB::ReplicatedMergeTreeQueue::SelectedEntry>) @ 0x0000000018e1381c ↴│
│ │↳17. DB::ExecutableLambdaAdapter::executeStep() @ 0x00000000192a6c12 ↴│
│ │↳18. DB::TaskRuntimeData::executeStep() const @ 0x0000000019389a0c ↴│
│ │↳19. DB::MergeTreeBackgroundExecutor<DB::RoundRobinRuntimeQueue>::threadFunction() @ 0x000000001938bc0d ↴│
│ │↳20. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::worker() @ 0x00000000136f5c2b ↴│
│ │↳21. void std::__function::__policy_invoker<void ()>::__call_impl[abi:ne190107]<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false, true>::ThreadFromGlobalPoolImpl<void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*>(void (ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool::*&&)(), ThreadPoolImpl<ThreadFromGlobalPoolImpl<false, true>>::ThreadFromThreadPool*&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x00000000136fcfa6↴│
│ │↳22. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x00000000136f2c12 ↴│
│ │↳23. void* std::__thread_proxy[abi:ne190107]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void (ThreadPoolImpl<std::thread>::ThreadFromThreadPool::*)(), ThreadPoolImpl<std::thread>::ThreadFromThreadPool*>>(void*) @ 0x00000000136fa6da↴│
│ │↳24. ? @ 0x0000000000094ac3 ↴│
│ │↳25. ? @ 0x0000000000125a74 ↴│
│ │↳ (version 25.8.15.35 (official build)) │
└─────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Additional context
Data parts comparison between replicas
After the reproduction steps, the active parts differ between replicas.
Query executed:
SELECT name, rows
FROM system.parts
WHERE database='s3test'
AND table='add_remove_one_replica'
AND active
ORDER BY name;
Node 1 (source replica)
Result:
┌─name──────┬────rows─┐
1. │ all_0_0_0 │ 1000000 │
2. │ all_1_1_0 │ 1000000 │
└───────────┴─────────┘
Node 3 (affected replica)
Result:
┌─name──────┬────rows─┐
1. │ all_1_1_0 │ 1000000 │
└───────────┴─────────┘
✅ I checked the Altinity Stable Builds lifecycle table, and the Altinity Stable Build version I'm using is still supported.
Type of problem
Bug report - something's broken
Describe the situation
A regression was introduced in 25.8.15 affecting
ReplicatedMergeTreetables using S3 / MinIO with zero-copy replication.When a replica is dropped and recreated during concurrent inserts, the recreated replica fails to fetch one data part from another replica.
As a result, the replica remains permanently out of sync, with fewer rows than expected.
This issue is:
Yes, the issue can be reproduced using an official ClickHouse build of the same version.
How to reproduce the behavior
The issue can be reproduced in two ways: via automation or manually.
Option 1: Reproduce using automation
Altinity Stable Build:
Command (example):
python3 -u s3/regression.py \ --clickhouse https://altinity-build-artifacts.s3.amazonaws.com/PRs/1331/80a50080a2dddad4ef2fc02d90e0ef1d2d5182d5/build_amd_binary/clickhouse \ --storage minio \ --only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \ --log log.logOficcial Clickhouse docker
Command (example):
python3 -u s3/regression.py \ --local \ --clickhouse docker://clickhouse/clickhouse-server:25.8.15.35 \ --clickhouse-version 25.8.15.35 \ --storage minio \ --only '/s3/minio/part 2/zero copy replication/add remove one replica/*' \ --log log3.logThe test
add remove one replicaconsistently fails on 25.8.15 and passes on 25.8.14.Option 2: Manual reproduction (Docker environment)
repro_env.zip
A ZIP file is attached containing:
docker-compose.ymlcluster.xml,storage.xml,macros*.xml)Start the environment:
Steps
Open a ClickHouse client on each node:
On all three nodes, create the database and replicated table:
On node2, insert the first batch of data:
On node3, delete the replica:
On node3, recreate the replicated table:
On node1, insert the second batch of data:
Verify row count on each node:
Expected behavior
All replicas should eventually converge and return:
The recreated replica should fetch all missing parts and fully synchronize.
Actual behavior
The third replica remains permanently out of sync:
Replication does not recover even after waiting.
Logs, error messages, stacktraces
Replication queue error (node3)
Query executed:
Result:
Additional context
Data parts comparison between replicas
After the reproduction steps, the active parts differ between replicas.
Query executed:
Node 1 (source replica)
Result:
Node 3 (affected replica)
Result: