Context:
I have a Postgres DB in the cloud that serves as the Event Store. A MicroService (also in the cloud) is subscribed to the Event Store and synchronizes Entity states (I have just Create, Update, Delete events for now, so no complex read-model/transformation logic) to an SQL Server database (which is also in the cloud). The subscription Checkpoint is maintained in the receiving database (Sql Server). I have begun running bulk tests / load tests to test performance and reliability.
Scenario:
I have uploaded 20.000 events to the Event Store, concerning 4.000 Entities; each gets 1 Create, followed by 4 Update events. The events were generated sequentially (synchronously), so they are in the Event Store in order (events 1-5 concern Entity 1, 6-10 are Entity 2, etc.). The containing data is written in such a way that I can easily query at the end whether Entities are out of sync or events were not processed in the correct (i.e. chronological) order.
Expected:
After starting the MicroService and waiting for the subscription to catch up (checkpoint is 0-leading) I expect to see the correct outcome after each run:

After a run, i stop the MicroService and delete the Checkpoint and the Entities from the database, in order to re-run the test.
Issue:
In some runs (not all, and not predictably), the subscription loses track of its Checkpoint very early on:

The above screenshot was taken after the subscription has caught up (confirmed by logging), but as can be seen its Checkpoint is very much out of sync. This would mean, were I to restart the MicroService (in a Production scenario this would be for instance Releasing a newer version of the app), that it would reconstruct its entire read-model unnecessarily. This would be undesirable in our business case.
The issue occurs intermittently. I always see it breaking almost immediately (in the first 20 or so processed events) or not at all. From glancing at the Eventuous code I learned that the SaveCheckpoint method only updates if it encounters its current checkpoint-1 exactly (e.g. only update to 14 if current position is 13). Thus, if only a single Save is missed the Checkpoint can never again be updated by the same running application (the only fix would be a restart, causing a reconstruction as mentioned above).
Guess:
Since I always see it breaking early after application startup or not at all, I am guessing there is an issue with application startup, bootstrapping, or establishing the initial database connection. Somewhere in that process 1 of the SaveCheckpoint actions is lost (oddly it loses for instance action 13 after the first 12 were processed succesfully). So far I have not been able to lose a Checkpoint after an application has been running for a few seconds, even with a heavy processing load.
Addendum:
Also, I should add that I have also encountered above scenario while running the MicroService locally and connecting with the 2 databases in the Cloud.
Context:
I have a Postgres DB in the cloud that serves as the Event Store. A MicroService (also in the cloud) is subscribed to the Event Store and synchronizes Entity states (I have just Create, Update, Delete events for now, so no complex read-model/transformation logic) to an SQL Server database (which is also in the cloud). The subscription Checkpoint is maintained in the receiving database (Sql Server). I have begun running bulk tests / load tests to test performance and reliability.
Scenario:
I have uploaded 20.000 events to the Event Store, concerning 4.000 Entities; each gets 1 Create, followed by 4 Update events. The events were generated sequentially (synchronously), so they are in the Event Store in order (events 1-5 concern Entity 1, 6-10 are Entity 2, etc.). The containing data is written in such a way that I can easily query at the end whether Entities are out of sync or events were not processed in the correct (i.e. chronological) order.
Expected:

After starting the MicroService and waiting for the subscription to catch up (checkpoint is 0-leading) I expect to see the correct outcome after each run:
After a run, i stop the MicroService and delete the Checkpoint and the Entities from the database, in order to re-run the test.
Issue:

In some runs (not all, and not predictably), the subscription loses track of its Checkpoint very early on:
The above screenshot was taken after the subscription has caught up (confirmed by logging), but as can be seen its Checkpoint is very much out of sync. This would mean, were I to restart the MicroService (in a Production scenario this would be for instance Releasing a newer version of the app), that it would reconstruct its entire read-model unnecessarily. This would be undesirable in our business case.
The issue occurs intermittently. I always see it breaking almost immediately (in the first 20 or so processed events) or not at all. From glancing at the Eventuous code I learned that the SaveCheckpoint method only updates if it encounters its current checkpoint-1 exactly (e.g. only update to 14 if current position is 13). Thus, if only a single Save is missed the Checkpoint can never again be updated by the same running application (the only fix would be a restart, causing a reconstruction as mentioned above).
Guess:
Since I always see it breaking early after application startup or not at all, I am guessing there is an issue with application startup, bootstrapping, or establishing the initial database connection. Somewhere in that process 1 of the SaveCheckpoint actions is lost (oddly it loses for instance action 13 after the first 12 were processed succesfully). So far I have not been able to lose a Checkpoint after an application has been running for a few seconds, even with a heavy processing load.
Addendum:
Also, I should add that I have also encountered above scenario while running the MicroService locally and connecting with the 2 databases in the Cloud.