Skip to content

[SS-83] fix MySQL dataflow restarts#35458

Merged
patrickwwbutler merged 8 commits intoMaterializeInc:mainfrom
patrickwwbutler:patrick/mysql-df-restart
Mar 30, 2026
Merged

[SS-83] fix MySQL dataflow restarts#35458
patrickwwbutler merged 8 commits intoMaterializeInc:mainfrom
patrickwwbutler:patrick/mysql-df-restart

Conversation

@patrickwwbutler
Copy link
Copy Markdown
Contributor

@patrickwwbutler patrickwwbutler commented Mar 12, 2026

We've had an issue with "zombie" dataflows going on forever because, when a definite error is emitted at the max GTID, the stats operator continues to emit probes, meaning the output never reaches the max gtid, surfacing the error and killing the dataflow. This addresses this by adding a new input to the statistics operator that takes in error streams from the snapshot and replication operators, which the stats operator then consumes regularly, and exits upon receiving a definite error.

Motivation

https://linear.app/materializeinc/issue/SS-83/mysql-dataflow-restarts-on-definite-error-causes-data-incorrectness

Verification

Added some new tests and updated an existing test

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone.

PR title guidelines

  • Use imperative mood: "Fix X" not "Fixed X" or "Fixes X"
  • Be specific: "Fix panic in catalog sync when controller restarts" not "Fix bug" or "Update catalog code"
  • Prefix with area if helpful: compute: , storage: , adapter: , sql:

Pre-merge checklist

  • The PR title is descriptive and will make sense in the git log.
  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).

@patrickwwbutler patrickwwbutler changed the title logs/traces fix MySQL dataflow restarts Mar 12, 2026
@patrickwwbutler patrickwwbutler force-pushed the patrick/mysql-df-restart branch from b0d6561 to 6ae9a99 Compare March 19, 2026 14:13
@patrickwwbutler patrickwwbutler changed the title fix MySQL dataflow restarts [SS-83] fix MySQL dataflow restarts Mar 24, 2026
@patrickwwbutler patrickwwbutler marked this pull request as ready for review March 26, 2026 14:23
@patrickwwbutler patrickwwbutler requested a review from a team as a code owner March 26, 2026 14:23
@DAlperin DAlperin requested a review from a team March 26, 2026 14:43
Copy link
Copy Markdown
Contributor

@martykulma martykulma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Cloning the error streams should be ok as we don't expect much data to flow through them.

Some of the logging added is at info, but it seems it should be at warn - I didn't flag all of them though!

Comment thread src/storage/src/source/mysql/replication/events.rs Outdated
Comment thread src/storage/src/source/mysql/statistics.rs Outdated
@patrickwwbutler patrickwwbutler merged commit 9e91428 into MaterializeInc:main Mar 30, 2026
126 checks passed
def- added a commit to def-/materialize that referenced this pull request Mar 31, 2026
for err in err_data {
if let ReplicationError::Definite(def_err) = err {
tracing::info!(
"ts: {:?} Definite replication error detected in statistics operator: {def_err}, exiting", ts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't actually exit, since there are two loops here and we only break out of the inner one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants