Fix resume data loss: route heartbeat coords through applyEventsQueue#1684
Open
grodowski wants to merge 1 commit into
Open
Conversation
onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly from the streamer goroutine, before any DML that preceded the heartbeat was applied to the ghost table. The checkpoint loop reads CurrentCoordinates as "applied through this GTID" and could persist a checkpoint whose LastTrxCoords was ahead of what was actually applied. If gh-ost crashed before applyEventsQueue drained, --resume read that checkpoint and called StartSyncGTID with the persisted set; MySQL treated the un-applied GTIDs as already-seen and never re-streamed them. The ghost table silently lost those DMLs and cut-over produced a stale table. Fix: enqueue a tableWriteFunc onto applyEventsQueue that performs the coords bump. The apply goroutine executes it in order, after the DMLs the streamer enqueued before the heartbeat, restoring the invariant. Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at the previous HEAD and passes after the fix; also asserts queue ordering to guard against future changes that wrap the heartbeat enqueue in a goroutine. Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a resume/checkpoint correctness bug where changelog heartbeats could advance applier.CurrentCoordinates from the streamer goroutine before earlier queued DMLs were actually applied, allowing checkpoints to persist GTIDs that were not applied and causing --resume to skip those transactions (silent data loss).
Changes:
- Route heartbeat coordinate advancement through
applyEventsQueueto preserve ordering relative to queued DML. - Add a regression test ensuring heartbeats do not advance
CurrentCoordinatesbeyond unapplied queued DML.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| go/logic/migrator.go | Enqueue heartbeat coordinate updates through applyEventsQueue to preserve apply-order invariants needed for safe checkpoint/resume. |
| go/logic/migrator_test.go | Adds a regression test that reproduces the “heartbeat advances past unapplied DML” failure mode and verifies correct queue ordering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| mgtr.applier.CurrentCoordinatesMutex.Unlock() | ||
| return nil | ||
| } | ||
| mgtr.applyEventsQueue <- newApplyEventStructByFunc(&writeFunc) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
onChangelogHeartbeatEventwas mutatingapplier.CurrentCoordinatesdirectly from the streamer goroutine (introduced by #1595) , before any DML that preceded the heartbeat was applied to the ghost table. The checkpoint loop readsCurrentCoordinatesas "applied through this GTID" and could persist a checkpoint whoseLastTrxCoordswas ahead of what was actually applied.If
gh-ostcrashed beforeapplyEventsQueuedrained,--resumeread that checkpoint and calledStartSyncGTIDwith the persisted set; MySQL treated the un-applied GTIDs as already-seen and never re-streamed them. The ghost table silently lost those DMLs and cut-over produced a stale table.Fix: enqueue the heartbeat through
applyEventsQueue. The apply goroutine executes it in order, after the DMLs the streamer enqueued before the heartbeat, restoring the invariant.Adds
TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at the previous HEAD and passes after the fix.script/cibuildreturns with no formatting errors, build errors or unit test errors.