Skip to content

Fix resume data loss: route heartbeat coords through applyEventsQueue#1684

Open
grodowski wants to merge 1 commit into
github:masterfrom
Shopify:grodowski/coding-chimp/fix-heartbeat-dml-checkpoint-race-gh
Open

Fix resume data loss: route heartbeat coords through applyEventsQueue#1684
grodowski wants to merge 1 commit into
github:masterfrom
Shopify:grodowski/coding-chimp/fix-heartbeat-dml-checkpoint-race-gh

Conversation

@grodowski
Copy link
Copy Markdown
Contributor

onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly from the streamer goroutine (introduced by #1595) , before any DML that preceded the heartbeat was applied to the ghost table. The checkpoint loop reads CurrentCoordinates as "applied through this GTID" and could persist a checkpoint whose LastTrxCoords was ahead of what was actually applied.

If gh-ost crashed before applyEventsQueue drained, --resume read that checkpoint and called StartSyncGTID with the persisted set; MySQL treated the un-applied GTIDs as already-seen and never re-streamed them. The ghost table silently lost those DMLs and cut-over produced a stale table.

Fix: enqueue the heartbeat through applyEventsQueue. The apply goroutine executes it in order, after the DMLs the streamer enqueued before the heartbeat, restoring the invariant.

Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at the previous HEAD and passes after the fix.

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

onChangelogHeartbeatEvent was mutating applier.CurrentCoordinates directly
from the streamer goroutine, before any DML that preceded the heartbeat was
applied to the ghost table. The checkpoint loop reads CurrentCoordinates as
"applied through this GTID" and could persist a checkpoint whose
LastTrxCoords was ahead of what was actually applied.

If gh-ost crashed before applyEventsQueue drained, --resume read that
checkpoint and called StartSyncGTID with the persisted set; MySQL treated
the un-applied GTIDs as already-seen and never re-streamed them. The ghost
table silently lost those DMLs and cut-over produced a stale table.

Fix: enqueue a tableWriteFunc onto applyEventsQueue that performs the
coords bump. The apply goroutine executes it in order, after the DMLs the
streamer enqueued before the heartbeat, restoring the invariant.

Adds TestMigratorHeartbeatDoesNotAdvancePastUnappliedDML, which fails at
the previous HEAD and passes after the fix; also asserts queue ordering to
guard against future changes that wrap the heartbeat enqueue in a goroutine.

Co-authored-by: Bastian Bartmann <bastian.bartmann@shopify.com>
Copilot AI review requested due to automatic review settings May 20, 2026 14:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a resume/checkpoint correctness bug where changelog heartbeats could advance applier.CurrentCoordinates from the streamer goroutine before earlier queued DMLs were actually applied, allowing checkpoints to persist GTIDs that were not applied and causing --resume to skip those transactions (silent data loss).

Changes:

  • Route heartbeat coordinate advancement through applyEventsQueue to preserve ordering relative to queued DML.
  • Add a regression test ensuring heartbeats do not advance CurrentCoordinates beyond unapplied queued DML.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
go/logic/migrator.go Enqueue heartbeat coordinate updates through applyEventsQueue to preserve apply-order invariants needed for safe checkpoint/resume.
go/logic/migrator_test.go Adds a regression test that reproduces the “heartbeat advances past unapplied DML” failure mode and verifies correct queue ordering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread go/logic/migrator.go
mgtr.applier.CurrentCoordinatesMutex.Unlock()
return nil
}
mgtr.applyEventsQueue <- newApplyEventStructByFunc(&writeFunc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants