[WIP]#5143
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request modifies the GC safety checks to allow the changefeed startTs (or checkpointTs) to be equal to the GC safepoint, rather than requiring it to be strictly greater. It introduces getGCSafepointUpperBound to set the GC barrier at startTs - 1 and updates tests accordingly. The review feedback highlights a critical issue in next-gen mode where setting startTs == gcSafepoint could leave the changefeed unprotected if the barrier is not explicitly set to startTs, and suggests adding assertions in tests to verify the barrier is correctly set at this boundary.
| if err != nil { | ||
| return errors.ErrStartTsBeforeGC.GenWithStackByArgs(startTs) | ||
| if !errors.IsGCBarrierTSBehindTxnSafePointError(err) { | ||
| return errors.ErrStartTsBeforeGC.GenWithStackByArgs(startTs, gcSafepointUpperBound) | ||
| } | ||
| gcSafepoint, getErr := UnifyGetServiceGCSafepoint(ctx, pdCli, keyspaceID, gcServiceID) | ||
| if getErr != nil { | ||
| return getErr | ||
| } | ||
| if startTs > 0 && startTs < gcSafepoint { | ||
| return errors.ErrStartTsBeforeGC.GenWithStackByArgs(startTs, gcSafepoint) | ||
| } | ||
| } |
There was a problem hiding this comment.
In next-gen mode, if SetGCBarrier fails because gcSafepointUpperBound (startTs - 1) is behind the current GC safepoint, and startTs == gcSafepoint, the current implementation simply returns nil (success) without setting any barrier. This leaves the changefeed unprotected, and a subsequent GC run could advance the safepoint past startTs before the changefeed starts and sets its own barrier.
To prevent this potential data loss, we should explicitly set the GC barrier to startTs when startTs == gcSafepoint.
if err != nil {
if !errors.IsGCBarrierTSBehindTxnSafePointError(err) {
return errors.ErrStartTsBeforeGC.GenWithStackByArgs(startTs, gcSafepointUpperBound)
}
gcSafepoint, getErr := UnifyGetServiceGCSafepoint(ctx, pdCli, keyspaceID, gcServiceID)
if getErr != nil {
return getErr
}
if startTs > 0 && startTs < gcSafepoint {
return errors.ErrStartTsBeforeGC.GenWithStackByArgs(startTs, gcSafepoint)
}
if startTs == gcSafepoint {
_, err = SetGCBarrier(ctx, gcCli, gcServiceID, startTs, time.Duration(ttl)*time.Second)
if err != nil {
return errors.Trace(err)
}
}
}| err = EnsureChangefeedStartTsSafety(ctx, pdCli, | ||
| "ticdc-creating-", | ||
| 0, | ||
| common.NewChangeFeedIDWithName("changefeed-boundary", "default"), TTL, 60) | ||
| require.NoError(t, err) |
There was a problem hiding this comment.
We should assert that the GC barrier is actually set and active in PD when startTs == gcSafepoint (boundary condition), rather than just checking that no error is returned.
| err = EnsureChangefeedStartTsSafety(ctx, pdCli, | |
| "ticdc-creating-", | |
| 0, | |
| common.NewChangeFeedIDWithName("changefeed-boundary", "default"), TTL, 60) | |
| require.NoError(t, err) | |
| err = EnsureChangefeedStartTsSafety(ctx, pdCli, | |
| "ticdc-creating-", | |
| 0, | |
| common.NewChangeFeedIDWithName("changefeed-boundary", "default"), TTL, 60) | |
| require.NoError(t, err) | |
| require.Equal(t, uint64(60), pdCli.gcBarriers["ticdc-creating-default_changefeed-boundary"]) |
|
[FORMAT CHECKER NOTIFICATION] Notice: To remove the 📖 For more info, you can check the "Contribute Code" section in the development guide. |
What problem does this PR solve?
Issue Number: close #xxx
What is changed and how it works?
Check List
Tests
Questions
Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?
Release note