Skip to content

contrib: wait cluster bootstrap before put store to pd#10863

Open
yongman wants to merge 1 commit into
pingcap:masterfrom
yongman:fix-bootstrap
Open

contrib: wait cluster bootstrap before put store to pd#10863
yongman wants to merge 1 commit into
pingcap:masterfrom
yongman:fix-bootstrap

Conversation

@yongman
Copy link
Copy Markdown
Member

@yongman yongman commented May 22, 2026

What problem does this PR solve?

Issue Number: close #10859

Problem Summary:
Panic when cluster not bootstrapped in columnar mode.

What is changed and how it works?


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Manual test

[2026/05/22 19:08:23.773 +08:00] [ERROR] [TCPServersHolder.cpp:158] ["tls config is set but tcp_port_secure is not set."] [thread_id=1]
[2026/05/22 19:08:23.773 +08:00] [INFO] [TCPServersHolder.cpp:167] ["Listening tcp: 0.0.0.0:9001"] [thread_id=1]
[2026/05/22 19:08:23.773 +08:00] [INFO] [Server.cpp:1183] ["Available RAM = 101332271104; physical cores = 56; logical cores = 56."] [thread_id=1]
[2026/05/22 19:08:23.774 +08:00] [INFO] [Server.cpp:1186] ["Ready for connections."] [thread_id=1]
[2026/05/22 19:08:23.774 +08:00] [INFO] [MetricsPrometheus.cpp:237] ["Config: status.metrics_interval = 15"] [source=Prometheus] [thread_id=1]
[2026/05/22 19:08:23.774 +08:00] [INFO] [MetricsPrometheus.cpp:242] ["Disable prometheus push mode, cause status.metrics_addr is not set!"] [source=Prometheus] [thread_id=1]
[2026/05/22 19:08:23.777 +08:00] [INFO] [MetricsPrometheus.cpp:290] ["Enable prometheus secure pull mode; Listen Host = 0.0.0.0, Metrics Port = 8235"] [source=Prometheus] [thread_id=1]
[2026/05/22 19:08:23.778 +08:00] [INFO] [ProxyStateMachine.h:352] ["Waiting for TiKV cluster to be bootstrapped"] [thread_id=1]
[2026/05/22 19:08:23.779 +08:00] [ERROR] [ProxyStateMachine.h:359] ["Waiting for cluster to be bootstrapped, we will sleep for 3 seconds and try again."] [thread_id=1]
[2026/05/22 19:08:25.778 +08:00] [INFO] [KVStoreConfig.cpp:75] ["Region compact log thresholds, rows=40960 bytes=33554432 gap=200 eager_gc_gap=4096"] [thread_id=3]
[2026/05/22 19:08:25.778 +08:00] [INFO] [KVStoreConfig.cpp:97] ["read-index timeout: 10000ms; wait-index timeout: 300000ms; wait-region-ready timeout: 1200s; read-index-worker-tick: 10ms"] [thread_id=3]
[2026/05/22 19:08:25.779 +08:00] [INFO] [RateLimiter.cpp:511] ["storage.io_rate_limit is not found in config, use default config."] [source=IORateLimiter] [thread_id=3]
[2026/05/22 19:08:25.779 +08:00] [INFO] [RateLimiter.cpp:515] ["storage.io_rate_limit is not changed."] [source=IORateLimiter] [thread_id=3]
[2026/05/22 19:08:25.780 +08:00] [INFO] [Context.cpp:693] ["reload delta tree "] [source=Context] [thread_id=3]
[2026/05/22 19:08:26.781 +08:00] [ERROR] [ProxyStateMachine.h:359] ["Waiting for cluster to be bootstrapped, we will sleep for 3 seconds and try again."] [thread_id=1]
^C[2026/05/22 19:08:28.750 +08:00] [INFO] [BaseDaemon.cpp:1293] ["Received termination signal (Interrupt)"] [source=Application] [thread_id=4]
[2026/05/22 19:08:28.751 +08:00] [ERROR] [ProxyStateMachine.h:359] ["Waiting for cluster to be bootstrapped, we will sleep for 3 seconds and try again."] [thread_id=1]
^C[2026/05/22 19:08:29.238 +08:00] [INFO] [BaseDaemon.cpp:1293] ["Received termination signal (Interrupt)"] [source=Application] [thread_id=4]

Summary by CodeRabbit

  • Refactor
    • Improved TiFlash Columnar Hub startup sequence by optimizing the timing of store registration and heartbeat initialization to occur after the engine-store server reaches operational readiness, enhancing system stability during initialization.

Review Change Stack

Signed-off-by: yongman <yming0221@gmail.com>
@ti-chi-bot ti-chi-bot Bot added the release-note-none Denotes a PR that doesn't merit a release note. label May 22, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign calvinneo for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented May 22, 2026

@yongman I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 22, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

run_proxy now defers TiFlash Columnar Hub store registration and heartbeat startup until the engine-store server transitions to Running. A store_registration slot holds the constructed store and start_time; PD registration, heartbeat initialization, and heartbeat-thread spawning occur only after observing EngineStoreServerStatus::Running. The main loop refreshes engine_store_status each iteration.

Changes

Deferred Store Registration on Engine-Store Ready

Layer / File(s) Summary
Store preparation and deferral slot
contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
Introduces store_registration optional to hold the constructed metapb::Store and start_time, and stores the prepared store into that slot instead of immediately registering it with PD.
Deferred registration and heartbeat on Running status
contrib/tiflash-columnar-hub/hub-runtime/src/run.rs
Restructures engine-store status handling to wait for Idle, then gates put_store(...) and heartbeat_context initialization behind Running observation; refreshes engine_store_status each loop iteration to detect transitions and remain responsive to Stopping/Terminated signals.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 The store now waits with patient grace,
For engines running into place,
No panic when PD's not quite right,
Just defer until the stars align bright!
A gentle heartbeat in the night. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly matches the main change: deferring store registration to PD until cluster bootstrap completes.
Linked Issues check ✅ Passed The code changes implement the core requirement from #10859: wait until cluster bootstrap before registering store to PD.
Out of Scope Changes check ✅ Passed All changes in run.rs are directly scoped to the bootstrap-wait requirement; no unrelated modifications detected.
Description check ✅ Passed The PR description includes the issue number, problem summary, checklist with manual test marked, and release note as required by the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 22, 2026

@yongman: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-sanitizer-asan 09ec8e0 link false /test pull-sanitizer-asan

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Columnar Hub panics when TiFlash starts before PD cluster bootstrap completes

1 participant