*: fix reload tls certs in reconnect and fix region cache#537
Conversation
Signed-off-by: Ray Yan <yming0221@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughThis PR implements connection caching keyed by TLS file signatures. ChangesConnection Cache Key Feature
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Ray Yan <yming0221@gmail.com>
9e8125c to
ea87c9d
Compare
ea87c9d to
b2c2777
Compare
Signed-off-by: Ray Yan <yming0221@gmail.com>
b2c2777 to
7c798eb
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/pd/client.rs (1)
346-371:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winPotential cache key race condition.
Between fetching the
cache_key(line 347) and storing the newly connected client (lines 358-366), the underlying TLS certificates could change. This would cache the client with a stale key, causing the next call to incorrectly reuse it until the certificates change again.Consider refetching the cache key immediately after
connect()succeeds and before inserting into the cache:match self.kv_connect.connect(address).await { Ok(client) => { // Refetch cache key to ensure consistency if let Ok(current_key) = self.kv_connect.connection_cache_key().await { self.kv_client_cache.write().await.insert( address.to_owned(), CachedKvClient { cache_key: current_key, client: client.clone(), }, ); } Ok(client) } Err(e) => Err(e), }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/pd/client.rs` around lines 346 - 371, The kv_client method has a race where the cache_key is read before connect() and may become stale by the time you insert a newly connected client; after a successful self.kv_connect.connect(address).await, refetch the cache key by calling self.kv_connect.connection_cache_key().await and only insert into self.kv_client_cache (CachedKvClient { cache_key: current_key, client: client.clone() }) when that refetch returns Ok(current_key); keep returning the client either way but avoid caching with the original pre-connect key (update code in kv_client around the connect() match and the cache insertion logic to use the refetched key).
🧹 Nitpick comments (2)
.cargo/config.toml (1)
1-2: ⚡ Quick winProvide context for the resolver configuration change.
The addition of
incompatible-rust-versions = "fallback"is not explained in the PR description or commit message. Please document why this Cargo resolver setting is necessary—for example, if it addresses a specific dependency conflict or build failure scenario.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.cargo/config.toml around lines 1 - 2, The change adding the Cargo resolver setting incompatible-rust-versions = "fallback" in .cargo/config.toml lacks explanation; update the PR description or the commit message to document why this resolver behavior is required (e.g., specific dependency version conflict, build failure on older/newer rustc, or to allow fallback resolution for a crate), mention which dependency or build scenario it addresses, and note any testing you did to validate the change so reviewers can understand and reproduce the rationale for using incompatible-rust-versions = "fallback".src/common/security.rs (1)
211-248: ⚡ Quick winAdd test case for same-size certificate updates.
The current test verifies that cache keys change when file sizes differ. However, it doesn't validate the behavior when certificates of the same size are rotated (a common scenario with fixed-length key formats). Consider adding a test case that:
- Writes files with identical sizes but different content
- Optionally preserves mtime using
filetimecrate- Verifies whether the cache key changes
This will surface the fingerprinting limitation flagged in the
connection_cache_key()review and help validate any future fix.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/common/security.rs` around lines 211 - 248, Add a new async test (alongside test_security_reload) that exercises SecurityManager::load and connection_cache_key for rotated certs with identical file sizes: create temp files for ca/cert/key with initial contents, load via SecurityManager::load and capture first = load_tls_materials() and key1 = connection_cache_key(). Then overwrite the three files with different content but exactly the same byte lengths (use padding) and optionally set identical or different mtimes using the filetime crate, reload to get second = load_tls_materials() and key2 = connection_cache_key(), and assert that the material bytes differ while checking whether key1 != key2 (to document current behavior); reference test_security_reload, SecurityManager::load, load_tls_materials, and connection_cache_key when locating insertion point.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/common/security.rs`:
- Around line 82-98: connection_cache_key currently uses file_signature
(mtime/size) which is too weak; replace that with a content-based cryptographic
hash to reliably detect certificate/key rotations. In connection_cache_key(),
instead of calling file_signature(...) for ca_path, cert_path and key_path, open
each file and compute a SHA-256 (or other crypto hash) over its contents (or if
performance is a concern, a SHA-256 of first+last N bytes), feed those digest
bytes into the hasher (or combine them directly) and return hasher.finish();
ensure you still respect tls_configured() and the same
ca_path/cert_path/key_path Option unwrapping logic. Use function names
connection_cache_key, file_signature (remove/replace),
ca_path/cert_path/key_path to locate the changes.
---
Outside diff comments:
In `@src/pd/client.rs`:
- Around line 346-371: The kv_client method has a race where the cache_key is
read before connect() and may become stale by the time you insert a newly
connected client; after a successful self.kv_connect.connect(address).await,
refetch the cache key by calling self.kv_connect.connection_cache_key().await
and only insert into self.kv_client_cache (CachedKvClient { cache_key:
current_key, client: client.clone() }) when that refetch returns
Ok(current_key); keep returning the client either way but avoid caching with the
original pre-connect key (update code in kv_client around the connect() match
and the cache insertion logic to use the refetched key).
---
Nitpick comments:
In @.cargo/config.toml:
- Around line 1-2: The change adding the Cargo resolver setting
incompatible-rust-versions = "fallback" in .cargo/config.toml lacks explanation;
update the PR description or the commit message to document why this resolver
behavior is required (e.g., specific dependency version conflict, build failure
on older/newer rustc, or to allow fallback resolution for a crate), mention
which dependency or build scenario it addresses, and note any testing you did to
validate the change so reviewers can understand and reproduce the rationale for
using incompatible-rust-versions = "fallback".
In `@src/common/security.rs`:
- Around line 211-248: Add a new async test (alongside test_security_reload)
that exercises SecurityManager::load and connection_cache_key for rotated certs
with identical file sizes: create temp files for ca/cert/key with initial
contents, load via SecurityManager::load and capture first =
load_tls_materials() and key1 = connection_cache_key(). Then overwrite the three
files with different content but exactly the same byte lengths (use padding) and
optionally set identical or different mtimes using the filetime crate, reload to
get second = load_tls_materials() and key2 = connection_cache_key(), and assert
that the material bytes differ while checking whether key1 != key2 (to document
current behavior); reference test_security_reload, SecurityManager::load,
load_tls_materials, and connection_cache_key when locating insertion point.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 01c444fc-e0da-4389-a2b6-62b50fe48b8b
📒 Files selected for processing (7)
.cargo/config.tomlCargo.tomlsrc/common/security.rssrc/pd/client.rssrc/request/mod.rssrc/request/plan.rssrc/store/client.rs
| pub(crate) async fn connection_cache_key(&self) -> Result<Option<u64>> { | ||
| if !self.tls_configured() { | ||
| return Ok(None); | ||
| } | ||
|
|
||
| let mut hasher = DefaultHasher::new(); | ||
| file_signature(self.ca_path.as_ref().expect("tls_configured checked")) | ||
| .await? | ||
| .hash(&mut hasher); | ||
| file_signature(self.cert_path.as_ref().expect("tls_configured checked")) | ||
| .await? | ||
| .hash(&mut hasher); | ||
| file_signature(self.key_path.as_ref().expect("tls_configured checked")) | ||
| .await? | ||
| .hash(&mut hasher); | ||
| Ok(Some(hasher.finish())) | ||
| } |
There was a problem hiding this comment.
File signature is too weak for reliable cache invalidation.
The cache key derivation uses only file length and modification time, which creates several risks:
- False cache hits: Two different certificate files with the same size and mtime will produce identical cache keys, potentially reusing stale TLS connections.
- Missed rotations: In-place certificate updates that preserve file size (e.g., same key length, padding) and restoration of mtime will not invalidate the cache.
- Security: An attacker with filesystem access could manipulate mtime to prevent certificate rotation from taking effect.
Consider hashing file content instead of relying on metadata. Alternatively, if performance is critical, use a cryptographic hash of the first and last N bytes, or rely on inode + device ID on Unix systems.
🔐 Proposed approach using content hashing
Replace file_signature() with content-based hashing:
-async fn file_signature(path: &Path) -> Result<(u64, Option<u128>)> {
- let metadata = tokio::fs::metadata(path)
- .await
- .map_err(|e| internal_err!("failed to stat {}: {:?}", path.display(), e))?;
- let modified = metadata.modified().ok().and_then(|t: SystemTime| {
- t.duration_since(SystemTime::UNIX_EPOCH)
- .ok()
- .map(|d| d.as_nanos())
- });
- Ok((metadata.len(), modified))
+async fn file_signature(path: &Path) -> Result<u64> {
+ let contents = tokio::fs::read(path)
+ .await
+ .map_err(|e| internal_err!("failed to read {}: {:?}", path.display(), e))?;
+ let mut hasher = DefaultHasher::new();
+ contents.hash(&mut hasher);
+ Ok(hasher.finish())
}Then update connection_cache_key():
let mut hasher = DefaultHasher::new();
- file_signature(self.ca_path.as_ref().expect("tls_configured checked"))
- .await?
- .hash(&mut hasher);
- file_signature(self.cert_path.as_ref().expect("tls_configured checked"))
- .await?
- .hash(&mut hasher);
- file_signature(self.key_path.as_ref().expect("tls_configured checked"))
- .await?
- .hash(&mut hasher);
+ hasher.write_u64(
+ file_signature(self.ca_path.as_ref().expect("tls_configured checked")).await?
+ );
+ hasher.write_u64(
+ file_signature(self.cert_path.as_ref().expect("tls_configured checked")).await?
+ );
+ hasher.write_u64(
+ file_signature(self.key_path.as_ref().expect("tls_configured checked")).await?
+ );
Ok(Some(hasher.finish()))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/common/security.rs` around lines 82 - 98, connection_cache_key currently
uses file_signature (mtime/size) which is too weak; replace that with a
content-based cryptographic hash to reliably detect certificate/key rotations.
In connection_cache_key(), instead of calling file_signature(...) for ca_path,
cert_path and key_path, open each file and compute a SHA-256 (or other crypto
hash) over its contents (or if performance is a concern, a SHA-256 of first+last
N bytes), feed those digest bytes into the hasher (or combine them directly) and
return hasher.finish(); ensure you still respect tls_configured() and the same
ca_path/cert_path/key_path Option unwrapping logic. Use function names
connection_cache_key, file_signature (remove/replace),
ca_path/cert_path/key_path to locate the changes.
Deep Review
Problem Summary
Solution Walkthrough
src/common/security.rsreplaces in-memoryca/certbytes with stored file paths.SecurityManager::connect()now reloads the CA, cert, and key from disk every time it builds a new TLSEndpoint, which is what enables rotated files to affect future connections.src/common/security.rsalso addsconnection_cache_key(), which hashes per-file metadata fromfile_signature()and returnsNonewhen TLS is disabled. The intended use is to tell whether a cached TiKV client was created from the same TLS inputs as the current filesystem state.src/store/client.rsextendsKvConnectwith an optionalconnection_cache_key()hook.TikvConnectforwards that call toSecurityManager, so the TiKV transport layer can participate in TLS-aware cache invalidation without changing non-TLS connectors.src/pd/client.rschangeskv_client_cachefromHashMap<String, KvClient>toHashMap<String, CachedKvClient<_>>.PdRpcClient::kv_client()now reads the current TLS cache key, compares it with the cached entry for the target store, and reconnects when the key changes.src/request/plan.rsbroadens retry handling inRetryableMultiRegion::single_shard_handler(). Previously onlyLeaderNotFoundduringmap_region_to_store()went throughhandle_other_error(). Now any failure frommap_region_to_store()orplan.apply_store()invalidates region cache state, consumes region backoff, and retries the full shard mapping flow.src/request/mod.rsandsrc/pd/client.rsadd tests for the new retry path and cache-key-sensitive TiKV client reuse.src/common/security.rsadds a reload test that proves file rereads and cache-key changes when file contents change..cargo/config.tomladdsresolver.incompatible-rust-versions = "fallback"so Cargo can resolve dependencies on older toolchains in CI.Findings (ordered by severity)
Medium: TLS rotation detection is based only on(file length, modified time). A renewed PEM with the same length and an unchanged or coarse-grained mtime will produce the same cache key, so the cachedChannelkeeps using the old TLS identity indefinitely. This undercuts the main goal of the patch, and the added test only covers size-changing rewrites, so the blind spot is untested. Refs:src/common/security.rs:83-92,src/common/security.rs:161-169,src/common/security.rs:205-241.It's enough use file length and the file mtime as fingerprint.
Costs and Negative Impacts
src/request/plan.rs:166-189is directionally correct for stale store mapping, but the TLS cache-key design introduces new failure cases where cert-file metadata availability, rather than connection health, decides whether a request can proceed.RetryableMultiRegionnow retries anymap_region_to_store()/apply_store()error instead of onlyLeaderNotFound. For in-tree request types that mostly means more aggressive recovery, but it is still a behavioral change for any downstreamShardableimplementation that usedapply_store()errors as terminal failures. Refs:src/request/plan.rs:168-189.Option<u64>.Engineering Rules Check
AGENTS.mdwas found under/home/dev/workspace/client-rust, so there were no repository-specific engineering rules to validate beyond the code itself.Questions and Assumptions
origin/HEAD...HEADbecause no explicit diff or PR metadata was provided.bd2c601,5e975b0, and29f4ff1.Suggested Tests / Validation
connection_cache_key()temporarily returns an error after a client has already been cached; verify whether requests should reuse the cached client or fail closed.kv_client()cache hits under TLS to measure the cost of the per-request metadata reads.cargo test --libpassed locally on May 12, 2026.Summary by CodeRabbit
Refactor
Tests
Chores