MDEV-34358 Encryption threads consume CPU when no work available#5040
MDEV-34358 Encryption threads consume CPU when no work available#5040Thirunarayanan wants to merge 4 commits into10.6from
Conversation
Problem: -------- 1. Encryption threads busy-wait when no work is available. When reaching fil_system.space_list.end(), fil_crypt_return_iops() is called with wake=true, causing pthread_cond_broadcast() to wake all threads unnecessarily, leading to CPU waste. 2. Tablespaces with CLOSING/STOPPING flags (set during DDL operations) are skipped during iteration. Since DDL completion doesn't wake encryption threads, these spaces may never be encrypted if threads sleep indefinitely. 3. For default_encrypt_list iteration, when spaces exist but none are acquirable, threads need to wake others for cooperative retry, but this case was not distinguished from fil_system.space_list.end(). 4. IOPS are allocated before searching for tablespaces, wasting resources during iteration when no I/O occurs. Solution: --------- 1. Implement timed wait with exponential backoff (5s -> 10s -> 20s -> 40s -> 60s, max 5 attempts) for fil_system.space_list iteration. After ~135 seconds, switch to indefinite wait. This periodically rechecks for spaces that become available after DDL completes. 2. Use indefinite wait for default_encrypt_list iteration since other threads will retry and wake when needed. 3. fil_space_t::next(): Add default_encrypt_list flag to distinguish between the two iteration modes. Wake other threads only when this flag is true (spaces exist but unacquirable). 4. Move IOPS allocation from before tablespace search to after finding a space that needs rotation. 5. Handle wake logic explicitly at call site based on default_encrypt_list flag. rotate_thread_t changes ------------------------ - default_encrypt_list (bool): Indicates if default_encrypt_list has unacquirable spaces - timed_wait_count (uint8_t): Counts consecutive timeouts for exponential backoff - sleep_timeout_ms (uint16_t): Current timeout in ms (5s -> 60s max) - wait_for_work(): Implements timed/indefinite wait based on iteration mode - increase_sleep_timeout(): Doubles timeout up to 60s - reset_sleep_timeout(): Resets timeout to 5s and clears count
…it behavior Added debug-only status variables Innodb_encryption_timed_waits and Innodb_encryption_indefinite_waits to track encryption thread wait types, enabling verification that threads use timed wait with exponential backoff instead of busy-waiting when idle. The counters are incremented in rotate_thread_t::wait_for_work() and exposed via SHOW STATUS in debug builds only. Also added a debug sync point 'rotate_only_2_timed_waits' to reduce the timed wait threshold from 5 to 2 for faster testing of the indefinite wait transition.
|
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request implements an exponential backoff mechanism for InnoDB encryption threads to reduce CPU usage when tablespaces are temporarily unavailable, such as during DDL operations. It introduces a versioning system for encryption settings to trigger iteration restarts and adds debug status variables to track wait behavior. The review feedback highlights several improvement opportunities: addressing dead code wrapped in #if 0, correcting the timeout threshold logic to match the intended behavior described in comments, removing redundant counter resets, and increasing the bit-width of the settings version counter to prevent potential overflow.
| Incremented each time innodb_encrypt_tables or | ||
| innodb_encryption_rotate_key_age is modified to signal | ||
| encryption threads to restart iteration */ | ||
| static Atomic_counter<uint16_t> fil_crypt_settings_version; |
There was a problem hiding this comment.
Using uint16_t for a version counter is risky as it can wrap around relatively quickly (65535 changes). While unlikely to cause an issue in typical operation, using uint32_t or uint64_t is safer and generally has the same performance characteristics for atomic operations on modern architectures.
static Atomic_counter<uint32_t> fil_crypt_settings_version;There was a problem hiding this comment.
If user changes innodb_encrypt_tables variable every hour then this variable may ran out in 7.5 years. If we restart the server in the mean time then this variable becomes 0. This datatype is enough for practical usage. I don't think sane people is going to toggle encryption for every hour.
dr-m
left a comment
There was a problem hiding this comment.
I think that we need to consider two paths forward. I see that this pull request is currently targeting the 10.6 branch, to which only minimal changes should be done, given that the branch will soon reach its end of life.
There is also the hang MDEV-37946, which I expect we should be hitting when testing this extensively enough. I think that the hang needs to be fixed as part of a minimal fix; it should be fairly easy.
For newer branches, I would like to see a more comprehensive solution, which I am outlining below.
I’d like to see some testing of this subsystem. I have some doubts of the efficacy of the current multi-threaded implementation. We currently have multiple threads that can increase the size of buf_pool.flush_list, It could be the case that multiple threads are concurrently calling buf_flush_list_space(), which could repeatedly invalidate buf_pool.flush_hp.
As a starting point, I think that we have to test if there is any benefit of having multiple "encryption threads" which perform dummy writes to data pages. The name "encryption thread" is misleading, because the actual encryption will take place in the single buf_flush_page_cleaner() thread, in the function buf_page_encrypt(). Yes, this CPU intensive operation will be executed in a single thread!
How can it possibly be helpful to have multiple "producers" (fil_crypt_thread) that are dirtying the buffer pool, when we have a single "consumer" (buf_flush_page_cleaner) that is doing some very CPU intensive work (buf_page_encrypt() and computing CRC-32C)?
Could we implement tighter coupling between the current "encryption" (page-dirtying) and the buf_flush_page_cleaner()? That thread is already regularly scanning potentially the entire buffer pool. So, why not implement an "encryption step" that would process a number of pages that were missed by buf_flush_list_holding_mutex() and buf_flush_LRU()? Such a design could trivially follow the innodb_io_capacity budget and scale back the key rotation activity when there are bursts of application workload (some threads are blocked in buf_flush_wait() or buf_LRU_get_free_block()).
When a clean page that we wish to re-encrypt is in the buffer pool, the buf_flush_page_cleaner() thread can simply write a dummy log record and move the page to the buf_pool.flush_list.
When a page will need to be read into the buffer pool, the current implementation is invoking synchronous buf_page_read() via buf_page_get_gen(). It would be much better to submit asynchronous reads, implementing the logic with a future implementation of MDEV-11378 in mind. The read completion callback would have to detect that this is a special read for encryption, and it would write the dummy record so that the page will be re-encrypted.
The relevant parts of fil_space_rotate_state_t and key_state_t may have to be made part of fil_space_t or fil_space_crypt_t.
A final part of this would be to repurpose the parameter innodb_encrypt_threads to refer to worker tasks that the buf_flush_page_cleaner() would employ to encrypt pages or compute CRC-32C. The function buf_page_t::flush() would be split. Only initial part would be invoked by the buf_flush_page_cleaner() thread; the rest would be invoked by worker tasks, which would process a number of buf_page_t* from their queue.
Problem: ======= When innodb_encrypt_tables or innodb_encryption_rotate_key_age is changed during encryption thread iteration, threads continue with stale configuration values, potentially missing tablespaces that should be encrypted or rotated under the new settings. Solution: ======== Added atomic version counter fil_crypt_settings_version that is incremented whenever innodb_encrypt_tables or innodb_encryption_rotate_key_age changes. Encryption threads capture the version at iteration start and check for changes during iteration. If config changed, threads immediately restart iteration from the beginning to ensure complete coverage with new settings. fil_crypt_settings_version: Atomic counter to track the innodb_encrypt_tables or innodb_encryption_rotate_key_age changes rotate_thread_t::settings_version: To compare the existing fil_crypt_settings_version to restart the encryption from the beginning
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an exponential backoff mechanism for InnoDB encryption threads to optimize CPU usage and a versioning system to manage configuration changes, including updates to status variable tests. The review feedback identifies several improvements: using 32-bit integers for timeout and version counters to prevent overflow and wrap-around, correcting the wait logic for the default encryption list, resetting timeout states when work is found, and addressing a busy-wait loop in IOPS allocation.
| Used only when iterating fil_system.space_list to periodically recheck for | ||
| spaces that become available after DDL operations complete. Reset to 5000ms | ||
| when woken by signal or when work is found. */ | ||
| uint16_t sleep_timeout_ms= 5000; |
There was a problem hiding this comment.
The variable sleep_timeout_ms should be changed to uint32_t to avoid potential overflow issues during calculations in increase_sleep_timeout(). When sleep_timeout_ms is 40,000, sleep_timeout_ms * 2 results in 80,000, which exceeds the uint16_t range and causes incorrect behavior when passed to std::min<uint16_t>.
uint32_t sleep_timeout_ms= 5000;| timed wait expires without being woken by a signal, indicating no work | ||
| became available during the timeout period. */ | ||
| void increase_sleep_timeout() { | ||
| sleep_timeout_ms = std::min<uint16_t>(sleep_timeout_ms * 2, 60000); |
There was a problem hiding this comment.
| Incremented each time innodb_encrypt_tables or | ||
| innodb_encryption_rotate_key_age is modified to signal | ||
| encryption threads to restart iteration */ | ||
| static Atomic_counter<uint16_t> fil_crypt_settings_version; |
There was a problem hiding this comment.
Using uint16_t for a version counter is risky in a long-running database server as it can wrap around relatively quickly (65,535 changes). While the logic only checks for inequality, using uint32_t is safer and more standard for versioning to avoid any collision during a single iteration.
static Atomic_counter<uint32_t> fil_crypt_settings_version;| /** Config version when thread started current iteration. | ||
| Used to detect innodb_encrypt_tables changes during iteration | ||
| and restart from beginning to ensure complete encryption coverage. */ | ||
| uint16_t settings_version; |
| return; | ||
| } | ||
|
|
||
| if (space == fil_system.space_list.end()) { |
There was a problem hiding this comment.
There is a discrepancy between the code and the intended design described in the PR and the comments (lines 1131-1133). The code currently performs a timed wait when space == .end(), regardless of whether it's iterating the default_encrypt_list. To follow the stated design of using an indefinite wait for the default list (where other threads are expected to wake this one), the condition should check the default_encrypt_list flag.
if (space == fil_system.space_list.end() && !default_encrypt_list) {There was a problem hiding this comment.
Are you kidding me, right?
Please check the callers for wait_for_work()
if (thr.default_encrypt_list) {
pthread_cond_broadcast(
&fil_crypt_threads_cond);
}
goto wait_for_work;
}
| goto wait_for_work; | ||
| } | ||
|
|
||
| /* we found a space to rotate */ |
There was a problem hiding this comment.
When a tablespace is successfully found for rotation, the sleep timeout and timed wait count should be reset. This ensures that the exponential backoff starts fresh the next time the thread becomes idle, rather than potentially jumping straight to an indefinite wait due to stale state from previous idle periods.
/* we found a space to rotate */
thr.reset_sleep_timeout();| do { | ||
| if (thr.should_shutdown()) { | ||
| thr.space->release(); | ||
| thr.space = fil_system.space_list.end(); | ||
| goto func_exit; | ||
| } | ||
| } while (!fil_crypt_alloc_iops(&thr)); | ||
|
|
There was a problem hiding this comment.
This loop for IOPS allocation is a busy-wait that consumes 100% CPU if fil_crypt_alloc_iops returns false (e.g., when the global encryption IOPS limit is reached). While this pattern existed in the previous version, it remains a performance bottleneck. Consider adding a short sleep or yielding the thread if IOPS are unavailable.
Problem:
space that needs rotation.