Fix superblock IO error handling and add footer validation#866
Fix superblock IO error handling and add footer validation#866xiaoxichen wants to merge 12 commits into
Conversation
b39d73b to
f965093
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #866 +/- ##
==========================================
- Coverage 56.51% 48.24% -8.27%
==========================================
Files 108 110 +2
Lines 10300 12906 +2606
Branches 1402 6202 +4800
==========================================
+ Hits 5821 6227 +406
+ Misses 3894 2577 -1317
- Partials 585 4102 +3517 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9849896 to
0a4ef2e
Compare
JacksonYao287
left a comment
There was a problem hiding this comment.
I am curious where the pdev superblk is found corrupted. production?
| if (err) { | ||
| hs_utils::iobuf_free(buf, sisl::buftag::superblk); | ||
| HS_LOG(ERROR, device, "IO error reading first block from device={}, error={}", devname, err.message()); | ||
| throw std::system_error(err, "Failed to read first block from device"); |
There was a problem hiding this comment.
if err happens when writing, we use assert false . why throws exception here when reading? is there any consideration?
There was a problem hiding this comment.
no special consideration. They are roughly the same, stopping the application from running
We hit one issue in QLC that the read superblk returns EIO, the original code fails silently and pretend it is a new drive, sending registration to CM. |
|
let other parts looks good. |
b0d2d76 to
0d33eb5
Compare
…ime-based rate limiting This change enhances LOG_EVERY_N macros to fix memory leaks and add new capabilities: - Refactor check_logged_already() to check_and_format_log() which consolidates rate-limiting decision and suffix formatting into single function - Use hash-based keys instead of full message strings for memory efficiency - Clear entire map every 300s to prevent unbounded memory growth - Store millisecond offsets and counts for time-based rate limiting - Log first occurrence (count==1) in addition to every Nth occurrence - Add HS_LOG_EVERY_N_SEC for time-based rate limiting (log every N seconds) - Add HS_LOG_EVERY_N_OR_SEC for hybrid rate limiting (every N occurrences OR every M seconds) - Use consistent suffix format: "Last logged X.Xs ago, N occurrences" Memory impact: Reduces per-entry storage from ~124 bytes to ~20 bytes (6x improvement) Resolves: eBay#880
Use milliseconds directly in format string instead of converting to seconds, avoiding rvalue binding issue with fmt::make_format_args. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix count tracking: reset to 0 instead of 1 to avoid off-by-one in rate limiting (was logging at 10, 19, 28... instead of 10, 20, 30...) - Use happened flag to detect first occurrence instead of count==1 - Use swap with empty map to actually release memory during cleanup (clear() only removes entries but retains bucket array capacity) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add #include <functional> for std::hash to avoid relying on transitive includes - Handle edge case when both freq=0 and interval_sec=0: always log (fallback behavior) instead of only logging first occurrence then suppressing all subsequent logs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add function comment documenting the 300s cleanup limitation - Log one-time warning when interval_sec exceeds cleanup period - Clarifies expected behavior when using large interval_sec values Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Here is the issue description:
journal–table metadata mismatch due to CP vs destroy ordering
A split hits crash flip and marks its parent buffer with m_crash_flag_on during transact_bufs (src/lib/index/wb_cache.cpp:237-247).
The same logical window removes the table: index_table::destroy() immediately removes its superblock from meta via MetaBlkService::remove_sub_sb (src/include/homestore/index/index_table.hpp:135-147 →
src/lib/meta/meta_blk_service.cpp:872+).
CP flush later starts and writes the txn_journal to meta first, then begins flushing dirty buffers; when the flagged parent buffer is reached, it crashes (src/lib/index/wb_cache.cpp:860-871, 896-903).
On restart, recovery replays the persisted txn_journal and attempts to repair the table by ordinal, but the table superblock is gone and the table isn’t loaded → HS_DBG_ASSERT in repair_index_node (src/lib/index/index_service.cpp:205-212).
Key ordering rules that cause the mismatch
- Table destroy persistence is immediate at destroy(): meta superblock is removed synchronously (not tied to CP).
- Index CP flush ordering is fixed: (1) persist txn_journal; (2) flush dirty buffers; crash can occur at (2).
- Thus it’s possible to have a persisted journal entry for a table whose superblock was already removed.
Co-authored-by: yuwmao <yuwmao@ebaychina.com>
…ondition between upper layer reset and self operation on it (cp_flush and free) (eBay#886) Fix eBay/HomeObject#401 Co-authored-by: yawzhang <yawzhang@ebay.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
This change improves superblock error handling and integrity checking: - Fail fast on IO errors during superblock read instead of treating them as fresh/unformatted disks - Add footer superblock validation for HDD devices to detect corruption - Check header and footer write errors independently to prevent silent failures - Add comprehensive unit tests covering all error scenarios Changes: - read_first_block(): Now throws std::system_error on IO errors instead of returning garbage data - write_super_block(): Separately validates header and footer writes with independent error checking - sanity_check(): New method that validates footer consistency on HDD devices by comparing header and footer superblocks using full memcmp - Added 6 unit tests covering IO errors, corruption detection, and footer validation scenarios Use release assert instead of exceptions for superblock IO errors Per review feedback, replace exception throws with HS_REL_ASSERT for all superblock IO errors to ensure immediate crash on failure: - read_first_block(): Use HS_REL_ASSERT instead of throwing std::system_error - sanity_check(): Use HS_REL_ASSERT for header/footer read errors and mismatch - Update tests from ASSERT_THROW to ASSERT_DEATH to verify crash behavior
0d33eb5 to
ba36650
Compare
This change improves superblock error handling and integrity checking:
Changes: