Skip to content

Fix superblock IO error handling and add footer validation#866

Closed
xiaoxichen wants to merge 12 commits into
eBay:masterfrom
xiaoxichen:fix-superblock-io-error
Closed

Fix superblock IO error handling and add footer validation#866
xiaoxichen wants to merge 12 commits into
eBay:masterfrom
xiaoxichen:fix-superblock-io-error

Conversation

@xiaoxichen
Copy link
Copy Markdown
Collaborator

This change improves superblock error handling and integrity checking:

  • Fail fast on IO errors during superblock read instead of treating them as fresh/unformatted disks
  • Add footer superblock validation for HDD devices to detect corruption
  • Check header and footer write errors independently to prevent silent failures
  • Add comprehensive unit tests covering all error scenarios

Changes:

  • read_first_block(): Now throws std::system_error on IO errors instead of returning garbage data
  • write_super_block(): Separately validates header and footer writes with independent error checking
  • sanity_check(): New method that validates footer consistency on HDD devices by comparing header and footer superblocks using full memcmp
  • Added 6 unit tests covering IO errors, corruption detection, and footer validation scenarios

@xiaoxichen xiaoxichen force-pushed the fix-superblock-io-error branch from b39d73b to f965093 Compare February 24, 2026 17:03
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Feb 24, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 40.54054% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 48.24%. Comparing base (1a0cef8) to head (0a4ef2e).
⚠️ Report is 318 commits behind head on master.

Files with missing lines Patch % Lines
src/lib/device/physical_dev.cpp 40.54% 2 Missing and 20 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #866      +/-   ##
==========================================
- Coverage   56.51%   48.24%   -8.27%     
==========================================
  Files         108      110       +2     
  Lines       10300    12906    +2606     
  Branches     1402     6202    +4800     
==========================================
+ Hits         5821     6227     +406     
+ Misses       3894     2577    -1317     
- Partials      585     4102    +3517     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@xiaoxichen xiaoxichen force-pushed the fix-superblock-io-error branch 2 times, most recently from 9849896 to 0a4ef2e Compare February 25, 2026 02:03
Copy link
Copy Markdown
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious where the pdev superblk is found corrupted. production?

Comment thread src/lib/device/physical_dev.cpp Outdated
if (err) {
hs_utils::iobuf_free(buf, sisl::buftag::superblk);
HS_LOG(ERROR, device, "IO error reading first block from device={}, error={}", devname, err.message());
throw std::system_error(err, "Failed to read first block from device");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err happens when writing, we use assert false . why throws exception here when reading? is there any consideration?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no special consideration. They are roughly the same, stopping the application from running

@xiaoxichen
Copy link
Copy Markdown
Collaborator Author

I am curious where the pdev superblk is found corrupted. production?

We hit one issue in QLC that the read superblk returns EIO, the original code fails silently and pretend it is a new drive, sending registration to CM.

@JacksonYao287
Copy link
Copy Markdown
Contributor

lets use release assert and print the error log for all case here, since theoretically exception can be catched . wed better let it crash immediately if any io error of superblk read or write happens.

other parts looks good.

@xiaoxichen xiaoxichen force-pushed the fix-superblock-io-error branch 2 times, most recently from b0d2d76 to 0d33eb5 Compare March 11, 2026 09:01
szmyd and others added 12 commits April 24, 2026 07:22
…ime-based rate limiting

This change enhances LOG_EVERY_N macros to fix memory leaks and add new capabilities:

- Refactor check_logged_already() to check_and_format_log() which consolidates
  rate-limiting decision and suffix formatting into single function
- Use hash-based keys instead of full message strings for memory efficiency
- Clear entire map every 300s to prevent unbounded memory growth
- Store millisecond offsets and counts for time-based rate limiting
- Log first occurrence (count==1) in addition to every Nth occurrence
- Add HS_LOG_EVERY_N_SEC for time-based rate limiting (log every N seconds)
- Add HS_LOG_EVERY_N_OR_SEC for hybrid rate limiting (every N occurrences OR every M seconds)
- Use consistent suffix format: "Last logged X.Xs ago, N occurrences"

Memory impact: Reduces per-entry storage from ~124 bytes to ~20 bytes (6x improvement)

Resolves: eBay#880
Use milliseconds directly in format string instead of converting to seconds,
avoiding rvalue binding issue with fmt::make_format_args.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix count tracking: reset to 0 instead of 1 to avoid off-by-one in rate limiting
  (was logging at 10, 19, 28... instead of 10, 20, 30...)
- Use happened flag to detect first occurrence instead of count==1
- Use swap with empty map to actually release memory during cleanup
  (clear() only removes entries but retains bucket array capacity)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add #include <functional> for std::hash to avoid relying on transitive includes
- Handle edge case when both freq=0 and interval_sec=0: always log (fallback behavior)
  instead of only logging first occurrence then suppressing all subsequent logs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add function comment documenting the 300s cleanup limitation
- Log one-time warning when interval_sec exceeds cleanup period
- Clarifies expected behavior when using large interval_sec values

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Here is the issue description:
journal–table metadata mismatch due to CP vs destroy ordering

A split hits crash flip and marks its parent buffer with m_crash_flag_on during transact_bufs (src/lib/index/wb_cache.cpp:237-247).
The same logical window removes the table: index_table::destroy() immediately removes its superblock from meta via MetaBlkService::remove_sub_sb (src/include/homestore/index/index_table.hpp:135-147 →
  src/lib/meta/meta_blk_service.cpp:872+).
CP flush later starts and writes the txn_journal to meta first, then begins flushing dirty buffers; when the flagged parent buffer is reached, it crashes (src/lib/index/wb_cache.cpp:860-871, 896-903).
On restart, recovery replays the persisted txn_journal and attempts to repair the table by ordinal, but the table superblock is gone and the table isn’t loaded → HS_DBG_ASSERT in repair_index_node (src/lib/index/index_service.cpp:205-212).
Key ordering rules that cause the mismatch
    - Table destroy persistence is immediate at destroy(): meta superblock is removed synchronously (not tied to CP).
    - Index CP flush ordering is fixed: (1) persist txn_journal; (2) flush dirty buffers; crash can occur at (2).
    - Thus it’s possible to have a persisted journal entry for a table whose superblock was already removed.

Co-authored-by: yuwmao <yuwmao@ebaychina.com>
…ondition between upper layer reset and self operation on it (cp_flush and free) (eBay#886)

Fix eBay/HomeObject#401

Co-authored-by: yawzhang <yawzhang@ebay.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Xiaoxi Chen <xiaoxchen@ebay.com>
This change improves superblock error handling and integrity checking:

- Fail fast on IO errors during superblock read instead of treating them as
  fresh/unformatted disks
- Add footer superblock validation for HDD devices to detect corruption
- Check header and footer write errors independently to prevent silent failures
- Add comprehensive unit tests covering all error scenarios

Changes:
- read_first_block(): Now throws std::system_error on IO errors instead of
  returning garbage data
- write_super_block(): Separately validates header and footer writes with
  independent error checking
- sanity_check(): New method that validates footer consistency on HDD devices
  by comparing header and footer superblocks using full memcmp
- Added 6 unit tests covering IO errors, corruption detection, and footer
  validation scenarios

Use release assert instead of exceptions for superblock IO errors

Per review feedback, replace exception throws with HS_REL_ASSERT for all
superblock IO errors to ensure immediate crash on failure:

- read_first_block(): Use HS_REL_ASSERT instead of throwing std::system_error
- sanity_check(): Use HS_REL_ASSERT for header/footer read errors and mismatch
- Update tests from ASSERT_THROW to ASSERT_DEATH to verify crash behavior
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants