fix: sharky corruption by martinconic · Pull Request #5409 · ethersphere/bee

martinconic · 2026-03-24T11:51:36Z

Checklist

I have read the coding guide.
My change requires a documentation update, and I have done it.
I have added tests to cover my changes.
I have filled out the description and linked the related issues.

Description

Fixes #4737

Problem

When a node crashes or loses power while actively storing chunks (e.g. during a file upload via pushsync), two corruption scenarios can occur:

Scenario 1 — slots bitmap wiped on crash.
slots.save() previously called Truncate(0) before seeking back to position 0 and rewriting the bitmap. If the node crashed between the truncate and the rewrite, the slots file was left empty. On next startup, Sharky would treat all slots as free and begin overwriting chunks that were still referenced by the LevelDB index.

Scenario 2 — shard data lost from OS page cache.
Sharky writes chunk data to a shard file via the OS page cache. LevelDB, which has its own write-ahead log, commits the chunk index entry durably. If the node crashes before the OS flushes the page cache to disk, the LevelDB index points to Sharky slots that contain stale or zeroed data. The node then serves corrupted chunks.

This was reproducible specifically when nodes were actively uploading under load, consistent with the reports in the issue. Idle nodes were unaffected because no new writes were in-flight.

Solution

Fix 1 — remove Truncate(0) from slots.save().
Slots only ever grow (the only mutation is extend), so sl.data is always >= the previous file size. Seeking to 0 and overwriting is always safe — no stale tail bytes can survive. The truncate before write was unnecessary and introduced a crash window. Removing it closes that window with zero performance impact.

Fix 2 — validate and prune corrupted chunks on recovery.
The existing .DIRTY file mechanism already detects unclean shutdowns. On recovery, instead of simply rebuilding the slots bitmap from the index, each chunk's data is now read from Sharky and its content hash is validated (CAC and SOC). Valid chunks are registered in the recovery bitmap as before. Corrupted entries — those whose data is unreadable or whose hash does not match the indexed address — are removed from the LevelDB index so the node starts clean and does not serve invalid data.

If a corrupted index entry cannot be deleted, an error is returned and node startup is aborted to prevent operating on corrupt state.

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

AI Disclosure

This PR contains code that has been generated by an LLM.
I have reviewed the AI generated code thoroughly.
I possess the technical expertise to responsibly review the code generated in this PR.

pkg/storer/recover.go

acud · 2026-04-01T04:13:47Z

pkg/storer/recover.go

+		default:
+		}
+
+		buf := make([]byte, item.Location.Length)


it is not needed to allocate a new buffer for every chunk. you can preallocate once outside of the loop with an extra size (there's specific const we use for that, i think some type of soc with header size const)

acud · 2026-04-01T04:15:24Z

pkg/storer/recover.go

+// the index store so the node starts clean without serving invalid data.
+// If a corrupted index entry cannot be deleted, an error is returned and the
+// node startup is aborted to prevent serving or operating on corrupt state.
+func validateAndAddLocations(ctx context.Context, store storage.Store, sharkyRecover *sharky.Recovery, logger log.Logger) error {


how long does this take comparing to the previous one which did not validate? it seems that the recovery will not take significantly more resources and time. would be good to have an idea about this

martinconic added 2 commits March 20, 2026 11:55

fix: prevent chunk corruption on unclean shutdown

57ace78

test: add crash corruption regression test for initial issue

0c53892

martinconic changed the title ~~Fix/sharky corruption~~ Fix:sharky corruption Mar 24, 2026

martinconic changed the title ~~Fix:sharky corruption~~ fix: sharky corruption Mar 24, 2026

martinconic added 3 commits March 24, 2026 14:32

fix: linter

9cac5d8

fix: revert fsync solution as it would be to slow

92f9472

test: verify recovery prunes corrupted sharky chunks on unclean shutdown

23aac72

martinconic marked this pull request as ready for review March 27, 2026 13:27

acud reviewed Mar 30, 2026

View reviewed changes

pkg/storer/recover.go Show resolved Hide resolved

pkg/storer/recover.go Outdated Show resolved Hide resolved

fix: return error on failing to prune corrupted index entry

e18aa45

martinconic requested review from acud, akrem-chabchoub, gacevicljubisa, janos and sbackend123 March 31, 2026 11:41

acud reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sharky corruption#5409

fix: sharky corruption#5409
martinconic wants to merge 6 commits intomasterfrom
fix/sharky-corruption

martinconic commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

acud Apr 1, 2026

Uh oh!

acud Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

martinconic commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Description

Problem

Solution

Open API Spec Version Changes (if applicable)

Motivation and Context (Optional)

Related Issue (Optional)

Screenshots (if appropriate):

AI Disclosure

Uh oh!

Uh oh!

Uh oh!

acud Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

acud Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

martinconic commented Mar 24, 2026 •

edited

Loading