Open
Conversation
acud
reviewed
Mar 30, 2026
acud
reviewed
Apr 1, 2026
| default: | ||
| } | ||
|
|
||
| buf := make([]byte, item.Location.Length) |
Contributor
There was a problem hiding this comment.
it is not needed to allocate a new buffer for every chunk. you can preallocate once outside of the loop with an extra size (there's specific const we use for that, i think some type of soc with header size const)
| // the index store so the node starts clean without serving invalid data. | ||
| // If a corrupted index entry cannot be deleted, an error is returned and the | ||
| // node startup is aborted to prevent serving or operating on corrupt state. | ||
| func validateAndAddLocations(ctx context.Context, store storage.Store, sharkyRecover *sharky.Recovery, logger log.Logger) error { |
Contributor
There was a problem hiding this comment.
how long does this take comparing to the previous one which did not validate? it seems that the recovery will not take significantly more resources and time. would be good to have an idea about this
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Checklist
Description
Fixes #4737
Problem
When a node crashes or loses power while actively storing chunks (e.g. during a file upload via pushsync), two corruption scenarios can occur:
Scenario 1 — slots bitmap wiped on crash.
slots.save()previously calledTruncate(0)before seeking back to position 0 and rewriting the bitmap. If the node crashed between the truncate and the rewrite, the slots file was left empty. On next startup, Sharky would treat all slots as free and begin overwriting chunks that were still referenced by the LevelDB index.Scenario 2 — shard data lost from OS page cache.
Sharky writes chunk data to a shard file via the OS page cache. LevelDB, which has its own write-ahead log, commits the chunk index entry durably. If the node crashes before the OS flushes the page cache to disk, the LevelDB index points to Sharky slots that contain stale or zeroed data. The node then serves corrupted chunks.
This was reproducible specifically when nodes were actively uploading under load, consistent with the reports in the issue. Idle nodes were unaffected because no new writes were in-flight.
Solution
Fix 1 — remove
Truncate(0)fromslots.save().Slots only ever grow (the only mutation is
extend), sosl.datais always >= the previous file size. Seeking to 0 and overwriting is always safe — no stale tail bytes can survive. The truncate before write was unnecessary and introduced a crash window. Removing it closes that window with zero performance impact.Fix 2 — validate and prune corrupted chunks on recovery.
The existing
.DIRTYfile mechanism already detects unclean shutdowns. On recovery, instead of simply rebuilding the slots bitmap from the index, each chunk's data is now read from Sharky and its content hash is validated (CAC and SOC). Valid chunks are registered in the recovery bitmap as before. Corrupted entries — those whose data is unreadable or whose hash does not match the indexed address — are removed from the LevelDB index so the node starts clean and does not serve invalid data.If a corrupted index entry cannot be deleted, an error is returned and node startup is aborted to prevent operating on corrupt state.
Open API Spec Version Changes (if applicable)
Motivation and Context (Optional)
Related Issue (Optional)
Screenshots (if appropriate):
AI Disclosure