Skip to content

storageRadius decrease logic too restrictive on active networks #5396

@crtahlin

Description

@crtahlin

Summary

The reserve worker only decreases storageRadius when SyncRate() == 0, which is unrealistic on active networks where pullsync never fully stops due to continuous uploads. Combined with the 15-minute check interval and single-step decrements, this causes nodes to report a stale (too high) storageRadius even when their reserve already contains enough chunks to justify a lower radius.

Code Reference

pkg/storer/reserve.go, reserveWorker():

case <-thresholdTicker.C:
    radius := db.reserve.Radius()
    count, err := db.countWithinRadius(ctx)
    // ...
    if count < threshold(db.reserve.Capacity()) && db.syncer.SyncRate() == 0 && radius > db.reserveOptions.minimumRadius {
        radius--
        // ...
    }

Three constraints compound the problem:

  1. SyncRate() == 0 gate — On an active network, pullsync never fully stops. New data is continuously being uploaded, so there's always some sync activity. This condition blocks radius adjustment indefinitely.

  2. 15-minute ticker (reserveWakeUpDuration = 15 * time.Minute) — Even when conditions are met, the radius can only decrease by 1 every 15 minutes.

  3. Single-step decrement (radius--) — If the radius needs to drop by several levels (e.g., after a restart with batchstore reset), adaptation takes N * 15 minutes.

Observable Impact

After a node restart with batchstore resync, the radius starts high and needs to decrease to match the node's actual reserve capacity. Example:

  • Node has ~115M chunks in reserve (near full capacity for its configuration)
  • storageRadius is stuck at 5 instead of the correct value of 4
  • pullsyncRate is ~8 chunks/sec (normal background sync activity)
  • Radius never decreases because SyncRate() != 0

This causes:

  • Incorrect committedDepth reported to peers via the status protocol, which feeds into salud's network radius consensus
  • Node marked unhealthy by salud due to committedDepth mismatch (self_radius vs network_radius)
  • Potential redistribution game issues — node plays with wrong radius, risking incorrect samples

Nodes with reserve-capacity-doubling Are More Affected

Doubled nodes cover more neighborhoods and receive proportionally more chunk offers via pullsync. Their SyncRate() is structurally higher than non-doubled nodes, making the == 0 condition even harder to satisfy.

Suggestion

The radius adjustment could be based on the relationship between reserveSize and capacity rather than requiring zero sync activity. For example:

  • Remove or relax the SyncRate() == 0 gate — perhaps use a threshold relative to the node's capacity or check that sync rate is stable rather than zero
  • Allow multi-step radius adjustment — if the reserve size clearly justifies a lower radius, skip intermediate steps
  • Reduce the check interval — 15 minutes is very coarse for a value that affects network consensus

The radius increase path (in unreserve()) already operates without a sync rate gate, so there's precedent for the radius responding to actual reserve state rather than sync activity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions