At the moment, a distributed Restate cluster can choke up when trying to replicate large Store messages. Large Store messages can appear if services create large journal entries (e.g. when setting large state values) or due to batching multiple messages into a single Store message. Right now we have a hard upper limit of 32 MB for the Store messages. If the system tries to replicate a message or batch of messages that is larger than 32 MB, then the SequencerAppender will indefinitely retry appending a message that will always fail being sent because of the size limit. When this happens one will eventually see the following in the logs.
2026-01-05T09:27:25.050405Z WARN restate_bifrost::providers::replicated_loglet::sequencer::appender: Append wave failed, retrying with a new wave after 6.483829236s. Status is [N1(ERROR(7)), N2(ERROR(7)), N3(COMMITTED)] wave=7 loglet_id=0_1 first_offset=53 to_offset=56 length=4 otel.name="replicated_loglet::sequencer::appender: run"
2026-01-05T09:27:31.695533Z WARN restate_bifrost::providers::replicated_loglet::sequencer::appender: Append wave failed, retrying with a new wave after 5.451028884s. Status is [N1(ERROR(8)), N2(ERROR(8)), N3(COMMITTED)] wave=8 loglet_id=0_1 first_offset=53 to_offset=56 length=4 otel.name="replicated_loglet::sequencer::appender: run"
2026-01-05T09:27:37.327455Z WARN restate_bifrost::providers::replicated_loglet::sequencer::appender: Append wave failed, retrying with a new wave after 5.934103456s. Status is [N1(ERROR(9)), N2(ERROR(9)), N3(COMMITTED)] wave=9 loglet_id=0_1 first_offset=53 to_offset=56 length=4 otel.name="replicated_loglet::sequencer::appender: run"
2026-01-05T09:27:43.457925Z WARN restate_bifrost::providers::replicated_loglet::sequencer::appender: Append wave failed, retrying with a new wave after 5.069349886s. Status is [N1(ERROR(10)), N2(ERROR(10)), N3(COMMITTED)] wave=10 loglet_id=0_1 first_offset=53 to_offset=56 length=4 otel.name="replicated_loglet::sequencer::appender: run"
2026-01-05T09:27:48.702249Z WARN restate_bifrost::providers::replicated_loglet::sequencer::appender: Append wave failed, retrying with a new wave after 5.765945394s. Status is [N1(ERROR(11)), N2(ERROR(11)), N3(COMMITTED)] wave=11 loglet_id=0_1 first_offset=53 to_offset=56 length=4 otel.name="replicated_loglet::sequencer::appender: run"
2026-01-05T09:27:54.613599Z WARN restate_bifrost::providers::replicated_loglet::sequencer::appender: Append wave failed, retrying with a new wave after 5.442828005s. Status is [N1(ERROR(12)), N2(ERROR(12)), N3(COMMITTED)] wave=12 loglet_id=0_1 first_offset=53 to_offset=56 length=4 otel.name="replicated_loglet::sequencer::appender: run"
Note that replication to co-located log server will succeed because of using an in-memory communication channel that does not impose any size restrictions.
A SequencerAppender in this state, will prevent appending any other messages to the log and thereby brings the whole system eventually down.
A helper service to reproduce the problem can be found here: https://github.com/tillrohrmann/large-state-service. Note that the problem does not occur with single node clusters because of the in memory connection that does not impose any size limits.
At the moment, a distributed Restate cluster can choke up when trying to replicate large
Storemessages. LargeStoremessages can appear if services create large journal entries (e.g. when setting large state values) or due to batching multiple messages into a singleStoremessage. Right now we have a hard upper limit of 32 MB for theStoremessages. If the system tries to replicate a message or batch of messages that is larger than 32 MB, then theSequencerAppenderwill indefinitely retry appending a message that will always fail being sent because of the size limit. When this happens one will eventually see the following in the logs.Note that replication to co-located log server will succeed because of using an in-memory communication channel that does not impose any size restrictions.
A
SequencerAppenderin this state, will prevent appending any other messages to the log and thereby brings the whole system eventually down.A helper service to reproduce the problem can be found here: https://github.com/tillrohrmann/large-state-service. Note that the problem does not occur with single node clusters because of the in memory connection that does not impose any size limits.