Skip to content

docs: add WAL vs changeset replication tradeoffs analysis#54

Open
WillPapper wants to merge 3 commits intomainfrom
claude/evaluate-litestream-uQGLX
Open

docs: add WAL vs changeset replication tradeoffs analysis#54
WillPapper wants to merge 3 commits intomainfrom
claude/evaluate-litestream-uQGLX

Conversation

@WillPapper
Copy link
Contributor

@WillPapper WillPapper commented Dec 17, 2025

Opening this PR to allow for easy discussion @jorgemmsilva @daniilrrr. We can iterate on PR comments (and if we'd like, have Claude Code make PR updates in response to our questions)

Some of this recreates the original logic from the earlier WAL vs changeset debate, but in a more structured way

Summary

Evaluates WAL-based replication (Litestream) as an alternative to changeset-based
replication. Documents why changesets fit SyndDB's validator verification model better.

Key findings:

  • WAL-based: Best for disaster recovery, zero integration, but opaque to validators
  • Changeset-based: Auditable logical operations, cross-architecture determinism
  • Inversion: Changesets can be inverted via sqlite3changeset_invert() for surgical rollback; WAL is forward-only with no undo capability

Discussion points

  • Is there value in adding Litestream as a secondary backup mechanism?
  • Pain points with the current Session Extension approach?
  • Should we document Session Extension lifecycle more clearly?
  • Do validators need fine-grained rollback (favors changesets), or is snapshot restore acceptable?

Evaluates Litestream and WAL-based replication as alternatives to
SyndDB's changeset-based approach. Documents why changesets are
preferred for validator verification while acknowledging WAL's
strengths for disaster recovery.
Copilot AI review requested due to automatic review settings December 17, 2025 18:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation analyzing the tradeoffs between WAL-based and changeset-based SQLite replication approaches for SyndDB. The document explains why SyndDB uses the SQLite Session Extension for changeset-based replication instead of WAL-based tools like Litestream, while acknowledging that WAL-based approaches have strengths for disaster recovery scenarios.

Key points:

  • Compares physical (WAL) vs logical (changeset) replication approaches
  • Evaluates Litestream as a WAL-based alternative
  • Documents five key requirements that make changesets the better choice for SyndDB's validator verification architecture

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Restructure as a discussion-focused design doc with:
- Clear options (A/B/C/D) for decision making
- Key decision factors with specific questions
- Effort estimates and risk levels
- 10 open questions for team discussion
- Suggested experiments to validate assumptions
- Tentative recommendation with rationale
@daniilrrr
Copy link
Contributor

daniilrrr commented Dec 17, 2025

L O L I've been reading the docs for SQLite WAL Mode, Session Extension, and Litestream this morning and these are similar to the points I was going to bring up. I'll add my thoughts in the next ~hour

@daniilrrr
Copy link
Contributor

daniilrrr commented Dec 17, 2025

My resources

My thoughts BEFORE reading Claude's docs above:

  1. Litestream is a great find and is a mature (though pre-v1) project that is accomplishing something similar to what we want to do with our synddb-client that is relaying database deltas to the Storage Layer. We should definitely look at the project for inspiration and steal anything useful.
  2. Litestream uses WAL-based (write-ahead log) replication which is fundamentally different to our Session Extension-based approach. We need to understand both and the value of each to make a good technical assessment.
  3. Litestream states it is designed for disaster recovery to copy DB content.
    1. I am concerned that it has made some assumptions in the implementation to accomplish this goal. We will want to try to unearth these.
    2. It is also designed to handle authn and publishing to object stores like AWS S3 and GCP on its own for a seamless user experience for disaster recovery. I think that we will need to roll our own if we want to support crypto-native Storage layers like Arweave (near-term) or IPFS and EigenDA (longer term). This removes a chunk of the value of Litestream imo
  4. WAL vs changesets
    1. WAL is a journal mode which is a core functionality of SQLite. Per R3
      1. Litestream works by effectively taking over the checkpointing process. It starts a long-running read transaction to prevent any other process from checkpointing and restarting the WAL file. Instead, it continually copies over new WAL pages to a staging area called the shadow WAL and manually calls out to SQLite to perform checkpoints as necessary.

    2. I would categorize this as an "intrusive" approach and requires Litestream to be responsible for SQLite working correctly since it is running the WAL. We would take on a similar responsibility
    3. Related, @jorgemmsilva said
      1. We would provide the pre-configured DB (with wal mode, etc). All they have to do is set the busy_timeout for better UX.

      2. I think this is actually a large commitment from us. I am concerned that this may turn us into a "SQLite company" and we would have to become experts in SQLite and be responsible for that piece of infra for a customer. See R2 above, but there's lots of tuning and settings in the WAL section and care to take when running it. It would be nice to avoid this level of effort if possible and if it doesn't have significant advantages
    4. In my opinion, the Session Extension approach of listening to changesets is less intrusive and itself appears to be "sidecar" functionality in SQLite to the main db functionality. Since we are only listening to SQL update statements that have already run, rather than something core to the DB, this seems like a lower level of effort to get right. Indeed, this repo currently has the components wired up and communicating with one another using Session Extension.
    5. This approach is also saving generic SQL and thus potentially extensible to validators running any other SQL DB, if that is desired.

Big questions for me

  1. What kind of latency requirements do we have for the "db" content to be available to validators?
    1. Do WAL vs Session Extension fail these in any significant way
  2. Do validators need to receive generic SQL, or can they receive WAL file content? What will validators do with the data beyond reconstructing the state, if anything?

I think we need to zoom out and define the SyndDB product requirements - who SyndDB is for and not for, and what they should and should not be able to do with it. Both the WAL and Session Extension approaches are perfectly viable, but each has tradeoffs we need to understand in order to decide how to proceed

@daniilrrr
Copy link
Contributor

daniilrrr commented Dec 17, 2025

Thoughts on what Claude wrote

WAL-based replication requires zero application integration

This doesn't seem true based on what Litestream says. I misread. Zero app integration so there's no library to run (true) but a lot of DB integration. Litestream works by "taking over the WAL process." So it seems to me that to build something similar (potentially in Rust) we would have to do the same

Changeset > Current pain points

  • Must enable/disable session around certain operations
  • Some operations not captured (PRAGMA, ATTACH, VACUUM)

This is an interesting point. I am not sure what "certain operations" it refers to that aren't captured. We would need to determine if this is acceptable for us.

Key Decision Factors

I don't think the Payload Size matters, iirc Ar.io (Arweave network for permanent storage) told us that up to 100 KB data uploads are free (subsidized) and GCS definitely handles such files.

Cross-Architecture Determinism
WAL pages may differ across:...

I didn't know this and it wasn't explicitly discussed in docs. It seems like an unpleasant experience to specify a validator machine type

Took a brief look into this further (https://www.sqlite.org/wal.html) and I am not sure if the WAL file does differ across platform, or if it does then I didn't find conclusive evidence from reading the docs and other Googling. Claude may have hallucinated this

The other decision factors appear relevant and echo what I said in my comment above.

@jorgemmsilva
Copy link

@daniilrrr

I am concerned that this may turn us into a "SQLite company" and we would have to become experts in SQLite and be responsible for that piece of infra for a customer.

We are developing a product called synd DB. We better become experts in SQLite, otherwise debugging issues will be impossible. We need do develop specialized expertise to have any chance of offering a quality product.

In my opinion, the Session Extension approach of listening to changesets is less intrusive and itself appears to be "sidecar" functionality in SQLite to the main db functionality. Since we are only listening to SQL update statements that have already run, rather than something core to the DB, this seems like a lower level of effort to get right. Indeed, this repo currently has the components wired up and communicating with one another using Session Extension.

IMO the value proposition of offering "no changes needed" integration is huge! The "session extension way" requires some non-trivial changes in the application code.
Besides all that, the current "session extension" architecture is fundamentally flawed. It is prone to race conditions as we're relying on the application to send the changesets in a predictable order. The data layer becomes unrecoverably desynchronized from the DB in case there is an issue (a bug in the sequencer code or the application, someone pulls the plug, etc. anything like that can make a changeset be applied in the DB and not transmitted to the storage layer).
Essentially the way I see it, we have an amazing point to connect the pieces: the DB itself - which resolves all race conditions, provides ACID properties and acts as the single source of truth, and instead of using it we're creating a system on top (the "session extension way") which implicitly adds a lot of complexity and ways the whole thing can go wrong.

  • What kind of latency requirements do we have for the "db" content to be available to validators?
    Do WAL vs Session Extension fail these in any significant way

I don't see a reason we particularly care about this. The only thing the storage layer is useful is for validators to reconstruct the state and sign withdrawals, and that's not a latency-sensitive part of the system. (for reference litestream does WAL checkpoints/backups every second by default)

  • Do validators need to receive generic SQL, or can they receive WAL file content? What will validators do with the data beyond reconstructing the state, if anything?

They should assert constraints over the data. Overall reconstructing the DB state and making checks over table data should a simple / robust way to do it

Comment on lines +133 to +137
WAL pages may differ across:
- Endianness (big vs little endian)
- Alignment/padding
- Page size configuration
- SQLite compile options

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +145 to +149
Rough comparison for a single-column UPDATE:
- WAL: 4KB page (minimum)
- Changeset: ~50-200 bytes (column value + metadata)

**Question:** Is bandwidth/storage cost a significant concern?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like gibberish. How were this values calculated?


---

## Hybrid Architecture (Option C)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the worst of both worlds, we won't be able to recuperate changesets from WAL backups, so the storage layer data is borked anyway in case of a failure

Copy link
Contributor

@daniilrrr daniilrrr Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agree, imo there's no disaster recovery need here

@daniilrrr
Copy link
Contributor

@jorgemmsilva

Besides all that, the current "session extension" architecture is fundamentally flawed. It is prone to race conditions as we're relying on the application to send the changesets in a predictable order.

I don't think that's true. Session Extension is a native API on SQLite, so changesets should be listened to and sent in the order that they're written to on the DB. I don't think we are relying on the app for DB ordering.

The data layer becomes unrecoverably desynchronized from the DB in case there is an issue (a bug in the sequencer code or the application, someone pulls the plug, etc. anything like that can make a changeset be applied in the DB and not transmitted to the storage layer).

I think this is still true if there is an issue in the WAL version of the application we build

@daniilrrr
Copy link
Contributor

daniilrrr commented Dec 18, 2025

@jorgemmsilva having listened to your explanation in standup today, I think what you are really against is the FFI client library approach where we depend on the app using the client to publish() changesets. Which I do understand.

I think both Session Extension and WAL are viable for the reasons discussed above. To address your concern, with the Session Extension approach we could write a standalone "listener" node outside of the App but still in "VM 1" that would solely forward changesets. This would introduce another node with associated network calls and failure modes though

e.g.
image

@jorgemmsilva
Copy link

I don't think that's true. Session Extension is a native API on SQLite, so changesets should be listened to and sent in the order that they're written to on the DB. I don't think we are relying on the app for DB ordering.

https://sqlite.org/sessionintro.html#capturing_a_changeset

It is not necessary to delete a session object after extracting a changeset or patchset from it. It can be left attached to the database handle and will continue monitoring for changes on the configured tables as before. However, if sqlite3session_changeset() or sqlite3session_patchset() is called a second time on a session object, the changeset or patchset will contain all changes that have taken place on the connection since the session was created. In other words, a session object is not reset or zeroed by a call to sqlite3session_changeset() or sqlite3session_patchset().

we're relying on the way the application is written to finish sessions and send changesets ordered correctly.
I really don't like that we're relying on the way the client will write their application to guarantee correct behaviour.

The data layer becomes unrecoverably desynchronized from the DB in case there is an issue (a bug in the sequencer code or the application, someone pulls the plug, etc. anything like that can make a changeset be applied in the DB and not transmitted to the storage layer).

I think this is still true if there is an issue in the WAL version of the application we build

Not true, we can reconcile WAL frames with the DB / storage layer so no information is ever missing

@jorgemmsilva
Copy link

we could write a standalone "listener" node outside of the App

Is this even possible to do?

we also can't guarantee that the client will call "sqlite3session_attach() with a NULL argument"
What happens if the application decides to call sqlite3session_delete() ?

Just feels like this strategy leaves a lot of implementation detail burden on the application. Which I really don't like when the alternative is "no custom implementation necessary"

There is a reason litestream uses the WAL strategy

@daniilrrr
Copy link
Contributor

daniilrrr commented Dec 18, 2025

I see what you are saying. I think you are correct that is a question of where to store the implementation complexity. The session extension approach means that the customer needs to use our library correctly, while the WAL approach means that we need to configure the [DB + Syndicate-Litestream-app] and make sure that works correctly. The latter is going to be more work for us but probably the more robust approach.

I searched "session extension" in the Litestream Github and there are some interesting results.

This is interesting - benbjohnson/litestream#129 - marked as "wont fix." I think this indicates that the WAL approach is orthogonal to saving logical changesets, if such a feature is complicated to support in Litestream

@jorgemmsilva
Copy link

I'm glad we're getting on the same page 🙂

This is interesting - benbjohnson/litestream#129 - marked as "wont fix." I think this indicates that the WAL approach is orthogonal to saving logical changesets, if such a feature is complicated to support in Litestream

That issue is about potentially using a "session extension format" (changesets) to share changes to a downstream client as litestream replicates WAL data. It's quite different from using the "session extension" as the replication mechanism itself.

I feel like the session extension tech would be useful for CRDTs and offline apps, where one could, for example: do changes to a TODO list mobile app while offline, save those changes as a changeset and then relay the info over to a server once the device comes online again. But yeah, all logic must be purpose-built on the app side.

Add section 6 covering changeset inversion as a key differentiator:
- Document sqlite3changeset_invert() API and how it transforms operations
- Explain why WAL cannot support inversion (forward-only, checkpointing)
- List SyndDB use cases: validator rollback, dispute resolution, optimistic execution
- Add inversion row to comparison table
- Update recommendation rationale

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@WillPapper
Copy link
Contributor Author

This is a very helpful discussion! Thank you for all of the thoughtful input @jorgemmsilva @daniilrrr. I agree that placing this discussion in the context of product goals will help us make a decision.

One blind spot in our current understanding is how validators will work. Do validators need the diffs provided by changesets, or are forward-only WAL updates suitable? That product question (validator flexibility) would affect the DX and security of SyndDB more than the WAL vs changeset approach. A hard-to-write, brittle validator is more painful than a required client library or set of initialization instructions for an application.

I built out some example use cases (price oracle and prediction market) that involve custom validator rules in #58 (also fixed some client bugs and some nuances in changeset handling while I was at it). That will help guide the discussion after break

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants