-
Notifications
You must be signed in to change notification settings - Fork 16
Expand and clarify consitency/durability docs in store.wit #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
89c63fc
aaeec54
ee77cef
b4f3fff
5b3c65b
94183c2
cf03f8a
418d48f
a9f34a4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -7,22 +7,65 @@ | |||||
| /// ensuring compatibility between different key-value stores. Note: the clients will be expecting | ||||||
| /// serialization/deserialization overhead to be handled by the key-value store. The value could be | ||||||
| /// a serialized object from JSON, HTML or vendor-specific data types like AWS S3 objects. | ||||||
| /// | ||||||
| /// ## Consistency | ||||||
| /// | ||||||
| /// Data consistency in a key value store refers to the guarantee that once a write operation | ||||||
| /// completes, all subsequent read operations will return the value that was written. | ||||||
| /// | ||||||
| /// Any implementation of this interface must have enough consistency to guarantee "reading your | ||||||
| /// writes." In particular, this means that the client should never get a value that is older than | ||||||
| /// the one it wrote, but it MAY get a newer value if one was written around the same time. These | ||||||
| /// guarantees only apply to the same client (which will likely be provided by the host or an | ||||||
| /// external capability of some kind). In this context a "client" is referring to the caller or | ||||||
| /// guest that is consuming this interface. Once a write request is committed by a specific client, | ||||||
| /// all subsequent read requests by the same client will reflect that write or any subsequent | ||||||
| /// writes. Another client running in a different context may or may not immediately see the result | ||||||
| /// due to the replication lag. As an example of all of this, if a value at a given key is A, and | ||||||
| /// the client writes B, then immediately reads, it should get B. If something else writes C in | ||||||
| /// quick succession, then the client may get C. However, a client running in a separate context may | ||||||
| /// still see A or B | ||||||
| /// writes." In particular, this means that a `get` call for a given key on a given `bucket` | ||||||
| /// resource should never return a value that is older than the the last value written to that key | ||||||
| /// on the same resource, but it MAY get a newer value if one was written around the same | ||||||
| /// time. These guarantees only apply to reads and writes on the same resource; they do not hold | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think we might be burying the lead a bit. It might be useful to start the consistency section with a quick sentence that says that there are no consistency guarantees across resource handles.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense; please see my latest push and let me know if it still needs improvement. |
||||||
| /// across multiple resources -- even when those resources were opened using the same string | ||||||
| /// identifier by the same component instance. | ||||||
| /// | ||||||
| /// The following pseudocode example illustrates this behavior. Note that we assume there is | ||||||
| /// initially no value set for any key and that no other writes are happening beyond what is shown | ||||||
| /// in the example. | ||||||
| /// | ||||||
| /// bucketA = open("foo") | ||||||
| /// bucketB = open("foo") | ||||||
| /// bucketA.set("bar", "a") | ||||||
| /// // The following are guaranteed to succeed: | ||||||
| /// assert bucketA.get("bar").equals("a") | ||||||
| /// assert bucketB.get("bar").equals("a") or bucketB.get("bar") is None | ||||||
| /// // ...whereas this is NOT guaranteed to succeed immediately (but should eventually): | ||||||
| /// // assert bucketB.get("bar").equals("a") | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It sounds like from what's written above this is not guranteed to ever be true. Since consistency is not guaranteed across resource handles,
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, hence the "should". I think it's worth mentioning what a conforming implementation should make a best effort to do (i.e. in normal operation, barring exceptional circumstances) as well as what it must do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we're using the RFC 2119 meaning of "should" I think we should write it as "SHOULD" (in all caps). A non-RFC definition of "should" here might lead readers to interpret "should" as "will". |
||||||
| /// | ||||||
| /// Once a value is `set` for a given key on a given `bucket`, all subsequent `get` requests on that | ||||||
| /// same bucket will reflect that write or any subsequent writes. `get` requests using a different | ||||||
| /// bucket may or may not immediately see the new value due to e.g. cache effects and/or replication | ||||||
| /// lag. | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd prefer if we were consistent about when we used "resource" vs. "bucket". I think you mean "resource" here, because if there is a second resource handle to the same logical "bucket" then subsequent
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's fair. I'm using
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just pushed an update to consistently use the term " |
||||||
| /// | ||||||
| /// Continuing the above example: | ||||||
| /// | ||||||
| /// bucketB.set("bar", "b") | ||||||
| /// bucketC = open("foo") | ||||||
| /// value = bucketC.get("bar") | ||||||
| /// assert value.equals("a") or value.equals("b") or value is None | ||||||
| /// | ||||||
| /// In other words, the `bucketC` resource may reflect either the most recent write to the `bucketA` | ||||||
| /// resource, or the one to the `bucketB` resource, or neither, depending on how quickly either of | ||||||
| /// those writes reached the replica from which the `bucketC` resource is reading. However, | ||||||
| /// assuming there are no unrecoverable errors -- such that the state of a replica is irretrievably | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm confused why we mention "unrecoverable errors". Such errors aren't visible to the guest and thus aren't really of consequence to the guest. I believe the important bit is that the writes one one resource are not guaranteed to be reflected on subsequent reads of a different resource. As things are written I'm unsure about the following situation. Imagine the guest code: The client has left sufficient time (1,000,000 years) for replication to happen. However, the backing implementation uses caching such that once Is that spec compliant?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The scenario I had in mind regarding "unrecoverable errors" was where BTW, if the discussion of unusual errors is distracting and/or superfluous, I can omit it or move it to a footnote. I mainly just wanted to point out that failures in a distributed system are non-atomic and can affect the behavior of that system even when it's still (partially) available. That's in contrast to a centralized, ACID database where it either fails completely or not at all. Regarding caching: I expect There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this discussion is superfluous. I think it's extremely important. It's the difference between whether host implementors of this interface need to wait for guarantee of replication or not. When we settle on the semantics of writes are not guaranteed to replicate, then that means the guest can never trust a write except by opening a new resource handle and doing a new read, right?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds correct to me. FWIW, I do think supporting two kinds of writes (one that uses write-behind caching to avoid blocking and another that blocks until it has received confirmation from at least one replica) and two kinds of reads (one that uses a cache and one that doesn't) could make sense. Even when using the blocking versions of those operations, though, we still wouldn't be able to make guarantees about if/when the write is visible using a different resource handle (since it might be connected to a different replica). Some distributed databases use a single-master replication model, which make it easier to provide stronger guarantees -- e.g. as long as you get write confirmation from the master and then, when reading, request that the replica syncs with the master before returning a result, then you'll get very ACID-style semantics. That's what Turso does to implement transactional writes and It might help in this discussion to nail down the minimum feature set (related to consistency, durability, or otherwise) a backing key value store must provide to be compatible with |
||||||
| /// lost before it can be propagated -- one of the values ("a" or "b") will eventually be considered | ||||||
| /// the "latest" and replicated across the system, at which point all three resources will return | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Who decides what is to be considered the "latest"? In a distributed system, nodes might have clock skews that there is no global concept of "latest". My main concen here is related to conflict resolution mechansm and it's not clear to me whether the system or the application is responsible for resolving these conflicts and how to resolve them.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To elaborate more: if the system is using Last-Write-Wins to resolve conflict using a timestamp from the node's clock, due to clock skew, a write operation that actually occurred earlier in real-time might be assigned a later timestamp by a node whose clock is skewed. In this case, two nodes may not have agreed on the same value being the "latest".
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expect that different backing store implementations will potentially use different approaches to resolving conflicts and deciding which write wins (e.g. vector clocks, a centralized serializer, or even (pseudo)random choice), which is why I didn't specify which value ("a" or "b") would be preferred. We haven't provided any API in this interface to resolve conflicts, so it can only be the responsibility of the system, not the application, AFAICT. I agree that the term "latest" here may cause confusion. Would it help to change the wording to
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
be considered the "winner" per node, but there won't be a global "winner" for that specific value. e.g. Node A might determine "a" as the winner while node B might determine "b" as the winner. I think it's fine to say "one of the values ("a" or "b") MUST eventually be considered the "winner" at each replica ..."
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Or perhaps we can say "In the absence of irrecoverable errors and prolonged network partitions, the system will strive to converge towards a consistent state" This emphasizes that after updates cease and replication processes have completed, each replica will reach a stable and internally consistent state. It is important to recognize that while the keyvalue system aims for replicas to have the same view of the data, realistically, a consistent state in eventual consistency does not guarantee that all replicas will converge to exactly the same value for every key.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That surprises me; I thought part of the definition of eventual consistency was that, as long a no new updates are made, all replicas will eventually converge on the same value for every key. Can you give an example of an eventually consistent system that does not have that property?
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is a surpising fact that eventual consistency does not guarantee that all replicas will always converge on the same value for every key. It only guarantees that replicas will converge to some consistent state. The key is conflict resolution mechanism. Since this proposal does not specify a conflict resolution mechanism and leaves it for the provider to decide, I can imagine the following scenario. let's say we have two replicas A and B and initially there is a network partition After some time, the network partition recovers, and the system finds a conflict! If the key-value store uses Last-Write-Win but relies on-node system clock (not a vector clock), conflicting timestamps might cause replicas to inconsistently order updates. Let's say, replica A believes That's why I raised the point about the importance of the conflict resolution mechanism (CRM) in scenarios where concurrent writes exist. If the system is using more advanced CRM, e.g. vector clocks, then it might ensure a single "winner" value.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That certainly is surprising. I wouldn't have guessed that each node is allowed to reach its own conclusion about what the "latest value" is and, in so doing, permanently diverge from its peers. TIL. I'll update the PR to reflect that; probably next week some time.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the kind of eventual consistent system that guarantees that any two nodes have the same value for the same keys is apprantly called "strong eventual consistency". You will need to use CRDT to achieve that. That's a funny name
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry for the delay; I finally got around to updating this. Please let me know what you think. |
||||||
| /// that same value. | ||||||
| /// | ||||||
| /// ## Durability | ||||||
| /// | ||||||
| /// This interface does not currently make any hard guarantees about the durability of values | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's okay to leave the durability wide open. I am wondering in your case 3 - under async Now, there is a question of "what happens if an async I/O error occurs right after the In a strict interpretation of the spec, once If the store experiences a critical I/O failure that causes data corruption or data loss, there are currently no instructions on how the store should respond. Should it return I think there are two possible ways to extend the specification to address the above concerns: Handle defunct after errorsWe could define that once a bucket handle experiences a critical I/O error, all further operations on that handle must return an error. That is, if a store fails after a Best-effort guarantee tied to success conditionsThe specification could define that “read your writes” holds as long as the store does not fail irrecoverably between operations. A
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Mossaka Based on the previous discussion above, I think there's performance reasons not to require "read your writes" (even when reads follow writes on the same |
||||||
| /// stored. A valid implementation might rely on an in-memory hash table, the contents of which are | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For in-memory stores, we probably want to emphasize that the data might be lost due to store crashed, and the Best-effort guarantee described in my comment above should apply to our specification - stating that the "read your write" consistency contract should only apply to store operating under normal conditions. |
||||||
| /// lost when the process exits. Alternatively, another implementation might synchronously persist | ||||||
| /// all writes to disk -- or even to a quorum of disk-backed nodes at multiple locations -- before | ||||||
| /// returning a result for a `set` call. Finally, a third implementation might persist values | ||||||
| /// asynchronously on a best-effort basis without blocking `set` calls, in which case an I/O error | ||||||
| /// could occur after the component instance which originally made the call has exited. | ||||||
| /// | ||||||
| /// Future versions of the `wasi-keyvalue` package may provide ways to query and control the | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| /// durability and consistency provided by the backing implementation. | ||||||
| interface store { | ||||||
| /// The set of errors which may be raised by functions in this package | ||||||
| variant error { | ||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to understand what "context" (borrowing the terminology below) this is meant for.
One reading of this (which I assume is not meant) is "all subsequent read operations globally (from any client) will return the value that was written". I assume what is actually meant is all reads from the client that performed the write. Perhaps we should move the definitions of client and context from below to the top of this section and then be explicit about how all operations unless otherwise stated are only from the perspective of the current client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense; I just pushed an update which simply removes the first paragraph since the second one says the same thing more precisely.