The key problem with Prism's architecture is that you need to have a cloned database per region. This isn't that bad of an idea from a logistical standpoint (after all, this is needed for certain larger scale applications anyway) but it means that there is always a chance of database desyncs. Here is a current example of one:
NA1 <- Receives Update Request
NA1 = Performs Database Update on NA1 DB
NA1 -> Broadcasts Update to EU1 and AS1 peer nodes
EU1 <- Receives Update Request
AS1 <- Network Transmission Error (Never receives Update)
EU1 = Fails Database Update
In the example above, the EU and AS databases are now completely desynced from each other since a DB query was never made. There is really no way to avoid this at it's core, but we can implement fail safes in order to preserve data integrity. Here are some ideas I've collected (which I'll probably implement all of them) to circumvent the desync.
NA1 <- Receives Update Request
NA1 = Performs Database Update on NA1 DB
NA1 -> Broadcasts Update to EU1 (Awaiting an ACK)
EU1 <- Network Transmission Error (Never receives Update)
NA1 = Realizes an ACK was never returned from EU1
NA1 -> Retries Update to EU1 (Awaiting an ACK)
EU1 <- Receives Update Request
EU1 = Successfully Performs Database Update
EU1 -> Sends ACK back to NA1
NA1 <- Receives ACK from EU1 (Shifts the ACK queue)
NA1 <- Receives Update Request
NA1 = Performs Database Update on NA1 DB
NA1 -> Broadcasts Update to EU1 (Awaiting an ACK)
EU1 <- Network Transmission Error (Never receives Update)
NA1 = Realizes an ACK was never returned from EU1
NA1 -> Retries Update to EU1 (Awaiting an ACK)
EU1 <- Network Transmission Error (5x)
NA1 <- Retries Update to EU1 (5x)
NA1 = Realizes EU1 is not responding after 5 retries
NA1 = Switches to EU2 as the failover node
NA1 = Performs Update to EU2 (Awaiting an ACK)
EU2 <- Receives Update Request
EU2 = Successfully Performs Database Update
EU2 -> Sends ACK back to NA1
NA1 <- Receives ACK from EU2 (Shifts the ACK queue)
... NA1 has exhausted all retries with EU1 and all of its failover
NA1 -> Creates SQL connection the EU Database
NA1 -> Writes update to EU Database
Other Ideas:
- Undo requests from peer nodes to source node
- Awaiting DB write on main node until successful writes on all peer nodes
- Storing unsynced requests in file system or a cache and retrying on startup
The key problem with Prism's architecture is that you need to have a cloned database per region. This isn't that bad of an idea from a logistical standpoint (after all, this is needed for certain larger scale applications anyway) but it means that there is always a chance of database desyncs. Here is a current example of one:
In the example above, the EU and AS databases are now completely desynced from each other since a DB query was never made. There is really no way to avoid this at it's core, but we can implement fail safes in order to preserve data integrity. Here are some ideas I've collected (which I'll probably implement all of them) to circumvent the desync.
Other Ideas: