Yes, this is a bit long. Sorry.
The Acme Corporation has an accounts database on a central server. They have a desktop accounts application in use by workers around the world. The application makes SSL connections to the central server to issue commands and receive responses.
The underlying accounts system is horrifically complex, and the commands have effects on the database that depend on the previous state of the database (and possibly the time of day and phase of the moon) in bizarre ways that no-one at Acme understands.
A while back they decided to add some resilience to this system. They bought four more servers and distributed them around the world. They modified the command/response process on the main server so that it became:
- Take a filesystem snapshot of the database files
- Accept a command from a client
- Apply the command to the database application, flushing changes to disk
- Send the response back to the client
- Take another filesystem snapshot
- Use the pair of snapshots to generate a filesystem level delta that represents the database changes that resulted from running the command.
- Distribute the delta to the four backup servers
This worked well, and the Acme board was happy, until one day an intern observed that the main server could be destroyed by a nuclear strike just after sending the response back to the client. If that happened, then the effects of that command would be lost when a backup server was converted into the new live server. For a change to the database to be lost after an accountant has been told that the change has been made is Not Allowed At Acme, so heads rolled.
You have been recruited as the new Chief Accounts Database Resilience Officer, and given the task of redesigning the resilience mechanism. Unfortunately, it took them 3 weeks to find your CV, and they spent the time thinking up requirements for the new system.
- Any of the five servers may crash and reboot at any time, your design must allow for this.
- Any of the five servers may be permanently destroyed by a nuclear strike at any time. Your design must allow for up to 2 such server nukings, which may be simultaneous.
- The (surviving) replicas may not diverge, each must end up applying the same sequence of deltas.
- Once the response to a command has reached the client, the effects of that command may not be lost.
- Communications between the servers may be disrupted in all sorts of ways, ranging from packet loss to the Internet becoming temporarily partitioned.
- Without human intervention, the system must keep accepting commands and generating responses so long as 3 or more servers are up and able to exchange messages.
- A team of unusually pedantic and inventive interns has been hired to poke holes in your design and think up situations in which it fails, so expect to face pathologically selective packet loss.