A resilient distributed system

Yes, this is a bit long. Sorry.

The Acme Corporation has an accounts database on a central server. They have a desktop accounts application in use by workers around the world. The application makes SSL connections to the central server to issue commands and receive responses.

The underlying accounts system is horrifically complex, and the commands have effects on the database that depend on the previous state of the database (and possibly the time of day and phase of the moon) in bizarre ways that no-one at Acme understands.

A while back they decided to add some resilience to this system. They bought four more servers and distributed them around the world. They modified the command/response process on the main server so that it became:

  1. Take a filesystem snapshot of the database files
  2. Accept a command from a client
  3. Apply the command to the database application, flushing changes to disk
  4. Send the response back to the client
  5. Take another filesystem snapshot
  6. Use the pair of snapshots to generate a filesystem level delta that represents the database changes that resulted from running the command.
  7. Distribute the delta to the four backup servers

This worked well, and the Acme board was happy, until one day an intern observed that the main server could be destroyed by a nuclear strike just after sending the response back to the client. If that happened, then the effects of that command would be lost when a backup server was converted into the new live server. For a change to the database to be lost after an accountant has been told that the change has been made is Not Allowed At Acme, so heads rolled.

You have been recruited as the new Chief Accounts Database Resilience Officer, and given the task of redesigning the resilience mechanism. Unfortunately, it took them 3 weeks to find your CV, and they spent the time thinking up requirements for the new system.

My solution


Hosted by Exonetric
2007
Home