Kafka, and other systems, solve this by having a large enough replication factor, that ensures that we don’t trust any single fsync and replicate to at least 3 nodes on the cluster.
The WISC team found that Kafka’s ZooKeeper still has a few unfixed storage bugs around local single disk block failures, conflating system crashes with corruption in the middle of the journal, and also misdirected writes that can all propagate across the replicated cluster to cause data loss and/or complete cluster failure: https://www.usenix.org/system/files/conference/fast18/fast18-alagappan.pdf
With this storage fault model in mind, we’re building WISC’s CTRL protocol into TigerBeetle and some of the CTRL protocol is already present in AlphaBeetle with how we duplicate critical sections of the journal file to detect all kinds of disk failures. There are some ZooKeeper clusters running custom patches with CTRL, at least at Apple I think, but as far as I understand it’s never been merged back, which is a pity. It looks like Kafka is replacing ZooKeeper so it will be interesting to see what their storage fault model will be.
You can find our fault model here: tigerbeetle/DESIGN.md at master · coilhq/tigerbeetle · GitHub
We can also use something like a ZFS filesystem, which provides error correction and proper storage resiliency.
ZFS is the bomb! They paved the way on all of this: ZFS: The Last Word in File Systems Part 1 - YouTube
The reason we’re not delegating to ZFS is because we want TigerBeetle to be usable as an embeddable library on any platform or system in the long-run for financial inclusion, and because we need to handle local storage failures in the context of the global consensus protocol, essential for a distributed cluster. ZFS would be fine for a single node, but for a cluster running consensus I don’t think ZFS would surface the local data we’d need to get the global consensus protocol right, these are the local fine-grained answers to “Was this caused by a system crash?” or “Was this caused by corruption?” which all affect the global consensus protocol in surprising ways (as documented in the Alagappan paper above).
How does this parallelise and survive failures?
The question was more about how the system can survive a nic/server or power failure. From what I got from the presentation, TiggerBeetle’s replication model is a single active with passive replicas; if not parallelised, or active-active, how is traffic routed to a replica in case of catastrophic failure of the main server? Are you considering putting this in the client?
Yes, a strong stable leader in the distributed consensus sense, with passive followers, classic Multi-Paxos. In the event of catastrophic failure of the leader (or network partition etc.), a new leader would be promoted automatically, maintaining strict serializability throughout.
We’ve started on the code for automated leader election from amongst the cluster, seamless, within a very small time window, with no chance of split brain, using the Viewstamped Replication Revisited distributed consensus protocol: http://pmg.csail.mit.edu/papers/vr-revisited.pdf
As a variant of Multi-Paxos, I believe VRR has one of the lowest latencies for leader failover. It’s also easy to understand, very similar to Raft. It differs from Raft in that it gives better liveness, and it’s nice and predictable, you can predict the next leader in the cluster, which is also what the client would do in the common case, or fall back to asking the cluster in a single RTT in the worst case.
Do you already have any idea how does the replication affect latency?
As Adrian said, it should be minimal, and optimal, if fault-tolerance is not negotiable:
Compared to Paxos (a two-phase protocol in the best case assuming no failures) which requires two RTTs to agree on a value, Multi-Paxos reuses the first instance of Paxos required for leader election across subsequent “multiple” rounds of replication, which would then only require a single RTT from the leader to the replication quorum.
From there, with Flexible Paxos (https://arxiv.org/pdf/1608.06696.pdf), we can then optimize the size of the replication quorum further so that we would synchronously replicate only to the leader’s disk plus one other follower’s disk (in parallel) before acking back to the client. The rest of the durability guarantee (beyond two servers) would come through later within a few ms, outside of the critical path with asynchronous replication amongst the remaining followers (who have spare time on their hands).
Just as a back-of-the-envelope sketch: replication would affect latency and throughput by a single network hop with not too much to divide NIC bandwidth since the data structures are small. Given that synchronous replication is required for fault-tolerance for any system, this is at least the fastest way of doing it, and what LMAX also do (sync replication between two nodes in the same DC, async replication to another node cross DC).
Hope to have an answer with actual numbers in March.
Again, congrats on a great work. You got me a bit jealous because I love to do low level code and don’t get many chances to do it
Thanks again Pedro, and looking forward to integrating TigerBeetle with you. You might also enjoy Zig ShowTime by Loris Cro (https://zig.show)… see if that gets you hooked!