TigerBeetle Update

Thursday January 28, 2021
2:50 PM UTC

Session Leads: Adrian Hope-Bailie (@adrianhopebailie), Joran Greef (@jorandirkgreef), Donovan Changfoot (@donchangfoot)

Session Description: A demonstration of the latest work on TigerBeetle and a discussion of the next steps for the community and Mojaloop integration

Session Slides

Session Recording

Please post questions, suggestions and thoughts about this session in this thread.

Post by clicking the “Reply” button below.

Can you provide a link to your repo please

1 Like
4 Likes

First off, this is an amazing work, so congratulations! Coupling the logic with the storage, in a low level language will always be infinitely faster than any distributed design - this is really exciting.

I see you’re using lots of smart techniques like ring_buffers, and best practice IO techniques that kafka also uses, like zero-copy.
It’s good you mentioned the lack of trust in fsync, and this is a statistical fact that we keep forgetting in typical designs (we assume raided discs don’t fail). Kafka, and other systems, solve this by having a large enough replication factor, that ensures that we don’t trust any single fsync and replicate to at least 3 nodes on the cluster. We can also use something like a ZFS filesystem, which provides error correction and proper storage resiliency.

I like the draining warm data to SQL, sounds like CQRS! good validation :wink:

But my questions are:

  • How do we query the system, and do complex queries, sounds like CQRS?
  • How does this parallelise and survive failures?
  • Do you already have any idea how does the replication affect latency?

Again, congrats on a great work. You got me a bit jealous because I love to do low level code and don’t get many chances to do it

Can’t wait to see how this can be integrated with the patterns and principles we have on the PoC

2 Likes

This will depend on the context. The lookup command returns the current state of the queried accounts, including their balances. This might be used by internal components/monitoring systems if there is a need for direct queries against the system of record.

Other queries (e.g. servicing external systems) would be executed against the materialised views of the data that come from draining it into external data stores. (Yes, very CQRS like)

We have no immediate plans to parallelise the core functionality. We may introduce additional processing threads for specific functions such as networking, draining off data etc. This way recovery from failure should be deterministic. (I hope that answers the question)

Back of the envelope calculations suggest it is minimal. Because we batch early we also replicate batches so it’s a single network call per batch. We are also leveraging flexible paxos which means we are waiting for fewer acks from replicas before we consider it safe to continue (vs traditional paxos).

@jorandirkgreef has some actual numbers I think he can share.

Thanks, I hope you’ll consider contributing. I think you’ll love Zig

1 Like

The question was more about how the system can survive a nic/server or power failure. From what I got from the presentation, TiggerBeetle’s replication model is a single active with passive replicas; if not parallelised, or active-active, how is traffic routed to a replica in case of catastrophic failure of the main server? Are you considering putting this in the client?

I’d love to contribute, but I don’t want to make promises I can’t keep :frowning: I can commit to discussing ideas, though.

Thanks for your answers. Looking forward to the next round of progress on the replication and high availability, that’s where the daemons hide!! But I trust you guys have the skills to succeed.

1 Like

Kafka, and other systems, solve this by having a large enough replication factor, that ensures that we don’t trust any single fsync and replicate to at least 3 nodes on the cluster.

The WISC team found that Kafka’s ZooKeeper still has a few unfixed storage bugs around local single disk block failures, conflating system crashes with corruption in the middle of the journal, and also misdirected writes that can all propagate across the replicated cluster to cause data loss and/or complete cluster failure: https://www.usenix.org/system/files/conference/fast18/fast18-alagappan.pdf

With this storage fault model in mind, we’re building WISC’s CTRL protocol into TigerBeetle and some of the CTRL protocol is already present in AlphaBeetle with how we duplicate critical sections of the journal file to detect all kinds of disk failures. There are some ZooKeeper clusters running custom patches with CTRL, at least at Apple I think, but as far as I understand it’s never been merged back, which is a pity. It looks like Kafka is replacing ZooKeeper so it will be interesting to see what their storage fault model will be.

You can find our fault model here: tigerbeetle/DESIGN.md at master · coilhq/tigerbeetle · GitHub

We can also use something like a ZFS filesystem, which provides error correction and proper storage resiliency.

ZFS is the bomb! They paved the way on all of this: ZFS: The Last Word in File Systems Part 1 - YouTube

The reason we’re not delegating to ZFS is because we want TigerBeetle to be usable as an embeddable library on any platform or system in the long-run for financial inclusion, and because we need to handle local storage failures in the context of the global consensus protocol, essential for a distributed cluster. ZFS would be fine for a single node, but for a cluster running consensus I don’t think ZFS would surface the local data we’d need to get the global consensus protocol right, these are the local fine-grained answers to “Was this caused by a system crash?” or “Was this caused by corruption?” which all affect the global consensus protocol in surprising ways (as documented in the Alagappan paper above).

How does this parallelise and survive failures?

The question was more about how the system can survive a nic/server or power failure. From what I got from the presentation, TiggerBeetle’s replication model is a single active with passive replicas; if not parallelised, or active-active, how is traffic routed to a replica in case of catastrophic failure of the main server? Are you considering putting this in the client?

Yes, a strong stable leader in the distributed consensus sense, with passive followers, classic Multi-Paxos. In the event of catastrophic failure of the leader (or network partition etc.), a new leader would be promoted automatically, maintaining strict serializability throughout.

We’ve started on the code for automated leader election from amongst the cluster, seamless, within a very small time window, with no chance of split brain, using the Viewstamped Replication Revisited distributed consensus protocol: http://pmg.csail.mit.edu/papers/vr-revisited.pdf

As a variant of Multi-Paxos, I believe VRR has one of the lowest latencies for leader failover. It’s also easy to understand, very similar to Raft. It differs from Raft in that it gives better liveness, and it’s nice and predictable, you can predict the next leader in the cluster, which is also what the client would do in the common case, or fall back to asking the cluster in a single RTT in the worst case.

Do you already have any idea how does the replication affect latency?

As Adrian said, it should be minimal, and optimal, if fault-tolerance is not negotiable:

Compared to Paxos (a two-phase protocol in the best case assuming no failures) which requires two RTTs to agree on a value, Multi-Paxos reuses the first instance of Paxos required for leader election across subsequent “multiple” rounds of replication, which would then only require a single RTT from the leader to the replication quorum.

From there, with Flexible Paxos (https://arxiv.org/pdf/1608.06696.pdf), we can then optimize the size of the replication quorum further so that we would synchronously replicate only to the leader’s disk plus one other follower’s disk (in parallel) before acking back to the client. The rest of the durability guarantee (beyond two servers) would come through later within a few ms, outside of the critical path with asynchronous replication amongst the remaining followers (who have spare time on their hands).

Just as a back-of-the-envelope sketch: replication would affect latency and throughput by a single network hop with not too much to divide NIC bandwidth since the data structures are small. Given that synchronous replication is required for fault-tolerance for any system, this is at least the fastest way of doing it, and what LMAX also do (sync replication between two nodes in the same DC, async replication to another node cross DC).

Hope to have an answer with actual numbers in March.

Again, congrats on a great work. You got me a bit jealous because I love to do low level code and don’t get many chances to do it

Thanks again Pedro, and looking forward to integrating TigerBeetle with you. You might also enjoy Zig ShowTime by Loris Cro (https://zig.show)… see if that gets you hooked! :wink:

1 Like