How Many Planet-Wide Leaders Should There Be?

Shengyun Liu,Marko Vukolic
DOI: https://doi.org/10.1145/2847220.2847222
2015-01-01
Abstract:Geo-replication becomes increasingly important for modern planetary scale distributed systems, yet it comes with a specific challenge: latency, bounded by the speed of light. In particular, clients of a geo-replicated system must communicate with a leader which must in turn communicate with other replicas: wrong selection of a leader may result in unnecessary round-trips across the globe. Classical protocols such as celebrated Paxos, have a single leader making them unsuitable for serving widely dispersed clients. To address this issue, several all-leader geo-replication protocols have been proposed recently, in which every replica acts as a leader. However, because these protocols require coordination among all replicas, commiting a client's request at some replica may incure the so-called "delayed commit" problem, which can introduce even a higher latency than a classical single-leader majority-based protocol such as Paxos. In this paper, we argue that the "right" choice of the number of leaders in a geo-replication protocol depends on a given replica configuration and propose Droopy, an optimization for state machine replication protocols that explores the space between single-leader and all-leader by dynamically reconfiguring the leader set. We implement Droopy on top of Clock-RSM, a state-of-the-art all-leader protocol. Our evaluation on Amazon EC2 shows that, under typical imbalanced workloads, Droopy-enabled Clock-RSM efficiently reduces latency compared to native Clock-RSM, whereas in other cases the latency is the same as that of the native Clock-RSM.
What problem does this paper attempt to address?