Fault tolerance via idempotence

Ganesan Ramalingam,Kapil Vaswani
DOI: https://doi.org/10.1145/2480359.2429100
2013-01-23
ACM SIGPLAN Notices
Abstract:Building distributed services and applications is challenging due to the pitfalls of distribution such as process and communication failures. A natural solution to these problems is to detect potential failures, and retry the failed computation and/or resend messages. Ensuring correctness in such an environment requires distributed services and applications to be idempotent. In this paper, we study the inter-related aspects of process failures, duplicate messages, and idempotence. We first introduce a simple core language (based on lambda calculus inspired by modern distributed computing platforms. This language formalizes the notions of a service, duplicate requests, process failures, data partitioning, and local atomic transactions that are restricted to a single store. We then formalize a desired (generic) correctness criterion for applications written in this language, consisting of idempotence (which captures the desired safety properties) and failure-freedom (which captures the desired progress properties). We then propose language support in the form of a monad that automatically ensures failfree idempotence. A key characteristic of our implementation is that it is decentralized and does not require distributed coordination. We show that the language support can be enriched with other useful constructs, such as compensations, while retaining the coordination-free decentralized nature of the implementation. We have implemented the idempotence monad (and its variants) in F# and C# and used our implementation to build realistic applications on Windows Azure. We find that the monad has low runtime overheads and leads to more declarative applications.
What problem does this paper attempt to address?