The Neutralizer: a Self-Configurable Failure Detector for Minimizing Distributed Storage Maintenance Cost

Zhi Yang,Yafei Dai,Xiaoming Li
DOI: https://doi.org/10.1002/cpe.1338
2008-01-01
Concurrency and Computation Practice and Experience
Abstract:To achieve high data availability or reliability in an efficient manner, distributed storage systems must detect whether an observed node failure is permanent or transient, and if necessary, generate replicas to restore the desired level of replication. Given the unpredictability of network dynamics, however, distinguishing permanent and transient failures is extremely difficult. Though timeout-based detectors can be used to avoid mistaking transient failures as permanent failures, it is unknown how the timeout values should be selected to achieve a better tradeoff between detection latency and accuracy. In this paper, we address this fundamental tradeoff from several perspectives. First, we explore the impact of different timeout values on maintenance cost by examining the probability of their false positives and false negatives. Second, we propose a self-configurable failure detector called the Neutralizer based on the idea of counteracting false positives with false negatives. The Neutralizer could enable the system to maintain a desired replication level on average with the least amount of bandwidth. We conduct extensive simulations using real trace data from a widely deployed peer-to-peer system and synthetic traces based on PlanetLab and Microsoft PCs, showing a significant reduction in aggregate bandwidth usage after applying the Neutralizer (especially in an environment with a low average node availability). Overall, we demonstrate that the Neutralizer closely approximates the performance of a perfect ‘oracle’ detector in many cases. Copyright © 2008 John Wiley & Sons, Ltd.
What problem does this paper attempt to address?