Chronos: Finding Timeout Bugs in Practical Distributed Systems by Deep-Priority Fuzzing with Transient Delay

Yuanliang Chen,Fuchen Ma,Yuanhang Zhou,Ming Gu,Qing Liao,Yu Jiang
DOI: https://doi.org/10.1109/sp54263.2024.00109
2024-01-01
Abstract:Delays are inevitable in complex distributed environments. Timeout mechanisms are commonly used to handle unexpected failures in distributed systems. However, incorrect timeout handling or implementation errors in timeout mechanisms can lead to system hang-ups or crashes. Such timeout bugs may be crucial and pose a significant threat to the availability and security of distributed systems.In this work, we introduce Chronos, a general testing framework for automatically detecting timeout bugs in distributed systems with deep-priority transient delays. First, we propose general runtime delayed libraries that dynamically inject fine-grained delays in a Distributed System Under Test (DSUT). To effectively trigger delays and constantly explore timeout bugs in deep paths, Chronos harnesses a deep-priority guided fuzzing that dynamically generates high-quality delay sequences in the runtime. Then, Chronos utilizes transient delays to eliminate the time overhead caused by actual delays and accelerate the test process. We implemented and evaluated Chronos on four widely used distributed systems, including ZooKeeper, MySQL-Cluster, HDFS, and Go-Ethereum. Compared with the state-of-the-art techniques, Random, Brute-Force, and Coverage-Guided fault injection, Chronos covers 26.40%, 21.69%, and 15.14% more timeout mechanism logic, respectively. Furthermore, Chronos has detected 27 timeout bugs in these real-world applications, which have been repaired by the corresponding maintainers.
What problem does this paper attempt to address?