FCatch

Haopeng Liu,Xu Wang,Guangpu Li,Shan Lu,Feng Ye,Chen Tian
DOI: https://doi.org/10.1145/3173162.3177161
2018-01-01
ACM SIGPLAN Notices
Abstract:It is crucial for distributed systems to achieve high availability. Unfortunately, this is challenging given the common component failures (i.e., faults). Developers often cannot anticipate all the timing conditions and system states under which a fault might occur, and introduce time-of-fault (TOF) bugs that only manifest when a node crashes or a message drops at a special moment. Although challenging, detecting TOF bugs is fundamental to developing highly available distributed systems. Unlike previous work that relies on fault injection to expose TOF bugs, this paper carefully models TOF bugs as a new type of concurrency bugs, and develops FCatch to automatically predict TOF bugs by observing correct execution. Evaluation on representative cloud systems shows that FCatch is effective, accurately finding severe TOF bugs.
What problem does this paper attempt to address?