FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems

Jeffrey F. Lukman,Huan Ke,Cesar A. Stuardo,Riza O. Suminto,Daniar H. Kurniawan,Dikaimin Simon,Satria Priambada,Chen Tian,Feng Ye,Tanakorn Leesatapornwongsa,Aarti Gupta,Shan Lu,Haryadi S. Gunawi
DOI: https://doi.org/10.1145/3302424.3303986
2019-01-01
Abstract:We present a fast and scalable testing approach for datacenter/cloud systems such as Cassandra, Hadoop, Spark, and ZooKeeper. The uniqueness of our approach is in its ability to overcome the path/state-space explosion problem in testing workloads with complex interleavings of messages and faults. We introduce three powerful algorithms: state symmetry, event independence, and parallel flips, which collectively makes our approach on average 16x (up to 78x) faster than other state-of-the-art solutions. We have integrated our techniques with 8 popular datacenter systems, successfully reproduced 12 old bugs, and found 10 new bugs --- all were done without random walks or manual checkpoints.
What problem does this paper attempt to address?