When Amnesia Strikes: Understanding and Reproducing Data Loss Bugs with Fault Injection

Maria Ramos,João Azevedo,Kyle Kingsbury,José Pereira,Tânia Esteves,Ricardo Macedo,João Paulo
DOI: https://doi.org/10.14778/3681954.3681980
IF: 2.5
2024-07-01
Proceedings of the VLDB Endowment
Abstract:We present LazyFS, a new fault injection tool that simplifies the debugging and reproduction of complex data durability bugs experienced by databases, key-value stores, and other data-centric systems in crashes. Our tool simulates persistence properties of POSIX file systems (e.g., operations ordering and atomicity) and enables users to inject lost and torn write faults with a precise and controlled approach. Further, it provides profiling information about the system's operations flow and persisted data, enabling users to better understand the root cause of errors. We use LazyFS to study seven important systems: PostgreSQL, etcd, Zookeeper, Redis, LevelDB, PebblesDB, and Lightning Network. Our fault injection campaign shows that LazyFS automates and facilitates the reproduction of five known bug reports containing manual and complex reproducibility steps. Further, it aids in understanding and reproducing seven ambiguous bugs reported by users. Finally, LazyFS is used to find eight new bugs, which lead to data loss, corruption, and unavailability.
computer science, information systems, theory & methods
What problem does this paper attempt to address?