Causal fault localisation in dataflow systems

Andrei Paleyes,Neil D. Lawrence
2023-04-24
Abstract:Dataflow computing was shown to bring significant benefits to multiple niches of systems engineering and has the potential to become a general-purpose paradigm of choice for data-driven application development. One of the characteristic features of dataflow computing is the natural access to the dataflow graph of the entire system. Recently it has been observed that these dataflow graphs can be treated as complete graphical causal models, opening opportunities to apply causal inference techniques to dataflow systems. In this demonstration paper we aim to provide the first practical validation of this idea with a particular focus on causal fault localisation. We provide multiple demonstrations of how causal inference can be used to detect software bugs and data shifts in multiple scenarios with three modern dataflow engines.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores how to apply causal inference techniques in data stream systems to achieve fault localization. The authors believe that the data stream computing paradigm brings significant benefits to various fields of system engineering and has the potential to become the preferred paradigm for data-driven application development. One feature of data stream computing is the ability to naturally access the entire system's data flow graph, which allows these data flow graphs to be viewed as complete graphical causal models, thereby providing an opportunity to apply causal inference techniques in data stream systems. Specifically, the goal of the paper is to validate the effectiveness of causal inference for fault localization in data stream systems. By demonstrating how to use causal inference techniques to detect software defects and data drifts, the authors conducted multiple experimental demonstrations covering different scenarios of three modern data stream engines. These experiments are not limited to the detection of software errors but also include cases of data distribution changes. To achieve this goal, the paper proposes a causal attribution algorithm that can recursively traverse the data flow graph, calculating the deviation of each node (i.e., changes in output distribution) and the attribution score (the degree of contribution of the node to the overall system output change). Through experiments on real-world software applications, the method is shown to effectively identify specific components that cause output changes, thereby achieving fault localization. In summary, the problem the paper attempts to solve is the effective fault localization in data stream systems using causal inference techniques.