Abstract:Dataflow computing was shown to bring significant benefits to multiple niches of systems engineering and has the potential to become a general-purpose paradigm of choice for data-driven application development. One of the characteristic features of dataflow computing is the natural access to the dataflow graph of the entire system. Recently it has been observed that these dataflow graphs can be treated as complete graphical causal models, opening opportunities to apply causal inference techniques to dataflow systems. In this demonstration paper we aim to provide the first practical validation of this idea with a particular focus on causal fault localisation. We provide multiple demonstrations of how causal inference can be used to detect software bugs and data shifts in multiple scenarios with three modern dataflow engines.

What problem does this paper attempt to address?

The paper primarily explores how to apply causal inference techniques in data stream systems to achieve fault localization. The authors believe that the data stream computing paradigm brings significant benefits to various fields of system engineering and has the potential to become the preferred paradigm for data-driven application development. One feature of data stream computing is the ability to naturally access the entire system's data flow graph, which allows these data flow graphs to be viewed as complete graphical causal models, thereby providing an opportunity to apply causal inference techniques in data stream systems. Specifically, the goal of the paper is to validate the effectiveness of causal inference for fault localization in data stream systems. By demonstrating how to use causal inference techniques to detect software defects and data drifts, the authors conducted multiple experimental demonstrations covering different scenarios of three modern data stream engines. These experiments are not limited to the detection of software errors but also include cases of data distribution changes. To achieve this goal, the paper proposes a causal attribution algorithm that can recursively traverse the data flow graph, calculating the deviation of each node (i.e., changes in output distribution) and the attribution score (the degree of contribution of the node to the overall system output change). Through experiments on real-world software applications, the method is shown to effectively identify specific components that cause output changes, thereby achieving fault localization. In summary, the problem the paper attempts to solve is the effective fault localization in data stream systems using causal inference techniques.

Causal fault localisation in dataflow systems

Applications of Causality and Causal Inference in Software Engineering

CausalFlow: Visual Analytics of Causality in Event Sequences

An Overview of the Quantitative Causality Analysis and Causal Graph Reconstruction Based on a Rigorous Formalism of Information Flow

Causality and Temporal Dependencies in the Design of Fault Management Systems

Extracting Physical Causality from Measurements to Detect and Localize False Data Injection Attacks

Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight

The Landscape of Causal Discovery Data: Grounding Causal Discovery in Real-World Applications

Causal inference for data centric engineering

Efficient Discovery of Actual Causality using Abstraction-Refinement

Causal Data Integration

Progress in Root Cause and Fault Propagation Analysis of Large-Scale Industrial Processes

A Survey on Causal Discovery: Theory and Practice

Computational Causal Inference

BayesFLo: Bayesian fault localization of complex software systems

Flow of dynamical causal structures with an application to correlations

CausIL: Causal Graph for Instance Level Microservice Data

Accelerating Causal Algorithms for Industrial-scale Data: A Distributed Computing Approach with Ray Framework

On Geometry of Information Flow for Causal Inference

Data-Driven Root-Cause Analysis For Distributed System Anomalies

On the Fly Detection of Root Causes from Observed Data with Application to IT Systems