Accelerating Causal Algorithms for Industrial-scale Data: A Distributed Computing Approach with Ray Framework

Vishal Verma,Vinod Reddy,Jaiprakash Ravi
2024-01-22
Abstract:The increasing need for causal analysis in large-scale industrial datasets necessitates the development of efficient and scalable causal algorithms for real-world applications. This paper addresses the challenge of scaling causal algorithms in the context of conducting causal analysis on extensive datasets commonly encountered in industrial settings. Our proposed solution involves enhancing the scalability of causal algorithm libraries, such as EconML, by leveraging the parallelism capabilities offered by the distributed computing framework Ray. We explore the potential of parallelizing key iterative steps within causal algorithms to significantly reduce overall runtime, supported by a case study that examines the impact on estimation times and costs. Through this approach, we aim to provide a more effective solution for implementing causal analysis in large-scale industrial applications.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The paper primarily addresses the issue of causal analysis in industrial-scale data, aiming to develop efficient and scalable causal algorithms to meet the needs of practical applications. Specifically, the main challenge faced by researchers is that existing causal algorithm libraries (such as EconML) have limitations in computational efficiency and scalability when dealing with large-scale industrial datasets, especially those containing hundreds of covariates and confounding factors. To tackle this challenge, the authors propose a method to enhance causal algorithm libraries using the distributed computing framework Ray. Through this method, they achieve the following goals: 1. **Significant Parallelism**: By parallelizing key iterative steps in causal algorithms, they significantly reduce the overall runtime. 2. **Rapid Hyperparameter Tuning**: They make efficient hyperparameter tuning possible. 3. **Cost Optimization**: By effectively managing and scheduling resources, they reduce the overall computational cost. The authors demonstrate the impact of this distributed computing method on estimation time and cost through a case study and implement a prototype of an Orthogonal Machine Learning (OCML) algorithm, extending the orthogonal machine learning algorithm in the EconML library using the Ray framework. In summary, the goal of this paper is to show how distributed computing technology can improve the performance and efficiency of causal analysis algorithms on industrial-scale big datasets.