DAFEE: A Scalable Distributed Automatic Feature Engineering Algorithm for Relational Datasets

Wenqian Zhao,Xiangxiang Li,Guoping Rong,Mufeng Lin,Chen Lin,Yifan Yang
DOI: https://doi.org/10.1007/978-3-030-60239-0_3
2020-01-01
Abstract:Automatic feature engineering aims to construct informative features automatically and reduce manual labor for machine learning applications. The majority of existing approaches are designed to handle tasks with only one data source, which are less applicable to real scenarios. In this paper, we present a distributed automatic feature engineering algorithm, DAFEE, to generate features among multiple large-scale relational datasets. Starting from the target table, the algorithm uses a Breadth-First-Search type algorithm to find its related tables and constructs advanced high-order features that are remarkably effective in practical applications. Moreover, DAFEE implements a feature selection method to reduce the computational cost and improve predictive performance. Furthermore, it is highly optimized to process a massive volume of data. Experimental results demonstrate that it can significantly improve the predictive performance by 7% compared to SOTA algorithms.
What problem does this paper attempt to address?