Abstract:Machine learning (ML) over relational data is a booming area of data management. While there is a lot of work on scalable and fast ML systems, little work has addressed the pains of sourcing data for ML tasks. Real-world relational databases typically have many tables (often, dozens) and data scientists often struggle to even obtain all tables for joins before ML. In this context, Kumar et al. showed recently that key-foreign key dependencies (KFKDs) between tables often lets us avoid such joins without significantly affecting prediction accuracy-an idea they called "avoiding joins safely." While initially controversial, this idea has since been used by multiple companies to reduce the burden of data sourcing for ML. But their work applied only to linear classifiers. In this work, we verify if their results hold for three popular high-capacity classifiers: decision trees, non-linear SVMs, and ANNs. We conduct an extensive experimental study using both real-world datasets and simulations to analyze the effects of avoiding KFK joins on such models. Our results show that these high-capacity classifiers are surprisingly and counter-intuitively more robust to avoiding KFK joins compared to linear classifiers, refuting an intuition from the prior work's analysis. We explain this behavior intuitively and identify open questions at the intersection of data management and ML theoretical research. All of our code and datasets are available for download from http://cseweb.ucsd.edu/~arunkk/hamlet.

Decision Tables: Scalable Classification Exploring RDBMS Capabilities

A Fast Scalable Classifier Tightly Integrated with RDBMS

A Scalable Classification Algorithm Exploring Database Technology

A Decision Support System Using Two-Level Classifier for Smart Grid

Novel Design of Decision-Tree-Based Support Vector Machines Multi-Class Classifier

Efficient Tree Classifiers for Large Scale Datasets.

Probabilistic Safety Regions Via Finite Families of Scalable Classifiers

Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

Selecting Effective Features and Relations for Efficient Multi-Relational Classification.

Are key-foreign key joins safe to avoid when learning high-capacity classifiers?

Scalable Bootstrap Attribute Reduction for Massive Data

Research on Application of Decision Tree in Classifying Data

An Optimized Parallel Decision Tree Model Based on Rough Set Theory

Implementation of a Scalable Decision Forest Model Based on Information Theory

Integrating Association Rules with Decision Trees in Object-Relational Databases

Scalable Column Concept Determination for Web Tables Using Large Knowledge Bases

Knowledge Discovery of Decision Table Based on Support Vector Machine

Using Optimization-Based Classification Method for Massive Datasets.

Incorporating logistic regression to decision-theoretic rough sets for classifications

A novel statistical method on decision table analysis

Decision tree modeling with relational views