GaussML: an End-to-End In-Database Machine Learning System

Guoliang Li,Ji Sun,Lijie Xu,Shifu Li,Jiang Wang,Wen Nie
DOI: https://doi.org/10.1109/icde60146.2024.00391
2024-01-01
Abstract:In-database machine learning (In-DB ML) is appealing to database users with security and privacy concerns, as it avoids copying data out of the database to a separate machine learning system. The common way to implement in-DB ML is the ML-as-UDF approach, which utilizes the User-Defined Functions (UDFs) within SQL to implement the ML training and prediction. However, UDFs may introduce security risks with vulnerable code, and suffer from performance problems, as constrained by data access and execution patterns of SQL query operators. To address these limitations, we propose a new in-database machine learning system, namely GaussML, which provides an end-to-end machine-learning ability with native SQL interface. To support ML training/inference within SQL query, GaussML directly integrates typical ML operators into the query engine without UDFs. GaussML also introduces an ML-aware cardinality and cost estimator to optimize the SQL+ML query plan. Moreover, GaussML leverages Single Instruction Multiple Data (SIMD) and data prefetching techniques to accelerate the ML operators for training. We have implemented a series of algorithms inside GaussML in openGauss database. Compared to the state-of-the-art in-DB ML systems like Apache MADlib, our GaussML achieves 2-6× speed-up in extensive experiments.
What problem does this paper attempt to address?