Architecting Intermediate Layers for Efficient Composition of Data Management and Machine Learning Systems

Supun Abeysinghe,Fei Wang,Gregory Essertel,Tiark Rompf
2023-11-06
Abstract:Modern data analytics workloads combine relational data processing with machine learning (ML). Most DBMS handle these workloads by offloading these ML operations to external specialized ML systems. While both DBMS and ML systems go to great lengths to optimize performance for their specific workloads, significant performance is lost when used in combination, due to data movement across system boundaries, conversions between incompatible internal data formats, and the lack of cross system optimizations. A key idea to remove these bottlenecks is to integrate existing data manipulation systems with ML systems by building a common intermediate layer (IR). Although this idea has been explored before (Weld, Delite), previous such attempts require significant re-engineering of prior systems and still fall short in achieving best-of-breed performance for individual tasks (e.g., SQL, Deep Learning). Specifically, they rely on re-implementing existing systems using a generic set of operators and fail to match best-of-breed individual performance due to the inability to recover high-level optimizations from this generic IR through compiler analysis. We present Flern, the first intermediate-layer integration between DB and ML systems that are best-of-breed individually, competitive with the best compiled query engines such as HyPer on comprehensive relational benchmarks (TPC-H) and competitive with TensorFlow and PyTorch in state-of-the-art ML models (e.g., DeepSpeech, SqueezeNet, Transformers) and also represents a new state-of-the-art for integration. A key realization is to architect intermediate layers based on generative programming capabilities, which preserves high-level contextual information for cross optimizations and enables the construction of a variety of complex structures and cross system optimizations with minimal effort.
Programming Languages
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the performance bottleneck faced when combining relational data processing and machine learning (ML) in modern data analysis workloads. Specifically, current database management systems (DBMS) and machine - learning systems have poor end - to - end performance when handling such mixed workloads for the following reasons: 1. **Insufficient global optimization across system boundaries**: Global optimization cannot be carried out between DBMS and ML systems. 2. **Overhead of data movement and format conversion**: Data transfer and format conversion between different systems bring significant performance losses. 3. **Lack of efficient integration mechanisms**: Existing intermediate - layer methods either require a large amount of refactoring of the original systems or cannot achieve the best individual - task performance. To solve these problems, the paper proposes a new intermediate - layer architecture called Flern, which can efficiently combine relational data processing and machine - learning systems and achieve cross - system optimization while maintaining the best performance of each system. The following are the core contributions of the paper: - **Analysis of the limitations of existing methods**: The paper analyzes in detail the problems of existing methods for constructing a general - purpose intermediate layer, such as the limitations of fixed general - purpose IR and the high cost of re - implementing complex operations. - **Introduction of generative programming**: By adopting generative programming techniques (Generative Programming), especially Lightweight Modular Staging (LMS), efficient cross - system integration can be achieved without sacrificing high - level abstract information. - **Demonstration of Flern's superior performance**: Flern not only performs well in individual tasks (for example, it is comparable to HyPer in the TPC - H benchmark test and comparable to TensorFlow and PyTorch in deep - learning models), but also achieves an acceleration of up to an order of magnitude in comprehensive tasks. ### Specific Solutions The solutions proposed in the paper include the following aspects: 1. **Constructing a general - purpose intermediate layer**: Through generative programming techniques, construct a general - purpose intermediate layer (IR) that can support SQL / DataFrame operations and deep - learning operations. This intermediate layer retains high - level abstract information, allowing the compiler to optimize better. 2. **Minimizing re - engineering costs**: Utilize the characteristics of generative programming to avoid large - scale refactoring of existing systems, thereby achieving efficient cross - system integration at a relatively small engineering cost. 3. **Implementing cross - system optimization**: By adding global optimization strategies to the intermediate layer, reduce the overhead of data transfer and format conversion and improve overall performance. ### Conclusion The main contribution of the paper is to propose an intermediate - layer architecture Flern based on generative programming, which solves the performance bottlenecks encountered by existing methods when combining relational data processing and machine learning. Through this method, not only can the best performance of each system be maintained, but also efficient cross - system optimization can be achieved, thereby significantly improving the execution efficiency of comprehensive tasks.