PyTond: Efficient Python Data Science on the Shoulders of Databases

Hesam Shahrokhi,Amirali Kaboli,Mahdi Ghorbani,Amir Shaikhha

2024-07-16

Abstract:Python data science libraries such as Pandas and NumPy have recently gained immense popularity. Although these libraries are feature-rich and easy to use, their scalability limitations require more robust computational resources. In this paper, we present PyTond, an efficient approach to push the processing of data science workloads down into the database engines that are already known for their big data handling capabilities. Compared to the previous work, by introducing TondIR, our approach can capture a more comprehensive set of workloads and data layouts. Moreover, by doing IR-level optimizations, we generate better SQL code that improves the query processing by the underlying database engine. Our evaluation results show promising performance improvement compared to Python and other alternatives for diverse data science workloads.

Databases,Programming Languages

What problem does this paper attempt to address?

The main problem this paper attempts to address is the scalability limitations of Python data science libraries (such as Pandas and NumPy) when handling large-scale datasets. Although these libraries are feature-rich and easy to use, their interpreted execution mode leads to performance bottlenecks, especially when processing large datasets that require more computational resources. To solve this problem, the paper proposes the PyTond framework, which improves performance by converting Pandas and NumPy workloads into SQL code and leveraging the powerful processing capabilities of database engines. Specifically, the design goals of PyTond include: 1. **Broad Coverage**: Support for various APIs of Pandas (relational algebra) and NumPy (linear algebra), as well as their mixed workloads. 2. **Optimization of Generated SQL Code**: Optimization through an intermediate representation (TondIR) to generate SQL code that is better suited for the underlying query engine. 3. **Support for Multiple Data Layouts**: Support for both sparse and dense data layouts to meet different types of data processing needs. 4. **Automated Translation Process**: Users can achieve automated translation from Python to SQL by simply adding the `@pytond` decorator to functions, without modifying program logic or imported libraries. Through the above design, PyTond aims to significantly improve the execution efficiency of Python data science workloads on large-scale datasets.

PyTond: Efficient Python Data Science on the Shoulders of Databases

The Tensor Data Platform: Towards an AI-centric Database System

Productivity, Portability, Performance: Data-Centric Python

Intrepydd: performance, productivity, and portability for data science application kernels

In-depth Analysis On Parallel Processing Patterns for High-Performance Dataframes

Query Processing on Tensor Computation Runtimes

Octopus-DF: Unified DataFrame-based cross-platform data analytic system

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

Dias: Dynamic Rewriting of Pandas Code

Supercharging Distributed Computing Environments For High Performance Data Engineering

Parallel data analysis directly on scientific file formats

A Practice Of Tpc-Ds Multidimensional Implementation On Nosql Database Systems

Supercharging distributed computing environments for high-performance data engineering

In-Memory Indexed Caching for Distributed Data Processing

Toward real-time data query systems in HEP

Banian: A Cross-Platform Interactive Query System for Structured Big Data

Diba: A Re-Configurable Stream Processor

Asynchronous Execution of Python Code on Task Based Runtime Systems

In-RDBMS Hardware Acceleration of Advanced Analytics

Software for Sparse Tensor Decomposition on Emerging Computing Architectures