PyTond: Efficient Python Data Science on the Shoulders of Databases

Hesam Shahrokhi,Amirali Kaboli,Mahdi Ghorbani,Amir Shaikhha
2024-07-16
Abstract:Python data science libraries such as Pandas and NumPy have recently gained immense popularity. Although these libraries are feature-rich and easy to use, their scalability limitations require more robust computational resources. In this paper, we present PyTond, an efficient approach to push the processing of data science workloads down into the database engines that are already known for their big data handling capabilities. Compared to the previous work, by introducing TondIR, our approach can capture a more comprehensive set of workloads and data layouts. Moreover, by doing IR-level optimizations, we generate better SQL code that improves the query processing by the underlying database engine. Our evaluation results show promising performance improvement compared to Python and other alternatives for diverse data science workloads.
Databases,Programming Languages
What problem does this paper attempt to address?
The main problem this paper attempts to address is the scalability limitations of Python data science libraries (such as Pandas and NumPy) when handling large-scale datasets. Although these libraries are feature-rich and easy to use, their interpreted execution mode leads to performance bottlenecks, especially when processing large datasets that require more computational resources. To solve this problem, the paper proposes the PyTond framework, which improves performance by converting Pandas and NumPy workloads into SQL code and leveraging the powerful processing capabilities of database engines. Specifically, the design goals of PyTond include: 1. **Broad Coverage**: Support for various APIs of Pandas (relational algebra) and NumPy (linear algebra), as well as their mixed workloads. 2. **Optimization of Generated SQL Code**: Optimization through an intermediate representation (TondIR) to generate SQL code that is better suited for the underlying query engine. 3. **Support for Multiple Data Layouts**: Support for both sparse and dense data layouts to meet different types of data processing needs. 4. **Automated Translation Process**: Users can achieve automated translation from Python to SQL by simply adding the `@pytond` decorator to functions, without modifying program logic or imported libraries. Through the above design, PyTond aims to significantly improve the execution efficiency of Python data science workloads on large-scale datasets.