Transactional Python for Durable Machine Learning: Vision, Challenges, and Feasibility

Supawit Chockchowwat,Zhaoheng Li,Yongjoo Park
DOI: https://doi.org/10.1145/3595360.3595855
2023-05-16
Abstract:In machine learning (ML), Python serves as a convenient abstraction for working with key libraries such as PyTorch, scikit-learn, and others. Unlike DBMS, however, Python applications may lose important data, such as trained models and extracted features, due to machine failures or human errors, leading to a waste of time and resources. Specifically, they lack four essential properties that could make ML more reliable and user-friendly -- durability, atomicity, replicability, and time-versioning (DART). This paper presents our vision of Transactional Python that provides DART without any code modifications to user programs or the Python kernel, by non-intrusively monitoring application states at the object level and determining a minimal amount of information sufficient to reconstruct a whole application. Our evaluation of a proof-of-concept implementation with public PyTorch and scikit-learn applications shows that DART can be offered with overheads ranging 1.5%--15.6%.
Databases,Machine Learning,Programming Languages
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the issues of data durability, atomicity, replicability, and time-versioning (DART) in machine learning (ML) applications. Specifically, Python applications may lose important data, such as trained models and extracted features, in the event of machine failures or human errors, leading to wasted time and resources. Unlike database management systems (DBMS), Python applications lack the following four key attributes: 1. **Durability**: The application state cannot be automatically persisted after unexpected errors or failures. 2. **Atomicity**: Interrupted functions may result in partial updates rather than complete transactions. 3. **Replicability**: The running application state is difficult to easily replicate to another machine and recover from. 4. **Time-Versioning**: The application cannot be restored from a past state. To address these issues, this paper proposes a new framework—Transactional Python, which achieves DART by non-invasively monitoring the application state and determining the minimal necessary information at the object level, without requiring any code modifications to user programs or the Python kernel. Preliminary evaluations indicate that this framework can provide DART functionality with an overhead ranging from 1.5% to 15.6%.