BigDL 2.0: Seamless Scaling of AI Pipelines from Laptops to Distributed Cluster

Jason Dai,Ding Ding,Dongjie Shi,Shengsheng Huang,Jiao Wang,Xin Qiu,Kai Huang,Guoqiong Song,Yang Wang,Qiyuan Gong,Jiaming Song,Shan Yu,Le Zheng,Yina Chen,Junwei Deng,Ge Song
DOI: https://doi.org/10.48550/arXiv.2204.01715
2022-04-19
Abstract:Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger dataset (for both experimentation and production deployment). These usually entail many manual and error-prone steps for the data scientists to fully take advantage of the available hardware resources (e.g., SIMD instructions, multi-processing, quantization, memory allocation optimization, data partitioning, distributed computing, etc.). To address this challenge, we have open sourced BigDL 2.0 at <a class="link-external link-https" href="https://github.com/intel-analytics/BigDL/" rel="external noopener nofollow">this https URL</a> under Apache 2.0 license (combining the original BigDL and Analytics Zoo projects); using BigDL 2.0, users can simply build conventional Python notebooks on their laptops (with possible AutoML support), which can then be transparently accelerated on a single node (with up-to 9.6x speedup in our experiments), and seamlessly scaled out to a large cluster (across several hundreds servers in real-world use cases). BigDL 2.0 has already been adopted by many real-world users (such as Mastercard, Burger King, Inspur, etc.) in production.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How to seamlessly scale an AI project from a Python notebook on a single laptop to a high - performance distributed cluster capable of handling larger datasets to support large - scale experiments and production deployments. Specifically, the paper focuses on simplifying the complex and error - prone manual steps involved in this process, enabling data scientists to fully utilize hardware resources (such as SIMD instructions, multiprocessing, quantization, memory allocation optimization, data partitioning, distributed computing, etc.). ### Problem Background Most AI projects usually start from a Python notebook running on a single laptop or workstation. However, when it is necessary to handle larger datasets, data scientists must go through a series of complex and error - prone manual steps in order to fully utilize the available hardware resources. These steps include: - Optimization using SIMD instructions - Multiprocessing parallelization - Model quantization - Memory allocation optimization - Data partitioning - Distributed computing These manual operations are not only complex but also error - prone, increasing the difficulty of project development and maintenance. ### Solution To address the above challenges, the author open - sourced the BigDL 2.0 toolkit, which combines the original BigDL and Analytics Zoo projects. The main features of BigDL 2.0 include: 1. **Transparent Acceleration**: Users can build regular Python notebooks using standard APIs on local notebooks and automatically accelerate model training and inference through BigDL 2.0, achieving a speed - up of up to 9.6 times. 2. **Seamless Scaling**: BigDL 2.0 can seamlessly scale the AI pipeline to large clusters, spanning hundreds of servers, without the need for invasive code modifications. 3. **End - to - End Pipeline Optimization**: BigDL 2.0 optimizes the entire AI pipeline, including data pre - processing, feature transformation, hyperparameter tuning, model training and inference, model optimization and deployment, etc. 4. **Automated Machine Learning (AutoML) Support**: Through the built - in AutoML function, BigDL 2.0 can help users automate hyperparameter searches and improve model development efficiency. ### Implementation Method BigDL 2.0 achieves these goals through two main libraries: - **BigDL - Nano**: Used for transparently accelerating the AI pipeline on a single node, integrating multiple optimization techniques such as SIMD instructions, multiprocessing, quantization, memory allocation optimization, etc. - **BigDL - Orca**: Used for seamlessly expanding AI applications, automatically configuring distributed data processing and AI systems (such as Apache Spark and Ray), and efficiently performing data parallel processing, model training and inference in a distributed environment. ### Conclusion Through BigDL 2.0, users can easily build AI pipelines on local notebooks and seamlessly scale them to large - scale distributed clusters, thereby significantly improving the efficiency and performance of processing large - scale datasets. This toolkit has been widely used and verified in practical application scenarios such as Mastercard, Burger King, and Inspur.