Deployment and Verification of Machine Learning Tool-Chain Based on Kubernetes Distributed Clusters

Cai Haoyu,Wang Chao,Zhou Xuehai
DOI: https://doi.org/10.1007/s42514-021-00065-w
2021-01-01
CCF Transactions on High Performance Computing
Abstract:In the field of software engineering, the environmental dependency conflict is a significant problem facing software engineers. Containerization (Pahl 2015 ) was proposed to resolve environmental dependency conflicts, Currently widely used in cloud computing and distributed systems. Simultaneously, in the process of large-scale application development and deployment, the microservice (Nadareishvili et al. 2016 ) architecture has the advantages of robust scalability and low coupling. Therefore, it is becoming increasingly favored by software developers. For example, Google is one of the few companies that need to manage the deployment and development of a large number of service components on hundreds of thousands of servers. With the concept of containerization and microservices at the core, an open-source distributed container management system called Kubernetes was developed. Kubernetes can not only maintain complete independence of applications but also improve the utilization of hardware resources, so it is affected by Internet companies and widely used by many institutions. In recent years, the demand for computing resources for machine learning-related applications is increasing, and the stand-alone computing for machine learning tasks is often unsustainable. Many data scientists will rely on distributed systems to provide sufficient computing resources for machine learning tasks. Kubernetes can be used for machine learning related applications and provides support for deployment on distributed systems. Meanwhile, it has many excellent features such as containerization and microservices. Therefore, the idea of developing and deploying machine-learning applications based on Kubernetes is favored by data scientists. Google has developed Kubeflow, a machine learning tool suite based on the Kubernetes system. Kubeflow can help data scientists run machine learning workloads on distributed clusters. For historical reasons, Kubeflow’s support for Tensorflow is quite complete, but the support of the framework is not perfect for Pytorch. Besides, although the pipeline component included in Kubeflow can build an entire machine learning workflow, this component is still dependent on the Google Cloud Platform. Therefore, Kubeflow pipeline is not friendly enough for developers who have no conditions to lease Google Cloud Service. This paper designs a complete solution to the problem of Kubeflow pipeline, and verifies the feasibility of the solution through an example, so that Kubernetes no longer depends on Google Cloud Service and better supports machine learning applications based on Pytorch. This solution ensures that data scientists can accomplish Pytorch-based deep learning applications on Kubernetes-based distributed systems development, demonstration, construction workflow, deployment services, operation, maintenance, and other functions. In principle, any data scientist who wants to use Pytorch to develop a machine-learning project can deploy their machine “learning applications on distributed systems by following such a set of solutions. The proposed approach makes the application stably and continuously running and reasonably scheduled on the cluster containing heterogeneous computing resources.
What problem does this paper attempt to address?