Abstract:In the field of software engineering, the environmental dependency conflict is a significant problem facing software engineers. Containerization (Pahl 2015 ) was proposed to resolve environmental dependency conflicts, Currently widely used in cloud computing and distributed systems. Simultaneously, in the process of large-scale application development and deployment, the microservice (Nadareishvili et al. 2016 ) architecture has the advantages of robust scalability and low coupling. Therefore, it is becoming increasingly favored by software developers. For example, Google is one of the few companies that need to manage the deployment and development of a large number of service components on hundreds of thousands of servers. With the concept of containerization and microservices at the core, an open-source distributed container management system called Kubernetes was developed. Kubernetes can not only maintain complete independence of applications but also improve the utilization of hardware resources, so it is affected by Internet companies and widely used by many institutions. In recent years, the demand for computing resources for machine learning-related applications is increasing, and the stand-alone computing for machine learning tasks is often unsustainable. Many data scientists will rely on distributed systems to provide sufficient computing resources for machine learning tasks. Kubernetes can be used for machine learning related applications and provides support for deployment on distributed systems. Meanwhile, it has many excellent features such as containerization and microservices. Therefore, the idea of developing and deploying machine-learning applications based on Kubernetes is favored by data scientists. Google has developed Kubeflow, a machine learning tool suite based on the Kubernetes system. Kubeflow can help data scientists run machine learning workloads on distributed clusters. For historical reasons, Kubeflow’s support for Tensorflow is quite complete, but the support of the framework is not perfect for Pytorch. Besides, although the pipeline component included in Kubeflow can build an entire machine learning workflow, this component is still dependent on the Google Cloud Platform. Therefore, Kubeflow pipeline is not friendly enough for developers who have no conditions to lease Google Cloud Service. This paper designs a complete solution to the problem of Kubeflow pipeline, and verifies the feasibility of the solution through an example, so that Kubernetes no longer depends on Google Cloud Service and better supports machine learning applications based on Pytorch. This solution ensures that data scientists can accomplish Pytorch-based deep learning applications on Kubernetes-based distributed systems development, demonstration, construction workflow, deployment services, operation, maintenance, and other functions. In principle, any data scientist who wants to use Pytorch to develop a machine-learning project can deploy their machine “learning applications on distributed systems by following such a set of solutions. The proposed approach makes the application stably and continuously running and reasonably scheduled on the cluster containing heterogeneous computing resources.

Kubeflow-based Automatic Data Processing Service for Data Center of State Grid Scenario

Electric Load Data Compression and Classification Based on Deep Stacked Auto-Encoders

An Efficient Sampling And Classification Approach For Flow Detection In Sdn-Based Big Data Centers

Load Data Mining Based on Deep Learning Method

Deployment and Verification of Machine Learning Tool-Chain Based on Kubernetes Distributed Clusters

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Design and Implementation of Power Big Data Platform

Data-Intensive Application Deployment at Edge: A Deep Reinforcement Learning Approach

E3: an Elastic Execution Engine for Scalable Data Processing.

Power Grid Data Monitoring and Analysis System based on Edge Computing

Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows

AutoDDL: Automatic Distributed Deep Learning With Near-Optimal Bandwidth Cost

A Data-Driven Architecture Design of Stream Computing for the Dispatch and Control System of the Power Grid

Toward Building Edge Learning Pipelines

Analyzing large-scale Data Cubes with user-defined algorithms: A cloud-native approach

An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment

DPDPU: Data Processing with DPUs

Data pipeline for real-time energy consumption data management and prediction

APPLICATION OF THE KUBEFLOW TOOL FOR THE INTEGRATION OF MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE IN UNMANNED AERIAL VEHICLE

PowerAI DDL

ElasticFlow: an Elastic Serverless Training Platform for Distributed Deep Learning.