Abstract:Deep neural networks (DNNs) have gained tremendous attractions as compelling solutions for applications such as image classification, object detection, speech recognition, and so forth. Its great success comes with excessive trainings to make sure the model accuracy is good enough for those applications. Nowadays, it becomes challenging to train a DNN model because of 1) the model size and data size keep increasing, which usually needs more iterations to train; 2) DNN algorithms evolve rapidly, which requires the training phase to be short for a quick deployment. To address those challenges, distributed training platforms have been proposed to leverage massive server nodes for training, with the hope of significant training time reduction. Therefore, scalability is a critical performance metric to evaluate a distributed training platform. Nevertheless, our analysis reveals that traditional server clusters have poor scalability for training due to the traffic congestions within the server and beyond. The intra-server traffic on the I/O fabric can result in severe congestions and skewed quality of service as high performance devices are competing with each other. Moreover, the traffic congestions on the Ethernet for inter-server communication could also incur significant performance degradation. In this work, we devise a novel distributed training platform, EFLOPS, that adopts an algorithm and system co-design methodology to achieve good scalability. A new server architecture is proposed to alleviate the intra-server congestions. Moreover, a new network topology, BiGraph, is proposed to divide the network into two separate parts, so that there is always a direct connection between any nodes from different parts. Finally, accompany with BiGraph, a topology-aware allreduce algorithm is proposed to eliminate the traffic congestion on the direct connection. The experimental results show that eliminating the congestions on network interface can gain up to 11.3xcommunication speedup. The proposed algorithm and topology can provide further improvement up to 6.08x. The overall performance of ResNet-50 training achieves near-linear scalability, and is competitive to the top-rankings of MLPerf results.

ROG: A High Performance and Robust Distributed Training System for Robotic IoT

Robot Simulation and Reinforcement Learning Training Platform Based on Distributed Architecture.

Coorp: Satisfying Low-Latency and High-Throughput Requirements of Wireless Network for Coordinated Robotic Learning

Adaptive Partitioning and Efficient Scheduling for Distributed DNN Training in Heterogeneous IoT Environment

Decentralized Proactive Model Offloading and Resource Allocation for Split and Federated Learning

Hybrid-Parallel: Achieving High Performance and Energy Efficient Distributed Inference on Robots

Adaptive Cooperative Gene Regulatory Network Optimized by Elastic Deformation Algorithm for Multirobot Hunting

On the Way from Lightweight to Powerful Intelligence: A Heterogeneous Multi-Robot Social System with IoT Devices.

EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

C3F: Constant Collaboration and Communication Framework for Graph-Representation Dynamic Multi-Robotic Systems

UAV-assisted task offloading system using dung beetle optimization algorithm & deep reinforcement learning

Robotic Wireless Energy Transfer in Dynamic Environments: System Design and Experimental Validation

Boosting Cost‐Efficiency in Robotics: A Distributed Computing Approach for Harvesting Robots

FogROS: An Adaptive Framework for Automating Fog Robotics Deployment

A Distributed Computing Real-Time Safety System of Collaborative Robot

RLPTO : A reinforcement learning-based performance-time optimized task and resource scheduling mechanism for distributed machine learning

OCTOANTS: A Heterogeneous Lightweight Intelligent Multi-Robot Collaboration System with Resource-constrained IoT Devices.

Multi-robot Cooperative Object Transportation using Decentralized Deep Reinforcement Learning

A Real-Time Rescue System: Towards Practical Implementation of Robotic Sensor Network

dPRO: A Generic Profiling and Optimization System for Expediting Distributed DNN Training