Abstract:To alleviate hardware scarcity in training large deep neural networks (DNNs), particularly large language models (LLMs), we present FusionLLM, a decentralized training system designed and implemented for training DNNs using geo-distributed GPUs across different computing clusters or individual devices. Decentralized training faces significant challenges regarding system design and efficiency, including: 1) the need for remote automatic differentiation (RAD), 2) support for flexible model definitions and heterogeneous software, 3) heterogeneous hardware leading to low resource utilization or the straggler problem, and 4) slow network communication. To address these challenges, in the system design, we represent the model as a directed acyclic graph of operators (OP-DAG). Each node in the DAG represents the operator in the DNNs, while the edge represents the data dependency between operators. Based on this design, 1) users are allowed to customize any DNN without caring low-level operator implementation; 2) we enable the task scheduling with the more fine-grained sub-tasks, offering more optimization space; 3) a DAG runtime executor can implement RAD withour requiring the consistent low-level ML framework versions. To enhance system efficiency, we implement a workload estimator and design an OP-Fence scheduler to cluster devices with similar bandwidths together and partition the DAG to increase throughput. Additionally, we propose an AdaTopK compressor to adaptively compress intermediate activations and gradients at the slowest communication links. To evaluate the convergence and efficiency of our system and algorithms, we train ResNet-101 and GPT-2 on three real-world testbeds using 48 GPUs connected with 8 Mbps~10 Gbps networks. Experimental results demonstrate that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.

FLNA: Flexibly Accelerating Feature Learning Networks for Large-Scale Point Clouds with Efficient Dataflow Decoupling

Fast Point Cloud Sampling Network.

Adaptive Recurrent Forward Network for Dense Point Cloud Completion

FG-Net: A Fast and Accurate Framework for Large-Scale LiDAR Point Cloud Understanding

SFL-NET: Slight Filter Learning Network for Point Cloud Semantic Segmentation

An Efficient FPGA Accelerator for Point Cloud

FusionArch: A Fusion-Based Accelerator for Point-Based Point Cloud Neural Networks

FLBooster: A Unified and Efficient Platform for Federated Learning Acceleration.

FINet: Fast Point Cloud Interpolation Network Via Distance Transform

A Lightweight Network for Point Cloud Analysis via the Fusion of Local Features and Distribution Characteristics

PointGL: A Simple Global-Local Framework for Efficient Point Cloud Analysis

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

LPFE-Net: a local parallel feature extraction network for large-scale point cloud semantic segmentation

L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform

FPS-Net: A Convolutional Fusion Network for Large-Scale LiDAR Point Cloud Segmentation

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

Accelerating DNN-based 3D point cloud processing for mobile computing

AFpoint: adaptively fusing local and global features for point cloud

A 28-Nm Energy-Efficient Sparse Neural Network Processor for Point Cloud Applications Using Block-Wise Online Neighbor Searching