Abstract:The recent progress of AI can be largely attributed to large language models (LLMs). However, their escalating memory requirements introduce challenges for machine learning (ML) researchers and engineers. Addressing this requires developers to partition a large model to distribute it across multiple GPUs or TPUs. This necessitates considerable coding and intricate configuration efforts with existing model parallel tools, such as Megatron-LM, DeepSpeed, and Alpa. These tools require users' expertise in machine learning systems (MLSys), creating a bottleneck in LLM development, particularly for developers without MLSys background. In this work, we present RedCoast (Redco), a lightweight and user-friendly tool crafted to automate distributed training and inference for LLMs, as well as to simplify ML pipeline development. The design of Redco emphasizes two key aspects. Firstly, to automate model parallelism, our study identifies two straightforward rules to generate tensor parallel strategies for any given LLM. Integrating these rules into Redco facilitates effortless distributed LLM training and inference, eliminating the need of additional coding or complex configurations. We demonstrate the effectiveness by applying Redco on a set of LLM architectures, such as GPT-J, LLaMA, T5, and OPT, up to the size of 66B. Secondly, we propose a mechanism that allows for the customization of diverse ML pipelines through the definition of merely three functions, avoiding redundant and formulaic code like multi-host related processing. This mechanism proves adaptable across a spectrum of ML algorithms, from foundational language modeling to complex algorithms like meta-learning and reinforcement learning. As a result, Redco implementations exhibit significantly fewer lines of code compared to their official counterparts.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve The paper aims to address several key challenges in the distributed training of large language models (LLMs): 1. **High Memory Requirements**: As the number of parameters in LLMs continues to increase, the memory of a single GPU or TPU often cannot meet the model's needs, necessitating the partitioning of the model and distributed training across multiple devices. 2. **Complex Model Parallelism Techniques**: Existing model parallelism tools (such as Megatron-LM, DeepSpeed, and Alpa) provide solutions but require users to have deep knowledge of machine learning systems (MLSys) and involve a significant amount of coding and configuration work. 3. **Low Development Efficiency**: In traditional ML pipeline development, there is a lot of repetitive boilerplate code, such as backpropagation, gradient application, and batch iteration, which increases the complexity and time cost of development. To address these challenges, the paper introduces RedCoast (Redco), a lightweight and user-friendly tool designed to automate the distributed training and inference of LLMs, simplifying the development process of ML pipelines. Specifically, Redco achieves this goal through the following two key aspects: 1. **Automatic Model Parallelism**: Redco automatically generates tensor parallelism strategies suitable for any given LLM by identifying and integrating two simple rules, thereby eliminating the need for additional coding and complex configuration. 2. **Concise ML Pipeline Development Mechanism**: Redco allows users to design ML pipelines by defining three intuitive functions, while Redco handles all underlying execution details such as data parallelism, multi-host related processing, checkpoint management, and more. Through these designs, Redco not only improves development efficiency but also makes it easy for users without an MLSys background to use the tool.

RedCoast: A Lightweight Tool to Automate Distributed Training of LLMs on Any GPU/TPUs

An Optimization Toolchain Design Of Deep Learning Deployment Based On Heterogeneous Computing Platform

FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs

An Efficient 2D Method for Training Super-Large Deep Learning Models

CoLLiE: Collaborative Training of Large Language Models in an Efficient Way

CoLLM: A Collaborative LLM Inference Framework for Resource-Constrained Devices

DiLoCo: Distributed Low-Communication Training of Language Models

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Pipette: Automatic Fine-grained Large Language Model Training Configurator for Real-World Clusters

Data-parallel distributed training of very large models beyond GPU capacity

Efficient Deployment of Large Language Model Across Cloud-Device Systems

Hiding Communication Cost in Distributed LLM Training via Micro-batch Co-execution

An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training

Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment

Liger Kernel: Efficient Triton Kernels for LLM Training

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

RTiL: Real-Time Inference of Large Language Models on Memory-Constrained GPU Devices

Efficient and Economic Large Language Model Inference with Attention Offloading

DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models