Abstract:Large Language Models (LLMs) have achieved remarkable success in various fields, but their training and finetuning require massive computation and memory, necessitating parallelism which introduces heavy communication overheads. Driven by advances in packaging, the chiplet architecture emerges as a potential solution, as it can integrate computing power, as well as utilize on-package links with better signal integrity, higher bandwidth, and lower energy consumption. However, most existing chiplet-related works focus on DNN inference. Directly porting them to LLM training introduces significantly large quantities of DRAM access and network-on-package (NoP) overheads which make state-of-the-art chiplet designs fail, highlighting a research gap. This work proposes Hecaton, a scalable and cost-effective chiplet system for LLM training. We first provide a chiplet architecture with tailored scheduling that can largely reduce DRAM accesses. We further design an efficient distributed training method that reduces NoP communication complexity and relieves constraints on SRAM capacity and layout. Theoretical analysis shows that the entire system achieves weak scaling: as the workload and hardware resources grow proportionally, the computation-to-communication ratio remains nearly constant. Experiments with various workloads and hardware configurations verify the property, and Hecaton achieves $5.29\times$ performance improvement and $3.46\times$ energy reduction on Llama3.1-405B, compared to the tensor parallelism in Megatron. To the best of our knowledge, we propose the first chiplet architecture specifically used for LLM training or finetuning, with guaranteed performance regardless of the problem scale.

ChipAlign: Instruction Alignment in Large Language Models for Chip Design via Geodesic Interpolation

ChipNeMo: Domain-Adapted LLMs for Chip Design

Aligners: Decoupling LLMs and Alignment

Improving In-context Learning via Bidirectional Alignment

Hecaton: Training Large Language Models with Scalable Chiplet Systems

Assessing Economic Viability: A Comparative Analysis of Total Cost of Ownership for Domain-Adapted Large Language Models versus State-of-the-art Counterparts in Chip Design Coding Assistance

MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

ChipExpert: The Open-Source Integrated-Circuit-Design-Specific Large Language Model

LongAlign: A Recipe for Long Context Alignment of Large Language Models

Mixture-of-Instructions: Comprehensive Alignment of a Large Language Model through the Mixture of Diverse System Prompting Instructions

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability

Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Align^2LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

Model Merging and Safety Alignment: One Bad Model Spoils the Bunch

Align$^2$LLaVA: Cascaded Human and Large Language Model Preference Alignment for Multi-modal Instruction Curation

Human-Instruction-Free LLM Self-Alignment with Limited Samples

Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs

Aligner: Efficient Alignment by Learning to Correct

NeMo-Aligner: Scalable Toolkit for Efficient Model Alignment