Abstract:Recently, deep learning has made remarkable strides, especially with generative modeling, such as large language models and probabilistic diffusion models. However, training these models often involves significant computational resources, requiring billions of petaFLOPs. This high resource consumption results in substantial energy usage and a large carbon footprint, raising critical environmental concerns. Back-propagation (BP) is a major source of computational expense during training deep learning models. To advance research on energy-efficient training and allow for sparse learning on any machine and device, we propose a general, energy-efficient convolution module that can be seamlessly integrated into any deep learning architecture. Specifically, we introduce channel-wise sparsity with additional gradient selection schedulers during backward based on the assumption that BP is often dense and inefficient, which can lead to over-fitting and high computational consumption. Our experiments demonstrate that our approach reduces 40\% computations while potentially improving model performance, validated on image classification and generation tasks. This reduction can lead to significant energy savings and a lower carbon footprint during the research and development phases of large-scale AI systems. Additionally, our method mitigates over-fitting in a manner distinct from Dropout, allowing it to be combined with Dropout to further enhance model performance and reduce computational resource usage. Extensive experiments validate that our method generalizes to a variety of datasets and tasks and is compatible with a wide range of deep learning architectures and modules. Code is publicly available at <a class="link-external link-https" href="https://github.com/lujiazho/ssProp" rel="external noopener nofollow">this https URL</a>.

OASR-WFBP: An overlapping aware start-up sharing gradient merging strategy for efficient communication in distributed deep learning

OF-WFBP: A near-optimal communication mechanism for tensor fusion in distributed deep learning

MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning

WBSP: Addressing Stragglers in Distributed Machine Learning with Worker-Busy Synchronous Parallel

US-Byte: an Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Asynchronous Proximal Stochastic Gradient Algorithm for Composition Optimization Problems

Adaptive Batchsize Selection and Gradient Compression for Wireless Federated Learning

OSP: Boosting Distributed Model Training with 2-Stage Synchronization

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training

Prophet: Speeding Up Distributed DNN Training with Predictable Communication Scheduling.

Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation

ssProp: Energy-Efficient Training for Convolutional Neural Networks with Scheduled Sparse Back Propagation

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Communication-Efficient Distributed Learning via Sparse and Adaptive Stochastic Gradient

Weighted Aggregating Stochastic Gradient Descent for Parallel Deep Learning

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

FLOPS: Forward Learning with OPtimal Sampling

Projection-free Online Learning with Arbitrary Delays