Abstract:LDA is a widely used machine learning technique for big data analysis. The application includes an inference algorithm that iteratively updates a model until it converges. A major challenge is the scaling issue in parallelization owing to the fact that the model size is huge and parallel workers need to communicate the model continually. We identify three important features of the model in parallel LDA computation: 1. The volume of model parameters required for local computation is high; 2. The time complexity of local computation is proportional to the required model size; 3. The model size shrinks as it converges. By investigating collective and asynchronous methods for model communication in different tools, we discover that optimized collective communication can improve the model update speed, thus allowing the model to converge faster. The performance improvement derives not only from accelerated communication but also from reduced iteration computation time as the model size shrinks during the model convergence. To foster faster model convergence, we design new collective communication abstractions and implement two Harp-LDA applications, lgs and rtt. We compare our new approach with Yahoo! LDA and Petuum LDA, two leading implementations favoring asynchronous communication methods in the field, on a 100-node, 4000-thread Intel Haswell cluster. The experiments show that lgs can reach higher model likelihood with shorter or similar execution time compared with Yahoo! LDA, while rtt can run up to 3.9 times faster compared with Petuum LDA when achieving similar model likelihood.

Model-centric computation abstractions in machine learning applications.

Models of parallel computation: a survey and classification

High Performance LDA Through Collective Model Communication Optimization

An Overview of Computational Sparse Models and Their Applications in Artificial Intelligence

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

A Data-Centric Optimization Framework for Machine Learning

Understanding ML driven HPC: Applications and Infrastructure

Arithmetic Deduction Model for High Performance Computing: A Comparative Exploration of Computational Models Paradigms

Parallelizing Big Data Machine Learning Applications With Model Rotation

Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning

Performance and Energy Consumption of Parallel Machine Learning Algorithms

Parallel Learning - A New Framework for Machine Learning

MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

Toward a `Standard Model' of Machine Learning

Layered models of parallel computation

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

Model Parallelism on Distributed Infrastructure: A Literature Review from Theory to LLM Case-Studies

A Linear Algebraic Approach to Model Parallelism in Deep Learning

Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities

Model-Agnostic Interpretation Framework in Machine Learning: A Comparative Study in NBA Sports

Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms