Abstract:Recent research has shown that collaborative computing of CPUs and GPUs in the same system can effectively accelerate large-scale SGD-based matrix factorization (MF), but it faces the problem of limited scalability due to parameter synchronization in the server. Theoretically, asynchronous methods can overcome this shortcoming. However, through a series of tests, observations, and analyses, we realize that developing an effective asynchronous multi-CPU/GPU MF framework faces several major design challenges: the underutilized CPUs, high communication overhead, and the asynchronous data safety issue. This article presents a unified multi-CPU/GPU asynchronous computing framework for SGD-based matrix factorization, named UMA-MF. UMA-MF treats CPUs and GPUs in the system as distributed workers that train matrix datasets in parallel and update feature parameters asynchronously. It provides a cache-friendly CPU external working mode, which can improve the CPU's cache hit rate, thereby promoting the efficient use of CPUs. It offers an algorithm to find the shortest communication ring topology of heterogeneous CPU/GPU workers and builds computing-communication pipelines to minimize the communication overhead. It implements a wait-free structure and load-balanced data distribution to achieve asynchronous data safety. UMA-MF can effectively accelerate SGD-based MF on multi-CPU/GPU systems in an asynchronous way. On a physical platform with configurations ranging from single processor system to 2CPUs--4CPUs system, for five common datasets Netfix, R1, R2, Goodreads, and de-dense, UMA-MF achieves up to 3.56x speedup compared with HCC-MF, which is the state-of-the-art multi-CPU/GPU synchronous computing framework for SGD-based MF. UMA-MF also shows good scalability. When the system is scaled to 2CPUs-4GPUs, the training time speedup of UMA-MF can reach 70%--97% of the ideal speedup.

Alternating Mixing Stochastic Gradient Descent for Large-scale Matrix Factorization

Unmixing of Large-Scale Hyperspectral Data Based on Projected Mini-Batch Gradient Descent.

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs.

Hyperspectral Unmixing Via Projected Mini-Batch Gradient Descent

Asynchronous Proximal Stochastic Gradient Algorithm for Composition Optimization Problems

Large-scale gradient-based training of Mixtures of Factor Analyzers

Distributed Stochastic ADMM for Matrix Factorization.

Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC

Stochastic Gradient Made Stable: A Manifold Propagation Approach for Large-Scale Optimization

CuMF_SGD: Fast and Scalable Matrix Factorization.

An adaptive Hessian approximated stochastic gradient MCMC method

Stochastic Anderson Mixing for Nonconvex Stochastic Optimization

Decentralized Rank-Adaptive Matrix Factorization—Part II: Convergence Analysis

UMA-MF: A Unified Multi-CPU/GPU Asynchronous Computing Framework for SGD-Based Matrix Factorization

Scalable Stochastic Alternating Direction Method of Multipliers.

Hierarchical Particle Swarm Optimization-incorporated Latent Factor Analysis for Large-Scale Incomplete Matrices

Distributing the Stochastic Gradient Sampler for Large-Scale LDA.

On the Convergence of Memory-Based Distributed SGD.

Decentralized Rank-Adaptive Matrix Factorization — Part I: Algorithm Development

Scaling up stochastic gradient descent for non-convex optimisation

Large-scale and Scalable Latent Factor Analysis via Distributed Alternative Stochastic Gradient Descent for Recommender Systems