Abstract:Recent research has shown that collaborative computing of CPUs and GPUs in the same system can effectively accelerate large-scale SGD-based matrix factorization (MF), but it faces the problem of limited scalability due to parameter synchronization in the server. Theoretically, asynchronous methods can overcome this shortcoming. However, through a series of tests, observations, and analyses, we realize that developing an effective asynchronous multi-CPU/GPU MF framework faces several major design challenges: the underutilized CPUs, high communication overhead, and the asynchronous data safety issue. This article presents a unified multi-CPU/GPU asynchronous computing framework for SGD-based matrix factorization, named UMA-MF. UMA-MF treats CPUs and GPUs in the system as distributed workers that train matrix datasets in parallel and update feature parameters asynchronously. It provides a cache-friendly CPU external working mode, which can improve the CPU's cache hit rate, thereby promoting the efficient use of CPUs. It offers an algorithm to find the shortest communication ring topology of heterogeneous CPU/GPU workers and builds computing-communication pipelines to minimize the communication overhead. It implements a wait-free structure and load-balanced data distribution to achieve asynchronous data safety. UMA-MF can effectively accelerate SGD-based MF on multi-CPU/GPU systems in an asynchronous way. On a physical platform with configurations ranging from single processor system to 2CPUs--4CPUs system, for five common datasets Netfix, R1, R2, Goodreads, and de-dense, UMA-MF achieves up to 3.56x speedup compared with HCC-MF, which is the state-of-the-art multi-CPU/GPU synchronous computing framework for SGD-based MF. UMA-MF also shows good scalability. When the system is scaled to 2CPUs-4GPUs, the training time speedup of UMA-MF can reach 70%--97% of the ideal speedup.

An Efficient Parallel Stochastic Gradient Descent for Matrix Factorization on GPUS

An Efficient Approach of GPU-accelerated Stochastic Gradient Descent Method for Matrix Factorization

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs.

Gpusgd: A Gpu-Accelerated Stochastic Gradient Descent Algorithm for Matrix Factorization

A Fast Parallel Stochastic Gradient Method for Matrix Factorization in Shared Memory Systems

Efficient Matrix Factorization on Heterogeneous CPU-GPU Systems

GPUMF: A GPU-Enpowered Collaborative Filtering Algorithm Through Matrix Factorization

CuMF_SGD: Fast and Scalable Matrix Factorization.

GPU accelerated matrix factorization of large scale data using block based approach

MSGD: A Novel Matrix Factorization Approach for Large-Scale Collaborative Filtering Recommender Systems on GPUs.

A Fast Distributed Stochastic Gradient Descent Algorithm for Matrix Factorization.

A Parallel Matrix Factorization Based Recommender by Alternating Stochastic Gradient Decent

An Efficient Parallelization Approach for Large-Scale Sparse Non-Negative Matrix Factorization Using Kullback-Leibler Divergence on Multi-GPU.

Fast Asynchronous Parallel Stochastic Gradient Decent

UMA-MF: A Unified Multi-CPU/GPU Asynchronous Computing Framework for SGD-Based Matrix Factorization

Stochastic Gradient Descent for matrix completion: Hybrid parallelization on shared- and distributed-memory systems

Alternating Mixing Stochastic Gradient Descent for Large-scale Matrix Factorization

Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units.

Scaling up stochastic gradient descent for non-convex optimisation

Efficient Gradient Boosted Decision Tree Training on GPUs

Large-scale and Scalable Latent Factor Analysis via Distributed Alternative Stochastic Gradient Descent for Recommender Systems