Abstract:Recent research has shown that collaborative computing of CPUs and GPUs in the same system can effectively accelerate large-scale SGD-based matrix factorization (MF), but it faces the problem of limited scalability due to parameter synchronization in the server. Theoretically, asynchronous methods can overcome this shortcoming. However, through a series of tests, observations, and analyses, we realize that developing an effective asynchronous multi-CPU/GPU MF framework faces several major design challenges: the underutilized CPUs, high communication overhead, and the asynchronous data safety issue. This article presents a unified multi-CPU/GPU asynchronous computing framework for SGD-based matrix factorization, named UMA-MF. UMA-MF treats CPUs and GPUs in the system as distributed workers that train matrix datasets in parallel and update feature parameters asynchronously. It provides a cache-friendly CPU external working mode, which can improve the CPU's cache hit rate, thereby promoting the efficient use of CPUs. It offers an algorithm to find the shortest communication ring topology of heterogeneous CPU/GPU workers and builds computing-communication pipelines to minimize the communication overhead. It implements a wait-free structure and load-balanced data distribution to achieve asynchronous data safety. UMA-MF can effectively accelerate SGD-based MF on multi-CPU/GPU systems in an asynchronous way. On a physical platform with configurations ranging from single processor system to 2CPUs--4CPUs system, for five common datasets Netfix, R1, R2, Goodreads, and de-dense, UMA-MF achieves up to 3.56x speedup compared with HCC-MF, which is the state-of-the-art multi-CPU/GPU synchronous computing framework for SGD-based MF. UMA-MF also shows good scalability. When the system is scaled to 2CPUs-4GPUs, the training time speedup of UMA-MF can reach 70%--97% of the ideal speedup.

A Time-Cost Based Automatic Scheduling Framework for Matrix Computation on Various Distributed Computing Platforms

Distributed High-Dimension Matrix Operation Optimization on Spark

A parallel computing method for irregular work

Towards Efficient Scheduling of Federated Mobile Devices under Computational and Statistical Heterogeneity

A Deadline And Budget Constrained Cost-Time Optimization Algorithm For Scheduling Dependent Tasks In Grid Computing

Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms

Efficient Large Scale Distributed Matrix Computation with Spark.

Communication-Efficient Task Scheduling for Real-Time Distributed Computing.

Distributed matrix computing system for big data

A Novel Multi-CPU/GPU Collaborative Computing Framework for SGD-based Matrix Factorization

Learning scheduling algorithms for data processing clusters

Exploiting Matrix Dependency For Efficient Distributed Matrix Computation

Julia Cloud Matrix Machine: Dynamic Matrix Language Acceleration on Multicore Clusters in the Cloud

A Deadline and Budget Constrained Cost-Time Optimisation Algorithm for Scheduling Task Farming Applications on Global Grids

Cascaded Prediction and Asynchronous Execution of Iterative Algorithms on Heterogeneous Platforms

Unified Programming Model and Software Framework for Big Data Machine Learning and Data Analytics.

Magas: matrix-based asynchronous graph analytics on shared memory systems

UMA-MF: A Unified Multi-CPU/GPU Asynchronous Computing Framework for SGD-Based Matrix Factorization

Increasing the Efficiency of Sparse Matrix-Matrix Multiplication with a 2.5D Algorithm and One-Sided MPI

Cost-Efficient Workflow Scheduling Algorithm for Applications With Deadline Constraint on Heterogeneous Clouds

Octopus-DF: Unified DataFrame-based cross-platform data analytic system