Abstract:Software-managed heterogeneous memory (HM) provides a promising solution to increase memory capacity and cost efficiency. However, to release the performance potential of HM, we face a problem of data management. Given an application with various execution phases and each with possibly distinct working sets, we must move data between memory components of HM to optimize performance. The deep neural network (DNN), as a common workload on data centers, imposes great challenges on data management on HM. This workload often employs a task dataflow execution model, and is featured with a large amount of small data objects and fine-grained operations (tasks). This execution model imposes challenges on memory profiling and efficient data migration. We present Sentinel, a runtime system that automatically optimizes data migration (i.e., data management) on HM to achieve performance similar to that on the fast memory-only system with a much smaller capacity of fast memory. To achieve this,Sentinel exploits domain knowledge about deep learning to adopt a custom approach for data management. Sentinel leverages workload repeatability to break the dilemma between profiling accuracy and overhead; It enables profiling and data migration at the granularity of data objects (not pages), by controlling memory allocation. This method bridges the semantic gap between operating system and applications. By associating data objects with the DNN topology, Sentinel avoids unnecessary data movement and proactively triggers data movement. Using only 20% of peak memory consumption of DNN models as fast memory size, Sentinel achieves the same or comparable performance (at most 8% performance difference) to that of the fast memory-only system on common DNN models; Sentinel also consistently outperforms a state-of-the-art solution by 18%.

TENSILE: A Tensor Granularity Dynamic GPU Memory Scheduling Method Toward Multiple Dynamic Workloads System

G10: Enabling An Efficient Unified GPU Memory and Storage Architecture with Smart Tensor Migrations

MegTaiChi: Dynamic Tensor-based Memory Management Optimization for DNN Training

FusionStitching: Boosting Memory Intensive Computations for Deep Learning Workloads

TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training Via Tensor Splitting

Efficient Memory Management for GPU-based Deep Learning Systems

vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving

WELDER: Scheduling Deep Learning Memory Access Via Tile-graph

TensorCache: Reconstructing Memory Architecture with SRAM-Based In-Cache Computing for Efficient Tensor Computations in GPGPUs

MAGIS: Memory Optimization Via Coordinated Graph Transformation and Scheduling for DNN

Dynamic Space-Time Scheduling for GPU Inference

TensorTEE: Unifying Heterogeneous TEE Granularity for Efficient Secure Collaborative Tensor Computing

A Framework for Memory Oversubscription Management in Graphics Processing Units

Thread Batching for High-performance Energy-efficient GPU Memory Design

HOME: A Holistic GPU Memory Management Framework for Deep Learning

Surpassing Sycamore: Achieving Energetic Superiority Through System-Level Circuit Simulation

Accelerating Heterogeneous Tensor Parallelism via Flexible Workload Control

Scalable CP Decomposition for Tensor Learning using GPU Tensor Cores

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Sentinel: Runtime Data Management on Heterogeneous Main MemorySystems for Deep Learning