Abstract:Applications running concurrently in CMP systems interfere with each other at DRAM memory, leading to poor system performance and fairness. Memory access scheduling reorders memory requests to improve system throughput and fairness. However, it cannot resolve the interference issue effectively. To reduce interference, memory partitioning divides memory resource among threads. Memory channel partitioning maps the data of threads that are likely to severely interfere with each other to different channels. However, it allocates memory resource unfairly and physically exacerbates memory contention of intensive threads, thus ultimately resulting in the increased slowdown of these threads and high system unfairness. Bank partitioning divides memory banks among cores and eliminates interference. However, previous equal bank partitioning restricts the number of banks available to individual thread and reduces bank level parallelism. In this paper, we first propose a Dynamic Bank Partitioning (DBP), which partitions memory banks according to threads' requirements for bank amounts. DBP compensates for the reduced bank level parallelism caused by equal bank partitioning. The key principle is to profile threads' memory characteristics at run-time and estimate their demands for bank amount, then use the estimation to direct our bank partitioning. Second, we observe that bank partitioning and memory scheduling are orthogonal in the sense; both methods can be illuminated when they are applied together. Therefore, we present a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory scheduling) to further improve system performance. Experimental results show that the proposed DBP improves system performance by 4.3% and improves system fairness by 16% over equal bank partitioning. Compared to TCM, DBP-TCM improves system throughput by 6.2% and fairness by 16.7%. When compared with MCP, DBP-TCM p- ovides 5.3% better system throughput and 37% better system fairness. We conclude that our methods are effective in improving both system throughput and fairness.

CMLB: a Communication-aware and Memory Load Balance Mapping Optimization for Modern NUMA Systems

Latency Optimization for Cellular Assisted Mobile Edge Computing Via Non-Orthogonal Multiple Access

HASO: A Hot-Page Aware Scheduling Optimization Method in Virtualized NUMA Systems

A User-Level NUMA-Aware Scheduler for Optimizing Virtual Machine Performance.

Evaluation of Virtual Machine Performance on NUMA Multicore Systems

Memory Affinity: Balancing Performance, Power, Thermal and Fairness for Multi-core Systems

Performance-Monitoring-Based Traffic-Aware Virtual Machine Deployment on NUMA Systems

NestedMP: Enabling Cache-Aware Thread Mapping for Nested Parallel Shared Memory Applications

EMBA: Efficient Memory Bandwidth Allocation to Improve Performance on Intel Commodity Processor

A performance comparison of data and memory allocation strategies for sequence aligners on NUMA architectures

Poster: revisiting virtual channel memory for performance and fairness on multi-core architecture.

JArena: Partitioned Shared Memory for NUMA-awareness in Multi-threaded Scientific Applications

Affinity-Based Thread and Data Mapping in Shared Memory Systems

High-performance application mapping in network-on-chip-based multicore systems

Distributed Memory Management Units Architecture for NoC-based CMPs

Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning

Optimizing Memory Access Traffic Via Runtime Thread Migration for On-Chip Distributed Memory Systems

Load Balance Scheduling Algorithm for CMP Architecture

Memory and Computation Coordinated Mapping of DNNs Onto Complex Heterogeneous SoC.

Object-Level Memory Allocation and Migration in Hybrid Memory Systems

Agent-Based Memory Access for Many-Core CMPs