Abstract:Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPU stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an order of magnitude in modern servers. An important factor that affects performance includes the location of the logical cores where the DBMS queries are scheduled, and the locations of the data that the queries access. This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to certain logical cores, and places data accordingly to certain integrated memory controllers (IMC), to integrate hardware consciousness into the system. In the spirit of hardware-software synergy, P-MOSS solely guides its scheduling decision based on low-level hardware statistics collected by performance monitoring counters with the aid of a Decision Transformer. Experimental evaluation is performed in the context of the B-tree and R-tree indexes. Performance results demonstrate that P-MOSS has up to 6x improvement over traditional schedules in terms of query throughput.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to optimize the query execution performance of main - memory indexes on servers with NUMA architecture, especially in a multi - core heterogeneous environment. Specifically, it aims to learn spatial scheduling strategies (spatial scheduling), allocate queries to specific logical cores, and place data on corresponding integrated memory controllers (IMC) to minimize cross - node communication latency and maximize local memory access efficiency. ### Main Problem Background Since the Dennard scaling law became invalid in the early 2000s, CPU frequencies have stagnated, and manufacturers have begun to increase the number of cores per CPU chip, thus introducing non - uniform memory access (NUMA) architectures and heterogeneous processors. This change has led to the performance of database management systems (DBMS) in modern servers that may fluctuate significantly due to the heterogeneity of the hardware design space, even reaching a difference of an order of magnitude. ### Specific Manifestations of the Problem 1. **Query Scheduling and Data Partitioning**: In NUMA servers, query scheduling strategies and data partitioning methods have a significant impact on performance. For example, the performance difference of B + - trees on the same NUMA architecture under different scheduling strategies can be up to 5.83 times. 2. **Hardware Heterogeneity**: The communication latency differences between and within sockets in NUMA servers are significant, resulting in very different performance results for different scheduling strategies. ### Solutions Proposed in the Paper To solve the above problems, the paper introduces P - MOSS (Performance MOnitoring Unit - driven Spatial Query Scheduling), which is a learning - based spatial scheduling framework based on low - level hardware statistics. The main objectives of P - MOSS are: - **Spatial Query Scheduling**: Determine which logical cores execute queries and which IMCs store data to minimize communication distance and maximize local memory access. - **Utilize Hardware Performance Monitoring Unit (PMU)**: Guide scheduling decisions by collecting low - level hardware statistics, avoiding the blindness in traditional scheduling methods. - **Offline Reinforcement Learning (Offline RL)**: Adopt the offline reinforcement learning method to train scheduling strategies to ensure that the normal operation of the DBMS is not affected during the learning process. ### Specific Implementation P - MOSS divides the main - memory index into multiple index slices, each slice corresponding to a specific key range. Then, by learning the optimal mapping relationships of these slices (i.e., which slice is mapped to which core and IMC), the query performance is optimized. ### Experimental Results The experimental results show that P - MOSS can significantly improve query throughput under various index structures (such as B - tree and R - tree) and different hardware configurations, with an improvement of up to 6 times. ### Summary By introducing the concept of spatial query scheduling and combining low - level hardware statistics and offline reinforcement learning, P - MOSS has successfully solved the impact of query scheduling and data partitioning on performance in NUMA architectures and achieved significant performance improvements.

P-MOSS: Learned Scheduling For Indexes Over NUMA Servers Using Low-Level Hardware Statistics

A User-Level NUMA-Aware Scheduler for Optimizing Virtual Machine Performance.

HASO: A Hot-Page Aware Scheduling Optimization Method in Virtualized NUMA Systems

Performance-Monitoring-Based Traffic-Aware Virtual Machine Deployment on NUMA Systems

Evaluation of Virtual Machine Performance on NUMA Multicore Systems

Hitsm: A Heuristic Algorithm For Independent Task Scheduling In Multicore

Scheduling on Homogeneous Multi-Core System

Towards Optimal Transaction Scheduling

Mitigating inefficient task mappings with an Adaptive Resource-Moldable Scheduler (ARMS)

Efficient Execution of Multiple Queries on Deep Memory Hierarchy

Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures

No Delay: Latency-Driven, Application Performance-Aware, Cluster Scheduling

Optimizing LSM-based indexes for disaggregated memory

Processor scheduling in shared memory multiprocessors

Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

A Comprehensive Performance Evaluation of Modern In-Memory Indices

Schedule Refinement for Homogeneous Multi-Core Processors in the Presence of Manufacturing-Caused Heterogeneity

Energy Aware Loop Scheduling for High Performance Multi-Module Memory

Collaborative Heterogeneity-Aware OS Scheduler for Asymmetric Multicore Processors

Intelligent colocation of HPC workloads

Scheduling OLTP Transactions via Machine Learning