Data locality optimization of shared memory programs on NUMA architectures using an integrated tool environment

J. Tao
Abstract:Due to their excellent price-performance ratio, clusters built from commodity nodes have become broadly adopted and increasingly popular as platforms for parallel processing. Among them, the clusters of standard PCs interconnected with high-speed system area networks (SANs) are especially attractive and have been widely established. At the same time, the developments in interconnection technologies also formed the basis for the rise of Non-Uniform Memory Access (NUMA) architectures, i.e. systems with physically distributed memories, but with a global address space allowing an efficient but non-uniform access to any memory location in the system. These kinds of systems, especially when offered as non cache coherent NUMA for loosely coupled commodity architectures, can easily be implemented in a straightforward manner without major hardware efforts. They form a favorable architectural tradeoff by combining the scalability and cost–effectiveness of standard clusters with a shared memory support close to symmetric multiprocessors. The non-uniform memory access characteristic, however, introduces a distinction between local and remote memory causing different memory access latencies. In systems with such characteristics, a remote memory access can take up to an order of magnitude longer than a local one. This leads to the fact that many shared memory applications initially do not achieve a good parallel speedup when running on NUMA-like architectures due to excessive remote memory accesses. This thesis targets such inefficiency problems of NUMA-based shared memory programs. For this purpose, a comprehensive and integrated tool environment has been built which aims at improving the data locality of running applications, by combining single frameworks enabling both program tuning and runtime manipulation. This environment comprises a low-level data acquisition system, a distributed tool middleware, and a set of performance tools. Based on the hardware monitoring facility, which is capable of observing all memory transactions performed on the interconnection fabric, the data acquisition system provides information about an application’s memory access behavior as well as information about e.g. synchronization primitives and address mapping necessary for data placement. This information is then aggregated across the distributed system and made accessible to the tools through an established on-line monitoring interface specification serving as middleware. Tools further process the acquired performance data and use it to steer the execution of programs with a result of an optimization for the runtime data layout. Currently, two such tools have been implemented: a Data Layout Visualizer (DLV) and an Adaptive Runtime System (ARS). DLV is used to present the monitoring data in an easy-
What problem does this paper attempt to address?