Abstract:Translation Look aside Buffers (TLBs) have a significant impact on system performance. Numerous prior studies focus on TLBs design for uniprocessors. As the advent of chip multiprocessors (CMPs), we need shift to TLBs for chip multiprocessors. This paper presents a software-implemented level-two TLB -- SL2-TLB which is a shared level-two TLB for multiprocessors. It not only reduces the cost of TLB refill handler for every processor core, but also reduces the redundant TLB misses' cost for CMPs effectively. Today, CMPs typically employ private per-core TLBs. SL2-TLB together with the hardware TLBs make up a software-hardware co-designed multilevel TLB system which brings great benefit to system performance while avoiding changing the hardware TLB. So it is a convenient and efficient method for CMPs' TLB performance improvement. The benefit brought by SL2-TLB to SPECCPU2000 is less than that to SPECCPU2006, about 5% and 7% separately. Therein to, the average performance improvement of SPECint 2006 reaches about 12.7%. That is because the overhead for TLB refill is small when the cache is large enough to avoid a miss as walking the page table of applications with small memory footprints. The further optimization for SL2-TLB is kept the SL2-TLB table stay in L2 cache forever by the cache locking scheme. SL2-TLB together with cache locking scheme improves the performances by over 13% for SPECint 2006. And an average performance improvement of over 7% is brought to the new emerging parallel benchmark suite-Princeton Application Repository for Shared-Memory Computers (PARSEC). And all the above evaluations are done on Godson-3 processors which is the latest generation of China's most powerful microprocessor family.

Evaluation of TLB Prefetching Techniques

Agile TLB Prefetching and Prediction Replacement Policy

Performance Optimization Technology for TLB on Godson Processors

Performance Comparison of Data Prefetching for Pointer-Chasing Applications

Prefetching Techniques for STT-RAM Based Last-Level Cache in CMP Systems

Performance Analysis and Optimization of Prefetching Thread on CMPs

Software and Hardware Co-designed Multi-level TLBs for Chip Multiprocessors

Unified-TP: A Unified TLB and Page Table Cache Structure for Efficient Address Translation

The Optimization Design of TLB of High Performance Processor

Performance Analysis of Prefetching Thread for Linked Data Structure in CMPs

NewPrefetch Technique Design Forl2cache

BTIP: Branch Triggered Instruction Prefetcher Ensuring Timeliness

Tolerating Memory Latency Using a Hardware-Based Active-Pushing Technique

Estimating Effective Prefetch Distance in Threaded Prefetching for Linked Data Structures

Translation lookaside buffer design based on dynamic memory page merging

Evaluating the Memory System Performance of Software-Initiated Inter-core LLC Prepushing

New Prefetch Technique Design for L2 Cache

The Performance Optimization of Threaded Prefetching for Linked Data Structures.

Energy-Efficient Hardware Data Prefetching

An Efficient Hardware Prefetcher Exploiting the Prefetch Potential of Long-Stride Access Pattern on Virtual Address

Aap And Aapm: Improved Prefetching Structures Of The L2 Cache