Exposing Shadow Branches

Chrysanthos Pepi,Bhargav Reddy Godala,Krishnam Tibrewala,Gino Chacon,Paul V. Gratz,Daniel A. Jiménez,Gilles A. Pokam,David I. August
2024-08-23
Abstract:Modern processors implement a decoupled front-end in the form of Fetch Directed Instruction Prefetching (FDIP) to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1I). As data center applications become more complex, their code footprints also grow, resulting in an increase in Branch Target Buffer (BTB) misses. FDIP can alleviate L1I cache misses, but when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1I cache. We observe that the vast majority, 75%, of BTB-missing, unidentified branches are actually present in instruction cache lines that FDIP has previously fetched but, these missing branches have not yet been decoded and inserted into the BTB. This is because the instruction line is decoded from an entry point (which is the target of the previous taken branch) till an exit point (the taken branch). Branch instructions present in the ignored portion of the cache line we call them "Shadow Branches". Here we present Skeia, a novel shadow branch decoding technique that identifies and decodes unused bytes in cache lines fetched by FDIP, inserting them into a Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to speculate despite a BTB miss. With a minimal storage state of 12.25KB, Skeia delivers a geomean speedup of ~5.7% over an 8K-entry BTB (78KB) and ~2% versus adding an equal amount of state to the BTB across 16 front-end bound applications. Since many branches stored in the SBB are unique compared to those in a similarly sized BTB, we consistently observe greater performance gains with Skeia across all examined sizes until saturation.
Hardware Architecture
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of high Branch Target Buffer (BTB) miss rate encountered by the front - end of modern processors when handling complex data - center and commercial workloads. Specifically, when using Fetch Directed Instruction Prefetching (FDIP), unrecognized branch instructions (i.e., "shadow branches") can lead to BTB misses, thus affecting performance. #### Background and problem description 1. **Front - end pressure and instruction cache misses**: - Modern data processing centers and commercial applications exert great pressure on the front - end of processor cores and instruction caches. - Instruction prefetching techniques (such as FDIP) can reduce L1 - I cache misses, but they rely on address predictions provided by the Branch Prediction Unit (BPU), and the accuracy of the BPU directly affects the effectiveness of FDIP. 2. **BTB miss problem**: - As applications become more and more complex and the amount of code increases, the BTB miss rate rises. - When FDIP encounters a BTB miss, it may not be able to correctly identify the current instruction as a branch, and thus cannot perform prefetching or may lead to wrong speculation paths, further polluting the L1 - I cache. 3. **Shadow branch phenomenon**: - The author observes that 75% of BTB - miss branches actually already exist in the instruction cache lines prefetched before FDIP, but since these branches have not been decoded and inserted into the BTB yet, they are called "shadow branches". 4. **Limitations of existing solutions**: - Existing solutions (such as enhancing FDIP to handle L1 - I and BTB misses simultaneously) rely on the content of the BTB to generate predictions, which are not effective for cold branches (i.e., infrequently accessed branches) and may pollute the cache. ### Solutions proposed in the paper To this end, the paper proposes Skeia, a new shadow - branch decoding technique, which aims to identify and decode the unused bytes in the cache lines prefetched by FDIP and insert these shadow branches into a structure called Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, enabling FDIP to continue speculation even in the event of a BTB miss. ### Main contributions 1. **First discovery and identification of shadow branches**: Revealed the phenomenon of undecoded shadow branches in the cache lines prefetched by FDIP. 2. **Introduction of the Skeia mechanism**: Designed a new mechanism that can speculatively identify and decode shadow branches, applicable to both fixed - length and variable - length instruction sets. 3. **Introduction of the SBB structure**: Proposed an efficient and space - saving structure to store shadow branches, which can achieve a performance improvement of about 5.7% with only 12.25KB of state space. 4. **Performance advantage**: Experimental results show that, compared with simply expanding the BTB, Skeia can bring higher performance gains on almost all BTB sizes. Through these innovations, Skeia effectively reduces the BTB miss rate and improves front - end performance, especially when handling complex commercial and data - center workloads.