Evaluating Intel AVX2 Vgather Instructions with Stencils

LIN Xin-hua,QIN Qiang,LI Shuo,WEN Min-hua,MATSUOKA Satoshi
DOI: https://doi.org/10.11896/j.issn.1002-137x.2017.01.004
2018-01-01
Computer Science
Abstract:Intel provided AVX2 vgather instruction on Haswell CPU to better support reading discontinued data in vectorization.We found the compiler generates vgather instructions,which slow down the performance of Stencil on Haswell,because the branches exist in defining boundary condition of Stencils.We proposed to utilize peel optimization or intrinsic load to avoid these vgather instructions.We applied these optimizations to three Stencil benchmarks,a longrange Stencil 3DFD,and a hybrid Stencil application,and archived the speedup from 1.22X to 3.88X on Haswell.By analyzing the implementation of the instruction,we found the vgather instructions are decoded into multiple micro-operations (μops),and the instructions generate one μops for each element to be gathered.Due to the high overhead of decoder,the vgather instructions become the performance bottleneck of Stencils on Haswell.It is believed that the understanding of the implementation of AVX2 vgather instructions and adopting the optimizations to avoid the vgather instructions are quite helpful for performance tuning the applications with good spatial locality on Haswell.
What problem does this paper attempt to address?