Abstract:With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon processor (Haswell—HSW), up to 25% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon PhiTM\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\hbox {Phi}^{\mathrm{TM}}$$\end{document} coprocessor (Knights Corner—KNC), and up to 430% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon PhiTM\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\hbox {Phi}^{\mathrm{TM}}$$\end{document} processor with AVX-512 instructions support (Knights Landing—KNL).

Evaluating Intel AVX2 Vgather Instructions with Stencils

Scaling and analyzing the stencil performance on multi-core and many-core architectures

Cache-Friendly Design for Complex Spatially-Variable Coefficient Stencils on Many-Core Architectures

Performance Tuning and Analysis for Stencil-Based Applications on POWER8 Processor.

Stencil Computations on AMD and Nvidia Graphics Processors: Performance and Tuning Strategies

Temporal Vectorization for Stencils

Performance Optimization of Jacobi Stencil Algorithms Based on POWER8 Architecture.

Performance Modeling of Stencil Computation on SW26010 Processors

Automated Compiler Optimization of Multiple Vector Loads/Stores

Evaluation of Programming Models and Performance for Stencil Computation on Current GPU Architectures

Fast AVS Prediction Residual and Integer DCT Implementations for VLIW DSP

Revisiting Temporal Blocking Stencil Optimizations

Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning

Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts

Graph-oriented Code Transformation Approach for Register-Limited Stencils on GPUs

HW/SW Co-Optimization for Stencil Computation: Beginning with a Customizable Core

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code

High-performance code generation for stencil computations on GPU architectures

Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance

A Portable Framework for Accelerating Stencil Computations on Modern Node Architectures