Abstract:With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon processor (Haswell—HSW), up to 25% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon PhiTM\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\hbox {Phi}^{\mathrm{TM}}$$\end{document} coprocessor (Knights Corner—KNC), and up to 430% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon PhiTM\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\hbox {Phi}^{\mathrm{TM}}$$\end{document} processor with AVX-512 instructions support (Knights Landing—KNL).

High-performance Computation of Kubo Formula with Vectorization of Batched Linear Algebra Operation

Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors

Performance Evaluation of a Next-Generation SX-Aurora TSUBASA Vector Supercomputer

Performance Acceleration of Kernel Polynomial Method Applying Graphics Processing Units

Performance Optimization for Sparse A(T)Ax in Parallel on Multicore Cpu

Implementation of a Parallel Sparse Direct Solver on Vector Architecture

High Performance Optimizations For Nuclear Physics Code Mfdn On Knl

Analyzing the Performance Portability of Tensor Decomposition

A performance portable, fully implicit Landau collision operator with batched linear solvers

A High Performance Block Eigensolver for Nuclear Configuration Interaction Calculations.

Rapid Exploration of Optimization Strategies on Advanced Architectures using TestSNAP and LAMMPS

Automated Compiler Optimization of Multiple Vector Loads/Stores

PBBFMM3D: A parallel black-box algorithm for kernel matrix-vector multiplication

A study of vectorization for matrix-free finite element methods

Accelerating and Tuning Small Matrix Multiplications on Sunway TaihuLight: A Case Study of Spectral Element CFD Code Nek5000

Acceleration of multiple precision matrix multiplication based on multi-component floating-point arithmetic using AVX2

Memory-Constrained Vectorization and Scheduling of Dataflow Graphs for Hybrid CPU-GPU Platforms

Boosting the effective performance of massively parallel tensor network state algorithms on hybrid CPU-GPU based architectures via non-Abelian symmetries

Software for Sparse Tensor Decomposition on Emerging Computing Architectures

Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations

Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit