Abstract:With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon processor (Haswell—HSW), up to 25% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon PhiTM\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\hbox {Phi}^{\mathrm{TM}}$$\end{document} coprocessor (Knights Corner—KNC), and up to 430% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon PhiTM\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\hbox {Phi}^{\mathrm{TM}}$$\end{document} processor with AVX-512 instructions support (Knights Landing—KNL).

Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine

Automated Compiler Optimization of Multiple Vector Loads/Stores

A Novel Fully Hardware-Implemented SVD Solver Based on Ultra-Parallel BCV Jacobi Algorithm

Advancing Direct Convolution using Convolution Slicing Optimization and ISA Extensions

Autovesk: Automatic vectorized code generation from unstructured static kernels using graph transformations

A Highly Configurable Hardware/Software Stack for DNN Inference Acceleration

ViA: A Novel Vision-Transformer Accelerator Based on FPGA

Toward matrix multiplication for deep learning inference on the Xilinx Versal

VisionAGILE: A Versatile Domain-Specific Accelerator for Computer Vision Tasks

AI Powered Compiler Techniques for DL Code Optimization

An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT

ViTA: A Vision Transformer Inference Accelerator for Edge Applications

CHOSEN: Compilation to Hardware Optimization Stack for Efficient Vision Transformer Inference

ME-ViT: A Single-Load Memory-Efficient FPGA Accelerator for Vision Transformers

Vector Extensions in COTS Processors to Increase Guaranteed Performance in Real-Time Systems

A Quantitative Evaluation of Vector Transcendental Functions on ARMv8-Based Processors

An Exploration Framework for Efficient High-Level Synthesis of Support Vector Machines: Case Study on ECG Arrhythmia Detection for Xilinx Zynq SoC

Research on Convolutional Neural Network Inference Acceleration and Performance Optimization for Edge Intelligence

Vc: A C++ library for explicit vectorization

CONNA: Configurable Matrix Multiplication Engine for Neural Network Acceleration

Myocarditis: A clinical entity that can benefit from noninvasive imaging