Abstract:With widening vectors and the proliferation of advanced vector instructions in today’s processors, vectorization plays an ever-increasing role in delivering application performance. Achieving the performance potential of this vector hardware has required significant support from the software level such as new explicit vector programming models and advanced vectorizing compilers. Today, with the combination of these software tools plus new SIMD ISA extensions like gather/scatter instructions it is not uncommon to find that even codes with complex and irregular data access patterns can be vectorized. In this paper we focus on these vectorized codes with irregular accesses, and show that while the best-in-class Intel Compiler Vectorizer does indeed provide speedup through efficient vectorization, there are some opportunities where clever program transformations can increase performance further. After identifying these opportunities, this paper describes two automatic compiler optimizations to target these data access patterns. The first optimization focuses on improving the performance for a group of adjacent gathers/scatters. The second optimization improves performance for a group of stencil vector accesses using more efficient SIMD instructions. Both optimizations are now implemented in the 17.0 version of the Intel Compiler. We evaluate the optimizations using an extensive set of micro-kernels, representative benchmarks and application kernels. On these benchmarks, we demonstrate performance gains of 3–750% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon processor (Haswell—HSW), up to 25% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon PhiTM\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\hbox {Phi}^{\mathrm{TM}}$$\end{document} coprocessor (Knights Corner—KNC), and up to 430% on the Intel®\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$${\textregistered }$$\end{document} Xeon PhiTM\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\hbox {Phi}^{\mathrm{TM}}$$\end{document} processor with AVX-512 instructions support (Knights Landing—KNL).

Improving SIMD Parallelism via Dynamic Binary Translation

Exploiting SIMD Asymmetry in ARM-to-x86 Dynamic Binary Translation

A Hardware Non-Invasive Mapping Method for Condition Bits in Binary Translation

SPC-Indexed Indirect Branch Hardware Cache Redirecting Technique in Binary Translation

SIMD Code Translation in an Enhanced HQEMU

Performance Improvements via Peephole Optimization in Dynamic Binary Translation

FPGA based hardware-software co-designed dynamic binary translation system

On Static Binary Translation of ARM/Thumb Mixed ISA Binaries

Condition code optimization in dynamic binary translation

A Case Study of LLVM-Based Analysis for Optimizing SIMD Code Generation

Efficient Binary Translation System with Low Hardware Cost

Optimizing Compiler for Shared-Memory Multiple Simd Architecture

Automated Compiler Optimization of Multiple Vector Loads/Stores

A Quantitative Evaluation of Vector Transcendental Functions on ARMv8-Based Processors

Improving Dynamically-Generated Code Performance on Dynamic Binary Translators

Optimizing the SIMD Parallelism Through Bitwidth Analysis

Reverse Compilation for Speculative Parallel Threading

Designing and Implementing a Generator Framework for a SIMD Abstraction Library

GSM: An Efficient Code Generation Algorithm for Dynamic Binary Translator

Mis-speculation-Driven Compiler Framework for Aggressive Loop Automatic Parallelization

Spire: Improving Dynamic Binary Translation Through Spc-Indexed Indirect Branch Redirecting