Abstract:The Square Kilometre Array (SKA), which will be the world's largest radio telescope, will enhance and boost a large number of science projects, including the search for pulsars. The frequency domain acceleration search is an efficient approach to search for binary pulsars. A significant part of it is the harmonic-summing module, which is the research subject of this paper. Most of the operations in the harmonic-summing module are relatively cheap operations for FPGAs. The main challenge is the large number of point accesses to off-chip memory which are not consecutive but irregular. Although harmonic-summing alone might not be targeted for FPGA acceleration, it is a part of the pulsar search pipeline that contains many other compute-intensive modules, which are efficiently executed on FPGA. Hence having the harmonic-summing also on the FPGA will avoid off-board communication, which could destroy other acceleration benefits. Two types of harmonic-summing approaches are investigated in this paper: 1) storing intermediate data in off-chip memory and 2) processing the input signals directly without storing. For the second type, two approaches of caching data are proposed and evaluated: 1) preloading points that are frequently touched 2) preloading all necessary points that are used to generate a chunk of output points. OpenCL is adopted to implement the proposed approaches. In an extensive experimental evaluation, the same OpenCL kernel codes are evaluated on FPGA boards and GPU cards. Regarding the proposed preloading methods, preloading all necessary points method while reordering the input signals is faster than all the other methods. While in raw performance a single FPGA board cannot compete with a GPU, in terms of energy dissipation, GPU costs up to 2.6x times more energy than that of FPGAs in executing the same NDRange kernels.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to optimize the irregular memory access problem in the pulsar search module in the Square Kilometre Array (SKA), especially in the harmonic - summing module in the Frequency - Domain Accelerated Search (FDAS). SKA will become the world's largest radio telescope, and its pulsar search tasks involve a large amount of computation, especially the harmonic - summing module, which needs to frequently access non - continuous off - chip memory, resulting in a performance bottleneck. ### Core of the problem 1. **Irregular memory access**: The main challenge of the harmonic - summing module is that a large number of point - accesses to off - chip memory are discontinuous and irregular, which affects the high - performance computing of hardware accelerators. 2. **Avoiding unnecessary data transfer**: In order to maintain the advantage of FPGA acceleration and avoid unnecessary data transfer between other computationally - intensive modules, it is very important to execute the harmonic - summing module on the FPGA as well. ### Solutions To solve the above problems, the paper proposes several methods to optimize the memory access of the harmonic - summing module: 1. **Reducing intermediate data access**: - By changing the processing order and storing intermediate data in on - chip memory, the total number of off - chip memory accesses is reduced. 2. **Pre - loading data**: - Two methods of pre - loading data are proposed: 1. **Pre - loading of high - frequency access points**: Load those points that are frequently accessed. 2. **Pre - loading of necessary points**: Load all necessary points for generating a set of output points. 3. **Re - ordering inputs**: - Based on the pre - loading method of necessary points, re - order the input points to improve the memory access speed. After re - ordering, the data required for each workgroup comes from continuous addresses and can be streamed from off - chip memory to the FPGA. 4. **Cross - device evaluation**: - These methods are implemented using OpenCL and ported to different devices for evaluation, including different series of FPGAs, general - purpose GPUs and CPUs for comparison. ### Experimental results - In terms of original performance, a single FPGA board cannot compete with a GPU. - But in terms of energy consumption, the energy consumed by a GPU is 2.6 times that of an FPGA. ### Conclusion The paper significantly improves the performance and energy efficiency of FPGAs in pulsar search tasks by optimizing the memory access pattern of the harmonic - summing module. These optimization methods are not only applicable to the SKA project, but can also be extended to other application scenarios that need to handle irregular memory access.

Harmonic-summing Module of SKA on FPGA--Optimising the Irregular Memory Accesses

FPGA-based Acceleration of FT Convolution for Pulsar Search Using OpenCL

A Novel Greedy Approach To Harmonic Summing Using GPUs

FPGA implementation of hardware processing modules as coprocessors in brain-machine interfaces.

FPGA architecture based on OpenCL for studying the acoustic backscattering by an immersed tube

Towards On-Board SAR Processing with FPGA Accelerators and a PCIe Interface

A Ubiquitous Machine Learning Accelerator With Automatic Parallelization on FPGA

An Efficient FPGA Implementation of Orthogonal Matching Pursuit With Square-Root-Free QR Decomposition

Accelerating unstructured finite volume computations on field-programmable gate arrays

Efficient Ab-Initio Molecular Dynamic Simulations by Offloading Fast Fourier Transformations to FPGAs

Initial Architecture Design of Ultrasound Synthetic Aperture Imaging Based on FPGA

SolarAccel: FPGA accelerated 2D cross-correlation of digital images: Application to solar adaptive optics

Parallel Optimisation and Implementation of a Real-Time Back Projection (BP) Algorithm for SAR Based on FPGA

Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform

An optimized architecture for accelerating graph computing on FPGAs

High Performance Scalable FPGA Accelerator for Deep Neural Networks

Field Programmable Gate Array (FPGA) Implementation of Parallel Jacobi for Eigen-Decomposition in Direction of Arrival (DOA) Estimation Algorithm

Optimizing FPGA-based Accelerator Design for Large-Scale Molecular Similarity Search

Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks

Hardware Acceleration and Implementation of YOLOX-s for On-Orbit FPGA

Using OpenCL to Enable Software-like Development of an FPGA-Accelerated Biophotonic Cancer Treatment Simulator