Harmonic-summing Module of SKA on FPGA--Optimising the Irregular Memory Accesses

Haomiao Wang,Prabu Thiagaraj,Oliver Sinnen
DOI: https://doi.org/10.1109/TVLSI.2018.2882238
2018-06-29
Abstract:The Square Kilometre Array (SKA), which will be the world's largest radio telescope, will enhance and boost a large number of science projects, including the search for pulsars. The frequency domain acceleration search is an efficient approach to search for binary pulsars. A significant part of it is the harmonic-summing module, which is the research subject of this paper. Most of the operations in the harmonic-summing module are relatively cheap operations for FPGAs. The main challenge is the large number of point accesses to off-chip memory which are not consecutive but irregular. Although harmonic-summing alone might not be targeted for FPGA acceleration, it is a part of the pulsar search pipeline that contains many other compute-intensive modules, which are efficiently executed on FPGA. Hence having the harmonic-summing also on the FPGA will avoid off-board communication, which could destroy other acceleration benefits. Two types of harmonic-summing approaches are investigated in this paper: 1) storing intermediate data in off-chip memory and 2) processing the input signals directly without storing. For the second type, two approaches of caching data are proposed and evaluated: 1) preloading points that are frequently touched 2) preloading all necessary points that are used to generate a chunk of output points. OpenCL is adopted to implement the proposed approaches. In an extensive experimental evaluation, the same OpenCL kernel codes are evaluated on FPGA boards and GPU cards. Regarding the proposed preloading methods, preloading all necessary points method while reordering the input signals is faster than all the other methods. While in raw performance a single FPGA board cannot compete with a GPU, in terms of energy dissipation, GPU costs up to 2.6x times more energy than that of FPGAs in executing the same NDRange kernels.
Distributed, Parallel, and Cluster Computing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to optimize the irregular memory access problem in the pulsar search module in the Square Kilometre Array (SKA), especially in the harmonic - summing module in the Frequency - Domain Accelerated Search (FDAS). SKA will become the world's largest radio telescope, and its pulsar search tasks involve a large amount of computation, especially the harmonic - summing module, which needs to frequently access non - continuous off - chip memory, resulting in a performance bottleneck. ### Core of the problem 1. **Irregular memory access**: The main challenge of the harmonic - summing module is that a large number of point - accesses to off - chip memory are discontinuous and irregular, which affects the high - performance computing of hardware accelerators. 2. **Avoiding unnecessary data transfer**: In order to maintain the advantage of FPGA acceleration and avoid unnecessary data transfer between other computationally - intensive modules, it is very important to execute the harmonic - summing module on the FPGA as well. ### Solutions To solve the above problems, the paper proposes several methods to optimize the memory access of the harmonic - summing module: 1. **Reducing intermediate data access**: - By changing the processing order and storing intermediate data in on - chip memory, the total number of off - chip memory accesses is reduced. 2. **Pre - loading data**: - Two methods of pre - loading data are proposed: 1. **Pre - loading of high - frequency access points**: Load those points that are frequently accessed. 2. **Pre - loading of necessary points**: Load all necessary points for generating a set of output points. 3. **Re - ordering inputs**: - Based on the pre - loading method of necessary points, re - order the input points to improve the memory access speed. After re - ordering, the data required for each workgroup comes from continuous addresses and can be streamed from off - chip memory to the FPGA. 4. **Cross - device evaluation**: - These methods are implemented using OpenCL and ported to different devices for evaluation, including different series of FPGAs, general - purpose GPUs and CPUs for comparison. ### Experimental results - In terms of original performance, a single FPGA board cannot compete with a GPU. - But in terms of energy consumption, the energy consumed by a GPU is 2.6 times that of an FPGA. ### Conclusion The paper significantly improves the performance and energy efficiency of FPGAs in pulsar search tasks by optimizing the memory access pattern of the harmonic - summing module. These optimization methods are not only applicable to the SKA project, but can also be extended to other application scenarios that need to handle irregular memory access.