Global Atmospheric Simulation on a Reconfigurable Platform.
Lin Gan,Haohuan Fu,Wayne Luk,Chao Yang,Wei Xue,Guangwen Yang
DOI: https://doi.org/10.1109/fccm.2013.26
2013-01-01
Abstract:Summary form only given. As the only method to study long-term climate trend and to predict potential climate risk, climate modeling is becoming a key research topic among governments and research organizations. One of the most essential and challenging components in climate modeling is the atmospheric model. To cover high resolution in climate simulation scenarios, developers have to face the challenges from billions of mesh points and extremely complex algorithms. Shallow Water Equations (SWEs) are a set of conservation laws that perform most of the essential characteristics of the atmosphere. The study of SWEs can serve as the starting point for understanding the dynamic behavior of the global atmosphere. We choose cubed-sphere mesh as the computational mesh for its better load balance in pole regions over other meshes such as the latitude-longitude mesh. The cubed-sphere mesh is obtained by mapping a cube to the surface of the sphere. The computational domain is then the six patches, each of which is covered with N × N mesh points to be calculated. When written in local coordinates, SWEs have an identical expression on the six patches, that is ∂Q/∂t + 1/Λ ∂(ΛF1)/∂x1 + 1/Λ ∂(ΛF1)/∂z2 + S=0, (1) where (x1, x2) ∈ [-π/4, π/4] are the local coordinates, Q = (h, hu1, hu2)T is the prognostic variable, Fi = uiQ (i = 1, 2) are the convective fluxes, S is the source term. Spatially discretized with a cell-centered finite volume method and integrated with a second-order accurate TVD Runge-Kutta method, SWE solvers are transferred to the computation of a 13-point upwind stencil that exhibits a diamond shape. To get the prognostic components (h, hu1 and hu2) of the central point, its neighboring 12 points need to be accessed. The stencil kernel includes at least 434 ADD/SUB operations, 570 multiplications, 99 divisions. The high arithmetic density of the SWEs algorithm makes it difficult to implement one kernel into the resource-limited FPGA card. In this study, we first proposes a hybrid algorithm that utilizes both CPUs and FPGAs to simulate the global shallow water equations (SWEs). In each of the computational patch, most of the complicated communications happen in the two layers of the outer boundary, whose value need to be exchanged with other patches. Therefore, we decompose each of the six patches into an outer part that includes two layers of the outer boundary meshes, and an inner part that is the remaining part. We assign CPU to handle the communications and the stencil calculation of the outer part, while assign FPGA to process the inner-part stencil. In this way, FPGA and CPU will work simultaneously and the CPU time for stencil and communication can be hidden in the FPGA time for stencil. For the Virtex-6 SX475T that we use in our study, the original program in double-precision will require 299% of the on-board LUTs, 283% of the FFs and 189% of the DSPs, and cannot fit into one FPGA. In order to fit the SWE kernel into one FPGA chip, we apply two algorithmic optimizations to the original design. One is to replace certain computations by lookup tables, so as to reduce the usage of computation resources. The other one is to locate common factors in the algorithm and to remove redundant computations. These two optimizations reduce the resource usage by 20%. To further reduce the resource cost and to fit the extremely complex stencil kernel into one FPGA chip, we perform optimization in the space of customizable representations and precisions. For the variables with a relatively small range, we apply fixed-point number to replace the double-precisions. For the rest parts with a wide dynamic range, we use floating-point numbers with a mixed-precision. Through mixed-precision floating-point and fixed-point arithmetic, we build a complex upwind stencil kernel on a single FPGA. The design includes a highly-efficient pipeline that can perform hundreds of floating-point and fixed-point arithmetic operations concurrently. Compared with our previous work in [1], the solution based on one FPGA acceleration card provides 100 times speedup over a 6-core CPU, and 4 times speedup over a Tianhe-1A supercomputer node that consists of 12 CPU cores and one Fermi GPU.