Model-Parallel Fourier Neural Operators as Learned Surrogates for Large-Scale Parametric PDEs

Thomas J. Grady II,Rishi Khan,Mathias Louboutin,Ziyi Yin,Philipp A. Witte,Ranveer Chandra,Russell J. Hewett,Felix J. Herrmann
DOI: https://doi.org/10.1016/j.cageo.2023.105402
2023-02-02
Abstract:Fourier neural operators (FNOs) are a recently introduced neural network architecture for learning solution operators of partial differential equations (PDEs), which have been shown to perform significantly better than comparable deep learning approaches. Once trained, FNOs can achieve speed-ups of multiple orders of magnitude over conventional numerical PDE solvers. However, due to the high dimensionality of their input data and network weights, FNOs have so far only been applied to two-dimensional or small three-dimensional problems. To remove this limited problem-size barrier, we propose a model-parallel version of FNOs based on domain-decomposition of both the input data and network weights. We demonstrate that our model-parallel FNO is able to predict time-varying PDE solutions of over 2.6 billion variables on Perlmutter using up to 512 A100 GPUs and show an example of training a distributed FNO on the Azure cloud for simulating multiphase CO$_2$ dynamics in the Earth's subsurface.
Machine Learning,Distributed, Parallel, and Cluster Computing,Numerical Analysis
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the computational efficiency and scalability problems encountered when using traditional numerical simulators to solve partial differential equations (PDEs). Specifically, traditional numerical methods such as the finite - difference, finite - volume, or finite - element methods, although highly accurate, require a large amount of computational resources and time when dealing with large - scale, multi - dimensional problems, which limits their practicality in applications that require a large number of simulations, such as uncertainty quantification, inverse problem solving, or numerical optimization. To solve this problem, the paper proposes a model - parallel method based on Fourier neural operators (FNOs) to learn and predict the solutions of large - scale parameterized PDEs. FNOs are a recently introduced neural network architecture for learning the solution operators of PDEs and have been proven to perform better than other deep - learning methods. The trained FNOs can significantly accelerate the solution speed of PDEs during the inference stage, achieving an improvement of several orders of magnitude. However, due to the high - dimensionality of the input data and network weights, FNOs have so far been only applied to two - dimensional or small three - dimensional problems. To break through this limitation, the paper proposes model - parallel FNOs based on domain decomposition, which predicts the time - dependent PDE solutions of more than 2.6 billion variables by distributing the input data and network weights across multiple GPUs. In addition, this method also shows an example of distributed training of FNOs on the Azure cloud platform to simulate the multi - phase CO₂ dynamics underground on the earth. #### Main challenges 1. **Scalability**: Scaling deep surrogate models such as FNOs to the large - scale problem sizes required in practical applications, especially beyond small - scale 2D or 3D time - dependent scenarios. 2. **Memory limitations**: Modern GPU architectures (such as NVIDIA Ampere GPU) cannot provide enough memory to process a single training sample, especially for medium - sized 3D problems (more than 64³ grid points). 3. **Parallel computing**: It is necessary to distribute the network among multiple GPUs to support the training of large - scale 2D and 3D time - dependent problems. #### Solutions The paper proposes to use the domain decomposition method to achieve model parallelism, that is, to partition all tensors (including input, output, weight, and gradient tensors) in the feature dimension (space and time). This method allows the input data, weights, and hidden states to be distributed across multiple worker nodes, making it possible to scale any network and data size. Compared with the existing 3D parallel methods, this method provides more fine - grained tensor partition control and is especially suitable for architectures such as FNOs, which are different from natural language processing (NLP) models. In this way, the paper successfully conducted experiments on the Perlmutter supercomputer using up to 512 A100 GPUs, demonstrating the potential of model - parallel FNOs in handling extremely large - scale PDE problems.