Abstract:Transformer has shown state-of-the-art performance on various applications and has recently emerged as a promising tool for surrogate modeling of partial differential equations (PDEs). Despite the introduction of linear-complexity attention, applying Transformer to problems with a large number of grid points can be numerically unstable and computationally expensive. In this work, we propose Factorized Transformer (FactFormer), which is based on an axial factorized kernel integral. Concretely, we introduce a learnable projection operator that decomposes the input function into multiple sub-functions with one-dimensional domain. These sub-functions are then evaluated and used to compute the instance-based kernel with an axial factorized scheme. We showcase that the proposed model is able to simulate 2D Kolmogorov flow on a $256\times 256$ grid and 3D smoke buoyancy on a $64\times64\times64$ grid with good accuracy and efficiency. The proposed factorized scheme can serve as a computationally efficient low-rank surrogate for the full attention scheme when dealing with multi-dimensional problems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when using the Transformer model for partial differential equation (PDE) surrogate modeling, when dealing with problems with a large number of grid points, the standard Transformer model may have problems of numerical instability and excessively high computational cost. Specifically: 1. **Numerical instability**: As the number of grid points increases, especially on high - resolution grids, stacking multiple attention layers will lead to numerical instability. 2. **High computational cost**: For multi - dimensional problems, the number of grid points grows exponentially with the dimension, resulting in a very large attention matrix and high computational complexity. To solve these problems, the author proposes the Factorized Transformer (FactFormer), which is an improved attention mechanism based on axial decomposition kernel integration. By decomposing the input function into multiple one - dimensional sub - functions and using these sub - functions to calculate the instantiated kernel functions, FactFormer can significantly reduce the computational cost and improve numerical stability while maintaining high precision. This enables the model to effectively handle large - scale grid points in multi - dimensional problems. ### Specific improvement measures - **Axial decomposition kernel integration**: A learning projection operator is introduced to decompose the input function into multiple one - dimensional sub - functions. These sub - functions are used to calculate the kernel functions on each axis, thus avoiding directly dealing with large - scale full - attention matrices. - **Low - rank approximation**: The low - rank structure of the kernel matrix is utilized to reduce the computational complexity, making it suitable for multi - dimensional problems. - **Numerical stability**: By decomposing the attention mechanism, the numerical instability problem that may occur when stacking multiple attention layers on high - resolution grids is avoided. ### Experimental verification The author verifies the effectiveness of FactFormer through multiple benchmark test problems, including: - 2D Kolmogorov flow (256×256 grid) - 3D smoke buoyancy (64×64×64 grid) The experimental results show that FactFormer exhibits good accuracy and efficiency on these problems and has obvious advantages compared with existing methods. ### Summary The main contribution of this paper is to propose a new attention mechanism - FactFormer, which solves the problems of numerical instability and excessively high computational cost encountered by existing Transformer models in PDE surrogate modeling, and provides a new method for efficient and stable PDE simulation.

Scalable Transformer for PDE Surrogate Modeling

Multi-scale Time-stepping of Partial Differential Equations with Transformers

Transolver: A Fast Transformer Solver for PDEs on General Geometries

3D-Transformer: Molecular Representation with Transformer in 3D Space

Provable In-Context Learning of Linear Systems and Linear Elliptic PDEs with Transformers

Unisolver: PDE-Conditional Transformers Are Universal PDE Solvers

A Unified Framework for Interpretable Transformers Using PDEs and Information Theory

Physics Informed Token Transformer for Solving Partial Differential Equations

Transformer-Powered Surrogates Close the ICF Simulation-Experiment Gap with Extremely Limited Data

Transformers as Neural Operators for Solutions of Differential Equations with Finite Regularity

Choose a Transformer: Fourier or Galerkin

RPPformer-Flow: Relative Position Guided Point Transformer for Scene Flow Estimation

InParformer: Evolutionary Decomposition Transformers with Interactive Parallel Attention for Long-Term Time Series Forecasting

A 28nm 49.7TOPS/W Sparse Transformer Processor with Random-Projection-Based Speculation, Multi-Stationary Dataflow, and Redundant Partial Product Elimination

Positional Knowledge is All You Need: Position-induced Transformer (PiT) for Operator Learning

Mamba Neural Operator: Who Wins? Transformers vs. State-Space Models for PDEs

Inducing Point Operator Transformer: A Flexible and Scalable Architecture for Solving PDEs

Factorized Multimodal Transformer for Multimodal Sequential Learning

Do Efficient Transformers Really Save Computation?

Leveraging 2D Information for Long-term Time Series Forecasting with Vanilla Transformers

Flowformer: Linearizing Transformers with Conservation Flows