Sim2Real for Environmental Neural Processes

Jonas Scholz,Tom R. Andersson,Anna Vaughan,James Requeima,Richard E. Turner
2023-10-31
Abstract:Machine learning (ML)-based weather models have recently undergone rapid improvements. These models are typically trained on gridded reanalysis data from numerical data assimilation systems. However, reanalysis data comes with limitations, such as assumptions about physical laws and low spatiotemporal resolution. The gap between reanalysis and reality has sparked growing interest in training ML models directly on observations such as weather stations. Modelling scattered and sparse environmental observations requires scalable and flexible ML architectures, one of which is the convolutional conditional neural process (ConvCNP). ConvCNPs can learn to condition on both gridded and off-the-grid context data to make uncertainty-aware predictions at target locations. However, the sparsity of real observations presents a challenge for data-hungry deep learning models like the ConvCNP. One potential solution is 'Sim2Real': pre-training on reanalysis and fine-tuning on observational data. We analyse Sim2Real with a ConvCNP trained to interpolate surface air temperature over Germany, using varying numbers of weather stations for fine-tuning. On held-out weather stations, Sim2Real training substantially outperforms the same model architecture trained only with reanalysis data or only with station data, showing that reanalysis data can serve as a stepping stone for learning from real observations. Sim2Real could thus enable more accurate models for weather prediction and climate monitoring.
Machine Learning,Atmospheric and Oceanic Physics
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper explores how to improve weather forecasting models based on observational data using the Sim2Real method. Specifically: 1. **Background and Challenges**: - Current machine learning (ML) weather models typically rely on reanalysis data for training, which is generated by numerical assimilation systems. - Reanalysis data has limitations, such as biases in the assumptions of physical laws and lower spatiotemporal resolution. - Real-world environmental observational data (e.g., weather station data) is sparse and scattered, making it challenging to directly train deep learning models. 2. **Sim2Real Method**: - This method proposes a combination of pre-training and fine-tuning. - First, pre-training is conducted on abundant reanalysis data, followed by fine-tuning on limited real observational data. - The model architecture used is the Convolutional Conditional Neural Process (ConvCNP), which can handle both gridded and non-gridded data and generate predictions with uncertainty. 3. **Experimental Setup**: - The experiment uses ground air temperature in Germany as an example, with ERA5 reanalysis data as simulated data and data from the German Weather Service (DWD) as real data. - The model performance is evaluated under different numbers of weather station data (Nstations) and time slices (Ntimes). 4. **Key Findings**: - In cases of moderate data volume (e.g., Nstations = 500), the Sim2Real method significantly outperforms methods using only reanalysis data or only observational data. - Through fine-tuning, the model can learn higher-frequency spatial features, better capturing short-distance scale weather phenomena. - For very sparse data situations (e.g., Nstations = 20 or 100), the advantage of Sim2Real is not apparent. 5. **Conclusions and Future Work**: - The Sim2Real method performs best in cases of moderate data volume, addressing the issue of insufficient real observational data. - Future research can further explore transferring data from data-rich regions (e.g., Germany) to data-sparse regions (e.g., the Himalayas or Antarctica) to alleviate data gaps and socioeconomic disparities.