ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning

Julia Kaltenborn,Charlotte E. E. Lange,Venkatesh Ramesh,Philippe Brouillard,Yaniv Gurwicz,Chandni Nagda,Jakob Runge,Peer Nowack,David Rolnick
2023-11-07
Abstract:Climate models have been key for assessing the impact of climate change and simulating future climate scenarios. The machine learning (ML) community has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and ML communities have suggested that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives. In addition, we provide a modular dataset pipeline for retrieving and preprocessing additional climate models and scenarios. We showcase the potential of our dataset by using it as a benchmark for ML-based climate model emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing their performance across different climate models. Furthermore, the dataset can be used to train an ML emulator on several climate models instead of just one. Such a "super emulator" can quickly project new climate change scenarios, complementing existing scenarios already provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate-related tasks at scale.
Machine Learning,Artificial Intelligence,Computational Engineering, Finance, and Science,Atmospheric and Oceanic Physics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the shortcomings of climate model datasets in machine learning (ML) applications. Specifically, existing climate model datasets are typically created based on a single model, which limits the training scale and generalization ability of machine learning models. Additionally, these datasets face difficulties in acquisition, preprocessing, and consistency, further hindering the progress of the machine learning community in climate-related tasks. ### Main Contributions 1. **Introduction of the ClimateSet Data Pipeline**: - Provides a method for retrieving and preprocessing climate model data from CMIP6 (climate model outputs) and Input4MIPs (climate model inputs) for climate-related machine learning tasks. 2. **Construction of the Core ClimateSet Dataset**: - Includes output data from 36 climate models, as well as emission field inputs for 4 different Shared Socioeconomic Pathways (SSP) scenarios and historical data. - The dataset is currently publicly available through the Digital Research Alliance of Canada. 3. **Comparison of State-of-the-Art Machine Learning Methods**: - Uses the ClimateSet dataset to compare different machine learning methods in climate model simulation tasks, simulating the response of temperature and precipitation to climate forcing factors, with results that are more reliable and qualitatively different from previous work. ### Background and Motivation - **Importance of Climate Models**: Climate models are crucial for assessing the impacts of climate change and simulating future climate scenarios. - **Interest in Machine Learning**: The machine learning community has shown increasing interest in supporting the efforts of climate scientists, particularly in climate model simulation, downscaling, and prediction tasks. - **Limitations of Existing Datasets**: Existing climate model datasets are typically based on a single model, lacking large-scale, consistent, and machine learning-ready data. ### Solution - **Multi-Model Dataset**: ClimateSet includes data from multiple climate models, capturing the uncertainty between different models, which is important for policy-making. - **Large-Scale Training Data**: Provides sufficient training data to support the training of large-scale machine learning models. - **Modular Data Pipeline**: Offers a modular data pipeline that can be extended to more climate models, ensemble members, variables, altitude layers, spatial and temporal resolutions. ### Application Scenarios - **Climate Projections**: Simulating future climate scenarios. - **Climate Data Downscaling**: Improving the spatial resolution of climate data. - **Extreme Weather Prediction**: Predicting extreme weather events under different warming scenarios. - **Large-Scale Machine Learning Climate Models**: Developing large-scale climate prediction models. ### Conclusion ClimateSet provides the machine learning community with the necessary foundation to address climate-related tasks on a large scale. By including data from multiple climate models, ClimateSet not only provides sufficient training data but also captures the uncertainty between different models, thereby offering more valuable information for climate policy-making.