ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change

David Thulke,Yingbo Gao,Petrus Pelser,Rein Brune,Rricha Jalota,Floris Fok,Michael Ramos,Ian van Wyk,Abdallah Nasir,Hayden Goldstein,Taylor Tragemann,Katie Nguyen,Ariana Fowler,Andrew Stanco,Jon Gabriel,Jordan Taylor,Dean Moro,Evgenii Tsymbalov,Juliette de Waal,Evgeny Matusov,Mudar Yaghi,Mohammad Shihadah,Hermann Ney,Christian Dugast,Jonathan Dotan,Daniel Erasmus
2024-01-18
Abstract:This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change. We trained two 7B models from scratch on a science-oriented dataset of 300B tokens. For the first model, the 4.2B domain-specific tokens were included during pre-training and the second was adapted to the climate domain after pre-training. Additionally, ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama~2 on a domain-specific dataset of 4.2B tokens. Each model is instruction fine-tuned on a high-quality and human-generated domain-specific dataset that has been created in close cooperation with climate scientists. To reduce the number of hallucinations, we optimize the model for retrieval augmentation and propose a hierarchical retrieval strategy. To increase the accessibility of our model to non-English speakers, we propose to make use of cascaded machine translation and show that this approach can perform comparably to natively multilingual models while being easier to scale to a large number of languages. Further, to address the intrinsic interdisciplinary aspect of climate change we consider different research perspectives. Therefore, the model can produce in-depth answers focusing on different perspectives in addition to an overall answer. We propose a suite of automatic climate-specific benchmarks to evaluate LLMs. On these benchmarks, ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model while not degrading results on general domain benchmarks. Our human evaluation confirms the trends we saw in our benchmarks. All models were trained and evaluated using renewable energy and are released publicly.
Machine Learning,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the issue of information synthesis in interdisciplinary research related to climate change and proposes a large language model (LLM) specifically for the climate change domain—ClimateGPT. The goal of ClimateGPT is to facilitate communication and collaboration in the field of climate change by integrating knowledge from various domains, including environmental natural sciences, economics, and social sciences. To achieve this, the research team trained multiple versions of the ClimateGPT model, which were pre-trained on a vast array of scientific datasets to ensure the model can understand and generate high-quality text related to climate change. Specifically, ClimateGPT employs the following strategies and techniques: 1. **Model Architecture**: ClimateGPT is based on a decoder-style Transformer architecture, similar to other large language models like Llama-2. 2. **Pre-training Data**: The research team constructed a comprehensive dataset containing 300 billion tokens, covering data from multiple sources such as news, publications, modern books, patents, Wikipedia, policy and finance, and science. Additionally, a climate change-specific subset containing 4.2 billion tokens was collected for pre-training. 3. **Training from Scratch**: The research team also explored training the model from scratch to have complete control over the training data, thereby reducing the influence of bias and inaccurate information. 4. **Continual Pre-training**: Existing large language models (such as Llama-2) were further pre-trained on domain-specific data to enhance the model's understanding of the climate change field. 5. **Instruction Fine-Tuning**: The model was further enhanced through instruction fine-tuning (IFT) to improve its ability to follow user instructions, which helps in improving the model's performance in practical applications. 6. **Retrieval-Augmented Generation**: Retrieval-augmented generation (RAG) technology was introduced to utilize high-quality climate change resources to improve the factual accuracy of the generated content. 7. **Multilingual Support**: Multilingual support was achieved through a cascading machine translation (MT) approach, enabling non-English users to benefit from the model. In summary, the main objective of the paper is to develop a large language model capable of effectively processing and generating high-quality text in the field of climate change, thereby promoting knowledge sharing and decision support within this domain.