Abstract:This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change. We trained two 7B models from scratch on a science-oriented dataset of 300B tokens. For the first model, the 4.2B domain-specific tokens were included during pre-training and the second was adapted to the climate domain after pre-training. Additionally, ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama~2 on a domain-specific dataset of 4.2B tokens. Each model is instruction fine-tuned on a high-quality and human-generated domain-specific dataset that has been created in close cooperation with climate scientists. To reduce the number of hallucinations, we optimize the model for retrieval augmentation and propose a hierarchical retrieval strategy. To increase the accessibility of our model to non-English speakers, we propose to make use of cascaded machine translation and show that this approach can perform comparably to natively multilingual models while being easier to scale to a large number of languages. Further, to address the intrinsic interdisciplinary aspect of climate change we consider different research perspectives. Therefore, the model can produce in-depth answers focusing on different perspectives in addition to an overall answer. We propose a suite of automatic climate-specific benchmarks to evaluate LLMs. On these benchmarks, ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model while not degrading results on general domain benchmarks. Our human evaluation confirms the trends we saw in our benchmarks. All models were trained and evaluated using renewable energy and are released publicly.

What problem does this paper attempt to address?

The paper aims to address the issue of information synthesis in interdisciplinary research related to climate change and proposes a large language model (LLM) specifically for the climate change domain—ClimateGPT. The goal of ClimateGPT is to facilitate communication and collaboration in the field of climate change by integrating knowledge from various domains, including environmental natural sciences, economics, and social sciences. To achieve this, the research team trained multiple versions of the ClimateGPT model, which were pre-trained on a vast array of scientific datasets to ensure the model can understand and generate high-quality text related to climate change. Specifically, ClimateGPT employs the following strategies and techniques: 1. **Model Architecture**: ClimateGPT is based on a decoder-style Transformer architecture, similar to other large language models like Llama-2. 2. **Pre-training Data**: The research team constructed a comprehensive dataset containing 300 billion tokens, covering data from multiple sources such as news, publications, modern books, patents, Wikipedia, policy and finance, and science. Additionally, a climate change-specific subset containing 4.2 billion tokens was collected for pre-training. 3. **Training from Scratch**: The research team also explored training the model from scratch to have complete control over the training data, thereby reducing the influence of bias and inaccurate information. 4. **Continual Pre-training**: Existing large language models (such as Llama-2) were further pre-trained on domain-specific data to enhance the model's understanding of the climate change field. 5. **Instruction Fine-Tuning**: The model was further enhanced through instruction fine-tuning (IFT) to improve its ability to follow user instructions, which helps in improving the model's performance in practical applications. 6. **Retrieval-Augmented Generation**: Retrieval-augmented generation (RAG) technology was introduced to utilize high-quality climate change resources to improve the factual accuracy of the generated content. 7. **Multilingual Support**: Multilingual support was achieved through a cascading machine translation (MT) approach, enabling non-English users to benefit from the model. In summary, the main objective of the paper is to develop a large language model capable of effectively processing and generating high-quality text in the field of climate change, thereby promoting knowledge sharing and decision support within this domain.

ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on Climate Change

Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored Arabic LLM

chatClimate: Grounding Conversational AI in Climate Science

Exploring Large Language Models for Climate Forecasting

Climate Change from Large Language Models

Enhancing Large Language Models with Climate Resources

Assessing Large Language Models on Climate Information

Assessing the Effectiveness of GPT-4o in Climate Change Evidence Synthesis and Systematic Assessments: Preliminary Insights

ClimateBert: A Pretrained Language Model for Climate-Related Text

Towards unearthing neglected climate innovations from scientific literature using Large Language Models

AcademicGPT: Empowering Academic Research

CryptoGPT: a 7B model rivaling GPT-4 in the task of analyzing and classifying real-time financial news

ClimaQA: An Automated Evaluation Framework for Climate Foundation Models

ChatGPT in Climatology: Transforming Climate Research with Conversational AI

Automated Fact-Checking of Climate Change Claims with Large Language Models

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Common errors in Generative AI systems used for knowledge extraction in the climate action domain

GeoGalactica: A Scientific Large Language Model in Geoscience

ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning

ClimaX: A foundation model for weather and climate

WildfireGPT: Tailored Large Language Model for Wildfire Analysis