HouYi: An open-source large language model specially designed for renewable energy and carbon neutrality field

Mingliang Bai,Zhihao Zhou,Ruidong Wang,Yusheng Yang,Zizhen Qin,Yunxiao Chen,Chunjin Mu,Jinfu Liu,Daren Yu
2023-07-31
Abstract:Renewable energy is important for achieving carbon neutrality goal. With the great success of Large Language Models (LLMs) like ChatGPT in automatic content generation, LLMs are playing an increasingly important role. However, there has not been a specially designed LLM for renewable energy. Meanwhile, there has not been any dataset of renewable energy for training LLMs. Therefore, this paper published the first open-source Renewable Energy Academic Paper (REAP) dataset for non-commercial LLM research of renewable energy. REAP dataset is collected through searching the title and abstract of 1,168,970 academic literatures from Web of Science. Based on REAP dataset, HouYi model, the first LLM for renewable energy, is developed through finetuning general LLMs. HouYi demonstrated powerful academic paper paragraph generation ability in renewable energy field. Experiments show that its ability to generate academic papers on renewable energy is comparable to ChatGPT, slightly outperforms Claude, ERNIE Bot and SparkDesk, and significantly outperforms open-source LLaMA-13B model.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the current lack of large - scale language models (LLMs) specifically designed in the fields of renewable energy and carbon neutrality, as well as relevant open - source datasets. Specifically: 1. **Lack of domain - specific LLMs**: Although general large - scale language models like ChatGPT perform well in multiple fields, they are not optimized specifically for the renewable energy and carbon neutrality fields. This means that these models may not be able to provide the most accurate or relevant information when dealing with specific problems in this field. 2. **Lack of open - source datasets**: In the renewable energy field, there are no publicly available datasets for training large - scale language models. This limits the ability of researchers to develop and improve language models for this field. To solve these problems, the paper proposes the following solutions: - **Construct the REAP dataset**: The authors collected the titles and abstracts of 1,168,970 academic papers from the Web of Science database and constructed the first open - source renewable energy academic paper dataset (REAP) for non - commercial large - scale language model research. - **Develop the HouYi model**: Based on the REAP dataset, by fine - tuning general large - scale language models (such as ChatGLM - 6B), the HouYi model was developed, which is the first large - scale language model designed specifically for the renewable energy field. Through these contributions, the paper aims to improve the efficiency of academic writing in the fields of renewable energy and carbon neutrality and promote research and development in this field.