GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks

Shuyang Hou,Zhangxiao Shen,Anqi Zhao,Jianyuan Liang,Zhipeng Gui,Xuefeng Guan,Rui Li,Huayi Wu
2024-10-23
Abstract:The increasing demand for spatiotemporal data and modeling tasks in geosciences has made geospatial code generation technology a critical factor in enhancing productivity. Although large language models (LLMs) have demonstrated potential in code generation tasks, they often encounter issues such as refusal to code or hallucination in geospatial code generation due to a lack of domain-specific knowledge and code corpora. To address these challenges, this paper presents and open-sources the GeoCode-PT and GeoCode-SFT corpora, along with the GeoCode-Eval evaluation dataset. Additionally, by leveraging QLoRA and LoRA for pretraining and fine-tuning, we introduce GeoCode-GPT-7B, the first LLM focused on geospatial code generation, fine-tuned from Code Llama-7B. Furthermore, we establish a comprehensive geospatial code evaluation framework, incorporating option matching, expert validation, and prompt engineering scoring for LLMs, and systematically evaluate GeoCode-GPT-7B using the GeoCode-Eval dataset. Experimental results show that GeoCode-GPT outperforms other models in multiple-choice accuracy by 9.1% to 32.1%, in code summarization ability by 1.7% to 25.4%, and in code generation capability by 1.2% to 25.1%. This paper provides a solution and empirical validation for enhancing LLMs' performance in geospatial code generation, extends the boundaries of domain-specific model applications, and offers valuable insights into unlocking their potential in geospatial code generation.
Software Engineering,Artificial Intelligence
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: the challenges encountered by existing large - language models (LLMs) in generating geospatial code, such as "refusal to code" or "coding hallucination", which are caused by the lack of domain - specific knowledge and code corpora. Specifically, the paper points out: 1. **Limitations of existing models**: - Although general large - language models perform well in general code - generation tasks, they perform poorly in geospatial code - generation tasks. - These models usually generate incorrect code or are completely unable to generate the required geospatial code because they are unfamiliar with the specific data formats, operators, and platform functions in the geospatial domain. 2. **Lack of domain - specific corpora**: - Geospatial code involves complex spatio - temporal data and specific data formats (such as geographic coordinates, multi - dimensional rasters, multi - band spectra), as well as large - scale datasets (such as global remote - sensing datasets). - Since geospatial code is usually executed on dedicated platforms and uses proprietary internal indexing and naming conventions, general LLMs are often unfamiliar with such details, resulting in errors or irrationalities in the generated code. To solve these problems, this paper proposes the GeoCode - GPT - 7B model, which is the first large - scale language model specifically for geospatial code generation. In addition, the paper also introduces the following resources and methods: - **GeoCode - PT pre - training corpus**: It contains a large amount of multi - source data related to geospatial code, including code snippets and operator knowledge from platforms such as Google Earth Engine and ArcGIS. - **GeoCode - SFT supervised fine - tuning corpus**: High - quality instruction data generated by structured traversal algorithms and the Self - Instruct framework, which is used to enhance the model's instruction understanding and code - generation ability. - **GeoCode - Eval evaluation dataset**: It contains 3,000 multiple - choice questions, 500 code - generation tasks, and 500 code - summary tasks, which are used to comprehensively evaluate the model's performance. Through these resources and methods, the paper aims to improve the accuracy and reliability of LLMs in geospatial code - generation tasks, thereby promoting the productivity improvement of geospatial data analysis and modeling tasks.