Abstract:The increasing demand for spatiotemporal data and modeling tasks in geosciences has made geospatial code generation technology a critical factor in enhancing productivity. Although large language models (LLMs) have demonstrated potential in code generation tasks, they often encounter issues such as refusal to code or hallucination in geospatial code generation due to a lack of domain-specific knowledge and code corpora. To address these challenges, this paper presents and open-sources the GeoCode-PT and GeoCode-SFT corpora, along with the GeoCode-Eval evaluation dataset. Additionally, by leveraging QLoRA and LoRA for pretraining and fine-tuning, we introduce GeoCode-GPT-7B, the first LLM focused on geospatial code generation, fine-tuned from Code Llama-7B. Furthermore, we establish a comprehensive geospatial code evaluation framework, incorporating option matching, expert validation, and prompt engineering scoring for LLMs, and systematically evaluate GeoCode-GPT-7B using the GeoCode-Eval dataset. Experimental results show that GeoCode-GPT outperforms other models in multiple-choice accuracy by 9.1% to 32.1%, in code summarization ability by 1.7% to 25.4%, and in code generation capability by 1.2% to 25.1%. This paper provides a solution and empirical validation for enhancing LLMs' performance in geospatial code generation, extends the boundaries of domain-specific model applications, and offers valuable insights into unlocking their potential in geospatial code generation.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: the challenges encountered by existing large - language models (LLMs) in generating geospatial code, such as "refusal to code" or "coding hallucination", which are caused by the lack of domain - specific knowledge and code corpora. Specifically, the paper points out: 1. **Limitations of existing models**: - Although general large - language models perform well in general code - generation tasks, they perform poorly in geospatial code - generation tasks. - These models usually generate incorrect code or are completely unable to generate the required geospatial code because they are unfamiliar with the specific data formats, operators, and platform functions in the geospatial domain. 2. **Lack of domain - specific corpora**: - Geospatial code involves complex spatio - temporal data and specific data formats (such as geographic coordinates, multi - dimensional rasters, multi - band spectra), as well as large - scale datasets (such as global remote - sensing datasets). - Since geospatial code is usually executed on dedicated platforms and uses proprietary internal indexing and naming conventions, general LLMs are often unfamiliar with such details, resulting in errors or irrationalities in the generated code. To solve these problems, this paper proposes the GeoCode - GPT - 7B model, which is the first large - scale language model specifically for geospatial code generation. In addition, the paper also introduces the following resources and methods: - **GeoCode - PT pre - training corpus**: It contains a large amount of multi - source data related to geospatial code, including code snippets and operator knowledge from platforms such as Google Earth Engine and ArcGIS. - **GeoCode - SFT supervised fine - tuning corpus**: High - quality instruction data generated by structured traversal algorithms and the Self - Instruct framework, which is used to enhance the model's instruction understanding and code - generation ability. - **GeoCode - Eval evaluation dataset**: It contains 3,000 multiple - choice questions, 500 code - generation tasks, and 500 code - summary tasks, which are used to comprehensively evaluate the model's performance. Through these resources and methods, the paper aims to improve the accuracy and reliability of LLMs in geospatial code - generation tasks, thereby promoting the productivity improvement of geospatial data analysis and modeling tasks.

GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks

GeoCode-GPT: A Large Language Model for Geospatial Code Generation Tasks

Can Large Language Models Generate Geospatial Code?

Evaluation of Code LLMs on Geospatial Code Generation

GeoLLM: Extracting Geospatial Knowledge from Large Language Models

An LLM Agent for Automatic Geospatial Data Analysis

CodeJudge: Evaluating Code Generation with Large Language Models

Enabling Geospatial Analysis for Public through Natural Language, with Large Language Models

On the Effectiveness of Large Language Models in Domain-Specific Code Generation

GEE-OPs: An Operator Knowledge Base for Geospatial Code Generation on the Google Earth Engine Platform Powered by Large Language Models

Geo-FuB: A Method for Constructing an Operator-Function Knowledge Base for Geospatial Code Generation Tasks Using Large Language Models

GPT, large language models (LLMs) and generative artificial intelligence (GAI) models in geospatial science: a systematic review

GeoGPT: An assistant for understanding and processing geospatial tasks

A Survey on Large Language Models for Code Generation

GeoGalactica: A Scientific Large Language Model in Geoscience

A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends

GeoGPT: Understanding and Processing Geospatial Tasks through An Autonomous GPT

UrbanGPT: Spatio-Temporal Large Language Models

CityGPT: Empowering Urban Spatial Cognition of Large Language Models

A Survey on Evaluating Large Language Models in Code Generation Tasks

ModelGPT: Unleashing LLM's Capabilities for Tailored Model Generation