Toponym resolution leveraging lightweight and open-source large language models and geo-knowledge
Xuke Hu Jens Kersten Friederike Klan Sheikh Mastura Farzana a Institute of Data Science,German Aerospace Center (DLR),Jena,Germanyb Institute of Software Technology,German Aerospace Center (DLR),Cologne,GermanyXuke Hu is a permanent researcher at the DLR's Institute of Data Science. He earned his PhD in Geoinformation from Heidelberg University in 2020. His primary research interests include GeoAI,VGI,indoor localization and mapping,with a recent focus on the extraction and analysis of geographic information embedded in big text data,such as news articles,social media data,and historical documents.Jens Kersten has a background in geodesy,remote sensing and computer vision. At DLR's Institute of Data Science,he leads a group focusing on multimodal and geospatial information retrieval. His research interests focus on acquiring,analyzing and linking textual data from heterogeneous sources to obtain application-specific information from big data for monitoring and decision making.Friederike Klan is heading the Data Acquisition and Mobilization Department at the DLR Institute of Data Science. She has a scientific background in computer science with a specialization on data acquisition,preparation,management and provision. The focus of her work is on the development of innovative methods for collecting data,ensuring its quality,making it usable and deriving information from it - from intelligent data acquisition with mobile applications to the development of effective approaches for sharing data in data ecosystems.Sheikh Mastura Farzana is a researcher at the German Aerospace Center,focusing on Geographic Information Retrieval and associated technologies. Her research interests encompass a range of topics including Geoparsing Web Data,Scalable Geographic Information Retrieval,and Geoparsing Multilingual Data. She holds a Master's degree in Computer Science from the University of Bonn,Germany,and a Bachelor's degree in Computer Science and Engineering from BRAC University,Bangladesh.
DOI: https://doi.org/10.1080/13658816.2024.2405182
2024-09-25
International Journal of Geographical Information Science
Abstract:Toponym resolution is crucial for extracting geographic information from natural language texts, such as social media posts and news articles. Despite the advancements in current methods, including state-of-the-art deep learning solutions like GENRE and a sophisticated voting system that integrates seven individual methods, further enhancing their accuracy is essential. To achieve this goal, we propose a novel method that combines lightweight and open-source large language models and geo-knowledge. Specifically, we first fine-tune Mistral (7B), Baichuan2 (7B), Llama2 (7B & 13B), and Falcon (7B) to estimate toponyms' unambiguous reference (e.g., city, state, country) given their contexts. Subsequently, we correct inaccuracies in generated references and determine their geo-coordinates via sequentially querying GeoNames, Nominatim, and ArcGIS geocoders until a successful geocoding result is achieved. Our methods demonstrate enhanced performance compared to 20 existing methods, as evidenced across seven challenging datasets including 83,365 toponyms worldwide, with the Mistral-based method leading, followed by Baichuan2, Llama2, and Falcon-based methods. Specifically, the Mistral-based method achieves an Accuracy@161km of 0.91, surpassing GENRE, the best individual method, by 17% and the seven-methods composite voting system by 7%. Moreover, our methods are computationally efficient, operable on one general GPU, have modest memory requirements (14 GB for 7B models and 27 GB for 13B models), and exceed both GENRE and the voting system in inferring speed.
geography, physical,computer science, information systems,information science & library science