Zhiyang Dou,Zipeng Wang,Xumeng Han,Chenhui Qiang,Kuiran Wang,Guorong Li,Zhibei Huang,Zhenjun Han
Abstract:Global geolocation, which seeks to predict the geographical location of images captured anywhere in the world, is one of the most challenging tasks in the field of computer vision. In this paper, we introduce an innovative interactive global geolocation assistant named GaGA, built upon the flourishing large vision-language models (LVLMs). GaGA uncovers geographical clues within images and combines them with the extensive world knowledge embedded in LVLMs to determine the geolocations while also providing justifications and explanations for the prediction results. We further designed a novel interactive geolocation method that surpasses traditional static inference approaches. It allows users to intervene, correct, or provide clues for the predictions, making the model more flexible and practical. The development of GaGA relies on the newly proposed Multi-modal Global Geolocation (MG-Geo) dataset, a comprehensive collection of 5 million high-quality image-text pairs. GaGA achieves state-of-the-art performance on the GWS15k dataset, improving accuracy by 4.57% at the country level and 2.92% at the city level, setting a new benchmark. These advancements represent a significant leap forward in developing highly accurate, interactive geolocation systems with global applicability.
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the accuracy and interactivity issues in global geolocation. Specifically, the authors aim to develop an innovative interactive global geolocation assistant, GaGA, to improve the ability to predict geographical locations from images and provide explanations and bases for the prediction results. The following are the key challenges and solutions mentioned in the paper:
### Main Problems
1. **Lack of High - Quality Geo - related Data**: One of the main problems faced by existing large - scale vision - language models (LVLMs) in geolocation tasks is the lack of high - quality, geo - related image - text pair data.
2. **Limitations of Traditional Methods**:
- **Retrieval Methods**: They rely on matching similar images in geotagged databases, but it is difficult to ensure the data diversity and integrity of these databases.
- **Classification Methods**: They divide the earth's surface into regions and classify based on visual features, but they cannot provide clear visual cues as explanations.
3. **Lack of Explanatoriness and Interactivity**: Traditional location methods usually only output a single GPS coordinate or location label, lacking explanatoriness and interactivity, resulting in a poor user experience.
### Solutions
1. **Construct a Multimodal Global Geolocation Dataset (MG - Geo)**: To solve the data scarcity problem, the authors introduced a new dataset MG - Geo containing 5 million high - quality image - text pairs. This dataset covers a wide range of geographical information and is well - structured, and can better reflect the geolocation challenges in the real world.
2. **Develop an Interactive Global Geolocation Assistant GaGA**: Based on LVLMs, GaGA can not only locate according to the geographical cues in the image, but also combine extensive world knowledge for prediction, and interact with users through conversations, allowing users to intervene, correct or provide cues, thereby improving the accuracy and flexibility of location.
3. **Two - stage Training**:
- **Geographical Classification Enhancement Stage**: Use 4.87 million image - location pairs to pre - train the model and inject geographical knowledge to enhance its ability to understand and classify geographical locations.
- **Interactive Correction Enhancement Stage**: Fine - tune through 70,000 image - cue pairs and 73,000 image - multi - round question - answer pairs, focusing on improving the model's interactive correction ability in conversations.
### Experimental Results
The experimental results show that GaGA has achieved significant performance improvements on the GWS15k dataset, with a 4.57% increase in national - level accuracy and a 2.92% increase in city - level accuracy. In addition, GaGA also performs well in coordinate prediction, especially in interactive scenarios. When users provide effective guidance, its location accuracy is significantly improved.
Through these improvements, GaGA not only improves the accuracy and reliability of geolocation, but also enhances the interpretability of the results and user participation, marking an important progress in the global geolocation system.