Abstract:Geo-entity linking is the task of linking a location mention to the real-world geographic location. In this paper we explore the challenging task of geo-entity linking for noisy, multilingual social media data. There are few open-source multilingual geo-entity linking tools available and existing ones are often rule-based, which break easily in social media settings, or LLM-based, which are too expensive for large-scale datasets. We present a method which represents real-world locations as averaged embeddings from labeled user-input location names and allows for selective prediction via an interpretable confidence score. We show that our approach improves geo-entity linking on a global and multilingual social media dataset, and discuss progress and problems with evaluating at different geographic granularities.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the problem of geo-entity linking in noisy multilingual social media data. Specifically, the authors focus on how to match user-provided location information (e.g., the "location" field in Twitter profiles) with actual geographic entities.
### Background and Motivation
1. **Importance of Geographic Location**: The actual geographic location of social media users is crucial for many computational social science tasks, including disaster response, disease monitoring, language variation analysis, and regional attitude comparison.
2. **Limitations of Geotags**: Traditional geotags (such as latitude and longitude coordinates) were deprecated in 2019, and even before that, less than 2% of tweets contained geotags. Therefore, inferring location from user profiles and free-text location fields has become increasingly necessary.
3. **Limitations of Existing Tools**: Currently available multilingual geo-entity linking tools are scarce. Existing tools are either rule-based, which can easily fail in social media environments, or based on large language models (LLMs), which are costly and not suitable for large-scale datasets.
### Research Objectives
1. **Propose a New Method**: The authors propose a new method that represents real-world locations using the average embeddings of annotated user input location names and achieves selective prediction through an adjustable cosine similarity threshold.
2. **Performance Evaluation**: The authors evaluate the performance of the proposed method on a multilingual global social media dataset and compare it with other baseline methods.
3. **Discuss Issues**: The authors discuss the issues encountered when evaluating geo-entity linking at different geographic granularities (country, administrative region, city), particularly the challenges at the city level.
### Main Contributions
1. **New Method**: A method is proposed to represent real-world locations through average embeddings and achieve selective prediction using a cosine similarity threshold.
2. **Performance Improvement**: The proposed method outperforms leading baseline methods across all variants on a multilingual global dataset.
3. **Accuracy Upper Bound**: Through manual annotation experiments, the accuracy upper bound on the dataset is estimated, and the issues of geo-entity linking at the city level are discussed.
### Related Work
1. **Geo-entity Linking**: Previous research typically combines the text and context of location mentions, knowledge bases (such as gazetteers, Wikipedia), and coordinate/geometry features, using rule-based, unsupervised, or supervised methods.
2. **Multilingual Research**: Most prior work has focused on English data and news articles, but there are a few studies involving historical texts and web data.
3. **Social Media Data**: Some previous studies have explored geo-entity linking in social media data, but these studies are mostly rule-based or use large language models.
### Methodology
1. **Task Definition**: Given a target location database, a training set containing user input location names and real location pairs, and a test set, the model needs to predict the best matching geographic entity for each user input.
2. **Data**: A modified GeoNames database is used as the target location database, and geotagged tweets from the Twitter-Global dataset are extracted as training and test data.
3. **Method**: The proposed method (UserGeo) computes embeddings for each location in the target location database, then predicts the location by calculating the cosine similarity between the user input and location embeddings. If the cosine similarity of all location embeddings is below a given threshold, the prediction confidence is considered low, and no prediction is made.
### Experimental Results
1. **Performance Comparison**: UserGeo achieves the highest accuracy at the country and administrative region levels, outperforming Carmen 2.0 by 25 and 17 percentage points, respectively; NameGeo achieves the highest accuracy at the city level, outperforming Carmen 2.0 by 5 percentage points.
2. **Precision-Coverage Curve**: UserGeo and NameGeo can trade off between precision and coverage by adjusting the threshold, while Carmen 2.0 has lower coverage.
3. **Error Analysis**: UserGeo performs better in handling non-Latin scripts and alternative/informal location names, whereas Carmen 2.0 and NameGeo...