DLRGeoTweet: A comprehensive social media geocoding corpus featuring fine-grained places
Xuke Hu,Tobias Elßner,Shiyu Zheng,Helen Ngonidzashe Serere,Jens Kersten,Friederike Klan,Qinjun Qiu
DOI: https://doi.org/10.1016/j.ipm.2024.103742
IF: 7.466
2024-04-14
Information Processing & Management
Abstract:Every day, many short text messages on social media are generated in response to real-world events, providing a valuable resource for various domains such as emergency response and traffic management. Since exact coordinates of social media posts are rarely attached by users, accurately recognizing and resolving fine-grained place names, such as home addresses and Points of Interest, from these posts is crucial for understanding the precise locations of critical events, such as rescue requests. This task, known as geoparsing, involves toponym recognition and toponym resolution or geocoding. However, existing social media datasets for evaluating geoparsing approaches often lack sufficient fine-grained place names with associated geo-coordinates or linked to gazetteers, making evaluating, comparing, and training geocoding methods for such locations challenging. Moreover, the absence of supportive annotation tools compounds this challenge. To address these gaps, we implemented a lightweight Python tool leveraging Nominatim. Using this tool, we annotated a comprehensive X (formerly Twitter) geocoding corpus called DLRGeoTweet. The corpus underwent a rigorous cross-validation process to guarantee its quality. This corpus includes a total of 7,364 tweets and 12,510 places, of which 6,012 are fine-grained. It comprises two global datasets encompassing worldwide events and three local datasets related to local events such as the 2017 Hurricane Harvey. The annotation process spanned over ten months and required approximately 1000 person-hours to complete. We then evaluate 15 latest and representative geocoding approaches, including many deep learning-based, on DLRGeoTweet. The results highlight the inherent challenges in resolving fine-grained places accurately. Despite increasing access constraints to Twitter data, our corpus's focus on short, informal text makes it a valuable resource for geocoding across multiple social media platforms.
computer science, information systems,information science & library science