A Feature Analysis for Multimodal News Retrieval

Golsa Tahmasebzadeh,Sherzod Hakimov,Eric Müller-Budack,Ralph Ewerth
DOI: https://doi.org/10.48550/arXiv.2007.06390
2020-10-01
Abstract:Content-based information retrieval is based on the information contained in documents rather than using metadata such as keywords. Most information retrieval methods are either based on text or image. In this paper, we investigate the usefulness of multimodal features for cross-lingual news search in various domains: politics, health, environment, sport, and finance. To this end, we consider five feature types for image and text and compare the performance of the retrieval system using different combinations. Experimental results show that retrieval results can be improved when considering both visual and textual information. In addition, it is observed that among textual features entity overlap outperforms word embeddings, while geolocation embeddings achieve better performance among visual features in the retrieval task.
Computation and Language,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the effectiveness of multimodal features in cross-lingual news retrieval. Specifically, the authors analyze the impact of different types of features (including objects, places, and geolocation embeddings) from text and images on news retrieval performance and validate the effectiveness of combining these features through experiments. ### Main Research Content 1. **Background and Motivation**: - With the rapid growth of online media content, the development of intelligent technologies to organize and meet users' information needs has become particularly important. - Multimodal Information Retrieval (MIR) is a technology that combines different modalities of information (such as text, images, videos, and audio) to identify users' search needs. - Traditional information retrieval methods are usually based on a single modality (such as text or images), while multimodal methods can more comprehensively represent the content of multimedia documents. 2. **Research Methods**: - **Dataset**: The authors collected news articles containing text and images from 5 news domains (politics, health, environment, sports, and finance), covering both English and German languages. - **Feature Extraction**: - **Visual Features**: Pre-trained deep learning models were used to extract object, place, and geolocation embeddings. - **Text Features**: BERT embeddings and entity overlap were used to represent text content. - **Similarity Calculation**: The retrieval task was performed by calculating pairwise similarity between news articles (based on cosine similarity). 3. **Experimental Results**: - The experimental results show that the retrieval system combining visual and text features outperforms systems using each modality alone. - Among text features, entity overlap performed better than word embeddings. - Among visual features, geolocation embeddings outperformed object and place features across different news domains. - Simply averaging multimodal features already provides a good representative feature. 4. **Contributions**: - Compared the impact of different state-of-the-art feature descriptors on multimodal content. - Revealed the superiority of combining visual and text features in news retrieval. - Provided an analysis of the effectiveness of using multimodal features in different news domains. ### Conclusion Through experimental validation, the multimodal news retrieval system combining visual and text features outperforms single-modality methods in terms of performance. Particularly in the domains of environment and health news retrieval, the combination of multimodal features is especially prominent. However, in the domains of politics and finance news retrieval, text features still outperform visual features, possibly because the image content in these domains is less visually significant. Therefore, future research can further explore more visual descriptors to better represent the content of news images.