Abstract:As it is true for human perception that we gather information from different sources in natural and multi-modality forms, learning from multi-modalities has become an effective scheme for various information retrieval problems. In this paper, we propose a novel multi-modality fusion approach for video search, where the search modalities are derived from a diverse set of knowledge sources, such as text transcript from speech recognition, low-level visual features from video frames, and high-level semantic visual concepts from supervised learning. Since the effectiveness of each search modality greatly depends on specific user queries, prompt determination of the importance of a modality to a user query is a critical issue in multi-modality search. Our proposed approach, named concept-driven multi-modality fusion (CDMF), explores a large set of predefined semantic concepts for computing multi-modality fusion weights in a novel way. Specifically, in CDMF, we decompose the query-modality relationship into two components that are much easier to compute: query-concept relatedness and concept-modality relevancy. The former can be efficiently estimated online using semantic and visual mapping techniques, while the latter can be computed offline based on concept detection accuracy of each modality. Such a decomposition facilitates the need of adaptive learning of fusion weights for each user query on-the-fly, in contrast to the existing approaches which mostly adopted predefined query classes and/or modality weights. Experimental results on TREC video-retrieval evaluation 2005-2008 dataset validate the effectiveness of our approach, which outperforms the existing multi-modality fusion methods and achieves near-optimal performance (from oracle fusion) for many test queries.

Multimedia Evidence Fusion for Video Concept Detection Via OWA Operator.

Multimodal feature fusion for robust event detection in web videos

Concept-Driven Multi-Modality Fusion for Video Search

Dynamic Multimodal Fusion in Video Search

Multisensor Attribute Information Fusion Based on OWA Aggregation Operator

An Integrated Statistical Model for Multimedia Evidence Combination.

Feature Weighting Via Optimal Thresholding for Video Analysis (open Access)

A multi-modal fusion approach for measuring web video relatedness

Feature Weighting via Optimal Thresholding for Video Analysis

Rethinking the constraints of multimodal fusion: case study in Weakly-Supervised Audio-Visual Video Parsing

Efficient Heuristic Methods for Multimodal Fusion and Concept Fusion in Video Concept Detection

Fusion Detection via Distance-Decay IoU and weighted Dempster-Shafer Evidence Theory

Violence Video Detection Based on Multi-modal Fusion and Dual Channel Contrastive Learning.

Multimodal fusion for video copy detection

A fusion approach for video moving object detection based on evidential reasoning

Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion

Zhejiang University at TRECVID 2006.

An ontology-based evidential framework for video indexing using high-level multimodal fusion

OMOFuse: an Optimized Dual-Attention Mechanism Model for Infrared and Visible Image Fusion

Fusion Detection via Distance-Decay Intersection over Union and Weighted Dempster-Shafer Evidence Theory

Optimal Multimodal Fusion for Multimedia Data Analysis