A multi-modal fusion approach for measuring web video relatedness

Youfu WEN,Caiyan JIA,Zhineng CHEN
DOI: https://doi.org/10.11992/tis.201603040
2016-01-01
Abstract:With the advances in internet and multimedia technologies, the number of web videos on social video platforms rapidly grows. Therefore, tasks such as large?scale video retrieval, classification, and annotation become issues that need to be urgently addressed. Web video relatedness serves as a basic and common infrastructure for these issues. This paper investigates the measurement of web video relatedness from a multi?modal fusion perspec?tive. It proposes to measure web video relatedness based on multi?source heterogeneous information. The multi?mo?dal fusion simultaneously leverages videos'visual content, title, and tag text as well as social features contributed by human?video interactions (i.e., the upload time, channel, and author of a video). Consequently, a novel multi?modal fusion approach is proposed for computing web video relatedness, which serves to give a ranking criterion and is applied to the task of large?scale video retrieval. Experimental results using YouTube videos show that the pro?posed text, visual, and users' social feature multi?modal fusion approach performs best in comparison tests with three alternate approaches;i.e., those approaches that compute web video relatedness based just on text features, just on visual features, or jointly on text and visual features.
What problem does this paper attempt to address?