VIREO @ TRECVID 2014: Instance Search and Semantic Indexing.

Wei Zhang,Hao Zhang,Ting Yao,Yi-Jie Lu,Jingjing Chen,Chong-Wah Ngo
2014-01-01
Abstract:This paper summarizes the following two tasks participated by VIREO group: instance search and semantic indexing. We will present our approaches and analyze the results obtained in TRECVID 2014 benchmark evaluation [1]. Instance Search (INS): We submitted seven runs derived from the following three systems (1) the baseline: our last year’s best system; (2) the normalization: the method refining the normalization terms for both query and reference images; (3) the video agumented query: the original image query is augmented with the video example. -F A VIREO 7: Baseline using the first image example only. Our baseline system is based on the Bag-of-Words (BoW) model [2], augmented with Hamming Embeding [3], spatial verification via Delaunay Triangulation [4] and context weighting via “Stare” model [5]. -F B VIREO 6: Baseline using the first two image examples only. -F C VIREO 5: Baseline using the first three image examples only. -F D VIREO 2: Baseline using all the four image examples. -F D VIREO 3: Baseline + normalization method, using all the four image examples. -F E VIREO 4: Baseline + video augmented query, using all the four image examples as well as the video examples where the query images are extracted. -F E VIREO 1: Late fusion of the results from all our systems, including the baseline, normalization and video augmented query. This run also queries with all the four images and video examples. Semantic Indexing (SIN): This year, we experimented various features including the visual, motion and audio features in concept training. Specifically, state-of-the-art motion feature: Improved Trajectories [6] and aduio features: MFCC, LPC, LSF, OBSI [7] are involved in this year’s benchmark evaluation. We submitted two runs to test these newly added features: -2 BM D VIREO.14 1: Late fusion of the detection scores using visual features. -2 BM D VIREO.14 2: Late fusion of the detection scores using visual, motion and audio features.
What problem does this paper attempt to address?