Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and Search.
Chong-Wah Ngo,Yu-Gang Jiang,Xiao-Yong Wei,Feng Wang,Wanlei Zhao,Hung-Khoon Tan,Xiao Wu
2007-01-01
Abstract:In this paper, we present our approaches and results of high-level feature extraction and automatic video search in TRECVID-2007. In high-level feature extraction, our main focus is to explore the upper limit of bag-of-visualwords (BoW) approach based upon local appearance features. We study and evaluate several factors which could impact the performance of BoW. By considering these important factors, we show that a local feature only system already yields top performance (MAP= 0.0935). This conclusion is similar to our recent experiment of VIREO-374 on TRECVID-2006 dataset [1], except that the improvement, when incorporating with other features, is marginal. Description of our submitted runs: CityU-HK1: linear weighted fusion of 4 SVM classifiers using BoW, edge histogram, grid based color moment and wavelet texture. CityU-HK2: average fusion of 5 SVM classifiers using BoW, spatial layout of keypoints, edge histogram, grid based color moment and wavelet texture. CityU-HK3: average fusion of 4 SVM classifiers using BoW, edge histogram, grid based color moment and wavelet texture. CityU-HK4: Bag-of-visual-words (BoW). CityU-HK5: average fusion of 3 baseline classifiers using edge histogram, grid based color moment and wavelet texture. CityU-HK6: average fusion of 2 baseline classifiers using grid based color moment and wavelet texture. In automatic search, we study the performance of query-by-example (QBE) and VIREO-374 ontology-based concept search. In QBE, the spatial properties of local keypoints and concept detector confidence are utilized for retrieval. In concept-based search, a small set of VIREO374 detectors are selected for query answering by measuring the similarity of query terms to semantic concepts in an Ontology-enriched Semantic Space. We submit six runs composing of concept-based, query-based, motion-based and text-based search. CityUHK-SCS: concept-based search in which one single concept is selected for each query. CityUHK-MCS: concept-based search in which top-3 concepts are selected. CityUHK-Concept: use 36-d concept detection confidence vectors of keyframes for QBE. CityUHK-ConceptRerank: use 36-d concept detection confidence vectors to rerank the result of text baseline. CityUHK-VKmotion-Rank: employ the motion histogram of visual keywords (VK) in video sequence to rerank the result of text baseline. CityUHK-Text: baseline run by ASR/MT transcripts. 1 High-Level Feature Extraction This year, we mainly focus on exploring the upper limit of local features for concept detection. Our local feature approach is basically based on our previous work in [2]. We also implement three baseline features and examine the improvement of fusing the local features with the baseline visual features. For the selection of training samples, we only rely on this year’s data and combine the two publicly available annotations from LIG [3] and MCG-ICT-CAS.