Abstract:In this paper, we present our approaches and results of high-level feature extraction and automatic video search in TRECVID-2007. In high-level feature extraction, our main focus is to explore the upper limit of bag-of-visualwords (BoW) approach based upon local appearance features. We study and evaluate several factors which could impact the performance of BoW. By considering these important factors, we show that a local feature only system already yields top performance (MAP= 0.0935). This conclusion is similar to our recent experiment of VIREO-374 on TRECVID-2006 dataset [1], except that the improvement, when incorporating with other features, is marginal. Description of our submitted runs: CityU-HK1: linear weighted fusion of 4 SVM classifiers using BoW, edge histogram, grid based color moment and wavelet texture. CityU-HK2: average fusion of 5 SVM classifiers using BoW, spatial layout of keypoints, edge histogram, grid based color moment and wavelet texture. CityU-HK3: average fusion of 4 SVM classifiers using BoW, edge histogram, grid based color moment and wavelet texture. CityU-HK4: Bag-of-visual-words (BoW). CityU-HK5: average fusion of 3 baseline classifiers using edge histogram, grid based color moment and wavelet texture. CityU-HK6: average fusion of 2 baseline classifiers using grid based color moment and wavelet texture. In automatic search, we study the performance of query-by-example (QBE) and VIREO-374 ontology-based concept search. In QBE, the spatial properties of local keypoints and concept detector confidence are utilized for retrieval. In concept-based search, a small set of VIREO374 detectors are selected for query answering by measuring the similarity of query terms to semantic concepts in an Ontology-enriched Semantic Space. We submit six runs composing of concept-based, query-based, motion-based and text-based search. CityUHK-SCS: concept-based search in which one single concept is selected for each query. CityUHK-MCS: concept-based search in which top-3 concepts are selected. CityUHK-Concept: use 36-d concept detection confidence vectors of keyframes for QBE. CityUHK-ConceptRerank: use 36-d concept detection confidence vectors to rerank the result of text baseline. CityUHK-VKmotion-Rank: employ the motion histogram of visual keywords (VK) in video sequence to rerank the result of text baseline. CityUHK-Text: baseline run by ASR/MT transcripts. 1 High-Level Feature Extraction This year, we mainly focus on exploring the upper limit of local features for concept detection. Our local feature approach is basically based on our previous work in [2]. We also implement three baseline features and examine the improvement of fusing the local features with the baseline visual features. For the selection of training samples, we only rely on this year’s data and combine the two publicly available annotations from LIG [3] and MCG-ICT-CAS.

Improving bag-of-visual-words model with spatial-temporal correlation for video retrieval

Exploring Spatial Correlation for Visual Object Retrieval

Visual Words Refining Exploiting Spatial Co-Occurrence Table

Bag-of-visual-words Expansion Using Visual Relatedness for Video Indexing

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Improving Video Concept Detection Using Spatio-Temporal Correlation

Video Retrieval with Similarity-Preserving Deep Temporal Hashing

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Exploiting visual word co-occurrence for image retrieval.

Visual Word Proximity and Linguistics for Semantic Video Indexing and Near-Duplicate Retrieval

Modeling spatial and semantic cues for large-scale near-duplicated image retrieval

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

Query Expansion by Spatial Co-Occurrence for Image Retrieval

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

W2VV++

Using Bag of Visual Words for Video Retrieval Calibration

Scalable Video Event Retrieval by Visual State Binary Embedding.

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

Spatial-temporal Graphs for Cross-modal Text2Video Retrieval

Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and Search.

Boosting Temporal Binary Coding for Large-Scale Video Search