Abstract:In this paper, we present our approaches and results of high-level feature extraction and automatic video search in TRECVID-2007. In high-level feature extraction, our main focus is to explore the upper limit of bag-of-visualwords (BoW) approach based upon local appearance features. We study and evaluate several factors which could impact the performance of BoW. By considering these important factors, we show that a local feature only system already yields top performance (MAP= 0.0935). This conclusion is similar to our recent experiment of VIREO-374 on TRECVID-2006 dataset [1], except that the improvement, when incorporating with other features, is marginal. Description of our submitted runs: CityU-HK1: linear weighted fusion of 4 SVM classifiers using BoW, edge histogram, grid based color moment and wavelet texture. CityU-HK2: average fusion of 5 SVM classifiers using BoW, spatial layout of keypoints, edge histogram, grid based color moment and wavelet texture. CityU-HK3: average fusion of 4 SVM classifiers using BoW, edge histogram, grid based color moment and wavelet texture. CityU-HK4: Bag-of-visual-words (BoW). CityU-HK5: average fusion of 3 baseline classifiers using edge histogram, grid based color moment and wavelet texture. CityU-HK6: average fusion of 2 baseline classifiers using grid based color moment and wavelet texture. In automatic search, we study the performance of query-by-example (QBE) and VIREO-374 ontology-based concept search. In QBE, the spatial properties of local keypoints and concept detector confidence are utilized for retrieval. In concept-based search, a small set of VIREO374 detectors are selected for query answering by measuring the similarity of query terms to semantic concepts in an Ontology-enriched Semantic Space. We submit six runs composing of concept-based, query-based, motion-based and text-based search. CityUHK-SCS: concept-based search in which one single concept is selected for each query. CityUHK-MCS: concept-based search in which top-3 concepts are selected. CityUHK-Concept: use 36-d concept detection confidence vectors of keyframes for QBE. CityUHK-ConceptRerank: use 36-d concept detection confidence vectors to rerank the result of text baseline. CityUHK-VKmotion-Rank: employ the motion histogram of visual keywords (VK) in video sequence to rerank the result of text baseline. CityUHK-Text: baseline run by ASR/MT transcripts. 1 High-Level Feature Extraction This year, we mainly focus on exploring the upper limit of local features for concept detection. Our local feature approach is basically based on our previous work in [2]. We also implement three baseline features and examine the improvement of fusing the local features with the baseline visual features. For the selection of training samples, we only rely on this year’s data and combine the two publicly available annotations from LIG [3] and MCG-ICT-CAS.

W2VV++

Unsupervised Teacher-Student Model for Large-Scale Video Retrieval.

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

UATVR: Uncertainty-Adaptive Text-Video Retrieval

CLIPRerank: An Extremely Simple Method for Improving Ad-hoc Video Search

AVSegFormer: Audio-Visual Segmentation with Transformer

Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and Search.

WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge

Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering

Fusion of Multimodal Embeddings for Ad-Hoc Video Search

Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives

Transavs: End-To-End Audio-Visual Segmentation With Transformer

CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation

Weakly Supervised Text-based Actor-Action Video Segmentation by Clip-level Multi-instance Learning

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

AVATAR: Robust Voice Search Engine Leveraging Autoregressive Document Retrieval and Contrastive Learning

Enhancing machine vision: the impact of a novel innovative technology on video question-answering

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval