Abstract:This paper presents our approaches and results of the four TRECVID 2008 tasks we participated in: high-level feature extraction, automatic video search, video copy detection, and rushes summarization. In high-level feature extraction, we jointly submitted our results with Columbia University. The four runs submitted through CityU aim to explore context-based concept fusion by modeling inter-concept relationship. The relationship is modeled not based on semantic reasoning, but by observing how concepts correlate to each other, either directly or indirectly, in LSCOM common annotation [1]. An observability space (OS) [2] is thus built on top of LSCOM [1] and VIREO374 [3] for performing concept fusion. Since 19 of the 20 concepts evaluated this year appeared in VIREO-374, we apply OS to re-rank the results of both old models from VIREO-374 and new models from a joint baseline submission with Columbia. A CityU-HK1: re-rank A CU-run5 using OS – both positive and negative correlated concepts are used. A CityU-HK2: re-rank A CU-run5 using OS – only positive correlated concepts are used. A CityU-HK3: re-rank old models from VIREO-374 using OS – both positive and negative correlated concepts are used. A CityU-HK4: re-rank old models from VIREO-374 using OS – only positive correlated concepts are used. In automatic search, we focus on concept-based video search. The search is beyond semantic reasoning, where we consider the fusion of detectors using concept semantics, co-occurrence, diversity, and detector robustness. Two runs are submitted based on the works in [2] and [4] respectively. F A 2 CityUHK1 1: multi-modality fusion of concept-based search (Run-2), query example based search (Run-4 and Run-5), and text baseline (Run-6). F A 2 CityUHK2 2: concept-based search by fusing semantics, observability, reliability and diversity of concept detectors [2]. F A 2 CityUHK3 3: concept-based search using semantics reasoning [4, 5]. F A 2 CityUHK4 4: query-by-example – using VIREO-374 detection scores as features. F A 2 CityUHK5 5: query-by-example – using motion histograms as features. F A 1 CityUHK6 6: text baseline. In content-based video copy detection, we adopt a recently proposed near-duplicate video detection method [6, 7] based on the matching of local keypoint features. We submitted three runs: CityUHK loose: we use cosine similarity of visual word histograms to generate candidate near-duplicate keyframe set. The set is further filtered by a recently proposed method called SR-PE [6]. CityUHK vkisect: same with CityUHK loose except that we use histogram intersection instead of cosine similarity for candidate keyframe set generation. CityUHK tight: similar to CityUHK loose, but we add in few more heuristical constraints. In BBC rushes summarization, we submitted one run using the same method with our last year’s submission [8]. 1 High-Level Feature Extraction (HLFE) This year, we jointly submitted our HLFE results with Columbia University. Detailed descriptions of the joint submissions can be found in the notebook paper of Columbia [9]. For the four runs submitted by CityU, we aim to test context-based concept fusion based on a linear space (observability space) built from the observation derived from manual concept annotation. 1.1 Concept Fusion with Observability Space The observability space (OS) is proposed to effectively model the co-occurrence relationship among concepts [2]. We refine the individual concept detectors by using simple and efficient linear weighted fusion of the target concepts with several peripherally related concepts, where both concept selection and fusion weights are determined by the OS. Given a concept set V of n concepts, we first construct a n×n concept observability matrix R where each entry rij represents the co-occurrence relationship of a concept pair (Ci, Cj), measured by Pearson product-moment (PM) correlation: rij = PM(Ci, Cj) = ∑|T | k=1(Oik − μi)(Ojk − μj) (|T | − 1)σiσj (1) where Oik is the observability of concept Ci in shot k, and μi and σi are the sample mean and standard deviation, respectively, of observing Ci in a training set T . We set Oik to 1 if Ci presents in shot k, and 0 otherwise. With R, basis vectors C of OS can be estimated by solving following equation

VIREO @ TRECVID 2014: Instance Search and Semantic Indexing.

Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and Search.

IBM Research and Columbia University TRECVID-2012 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), and Semantic Indexing (SIN) Systems.

Informedia at TRECVID2014: MED and MER, Semantic Indexing, Surveillance Event Detection

Tsinghua University at TRECVID 2005.

IBM Research TRECVID-2009 Video Retrieval System.

BBN VISER TRECVID 2013 Multimedia Event Detection and Multimedia Event Recounting Systems.

BBNVISER : BBN VISER TRECVID 2012 Multimedia Event Detection and Multimedia Event Recounting Systems.

Informedia@ trecvid 2014 med and mer

TRECVID 2007 Search Tasks by NUS-ICT.

Beyond Semantic Search: What You Observe May Not Be What You Think

Fudan University at TRECVID 2007

PKU-ICST at TRECVID2009: High Level Feature Extraction and Search

OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

Zhejiang University at TRECVID 2006.

An overview on the evaluated video retrieval tasks at TRECVID 2022

Fudan University at TRECVID 2010 : Semantic Indexing.

VIREO-374 : LSCOM Semantic Concept Detectors Using Local Keypoint Features

Semantic Video Search by Exploiting Large-Scale Visual Concepts

Intelligent Multimedia Group of Tsinghua University at TRECVID 2006.

PKU_ICST at TRECVID 2018: Instance Search Task.