Abstract:In this paper, we describe the IBM Research system for indexing, analysis, and copy detection of video as applied to the TRECVID-2009 video retrieval benchmark. A. High-Level Concept Detection: This year, our focus was on global and local feature combination, automatic training data construction from web domain, and large-scale detection using Hadoop. 1. A ibm.Global 6: Baseline runs using 98 types of global features and 3 SVM learning methods; 2. A ibm.Combine2 5: Fusion of the 2 best models from 5 candidate models on global / local features; 3. A ibm.CombineMore 4: Fusion of all 5 candidate models on global / local features; 4. A ibm.Single+08 3: Single best model from the 5 candidate models, plus the old models from 2008; 5. C ibm.Combine2+FlkBox 2: Combine A ibm.Combine2 5 with automatic extracted training data from Flickr; 6. A ibm.BOR 1: Best overall run, assembled from best models for each concept using heldout performance. Overall, almost all the individual components can improve the mean average precision after fused with the baseline results. To summarize, we have the following observations from our evaluation results: 1) The global and local features are complementary to each other, and ∗IBM T. J. Watson Research Center, Hawthorne, NY, USA †IBM China Research Lab, Beijing, China ‡IBM Software Group, Cambridge, MA §Dept. of Computer Science, Columbia University ¶Machine Learning Dept., Carnegie Mellon Univ. their fusion results outperform either individual types of features; 2) The more features are combined, the better the performance, even with simple combination rules; 3) The development data collected automatically from the web domain are shown to be useful on a number of the concepts, although its average performance is not comparable with manually selected training data, partially because of the large domain gap between web images and documentary video; B. Content-Based Copy Detection: The focus of our copy detection system this year was in fusing 4 types of complementary fingerprints: a temporal activity-based fingerprint, keyframe-based color correlogram and SIFTogram fingerprints, and an audio-based fingerprint. We also considered two approaches (meanand median-equalization) for score normalization and fusion across systems that produce vastly different score distributions and ranges. A summary of our runs is listed below: 1. ibm.v.balanced.meanBAL: Video-only submission produced by fusing the temporal activity-based and keyframe color correlogram-based fingerprints after mean equalization and score normalization. 2. ibm.v.balanced.medianBAL: As above, but using the median scores as weighting factors. 3. ibm.v.nofa.meanNOFA: Similar to the first run, but with internal weights for our temporal method tuned more conservatively and a higher score threshold applied to our color feature based method. 4. ibm.v.nofa.medianNOFA: Similar to the meanNOFA run, but using the median scores for weighting. 5. ibm.m.balanced.meanFuse: For A+V runs, we used the same 2 video only methods, plus another video method (SIFTogram) and a temporal audio-based method. In this run, we used the mean scores of each constituent for weighting. 6. ibm.m.balanced.medianFuse: As in the above run, but using median score for weighting. 7. ibm.m.nofa.meanFuse: As with the video-only runs, we adjusted internal parameters of the temporal methods and the thresholds for the other methods. 8. ibm.m.nofa.medianFuse: As in the m.nofa.meanFuse run, but using the median scores for weighting. Overall, the SIFTogram approach performed best, followed by the correlogram approach and the temporal activity-based fingerprint approach, while audio did not help. With respect to score normalization and fusion, we found median equalization to be more effective than mean equalization.

The MediaMill TRECVID 2011 Semantic Video Search Engine.

IBM Research and Columbia University TRECVID-2012 Multimedia Event Detection (MED), Multimedia Event Recounting (MER), and Semantic Indexing (SIN) Systems.

IBM Research and Columbia University TRECVID-2011 Multimedia Event Detection (MED) System

Informedia at TRECVID2014: MED and MER, Semantic Indexing, Surveillance Event Detection

Informedia@ trecvid 2014 med and mer

BBNVISER : BBN VISER TRECVID 2012 Multimedia Event Detection and Multimedia Event Recounting Systems.

Informedia@TRECVID 2013.

BBN VISER TRECVID 2013 Multimedia Event Detection and Multimedia Event Recounting Systems.

Informedia E-Lamp@TRECVID 2012: Multimedia Event Detection and Recounting (MED and MER)

Multimedia Event Detection and Recounting

Informedia E-Lamp @ TRECVID 2013: Multimedia Event Detection and Recounting (MED and MER)

Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching.

Video Concept Detection Using Support Vector Machines - TRECVID 2007 Evaluations

TRECVID 2007 High-Level Feature Extraction By MCG-ICT-CAS

Video diver: generic video indexing with diverse features.

Bi-Level Semantic Representation Analysis for Multimedia Event Detection

IBM Research TRECVID-2009 Video Retrieval System.

Experimenting VIREO-374: Bag-of-Visual-Words and Visual-Based Ontology for Semantic Video Indexing and Search.

Fast And Accurate Content-Based Semantic Search In 100m Internet Videos

They Are Not Equally Reliable: Semantic Event Search Using Differentiated Concept Classifiers

Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images